SCN: Searches

Thu Jan 25 08:25:47 PST 2001

x-no-archive: yes

=======================

Mining the "Deep Web" With Specialized Drills   

(Lisa Guernsey, NY Times)---Two weeks ago, online newspapers 
and magazines were buzzing with news about Linda Chavez, 
President Bush's first choice for labor secretary.   

But from the results coming up in most popular search engines, you 
would never have known it. Instead of retrieving articles about an 
illegal immigrant who had lived in Ms. Chavez's home, a Google 
search on "chavez" led to several encyclopedia entries on Cesar 
Chavez, the American labor leader and advocate of farmworkers' 
rights.   

Lycos turned up several Web sites with information about Eric 
Chavez, an Oakland A's third baseman. On Alta Vista, some of the 
first results linked to Ms. Chavez's old columns for an online 
magazine, but none of the links provided even a hint of the fact that 
she had become front-page news.   

"I don't see anything that anyone would feel is relevant to her given 
the context of this past week," said Danny Sullivan, the editor of 
SearchEngineWatch.com, as he typed "chavez" into other search 
engines.   

His demonstration illustrated a problem that has long been apparent 
longtime problem that has to anyone casting about for online news 
reports: search engines can be pitifully inadequate, partly because 
they rely on Web-page indexes that were compiled weeks before. It 
is not just timely material that seems to escape their reach. Pages 
deep within Web sites are also often missed, as are multimedia 
files, bibliographies, the bits of information in databases and pages 
that come in P.D.F., Adobe's portable document format.   

In fact, traditional search engines have access to only a fraction of 1 
percent of what exists on the Web. As many as 500 billion pieces of 
content are hidden from the view of those search engines, according 
to BrightPlanet.com, a search company that has tried to tally them. 
To many search experts, this is the "invisible Web." BrightPlanet 
prefers the term "deep Web," an online frontier that it estimates may 
be 500 times larger than the surface Web that search engines try to 
cover. And that uncharted territory does not include Web pages that 
are behind firewalls or part of intranets.   

To dig deeper into the Web, a new breed of search engine has 
cropped up that takes a different approach to Web page retrieval. 
Instead of broadly scanning the Web by indexing pages from any 
links they can find, these search engines are devoted to drilling 
further into specialty areas - medical sites, legal documents, even 
Web pages dedicated to jokes and parody. Looking for timely 
financial data? Try FinancialFind.com. Seeking sketches of 
molecular structures or even scientific humor? Biolinks.com may 
help.   

"Instead of grabbing everything on the Web and then trying to deal 
with this big mess," Mr. Sullivan said, these boutique search 
engines have decided to do some filtering. "They may say, we'll 
pick 40 sites that we know are related to this topic," he said. "And 
that means you won't get these irrelevant search results."   

Some search engines go even further, sending out finely tuned 
software agents, or bots, that learn not only which pages to search, 
but also what information to grab from those pages. Either way, the 
theory is the same: The smaller the haystack, the better chance of 
finding the needle.   

Finding those smaller haystacks can be a challenge in itself. It is 
the same problem faced by patrons who walk into a library, said 
Gary Price, a librarian at George Washington University and co-
author of the forthcoming book "The Invisible Web" (CyberAge 
Books). People may know to come to the library, but they probably 
do not know which reference books to pull off the shelf. Of course, in 
such cases, patrons can at least consult a reference librarian. On 
the Web, people are usually fending for themselves.   

"The end user should have a better idea of all the different options 
that exist," Mr. Price said. "But this is easier said than done."   

Lately, however, a few specialty search engines have been popping 
up on lists of most-visited Web sites - evidence that people are 
learning to find them. MySimon, a service that specializes in culling 
product prices and information across 2,500 shopping sites, is one 
of the most popular. In December, the site attracted 5 million unique 
visitors, a huge increase from its 1.9 million visitors a year before, 
according to Jupiter Media Metrix, an Internet research firm. 
FindLaw.com, a search engine and Web- based directory of legal 
information, has as many as 900,000 visitors a month.   

Moreover.com, a site that opened in 1999 with a search engine that 
gathers headlines from 1,800 online news sources, has also 
appeared on Jupiter Media Metrix's reports of Web use, which track 
only sites with at least 200,000 visitors a month. Last month, about 
340,000 people visited Moreover.com's pages - and that is without 
any consumer marketing from the company, which offers the search 
engine free as a teaser for businesses that might buy its search 
software.   

Like most specialty search engines, Moreover manages to find 
those news stories because its bots have been designed to hunt for 
only specific pages within a specific realm of the Web. They are like 
sniffing dogs that have been given a whiff of a scent and are taught 
to disregard everything else. Font tags in the source code 
underlying the Web page, for example, are a giveaway. Between 6 
and 18 words in large type near the top of a Web page look a lot like 
headlines. In most cases they are, and the site's bots retrieve them, 
using the headline as the link in the list of search results.   

Once in a while, however, those supposed headlines turn out to be 
something else, like a copyright disclaimer page. So to filter further, 
Moreover's spiderlike bots learn the structure of the Web address, 
noting which words and numbers show up between the slashes. If 
an address ends with the word "copyright," a bot may decide to 
disregard that page. Similar rules are used to categorize the news 
articles so that people can narrow their searches before even 
entering a search term. "Our spiders are very good readers," said 
Nick Denton, Moreover's chief executive.   

MySimon also employs bots that are designed to hunt for very 
specific information. But first the bots must watch the click- through 
routines of MySimon employees who have learned the ins and outs 
of particular online shops - like exactly which pages typically 
provide prices, sizes or shipping fees. Once trained, the bots follow 
those paths themselves, prowling shops for information to put into 
databases and then display online. For example, one bot is 
assigned to Amazon.com's bookshelves; another is assigned to its 
electronics merchandise.   

"What we're doing is teaching our agents to shop on behalf of 
consumers," said Josh Goldman, president of MySimon.   

Meanwhile, general search engines have also decided to offer 
smaller fields for foraging. Northern Light has a news search 
service that searches a two-week archive of articles on 56 news 
wires. It also offers a "geosearch" service that allows people to look 
for businesses based within a few miles of a given address. Google 
recently opened an "Uncle Sam" area, where people can search for 
governmental material.   

Services that limit searches to audio or video files - typically found 
under the heading "multimedia search" - are now offered on sites 
like Alta Vista, Excite and Lycos. And shopping search engines are 
linked from almost all of the major search sites.   

But again, many Web users do not know that the narrow searching 
tools exist. So reference librarians and library Web sites are now 
directing their patrons to those areas on the Web. Mr. Sullivan, Mr. 
Price and Chris Sherman, a search guide on About.com who is 
working with Mr. Price on the "Invisible Web" book, are among the 
several information- retrieval experts who have built online 
directories of specific search sites. Another tool is the LexiBot, a 
downloadable program designed by BrightPlanet to demonstrate the 
search technology it sells to businesses. The LexiBot, which costs 
$89.95 but is free for the first 30 days, gathers information 
simultaneously from 600 search sites and databases - including the 
databases that form the basis of specialty search engines.   

The harder part may be to change people's behavior. All the 
boutique search engines in the world will not alter the fact that the 
majority of Web surfers are still inclined to type a single keyword 
into a huge, general search engine and hope for the best. The 
thought of narrowing a search - by either going to a specialty search 
page or clicking through a menu of choices on a general search site -
 does not seem to occur to most users, Mr. Sullivan said.   

He poses this challenge to the major search sites: Wouldn't search 
engines be more helpful if they would automatically narrow a search 
without requiring their users to make that realization on their own?   

"Can you automatically detect what database to search," he asked in 
posing his challenge, "based on what people have typed in?" During 
the second week of January, for example, perhaps a search engine 
could have been directed to steer people to news sites whenever 
they typed in words that made headlines, like "chavez."   

A few search engines have tried to take that step, with mixed 
results. For example, when Mr. Sullivan typed "chavez" into the 
search box at Ask Jeeves earlier this month, the site pointed to a 
recent news story - a link provided by Ask Jeeves' editors who were 
assembling information about potential members of a Bush cabinet. 
Using the same search a few weeks later, the news reports were 
nowhere to be found. (Paul Stroube, the company's vice president 
for Web production, said that the news link disappeared because 
Ms. Chavez was taken off Ask Jeeves' list of President Bush's 
nominees.)   

Unless the big search engines get better at delivering timely 
information, searchers might be better off with Moreover.com and 
other news-oriented search services. With those, Mr. Sullivan has 
found success. Two weeks ago, in a Moreover search using the 
word "chavez," more than 30 relevant stories appeared, at least half 
of which had been posted that day.   

Copyright 2001 The New York Times Company   

* * * * * * * * * * * * * *  From the Listowner  * * * * * * * * * * * *
.	To unsubscribe from this list, send a message to:
majordomo at scn.org		In the body of the message, type:
unsubscribe scn
==== Messages posted on this list are also available on the web at: ====
* * * * * * *     http://www.scn.org/volunteers/scn-l/     * * * * * * *