SCN: Searches
Steve
steve at advocate.net
Thu Jan 25 08:25:47 PST 2001
x-no-archive: yes
=======================
Mining the "Deep Web" With Specialized Drills
(Lisa Guernsey, NY Times)---Two weeks ago, online newspapers
and magazines were buzzing with news about Linda Chavez,
President Bush's first choice for labor secretary.
But from the results coming up in most popular search engines, you
would never have known it. Instead of retrieving articles about an
illegal immigrant who had lived in Ms. Chavez's home, a Google
search on "chavez" led to several encyclopedia entries on Cesar
Chavez, the American labor leader and advocate of farmworkers'
rights.
Lycos turned up several Web sites with information about Eric
Chavez, an Oakland A's third baseman. On Alta Vista, some of the
first results linked to Ms. Chavez's old columns for an online
magazine, but none of the links provided even a hint of the fact that
she had become front-page news.
"I don't see anything that anyone would feel is relevant to her given
the context of this past week," said Danny Sullivan, the editor of
SearchEngineWatch.com, as he typed "chavez" into other search
engines.
His demonstration illustrated a problem that has long been apparent
longtime problem that has to anyone casting about for online news
reports: search engines can be pitifully inadequate, partly because
they rely on Web-page indexes that were compiled weeks before. It
is not just timely material that seems to escape their reach. Pages
deep within Web sites are also often missed, as are multimedia
files, bibliographies, the bits of information in databases and pages
that come in P.D.F., Adobe's portable document format.
In fact, traditional search engines have access to only a fraction of 1
percent of what exists on the Web. As many as 500 billion pieces of
content are hidden from the view of those search engines, according
to BrightPlanet.com, a search company that has tried to tally them.
To many search experts, this is the "invisible Web." BrightPlanet
prefers the term "deep Web," an online frontier that it estimates may
be 500 times larger than the surface Web that search engines try to
cover. And that uncharted territory does not include Web pages that
are behind firewalls or part of intranets.
To dig deeper into the Web, a new breed of search engine has
cropped up that takes a different approach to Web page retrieval.
Instead of broadly scanning the Web by indexing pages from any
links they can find, these search engines are devoted to drilling
further into specialty areas - medical sites, legal documents, even
Web pages dedicated to jokes and parody. Looking for timely
financial data? Try FinancialFind.com. Seeking sketches of
molecular structures or even scientific humor? Biolinks.com may
help.
"Instead of grabbing everything on the Web and then trying to deal
with this big mess," Mr. Sullivan said, these boutique search
engines have decided to do some filtering. "They may say, we'll
pick 40 sites that we know are related to this topic," he said. "And
that means you won't get these irrelevant search results."
Some search engines go even further, sending out finely tuned
software agents, or bots, that learn not only which pages to search,
but also what information to grab from those pages. Either way, the
theory is the same: The smaller the haystack, the better chance of
finding the needle.
Finding those smaller haystacks can be a challenge in itself. It is
the same problem faced by patrons who walk into a library, said
Gary Price, a librarian at George Washington University and co-
author of the forthcoming book "The Invisible Web" (CyberAge
Books). People may know to come to the library, but they probably
do not know which reference books to pull off the shelf. Of course, in
such cases, patrons can at least consult a reference librarian. On
the Web, people are usually fending for themselves.
"The end user should have a better idea of all the different options
that exist," Mr. Price said. "But this is easier said than done."
Lately, however, a few specialty search engines have been popping
up on lists of most-visited Web sites - evidence that people are
learning to find them. MySimon, a service that specializes in culling
product prices and information across 2,500 shopping sites, is one
of the most popular. In December, the site attracted 5 million unique
visitors, a huge increase from its 1.9 million visitors a year before,
according to Jupiter Media Metrix, an Internet research firm.
FindLaw.com, a search engine and Web- based directory of legal
information, has as many as 900,000 visitors a month.
Moreover.com, a site that opened in 1999 with a search engine that
gathers headlines from 1,800 online news sources, has also
appeared on Jupiter Media Metrix's reports of Web use, which track
only sites with at least 200,000 visitors a month. Last month, about
340,000 people visited Moreover.com's pages - and that is without
any consumer marketing from the company, which offers the search
engine free as a teaser for businesses that might buy its search
software.
Like most specialty search engines, Moreover manages to find
those news stories because its bots have been designed to hunt for
only specific pages within a specific realm of the Web. They are like
sniffing dogs that have been given a whiff of a scent and are taught
to disregard everything else. Font tags in the source code
underlying the Web page, for example, are a giveaway. Between 6
and 18 words in large type near the top of a Web page look a lot like
headlines. In most cases they are, and the site's bots retrieve them,
using the headline as the link in the list of search results.
Once in a while, however, those supposed headlines turn out to be
something else, like a copyright disclaimer page. So to filter further,
Moreover's spiderlike bots learn the structure of the Web address,
noting which words and numbers show up between the slashes. If
an address ends with the word "copyright," a bot may decide to
disregard that page. Similar rules are used to categorize the news
articles so that people can narrow their searches before even
entering a search term. "Our spiders are very good readers," said
Nick Denton, Moreover's chief executive.
MySimon also employs bots that are designed to hunt for very
specific information. But first the bots must watch the click- through
routines of MySimon employees who have learned the ins and outs
of particular online shops - like exactly which pages typically
provide prices, sizes or shipping fees. Once trained, the bots follow
those paths themselves, prowling shops for information to put into
databases and then display online. For example, one bot is
assigned to Amazon.com's bookshelves; another is assigned to its
electronics merchandise.
"What we're doing is teaching our agents to shop on behalf of
consumers," said Josh Goldman, president of MySimon.
Meanwhile, general search engines have also decided to offer
smaller fields for foraging. Northern Light has a news search
service that searches a two-week archive of articles on 56 news
wires. It also offers a "geosearch" service that allows people to look
for businesses based within a few miles of a given address. Google
recently opened an "Uncle Sam" area, where people can search for
governmental material.
Services that limit searches to audio or video files - typically found
under the heading "multimedia search" - are now offered on sites
like Alta Vista, Excite and Lycos. And shopping search engines are
linked from almost all of the major search sites.
But again, many Web users do not know that the narrow searching
tools exist. So reference librarians and library Web sites are now
directing their patrons to those areas on the Web. Mr. Sullivan, Mr.
Price and Chris Sherman, a search guide on About.com who is
working with Mr. Price on the "Invisible Web" book, are among the
several information- retrieval experts who have built online
directories of specific search sites. Another tool is the LexiBot, a
downloadable program designed by BrightPlanet to demonstrate the
search technology it sells to businesses. The LexiBot, which costs
$89.95 but is free for the first 30 days, gathers information
simultaneously from 600 search sites and databases - including the
databases that form the basis of specialty search engines.
The harder part may be to change people's behavior. All the
boutique search engines in the world will not alter the fact that the
majority of Web surfers are still inclined to type a single keyword
into a huge, general search engine and hope for the best. The
thought of narrowing a search - by either going to a specialty search
page or clicking through a menu of choices on a general search site -
does not seem to occur to most users, Mr. Sullivan said.
He poses this challenge to the major search sites: Wouldn't search
engines be more helpful if they would automatically narrow a search
without requiring their users to make that realization on their own?
"Can you automatically detect what database to search," he asked in
posing his challenge, "based on what people have typed in?" During
the second week of January, for example, perhaps a search engine
could have been directed to steer people to news sites whenever
they typed in words that made headlines, like "chavez."
A few search engines have tried to take that step, with mixed
results. For example, when Mr. Sullivan typed "chavez" into the
search box at Ask Jeeves earlier this month, the site pointed to a
recent news story - a link provided by Ask Jeeves' editors who were
assembling information about potential members of a Bush cabinet.
Using the same search a few weeks later, the news reports were
nowhere to be found. (Paul Stroube, the company's vice president
for Web production, said that the news link disappeared because
Ms. Chavez was taken off Ask Jeeves' list of President Bush's
nominees.)
Unless the big search engines get better at delivering timely
information, searchers might be better off with Moreover.com and
other news-oriented search services. With those, Mr. Sullivan has
found success. Two weeks ago, in a Moreover search using the
word "chavez," more than 30 relevant stories appeared, at least half
of which had been posted that day.
Copyright 2001 The New York Times Company
* * * * * * * * * * * * * * From the Listowner * * * * * * * * * * * *
. To unsubscribe from this list, send a message to:
majordomo at scn.org In the body of the message, type:
unsubscribe scn
==== Messages posted on this list are also available on the web at: ====
* * * * * * * http://www.scn.org/volunteers/scn-l/ * * * * * * *
More information about the scn
mailing list