Web searching

Thu Jul 8 18:45:36 PDT 1999

x-no-archive: yes

=====================

Most of Web Beyond Scope of Search Sites 

Ashley Dunn
Los Angeles Times 7/8/99

If searching the World Wide Web for that one nugget of information
already seems like a bad trip into a quagmire of data, Internet
researchers have bad news for you--the situation is only getting
worse. 

Even the most comprehensive search engine today is aware of no more
than 16% of the estimated 800 million pages on the Web, according to
a study to be published today in the scientific journal Nature.
Moreover, the gap between what is posted on the Web and what is
retrievable by the search engines is widening fast. 

"The amount of information being indexed [by commonly used search
engines] is increasing, but it's not increasing as fast as the amount
of information that's being put on the Web," said Steve Lawrence, a
researcher at NEC Research Institute in Princeton, N.J., and one of
the study's authors. 

The findings, which are generally undisputed by the search engine
companies themselves, raise the specter that the Internet may lead to
a backward step in the distribution of knowledge amid a technological
revolution: The breakneck pace at which information is added to the
Web may mean that more information is lost to easy public view than
made available. 

The study also underscores a little-understood feature of the
Internet. While many users believe that Web pages are automatically
available to the search programs employed by such sites as Yahoo,
Excite and AltaVista, the truth is that finding, identifying and
categorizing new Web pages requires a great expenditure of time,
money and technology. 

Lawrence and his co-author, fellow NEC researcher C. Lee Giles, found
that most of the major search engines index less than 10% of the Web.
Even by combining all the major search engines, only 42% of the Web
has been indexed, they found. 

The rest of the Web--trillions of bytes of data ranging from
scientific papers to family photo albums--exist in a kind of black
hole of information, impenetrable by surfers unless they have the
exact address of a given site. Even the pages that are indexed take
an average of six months to be discovered by the search engines,
Lawrence and Giles found. 

The pace of indexing marks a striking decline from that found in a
similar study conducted by the same researchers just a year and a
half ago. 

At that time, they estimated the number of Web pages at about 320
million. The most thorough search engine in that study, HotBot,
covered about a third of all Web pages. Combined, the six leading
search engines they surveyed covered about 60% of the Web. 

But the best-performing search engine in the latest study, Northern
Light, covered only 16% of the Web, and the 11 search sites surveyed
reached only 42% combined. 

While Web surfers often complain about retrieving too much
information from search engines, failing to capture the full scope of
the Web would be to surrender one of the most powerful aspects of the
digital revolution--the ability to seek out and share diverse sources
of information across the globe, said Oren Etzioni, chief technology
officer of the multi-service Web site Go2Net and a professor of
computer science at the University of Washington. 

Etzioni said the mushrooming size of the Web's audience makes the
gulf between what is on the Web and what is retrievable increasingly
important. "There is a real price to be paid if you are not
comprehensive," he said. "There may be something that is important to
only 1% of the people. Well, you're talking about maybe 100,000
people." 

Lawrence and Giles estimated the number of Web pages by using special
software that searches systematically through 2,500 random Web
servers--the computers that hold Web pages. They calculated the
average number of pages on each server and extrapolated to the 2.8
million servers on the Internet. 

By using 1,050 search queries posed by employees of the NEC Research
Institute, a research lab owned by the Japanese electronics company
NEC, they were able to estimate the coverage of all the search
engines, ranging from 16% for Northern Light--a relatively obscure
service that ranks 16th in popularity among similar sites--to 2.5%
for Lycos, the fourth-most-popular search engine. 

For search engine companies, the findings of the report were no
surprise. Kris Carpenter, director of search products and services
for Excite, the third-most-popular search engine, said her company
purposely ignores a large part of the Web not so much because of weak
technology but because of a lack of consumer interest. 

"Most consumers are overwhelmed with just the information that is out
there," she said. "It's hard to fathom the hundreds of millions of
pages. How do you get your head around that?" Carpenter said millions
of pages, such as individual messages on Web bulletin boards, make
little sense to index. Kevin Brown, director of marketing for
Inktomi, whose search engine is used by the popular search sites
HotBot, Snap and Yahoo, said that search companies have long been
aware that they are indexing less and less of the Web. But he argued
that users are seeking quality information, not merely quantity.
"There is a point of diminishing returns," he said. "If you want to
find the best Thai food and there are 14,000 results, the question
isn't how many returns you got, but what are the top 10." In fact,
Brown said, the technology already exists to find all 800 million Web
pages, although indexing that much would be costly. 

Inktomi, like most search engines, uses a method called "crawling" in
which a program goes out onto the Internet and follows all the links
on a known Web page to find new pages. The words on that new page are
then indexed so that the page can be found when a user launches a
search. 

The crawling process helps the search engine compile an index made up
of the most popular sites. This method ensures that high- traffic
pages, such as those of the White House or CNN, could never go
undiscovered. 

Crawling can unearth an enormous number of new pages. Inktomi, for
example, can record about 20 million pages a day, meaning that it
could find all 800 million pages of the Web in less than two months. 

But storing, searching and delivering that amount of information
would require a daunting volume of computer storage and high- speed
connections to the Internet. He added that anyone who wants to be
found can be found since most of the search engines allow people to
submit their Web pages for manual inclusion in a search index.
Commercial Web sites can also pay for prominent placement on some
indexes. 

Excite's Carpenter said the future of search engines lies not in
bigger indexes but more specialized ones in which everything on a
given subject, such as baseball, could be indexed and displayed. "You
may be covering a huge percentage of the Web, but you're presenting
it in smaller slices," she said. "Lumping everything into one big,
be-everything index would be incredibly overwhelming." 

Lawrence also believes that indexing technologies will eventually
enable the search engines to start gaining on the proliferating data.
NEC, for example, has been developing a so-called "meta- search
engine" named Inquirus that combines the search ability of all major
engines, then lists their results. "I'm pretty optimistic that over a
period of years the trend will reverse," he said. But he added, "The
next 10 to 20 years could be really rough." 

* * * * * * * * * * * * * *  From the Listowner  * * * * * * * * * * * *
.	To unsubscribe from this list, send a message to:
majordomo at scn.org		In the body of the message, type:
unsubscribe scn
==== Messages posted on this list are also available on the web at: ====
* * * * * * *     http://www.scn.org/volunteers/scn-l/     * * * * * * *