An intensive study undertaken by AltaVista, Compaq and IBM reveals that not all pages on the World Wide Web are as well connected as we think.
The Web is shaped like a large bow tie with many underconnected sites out on its hard-to-reach fringes, say the researchers, who hope to use their indexed results to design better search engines and help electronic-commerce sites get noticed.
To determine the Web's structure, the companies used the AltaVista search engine and Compaq's AlphaServer hardware to perform two massive crawls' of more than 200 million pages by following the 1.5 billion hyperlinks connecting them. Search engines normally perform crawls to create the indexes that help speed up searches, says Jim Schissler, an AltaVista spokespman.
IBM researchers analysed the results and discovered that about a third of all Web sites are in a "strongly-connected core"- the knot of the figurative bow tie. You can easily travel between those pages via hyperlinks. Meanwhile, one side of the tie, containing about a quarter of all Web pages, consists of "origination" pages that let you eventually get to the core, but can't be reached from it. Likewise, "termination" pages on the other side of the tie can be reached from the core, but have trouble returning to it. Finally, one-fifth of the pages can't be reached from the core at all, but only from origination or termination pages.
"What this research told us is that, in order to get the most complete crawl for an index, you have to have more starting points," Schissler says.
"It used to be you could put a bunch of tags in your site and you'd get more hits," says Amanda Watlington, IProspect.com's director of research. "The bow tie theory suggests that Web and e-commerce sites should make sure they have plenty of links to and from sites within the core to position their own sites in higher-traffic areas of the Web."
Neither Schissler or Watlington thinks the average computer user needs to take much account of the study's finding or do anything differently. Schissler says the larger indexes that search engine vendors could create using the new theory might have little impact on the quality of results, which is based more on filtering and other selection mechanisms.
Watlington, however, advises taking special care to bookmark pages because they might be outside the core and hard to find later.