Indexing The World Wide Web: The Journey So Far

 Internet & Technology

of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Book Chapter From Next Generation Search Engines: Advanced Models for Information Retrieval, 2011. Available at:
  Indexing The World Wide Web:The Journey So Far Abhishek Das Google Inc., USA  Ankit Jain Google Inc., USA ABSTRACT In this chapter, we describe the key indexing components of today’s web search engines. As the WorldWide Web has grown, the systems and methods for indexing have changed significantly. We present thedata structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discusstrade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk,and data partitioning. We will finish with some thoughts on information organization for the newlyemerging data-forms. INTRODUCTION The World Wide Web is considered to be the greatest breakthrough in telecommunications after thetelephone. Quoting the new media reader from MIT press [Wardrip-Fruin   , 2003]:  “The World-Wide Web (W3) was developed to be a pool of human knowledge, and humanculture, which would allow collaborators in remote sites to share their ideas and all aspects of acommon project.”  The last two decades have witnessed many significant attempts to make this knowledge “discoverable”.These attempts broadly fall into two categories:(1) classification of webpages in hierarchical categories (directory structure), championed by thelikes of Yahoo! and Open Directory Project;(2) full-text index search engines such as Excite, AltaVista, and Google.The former is an intuitive method of arranging web pages, where subject-matter experts collect andannotate pages for each category, much like books are classified in a library. With the rapid growth of theweb, however, the popularity of this method gradually declined. First, the strictly manual editorial processcould not cope with the increase in the number of web pages. Second, the user’s idea of what sub-tree(s)to seek for a particular topic was expected to be in line with the editors’, who were responsible for theclassification. We are most familiar with the latter approach today, which presents the user with akeyword search interface and uses a pre-computed web index to algorithmically retrieve and rank web pages that satisfy the query. In fact, this is probably the most widely used method for navigating throughcyberspace. The earliest search engines had to handle orders of magnitude more documents than previousinformation retrieval systems. In fact, around 1995, when the number of static web pages was believed todouble every few months, AltaVista reported having crawled and indexed approximately 25 millionwebpages. Indices of today’s search engines are several orders of magnitude larger; Google reportedaround 25 billion web pages in 2005 [Patterson, 2005], while Cuil indexed 120 billion pages in 2008[Arrington, 2008]. Harnessing together the power of hundreds, if not thousands, of machines has provenkey in addressing this challenge of grand scale.    2  Figure 1: History of Major Web Search Engine Innovations (1994-2010) Using search engines has become routine nowadays, but they too have followed an evolutionary path.Jerry Yang and David Filo created Yahoo in 1994, starting it out as a listing of their favorite web sitesalong with a description of each page [Yahoo, 2010]. Later in 1994, WebCrawler  was introduced whichwas the first full-text search engine on the Internet; the entire text of each page was indexed for the firsttime. Introduced in 1993 by six Stanford University students,  Excite became functional in December 1995. It used statistical analysis of word relationships to aid in the search process and is part of   AskJeeves  today.  Lycos , created at CMU by Dr. Michael Mauldin, introduced relevance retrieval, prefix matching,and word proximity in 1994. Though it was the largest of any search engine at the time, indexing over 60million documents in 1996, it ceased crawling the web for its own index in April 1999. Today it providesaccess to human-powered results from  LookSmart  for popular queries and crawler-based results from Yahoo for others.  Infoseek  went online in 1995 and is now owned by the Walt Disney Internet Group.  AltaVista, also started in 1995, was the first search engine to allow natural language questions andadvanced searching techniques. It also provided multimedia search for photos, music, and videos.  Inktomi  was started in 1996 at UC Berkeley, and in June of 1999 introduced a directory search engine powered by concept induction technology. This technology tries to model human conceptual classification of content,and projects this intelligence across millions of documents. Yahoo purchased  Inktomi was in 2003.  AskJeeves launched in 1997 and became famous for being the natural language search engine, thatallowed the user to search by framing queries in question form and responding with what seemed to bethe right answer. In reality, behind the scenes, the company had many human editors who monitoredsearch logs and located what seemed to be the best sites to match the most popular queries. In 1999, theyacquired  Direct Hit  , which had developed the world’s first click popularity search technology, and in2001, they acquired Teoma whose index was built upon clustering concepts of subject-specific popularity.Google, developed by Sergey Brin and Larry Page at Stanford University, launched in 1998 and usedinbound links to rank sites. The  MSN Search and Open Directory Project  were also started in 1998, of which the former reincarnated as  Bing  in 2009. The Open Directory, according to its website, “is thelargest, most comprehensive human-edited directory of the Web”. Formerly known as  NewHoo , it wasacquired by AOL Time Warner-owned Netscape in November 1998.All current search engines rank web pages to identify potential answers to a query. Borrowing frominformation retrieval, a statistical similarity measure has always been used in practice to assess thecloseness of each document (web page) to the user text (query); the underlying principle being that thehigher the similarity score, the greater the estimated likelihood that it is relevant to the user. Thissimilarity formulation is based on models of documents and queries, the most effective of which is thevector space model [Salton, 1975]. The cosine measure [Salton, 1962] has consistently been found to bethe most successful similarity measure in using this model. It considers document properties as vectors,and takes as distance function the cosine of the angle between each vector pair. From an entropy-based    3  perspective, the score assigned to a document can be interpreted as the sum of information conveyed byquery terms in the document. Intuitively, one would like to accumulate evidence by giving more weight todocuments that match a query term several times as opposed to ones that contain it only once.   Eachterm’s contribution is weighted such that terms appearing to be discriminatory are favored while reducingthe impact of more common terms. Most similarity measures are a composition of a few statistical values:frequency of a term t in a document d (term frequency or TF), frequency of a term t in a query, number of documents containing a term t (document frequency or DF), number of terms in a document, number of documents in the collection, and number of terms in the collection. Introduction of document-length pivoting [Singhal, 1996] addressed the issue of long documents either containing too many terms, or many instances of the same term.The explosive growth of the web can primarily be attributed to the decentralization of content publication,with essentially no control of authorship. A huge drawback of this is that web pages are often a mix of facts, rumors, suppositions and even contradictions. In addition, web-page content that is trustworthy toone user may not be so to another. With search engines becoming the primary means to discover webcontent, however, users could no longer self-select sources they find trustworthy. Thus, a significantchallenge for search engines is to assign a user-independent measure of trust to each website or webpage.Over time, search engines encountered another drawback [Manning, 2008] of web decentralization: thedesire to manipulate webpage content for the purpose of appearing high up in search results. This is akinto companies using names that start with a long string of As to be listed early in the Yellow Pages.Content manipulation not only includes tricks like repeating multiple keywords in the same color as the background, but also sophisticated techniques such as cloaking and using doorway pages, which servedifferent content depending on whether the http request came from a crawler or a browser.To combat such spammers, search engines started exploiting the connectivity graph, established byhyperlinks on web pages. Google [Brin, 1998] was the first web search engine known to apply link analysis on a large scale, although all web search engines currently make use of it. They assigned each page a score, called PageRank, which can be interpreted as the fraction of time that a random web surfer will spend on that webpage when following the out-links from each page on the web. Another interpretation is that when a page links to another page, it is effectively casting a vote of confidence.PageRank calculates a page’s importance from the votes cast for it. HITS is another technique employinglink analysis which scores pages as both hubs and authorities, where a good hub is one that links to manygood authorities, and a good authority is one that is linked from many good hubs. It was developed by JonKleinberg and formed the basis of Teoma [Kleinberg, 1999].Search engines aim not only to give quality results but also to produce these results as fast as possible.With several terabytes of data spread over billions of documents in thousands of computers, their systemsare enormous in scale. In comparison, the text of all the books held in a small university might occupyonly around 100 GB. In order to create such highly available systems, which continually index thegrowing web and serve queries with sub-second response times, it must optimize all resources: disk,memory, CPU time, as well as disk transfers.This chapter describes how the index of a web-scale search engine organizes all the information containedin its documents. In the following sections, we will cover the basics of indexing data structures, introducetechniques that improve relevance of results, discuss trade-offs to best utilize the machine resources, andcover distributed processing concepts that allow scaling and updating of the index. In particular, we willdelve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will conclude with some thoughts on information organization for the newly emergingdata-forms.    4 ORGANIZING THE WEB In order to avoid linearly scanning each webpage at query time, an index of all possible query terms is prepared in advance. Lets consider the collection of English books in a library. The simplest approachwould be to keep track of all words from the English dictionary that appear in each book. On repeatingthis across all books, we end up with a term-incidence matrix , in which each entry tells us if a specificword occurs in a book or not. Figure 2 shows a sample term-incidence matrix. The collection of documents over which a search engine performs retrieval is referred to as a corpus . So, for a corpus of 1M documents with 100K distinct words, ~10GB (1M x 100K) will be required to hold the index inmatrix form. The corpus itself will require around 4 bytes to encode each distinct word and hence a totalof 4 GB (1M x 1000 x 4) storage if each document is 1000 words long on average. Clearly, lot of space iswasted in recording the absence of terms in a document, and hence a much better representation is torecord only the occurrences.  Figure 2: Term-Incidence Matrix for a sample of English documents   The most efficient index structure is an inverted index : a collection of lists, one per  term , recording thedocuments containing that term. Each item in the list for a term t  , also referred to as a  posting  , records theordinal document identifier  d  , and its corresponding term frequency (TF): <d, tf> . Note that if 4 bytes areused to encode each posting, a term appearing in ~100K documents will result in a posting list of size100KB to 1MB; though most terms will have much shorter posting lists. We illustrate this in Figure 3 for the same example as in Figure 2.  Figure 3: Illustration of Posting Lists for Example from Figure 2 In our example above, all terms in the English dictionary were known before hand. This, however, doesnot hold true on the web where authors create content in a multitude of languages, along with large
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks