✅The Anatomy of a large-scale hyper-textual Web Search engine [Google,98]

Link: http://infolab.stanford.edu/pub/papers/google.pdf

Other references explaining how search works:

http://web.archive.org/web/20080701150733/https://www.google.com/librariancenter/articles/0512_01.html

Summary:

Excellent example of a white paper that clearly indicates design decisions tied to reasoning and potential options.
assumptions and estimates should always be lower or upper bounded.
PageRank
- does not count links/backlinks from all pages equally. E.g. a backlink from Yahoo weights more than backlink from noname.website
In addition to PageRank, queries are first broken into individual words which are matched with documents that contain them. The yielded results also leverage proximity among the words within those documents for sorting the results.
Indexing
- Parse document, extract words, convert each word to wordID using in-memory hash table (lexicon).

Predictions

cost to index and store text/HTML to eventually decline. which yields favorable scaling properties for centralized systems like Google

Decisions

Two sets of inverted barrels (Inverted index)
- What: Inverted barrels are used to generate list of candidate documents which should be be a fast, consistent op with manageable complexity. Inverted index is built on top of Barrels which are k/v stores where K is the docID and v is the HitList.
- Considerations:
  - The keys (docId’s in a barrel are sorted by docID. → This makes quick merging of different doclists for multiple word queries
  - Sort by # occurrences of word in each document → Makes answering single word queries trivial.
  - Merging is significantly more difficult and requires a rebuild of the index when the ranking function is changed
- Decision
  - Keep two inverted indexes (two sets of inverted barrels). One which includes anchor hits and another one for all hit lists. This way query will only query the full index in rare circumstances there are not enough candidates from the
Compression technique [Repository]
- Tradeoff between speed and compression ratio.
- Selected Speed over Space saving.
- zlib is faster with 3 to 1 ratio vs bzip which is slower with 4 to 1 ratio
SQL DB for docStore ( ISAM → Index sequential access mode,e.g its sorted and has index so additions are logN and search is logN)
- driven by desire to have single disk seek during a search that contains all required metadata.
Single source of truth from which all other data structures can be recalculated.
- helps with data consistency and makes development easier
- The repository with all crawled web page documents. [compressed]
- recalculation would also require crawler error files.
Alternatives for encoding position in Hit Lists
- simple encoding (ints) vs compact encoding (bits) vs Huffman encoding
- Selected compact encoding due to better compression than simple encoding and better simplicity than Huffman encoding.
docID’s are stable and have no collisions for the URL’s that generated them.