How Search Engines work
While not all search engines work in the same way, they are all more or less similar and so I am going to use as an example the working components of Google which are shown in this schematic diagram taken from the original Google design paper by Sergey Brin and Lawrence Page.
Note that there are some components shown which are not discussed here as they are unique to Google (the PageRank module for instance) and that this is a snapshot of how Google was organized in the earlier days, There have doubtless been many changes and improvements in the past seven years.
1. Spider (or crawler) : A spider is a robotic program that downloads webpages and nothing more. It works similar to your browser when you connect to a web site and download a page except that the spider just doesn't have any visual components and only records the actual html it sees.
(You can see the same thing by viewing any webpage, and then right-clicking and selecting "view source" in your browser.) The spider has to know where to go to find the webpages that the search engine wants and so in Google it gets a list of pages from the URL server and then downloads the contents of a webpages to the Store Server which in turn compresses it and puts it into the Repository.
2. Repository : a storage warehouse (or database) where all the data the spiders find is sorted and stored in compressed format with the full html content found by the spider.
3. URL server : a server which determines which sites to spider next and gives that information to the spiders. This is an important function as it has the dual objectives of insuring that all the pages that it knows about are respidered as often as necessary (but not more often than necessary) and at the same time directing the spider to new pages which are found from links reported in spidered pages.
4. Indexer : this program first uncompresses the results in the repository, parses them with each document and converts it into a series of word occurrences called hits. These hits record the word, its position in the document, the approximate font size and capitalization. These hits are then stored in various places like the Word Barrels, and the Anchors file.
5. Word barrels : a collection of databases each of which contain data relating to one particular word and the pages that it occurs on. The word barrels are divided into several collections:
- The forward index - which contains all the words it knows about converted into Word IDs
- The inverted index - contain the same information as the forward index, but the results have been sorted by the sorter so that they are easier to find. In addition Google also creates another inverted index called the short index which is similar to the inverted indexed except in only contains hits from Page Titles and Anchor text.
6. The ranking component (labeled in the diagram above as the searcher) which is the component which receives the searchers queries, determines which pages may be relevant to that query and then ranks those pages according to its algorithm and presents the results to the searcher as a list of webpage ordered by decreasing relevancy.
This is the portion of the search engine that most is written and wondered about, but since each search engine uses a somewhat different ranking method this is discussed in more detail in the pages we have devoted to each search engine.