Web Crawlers and Indexing

Web Crawlers and Indexing
How Google’s Web Crawler Works:
Google’s web crawler, Googlebot, searches and retrieves web pages which it then sends to Google’s indexer. Googlebot works much like our web browser; it sends a request to a web server for a web page, downloads the page, and finally delivers it to Google’s indexer. The web crawler finds pages in two ways: through the URL form, and through finding links by “crawling” the web. Googlebot can request thousands of different pages simultaneously thanks to the help of many computers requesting and fetching pages.
How Google Builds its Index:
After Googlebot successfully searches and finds full text of the web pages, the pages are delivered and stored in Google’s indexer. The index is organized alphabetically by search term and each index entry contains a list of documents with an assigned term for each document. This data structure allows instant access to documents that contain a term (or terms) that we, users, type while browsing the web. In order to enhance search performance, Google does not index words such as the, is, of, how, and why, as well as certain punctuation marks.
Google’s Algorithms:
Googlebot uses an algorithmic process to find web pages. The web crawler consists of computer programs that determine which sites to search, how often, and the number of pages to fetch from each site. PageRank is one of the methods Googlebot uses to determine a web page’s relevance. PageRank involves probability distribution which is used to represent the chance that a person randomly clicking on links will come to any particular page. The probability is expressed as a numeric value between 0 and 1. For example, a PageRank of 0.7 means that there is a 70% chance that a person randomly clicking on a link will be directed to the document with the 0.7 PageRank. The PageRank of a page is defined “recursively and depends on the number and PageRank metric of all pages that link to it” (“PageRank.” Wikipedia, The Free...