Search Engine Spiders
Search engines rely heavily on automated programs called spiders, crawlers, robots or simply bots to develop their huge indexes of web pages. The spider is the search engines window on the web, continually seeking out content on the Internet, and sending back to the search engines the data they find, who then compile, analyze and rank the data the spider sends it. This is the spiders job pure and simple.
Spiders are simply programs run from a server connected to the web, most often with many instances of the same program running in parallel. Spiders are not really very advanced programs, they are optimized for speed and reliability and simply go where they are told to go and report back to the search engines what they find there. There seems to be a notion that spiders are intelligent beings who are responsible for many things but in fact spiders are quite limited. They can't click the buttons on your site, they can't navigate many kinds of drop down menus, they can't run a search on your website to find content, they can't for the most part read JavaScript, they can't read forms and are often stopped dead in their tracks by dynamic URLs
Spider myths
- Spiders search out and find information in a random fashion
- Spider parse content and store it in the search engines
- Spiders are somehow responsible for rankings or lack of rankings
- Spiders will not read html which does not validate
Spider facts
- Spiders primary targets are html pages, though many bots now can read other forms of text.
- Spiders usually get a series of URLs to visit from the search engines and simply record what they find there and report it back to he search engines.
- Spiders are built to operate on the real web and to overcome most obstacles that the engineers who built them can envision.
Spider tracks
When a spider reports for work its routine may go something like this:
First he will visit the URL server and get a list of places to visit
Next he will visit those URLs, first checking the robots.txt file (if there is one) to see if he is allowed to spider that page or site.
Then if he is allowed to spider, he will send out what is called an if-modified-since GET which in essence asks the webserver if such and such a file has been modified since a certain date (normally the date a spider last visited that URL). If the file has not changed since that date, the server will send back a simple 304 (not modified) response, which the spider will note and then move on to the next target. If the file has been modified the server will serve the page to the spider who will then send it back to the search engine, and then go on to his next target.
The good thing about web spiders is that they don't clutter the place up with webs and they don't bite.
