robots.txt File Important to Search Engine Crawlers!

by Mike Valentine on August 15, 2005

The robots.txt is a web standard required by all web crawlers/robots to tell them what files and directories to stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and will continue crawling your site anyway. I like to call them “Bad Bots.” We block them by IP exclusion which is another story entirely. 😉

To see the proper format look below. That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file —–>

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: msnbot
Crawl-delay: 10

User-agent: psbot
Disallow: /

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

——–> End of robots.txt file

The above is what we currently use at Publish101.com Web Content Distributor site, just launched in March of 2005. We did an extensive case study and published a series of articles on crawler behavior and indexing delays known as the Google Sandbox. That Google Sandbox Case Study is highly instructive on many levels for webmasters everywhere and, although I’m biased 😉 I highly recommend that you consider reading it in full. All articles are linked from the one above so you can read the entire series.

One thing we didn’t expect to glean from the research involved in indexing delays was the importance of robots.txt files to quick and efficient crawling by the spiders from the major search engines and the number of heavy crawls from bots that will do no earthly good to the site owner, yet crawl most sites extensively and heavily, straining servers to the breaking point with requests for pages coming as fast as 7 pages per second.

The “User Agent: MSNbot” is from MSN, Googlebot is obvious, Slurp is from Yahoo and Teoma is from AskJeeves. The others listed are “Bad” bots that crawl very fast and to nobody’s benefit but their own, so we ask them to stay out entirely. The * asterisk is a wild card that means “All” crawlers/spiders/bots should stay out of that group of files or directories listed.

The bots that are instructed to “Disalow: /” means they should stay out entirely and those with “Crawl-delay: 10” are those that crawled our site too quickly and caused it to bog down and overuse the server resources. Google crawls more slowly than the others and doesn’t require that instruction. Crawl-delay instruction is only needed on very large sites with hundreds or thousands of pages.

Those we requested Crawl-delay from were requesting as many as 7 pages per second and so we asked them to slow down. The number you see is seconds and you can change it to suit your server capacity and based on their crawling rate. You can discover how fast they are crawling by looking at your raw server logs – which show pages requested by precise times to within a hundredth of a second – available from your web host (or ask your web or IT person).

This file is saved as a plain text document and ALWAYS with the name “robots.txt” in the root of your domain. Your server logs can be found in that same root directory if you have server access, you can usually download compressed files by calendar day right off your server. You’ll need a utility that can expand .gz files to open and read those plain text files.

To see the contents of any robots.txt file just type robots.txt after any domain name. If they have that file up, you will see it displayed in your browser. Click on the link below to see that file for Amazon.com

Amazon.com robots.txt

You can see the contents of any website robots.txt file that way.

We discovered in our launch of the new site that Google and Yahoo will crawl the site whether or not you use a robots.txt file, but MSN seems to require it before they will begin crawling at all. All of the search engine robots seem to request the file on a semi-regular basis to verify it hasn’t changed.

Sometimes when you DO change it, they will stop crawling for a week or so and repeatedly ask for that robots.txt file during that time without crawling any additional pages. (Perhaps they had a list of pages to visit that included the directory or files you have instructed them to stay out of and must adjust their crawling schedule to eliminate those files from their list.)

Most webmasters instruct the bots to stay out of “image” directories and the “cgi-bin” directory as well as any directories containing private or proprietary files intended only for users of an intranet or password protected sections of your site.

True Paid Inclusion Programs are a “Thing of the Past” . . . Or Are They?

Yahoo Directory Most Valuable Text Link