White House Deflects Search Engines

Slashdot is reporting that the White House web site is including a large number of directories including the word “Iraq” in its robots.txt file. (For those less technically-inclined, a directory in the robots.txt file is a way of asking a polite web bot, like the one Google runs, to ignore certain parts of a web site.)

Text searches for “robots,” “bots,” and “crawlers” returned no results; a search for “search” didn’t return any relevant results. I couldn’t find any other official explanation for the very odd robots.txt file, a small part of which I’ve quoted here:

Disallow:	/911/response/iraq
Disallow:	/911/response/text
Disallow:	/911/sept112002/iraq
Disallow:	/911/sept112002/text
Disallow:	/911/text
Disallow:	/afac/index.htm/text
Disallow:	/afac/iraq
Disallow:	/afac/text

As you can see, there appears to be a combination of /iraq and /text lines for most of the Disallows. The /text lines make sense, if you consider that the site wants most people to find its fancy snazzy graphical pages first. The purpose of the /iraq lines is less obvious, since most of those directories don’t exist or don’t have an index.html page.

A number of theories have been put forth regarding this. Most agree that the cause is a script designed to generate the robots.txt file that was coded badly, but there’s disagreement as to why there was any attempt to impede the retrieval of information on Iraq in the first place. You can see a wide range of opinions at the ./ post.

Personally, I’d like to hear Dean and Josh’s take on this.

UPDATE 10/27 10:22 PM – A comment by mlc buried deep in the thread links to a plausible explanation of what’s going on. Basically, the huge robots.txt file is designed to prevent spiders from crawling different templates, all containing the same content.

While this doesn’t convince me that the Bush administration is innocent of cover-ups, I think the /. community blew this one way out of proportion.