White House Deflects Search Engines

Slashdot is reporting that the White House web site is including a large number of directories including the word “Iraq” in its robots.txt file. (For those less technically-inclined, a directory in the robots.txt file is a way of asking a polite web bot, like the one Google runs, to ignore certain parts of a web site.)

Text searches for “robots,” “bots,” and “crawlers” returned no results; a search for “search” didn’t return any relevant results. I couldn’t find any other official explanation for the very odd robots.txt file, a small part of which I’ve quoted here:

Disallow:	/911/response/iraq
Disallow:	/911/response/text
Disallow:	/911/sept112002/iraq
Disallow:	/911/sept112002/text
Disallow:	/911/text
Disallow:	/afac/index.htm/text
Disallow:	/afac/iraq
Disallow:	/afac/text

As you can see, there appears to be a combination of /iraq and /text lines for most of the Disallows. The /text lines make sense, if you consider that the site wants most people to find its fancy snazzy graphical pages first. The purpose of the /iraq lines is less obvious, since most of those directories don’t exist or don’t have an index.html page.

A number of theories have been put forth regarding this. Most agree that the cause is a script designed to generate the robots.txt file that was coded badly, but there’s disagreement as to why there was any attempt to impede the retrieval of information on Iraq in the first place. You can see a wide range of opinions at the ./ post.

Personally, I’d like to hear Dean and Josh’s take on this.

UPDATE 10/27 10:22 PM – A comment by mlc buried deep in the thread links to a plausible explanation of what’s going on. Basically, the huge robots.txt file is designed to prevent spiders from crawling different templates, all containing the same content.

While this doesn’t convince me that the Bush administration is innocent of cover-ups, I think the /. community blew this one way out of proportion.

1 thought on “White House Deflects Search Engines

  1. Allow me to enter into evidence exhibit #1 on line 463 of said file:

    Disallow: /infocus/medicare/iraq

    Now would someone please explain to me the linkage between medicare and Iraq? Or how about this beauty from line 527:

    Disallow: /kids/barney/iraq

    Tempting as it may be to link-up Barny with Saddam or the CIA, the fact is, the two have nothing in common.

    In fact, if you look, you’ll see what looks like to this 20 year programming veteran a very sloppy job at a search and replace by a typical government employee. I mean look at the pattern, ever subdirectory is affixed with either {whatever}/text or {whatever}/iraq.

    Here’s the reality. The whitehouse has three websites … two of which should be defined by subdomains, but because we’re dealing with government employees, you have http://www.whitehouse.gov/text instead of http://text.whitehouse.gov and http://www.whitehouse.gov/iraq instead of http://iraq.whitehouse.gov .

    For a good example of how it should be done, check out how I differentiate the youth, music and main ministry at redlandbaptist.org using subdomains.

    Anyway, the point is, you don’t use robots.txt to hide stuff. In fact, by putting disallows in robots.txt, you insure people are going to know about it. Instead, you put stuff in robots.txt to save your server’s bandwidth, which gets chewed-up pretty quick by stupid spiders generating 404’s looking for http://www.whitehouse.gov/history/photoessays/westwing/iraq .

    In other words, it sounds like they moved or deleted stuff and were telling the spiders to stop looking there … which is exactly what services like Google say you should do …

    …but if I wanted to really, truly wanted to hide something … I would use mod_rewrite in a variety of insidious ways that get the same effect yet are invisible to snooping eyes …

    … similarly, if these are pages that are moved, then had the webmaster at the whitehouse visited healyourchurchwebsite, he would have seen I have an article on how to redirect old articles to new urls.

Leave a Reply

Your email address will not be published. Required fields are marked *