Wednesday, January 21, 2009

New Obama White House web site: 'robots.txt' kerfuffle debunked

Bush-bashers scouring the new Obama-ized (Obamafied?) White House web site thought they had a great scoop. Whereas the secretive Bush White House had made liberal use of the "robots.txt" protocol, which signals to search engines that they should not search specified web pages, the sainted and "transparent" Obamanauts had -- in their first hours of power! -- rid WhiteHouse.gov of that pesky enemy of Internet freedom. As leading liberal blogger Ezra Klein put it:
[Jason] Kottke points out that the Bush White House site had almost 2400 lines of code barring search engines from indexing. and thus searching, the site. The new Whitehouse web site has no such lines of code. This stuff is small, yes, but it matters. It also bespeaks an administration that, at this point, doesn't think it needs to hide its words and actions from the people it governs.
Only problem is, the Kottke/Klein thesis is a bunch of hooey, as Declan McCullagh points out in CNET:

There's just one problem with these comments. They're wrong. As of Tuesday morning, the Bush administration'srobots.txt file did only two things: first, it pointed search engines to the high-graphics versions of the page, as opposed to the text-only versions, and second, it tried to keep type-in-your-search-query pages from being indexed.

Those are legitimate reasons to list those pages in robots.txt, which is why CNET's own file is relatively long and complicated too. (Sites that have been around for eight years or longer tend to get that way). We ask search engines not to index an "/Ads" directory, e-mail-this-story pages, and dozens of others. The Democrat-controlledHouse and Senate have--gasp!--substantial robots.txt files too.

It's true that in 2007, the Bush White House did block some files they should not have, which they fixed once I brought it to their attention. They also fixed a more serious problem with the Director of National Intelligence's Web site, and an earlier problem in 2003. (A better solution would be for search engines to ignore overly broad robots.txt files on .gov and .mil sites, including Thomas.loc.gov.)

If anything, Obama's robots.txt file is too short. It doesn't currently block search pages, meaning they'll show up on search engines--something that most site operators don't want and which runs afoul of Google's Webmaster guidelines. Those guidelines say: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

I can't say I'm terribly surprised that this "story" turned out not to be no story at all. After all, the premise of the Bush-bashers ("Bush White House puts info on web site but then tries to make it secret") makes no sense. If they wanted to keep info secret, wouldn't they, um, just keep it off the web site in the first place? Bush Derangement Syndrome isn't going to be cured anytime soon.

No comments:

Post a Comment

Comments here are moderated. I appreciate substantive comments, whether or not they agree with what I've written. Stay on topic, and be civil. Comments that contain name-calling, personal attacks, or the like will be rejected. If you want to rant about how evil the RIAA and MPAA are, and how entertainment companies' employees and attorneys are bad people, there are plenty of other places for you to go.

 
http://copyrightsandcampaigns.blogspot.com/