Jason Kottke noted a significant change in the “robots.txt” file for whitehouse.gov. “Robots.txt” is a file that tells web crawlers –Google’s indexing program, for example — what to index and what not to index.
The old whitehouse.gov site gave web crawlers 2,400 lines of instructions about what not to include. The new whitehouse.gov site gives two lines of instruction:
User-agent: *
Disallow: /includes/
This tells the webcrawler not to track the “includes” directory at the site, but everything else is fair game.
Shelly says
Third party content is also CC attribute licensed (unless otherwise specified), which I think is symbolic of the difference in the two administrations’ tones. (Obviously, original content is not protected by copyright, as it’s a product of the US government.)
Justin says
For those not aware, an Include file is merely a repeating piece of code, like a navigation bar or logo, that gets used over and over on other pages in the site. Includes serve a purpose of expedience for web developers and other technical wizardry.
The best part is that this effectively means the entire site can fully searched. Nothing is hidden from Google and other search engines.
Doug says
Thanks for that information on the “include.” I ran a brief search trying to get a layman’s explanation but was unsuccessful. I suspected it was something that had its reasons but wouldn’t be terribly interesting to the Googling public.