Discovered ... by Accident (and, yes, it's discoverable)

In the web world, we are all familiar with various search engines that are indispensable in locating information. However, these same search engines could also be potential pitfalls for exposing discoverable content.

Search engines (eg. Google) are essentially "crawlers" that browse and index all the webpages and files that are posted to the Internet, past and present. So what happens if a firm posts information on the web but later wants to block it from public view? Enter Robots.txt.

Simply put, Robots.txt is a text file put on a webserver hosted by a website for the purpose of controlling which pages can be indexed by a "well-behaved" crawler (or robot-- Google's crawler is coincidentally named Googlebot).

However, there are two ways for crawlers to circumvent Robots.txt and subject webpages and files to discovery. For one, since Robots.txt is just a protocol, "not so well-behaved" crawlers can choose to ignore it and still index all the pages and files posted at a website. Second, if website A posts a file and website B has a link to the file, Robots.txt provides no defense because the crawlers would still find the link listed on website B even though website A tries to block the file by removing the link at its own website.

To avoid questionable pages and files to be discovered by accident, it's best not to post them in the first place. In addition, it's advisable for in-house counsel to work with their IT department to review what's on their firm's webservers.


Trackbacks (0) Links to blogs that reference this article Trackback URL
http://ediscovery.quarles.com/admin/trackback/52770
Comments (1) Read through and enter the discussion with the form at the end
Leah Talley - February 5, 2008 1:12 AM

What is a "not so well behaved" crawler? Just curious.

Post A Comment / Question Use this form to add a comment to this entry.







Remember personal info?
Send To A Friend Use this form to send this entry to a friend via email.