Skip to main content

Google Spiders Ignore Meta Tags & Robots.txt During Crawl

There are two methods to prevent Googlebots from crawling or indexing your webpages.

You can either add a "disallow" entry in the robots.txt file of your website or simply add the following <META> tag inside webpages that you don't want search engine spiders to crawl or index.

<META NAME="ROBOTS" CONTENT="NOINDEX, NOARCHIVE">

Sounds simple, but we recently came across atleast two different cases where Googlebots are ignoring the META tag or robots.txt instructions. Let's looks at them briefly here:

Case A - del.icio.us

Google has indexed (and cached) ~1.4 million pages from the del.icio.us website. Now pay close attention to META tag on each of the del.icio.us webpages. You'll see the following text inside the HTML code of del.icio.us webpages [example]
<meta name="robots" content="noarchive,nofollow,noindex"/>
The tag clearly means that search engines are neither supposed to cache del.icio.us pages nor index them. Google is probably ignoring the META tags here.

Case B: Google Finance

The robots.txt file residing on www.google.com has the following instruction:
User-agent: *
Disallow: /finance
In simple English, these instructions mean that Googlebot is not supposed to index or crawl any webpage that's residing under the google.com/finance path.

Its again very surprising to see that atleast 44K pages from www.google.com/finance have been indexed and cached on Google servers. These pages also appear in organic search results.

Related: Google Finance: Guess the Date Contest

Update: Jim Kloss shares a similar problem with Googlebot ignoring their robots.txt file though other searchbots do obey the request. "We tell googlebot not to load these URL constructs but it ignores robots.txt. Nor were we able to get it to play nice via the webmaster control panel provided by Google...Our written email requests [to Google] to look into the situation were met with autoresponders."