There are two methods to prevent Googlebots from crawling or indexing your webpages.
You can either add a "disallow" entry in the robots.txt file of your website or simply add the following <META> tag inside webpages that you don't want search engine spiders to crawl or index.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOARCHIVE">
Sounds simple, but we recently came across atleast two different cases where Googlebots are ignoring the META tag or robots.txt instructions. Let's looks at them briefly here:
Case A - del.icio.us
Google has indexed (and cached) ~1.4 million pages from the del.icio.us website. Now pay close attention to META tag on each of the del.icio.us webpages. You'll see the following text inside the HTML code of del.icio.us webpages [example]
Case B: Google Finance
The robots.txt file residing on www.google.com has the following instruction:
Its again very surprising to see that atleast 44K pages from www.google.com/finance have been indexed and cached on Google servers. These pages also appear in organic search results.
Related: Google Finance: Guess the Date Contest
Update: Jim Kloss shares a similar problem with Googlebot ignoring their robots.txt file though other searchbots do obey the request. "We tell googlebot not to load these URL constructs but it ignores robots.txt. Nor were we able to get it to play nice via the webmaster control panel provided by Google...Our written email requests [to Google] to look into the situation were met with autoresponders."
You can either add a "disallow" entry in the robots.txt file of your website or simply add the following <META> tag inside webpages that you don't want search engine spiders to crawl or index.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOARCHIVE">
Sounds simple, but we recently came across atleast two different cases where Googlebots are ignoring the META tag or robots.txt instructions. Let's looks at them briefly here:
Case A - del.icio.us
Google has indexed (and cached) ~1.4 million pages from the del.icio.us website. Now pay close attention to META tag on each of the del.icio.us webpages. You'll see the following text inside the HTML code of del.icio.us webpages [example]
<meta name="robots" content="noarchive,nofollow,noindex"/>The tag clearly means that search engines are neither supposed to cache del.icio.us pages nor index them. Google is probably ignoring the META tags here.
Case B: Google Finance
The robots.txt file residing on www.google.com has the following instruction:
User-agent: *In simple English, these instructions mean that Googlebot is not supposed to index or crawl any webpage that's residing under the google.com/finance path.
Disallow: /finance
Its again very surprising to see that atleast 44K pages from www.google.com/finance have been indexed and cached on Google servers. These pages also appear in organic search results.
Related: Google Finance: Guess the Date Contest
Update: Jim Kloss shares a similar problem with Googlebot ignoring their robots.txt file though other searchbots do obey the request. "We tell googlebot not to load these URL constructs but it ignores robots.txt. Nor were we able to get it to play nice via the webmaster control panel provided by Google...Our written email requests [to Google] to look into the situation were met with autoresponders."