There are two methods to prevent Googlebots from crawling or indexing your webpages.You can either add a "disallow" entry in the robots.txt file of your website or simply add the following <META> tag inside webpages that you don't want search engine spiders to crawl or index.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOARCHIVE">
Sounds simple, but we recently came across atleast two different cases where Googlebots are ignoring the META tag or robots.txt instructions. Let's looks at them briefly here:
Case A - del.icio.us
Google has indexed (and cached) ~1.4 million pages from the del.icio.us website. Now pay close attention to META tag on each of the del.icio.us webpages. You'll see the following text inside the HTML code of del.icio.us webpages [example]
<meta name="robots" content="noarchive,nofollow,noindex"/>The tag clearly means that search engines are neither supposed to cache del.icio.us pages nor index them. Google is probably ignoring the META tags here.
Case B: Google Finance
The robots.txt file residing on www.google.com has the following instruction:
User-agent: *In simple English, these instructions mean that Googlebot is not supposed to index or crawl any webpage that's residing under the google.com/finance path.
Disallow: /finance
Its again very surprising to see that atleast 44K pages from www.google.com/finance have been indexed and cached on Google servers. These pages also appear in organic search results.
Related: Google Finance: Guess the Date Contest
Update: Jim Kloss shares a similar problem with Googlebot ignoring their robots.txt file though other searchbots do obey the request. "We tell googlebot not to load these URL constructs but it ignores robots.txt. Nor were we able to get it to play nice via the webmaster control panel provided by Google...Our written email requests [to Google] to look into the situation were met with autoresponders."
Find this article at: http://labnol.blogspot.com/2007/01/google-spiders-sometimes-ignore-meta.html
web: http://www.labnol.org/ email: amit@labnol.org
Reader Comments
What makes you think that we see the same metas and robots.txt file that Google-Finance and del.icio.us show to the Google-Bot?
In the cached delicious pages there is no "robots" meta-tag.
Written on 12/1/07 9:50 PM
Malte, here's an example using my own del.icio.us account.
Check the HTML source code of this webpage - you'll see the meta robots noindex tag.
Now here's a copy in the
Google Cache that was last accessed by Googlebot on Dec 15, 2006.
Written on 12/1/07 9:57 PM
Yup. First noticed this at http://www.webmasterworld.com/forum30/34757.htm
Why dont you ask Matt about this?
Written on 12/1/07 11:56 PM
Indeed its surprising...although the information that you provided is helpful and many related websites confirm that
Written on 13/1/07 3:59 AM
I'm glad someone with a higher profile has written about this.
We noticed googlebot being a low-life robots.txt ignoring dolt months ago. Running a large dynamic wiki, there are thousands of URL combinations we generate that indicate "this page doesn't exist yet, but if you click this URL, we'll dump database info that will help you manually build the page." These pages take a fair amount of CPU to produce. (Tech note: action=edit is what we want to exclude: http://www.wholewheatradio.org/wiki/index.php/User:Jimkloss/sandbox/stats if you catch it while googlebot is allowed clearly shows it loading those pages where http://www.wholewheatradio.org/robots.txt clearly shows it should not.)
We tell googlebot not to load these URL constructs but it ignores robots.txt. Nor were we able to get it to play nice via the webmaster control panel provided by Google.
Our only option was to cron an iptables lockout for googlebots during our peak CPU hours when googlebot was merrily loading a CPU bound page once every 1-5 seconds regardless of our clearly telling it not to.
From raw Apache logs, other searchbots do obey the request never to attempt loading a page with
Our written email requests to look into the situation were met with autoresponders.
Thanks for bringing some attention to it. I don't expect anything to change but at least others who may be scratching their heads going
Written on 13/1/07 9:51 AM
So then deny user agent Googlebot, there's more than one way to skin a bot...
Written on 14/1/07 7:12 AM
Please have a look at Google Cache, (please view source) it is not having any meta noindex.
If I get time tomorrow I will check deeper in, generally Google is better in these cases when compared to other SEs.
Written on 14/1/07 11:10 PM
Yes you are right Amit.
I am running approx 20 sites.
and came to get the same result.
They are ignoring Meta sience the misuse of Tags.
Now'a'days They only focus on contains.
But Behaviour with robots.txt
Whome to ask for???
Regards
Alok Tiwari
Sorry for http://none.com
I hope you may get this.
Anyway keep posting I Always read.
Written on 15/1/07 1:36 AM
For Google Finance:
When you go to finance.google.com, you are sent to finance.google.com/finance. When you look at robots.txt on finance.google.com, you see:
User-agent: *
Allow: /finance
Disallow: /finance/
Disallow: /
Written on 16/1/07 2:10 PM
In the case where meta tags are supposedly ignored, I could not find meta tags in Google's cached copy.
Do you have an example where the metatags were in Google's cached copy as well?
Thanks.
Written on 16/1/07 2:13 PM