In a recent discussion at work, some of my colleages mentioned that they believed to have found that Google is ignoring robots.txt and noindex rules and showing blocked content anyway, in violation of thier documentation and the robots.txt standard. Let’s do some science to find out whether this is the case!
Google uses two terms which at a glance have seemingly overlapping definitions:
- The process by which pages are accessed and processed for links and content.
- Including the page in search results.
Compounding this issue is Google’s documentation, which states that the URL can still be indexed even though Google’s been disallowed from accessing it:
A robotted page can still be indexed if linked to from from other sites.
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).1
Although the robot file is supposed to prevent the content from being crawled, according to this policy it could still be “indexed” if linked from another page, but what does this mean exactly?
Google’s Matt Cutts attempted to clarify the policy:
In the early days, lots of very popular websites didn’t want to be crawled at all. For example, eBay and the New York Times did not allow any search engine, or at least not Google to crawl any pages from it. The Library of Congress had various sections that said you are not allowed to crawl with a search engine. And so, when someone came to Google and they typed in eBay, and we haven’t crawled eBay, and we couldn’t return eBay, we looked kind of suboptimal. So, the compromise that we decided to come up with was, we wouldn’t crawl you from robots.txt, but we could return that URL reference that we saw.2
Taking this into account, along with previous experience, I’ve interpreted Google’s policy to mean that Google will not access (crawl) robotted pages, but will include them in the index if they are aware that they exist. Since Google isn’t technically allowed to read these pages, they index them with only the URL and a message stating that the content is blocked by robots.txt.
In addition, when working with a CMS or existing site, there’s always a possibility that a piece of content may not have been properly excluded, or that the page was initially published and crawled prior to noindex / robot disallow being set on the page.
For this experiment I’m going to set up the following pages, all published at the exact same time:
- Sample story #1, excluded from sitemap, dofollow and doindex3 and allowed by robots.txt as a control
- Sample story #2, excluded from sitemap, with noindex, nofollow meta tags set
- Sample story #3, excluded from sitemap, blocked by robots.txt
- Sample story #4, included in sitemap, dofollow and doindex and allowed by robots.txt
- Sample story #5, included in sitemap, with noindex, nofollow meta tags set
- Sample story #6, included in sitemap, blocked by robots.txt
- All three posts are linked from this page, but not included in sitemap
- Access logs are enabled for all three pages in addition to robots.txt, so we can determine conclusively whether Google indexes a blocked page and when they check robots.txt in conjunction with the page access to determine whether they follow thier stated policy.
- In an attempt to avoid duplicate content, bias, and dummy content, all experimental pages are machine generated at Talk to Transformer using the “sci-fi” prompt, and making sure that they’re at least vaguely on topic.
All of the URLs are public, and can be independently verified.
Conclusions will be posted after the experiment was allowed to run for enough time for Google to crawl the new pages, and updated if at some point if it’s discovered that Google is crawling the blocked pages.