Stop Using Robots.txt Noindex By September
Let’s all get prepared ahead of time, ready for the 1st of September, when Google will officially stop supporting noindex in the robots.txt directory.
Over the past 25 years, the unofficial standard of using robots.txt files, to make crawling an easier process, has been widely used on sites across the internet. Despite never being officially introduced as a web standard, Googlebot tends to follow robots.txt to decipher whether to crawl and index a site’s pages or images, to avoid following links and whether or not to show cached versions.
It’s important to note, robots.txt files can only be viewed as a guide and don’t completely block spiders from following requests. However, Google has announced that they plan to completely stop supporting the use of the noindex in the robots.txt file. So, it’s time to adapt a new way of instructing robots to not index any pages in which you want to avoid being crawled and indexed.
Why is Google stopping support for noindex in robots.txt?
As previously mentioned, the robots.txt noindex isn’t considered an official directive. Despite being unofficially supported by Google for the past quarter of a decade, noindex in robots.txt is often used incorrectly and has failed to work in 8% of cases. Google deciding to standardise the protocol is another step to further optimising the algorithm. Their aim with this standardisation is to prepare for potential open source releases in the future, which won’t support robots.txt directories. Google has been advising for years that users should avoid using robots.txt files so this change, although a major one, doesn’t come as a big surprise to us.
What Other Ways Can I Control The Crawling Process?
In order to get prepared for the day that Googlebot will stop following noindex instructions, as requested in the robots.txt directory, we must adapt to different processes in order to try and control crawling as much as we possibly can. Google has provided a few alternative suggestions on their official blog. However, the two we recommend you use for noindexing are:
• Robots meta tags with ‘noindex’
• Disallow in robots.txt
Robots meta tags with ‘noindex’
The first option we’re going to explore is using noindex in robots meta tags. As a brief summary, a robots meta tag is a bit of code that should be located in the header of a web page. This is the preferred option as it holds similar value, if not more, to that of robots.txt noindex and is highly effective for stopping URLs from being indexed. Using noindex in robots meta tags will still allow Googlebot to crawl your site but it will prevent URLs from being stored in Google’s index.
Disallow in robots.txt
The other method to noindexing is to use disallow in robots.txt. This form of robots.txt informs the robot to avoid visiting and crawling the site, which in turn means that it won’t be indexed.
Important things to bear in mind
There are some important things to keep in mind when using robots.txt to request for pages not to be indexed:
• Robots have the ability to ignore your instructions in robots.txt. Malware robots, spammers and email address harvesters are more likely to ignore robots.txt, so it’s important to think about what you’re requesting to be noindexed and if it’s something which shouldn’t be viewed by all robots.
• Robots.txt files are not private, which means anyone can see what parts of your site you don’t want robots to crawl. So, just remember this because you should NOT be using disallow in robots.txt as a way to hide certain information.
And over to you
We’ve given you an overview of our two recommendations for alternative noindexing methods. It’s now up to you to implement a new method ahead of the 1st of September so that you’re prepared for Google to stop supporting noindex robots.txt. If you have any questions, make sure to get in touch with us.
Sign up for our newsletter at the bottom of this page and follow us on Facebook and Twitter for the latest updates.