Robots.txt Basics
One of the most over-looked items related to your web site is a small unassuming text file called the robots.txt file. This simple text file has the important job of telling web crawlers (including search engine spiders) which files the web robots can access on your site.
Also known as “A Standard for Robot Exclusion”, the robots.txt file gives the site owner to ability to request that spiders not access certain areas of the site. The problem arises when webmasters accidentally block more than they intend.
At least once a year I get a call from some frantic site owner telling me that their site was penalized and is now out of Google when often they blocked the site from Google via their robots.txt.
An advantage of being a long time search marketer is that experience teaches you to know where to look when sites go awry. Interestingly, people are always looking for a complex reason for an issue when more times than not, it is a simple more basic problem.
It’s a situation not unlike the printing press company hiring the guy who knew which screw to turn. Eliminate the simple things that could be causing the problem before you jump to the complex. With this in mind, one of the first things I always check when I am told a site is having a penalty or crawling issues is the robots.txt file.
Accidental Blockage by Way of Robots.txt
This is often a self-inflicted wound that causes many webmasters to want to pound their heads into their desks when they discover the error. Sadly, it happens to companies small and big including publicly traded businesses with a dedicated staff of IT experts.
There are numerous ways to accidentally alter your robots.txt file. Most often it occurs after a site update when the IT department, designer, or webmaster rolls up files from a staging server to a live server. In these instances, the robots.txt file from the staging server is accidentally included in the upload. (A staging server is a separate server where new or revised web pages are tested prior to uploading to the live server. This server is generally excluded from search engine indexing on purpose to avoid duplicate content issues.)
If your robots.txt excludes your site from being indexed, this won’t force removal of pages from the index, but it will block polite spiders from following links to those pages and prevent the spiders from parsing the content of those pages. (Pages that are blocked may still reside in the index if they are linked to from other places.) You may think you did something wrong that got your site penalized or banned, but it’s actually your robots.txt file telling the engines to go away.
How to Check Your Robots.txt
How do you tell what’s in your robots.txt file? The easiest way to view your robots.txt is to go to a browser and type your domain name followed by a slash then “robots.txt.” It will look something like this in the address bar:
http://www.yourdomainname.com/robots.txt
If you get a 404-error page, don’t panic. The robots.txt file is actually an optional file. It is recommended by most engines but not required.
You can also log into your Google Webmaster Tools account and Google will tell you which URLs are being restricted from indexing.
You have a problem if your robots.txt file says:
User-agent: *
Disallow: /
A robots.txt file that contains the text above is excluding ALL robots – including search engine robots – from indexing the ENTIRE site. Unless you are working on a staging server, you don’t normally want to see this on a site live on the web.
How to Keep Areas of your Site From Being Indexed
There may be certain sections you don’t want indexed by the engines (such as an advertising section or your log files). Fortunately, you can selectively disallow them. A robots.txt that disallows the ads and logs directories would be written like this:
User-agent: *
Disallow: /ads
Disallow: /logs
The disallow statement shown above only keeps the robots from indexing the directories listed. Note that the protocol is pretty simplistic: it does a text comparison of the path of the URL to the Disallow: strings: if the front of the URL matches the text on a Disallow: line (a “head” match), then the URL is not fetched/parsed by the spider.
Many errors are introduced because webmasters think the robots.txt format is smarter than it really is. For example, the basic version of the Protocol does NOT allow:
- Wildcards in the Disallow: line
- “Allow:” lines
Google has expanded on the original format to allow both of these options, but these are not universally accepted, so it is recommended that these expansions ONLY be used for a “User-agent:” run by Google (e.g. Googlebot, Googlebot-Image, Mediapartners-Google, Adsbot-Google.).
Does the robots.txt Restrict People From Your Content?
No, it only requests that spiders keep from walking through and parsing the content for its index. Some webmasters falsely think that if they disallow a directory in the robots.txt file that it protects the area from prying eyes. The robots.txt file only tells robots what to do, not people (and the standard is voluntary so only “polite” robots follow it). If certain files are confidential and you don’t want them seen by other people or competitors, they should be password protected.
Note that the robots exclusion standard is a “please don’t parse this page’s content” standard. If you want the content removed from the index, you need to include a Robots noindex Meta tag on each page you want removed from the index.
Check robots.txt First
The good news is that if you have a situation where you accidently blocked your own site, the solution is easy to fix now that you know to look at your Robots.txt file first. Little things matter online. To learn more about the robots.txt file see http://www.robotstxt.org.