Robots.txt Basics
One of the most over-looked items related to your web site is a small unassuming text file called the robots.txt file. This simple text file has the important job of telling web crawlers (including search engine spiders) which files the web robots can access on your site.
Also known as “A Standard for Robot Exclusion”, the robots.txt file gives the site owner to ability to request that spiders not access certain areas of the site. The problem arises when webmasters accidentally block more than they intend.
At least once a year I get a call from some frantic site owner telling me that their site was penalized and is now out of Google when often they blocked the site from Google via their robots.txt.
An advantage of being a long time search marketer is that experience teaches you to know where to look when sites go awry. Interestingly, people are always looking for a complex reason for an issue when more times than not, it is a simple more basic problem.
It’s a situation not unlike the printing press company hiring the guy who knew which screw to turn. Eliminate the simple things that could be causing the problem before you jump to the complex. With this in mind, one of the first things I always check when I am told a site is having a penalty or crawling issues is the robots.txt file.
Accidental Blockage by Way of Robots.txt
This is often a self-inflicted wound that causes many webmasters to want to pound their heads into their desks when they discover the error. Sadly, it happens to companies small and big including publicly traded businesses with a dedicated staff of IT experts.
There are numerous ways to accidentally alter your robots.txt file. Most often it occurs after a site update when the IT department, designer, or webmaster rolls up files from a staging server to a live server. In these instances, the robots.txt file from the staging server is accidentally included in the upload. (A staging server is a separate server where new or revised web pages are tested prior to uploading to the live server. This server is generally excluded from search engine indexing on purpose to avoid duplicate content issues.)
If your robots.txt excludes your site from being indexed, this won’t force removal of pages from the index, but it will block polite spiders from following links to those pages and prevent the spiders from parsing the content of those pages. (Pages that are blocked may still reside in the index if they are linked to from other places.) You may think you did something wrong that got your site penalized or banned, but it’s actually your robots.txt file telling the engines to go away.
How to Check Your Robots.txt
How do you tell what’s in your robots.txt file? The easiest way to view your robots.txt is to go to a browser and type your domain name followed by a slash then “robots.txt.” It will look something like this in the address bar:
http://www.yourdomainname.com/robots.txt
If you get a 404-error page, don’t panic. The robots.txt file is actually an optional file. It is recommended by most engines but not required.
You can also log into your Google Webmaster Tools account and Google will tell you which URLs are being restricted from indexing.
You have a problem if your robots.txt file says:
User-agent: *
Disallow: /
A robots.txt file that contains the text above is excluding ALL robots – including search engine robots – from indexing the ENTIRE site. Unless you are working on a staging server, you don’t normally want to see this on a site live on the web.
How to Keep Areas of your Site From Being Indexed
There may be certain sections you don’t want indexed by the engines (such as an advertising section or your log files). Fortunately, you can selectively disallow them. A robots.txt that disallows the ads and logs directories would be written like this:
User-agent: *
Disallow: /ads
Disallow: /logs
The disallow statement shown above only keeps the robots from indexing the directories listed. Note that the protocol is pretty simplistic: it does a text comparison of the path of the URL to the Disallow: strings: if the front of the URL matches the text on a Disallow: line (a “head” match), then the URL is not fetched/parsed by the spider.
Many errors are introduced because webmasters think the robots.txt format is smarter than it really is. For example, the basic version of the Protocol does NOT allow:
- Wildcards in the Disallow: line
- “Allow:” lines
Google has expanded on the original format to allow both of these options, but these are not universally accepted, so it is recommended that these expansions ONLY be used for a “User-agent:” run by Google (e.g. Googlebot, Googlebot-Image, Mediapartners-Google, Adsbot-Google.).
Does the robots.txt Restrict People From Your Content?
No, it only requests that spiders keep from walking through and parsing the content for its index. Some webmasters falsely think that if they disallow a directory in the robots.txt file that it protects the area from prying eyes. The robots.txt file only tells robots what to do, not people (and the standard is voluntary so only “polite” robots follow it). If certain files are confidential and you don’t want them seen by other people or competitors, they should be password protected.
Note that the robots exclusion standard is a “please don’t parse this page’s content” standard. If you want the content removed from the index, you need to include a Robots noindex Meta tag on each page you want removed from the index.
Check robots.txt First
The good news is that if you have a situation where you accidently blocked your own site, the solution is easy to fix now that you know to look at your Robots.txt file first. Little things matter online. To learn more about the robots.txt file see http://www.robotstxt.org.
One thing that is often overlooked is explained by the following simple example.
User-agent: *
Disallow: /dogs
User-agent: Googlebot
Disallow: /cats
You might expect that all agents (including Google) would desist accessing the /dogs folder and Google would additionally desist from accessing the /cats folder.
In reality, Google will continue to access the /dogs folder because when there is a Googlebot-specific section in robots.txt Google will read ONLY that section.
To be absolutely clear, if you have both a User-agent: * and a User-agent: Googlebot section in robots.txt you must put everything that you want Googlebot to obey in the User-agent: Googlebot section even if this duplicates directives already listed in the User-agent: * section of robots.txt.
Googlebot reads only the most specific section, by User-agent, of the robots.txt file.
Finally, add a blank line after the last Disallow: directive in each section of the robots.txt file.
g1smd: you are correct: there can be several Disallow records in the file, and a given robot will attempt to pick the ONE set of Disallows that best matches its User-agent. The “User-agent: *” record is a catch-all default which will ONLY be used if a more specific User-agent line cannot be identified.
Care should be taken to avoid MULTIPLE records for the same User-agent, since the Standard (such as it is) does not specify how a robot should handle this case. If you have multiple Disallows for a given User-agent, you should put all the Disallows (one per line) within the same record, then put a blank line to indicate the end of that record.
User-agent: first robot name
Disallow: /directory1
Disallow: /directory2
Disallow: /directory/subdirectory3
User-agent: 2nd robot name
Disallow: /directory1
Disallow: /directory2
# 3rd is allowed everywhere
User-agent: 3rd robot name
Disallow:
#catchall (for robots not listed elsewhere)
User-agent: *
Disallow: /