Creating
a robots.txt file
By Sumantra Roy
Some people believe that they should create
different pages for different search engines, each page optimized for one
keyword and for one search engine. Now, while I don't recommend that
people create different pages for different search engines, if you do
decide to create such pages, there is one issue that you need to be aware
of.
These pages, although optimized for different search
engines, often turn out to be pretty similar to each other. The search
engines now have the ability to detect when a site has created such
similar looking pages and are penalizing or even banning such sites. In
order to prevent your site from being penalized for spamming, you need to
prevent the search engine spiders from indexing pages which are not meant
for it, i.e. you need to prevent AltaVista
from indexing pages meant for Excite
and vice-versa. The best way to do that is to use a robots.txt file.
You should create a robots.txt file using a text
editor like Windows Notepad. Don't use your word processor to create such
a file.
Here is the basic syntax of the robots.txt file:
User-Agent: [Spider Name]
Disallow: [File Name]
For instance, to tell AltaVista's spider, Scooter,
not to spider the file named myfile1.html residing in the root directory
of the server, you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Excite's spider, called ArchitextSpider, not
to spider the files myfile2.html and myfile3.html, you would write
User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put multiple User-Agent
statements in the same robots.txt file. Hence, to tell AltaVista not to
spider the file named myfile1.html, and to tell Excite not to spider the
files myfile2.html and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent all robots from spidering the
file named myfile4.html, you can use the * wildcard character in the
User-Agent line, i.e. you would write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use the wildcard character in
the Disallow line.
Once you have created the robots.txt file, you
should upload it to the root directory of your domain. Uploading it to any
sub-directory won't work - the robots.txt file needs to be in the root
directory.
I won't discuss the syntax and structure of the
robots.txt file any further - you can get the complete specifications from
http://www.robotstxt.org/wc/norobots.html
Now we come to how the robots.txt file can be used
to prevent your site from being penalized for spamming in case you are
creating different pages for different search engines. What you need to do
is to prevent each search engine from spidering pages which are not meant
for it.
For simplicity, let's assume that you are targeting
only two keywords: "tourism in Australia" and "travel to
Australia". Also, let's assume that you are targeting only four of
the major search engines: AltaVista, Excite, HotBot
and Northern Light.
Now, suppose you have followed the following
convention for naming the files: Each page is named by separating the
individual words of the keyword for which the page is being optimized by
hyphens. To this is added the first two letters of the name of the search
engine for which the page is being optimized.
Hence, the files for AltaVista are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for Excite are
tourism-in-australia-ex.html
travel-to-australia-ex.html
The files for HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for Northern Light are
tourism-in-australia-no.html
travel-to-australia-no.html
As I noted earlier, AltaVista's spider is called
Scooter and Excite's spider is called ArchitextSpider.
A list of spiders for the major search engines can
be found at http://www.searchenginewatch.com/webmasters/spiderchart.html
From this list, we find that the spider for Northern
Light is called Gulliver. HotBot uses Inktomi
and Inktomi's spider is called Slurp. Using this knowledge, here's what
the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: ArchitextSpider
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: Gulliver
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines in the robots.txt file,
you instruct each search engine not to spider the files meant for the
other search engines.
When you have finished creating the robots.txt file,
double-check to ensure that you have not made any errors anywhere in it. A
small error can have disastrous consequences - a search engine may spider
files which are not meant for it, in which case it can penalize your site
for spamming, or, it may not spider any files at all, in which case you
won't get top rankings in that search engine.
An useful tool to check the syntax of your
robots.txt file can be found at http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical errors in the robots.txt file,
it won't help you correct any logical errors, for which you will still
need to go through the robots.txt thoroughly, as mentioned above.
Article by Sumantra Roy. Sumantra is one of the most
respected search engine positioning specialists on the Internet. To have
Sumantra's company place your site at the top of the search engines, go to
http://www.1stSearchRanking.com/t.cgi?1135
For more advice on how you can take your web site to the top of the search
engines, subscribe to his FREE newsletter by going to http://www.1stSearchRanking.com/t.cgi?1135&newsletter.htm
|