Creating and Using Robots.txt
The
robots.txt file contains a set of instructions for the search engine
spiders that visit pages of your website and index the site content.
This file must be placed in the root directory of your site. Once placed
on the server, it instructs the search engine spiders not to index
certain sections of your website.
The URL of the robots.txt file usually appears like this:
http://www.yoursitename.com/robots.txt
The
robots.txt file itself is simple and can be easily created in Notepad.
After creating the file, you need to save it in your website's root
directory (the directory containing your index or home page). The
robots.txt file is the first thing that the search engine spiders look
for once they visit your site. Thus, it is always beneficial to have a
robots.txt file in your site to invite the spiders, even if you do not
want to restrict the spiders from indexing any section of your website.
When do you need to restrict spiders from indexing pages?
There are various cases when you
need to feed the spiders with these instructions in the robots.txt file.
Some of the situations where you will require using the robots.txt file
to stop spiders from visiting certain pages are mentioned below:
- You
are working on developing a site or certain pages of a site. This is
when you do not want to expose the unfinished work appearing in the
search results.
- You
are using doorway pages (pages having the same content and optimized
for various search engines). The search engine robots are sensitive to
duplicate content and can penalize your site for spamming.
- You
want to block certain spiders from indexing your site, especially those
spiders and bots that index pages with the purpose of collecting email addresses from your site.
- You
may have certain directories like the cgi-bin and others containing
error and Thank You pages. You need to prevent the spiders from indexing
these directories.
- Your
site may have some information that is useful only to the person for
whom you have created it. This is when you want to restrict the everyday
visitors of your site from viewing it.
Avoiding 404 Errors
The
robots.txt file is the first thing that the search engine spiders look
for after visiting your site. If the spiders fail to detect the
robots.txt file in your site (root directory), it may display a 404
Error for the missing file in your server log. Thus it is advisable to
create a blank robots.txt file if you do not need to restrict the
spiders from indexing any particular page. Once created, upload it in
the root directory. Whether you want to restrict spiders from indexing
certain pages or not is not that important, but the very fact that the
search engine spiders are looking for the robots.txt file is reason
enough to create and upload it.
Creating robots.txt file
It
is very easy to create the basic robots.txt file. To create the file,
you can either use the Notepad or any other text editor. The robots.txt
file consists of entries. You can create as many entries as you want
depending on not only the sections of your site that you want to
restrict the spiders from viewing, but also the various types of spiders
or bots that you want to restrict from visiting your site at all. The
basic thing required in the robots.txt file is an entry, and each entry
consists of two lines:

These
two lines are repeated in the robots.txt file for the spider that you
want to restrict and the directory you want to keep out from being
indexed.
Significance of (*) & (/) in robots.txt
The
asterisk sign (*) is also known as the wild card. The (*) sign means
any spider or robot. Thus if you are writing the first line of the entry
as:
User-Agent: *
This means you are allowing all the spiders to index your site without blocking any of them.
The
forward slash sign (/) is very important as it can either allow or
disallow the spiders from indexing your site. If you are writing the
second line of the entry as:
Disallow:
This
means that you are not disallowing any section of your site from being
indexed. However, if you just place the (/) sign, it means that you are
restricting the spiders and bots from indexing your whole site. The
second line of the entry will then look like this:
Disallow: /
It
is essential to place at least one Disallow field in the robots.txt
file without the (/) sign. Thus, if you want all the robots to retrieve
all the URLs of the site, your command should look like this:
User-Agent: *
Disallow:
Similarly, if you want not a single robot to index any part of your website, your command should appear as follows:
User-Agent: *
Disallow: /
Following
this pattern, you can create similar entries for individual spiders and
directories. Suppose you do not want Google to index a particular file
present in your site, all you need to do is to enter the details in the
robots.txt file. Let’s say that the file name is “new.htm†which
is present in the “new†directory. To prevent Googlebot (the Google
spider) from indexing the file, you need to add the following lines in
the robot.txt file:

But
is it that you always have to create fresh entries from each and every
directory your want to prevent the spiders from scanning? It definitely
sounds time consuming! Well, if you want none of the search engine
robots to index certain directories like cgi-bin and certain folders
like _new, _old, _secure, _template, and so on, you do not have to
create separate entries for individual directories. You can easily add
them in just a single entry. The file will then appear as follows:Â

That
is not all. Another benefit of using robots.txt file in the root
directory is that you can create several pages in your website that are
optimized for various search engines. To understand this well, let us
take the example of two pages called “page1.htm†and “page2.htmâ€
that are optimized for Lycos and Google respectively. (T-Rex is the
name of Lycos spider and Googlebot is the name of Google spider). In
this case you would want to hide “page1.htm†from Google and
“page2.htm†from Lycos. Thus, the two entries in the robots.txt file
should look like this:

How secure is robots.txt?
Even
though writing the robots.txt file is a simple procedure to restrict
unwanted spiders from indexing your site, it however does not guarantee
any security method. Even though the robots.txt file may restrict
specific pages of your site from appearing in the searches, it does not
make those pages unavailable altogether. It is true that most of the
bots and spiders respect the robots.txt file and index your site
accordingly. But it is equally true that there are numerous spiders that
do not follow the guidelines as specified in the robots.txt files. They
are specifically designed to visit pages that you have restricted in
the robots.txt file.
A quick overview of the robots.txt file
The
robots.txt file, located in the root directory of your site, gives
direction to the spiders on indexing pages of the website. Thus, it is
naturally the first thing that the bots or spiders look for while
visiting your page. Even if you do not want to restrict the spiders from
indexing sections of your website, it is essential to upload a
robots.txt file in the root directory of your site. This will not only
welcome the spiders to your site, but will also prevent the occurrence
of 404 Errors.
Doorway
pages can be easily blocked from specific spiders to avoid being marked
as spammer. Page optimization according to specific search engines is
possible by blocking the other search engine spiders from indexing the
optimized page. You can feed instruction in the robots.txt file to
prevent certain spiders from entering your site with the purpose of
collecting email database. Moreover, the robots.txt file provides you
the freedom to allow and disallow specific directories and files from
being retrieved by search engine bots.