Creating and Using Robots.txt

Creating and Using Robots.txt

The robots.txt file contains a set of instructions for the search engine spiders that visit pages of your website and index the site content. This file must be placed in the root directory of your site. Once placed on the server, it instructs the search engine spiders not to index certain sections of your website.

The URL of the robots.txt file usually appears like this:

http://www.yoursitename.com/robots.txt

The robots.txt file itself is simple and can be easily created in Notepad. After creating the file, you need to save it in your website's root directory (the directory containing your index or home page). The robots.txt file is the first thing that the search engine spiders look for once they visit your site. Thus, it is always beneficial to have a robots.txt file in your site to invite the spiders, even if you do not want to restrict the spiders from indexing any section of your website.


When do you need to restrict spiders from indexing pages?

There are various cases when you need to feed the spiders with these instructions in the robots.txt file. Some of the situations where you will require using the robots.txt file to stop spiders from visiting certain pages are mentioned below: 

  • You are working on developing a site or certain pages of a site. This is when you do not want to expose the unfinished work appearing in the search results.
  • You are using doorway pages (pages having the same content and optimized for various search engines). The search engine robots are sensitive to duplicate content and can penalize your site for spamming.
  • You want to block certain spiders from indexing your site, especially those spiders and bots that index pages with the purpose of collecting email addresses from your site.
  • You may have certain directories like the cgi-bin and others containing error and Thank You pages. You need to prevent the spiders from indexing these directories.
  • Your site may have some information that is useful only to the person for whom you have created it. This is when you want to restrict the everyday visitors of your site from viewing it.

Avoiding 404 Errors

The robots.txt file is the first thing that the search engine spiders look for after visiting your site. If the spiders fail to detect the robots.txt file in your site (root directory), it may display a 404 Error for the missing file in your server log. Thus it is advisable to create a blank robots.txt file if you do not need to restrict the spiders from indexing any particular page. Once created, upload it in the root directory. Whether you want to restrict spiders from indexing certain pages or not is not that important, but the very fact that the search engine spiders are looking for the robots.txt file is reason enough to create and upload it.

Creating robots.txt file

It is very easy to create the basic robots.txt file. To create the file, you can either use the Notepad or any other text editor. The robots.txt file consists of entries. You can create as many entries as you want depending on not only the sections of your site that you want to restrict the spiders from viewing, but also the various types of spiders or bots that you want to restrict from visiting your site at all. The basic thing required in the robots.txt file is an entry, and each entry consists of two lines:

Create Robots.txt file

These two lines are repeated in the robots.txt file for the spider that you want to restrict and the directory you want to keep out from being indexed.

Significance of (*) & (/) in robots.txt

The asterisk sign (*) is also known as the wild card. The (*) sign means any spider or robot. Thus if you are writing the first line of the entry as:

User-Agent: *

This means you are allowing all the spiders to index your site without blocking any of them.

The forward slash sign (/) is very important as it can either allow or disallow the spiders from indexing your site. If you are writing the second line of the entry as:

Disallow:

This means that you are not disallowing any section of your site from being indexed. However, if you just place the (/) sign, it means that you are restricting the spiders and bots from indexing your whole site. The second line of the entry will then look like this:

Disallow: /

It is essential to place at least one Disallow field in the robots.txt file without the (/) sign. Thus, if you want all the robots to retrieve all the URLs of the site, your command should look like this:

User-Agent: *
Disallow:

Similarly, if you want not a single robot to index any part of your website, your command should appear as follows:

User-Agent: *
Disallow: /

Following this pattern, you can create similar entries for individual spiders and directories. Suppose you do not want Google to index a particular file present in your site, all you need to do is to enter the details in the robots.txt file. Let’s say that the file name is “new.htm” which is present in the “new” directory. To prevent Googlebot (the Google spider) from indexing the file, you need to add the following lines in the robot.txt file:

Prevent Google from Spidering

But is it that you always have to create fresh entries from each and every directory your want to prevent the spiders from scanning? It definitely sounds time consuming! Well, if you want none of the search engine robots to index certain directories like cgi-bin and certain folders like _new, _old, _secure, _template, and so on, you do not have to create separate entries for individual directories. You can easily add them in just a single entry. The file will then appear as follows: 

Prevent spidering to specific directories

That is not all. Another benefit of using robots.txt file in the root directory is that you can create several pages in your website that are optimized for various search engines. To understand this well, let us take the example of two pages called “page1.htm” and “page2.htm” that are optimized for Lycos and Google respectively. (T-Rex is the name of Lycos spider and Googlebot is the name of Google spider). In this case you would want to hide “page1.htm” from Google and “page2.htm” from Lycos. Thus, the two entries in the robots.txt file should look like this:

Prevent spidering of specific files

How secure is robots.txt?

Even though writing the robots.txt file is a simple procedure to restrict unwanted spiders from indexing your site, it however does not guarantee any security method. Even though the robots.txt file may restrict specific pages of your site from appearing in the searches, it does not make those pages unavailable altogether. It is true that most of the bots and spiders respect the robots.txt file and index your site accordingly. But it is equally true that there are numerous spiders that do not follow the guidelines as specified in the robots.txt files. They are specifically designed to visit pages that you have restricted in the robots.txt file.

A quick overview of the robots.txt file

The robots.txt file, located in the root directory of your site, gives direction to the spiders on indexing pages of the website. Thus, it is naturally the first thing that the bots or spiders look for while visiting your page. Even if you do not want to restrict the spiders from indexing sections of your website, it is essential to upload a robots.txt file in the root directory of your site. This will not only welcome the spiders to your site, but will also prevent the occurrence of 404 Errors.

Doorway pages can be easily blocked from specific spiders to avoid being marked as spammer. Page optimization according to specific search engines is possible by blocking the other search engine spiders from indexing the optimized page. You can feed instruction in the robots.txt file to prevent certain spiders from entering your site with the purpose of collecting email database. Moreover, the robots.txt file provides you the freedom to allow and disallow specific directories and files from being retrieved by search engine bots.



Add Feedback