Robots.txt

How to Use the Robots.txt File

 

Search engines and robots

(Original text by By Rich Anderson) In order for search engines to evaluate and index html pages, they must first aquire them. They do it by sending small programs called 'spiders', also known as 'robots', to visit web sites and gather html pages for indexing. Once a page has been gathered, evaluated and indexed, it is available for listing in the search engine's results, but only in the results of the engine who's spider gathered it. Exceptions to this are when a search engine uses another engine's database for its results. To make full use of the robots.txt file, it is necessary to know which engine uses which spider. It is often desireable that search engines index only the pages that were designed for it rather than every page that the engine's spider can find. This is the purpose of the robots.txt file. The file is used to tell spiders what not to index. For instance, you can instruct AltaVista's spider not to gather the pages from a directory where pages made for Excite are kept. Most spiders adhere to the instructions in the robots.txt file though it should be stated that they don't have to.

The robots.txt file

The robots.txt file need not exist but, if it does exist, it must be called 'robots.txt' and must be written in pure ascii. Any ascii text editor will do to create and edit it. 'Notepad' is ideal. Also, it must be in the website's root directory as spiders never look for it anywhere else. The robots.txt standard allows you to do just one thing - disallow a spider or spiders from specific pages and/or directories. An entry in the file consists of two elements - the 'User-agent' element and the 'Disallow' element. The 'User-agent' element consists of one line and the 'Disallow' element consists of one or more lines. E.g.

User-agent: Googlebot
Disallow: /private.htm
Disallow: /cgi-bin
Disallow: /members

Notice the colons! Syntax does matter. In the example, the entry applies only to the 'ArchitextSpider' spider - Excite's spider. It is disallowed from indexing the private.htm' page in the root directory and anything in the 'cgi-bin' and 'members' directories. To disallow the spider from indexing a specific page in a sub-directory, the following would be used:-

User-agent: Googlebot
Disallow: /subdir/specificpage.htm


More robots.txt examples

User-agent: scooter
Disallow:

'Scooter' is AltaVista's spider. The blank after "Disallow:" indicates that nothing is disallowed. The spider is free to follow links at will, gathering any and all pages that it finds in the site.

User-agent: ArchitextSpider
Disallow: /

This example disallows 'ArchitextSpider' (Excite's spider) from gathering all relative URLs beginning with '/'. Because all relative URL's on a server begin with '/', this means the entire site is closed off to the spider.

User-agent: *
Disallow: /log
Disallow: /members
Disallow: /cgi-bin
Disallow: /searchengine.htm

The star (*) in this example is used to indicate that the disallow lines apply to all spiders.

User-agent: scooter

Disallow:/inktomi
Disallow:/google
Disallow:/lycos
Disallow:/northernstar

This example excludes Scooter (AltaVista) from the other engines' directories. Assuming there is also an 'altavista' directory then Scooter is directed towards it. Keeping pages that are optimized for a particular engine in their own directory is common practice although they are not usually given such obvious names.

Miscellaneous robots.txt notes

  • Wildcards are NOT supported. E.g. Disallow: /members/* doesn't work. Disallow: /members does. You shouldn't put more than one path on a Disallow line
  • A blank robots.txt file, or the absence of one, will NOT stop a spider from gatheringing a site's pages.
  • Rogue robots may ignore your robots.txt file completely. Never rely on the robots.txt file as a means of keeping private content truly private.


Why use the robot.txt file?

To point spiders in the right direction and to ensure that only the files that you want to be indexed are indexed. Some of the reasons for excluding files from some or all spiders are privacy (members only pages, log files, form results etc), technical issues (parts of framesets, shtml pages, etc.) and pages optimized for a particular search engines (e.g. doorway and hallway pages). Another reason for wanting spiders not to index certain pages is when the site uses frames. Pages that belong in a frame structure are often quite useless outside of it. The robots.txt file is one way of dealing with this problem.

Back to the Help File / Tutorial Directory
Visit our Web Marketing Glossary

 

©Copyright 2000 - 2005, Association of Internet Marketing Professionals, Inc., All Rights Reserved.
All rights reserved.

 

Back to Top