Digital Photo Frames

Robots.txt File
Format, Syntax and Examples

Page 1  |  Page 2  >

The Format and Syntax of the robots.txt File

  1. The file consists of one or more records separated by one or more blank lines, each line being terminated by CR,CR/NL, or NL.
  2. Each record contains lines having the form of "<field>:<optionalspace><value><optionalspace>".
  3. Blank lines are not permitted within a single record in the "robots.txt" file.
  4. The field name is not case sensitive.
  5. Comments can be included in the file using the '#' character which indicates that the preceding space (if any) and the remainder of the line is ignored.
  6. Lines containing only a comment are totally ignored.
  7. The record starts with one or more User-agent lines, followed by one or more Disallow lines.
  8. Unrecognized headers are ignored.
  9. A single record might look like the following:
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/
    Disallow: /mystuff/

The Comment:
Lines starting with '#', specify that it is a comment line.
A comment can also start at the end of a declaration statement.

# This would be a comment line.
# robots.txt for http://www.mywebsite.com/
Disallow: /cyberworld/map/   # Comments can also start here.
 

Special Characters:

  • The asterisk '*' is a wild card that means "any other User-agent".
  • You cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
    For example:  'Disallow: /tmp/* is not valid
    Rather, state it as 'Disallow: /tmp/.
  • The forward slash "/" implies the root directory of the URL and includes all sub-directories and files below the root directory.

User-agent

  • There must be only one "User-agent" field per record
  • The value for this field is the name of the robot that the record is describing access for.
  • If the value is an asterisk '*', then the record describes the access rules for any robot that has not been specifically named in any of the other records.
  • There can only be one "User-agent" in the "/robots.txt" file that has the value of an asterisk..

Disallow:

  • The "Disallow" field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.
    For example,
    Disallow: /help        disallows both /help.html and /help/index.html, whereas
    Disallow: /help/       would disallow /help/index.html but allow /help.html.
  • An empty value for "Disallow", indicates that all URLs can be retrieved.
  • At least one "Disallow" field must be present in the robots.txt file.
  • An empty "/robots.txt" file has no explicit instructions, it will be treated as if it did not exist. 
    Thus all robots will consider themselves welcome.
  • Do not put more than one path on a Disallow line, rather use multiple Disallow lines.

 

Examples

1)  A Basic "robots.txt":
This example declares that all robots (indicated by "*") are not to index any pages or directories on your website (the slash "/" implies your entire website.)
It's probably rare that you would want to use such a format

User-agent: *
Disallow: /

REMEMBER: This declaration will only apply to "Program Friendly" robots.  A Robot could be programmed to ignore statements in your robots.txt file or even ignore the file altogether.   This would be especially true of Malware Robots that scan the web for security weaknesses and also for email address harvester programs that are used by spammers to collect email addresses.

2)  Deny Access to a single SE Spider:
You can specify that a particular search engine spider should not access your site.  Here, we tell the Google Image bot to not crawl your website images, (thus keeping them from being searchable).

User-agent: Googlebot-Image
Disallow: /

3)  Keep all search engine robots from crawling specific directories and pages:
Again, the asterisk means all search engines, and then they are told not to crawl two different areas (all URLs that begin with "/tmp/" or "/privatearea/map/"), a single page "misc.htm" in the tutorials directory and a single page "special.htm" in the root directory.

User-agent: *
Disallow: /tmp/
Disallow: /privatearea/map/
Disallow: /tutorials/misc.htm
Disallow: /special.htm

4)  Only allow certain bots to crawl specific areas of your website:
Here we first declare that crawlers should not crawl any part of the site and then go on to specify that the Google bot can crawl the entire site except for any URLs that begin with "/cgi-bin/" or "/tools/".  Thus you can see that the more specific declarations take precedence over the more general declarations.

User-agent: * *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /tools/

5)  Only allow certain bots to crawl all of your website:
Notice how first you declare that all bots "*" are not to crawl any part of your website "/", and then you specify that a particular bot (Alexa) can crawl all of your website.  Not having anything after the second "Disallow:" means nothing is disallowed, so that means that everything is allowed for the previously specified bot.

User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:

6)  Use Allow to Only allow Google to crawl your website:
Currently, "Allow" is not part of the robots.txt proper protocol, but Google states in its Webmaster's FAQs that you can use "Allow" for their bot.  So if you only want to allow the Google bot to crawl your site you could use the following statements.

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Picture of dividing line

Page 1  |  Page 2  >

Graphic of Top Button

Official PayPal Seal

Home  |   Link Partners  |  Link to Us  |  Our Link Exchange Policy
Glossary of Terms  |  Privacy Policy  |  Site Map  |  About Us   |  Contact Us

Copyright © 2007-2008   Donald Dean Websites - All Rights Reserved