Robots.txt File
Format, Syntax and Examples
<
Page 1 | Page 2 >
The Format and Syntax of the robots.txt File
- The file consists of one or more records separated by one or
more blank lines, each line being terminated by CR,CR/NL, or NL.
- Each record contains lines having the form of "<field>:<optionalspace><value><optionalspace>".
- Blank lines are not permitted within a single record in the
"robots.txt" file.
- The field name is not case sensitive.
- Comments can be included in the file using the '#' character
which indicates that the preceding space (if any) and the remainder
of the line is ignored.
- Lines containing only a comment are totally ignored.
- The record starts with one or more User-agent
lines, followed by one or more Disallow
lines.
- Unrecognized headers are ignored.
- A single record might look like the following:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /mystuff/
The Comment:
Lines starting with '#', specify that it is a comment line.
A comment can also start at the end of a declaration statement.
# This would be a comment line.
# robots.txt for http://www.mywebsite.com/
Disallow: /cyberworld/map/ # Comments can also start here.
Special Characters:
- The asterisk '*' is a wild card that means
"any other User-agent".
- You cannot use wildcard patterns or regular expressions in either
User-agent or Disallow lines.
For example: 'Disallow: /tmp/* is not valid
Rather, state it as 'Disallow: /tmp/.
- The forward slash "/" implies the root directory of the URL
and includes all sub-directories and files below the root directory.
User-agent
- There must be only one "User-agent" field per record
- The value for this field is the name of the robot that the record
is describing access for.
- If the value is an asterisk '*', then the record describes the
access rules for any robot that has not been specifically named
in any of the other records.
- There can only be one "User-agent" in the "/robots.txt" file
that has the value of an asterisk..
Disallow:
- The "Disallow" field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL
that starts with this value will not be retrieved.
For example,
Disallow: /help
disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow
/help/index.html but allow /help.html.
- An empty value for "Disallow", indicates that all URLs can be
retrieved.
- At least one "Disallow" field must be present in the robots.txt
file.
- An empty "/robots.txt" file has no explicit instructions, it
will be treated as if it did not exist.
Thus all robots will consider themselves welcome.
- Do not put more than one path on a Disallow line, rather use
multiple Disallow lines.
Examples
1) A Basic "robots.txt":
This example declares that all robots (indicated by "*") are not
to index any pages or directories on your website (the slash "/" implies
your entire website.)
It's probably rare that you would want to use such a format
User-agent: *
Disallow: /
REMEMBER: This declaration will only
apply to "Program Friendly" robots. A Robot
could be programmed to ignore statements in your robots.txt file or
even ignore the file altogether. This would be especially
true of Malware Robots that scan the web for security weaknesses and
also for email address harvester programs that are used by spammers
to collect email addresses.
2) Deny Access to a single SE Spider:
You can specify that a particular search engine spider should not
access your site. Here, we tell the Google Image bot to not crawl
your website images, (thus keeping them from being searchable).
User-agent: Googlebot-Image
Disallow: /
3) Keep all search engine robots from crawling specific
directories and pages:
Again, the asterisk means all search engines, and then they are
told not to crawl two different areas (all URLs that begin with "/tmp/"
or "/privatearea/map/"), a single page "misc.htm" in the tutorials directory
and a single page "special.htm" in the root directory.
User-agent: *
Disallow: /tmp/
Disallow: /privatearea/map/
Disallow: /tutorials/misc.htm
Disallow: /special.htm
4) Only allow certain bots to crawl specific areas of your
website:
Here we first declare that crawlers should not crawl any part of
the site and then go on to specify that the Google bot can crawl the
entire site except for any URLs that begin with "/cgi-bin/" or "/tools/".
Thus you can see that the more specific declarations take precedence
over the more general declarations.
User-agent: * *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /tools/
5) Only allow certain bots to crawl all of your website:
Notice how first you declare that all bots "*" are not to crawl
any part of your website "/", and then you specify that a particular
bot (Alexa) can crawl all of your website. Not having anything
after the second "Disallow:" means nothing is
disallowed, so that means that everything is allowed for the
previously specified bot.
User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:
6) Use Allow to Only allow Google to crawl your website:
Currently, "Allow" is not part of the robots.txt proper protocol,
but Google states in its Webmaster's FAQs that you can use "Allow" for
their bot. So if you only want to allow the Google bot to crawl
your site you could use the following statements.
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

<
Page 1 | Page 2 >

|