Monday, February 18, 2008

Link to Us

Robots.txt File

Robots Overview

A robot is a program that automatically travels around the Web and retrieves documents for the search engines.  It also follows the hyper-links on the pages that it finds and thus finds more pages to retrieve.

These Web Robots are referred to as Web Spiders, Web Crawlers, Web Robots, or just plain Spiders, Crawlers, Robot or Bots. The software that controls these Robots doesn't actually cause a robot to go to a website.  Rather, these visits are simply requests for web pages from a website so the Robot can get copies of the documents

Search engines, like Google, have spider programs that search the web constantly.   As they search, they retrieve significant information from each page that they "crawl" and then copy this information into a huge database.  Then when a web surfer goes to Google and types in a query, the search engine can quickly retrieve this information from the database.

The more thorough a job that a Web Spider does of crawling your site, the more information it will pick up about your site, and the more pages from your Website that will be indexed by the search engines.  These things will increase the chances that your pages will appear in search results. 

 

What Is The "robots.txt" File?

As a website owner, a robots.txt file allows us to have some control over what web pages search engine robots are allowed to access and index.

"Robots.txt" is a regular text file that because of its name, has special meaning to the majority of the search engine  robots on the web. I say the majority, because there are some less than honorable SE robots that scourer the web using their own set of rules, disregarding internet rules and website rules.  These robots try to find or steal information that they should not have access to.

By defining a few rules in robots.txt file, you can instruct robots to not crawl and index certain files and directories in your site, or not to crawl your site at all. For example, you may not want Google to crawl the /images directory of your site because there is no reason to do so.  Instead, you want it to spend its precious time crawling the important pages of your website and you don't want it using up unnecessary bandwidth from your site.

The robots.txt file is available to the public because it must be in your site's root directory and have no access restrictions set on it. Thus, anyone can see what sections of your site you don't want robots to access.

 

Using A robots.txt File To Control SE Access To Your Site

A robots.txt file specifies restrictions to Search Engine Robots that crawl websites. When a bot comes to your site,  before it accesses the pages of the site, it first checks to see if a robots.txt file exists in the root directory.  If it can find it, it will analyze its contents to see if it is allowed to retrieve any of the pages.  You can customize the file to apply only to specific robots, and to disallow access to specific directories or files.

A site is defined as a HTTP server running on a particular host and port number.

The main purpose of a robots.txt file is to prevent search engine robots from accessing specific pages of your website.  Thus, you would want to have a robots.txt file if your site has content that you don't want search engines to index or you want to make it easier for the robots to get all of your web pages indexed sooner by having them leave alone some of the web pages that do not need to be crawled and indexed.

If you want search engines to index everything in your site, you don't need a robots.txt file, nor do you even need to have an empty one.

 

Creating A "robots.txt" File

  1. Create a regular text file called "robots.txt".  URL's are case-sensitive so the "/robots.txt" string must be all lower-case and must be named exactly that.
  2. This file must be uploaded to the main top level or root (accessible) directory of your site and not a sub-directory.  Such as http://www.mywebsite.com but NOT http://www.mywebsite.com/things/.

If the robots.txt file is not created as stated above, then search engines will not be able to find it and read the instructions in the file.

There can only be a single "/robots.txt" on a site, so you should not put "robots.txt" files in user directories

 

The Format and Syntax of the robots.txt File

  1. The file consists of one or more records separated by one or more blank lines, each line being terminated by CR,CR/NL, or NL.
  2. Each record contains lines having the form of "<field>:<optionalspace><value><optionalspace>".
  3. Blank lines are not permitted within a single record in the "robots.txt" file.
  4. The field name is not case sensitive.
  5. Comments can be included in the file using the '#' character which indicates that the preceding space (if any) and the remainder of the line is ignored.

  6. Lines containing only a comment are totally ignored.

  7. The record starts with one or more User-agent lines, followed by one or more Disallow lines.

  8. Unrecognized headers are ignored.

  9. A single record might look like the following:
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/
    Disallow: /mystuff/

The Comment:
Lines starting with '#', specify that it is a comment line.
A comment can also start at the end of a declaration statement.

# This would be a comment line.
# robots.txt for http://www.mywebsite.com/
Disallow: /cyberworld/map/   # Comments can also start here.
 

Special Characters:

  • The asterisk '*' is a wild card that means "any other User-agent".
  • You cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
    For example:  'Disallow: /tmp/* is not valid
    Rather, state it as 'Disallow: /tmp/.
  • The forward slash "/" implies the root directory of the URL and includes all sub-directories and files below the root directory.

User-agent

  • There must be only one "User-agent" field per record
  • The value for this field is the name of the robot that the record is describing access for.
  • If the value is an asterisk '*', then the record describes the access rules for any robot that has not been specifically named in any of the other records.
  • There can only be one "User-agent" in the "/robots.txt" file that has the value of an asterisk..

Disallow:

  • The "Disallow" field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.
    For example,
    Disallow: /help        disallows both /help.html and /help/index.html, whereas
    Disallow: /help/       would disallow /help/index.html but allow /help.html.
  • An empty value for "Disallow", indicates that all URLs can be retrieved.
  • At least one "Disallow" field must be present in the robots.txt file.
  • An empty "/robots.txt" file has no explicit instructions, it will be treated as if it did not exist. 
    Thus all robots will consider themselves welcome.
  • Do not put more than one path on a Disallow line, rather use multiple Disallow lines.

 

Examples

1)  A Basic "robots.txt":
This example declares that all robots (indicated by "*") are not to index any pages or directories on your website (the slash "/" implies your entire website.)
It's probably rare that you would want to use such a format

User-agent: *
Disallow: /

REMEMBER: This declaration will only apply to "Program Friendly" robots.  A Robot could be programmed to ignore statements in your robots.txt file or even ignore the file altogether.   This would be especially true of Malware Robots that scan the web for security weaknesses and also for email address harvester programs that are used by spammers to collect email addresses.

2)  Deny Access to a single SE Spider:
You can specify that a particular search engine spider should not access your site.  Here, we tell the Google Image bot to not crawl your website images, (thus keeping them from being searchable).

User-agent: Googlebot-Image
Disallow: /

3)  Keep all search engine robots from crawling specific directories and pages:
Again, the asterisk means all search engines, and then they are told not to crawl two different areas (all URLs that begin with "/tmp/" or "/privatearea/map/"), a single page "misc.htm" in the tutorials directory and a single page "special.htm" in the root directory.

User-agent: *
Disallow: /tmp/
Disallow: /privatearea/map/
Disallow: /tutorials/misc.htm
Disallow: /special.htm

4)  Only allow certain bots to crawl specific areas of your website:
Here we first declare that crawlers should not crawl any part of the site and then go on to specify that the Google bot can crawl the entire site except for any URLs that begin with "/cgi-bin/" or "/tools/".  Thus you can see that the more specific declarations take precedence over the more general declarations.

User-agent: * *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /tools/

5)  Only allow certain bots to crawl all of your website:
Notice how first you declare that all bots "*" are not to crawl any part of your website "/", and then you specify that a particular bot (Alexa) can crawl all of your website.  Not having anything after the second "Disallow:" means nothing is disallowed, so that means that everything is allowed for the previously specified bot.

User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:

6)  Use Allow to Only allow Google to crawl your website:
Currently, "Allow" is not part of the robots.txt proper protocol, but Google states in its Webmaster's FAQs that you can use "Allow" for their bot.  So if you only want to allow the Google bot to crawl your site you could use the following statements.

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

 

A List of Search Engine Robots and their Home Page

If you are interest in seeing a list of search engine robots and where they come from (the search engines that they work for), there are many lists to be found on the web.  Here is a pretty good one:
http://www.jafsoft.com/searchengines/webbots.html#search_engine_robots_and_others

Back To Top

Official PayPal Seal

Home  |   Link Partners  |  Link to Us  |  Our Link Exchange Policy
Glossary of Terms  |  Privacy Policy  |  Site Map  |  About Us   |  Contact Us

Copyright © 2007-2008   Donald Dean Websites - All Rights Reserved