|
||||
|
|
Robots.txt FileRobots OverviewA robot is a program that automatically travels around the Web and retrieves documents for the search engines. It also follows the hyper-links on the pages that it finds and thus finds more pages to retrieve. These Web Robots are referred to as Web Spiders, Web Crawlers, Web Robots, or just plain Spiders, Crawlers, Robot or Bots. The software that controls these Robots doesn't actually cause a robot to go to a website. Rather, these visits are simply requests for web pages from a website so the Robot can get copies of the documents Search engines, like Google, have spider programs that search the web constantly. As they search, they retrieve significant information from each page that they "crawl" and then copy this information into a huge database. Then when a web surfer goes to Google and types in a query, the search engine can quickly retrieve this information from the database. The more thorough a job that a Web Spider does of crawling your site, the more information it will pick up about your site, and the more pages from your Website that will be indexed by the search engines. These things will increase the chances that your pages will appear in search results.
What Is The "robots.txt" File?As a website owner, a robots.txt file allows us to have some control over what web pages search engine robots are allowed to access and index. "Robots.txt" is a regular text file that because of its name, has special meaning to the majority of the search engine robots on the web. I say the majority, because there are some less than honorable SE robots that scourer the web using their own set of rules, disregarding internet rules and website rules. These robots try to find or steal information that they should not have access to. By defining a few rules in robots.txt file, you can instruct robots to not crawl and index certain files and directories in your site, or not to crawl your site at all. For example, you may not want Google to crawl the /images directory of your site because there is no reason to do so. Instead, you want it to spend its precious time crawling the important pages of your website and you don't want it using up unnecessary bandwidth from your site. The robots.txt file is available to the public because it must be in your site's root directory and have no access restrictions set on it. Thus, anyone can see what sections of your site you don't want robots to access.
Using A robots.txt File To Control SE Access To Your SiteA robots.txt file specifies restrictions to Search Engine Robots that crawl websites. When a bot comes to your site, before it accesses the pages of the site, it first checks to see if a robots.txt file exists in the root directory. If it can find it, it will analyze its contents to see if it is allowed to retrieve any of the pages. You can customize the file to apply only to specific robots, and to disallow access to specific directories or files. A site is defined as a HTTP server running on a particular host and port number. The main purpose of a robots.txt file is to prevent search engine robots from accessing specific pages of your website. Thus, you would want to have a robots.txt file if your site has content that you don't want search engines to index or you want to make it easier for the robots to get all of your web pages indexed sooner by having them leave alone some of the web pages that do not need to be crawled and indexed. If you want search engines to index everything in your site, you don't need a robots.txt file, nor do you even need to have an empty one.
Creating A "robots.txt" File
If the robots.txt file is not created as stated above, then search engines will not be able to find it and read the instructions in the file. There can only be a single "/robots.txt" on a site, so you should not put "robots.txt" files in user directories
The Format and Syntax of the robots.txt File
The Comment: # This would be a comment line. Special Characters:
User-agent
Disallow:
Examples1) A Basic "robots.txt": User-agent: * REMEMBER: This declaration will only apply to "Program Friendly" robots. A Robot could be programmed to ignore statements in your robots.txt file or even ignore the file altogether. This would be especially true of Malware Robots that scan the web for security weaknesses and also for email address harvester programs that are used by spammers to collect email addresses. 2) Deny Access to a single SE Spider: User-agent: Googlebot-Image 3) Keep all search engine robots from
crawling specific directories and pages: User-agent: * 4) Only allow certain bots to crawl specific areas of
your website: User-agent: * * 5) Only allow certain bots to crawl all of your website:
User-agent: * 6) Use Allow to Only allow Google to crawl your website: User-agent: *
A List of Search Engine Robots and their Home PageIf you are interest in seeing a list of search engine robots and
where they come from (the search engines that they work for), there
are many lists to be found on the web. Here is a pretty good
one:
|
|||
|
||||