Skip to content

Commit

Permalink
Docs: Add information about the robots.txt file
Browse files Browse the repository at this point in the history
  • Loading branch information
arthurvr committed Oct 3, 2014
1 parent 6e5d83a commit a871a72
Showing 1 changed file with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions doc/misc.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,29 @@ globally ignore:

* More on global ignores: https://help.github.com/articles/ignoring-files
* Comprehensive set of ignores on GitHub: https://github.com/github/gitignore

## robots.txt

The `robots.txt` file is used to give instructions to web robots on what can
be crawled from the website.

By default, the file provided by this project includes the next two lines:

* `User-agent: *` - the following rules apply to all web robots
* `Disallow:` - everything on the website is allowed to be crawled

If you want to disallow certain pages you will need to specify the path in a
`Disallow` directive (e.g.: `Disallow: /path`) or, if you want to disallow
crawling of all content, use `Disallow: /`.

The `/robots.txt` file is not intended for access control, so don't try to
use it as such. Think of it as a "No Entry" sign, rather than a locked door.
URLs disallowed by the `robots.txt` file might still be indexed without being
crawled, and the content from within the `robots.txt` file can be viewed by
anyone, potentially disclosing the location of your private content! So, if
you want to block access to private content, use proper authentication instead.

For more information about `robots.txt`, please see:

* [robotstxt.org](http://www.robotstxt.org/)
* [How Google handles the `robots.txt` file](https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt)

0 comments on commit a871a72

Please sign in to comment.