Docs: Add information about the robots.txt file

Ref h5bp/html5-boilerplate#1598
shiaumayi · Oct 3, 2014 · a871a72 · a871a72
1 parent 6e5d83a
commit a871a72
Showing 1 changed file with 26 additions and 0 deletions.
diff --git a/doc/misc.md b/doc/misc.md
@@ -23,3 +23,29 @@ globally ignore:
 
 * More on global ignores: https://help.github.com/articles/ignoring-files
 * Comprehensive set of ignores on GitHub: https://github.com/github/gitignore
+
+## robots.txt
+
+The `robots.txt` file is used to give instructions to web robots on what can
+be crawled from the website.
+
+By default, the file provided by this project includes the next two lines:
+
+ * `User-agent: *` -  the following rules apply to all web robots
+ * `Disallow:` - everything on the website is allowed to be crawled
+
+If you want to disallow certain pages you will need to specify the path in a
+`Disallow` directive (e.g.: `Disallow: /path`) or, if you want to disallow
+crawling of all content, use `Disallow: /`.
+
+The `/robots.txt` file is not intended for access control, so don't try to
+use it as such. Think of it as a "No Entry" sign, rather than a locked door.
+URLs disallowed by the `robots.txt` file might still be indexed without being
+crawled, and the content from within the `robots.txt` file can be viewed by
+anyone, potentially disclosing the location of your private content! So, if
+you want to block access to private content, use proper authentication instead.
+
+For more information about `robots.txt`, please see:
+
+  * [robotstxt.org](http://www.robotstxt.org/)
+  * [How Google handles the `robots.txt` file](https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt)