From e374f1713a7f94e1e16d81c5bb98608725d629c1 Mon Sep 17 00:00:00 2001
From: Google Code Exporter <GoogleCodeExporter@users.noreply.github.com>
Date: Sat, 14 Mar 2015 01:43:10 -0400
Subject: [PATCH] Migrating wiki contents from Google Code

---
 AboutHarvestMan.md         |   33 ++
 ConfigXml.md               |  108 ++++
 FAQ.md                     | 1068 ++++++++++++++++++++++++++++++++++++
 FAQ_NEW.md                 |    2 +
 HarvestMan.md              |    0
 InstallHarvestMan.md       |   62 +++
 NewDevelopersNotes.md      |   30 +
 ProjectHome.md             |    3 +
 UsingHarvestMan.md         |   99 ++++
 WorkAroundHttpForbidden.md |   46 ++
 WorldsSimplestCrawler.md   |  177 ++++++
 WritingCustomCrawlers.md   |  196 +++++++
 bot.md                     |    5 +
 13 files changed, 1829 insertions(+)
 create mode 100644 AboutHarvestMan.md
 create mode 100644 ConfigXml.md
 create mode 100644 FAQ.md
 create mode 100644 FAQ_NEW.md
 create mode 100644 HarvestMan.md
 create mode 100644 InstallHarvestMan.md
 create mode 100644 NewDevelopersNotes.md
 create mode 100644 ProjectHome.md
 create mode 100644 UsingHarvestMan.md
 create mode 100644 WorkAroundHttpForbidden.md
 create mode 100644 WorldsSimplestCrawler.md
 create mode 100644 WritingCustomCrawlers.md
 create mode 100644 bot.md

diff --git a/AboutHarvestMan.md b/AboutHarvestMan.md
new file mode 100644
index 0000000..aa91205
--- /dev/null
+++ b/AboutHarvestMan.md
@@ -0,0 +1,33 @@
+# What is HarvestMan #
+
+HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python.
+
+HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application.
+
+HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.
+
+# History of HarvestMan #
+  1. Harvestman crawler was started by Anand B Pillai in June 2003 as a hobby project to develop a personal web crawler in Python, along with Nirmal Chidambaram.
+  1. Nirmal writes the original code in mid June 2003(one module, single threaded crawler) which Anand improves substantially and develops into a multithreaded crawler.
+  1. First version (0.8) was released by Anand in July 2003.
+  1. Released on [freshmeat](http://www.freshmeat.net/projects/harvestman) (1.3) on Dec 2003
+  1. Eight releases done between Dec 2003 (1.3) and Dec 2004 (1.4).
+  1. Project chosen as crawler for the [EIAO](http://www.eiao.net) web accessibility observatory in Feb 2005. EIAO chose version 1.4, which then underwent several minor releases.
+  1. Most recent release is 1.4.6, released on Sep 2005.
+  1. Since early 2006, HarvestMan has been undergoing development along with the EIAO project (mostly with EIAO feedbacks), but no public releases have been done.
+  1. Version 1.5 started development in mid 2006, but was never released.
+  1. Version 1.4.6 gets accepted into Debian in March 2006.
+  1. By mid 2007, the program had so many changes, that the version number under development was incremented to 2.0 from 1.5. Version 2.0 has been under development effectively since mid 2006, but most code changes happened since mid 2007.
+  1. Version 1.4.6 gets into Ubuntu repositories in May 2007.
+  1. Development hosted in [BerliOS](http://developer.berlios.de/projects/harvestman) till June 2008, when it was moved to Google code.
+  1. Contributors to 2.0 till June 2008 - Anand B Pillai (Main), Nils Ultveit Moe (EIAO), Morten Goodwin Olsen (EIAO), John Kleven.
+  1. Version 2.0 alpha package releases started in Aug 2007 on the website.
+  1. HarvestMan wins FOSS India Award in April 2008.
+  1. In June 2008, Lukasz Szybalski joins the team.
+
+## Future of HarvestMan ##
+> It is a brave new world out there... :-)
+> Well, currently the development stands at 2.0.5 beta, i.e the 2.0 version is still
+> not completed. The development is slow and I need to take time off from a regular
+> job to do this, so well, can't give a final date for this, but hopefully one day
+> it will be done :)
\ No newline at end of file
diff --git a/ConfigXml.md b/ConfigXml.md
new file mode 100644
index 0000000..09019ea
--- /dev/null
+++ b/ConfigXml.md
@@ -0,0 +1,108 @@
+# Harvestman config.xml #
+## Configuration File Structure ##
+
+The configuration file has different categories inside that split the configuration options into different sections. At present, the configuration file has the following namespaces:
+
+  1. **project** - This section holds the options related to the current HarvestMan project
+  1. **network** - This section holds the configuration options related to your network connection
+  1. **download** - This section holds configuration option that affect your downloads in a generic way
+  1. **control** - This section is similar to the above one, but holds options that affect your downloads in a much more specific way. This is a kind of 'tweak' section, that allows you to exert more fine-grained control over your projects.
+  1. **system** - This section controls the threading options, regional (locale) options and any other options related to Python interpreter and your computer.
+  1. **indexer** - This section holds variables related to how the files are processed after downloading. Right now it holds variables related to localizing links.
+  1. **files** - This section holds variables that control the files created by HarvestMan namely error log, message log and an optional url log.
+  1. **display** - This holds a single variable related to creating a browser page for all HarvestMan projects on your computer.
+
+## Control Section ##
+### fetchlevel ###
+
+HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL. In general **increasing a fetch level allows the program to crawl more files** on the Internet.
+
+A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL.
+
+For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the 
+
+&lt;images&gt;
+
+ sub-directory and directories below it, and no other file.
+
+The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page.
+
+A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers.
+
+A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL.
+
+A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc.
+
+Place the paramter in **control** element under **extent** section.
+Here is a sample XML element including this new param.
+```
+ <control>
+    ...
+      <extent>
+        <fetchlevel value="0"/>
+        <depth value="10"/>
+        <extdepth value="0"/>
+        <subdomain value="0"/>
+        <ignoretlds value="0"/>
+      </extent>
+    ...
+ </control>
+```
+
+
+**The value can be 0,1,2,3,4**
+
+See FAQ for more explanations.
+
+### maxbandwidth ###
+MaxBandwidth controls the speed of crawling. The throttling of bandwidth is used when we are downloading huge amount of data from a host. MaxBandwidh should prevent user from "Denial Of Service" that one could impose on the crawled server. By using this configuration variable you can limit your download speed to 5kb per second. With this speed the host should not have any problems serving your crawl and be able to proceed with its normal operations.
+
+Place the paramter in **control** element under **limits** section.
+Here is a sample XML element including this new param.
+```
+<control>
+....
+   <limits>
+       <maxextservers value="0"/>
+       <maxextdirs value="0"/>
+       <maxfiles value="100"/>
+       <maxfilesize value="5242880"/>
+       <connections value="5"/>
+       <maxbandwidth value="5.0" />
+       <timelimit value="-1"/>
+   </limits>
+...
+</control>
+```
+
+**The value needs to be specified in kb/sec , not in bytes/sec.**
+
+### maxbytes ###
+MaxBytes controls how many bytes your crawl will download. The max bytes is used when we are downloading huge amount of data from a host, and in conjunction with MaxBandwidh we want to limit how much data we download. By using this configuration variable with maxbandwidth you can set your crawl to download 10mb at 5kb/s. With this fine grained control of your download size and speed the host should not have any problems serving your crawl and be able to proceed with its normal operations.
+
+Place the paramter in **control** element under **limits** section.
+Here is a sample XML element including this new param.
+```
+<control>
+....
+   <limits>
+       <maxextservers value="0"/>
+       <maxextdirs value="0"/>
+       <maxfiles value="100"/>
+       <maxfilesize value="5242880"/>
+       <connections value="5"/>
+       <maxbandwidth value="5.0" />
+       <maxbytes value="10mb"/>
+       <timelimit value="-1"/>
+   </limits>
+...
+</control>
+```
+
+**The value accepts plain numbers(assumes bytes), KB, MB and GB.**
+```
+<maxbytes value="5000" /> - End crawl at 5000 bytes
+<maxbytes value="10kb" /> - End crawl at 10kb 
+<maxbytes value="50MB" /> - End crawl at 50 MB.
+<maxbytes value="1GB" /> - End crawl at 1 GB.
+```
\ No newline at end of file
diff --git a/FAQ.md b/FAQ.md
new file mode 100644
index 0000000..dbf965b
--- /dev/null
+++ b/FAQ.md
@@ -0,0 +1,1068 @@
+## This is still a work under progress and has been lifted verbatim from the HarvestMan web-site with little or no modifications. A lot of information is out of date and needs to be updated. Also the FAQ doesn't confirm to wiki style, so proceed with care ! ##
+
+HarvestMan - FAQ
+Version 2.0
+NOTE: The FAQ is being modified currently to be in sync with HarvestMan 1.4, so you might find that some parts of the FAQ are inconsistent with the rest of it. This is because some of the FAQ is modified, while the rest is still to be modified.
+
+  * 1. Overview
+> > o 1.1. What is HarvestMan?
+> > o 1.2. Why do you call it HarvestMan?
+> > o 1.3. What HarvestMan can be used for?
+> > o 1.4. What HarvestMan cannot be used for...
+> > o 1.5. What do I need to run HarvestMan?
+  * 2. Usage
+> > o 2.1. What is the HarvestMan Configuration File?
+> > o 2.2. Can HarvestMan be run as a command-line application?
+  * 3. Architecture
+> > o 3.1.What are "tracker" threads and what is their function?
+> > o 3.2.What are "crawler" threads?
+> > o 3.3.What are "fetcher" threads?
+> > o 3.4.How do the crawlers and fetchers co-operate?
+> > o 3.5.How many different Queues of information flow are there?
+> > o 3.6.What are worker (downloader) threads?
+> > o 3.7.How does a HarvestMan project finish?
+  * 4. Protocols & File Types
+> > o 4.1. What are the protocols supported by HarvestMan?
+> > o 4.2. What kind of files can be downloaded by HarvestMan?
+> > o 4.3. Can HarvestMan run javascript code?
+> > o 4.4. Can HarvestMan run java applets?
+> > o 4.5. How to prevent downloading of HTML & CGI forms?
+> > o 4.6. Does HarvestMan download dynamically generated cgi files (server-side) ?
+> > o 4.7. How does HarvestMan determine the filetype of dynamically generated server side files?
+> > o 4.8. Does HarvestMan obey the Robots Exclusion Protocol?
+> > o 4.9. Can I restart a project to download links that failed (Caching Mechanism)?
+  * 5. Network, Security & Access Rules
+> > o 5.1. Can HarvestMan work across proxies?
+> > o 5.2. Does HarvestMan support proxy authentication?
+> > o 5.3. Does HarvestMan work inside an intranet?
+> > o 5.4. Can HarvestMan crawl a site that requires HTTP authentication?
+> > o 5.5. Can HarvestMan crawl a site that requires HTTPS(SSL) authentication?
+> > o 5.6. Can I prevent the program from accessing specific domains?
+> > o 5.7. Can I specify download filters to prevent download of certain files or directories on a server?
+> > o 5.8. Is it possible to control the depth of traversal in a domain?
+  * 6. Download Control - Basic
+> > o 6.1. Can I set a limit on the maximum number of files that are downloaded?
+> > o 6.2 .Can I set a limit on the number of external servers crawled?
+> > o 6.3. Can I set a limit on the number of outside directories that are crawled?
+> > o 6.4. How can I prevent download of images?
+> > o 6.5. How can I prevent download of stylesheets?
+> > o 6.6. How to disable traversal of external servers?
+> > o 6.7. Can I specify a project timeout?
+> > o 6.8. Can I specify a thread timeout for worker threads?
+> > o 6.9. How to tell the program to retry failed links?
+  * 7. Download Control - Advanced
+> > o 7.1. What are fetchlevels and how can I use them?
+  * 8. Application development & customization
+> > o 8.1. I want to customize HarvestMan for a research project. Can you help out ?
+> > o 8.2. I want to customize HarvestMan for a commercial project. Can you help out ?
+  * 9. Diagrams
+> > o 9.1. HarvestMan Class Diagram
+
+1. Overview
+
+1.1. What is HarvestMan?
+HarvestMan (with a capital 'H' and a capital 'M') is a webcrawler program. HarvestMan belongs to a family of programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc.
+
+These programs are used to crawl a distributed network of computers like the Internet and download files locally.
+
+1.2. Why do you call it HarvestMan?
+The name "HarvestMan" is derived from a kind of small spider found in different parts of the world called "Daddy long legs" or Opiliones.
+
+Since this program is a web-spider, the analogy was compelling to name it after some species of spiders. The process of downloading data from websites is sometimes called harvesting.
+
+Both these similarities gave arise to the name HarvestMan.
+
+1.3. What HarvestMan can be used for?
+HarvestMan is a desktop tool for web search/data gathering. It works on the client side.
+
+As of the recent version, HarvestMan can be used for,
+
+  1. Downloading a website or a part of it.
+
+> 2. Download certain files from a website (matching certain patterns)
+> 3. Search a website for keywords & download the files containing them
+> 4. Scan a website for links and download them specifically using filters.
+
+1.4. What HarvestMan cannot be used for...
+HarvestMan is a small-medium size web-crawler mostly intended for personal use or for use by a small group. It cannot be used for massive data harvesting from the web. However a project to create a large-scale, distributed web crawler based on HarvestMan is underway. It is calld 'Distributed HarvestMan' or 'D-HarvestMan' in short. D-HarvestMan is currently at a prototype stage.
+
+Projects like EIAO has been able to customize HarvestMan for medium-large scale data gathering from the Internet. The EIAO project uses HarvestMan to download as much as 100,000 files from European websites daily.
+
+What HarvestMan is not,
+
+  1. HarvestMan is not an Internet search engine.
+> 2. HarvestMan is not an indexer or taxonomy tool for web documents
+> 3. HarvestMan is not a server-side program.
+
+1.5. What do I need to run HarvestMan?
+HarvestMan is written in a programming language called Python. Python is an interactive, interpreted, object-oriented programming language created by Guido Van Rossum and maintained by a team of volunteers from all over the world. Python is a very versatile language which can be used for a variety of tasks ranging from scripting to web frameworks to developing highly complex applications.
+
+HarvestMan is written completely in Python. It works with Python version 2.3 upward on all platforms where Python runs. However, HarvestMan has some performance optimizations that require the latest version of Python which is Python 2.4. The suggested version is Python 2.4. Anyway, HarvestMan will also work with Python 2.3 too, but with reduced performance.
+
+You need a machine with a rather large amount of RAM to run HarvestMan. HarvestMan tends to heavily use the system memory especially when performing large data downloads or when run with more than 10 threads. It is preferable to have a machine with 512 MB RAM and a fast CPU (Intel Pentium IV or higher) to run HarvestMan efficiently.
+
+2. Usage
+
+2.1. How do I run HarvestMan?
+HarvestMan is a command-line application. It has no GUI.
+
+From the 1.4 version, HarvestMan can be run by calling the main HarvestMan module as an executable script on the command-line as follows:
+
+
+% harvestman.py
+
+
+This will work, considering that you have edited your environment PATH variable to include the local HarvestMan installation directory on your machine. If you have not, you can run HarvestMan by running the harvestman.py module as an argument to the python interpreter, as follows:
+% python harvestman.py
+
+
+On Win32 systems, if you have associated the ".py" extension to the appropriate python.exe, you can run HarvestMan without invoking the interpreter explicitly.
+
+
+Note that this assumes that you have a config file named config.xmlin the directory from where you invoke HarvestMan. If you dont have a config file locally, you need to use the command-line options of HarvestMan to pass a different configuration file to the program.
+2.2. What is the HarvestMan Configuration (config) file?
+
+The standard way to run HarvestMan is to run the program with no arguments, allowing it to pick up its configuration parameters from an XML configuration file which is named config.xml by default.
+
+It is also possible to pass command-line options to HarvestMan. HarvestMan supports a limited set of command-line options which allow you to run the program without using a configuration file. You can learn more about the command-line options in the HarvestMan command-line options FAQ.
+The HarvestMan configuration is an XML file with the configuration options split into different elements and their hieararchies. A typical HarvestMan configuration file looks as follows:
+
+
+<?xml version="1.0" encoding="utf-8"?>
+
+<!DOCTYPE HarvestMan SYSTEM "HarvestMan.dtd">
+
+
+&lt;HarvestMan&gt;
+
+
+> 
+
+&lt;config version="3.0" xmlversion="1.0"&gt;
+
+
+> > 
+
+&lt;project&gt;
+
+
+> > > 
+
+&lt;url&gt;
+
+http://www.python.org/doc/current/tut/tut.html
+
+&lt;/url&gt;
+
+
+
+
+> 
+
+&lt;name&gt;
+
+pytut
+
+&lt;/name&gt;
+
+
+> 
+
+&lt;basedir&gt;
+
+~/websites
+
+&lt;/basedir&gt;
+
+
+> 
+
+&lt;verbosity value="3"/&gt;
+
+
+> 
+
+&lt;timeout value="600.0"/&gt;
+
+
+
+> 
+
+&lt;/project&gt;
+
+
+
+> 
+
+&lt;network&gt;
+
+
+> > 
+
+&lt;proxy&gt;
+
+
+> > > 
+
+&lt;proxyserver&gt;
+
+
+
+&lt;/proxyserver&gt;
+
+
+> > > 
+
+&lt;proxyuser&gt;
+
+
+
+&lt;/proxyuser&gt;
+
+
+
+
+> 
+
+&lt;proxypasswd&gt;
+
+
+
+&lt;/proxypasswd&gt;
+
+
+> 
+
+&lt;proxyport value=""/&gt;
+
+
+> 
+
+&lt;/proxy&gt;
+
+
+> 
+
+&lt;urlserver status="0"&gt;
+
+
+> > 
+
+&lt;urlhost&gt;
+
+localhost
+
+&lt;/urlhost&gt;
+
+
+
+
+> 
+
+&lt;urlport value="3081"/&gt;
+
+
+> 
+
+&lt;/urlserver&gt;
+
+
+> 
+
+&lt;/network&gt;
+
+
+
+> 
+
+&lt;download&gt;
+
+
+> > 
+
+&lt;types&gt;
+
+
+
+
+> 
+
+&lt;html value="1"/&gt;
+
+
+> 
+
+&lt;images value="1"/&gt;
+
+
+> 
+
+&lt;javascript value="1"/&gt;
+
+
+> 
+
+&lt;javaapplet value="1"/&gt;
+
+
+
+> 
+
+&lt;forms value="0"/&gt;
+
+
+> 
+
+&lt;cookies value="1"/&gt;
+
+
+> 
+
+&lt;/types&gt;
+
+
+> 
+
+&lt;cache status="1"&gt;
+
+
+
+> 
+
+&lt;datacache value="1"/&gt;
+
+
+> 
+
+&lt;/cache&gt;
+
+
+> 
+
+&lt;misc&gt;
+
+
+> > 
+
+&lt;retries value="1"/&gt;
+
+
+> > 
+
+&lt;tidyhtml value="1"/&gt;
+
+
+
+
+> 
+
+&lt;/misc&gt;
+
+
+> 
+
+&lt;/download&gt;
+
+
+
+> 
+
+&lt;control&gt;
+
+
+> > 
+
+&lt;links&gt;
+
+
+> > > 
+
+&lt;imagelinks value="1"/&gt;
+
+
+
+
+> 
+
+&lt;stylesheetlinks value="1"/&gt;
+
+
+> 
+
+&lt;/links&gt;
+
+
+> 
+
+&lt;extent&gt;
+
+
+> > 
+
+&lt;fetchlevel value="0"/&gt;
+
+
+> > 
+
+&lt;extserverlinks value="0"/&gt;
+
+
+
+
+> 
+
+&lt;extpagelinks value="1"/&gt;
+
+
+> 
+
+&lt;depth value="10"/&gt;
+
+
+> 
+
+&lt;extdepth value="0"/&gt;
+
+
+> 
+
+&lt;subdomain value="0"/&gt;
+
+
+
+> 
+
+&lt;/extent&gt;
+
+
+> 
+
+&lt;limits&gt;
+
+
+> > 
+
+&lt;maxextservers value="0"/&gt;
+
+
+> > 
+
+&lt;maxextdirs value="0"/&gt;
+
+
+> > 
+
+&lt;maxfiles value="5000"/&gt;
+
+
+
+
+> 
+
+&lt;maxfilesize value="1048576"/&gt;
+
+
+> 
+
+&lt;connections value="5"/&gt;
+
+
+> 
+
+&lt;requests value="5"/&gt;
+
+
+> 
+
+&lt;timelimit value="-1"/&gt;
+
+
+
+> 
+
+&lt;/limits&gt;
+
+
+> 
+
+&lt;rules&gt;
+
+
+> > 
+
+&lt;robots value="1"/&gt;
+
+
+> > 
+
+&lt;urlpriority&gt;
+
+
+
+&lt;/urlpriority&gt;
+
+
+> > 
+
+&lt;serverpriority&gt;
+
+
+
+&lt;/serverpriority&gt;
+
+
+
+
+> 
+
+&lt;/rules&gt;
+
+
+> 
+
+&lt;filters&gt;
+
+
+> > 
+
+&lt;urlfilter&gt;
+
+
+
+&lt;/urlfilter&gt;
+
+
+> > 
+
+&lt;serverfilter&gt;
+
+
+
+&lt;/serverfilter&gt;
+
+
+> > 
+
+&lt;wordfilter&gt;
+
+
+
+&lt;/wordfilter&gt;
+
+
+
+
+> 
+
+&lt;junkfilter value="0"/&gt;
+
+
+> 
+
+&lt;/filters&gt;
+
+
+> 
+
+&lt;/control&gt;
+
+
+
+> 
+
+&lt;system&gt;
+
+
+> > 
+
+&lt;workers status="1" size="10" timeout="200"/&gt;
+
+
+
+
+> 
+
+&lt;trackers value="4"/&gt;
+
+
+> 
+
+&lt;locale&gt;
+
+american
+
+&lt;/locale&gt;
+
+
+> 
+
+&lt;fastmode value="1"/&gt;
+
+
+> 
+
+&lt;/system&gt;
+
+
+
+> 
+
+&lt;files&gt;
+
+
+> > 
+
+&lt;urllistfile&gt;
+
+
+
+&lt;/urllistfile&gt;
+
+
+> > 
+
+&lt;urltreefile&gt;
+
+
+
+&lt;/urltreefile&gt;
+
+
+
+> 
+
+&lt;/files&gt;
+
+
+
+> 
+
+&lt;indexer&gt;
+
+
+
+> 
+
+&lt;localise value="2"/&gt;
+
+
+> 
+
+&lt;/indexer&gt;
+
+
+
+> 
+
+&lt;display&gt;
+
+
+> > 
+
+&lt;browsepage value="1"/&gt;
+
+
+
+
+> 
+
+&lt;/display&gt;
+
+
+
+> 
+
+&lt;/config&gt;
+
+
+
+
+
+&lt;/HarvestMan&gt;
+
+
+
+
+The current configuration file holds more than 60 configuration options. The variables that are essential to a project are project.url, project.name and project.basedir. These determine the identity of a HarvestMan crawl and normally require unique values for each HarvestMan project.
+
+For a more detailed discussion on the config file, click here.
+
+2.3. Can HarvestMan be run as a command-line application?
+Yes, it can. For details on this, refer the Command line FAQ.
+3. Architecture
+
+3.1. HarvestMan is a multithreaded program. What is the threading architecture of HarvestMan ?
+HarvestMan uses a multithreaded architecture. It assigns each thread with specific functions which help the program to complete the downloads at a relatively fast pace.
+
+HarvestMan is a network-bound program. This means that most of the time for the program is spent in waiting for network connections, fetching network data and closing the connections. HarvestMan can be considered to be not IO-bound since we can assume that there is ample disk space for the downloads, at least in most common cases.
+
+Whenever any program is network-boundor IO-bound, it helps to split the task to multiple threads of control, which perform their function without affecting other threads or the main thread.
+
+HarvestMan uses this theory to create a multithreaded system of co-operating threads, most of which gather data from the network, processes the data and writes the files to the disk. These threads are calledtracker threads. The name is derived from the fact that the thread tracks a web-page, downloads its links and further tracks each of the pages pointed by the links, doing this recursively for each link.
+
+HarvestMan uses a pre-emptive threaded architecture where trackers are launched when the program starts. They wait in turns for work, which is managed by a thread-safe Queueof data. Tracker threads post and retrieve data from the queue. These threads die only at the end of the program, spinning in a loop otherwise, looking for data.
+
+There are two different kind of trackers, namely crawlers and fetchers.These are described in the sections below.
+3.2. What are "crawler" threads?
+Crawlersor crawler-threads are trackers which perform the specific function of  parsing a web-page. They parse the data from a web-page, extract the links, and post the links to a url queue.
+
+The crawlers get their data from a dataqueue.
+
+
+3.3. What are "fetcher" threads?
+Fetchersor fetcher-threads are trackers which perform the function of "fetching", i.e downloading the files pointed to by urls. They download URLs which do not produce web-page content (HTML/XHTML) statically or dynamically. They download non-webpage URLs such as images, pdf files, zip files etc.
+
+The fetchers get their data from the urlqueue and they post web-page data to the dataqueue.
+
+
+3.4. How do the crawlers and fetchers co-operate?
+The design of HarvestMan forces the crawlers and fetchers to be synergic. This is because, the crawlers obtain their data (web-page data) from the data queue, and post their results to the url queue. The fetchers in turn obtain their data (urls) from the url queue, and post their results to the data queue.
+
+The program starts off by spawing the first thread which is a fetcher. It gets the web-page data for the starting page and posts it to the data queue. The first crawler in line gets this data, parses it and extracts the links, posting it to the url queue. The next fetcher thread waiting in the url queue gets this data, and the process repeats in a synergic manner, till the program runs out of urls to parse, when the project ends.
+
+3.5. How many different Queues of information flow are there?
+There are two queues of data flow, the url queue and the data queue.
+
+The crawlers **feed the url queue and**feed-off the data queue.
+The fetchers feed the data queue and feed-off the url queue.
+
+**feed = post data to**feed-off = get data from
+
+
+3.6. What are "worker" (downloader) threads?
+Apart from the tracker threads, you can specify additional threads to take charge of downloading urls. The urls can be downloaded in these threads instead of consuming the time of the fetcher threads.
+
+These threads are launched 'apriori', similar to the tracker threads, before the start of the crawl. By default, HarvestMan launches a set of 10 of these worker threads which are managed by a thread pool object. The fetcher threads delegate the actual job of downloading to the workers. However, if the worker threads are disabled, the fetchers will do the downloads themselves.
+
+These threads also die only at the end of a HarvestMan crawl.
+
+3.7. How does a HarvestMan project finish?
+(Make sure that you have read items 3.1 - 3.6 before reading this.)
+
+As mentioned before, HarvestMan works by the co-operation of crawler and fetcher family of tracker threads, each feeding on the data provided by the other.
+
+A project nears its end when there are no more web-pages to crawl according to the configurations of the project. This means that the fetchers have less web-page data to fetch, which in-turn dries the data source for the crawlers. The crawlers in-turn go idle, thus posting less data to the url queue, which again dries the data source for the fetchers. The synergy works at this phase also, as it does when the project is active and all tracker threads are running.
+
+After some time, all the tracker threads go idle, as there is no more data to feed from the queues. In the main thread of the HarvestMan program, there is a loop that spins continously checking for this event. Once all threads go idle, the loop detects it and exits; the project (and the program) comes to a halt.
+
+HarvestMan main thread enters this loop immediately after spawning all the tracker threads and waits in the loop till. It checks for the idle condition every 1 or 2 seconds, spinning in a loop. Once it detects that all threads have gone idle, it ends the threads, performs post-download operations, cleanup etc and brings the program to an end.
+
+
+4. Protocols & File Types
+
+4.1. What are the protocols supported by HarvestMan?
+
+HarvestMan supports the following protocols
+
+  1. HTTP
+> 2. FTP
+
+Support for HTTPS (SSL) protocol depends on the Python version you are running. Python 2.3 version and later has HTTPS support built into Python, so HarvestMan will support the HTTPS protocol, if you are running it using Python 2.3 or higher versions.
+
+The GOPHERand FILE://protocols should also work with HarvestMan.
+
+
+4.2. What kind of files can be downloaded by HarvestMan?
+HarvestMan can download **any** kind of file as long as it is served up by a web-server using HTTP/FTP/HTTPS. There are no restrictions on the type of file or the size of a single file.
+
+HarvestMan assumes that the URLs with the following extension are web-pages, static or dynamic.
+
+'.htm', '.html', '.shtm', '.shtml', '.php','.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl','.cgi', '.stx', '.cfm', '.cfml', '.cms'
+
+The URL with no extension is also assumed to be a web-page. However, the program has a mechanism by which it looks at the URL headers of the HTTP request and figures out the actual file type of the URL by doing a mimetype analysis. This happens immediately after the HTTP request is answered by the server. So if the program finds that the assumed type of a URL is different from the actual type, it sets the type correctly at this point.
+
+You can restrict download of certain files by creating specific filters for HarvestMan. These are described in a section somewhere below.
+
+A related question is the html-tagssupported by HarvestMan using which it downloads files.
+These are listed below.
+
+  1. Hypertext links of the form<a href='http://www.foo.com/bar/file.html'> .<br>
+</li></ul><blockquote>2. Image links of the form <img src='http://www.foo.com/bar/img.jpg'>  .<br>
+3. Stylesheets of the form   <br>
+<br>
+<link rel="stylesheet" type="text/css" href="style.css"><br>
+<br>
+.<br>
+4. Javascript source files of the form <br>
+<br>
+<script src="http://www.foo.com/scripts/script.js"><br>
+<br>
+ .<br>
+5. Java applets(.class files) of the form<applet...>  .</blockquote>
+
+
+4.3. Can HarvestMan run javascript code?
+HarvestMan does not include a javascript engine. So it cannot run javascript code like Netscape or I.E. But Harvestman can parse javascript code and fetch javascript source files. (See answer to item 4.2 above.)
+
+
+4.4. Can HarvestMan run java applets?
+No. HarvestMan does not include a java runtime environment, so it cannot execute java applets. It does not make much sense in doing so too, since HarvestMan is an off-line browser and not a browser which can host a Java Runtime Environment.
+
+But HarvestMan can download java applets (class files) by parsing the java applet tags inside web pages. (See answer to item 4.2 above).
+
+
+4.5. How to prevent downloading of CGI and HTML forms?
+From version 1.4, this is the default action. The program skips cgi form links and html forms, by looking for the query string of the form http://www.foo.com/bar/query.html?param=pby using regular expression matching. However, there is a way to make the program download CGI forms by enabling a configuration variable in the config file.
+
+4.6. Does HarvestMan download dynamically generated server-side files ?
+Yes. Server side files are generated during a regular client request to a web server which runs the server-side script (asp, jsp, php etc), and serves up the resulting file. Since this is like a regular HTTP request for HarvestMan, it downloads such files. However HarvestMan does not have a mechanism to prevent duplicate downloads of server-side files which generate the same content, but with different parameters to the server query. This could be added in a future version.
+
+
+4.7. How does HarvestMan determine the file type of dynamically generated server side files (cgi)?
+Since server-side requests can generate a wide variety of files and there is no way to find out the file type directly, adding logic to identify all such types is not an easy task.The current HarvestMan code contains logic to identify some files and it renames the files accordingly to match the extension of the particular type.
+
+HarvestMan can identify GIF, BMP andJPEG image files by comparing the file signature with the standard signatures of these file types.
+
+Since this is a rather errorprone approach, there is no guarantee it will work perfectly. Hence it is better not to rely on this feature if you are using HarvestMan.
+
+4.8. Does HarvestMan obey the Robots Exclusion Protocol ?
+Yes.
+
+HarvestMan respects the rules laid down by website managers in the robots.txtrules in the web server. These rules specify certain limitations to crawling certain areas of the web site depending upon the user agent of the browser client. (Some site owners block entire sections to all clients).
+
+HarvestMan obeys the robot exclusion protocol by default. There is way to bypass this protocol by disabling this feature. However, it is a good idea to always enable it to follow Internet etiquette and also to prevent yourself getting fined or sued by website owners for not following the robots.txt rules.
+
+Support for robots.txt rules is available in Python. HarvestMan uses a customised form of this module.
+
+
+4.9. Can I restart a project to download only links that failed in a previous run (web-page caching)?
+Yes. You can, since HarvestMan has an inbuilt caching mechanism for documents downloaded from the network.
+
+From version 1.2, the caching mechanism is available and enabled by default. HarvestMan uses an MD5 checksum< of the data of a downloaded file to create a unique  signature for each file.  This is associated with the url location of that file (in the Internet or the LAN) and written to a cache file. This signature is generated for every file downloaded in a project and the data written to a compressed cache file in the disk. You can locate this file in the directory named hm-cache in the HarvestMan project directory.
+
+When you re-start a project, HarvestMan loads the cache information for the project, if it exists. When it encounters a url, it compares the signature of the url data with the signature of the cache url and verifies if it is the same. If it is the same, the document has not changed, so HarvestMan skips this url. Otherwise it downloads it.
+
+The cache is regenerated at the end of every project. HarvestMan catches any keyboard interrupts by the user and makes sure that the cache is generated if the user decides to end the program by sending a keyboard interrupt, thereby making sure that precious network bandwidth is not wasted.
+
+You can disable web-page caching by disabling a configuration variable in HarvestMan configuration file.
+5 . Network, Security & Access Rules
+
+5.1. Can HarvestMan work across proxies?
+Yes. HarvestMan can work transparently across proxy servers and firewalls.
+
+HarvestMan supports proxies and firewalls by the following config variables in the config file
+and their command-line counteparts.
+
+> o proxyserver:               This is the name or ip of the proxy server
+> o proxyport:                   This is the port to which the proxy listens for requests.
+> o proxyuser:                  Username for proxy authentication, if any.
+> o proxypasswd:              Password for proxy authentication, if any.
+
+These variables are child elements of the top-level XML element network inside HarvestMan configuration file. The proxy handling is built into Python.
+
+Note that if you use the config file generation script (in the HarvestMan site or the one included in the distribution), the proxy variables are encrypted to prevent misuse of the information by a third-party.
+
+
+5.2. Does HarvestMan support proxy authentication?
+Yes. HarvestMan can work with both authenticated and regular (unauthenticated) proxies and firewalls. For details, see answer to item 5.1 above.
+
+
+5.3. Does HarvestMan work inside an Intranet?
+Yes. HarvestMan can crawl servers inside intranets or LANs as long as HarvestMan is able to resolve the name of the intranet server. This typically requires that there is a DNS server somewhere in the intranet which performs the resolving for clients. If you dont have such a DNS server, you might need to provide this information locally on the machine, by editing the local hosts file in your machine.
+
+HarvestMan makes no difference between Internet and Intranet crawling. This was not so in earlier versions. However from the 1.4 version, all servers are treated uniformly irrespective of whether they are part of the local intranet or the Internet.
+
+5.4. Can HarvestMan crawl a site that requires HTTP authentication?
+Currently, no. This feature is enabled in the code, but is not exposed since it is not working perfectly.
+
+
+5.5. Can HarvestMan crawl a site that requires HTTPS authentication?
+No. This is not possible with the current HarvestMan. This feature is targeted for a future release.
+5.6. Can I prevent the program from accessing specific domains?
+> This is possible by the use of specific domain filters.
+
+HarvestMan provides a filtering mechanism by which certain areas of a network or the Internet can be specified out-of-bounds to the program. This is possible by the use
+of a server filter.
+The config variable serverfilter inside the top-level control element can be set to a pattern of string in this manner.
+
+1. Prefix a server name with the minus sign to specify it as out of bounds to the program. HarvestMan will skip such domains.
+2. Prefix a server name with the plus sign to specify that it can be crawled.
+
+For example to prevent the server named server1.foo.com in the foo.com domain from
+being crawled, set control.serverfilter to the following pattern.
+
+control.serverfilter               -server1.foo.com
+
+If you want skip certain areas of a website or prevent downloading certain kinds of files, you should use the urlfilter , described below.
+
+5.7. Can I specify download filters to prevent download of certain files or directories on a server?
+HarvestMan uses the urlfilter mechanism to make this possible.
+
+You can specify a urlfilter as the config variable named urlfitler of the top-level control element to do this. The rules for creating urlfilters are similar to any other filter.
+
+1. Prefix an area or pattern to be skipped with the minus sign.
+2. Prefix an area or pattern to be included with the plus sign.
+
+For example, to prevent the program from downloading gif files, you can use the following pattern.
+
+control.urlfilter              -**.gif**
+
+To create a sequence of filters, chain them one after the other.
+
+For example, to prevent the download of microsoft word, excel and powerpoint files, use the following pattern.
+
+control.urlfilter             -**.doc-**.xls-**.ppt**
+
+You can use this mechanism to prevent the program from accessing specific directories also.
+
+For example, to prevent the program from crawling the directory named "images" in the server foo.com, use the following pattern.
+
+control.urlfilter            -foo.com/images/
+
+You can modify the above rule to allow the program to download all files except gif images from the same directory, as follows.
+
+control.urlfilter           -foo.com/images/**.gif**
+
+To specify traversal rules for certain directories and their subdirectories, you can use the following pattern.
+
+For example, you want to allow acces to the sub-directory named "public" inside the directory named "images" in foo.com server, but prevent access to all other sub-directories and files.
+
+The following urlfilter pattern does the trick.
+
+control.urlfilter          -foo.com/images/**+foo.com/images/public**
+
+The urlfilter pattern always takes precedence over the server filter pattern.
+
+
+5.8. Is it possible to specify the depth of traversal in a domain?
+
+Yes, it is possible.
+
+HarvestMan provides two configuration variables for specifying the maximum depth of
+traversal in a web server. The depth of a directory is calculated from the root of the
+web server.
+
+For example, the directory named http://www.foo.com/images/public/  has a depth
+of two from the server root directory.
+
+HarvestMan provides the variable depth inside the top-level element control to control the depth of traversal in the starting server, and the variable extdepth in the top-level control element to control the depth of traversal in external servers.
+
+The difference between both control options is that, for the starting server, depth of a directory is calculated with respect to the starting url's directory, whereas for external servers, it is calculated with respect to its root.
+
+For example, if the starting url is http://www.foo.com/images/public/index.html,
+and if control.depth is set to 2, the images in the directory named http://www.foo.com/images/public/picnic/hawaii
+will be fetched since the depth of this directory with respect to the directory of the starting url is 2.
+
+For the same url, if control.extdepth is set to 2, the images in the directory named http://www.foo2.com/images/public/picnic/hawaii will not be fetched, since it is an external server, and the depth of this directory w.r.t the root of its server is more than
+2.
+
+By default the value for control.depth is 10, and for control.extdepth is 0. This means that only the files in the root directory of external servers is fetched by default.
+
+
+6. Download Control
+
+6.1. Can I set a limit on the maximum number of files that are downloaded?
+Yes. Use the config option named maxfiles inside the top-level control element for this. By default this is set to 5000.
+
+When the number of downloaded files exceed this value, HarvestMan automatically kills all threads and terminates the projects. It might also do some deletion of already downloaded files to stick to this limit. You can get very good accuracy for this limit, with an average error of +/- 1.5 files.
+
+
+6.2. Can I set a limit on the number of external servers crawled?
+Yes. The config variable named maxextservers inside the top-level control element for this.
+
+There is a variable named fetchlevel which supersedes this. Typically, you wont be setting the external server option directly, instead using fetchlevels to configure your download extents. This variable is explained in a different section of the FAQ.
+
+
+6.3. Can I set a limit on the number of outside directories that are crawled?
+Yes. The config variable named maxextdirs inside the top-level control element that helps to do this.
+
+External directories are directory paths in the starting server, but outside the path of the directory of the starting url.
+
+Again, the fetchlevel setting supersedes this, so you can crawl external directories by setting appropriate fetch levels. Typically you wont be using the external directory setting directly, instead using fetchlevel to control the option indirectly.
+
+
+
+6.4. How can I prevent the download of images?
+You can do this by setting the config variable named images inside the top-level download element to zero.
+
+
+6.5. How can I prevent the download of stylesheets?
+You can do this by setting the config variable named stylesheets inside the top-level download element to zero.
+
+
+6.6. How do I disable the traversal of external servers?
+By setting the config variable named extserverlinks inside the top-level control element to zero.
+
+Note that the fetchlevel  setting supersedes this setting also. Typically, this setting is also not congfigured directly but instead controlled through fetchlevel.
+
+6.7. Can I specify a project timeout?
+> Yes you can.
+
+A HarvestMan project can hang due to many reasons. Problems in the network, memory crunch situations, threads hogging the CPU, insufficient thread context switching are some of these. Most of the times this happens when the threads will be running but unable to do any actual work.
+
+In such cases, the program has a mechanism to time-out the project, by monitoring the last time a thread got some useful data or finished a download, and comparing it with the current time. If the main thread of the program is running fine, it will monitor this period and timeout the project, if it exceeds a certain value which is hard-coded into the program. This is about 300 seconds (5 minutes).
+
+
+6.8. Can I specify a thread timeout for worker threads?
+Yes. (The worker threads are described above in section 3.8)
+
+Worker threads are normally given the charge of downloading files that are not web-pages. These threads can block sometimes because the server is busy, or if the connection socket blocks due to a number of other reasons (network errors, hardware problems, denial of service attacks etc).
+
+In such cases, HarvestMan ensures that the thread does not hang infinitely by providing a thread timeout mechanism. This is controlled by the config variable named timeout which is an attribute of the worker element inside the top-level system element of the config file. The value of this is 200 seconds by default.
+If a thread takes more time than this to download a file, HarvestMan terminates the thread and cleans it up.
+
+If you are crawling a website with a lot of traffic and/or huge files, it is a good idea to set this to a higher value to give the download threads more time to complete their downloads.
+
+
+6.9. How to tell the program to retry failed links?
+You can do this by editing the config variable named retries inside the top-level download element.
+
+Downloads can fail due to a variety of reasons. File not found errors (HTTP 404), server busy, socket failure etc are some of those. HarvestMan can attempt to re-download failed files by attempting the downloads again.
+
+This is controlled by the above mentioned variable. If it is set to zero, the program does not attempt to re-download failed links. If it is set to a value more than zero, HarvestMan will attempt to download links that failed with non-fatal errors
+(Non-fatal errors are errors like socket failure, network busy, server error etc, whereas fatal errors are errors like "File not found") , by a count specified by this variable. Also, HarvestMan will attempt to re-download failed links once more at the end of the project.
+
+For example if this variable is set to 2, 2  retry attempts will be made to download a link that failed with a non-fatal error.
+
+
+7. Download Control - Advanced
+
+7.1. What are fetchlevels and how can I use them?
+
+Fetchlevels are the most useful download tweaking feature of HarvestMan. It is important to understand them in order to tweak your downloads.
+
+HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL.
+
+In general increasing a fetch level allows the program to crawl more files on the Internet.
+
+A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL.
+
+For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the 
+
+&lt;images&gt;
+
+ sub-directory and directories below it, and no other file.
+
+The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page.
+
+A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers.
+
+A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL.
+
+A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc.
+
+In short we can summarize the above rules in the following download guidelines.
+
+  1. If you just want to download all links directly below the directory of starting URL, use a fetch level of zero.
+> 2. If you want to download all links linked to the starting url in the same server, use a fetch level of one.
+> 3. If you want to download all links directly below the starting url, and also first level links linked to other websites, use a fetch level of three.
+> 4. If you want to download all links linked to the starting url in the same server, and also first level links linked to other websites, use a fetch level of two.
+> 5. If you dont want to prescribe any limits, set a fetch level of four and tweak other download control options like depth fetching, file limits etc.But since this will lead to to large scale downloads from different servers on the Web, you should think of using a distributed crawler such as D-HarvestMan.
+
+
+8. Application development and Customization
+
+8.1. I want to customize HarvestMan for a research project. Can you help out ?
+
+HarvestMan is made available under the GPL license, so you are free to download the latest source code tar ball and customize it the way you want, for your research project.
+
+If you belong to a research group which wants to use HarvestMan as a component of your software, I am happy to provide informal support as long as it is now and then. If you want regular support and also some customization, I am available for consulting, support and customization at regular offshore software consulting rates in India.
+8.2. I want to customize HarvestMan for a commercial project. Can you help out ?
+
+If you want to write a commercial webcrawler application or solution on top of HarvestMan I am available for consulting. You can contact me at my email address.
+
+9. UML Diagrams
+
+9.1. HarvestMan Class Diagram
+
+I have generated a UML object diagram using the Dot toolkit of AT&T graphviz project and the PyUMLGraph utility. The diagram is available here (size 350 KB).
\ No newline at end of file
diff --git a/FAQ_NEW.md b/FAQ_NEW.md
new file mode 100644
index 0000000..510a480
--- /dev/null
+++ b/FAQ_NEW.md
@@ -0,0 +1,2 @@
+## Work started, stuff will be here in a few weeks, please be patient ##
+### Once this is completed, the existing FAQ page will be replaced with this ###
\ No newline at end of file
diff --git a/HarvestMan.md b/HarvestMan.md
new file mode 100644
index 0000000..e69de29
diff --git a/InstallHarvestMan.md b/InstallHarvestMan.md
new file mode 100644
index 0000000..d508b28
--- /dev/null
+++ b/InstallHarvestMan.md
@@ -0,0 +1,62 @@
+**You have few options when installing harvestman**
+
+Currently the best way is to download the source code from the trunk of our repository.
+
+# Check out code from svn #
+  * First you need to checkout version of harvestman from repository.You will need '''subversion''' program to do that. When subversion is installed run this command:
+```
+svn checkout http://harvestman-crawler.googlecode.com/svn/trunk/HarvestMan-lite harvestman-crawler
+```
+
+# Install harvestman #
+
+## Requirements ##
+  * Harvestman requires:
+
+  1. python2.4 or higher. (python2.3 untested)
+  1. python-dev package
+  1. sgmlop,pyparsing,web.py which will get installed automatically as part of the installation.
+
+## Installation of harvestman ##
+  * You have two options of installing harvestman. You can install it in a system base folder or you can use virtualenv to install it in your defined folder.
+
+### Vritualenv ###
+  * If you want to use virtualenv
+  * Setup virtualenv folder. If you don't have virtualenv install it using **easy\_install virtualenv**
+```
+virtualenv --no-site-packages harvestmanENV
+
+You should see:
+New python executable in harvestmanENV/bin/python
+Installing setuptools............done.
+```
+  * Activate your virtual setup.
+```
+source harvestmanENV/bin/activate
+```
+  * You should see (harvestmanENV) on your command line:
+```
+(harvestmanENV)lucas@delldebian:~/tmp$ 
+```
+  * Now go into the harvestman folder and install it
+```
+cd harvestman-crawler/HarvestMan/
+python setup.py install
+```
+  * Test the installation by running:
+```
+harvestman --selftest
+```
+  * When done you can deactivate your virtualenv
+```
+deactivate
+```
+  * See UsingHarvestman
+
+### system wide installation ###
+  * If you want to do a system wide installation:
+  * Go into the harvestman folder, and run setup file.
+```
+cd harvestman-crawler/HarvestMan/
+python setup.py install
+```
\ No newline at end of file
diff --git a/NewDevelopersNotes.md b/NewDevelopersNotes.md
new file mode 100644
index 0000000..c9dd60b
--- /dev/null
+++ b/NewDevelopersNotes.md
@@ -0,0 +1,30 @@
+## Standards ##
+  * When adding modifying code make sure you use the base imports
+```
+[good]from harvestman.lib import *
+[good]from harvestman.lib import xyz
+[good]from hasvestman.lib.common import *
+[bad]import xyz
+[bad]import *
+```
+## branches ##
+  * Here is a short intro on how do do merging and branching.
+  * http://kenkinder.com/svnmerge/
+# Packages #
+## python eggs ##
+  * To make a python egg you can do the following:
+  * In a folder that has setup.py:
+```
+python setup.py bdist_egg
+```
+  * The file should be in a dist folder
+```
+dist/HarvestMan-2.0.2dev_r107-py2.4.egg
+```
+  * You can send the egg to somebody and they can install it using easy\_install command:
+```
+easy_install HarvestMan-2.0.2dev_r107-py2.4.egg
+```
+
+## Making win32 package ##
+  * Environment:Windows XP SP2 running Python 2.5.2, py2exe 0.6.8
\ No newline at end of file
diff --git a/ProjectHome.md b/ProjectHome.md
new file mode 100644
index 0000000..f8b5bb0
--- /dev/null
+++ b/ProjectHome.md
@@ -0,0 +1,3 @@
+HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
+
+The final goal of the project is to develop a full-fledged semantic personal data mining platform which can be used to retrieve information from the Internet in a highly customizable manner, so that one can fetch information from the web _the way he wants it, when he wants it_. For this, HarvestMan project will provide support for Web 2.0 and 3.0 technologies such as RSS, RDF, OWL etc.
diff --git a/UsingHarvestMan.md b/UsingHarvestMan.md
new file mode 100644
index 0000000..df6998c
--- /dev/null
+++ b/UsingHarvestMan.md
@@ -0,0 +1,99 @@
+![http://harvestman.everythingability.com/chrome/common/banner.png](http://harvestman.everythingability.com/chrome/common/banner.png)
+# First Time Usage #
+
+When using HarvestMan you have two options of downloading pages/files. You can either use command line options or use config.xml file.
+
+## Command Line Options ##
+  * To get a list of commands, type:
+```
+harvestman --help
+```
+  * Here are some available choices:
+```
+Options:
+  -h, --help            show this help message and exit
+  -v, --version         Print version information and exit
+  -m, --simulate        Simulates crawling with the given configuration,
+                        without performing any actual downloads (same as "-g
+                        simulator")
+  -C CFGFILE, --configfile=CFGFILE
+                        Read all options from the configuration file CFGFILE
+  -P PROJFILE, --projectfile=PROJFILE
+                        Load the project file PROJFILE
+  -F URLFILE, --urlfile=URLFILE
+                        Read a list of start URLs from file URLFILE and crawl
+                        them
+  -b BASEDIR, --basedir=BASEDIR
+                        Set the (optional) base directory to BASEDIR
+  -p PROJECT, --project=PROJECT
+                        Set the (optional) project name to PROJECT
+  -V LEVEL, --verbosity=LEVEL
+                         Set the verbosity level to LEVEL. Ranges from 0-5,
+                        default is 2
+  -f LEVEL, --fetchlevel=LEVEL
+                        Set the fetch-level of this project to LEVEL. Ranges
+                        from 0-4, default is 0
+  -l LOCALISE, --localise=LOCALISE
+                        Localize urls after download (yes/no, default is yes)
+  -r NUMRETRIES, --retry=NUMRETRIES
+                        Set the number of retry attempts for failed urls to
+                        NUMRETRIES
+  -X PROXYSERVER, --proxy=PROXYSERVER
+                        Enable and set proxy to PROXYSERVER (host:port)
+  -U USERNAME, --proxyuser=USERNAME
+                        Set username for proxy server to USERNAME
+  -W PASSWORD, --proxypass=PASSWORD
+                         Set password for proxy server to PASSWORD
+  -n NUMCONNECTIONS, --connections=NUMCONNECTIONS
+                        Limit number of simultaneous network connections to
+                        NUMCONNECTIONS
+  -c CACHE, --cache=CACHE
+                        Enable/disable caching of downloaded files. If
+                        enabled(default), files won't be saved unless their
+                        timestamp is newer than the cache timestamp
+  -d DEPTH, --depth=DEPTH
+                        Set the limit on the depth of urls to DEPTH
+  -w NUMWORKERS, --workers=NUMWORKERS
+                        Enable worker threads and set the number of worker
+                        threads to NUMWORKERS
+  -T NUMTHREADS, --maxthreads=NUMTHREADS
+                        Limit the number of tracker threads to NUMTHREADS
+  -M NUMFILES, --maxfiles=NUMFILES
+                        Limit the number of files downloaded to NUMFILES
+  -t PERIOD, --timelimit=PERIOD
+                        Run the program for the specified time period PERIOD
+                        (in seconds)
+  -s, --subdomain       If set, treats subdomains in the same parent domain
+                        (like my.foo.com & his.foo.com) as the same
+  -R ROBOTS, --robots=ROBOTS
+                        Enable/disable Robot Exclusion Protocol and checking
+                        of META ROBOTS tags.
+  -u FILTER, --urlfilter=FILTER
+                        Use regular expression FILTER for filtering urls
+  -g PLUGINS, --plugins=PLUGINS
+                        Load the set of plugins PLUGINS (Specified as
+                        plugin1+plugin2+...)
+  -o <name=value>, --option=<name=value>
+                        Pass a configuration param using <name=value> syntax
+  --ui                  Start HarvestMan in Web UI mode
+  --genconfig           Create Configuration File Using GenConfig Web UI mode
+  --selftest            Run a self test
+```
+
+
+## Config.xml Options ##
+  * To create config.xml you can run our generate configuration program.
+```
+harvestman --genconfig
+```
+  * Fill all the information and when done save the xml to myconfig.xml and run the following command:
+```
+harvestman -C myconfig.xml
+```
+  * Here is how the web interface look.
+
+---
+
+![http://lucasmanual.com/out/harvestman.png](http://lucasmanual.com/out/harvestman.png)
+
+---
diff --git a/WorkAroundHttpForbidden.md b/WorkAroundHttpForbidden.md
new file mode 100644
index 0000000..fcb3a25
--- /dev/null
+++ b/WorkAroundHttpForbidden.md
@@ -0,0 +1,46 @@
+# HTTP 403 Error #
+
+Sometimes when you try to crawl a site, you get an error as follows.
+
+```
+[2010-01-10 13:18:34,965] Starting project en.wikipedia.org ...
+[2010-01-10 13:18:34,965] Writing Project Files... 
+[2010-01-10 13:18:35,005] Starting download of url http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A ...
+[2010-01-10 13:18:35,095] Reading Project Cache... 
+[2010-01-10 13:18:35,100] Downloading http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A
+[2010-01-10 13:18:36,015] Forbidden =>  http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A
+[2010-01-10 13:18:37,023] Ending Project en.wikipedia.org ...
+```
+
+Most of the time this is because the web-site does not accept the
+USER-AGENT string of the crawler. To fix this you need to
+change the USER-AGENT of HarvestMan to a more acceptable one.
+
+# Work-around #
+
+If using the config XML file then you can cheat the web-site
+by supplying a standard browser USER-AGENT. For example,
+
+```
+    <system>
+      <useragent value="Firefox/v3.5" />
+      <workers status="1" size="10" timeout="1200"/>
+      <trackers value="10"/>
+      <timegap value="3.0" random="1" />
+      <connections type="flush" />
+    </system>
+```
+
+You can also do this directly in code as follows. For example,
+
+```
+    spider = HarvestMan()
+    spider.initialize()
+    config = spider.get_config()
+    config.USER_AGENT = "Firefox/v3.5"
+```
+
+The basic idea is to use a USER\_AGENT string that is more
+popular, such as that of the Firefox web browser. This
+works for most of the web-pages and gets rid of the HTTP
+403 error.
\ No newline at end of file
diff --git a/WorldsSimplestCrawler.md b/WorldsSimplestCrawler.md
new file mode 100644
index 0000000..39c3e2c
--- /dev/null
+++ b/WorldsSimplestCrawler.md
@@ -0,0 +1,177 @@
+## World's Simplest Crawler - Crawling the Web from Python prompt with HarvestMan ##
+
+Once you install HarvestMan, it is very easy to do a quick crawl
+of a web-site or URL from the Python interpreter prompt. This page
+explains how.
+
+At the very first, make sure you have installed HarvestMan to
+Python. Check InstallHarvestMan for a guide to this.
+
+Once you have done so, here are the rest of the steps.
+
+---
+
+First start up Python.
+
+```
+[anand@localhost ~]$ python
+Python 2.5.1 (r251:54863, Jul 17 2008, 13:21:31) 
+[GCC 4.3.1 20080708 (Red Hat 4.3.1-4)] on linux2
+Type "help", "copyright", "credits" or "license" for more information.
+>>>
+```
+
+Now, import the HarvestMan application.
+```
+>>> from harvestman.apps.spider import HarvestMan
+```
+
+Next, create an instance of the crawler class.
+```
+>>> spider = HarvestMan()
+```
+
+Initialize the instance. (This has to be done explicitly when running the crawler like this on the interactive prompt).
+```
+>>> spider.init()
+```
+
+Now, get the configuration of the crawler. This is done by the "get\_config" method.
+(The config object is an instance of _harvestman.lib.config.HarvestManStateObject).
+```
+>>> cfg = spider.get_config()
+>>> type(cfg)
+<class 'harvestman.lib.config.HarvestManStateObject'>
+```_
+
+The config object defines an _add_ method to add URLs as projects to it directly. The method is as follows.
+```
+>>> help(cfg.add)
+Help on method add in module harvestman.lib.config:
+
+add(self, url, name='', basedir='.', verbosity=20) method of harvestman.lib.config.HarvestManStateObject in
+stance
+    Adds a crawl project to the crawler. The arguments
+    are the starting URL, and optional name for the project,
+    a base directory for saving files and project verbosity
+```
+
+So, let us add a project with starting URL _http://docs.python.org/tutorial/index.html_ with the name _pytut_. Let us save the files to _$HOME/websites_.
+
+```
+>>> cfg.add(url='http://docs.python.org/tutorial/index.html',name='pytut',basedir='~/websites')
+```
+
+_(Of course, the name in front of the arguments are not necessary. But it has been added to clarify what each argument stands for)_.
+
+Finally, we need to setup the configuration object and call the _main_ method on the crawler.
+
+```
+>>> cfg.setup()
+>>> spider.main()
+```
+
+That is it! The crawl starts immediately in the Python command line!
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:34:25] *** Log Started ***
+ 
+[17:34:25] Starting project pytut ...
+[17:34:25] Writing Project Files... 
+[17:34:25] Starting download of url http://docs.python.org/tutorial/index.html ...
+[17:34:25] Reading Project Cache... 
+[17:34:25] Project cache not found 
+[17:34:25] Downloading file for url http://docs.python.org/tutorial/index.html
+[17:34:27] Saved to /home/anand/websites/pytut/docs.python.org/tutorial/index.html
+[17:34:28] Fetching links for url http://docs.python.org/tutorial/index.html
+[17:34:28] Not Found =>  http://docs.python.org/robots.txt
+[17:34:29] Downloading file for url http://docs.python.org/tutorial/stdlib2.html
+[17:34:29] Downloading file for url http://docs.python.org/tutorial/interpreter.html
+...
+```
+
+Ain't that cool. So summarizing the steps, the world's simplest (and coolest!) crawl can be done right in the Python command line with the following  steps.
+
+```
+>>> from harvestman.apps.spider import HarvestMan
+>>> spider = HarvestMan()
+>>> spider.init()
+>>> cfg = spider.get_config()
+>>> cfg.add(url='http://docs.python.org/tutorial/index.html',name='pytut',basedir='~/websites')
+>>> cfg.setup()
+>>> spider.main()
+```
+
+That is about 6 lines of code excluding the import line. Guess it does not get simpler than that ;)
+
+Now, what if you want to re-crawl the same site. Ok, let us try the
+same crawl again.
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:38:37] *** Log Started ***
+ 
+Traceback (most recent call last):
+  File "/usr/lib64/python2.5/logging/__init__.py", line 750, in emit
+    self.stream.write(fs % msg)
+ValueError: I/O operation on closed file
+[17:38:37] Starting project pytut ...
+Traceback (most recent call last):
+  File "/usr/lib64/python2.5/logging/__init__.py", line 750, in emit
+    self.stream.write(fs % msg)
+ValueError: I/O operation on closed file
+...
+```
+
+Oops, something has gone wrong!
+
+The problem is that the crawler state has to be reset, before you attempt to recrawl again using the same instance of the crawler. This is automatically done if you run the
+program non-interactively, but in the interactive prompt, this has to be done manually.
+
+But there is no problem, since it just boils down to a single line of code...!
+
+```
+>>> spider.reset()   
+```
+
+That is it! This resets the state, so now you can crawl to your heart's content :)
+
+Let us crawl again...
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:41:01] *** Log Started ***
+ 
+[17:41:01] Starting project pytut ...
+[17:41:01] Writing Project Files... 
+[17:41:01] Starting download of url http://docs.python.org/tutorial/index.html ...
+[17:41:01] Reading Project Cache... 
+[17:41:01] Downloading file for url http://docs.python.org/tutorial/index.html
+...
+```
+
+
+Watch this space. More cool tutorials are coming on custom crawling the web with
+HarvestMan!
+
+
+**NOTE: This works only with the most recent trunk code (since Oct 6 2008), as
+the code referencing the application classes have changed in the last few days.**
+
+
+---
+
+_Please send your feedbacks regarding this tutorial to the project owner(s)_.
\ No newline at end of file
diff --git a/WritingCustomCrawlers.md b/WritingCustomCrawlers.md
new file mode 100644
index 0000000..cc9420b
--- /dev/null
+++ b/WritingCustomCrawlers.md
@@ -0,0 +1,196 @@
+# How to use HarvestMan events API to write custom crawlers #
+
+## HarvestMan events API ##
+HarvestMan provides a very well-defined events API which can be used by developers to write custom crawlers suited for a specific crawling/data mining task.
+
+### Events ###
+Events are implemented using a callback mechanism. At different times during the program execution, HarvestMan raises events with specific names. These events can be hooked into custom functions by subscribing to the events and defining functions which process the state supplied along with the event.
+
+Events are mainly of two types - _post_ events are those that are raised after an action is performed and _pre_ (_before_) events are those that are raised prior to performing an action. In HarvestMan, _pre_ events are more useful for controlling program flow since their return values are checked for True/False to decide rest of processing. Fore more information, read on.
+
+## Illustration ##
+Let us say that you want to write a custom crawler which saves only images which are larger than 4K to the disk (a practical example of this would be a crawler which ignores thumbnail images, since thumbnails are typically  of size 2K-4K). This is how you would do this by subscribing to the **_save\_url\_data_** event.
+
+First you need to define a custom crawler class over-riding the _HarvestMan_ class.
+
+```
+from harvestman.apps.spider import HarvestMan
+from harvestman.lib.common.macros import *
+
+class MyCustomCrawler(HarvestMan):
+    """ A custom crawler """
+
+    size_threshold = 4096
+
+    def save_this_url(self, event, *args, **kwargs):
+        """ Custom callback function which modifies behaviour
+            of saving URLs to disk """
+
+        # Get the url object
+        url = event.url
+        # If not image, save always
+        if not url.is_image():
+            return True
+        else:
+            # If image, check for content-length > 4K
+            size = url.clength
+            return (size>self.size_threshold)
+
+# Set up the custom crawler
+if __name__ == "__main__":
+    crawler = MyCustomCrawler()
+    crawler.initialize()
+    # Get the configuration object
+    config = crawler.get_config()
+    # Register for 'save_url_data' event which will be called
+    # back just before a URL is saved to disk
+    crawler.register('save_url_data', crawler.save_this_url)
+    # Run
+    crawler.main()
+```
+
+You can run the program as if you would run HarvestMan. For example if you save this code to a file named _customcrawler.py_ then you would run it as,
+
+```
+$ python customcrawler.py [URL]
+```
+
+Here is a sample crawl of a site containing images.
+```
+$ python customcrawler.py http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/crawler.py:53: DeprecationWarning: the sha module is deprecated; use the hashlib module instead
+  import sha
+/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/urlparser.py:50: DeprecationWarning: the md5 module is deprecated; use hashlib instead
+  import md5
+Loading user configuration... 
+Starting HarvestMan 2.0 beta 5... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[2010-02-10 19:21:51,052] *** Log Started ***
+ 
+[2010-02-10 19:21:51,052] Starting project www.tcm.phy.cam.ac.uk ...
+[2010-02-10 19:21:51,052] Writing Project Files... 
+[2010-02-10 19:21:51,191] Starting download of url http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html ...
+[2010-02-10 19:21:51,250] Reading Project Cache... 
+[2010-02-10 19:21:51,253] Project cache not found 
+[2010-02-10 19:21:51,256] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,211] Saved /home/anand/work/harvestman/HarvestMan-lite/harvestman/apps/samples/www.tcm.phy.cam.ac.uk/www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,299] Fetching links http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,730] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/index.html
+[2010-02-10 19:21:52,731] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0002.html
+...
+```
+
+## Diving deep ##
+Let us dissect the custom crawler application we have built to understand the
+events API.
+
+Here are the steps involved.
+
+  * Create a custom crawler class inheriting the _HarvestMan_ class.
+  * Create a custom function which hooks into a specific event.
+  * In the `__main__` section, configure the crawler to subscribe to the event by using the _register_ method. This method takes the event name as first argument and event handler as the second argument.
+
+Let us take a deeper look at the function _save\_this\_url_.
+
+```
+    def save_this_url(self, event, *args, **kwargs):
+        """ Custom callback function which modifies behaviour
+            of saving URLs to disk """
+
+        # Get the url object
+        url = event.url
+        # If not image, save always
+        if not url.is_image():
+            return True
+        else:
+            # If image, check for content-length > 4K
+            size = url.clength
+            return (size>self.size_threshold)
+```
+
+The function simply tells the crawler to alway save non-Image URLs, by
+returning True. For image URLs, it checks for size and returns True
+only if the size is greater than the required threshold size, else False.
+
+The main object here of interest is the **_event_** object. This object contains
+all the state the programmer needs to write the custom behavior. The _event_
+object is of type **_Event_**, a class defined in the module _harvestman.lib.event_.
+
+The _Event_ class is defined as follows.
+```
+class Event(object):
+    """ Event class for HarvestMan """
+
+    def __init__(self):
+        self.name = ''
+        self.config = objects.config
+        self.url = None
+        self.document = None
+```
+
+The attributes of the class are namely, _name_, _config_, _url_ and _document_. Of these the attributes of primary interest to the developer are _url_ and _document_.
+
+### The _url_ attribute ###
+The _url_ attribute contains the current URL object which is being processed. The URL object is of type _HarvestManUrl_ (in module _harvestman.lib.urlparser_). It keeps
+all the state of the current URL under processing.
+
+### The _document_ attribute ###
+The _document_ attribute holds information on the current web-page being crawled.
+The _document_ object is of type _HarvestManDocument_ (module _harvestman.lib.document_). This object holds information on the URL as a document, i.e its content, etag, last modified time etc. The document object is useful for URLs which represent web-pages or documents such as PDF etc.
+
+### The _config_ attribute ###
+This attribute binds to the global configuration object. Instead of having to call _objects.config_ everytime, you can get global configuration in the event handler by accessing the _event.config_ attribute.
+
+### The _name_ attribute ###
+This will contain the name of the event. For example in the above code, this would be _save\_url\_data_.
+
+### How to use attributes of _event_ object ###
+For most events, the _url_ attribute it present and is required to do any meaningful processing. The _document_ attribute is present only for the events which are dealing with a web-page with parseable content. For some events which are related to program
+stages (such as start/end of a project), both these attributes wont be present, i.e they will be _None_.
+
+### Additional arguments ###
+Additional arguments could be passed to the event handler by specific events. Positional arguments will appear as the _`*args`_ and keyword arguments as the _`**kwargs`_ variables respectively. For example the _before\_tag\_parse_ event passes in the current HTML tag and its attributes using positional arguments.
+
+
+
+## Table of Events ##
+The following table lists the main events raised by HarvestMan and the attributes that are filled in for each event, additional arguments, points in program flow when the events are raised, module which raises the event etc.
+
+| **Event** | **Raised when** | **Attributes** | **Positional Arguments** | **Module** | **Comments** |
+|:----------|:----------------|:---------------|:-------------------------|:-----------|:-------------|
+| before\_start\_project | Before starting a project | _url_ | None | _harvestman.apps.spider_ | _url_ is the starting URL |
+| post\_start\_project | After starting a project | _url_ | None | _harvestman.apps.spider_  | _url_ is the starting URL |
+| before\_finish\_project | Before finishing a project | _url_ | None| _harvestman.apps.spider_ | _url_ is the starting URL |
+| after\_finish\_project | After finishing a project | _url_ | None |_harvestman.apps.spider_ | _url_ is the starting URL |
+| before\_crawl\_url | Before a URL is crawled | _url_, _document_ | None | _harvestman.lib.crawler_ | _crawled_ here means the function _crawl\_url_ |
+| post\_crawl\_url | After a URL is crawled | _url_, _document_ | None | _harvestman.lib.crawler_ | _crawled_ here means the function _crawl\_url_ |
+| before\_download\_url | Before a URL is downloaded | _url_ | None | _harvestman.lib.crawler_ |  |
+| before\_parse\_url | Before a URL is parsed | _url_,_document_ | None | _harvestman.lib.crawler_ | This always comes after _before\_download\_url_ hook |
+| post\_parse\_url | After a URL is parsed | _url_,_document_ | _links_ | _harvestman.lib.crawler_ | _links_ stand for the child links of this URL |
+| before\_url\_connect | Before connection for a URL is done | _url_ | _last\_modified_, _etag_ | _harvestman.lib.connector_ | last\_modified, etag args are valid (not None) only if there is cache for the URL |
+| post\_url\_connect | After connection for a URL is done | _url_ | None | _harvestman.lib.connector_ |  |
+| save\_url\_data | Before saving data for a URL to disk | _url_ | _data_ | _harvestman.lib.connector_ | _data_ is the content of the URL |
+| post\_crawl\_complete | After the crawl is completed | None | None | _harvestman.lib.datamgr_ |  |
+| before\_tag\_parse | Before an HTML tag is parsed | _url_ | _tag_,_attrs_ | _harvestman.lib.pageparser_ | _tag_ is the tag name and _attrs_ the attributes dictionary |
+| before\_tag\_data | Before CDATA of an HTML tag is parsed | _url_ | _tag_, _cdata_ | _harvestman.lib.pageparser_ | _tag_ is the tag name and _cdata_ is its CDATA |
+| include\_this\_url | Before a URL is checked for rules | _url_ | None | _harvestman.lib.rules_ | This comes before a URL is crawled |
+
+## Programming using Events ##
+The key towards programming with events is that, the programmer can control the program flow by binding to any _before_ event and returning True or False depending upon his logic.
+
+All events which are raised **before** a certain action is performed, checks for the return value of the event processing. **_If the return value is False, the rest of the processing in the function which raised the event is NOT done_**. **_If the return value is True, the function continues processing as if nothing happened_**.
+
+This can be exploited to write custom crawlers that perform specific actions. For example in the sample code illustrated previously, we return _True_ if the URL is not an image and _False_ if the URL is an image, but below the given size. This way we modify the functionality of the function which raised the event, thereby causing the program to not save image URLs below a certain size.
+
+_Post_ events (events raised after an action, check table) are also useful, but since their return values are not checked in code, they are much less useful in controlling program flow when compared to _pre_ events.
+
+_NOTE_: In the table, any event for which _Raised when_ says _before_ is a pre-event. Any event for which it says _post_ or _after_ is a post-event.
+
+## More Reading ##
+For more information, check out sample custom crawler applications in the folder [\_harvestman/apps/samples\_](http://code.google.com/p/harvestman-crawler/source/browse/#svn/trunk/HarvestMan-lite/harvestman/apps/samples) in the code-base. Also read the HOWTO at [\_doc/events.HOWTO\_](http://code.google.com/p/harvestman-crawler/source/browse/trunk/HarvestMan-lite/doc/events.HOWTO).
+
+## Caveat ##
+In the earlier released versions, the _register_ method is not present. Instead the method name is _bind\_event_, with the same arguments. Also in released versions, positional arguments are not supported and additional arguments are always passed in as keyword arguments. This document is conformant to the most recent release (HarvestMan 2.0.5 beta) and the current trunk-code under HarvestMan-lite branch.
+
+
diff --git a/bot.md b/bot.md
new file mode 100644
index 0000000..56ef528
--- /dev/null
+++ b/bot.md
@@ -0,0 +1,5 @@
+## Why are you crawling my site? ##
+Harvestman is an open source crawler that people can install on their computer and crawl websites. People probably crawl your website because they think your website is useful to them.
+
+## How are you crawling my site? ##
+Harvestman is very extensible. It follows your robot.txt by default but can be configured to fit peoples needs. It is capable to setting bandwidth limits, connection limits, filter by specific content, ignore certain files, crawl only subpages. It is a decision of the person that sets up the crawler how they will crawl you website.
\ No newline at end of file