diff --git a/AboutHarvestMan.md b/AboutHarvestMan.md
new file mode 100644
index 0000000..aa91205
--- /dev/null
+++ b/AboutHarvestMan.md
@@ -0,0 +1,33 @@
+# What is HarvestMan #
+
+HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python.
+
+HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application.
+
+HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.
+
+# History of HarvestMan #
+  1. Harvestman crawler was started by Anand B Pillai in June 2003 as a hobby project to develop a personal web crawler in Python, along with Nirmal Chidambaram.
+  1. Nirmal writes the original code in mid June 2003(one module, single threaded crawler) which Anand improves substantially and develops into a multithreaded crawler.
+  1. First version (0.8) was released by Anand in July 2003.
+  1. Released on [freshmeat](http://www.freshmeat.net/projects/harvestman) (1.3) on Dec 2003
+  1. Eight releases done between Dec 2003 (1.3) and Dec 2004 (1.4).
+  1. Project chosen as crawler for the [EIAO](http://www.eiao.net) web accessibility observatory in Feb 2005. EIAO chose version 1.4, which then underwent several minor releases.
+  1. Most recent release is 1.4.6, released on Sep 2005.
+  1. Since early 2006, HarvestMan has been undergoing development along with the EIAO project (mostly with EIAO feedbacks), but no public releases have been done.
+  1. Version 1.5 started development in mid 2006, but was never released.
+  1. Version 1.4.6 gets accepted into Debian in March 2006.
+  1. By mid 2007, the program had so many changes, that the version number under development was incremented to 2.0 from 1.5. Version 2.0 has been under development effectively since mid 2006, but most code changes happened since mid 2007.
+  1. Version 1.4.6 gets into Ubuntu repositories in May 2007.
+  1. Development hosted in [BerliOS](http://developer.berlios.de/projects/harvestman) till June 2008, when it was moved to Google code.
+  1. Contributors to 2.0 till June 2008 - Anand B Pillai (Main), Nils Ultveit Moe (EIAO), Morten Goodwin Olsen (EIAO), John Kleven.
+  1. Version 2.0 alpha package releases started in Aug 2007 on the website.
+  1. HarvestMan wins FOSS India Award in April 2008.
+  1. In June 2008, Lukasz Szybalski joins the team.
+
+## Future of HarvestMan ##
+> It is a brave new world out there... :-)
+> Well, currently the development stands at 2.0.5 beta, i.e the 2.0 version is still
+> not completed. The development is slow and I need to take time off from a regular
+> job to do this, so well, can't give a final date for this, but hopefully one day
+> it will be done :)
\ No newline at end of file
diff --git a/ConfigXml.md b/ConfigXml.md
new file mode 100644
index 0000000..09019ea
--- /dev/null
+++ b/ConfigXml.md
@@ -0,0 +1,108 @@
+# Harvestman config.xml #
+## Configuration File Structure ##
+
+The configuration file has different categories inside that split the configuration options into different sections. At present, the configuration file has the following namespaces:
+
+  1. **project** - This section holds the options related to the current HarvestMan project
+  1. **network** - This section holds the configuration options related to your network connection
+  1. **download** - This section holds configuration option that affect your downloads in a generic way
+  1. **control** - This section is similar to the above one, but holds options that affect your downloads in a much more specific way. This is a kind of 'tweak' section, that allows you to exert more fine-grained control over your projects.
+  1. **system** - This section controls the threading options, regional (locale) options and any other options related to Python interpreter and your computer.
+  1. **indexer** - This section holds variables related to how the files are processed after downloading. Right now it holds variables related to localizing links.
+  1. **files** - This section holds variables that control the files created by HarvestMan namely error log, message log and an optional url log.
+  1. **display** - This holds a single variable related to creating a browser page for all HarvestMan projects on your computer.
+
+## Control Section ##
+### fetchlevel ###
+
+HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL. In general **increasing a fetch level allows the program to crawl more files** on the Internet.
+
+A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL.
+
+For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the 
+
+&lt;images&gt;
+
+ sub-directory and directories below it, and no other file.
+
+The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page.
+
+A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers.
+
+A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL.
+
+A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc.
+
+Place the paramter in **control** element under **extent** section.
+Here is a sample XML element including this new param.
+```
+ <control>
+    ...
+      <extent>
+        <fetchlevel value="0"/>
+        <depth value="10"/>
+        <extdepth value="0"/>
+        <subdomain value="0"/>
+        <ignoretlds value="0"/>
+      </extent>
+    ...
+ </control>
+```
+
+
+**The value can be 0,1,2,3,4**
+
+See FAQ for more explanations.
+
+### maxbandwidth ###
+MaxBandwidth controls the speed of crawling. The throttling of bandwidth is used when we are downloading huge amount of data from a host. MaxBandwidh should prevent user from "Denial Of Service" that one could impose on the crawled server. By using this configuration variable you can limit your download speed to 5kb per second. With this speed the host should not have any problems serving your crawl and be able to proceed with its normal operations.
+
+Place the paramter in **control** element under **limits** section.
+Here is a sample XML element including this new param.
+```
+<control>
+....
+   <limits>
+       <maxextservers value="0"/>
+       <maxextdirs value="0"/>
+       <maxfiles value="100"/>
+       <maxfilesize value="5242880"/>
+       <connections value="5"/>
+       <maxbandwidth value="5.0" />
+       <timelimit value="-1"/>
+   </limits>
+...
+</control>
+```
+
+**The value needs to be specified in kb/sec , not in bytes/sec.**
+
+### maxbytes ###
+MaxBytes controls how many bytes your crawl will download. The max bytes is used when we are downloading huge amount of data from a host, and in conjunction with MaxBandwidh we want to limit how much data we download. By using this configuration variable with maxbandwidth you can set your crawl to download 10mb at 5kb/s. With this fine grained control of your download size and speed the host should not have any problems serving your crawl and be able to proceed with its normal operations.
+
+Place the paramter in **control** element under **limits** section.
+Here is a sample XML element including this new param.
+```
+<control>
+....
+   <limits>
+       <maxextservers value="0"/>
+       <maxextdirs value="0"/>
+       <maxfiles value="100"/>
+       <maxfilesize value="5242880"/>
+       <connections value="5"/>
+       <maxbandwidth value="5.0" />
+       <maxbytes value="10mb"/>
+       <timelimit value="-1"/>
+   </limits>
+...
+</control>
+```
+
+**The value accepts plain numbers(assumes bytes), KB, MB and GB.**
+```
+<maxbytes value="5000" /> - End crawl at 5000 bytes
+<maxbytes value="10kb" /> - End crawl at 10kb 
+<maxbytes value="50MB" /> - End crawl at 50 MB.
+<maxbytes value="1GB" /> - End crawl at 1 GB.
+```
\ No newline at end of file
diff --git a/FAQ.md b/FAQ.md
new file mode 100644
index 0000000..dbf965b
--- /dev/null
+++ b/FAQ.md
@@ -0,0 +1,1068 @@
+## This is still a work under progress and has been lifted verbatim from the HarvestMan web-site with little or no modifications. A lot of information is out of date and needs to be updated. Also the FAQ doesn't confirm to wiki style, so proceed with care ! ##
+
+HarvestMan - FAQ
+Version 2.0
+NOTE: The FAQ is being modified currently to be in sync with HarvestMan 1.4, so you might find that some parts of the FAQ are inconsistent with the rest of it. This is because some of the FAQ is modified, while the rest is still to be modified.
+
+  * 1. Overview
+> > o 1.1. What is HarvestMan?
+> > o 1.2. Why do you call it HarvestMan?
+> > o 1.3. What HarvestMan can be used for?
+> > o 1.4. What HarvestMan cannot be used for...
+> > o 1.5. What do I need to run HarvestMan?
+  * 2. Usage
+> > o 2.1. What is the HarvestMan Configuration File?
+> > o 2.2. Can HarvestMan be run as a command-line application?
+  * 3. Architecture
+> > o 3.1.What are "tracker" threads and what is their function?
+> > o 3.2.What are "crawler" threads?
+> > o 3.3.What are "fetcher" threads?
+> > o 3.4.How do the crawlers and fetchers co-operate?
+> > o 3.5.How many different Queues of information flow are there?
+> > o 3.6.What are worker (downloader) threads?
+> > o 3.7.How does a HarvestMan project finish?
+  * 4. Protocols & File Types
+> > o 4.1. What are the protocols supported by HarvestMan?
+> > o 4.2. What kind of files can be downloaded by HarvestMan?
+> > o 4.3. Can HarvestMan run javascript code?
+> > o 4.4. Can HarvestMan run java applets?
+> > o 4.5. How to prevent downloading of HTML & CGI forms?
+> > o 4.6. Does HarvestMan download dynamically generated cgi files (server-side) ?
+> > o 4.7. How does HarvestMan determine the filetype of dynamically generated server side files?
+> > o 4.8. Does HarvestMan obey the Robots Exclusion Protocol?
+> > o 4.9. Can I restart a project to download links that failed (Caching Mechanism)?
+  * 5. Network, Security & Access Rules
+> > o 5.1. Can HarvestMan work across proxies?
+> > o 5.2. Does HarvestMan support proxy authentication?
+> > o 5.3. Does HarvestMan work inside an intranet?
+> > o 5.4. Can HarvestMan crawl a site that requires HTTP authentication?
+> > o 5.5. Can HarvestMan crawl a site that requires HTTPS(SSL) authentication?
+> > o 5.6. Can I prevent the program from accessing specific domains?
+> > o 5.7. Can I specify download filters to prevent download of certain files or directories on a server?
+> > o 5.8. Is it possible to control the depth of traversal in a domain?
+  * 6. Download Control - Basic
+> > o 6.1. Can I set a limit on the maximum number of files that are downloaded?
+> > o 6.2 .Can I set a limit on the number of external servers crawled?
+> > o 6.3. Can I set a limit on the number of outside directories that are crawled?
+> > o 6.4. How can I prevent download of images?
+> > o 6.5. How can I prevent download of stylesheets?
+> > o 6.6. How to disable traversal of external servers?
+> > o 6.7. Can I specify a project timeout?
+> > o 6.8. Can I specify a thread timeout for worker threads?
+> > o 6.9. How to tell the program to retry failed links?
+  * 7. Download Control - Advanced
+> > o 7.1. What are fetchlevels and how can I use them?
+  * 8. Application development & customization
+> > o 8.1. I want to customize HarvestMan for a research project. Can you help out ?
+> > o 8.2. I want to customize HarvestMan for a commercial project. Can you help out ?
+  * 9. Diagrams
+> > o 9.1. HarvestMan Class Diagram
+
+1. Overview
+
+1.1. What is HarvestMan?
+HarvestMan (with a capital 'H' and a capital 'M') is a webcrawler program. HarvestMan belongs to a family of programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc.
+
+These programs are used to crawl a distributed network of computers like the Internet and download files locally.
+
+1.2. Why do you call it HarvestMan?
+The name "HarvestMan" is derived from a kind of small spider found in different parts of the world called "Daddy long legs" or Opiliones.
+
+Since this program is a web-spider, the analogy was compelling to name it after some species of spiders. The process of downloading data from websites is sometimes called harvesting.
+
+Both these similarities gave arise to the name HarvestMan.
+
+1.3. What HarvestMan can be used for?
+HarvestMan is a desktop tool for web search/data gathering. It works on the client side.
+
+As of the recent version, HarvestMan can be used for,
+
+  1. Downloading a website or a part of it.
+
+> 2. Download certain files from a website (matching certain patterns)
+> 3. Search a website for keywords & download the files containing them
+> 4. Scan a website for links and download them specifically using filters.
+
+1.4. What HarvestMan cannot be used for...
+HarvestMan is a small-medium size web-crawler mostly intended for personal use or for use by a small group. It cannot be used for massive data harvesting from the web. However a project to create a large-scale, distributed web crawler based on HarvestMan is underway. It is calld 'Distributed HarvestMan' or 'D-HarvestMan' in short. D-HarvestMan is currently at a prototype stage.
+
+Projects like EIAO has been able to customize HarvestMan for medium-large scale data gathering from the Internet. The EIAO project uses HarvestMan to download as much as 100,000 files from European websites daily.
+
+What HarvestMan is not,
+
+  1. HarvestMan is not an Internet search engine.
+> 2. HarvestMan is not an indexer or taxonomy tool for web documents
+> 3. HarvestMan is not a server-side program.
+
+1.5. What do I need to run HarvestMan?
+HarvestMan is written in a programming language called Python. Python is an interactive, interpreted, object-oriented programming language created by Guido Van Rossum and maintained by a team of volunteers from all over the world. Python is a very versatile language which can be used for a variety of tasks ranging from scripting to web frameworks to developing highly complex applications.
+
+HarvestMan is written completely in Python. It works with Python version 2.3 upward on all platforms where Python runs. However, HarvestMan has some performance optimizations that require the latest version of Python which is Python 2.4. The suggested version is Python 2.4. Anyway, HarvestMan will also work with Python 2.3 too, but with reduced performance.
+
+You need a machine with a rather large amount of RAM to run HarvestMan. HarvestMan tends to heavily use the system memory especially when performing large data downloads or when run with more than 10 threads. It is preferable to have a machine with 512 MB RAM and a fast CPU (Intel Pentium IV or higher) to run HarvestMan efficiently.
+
+2. Usage
+
+2.1. How do I run HarvestMan?
+HarvestMan is a command-line application. It has no GUI.
+
+From the 1.4 version, HarvestMan can be run by calling the main HarvestMan module as an executable script on the command-line as follows:
+
+
+% harvestman.py
+
+
+This will work, considering that you have edited your environment PATH variable to include the local HarvestMan installation directory on your machine. If you have not, you can run HarvestMan by running the harvestman.py module as an argument to the python interpreter, as follows:
+% python harvestman.py
+
+
+On Win32 systems, if you have associated the ".py" extension to the appropriate python.exe, you can run HarvestMan without invoking the interpreter explicitly.
+
+
+Note that this assumes that you have a config file named config.xmlin the directory from where you invoke HarvestMan. If you dont have a config file locally, you need to use the command-line options of HarvestMan to pass a different configuration file to the program.
+2.2. What is the HarvestMan Configuration (config) file?
+
+The standard way to run HarvestMan is to run the program with no arguments, allowing it to pick up its configuration parameters from an XML configuration file which is named config.xml by default.
+
+It is also possible to pass command-line options to HarvestMan. HarvestMan supports a limited set of command-line options which allow you to run the program without using a configuration file. You can learn more about the command-line options in the HarvestMan command-line options FAQ.
+The HarvestMan configuration is an XML file with the configuration options split into different elements and their hieararchies. A typical HarvestMan configuration file looks as follows:
+
+
+<?xml version="1.0" encoding="utf-8"?>
+
+<!DOCTYPE HarvestMan SYSTEM "HarvestMan.dtd">
+
+
+&lt;HarvestMan&gt;
+
+
+> 
+
+&lt;config version="3.0" xmlversion="1.0"&gt;
+
+
+> > 
+
+&lt;project&gt;
+
+
+> > > 
+
+&lt;url&gt;
+
+http://www.python.org/doc/current/tut/tut.html
+
+&lt;/url&gt;
+
+
+
+
+> 
+
+&lt;name&gt;
+
+pytut
+
+&lt;/name&gt;
+
+
+> 
+
+&lt;basedir&gt;
+
+~/websites
+
+&lt;/basedir&gt;
+
+
+> 
+
+&lt;verbosity value="3"/&gt;
+
+
+> 
+
+&lt;timeout value="600.0"/&gt;
+
+
+
+> 
+
+&lt;/project&gt;
+
+
+
+> 
+
+&lt;network&gt;
+
+
+> > 
+
+&lt;proxy&gt;
+
+
+> > > 
+
+&lt;proxyserver&gt;
+
+
+
+&lt;/proxyserver&gt;
+
+
+> > > 
+
+&lt;proxyuser&gt;
+
+
+
+&lt;/proxyuser&gt;
+
+
+
+
+> 
+
+&lt;proxypasswd&gt;
+
+
+
+&lt;/proxypasswd&gt;
+
+
+> 
+
+&lt;proxyport value=""/&gt;
+
+
+> 
+
+&lt;/proxy&gt;
+
+
+> 
+
+&lt;urlserver status="0"&gt;
+
+
+> > 
+
+&lt;urlhost&gt;
+
+localhost
+
+&lt;/urlhost&gt;
+
+
+
+
+> 
+
+&lt;urlport value="3081"/&gt;
+
+
+> 
+
+&lt;/urlserver&gt;
+
+
+> 
+
+&lt;/network&gt;
+
+
+
+> 
+
+&lt;download&gt;
+
+
+> > 
+
+&lt;types&gt;
+
+
+
+
+> 
+
+&lt;html value="1"/&gt;
+
+
+> 
+
+&lt;images value="1"/&gt;
+
+
+> 
+
+&lt;javascript value="1"/&gt;
+
+
+> 
+
+&lt;javaapplet value="1"/&gt;
+
+
+
+> 
+
+&lt;forms value="0"/&gt;
+
+
+> 
+
+&lt;cookies value="1"/&gt;
+
+
+> 
+
+&lt;/types&gt;
+
+
+> 
+
+&lt;cache status="1"&gt;
+
+
+
+> 
+
+&lt;datacache value="1"/&gt;
+
+
+> 
+
+&lt;/cache&gt;
+
+
+> 
+
+&lt;misc&gt;
+
+
+> > 
+
+&lt;retries value="1"/&gt;
+
+
+> > 
+
+&lt;tidyhtml value="1"/&gt;
+
+
+
+
+> 
+
+&lt;/misc&gt;
+
+
+> 
+
+&lt;/download&gt;
+
+
+
+> 
+
+&lt;control&gt;
+
+
+> > 
+
+&lt;links&gt;
+
+
+> > > 
+
+&lt;imagelinks value="1"/&gt;
+
+
+
+
+> 
+
+&lt;stylesheetlinks value="1"/&gt;
+
+
+> 
+
+&lt;/links&gt;
+
+
+> 
+
+&lt;extent&gt;
+
+
+> > 
+
+&lt;fetchlevel value="0"/&gt;
+
+
+> > 
+
+&lt;extserverlinks value="0"/&gt;
+
+
+
+
+> 
+
+&lt;extpagelinks value="1"/&gt;
+
+
+> 
+
+&lt;depth value="10"/&gt;
+
+
+> 
+
+&lt;extdepth value="0"/&gt;
+
+
+> 
+
+&lt;subdomain value="0"/&gt;
+
+
+
+> 
+
+&lt;/extent&gt;
+
+
+> 
+
+&lt;limits&gt;
+
+
+> > 
+
+&lt;maxextservers value="0"/&gt;
+
+
+> > 
+
+&lt;maxextdirs value="0"/&gt;
+
+
+> > 
+
+&lt;maxfiles value="5000"/&gt;
+
+
+
+
+> 
+
+&lt;maxfilesize value="1048576"/&gt;
+
+
+> 
+
+&lt;connections value="5"/&gt;
+
+
+> 
+
+&lt;requests value="5"/&gt;
+
+
+> 
+
+&lt;timelimit value="-1"/&gt;
+
+
+
+> 
+
+&lt;/limits&gt;
+
+
+> 
+
+&lt;rules&gt;
+
+
+> > 
+
+&lt;robots value="1"/&gt;
+
+
+> > 
+
+&lt;urlpriority&gt;
+
+
+
+&lt;/urlpriority&gt;
+
+
+> > 
+
+&lt;serverpriority&gt;
+
+
+
+&lt;/serverpriority&gt;
+
+
+
+
+> 
+
+&lt;/rules&gt;
+
+
+> 
+
+&lt;filters&gt;
+
+
+> > 
+
+&lt;urlfilter&gt;
+
+
+
+&lt;/urlfilter&gt;
+
+
+> > 
+
+&lt;serverfilter&gt;
+
+
+
+&lt;/serverfilter&gt;
+
+
+> > 
+
+&lt;wordfilter&gt;
+
+
+
+&lt;/wordfilter&gt;
+
+
+
+
+> 
+
+&lt;junkfilter value="0"/&gt;
+
+
+> 
+
+&lt;/filters&gt;
+
+
+> 
+
+&lt;/control&gt;
+
+
+
+> 
+
+&lt;system&gt;
+
+
+> > 
+
+&lt;workers status="1" size="10" timeout="200"/&gt;
+
+
+
+
+> 
+
+&lt;trackers value="4"/&gt;
+
+
+> 
+
+&lt;locale&gt;
+
+american
+
+&lt;/locale&gt;
+
+
+> 
+
+&lt;fastmode value="1"/&gt;
+
+
+> 
+
+&lt;/system&gt;
+
+
+
+> 
+
+&lt;files&gt;
+
+
+> > 
+
+&lt;urllistfile&gt;
+
+
+
+&lt;/urllistfile&gt;
+
+
+> > 
+
+&lt;urltreefile&gt;
+
+
+
+&lt;/urltreefile&gt;
+
+
+
+> 
+
+&lt;/files&gt;
+
+
+
+> 
+
+&lt;indexer&gt;
+
+
+
+> 
+
+&lt;localise value="2"/&gt;
+
+
+> 
+
+&lt;/indexer&gt;
+
+
+
+> 
+
+&lt;display&gt;
+
+
+> > 
+
+&lt;browsepage value="1"/&gt;
+
+
+
+
+> 
+
+&lt;/display&gt;
+
+
+
+> 
+
+&lt;/config&gt;
+
+
+
+
+
+&lt;/HarvestMan&gt;
+
+
+
+
+The current configuration file holds more than 60 configuration options. The variables that are essential to a project are project.url, project.name and project.basedir. These determine the identity of a HarvestMan crawl and normally require unique values for each HarvestMan project.
+
+For a more detailed discussion on the config file, click here.
+
+2.3. Can HarvestMan be run as a command-line application?
+Yes, it can. For details on this, refer the Command line FAQ.
+3. Architecture
+
+3.1. HarvestMan is a multithreaded program. What is the threading architecture of HarvestMan ?
+HarvestMan uses a multithreaded architecture. It assigns each thread with specific functions which help the program to complete the downloads at a relatively fast pace.
+
+HarvestMan is a network-bound program. This means that most of the time for the program is spent in waiting for network connections, fetching network data and closing the connections. HarvestMan can be considered to be not IO-bound since we can assume that there is ample disk space for the downloads, at least in most common cases.
+
+Whenever any program is network-boundor IO-bound, it helps to split the task to multiple threads of control, which perform their function without affecting other threads or the main thread.
+
+HarvestMan uses this theory to create a multithreaded system of co-operating threads, most of which gather data from the network, processes the data and writes the files to the disk. These threads are calledtracker threads. The name is derived from the fact that the thread tracks a web-page, downloads its links and further tracks each of the pages pointed by the links, doing this recursively for each link.
+
+HarvestMan uses a pre-emptive threaded architecture where trackers are launched when the program starts. They wait in turns for work, which is managed by a thread-safe Queueof data. Tracker threads post and retrieve data from the queue. These threads die only at the end of the program, spinning in a loop otherwise, looking for data.
+
+There are two different kind of trackers, namely crawlers and fetchers.These are described in the sections below.
+3.2. What are "crawler" threads?
+Crawlersor crawler-threads are trackers which perform the specific function of  parsing a web-page. They parse the data from a web-page, extract the links, and post the links to a url queue.
+
+The crawlers get their data from a dataqueue.
+
+
+3.3. What are "fetcher" threads?
+Fetchersor fetcher-threads are trackers which perform the function of "fetching", i.e downloading the files pointed to by urls. They download URLs which do not produce web-page content (HTML/XHTML) statically or dynamically. They download non-webpage URLs such as images, pdf files, zip files etc.
+
+The fetchers get their data from the urlqueue and they post web-page data to the dataqueue.
+
+
+3.4. How do the crawlers and fetchers co-operate?
+The design of HarvestMan forces the crawlers and fetchers to be synergic. This is because, the crawlers obtain their data (web-page data) from the data queue, and post their results to the url queue. The fetchers in turn obtain their data (urls) from the url queue, and post their results to the data queue.
+
+The program starts off by spawing the first thread which is a fetcher. It gets the web-page data for the starting page and posts it to the data queue. The first crawler in line gets this data, parses it and extracts the links, posting it to the url queue. The next fetcher thread waiting in the url queue gets this data, and the process repeats in a synergic manner, till the program runs out of urls to parse, when the project ends.
+
+3.5. How many different Queues of information flow are there?
+There are two queues of data flow, the url queue and the data queue.
+
+The crawlers **feed the url queue and**feed-off the data queue.
+The fetchers feed the data queue and feed-off the url queue.
+
+**feed = post data to**feed-off = get data from
+
+
+3.6. What are "worker" (downloader) threads?
+Apart from the tracker threads, you can specify additional threads to take charge of downloading urls. The urls can be downloaded in these threads instead of consuming the time of the fetcher threads.
+
+These threads are launched 'apriori', similar to the tracker threads, before the start of the crawl. By default, HarvestMan launches a set of 10 of these worker threads which are managed by a thread pool object. The fetcher threads delegate the actual job of downloading to the workers. However, if the worker threads are disabled, the fetchers will do the downloads themselves.
+
+These threads also die only at the end of a HarvestMan crawl.
+
+3.7. How does a HarvestMan project finish?
+(Make sure that you have read items 3.1 - 3.6 before reading this.)
+
+As mentioned before, HarvestMan works by the co-operation of crawler and fetcher family of tracker threads, each feeding on the data provided by the other.
+
+A project nears its end when there are no more web-pages to crawl according to the configurations of the project. This means that the fetchers have less web-page data to fetch, which in-turn dries the data source for the crawlers. The crawlers in-turn go idle, thus posting less data to the url queue, which again dries the data source for the fetchers. The synergy works at this phase also, as it does when the project is active and all tracker threads are running.
+
+After some time, all the tracker threads go idle, as there is no more data to feed from the queues. In the main thread of the HarvestMan program, there is a loop that spins continously checking for this event. Once all threads go idle, the loop detects it and exits; the project (and the program) comes to a halt.
+
+HarvestMan main thread enters this loop immediately after spawning all the tracker threads and waits in the loop till. It checks for the idle condition every 1 or 2 seconds, spinning in a loop. Once it detects that all threads have gone idle, it ends the threads, performs post-download operations, cleanup etc and brings the program to an end.
+
+
+4. Protocols & File Types
+
+4.1. What are the protocols supported by HarvestMan?
+
+HarvestMan supports the following protocols
+
+  1. HTTP
+> 2. FTP
+
+Support for HTTPS (SSL) protocol depends on the Python version you are running. Python 2.3 version and later has HTTPS support built into Python, so HarvestMan will support the HTTPS protocol, if you are running it using Python 2.3 or higher versions.
+
+The GOPHERand FILE://protocols should also work with HarvestMan.
+
+
+4.2. What kind of files can be downloaded by HarvestMan?
+HarvestMan can download **any** kind of file as long as it is served up by a web-server using HTTP/FTP/HTTPS. There are no restrictions on the type of file or the size of a single file.
+
+HarvestMan assumes that the URLs with the following extension are web-pages, static or dynamic.
+
+'.htm', '.html', '.shtm', '.shtml', '.php','.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl','.cgi', '.stx', '.cfm', '.cfml', '.cms'
+
+The URL with no extension is also assumed to be a web-page. However, the program has a mechanism by which it looks at the URL headers of the HTTP request and figures out the actual file type of the URL by doing a mimetype analysis. This happens immediately after the HTTP request is answered by the server. So if the program finds that the assumed type of a URL is different from the actual type, it sets the type correctly at this point.
+
+You can restrict download of certain files by creating specific filters for HarvestMan. These are described in a section somewhere below.
+
+A related question is the html-tagssupported by HarvestMan using which it downloads files.
+These are listed below.
+
+  1. Hypertext links of the form<a href='http://www.foo.com/bar/file.html'> .<br>
+</li></ul><blockquote>2. Image links of the form <img src='http://www.foo.com/bar/img.jpg'>  .<br>
+3. Stylesheets of the form   <br>
+<br>
+<link rel="stylesheet" type="text/css" href="style.css"><br>
+<br>
+.<br>
+4. Javascript source files of the form <br>
+<br>
+<script src="http://www.foo.com/scripts/script.js"><br>
+<br>
+ .<br>
+5. Java applets(.class files) of the form<applet...>  .</blockquote>
+
+
+4.3. Can HarvestMan run javascript code?
+HarvestMan does not include a javascript engine. So it cannot run javascript code like Netscape or I.E. But Harvestman can parse javascript code and fetch javascript source files. (See answer to item 4.2 above.)
+
+
+4.4. Can HarvestMan run java applets?
+No. HarvestMan does not include a java runtime environment, so it cannot execute java applets. It does not make much sense in doing so too, since HarvestMan is an off-line browser and not a browser which can host a Java Runtime Environment.
+
+But HarvestMan can download java applets (class files) by parsing the java applet tags inside web pages. (See answer to item 4.2 above).
+
+
+4.5. How to prevent downloading of CGI and HTML forms?
+From version 1.4, this is the default action. The program skips cgi form links and html forms, by looking for the query string of the form http://www.foo.com/bar/query.html?param=pby using regular expression matching. However, there is a way to make the program download CGI forms by enabling a configuration variable in the config file.
+
+4.6. Does HarvestMan download dynamically generated server-side files ?
+Yes. Server side files are generated during a regular client request to a web server which runs the server-side script (asp, jsp, php etc), and serves up the resulting file. Since this is like a regular HTTP request for HarvestMan, it downloads such files. However HarvestMan does not have a mechanism to prevent duplicate downloads of server-side files which generate the same content, but with different parameters to the server query. This could be added in a future version.
+
+
+4.7. How does HarvestMan determine the file type of dynamically generated server side files (cgi)?
+Since server-side requests can generate a wide variety of files and there is no way to find out the file type directly, adding logic to identify all such types is not an easy task.The current HarvestMan code contains logic to identify some files and it renames the files accordingly to match the extension of the particular type.
+
+HarvestMan can identify GIF, BMP andJPEG image files by comparing the file signature with the standard signatures of these file types.
+
+Since this is a rather errorprone approach, there is no guarantee it will work perfectly. Hence it is better not to rely on this feature if you are using HarvestMan.
+
+4.8. Does HarvestMan obey the Robots Exclusion Protocol ?
+Yes.
+
+HarvestMan respects the rules laid down by website managers in the robots.txtrules in the web server. These rules specify certain limitations to crawling certain areas of the web site depending upon the user agent of the browser client. (Some site owners block entire sections to all clients).
+
+HarvestMan obeys the robot exclusion protocol by default. There is way to bypass this protocol by disabling this feature. However, it is a good idea to always enable it to follow Internet etiquette and also to prevent yourself getting fined or sued by website owners for not following the robots.txt rules.
+
+Support for robots.txt rules is available in Python. HarvestMan uses a customised form of this module.
+
+
+4.9. Can I restart a project to download only links that failed in a previous run (web-page caching)?
+Yes. You can, since HarvestMan has an inbuilt caching mechanism for documents downloaded from the network.
+
+From version 1.2, the caching mechanism is available and enabled by default. HarvestMan uses an MD5 checksum< of the data of a downloaded file to create a unique  signature for each file.  This is associated with the url location of that file (in the Internet or the LAN) and written to a cache file. This signature is generated for every file downloaded in a project and the data written to a compressed cache file in the disk. You can locate this file in the directory named hm-cache in the HarvestMan project directory.
+
+When you re-start a project, HarvestMan loads the cache information for the project, if it exists. When it encounters a url, it compares the signature of the url data with the signature of the cache url and verifies if it is the same. If it is the same, the document has not changed, so HarvestMan skips this url. Otherwise it downloads it.
+
+The cache is regenerated at the end of every project. HarvestMan catches any keyboard interrupts by the user and makes sure that the cache is generated if the user decides to end the program by sending a keyboard interrupt, thereby making sure that precious network bandwidth is not wasted.
+
+You can disable web-page caching by disabling a configuration variable in HarvestMan configuration file.
+5 . Network, Security & Access Rules
+
+5.1. Can HarvestMan work across proxies?
+Yes. HarvestMan can work transparently across proxy servers and firewalls.
+
+HarvestMan supports proxies and firewalls by the following config variables in the config file
+and their command-line counteparts.
+
+> o proxyserver:               This is the name or ip of the proxy server
+> o proxyport:                   This is the port to which the proxy listens for requests.
+> o proxyuser:                  Username for proxy authentication, if any.
+> o proxypasswd:              Password for proxy authentication, if any.
+
+These variables are child elements of the top-level XML element network inside HarvestMan configuration file. The proxy handling is built into Python.
+
+Note that if you use the config file generation script (in the HarvestMan site or the one included in the distribution), the proxy variables are encrypted to prevent misuse of the information by a third-party.
+
+
+5.2. Does HarvestMan support proxy authentication?
+Yes. HarvestMan can work with both authenticated and regular (unauthenticated) proxies and firewalls. For details, see answer to item 5.1 above.
+
+
+5.3. Does HarvestMan work inside an Intranet?
+Yes. HarvestMan can crawl servers inside intranets or LANs as long as HarvestMan is able to resolve the name of the intranet server. This typically requires that there is a DNS server somewhere in the intranet which performs the resolving for clients. If you dont have such a DNS server, you might need to provide this information locally on the machine, by editing the local hosts file in your machine.
+
+HarvestMan makes no difference between Internet and Intranet crawling. This was not so in earlier versions. However from the 1.4 version, all servers are treated uniformly irrespective of whether they are part of the local intranet or the Internet.
+
+5.4. Can HarvestMan crawl a site that requires HTTP authentication?
+Currently, no. This feature is enabled in the code, but is not exposed since it is not working perfectly.
+
+
+5.5. Can HarvestMan crawl a site that requires HTTPS authentication?
+No. This is not possible with the current HarvestMan. This feature is targeted for a future release.
+5.6. Can I prevent the program from accessing specific domains?
+> This is possible by the use of specific domain filters.
+
+HarvestMan provides a filtering mechanism by which certain areas of a network or the Internet can be specified out-of-bounds to the program. This is possible by the use
+of a server filter.
+The config variable serverfilter inside the top-level control element can be set to a pattern of string in this manner.
+
+1. Prefix a server name with the minus sign to specify it as out of bounds to the program. HarvestMan will skip such domains.
+2. Prefix a server name with the plus sign to specify that it can be crawled.
+
+For example to prevent the server named server1.foo.com in the foo.com domain from
+being crawled, set control.serverfilter to the following pattern.
+
+control.serverfilter               -server1.foo.com
+
+If you want skip certain areas of a website or prevent downloading certain kinds of files, you should use the urlfilter , described below.
+
+5.7. Can I specify download filters to prevent download of certain files or directories on a server?
+HarvestMan uses the urlfilter mechanism to make this possible.
+
+You can specify a urlfilter as the config variable named urlfitler of the top-level control element to do this. The rules for creating urlfilters are similar to any other filter.
+
+1. Prefix an area or pattern to be skipped with the minus sign.
+2. Prefix an area or pattern to be included with the plus sign.
+
+For example, to prevent the program from downloading gif files, you can use the following pattern.
+
+control.urlfilter              -**.gif**
+
+To create a sequence of filters, chain them one after the other.
+
+For example, to prevent the download of microsoft word, excel and powerpoint files, use the following pattern.
+
+control.urlfilter             -**.doc-**.xls-**.ppt**
+
+You can use this mechanism to prevent the program from accessing specific directories also.
+
+For example, to prevent the program from crawling the directory named "images" in the server foo.com, use the following pattern.
+
+control.urlfilter            -foo.com/images/
+
+You can modify the above rule to allow the program to download all files except gif images from the same directory, as follows.
+
+control.urlfilter           -foo.com/images/**.gif**
+
+To specify traversal rules for certain directories and their subdirectories, you can use the following pattern.
+
+For example, you want to allow acces to the sub-directory named "public" inside the directory named "images" in foo.com server, but prevent access to all other sub-directories and files.
+
+The following urlfilter pattern does the trick.
+
+control.urlfilter          -foo.com/images/**+foo.com/images/public**
+
+The urlfilter pattern always takes precedence over the server filter pattern.
+
+
+5.8. Is it possible to specify the depth of traversal in a domain?
+
+Yes, it is possible.
+
+HarvestMan provides two configuration variables for specifying the maximum depth of
+traversal in a web server. The depth of a directory is calculated from the root of the
+web server.
+
+For example, the directory named http://www.foo.com/images/public/  has a depth
+of two from the server root directory.
+
+HarvestMan provides the variable depth inside the top-level element control to control the depth of traversal in the starting server, and the variable extdepth in the top-level control element to control the depth of traversal in external servers.
+
+The difference between both control options is that, for the starting server, depth of a directory is calculated with respect to the starting url's directory, whereas for external servers, it is calculated with respect to its root.
+
+For example, if the starting url is http://www.foo.com/images/public/index.html,
+and if control.depth is set to 2, the images in the directory named http://www.foo.com/images/public/picnic/hawaii
+will be fetched since the depth of this directory with respect to the directory of the starting url is 2.
+
+For the same url, if control.extdepth is set to 2, the images in the directory named http://www.foo2.com/images/public/picnic/hawaii will not be fetched, since it is an external server, and the depth of this directory w.r.t the root of its server is more than
+2.
+
+By default the value for control.depth is 10, and for control.extdepth is 0. This means that only the files in the root directory of external servers is fetched by default.
+
+
+6. Download Control
+
+6.1. Can I set a limit on the maximum number of files that are downloaded?
+Yes. Use the config option named maxfiles inside the top-level control element for this. By default this is set to 5000.
+
+When the number of downloaded files exceed this value, HarvestMan automatically kills all threads and terminates the projects. It might also do some deletion of already downloaded files to stick to this limit. You can get very good accuracy for this limit, with an average error of +/- 1.5 files.
+
+
+6.2. Can I set a limit on the number of external servers crawled?
+Yes. The config variable named maxextservers inside the top-level control element for this.
+
+There is a variable named fetchlevel which supersedes this. Typically, you wont be setting the external server option directly, instead using fetchlevels to configure your download extents. This variable is explained in a different section of the FAQ.
+
+
+6.3. Can I set a limit on the number of outside directories that are crawled?
+Yes. The config variable named maxextdirs inside the top-level control element that helps to do this.
+
+External directories are directory paths in the starting server, but outside the path of the directory of the starting url.
+
+Again, the fetchlevel setting supersedes this, so you can crawl external directories by setting appropriate fetch levels. Typically you wont be using the external directory setting directly, instead using fetchlevel to control the option indirectly.
+
+
+
+6.4. How can I prevent the download of images?
+You can do this by setting the config variable named images inside the top-level download element to zero.
+
+
+6.5. How can I prevent the download of stylesheets?
+You can do this by setting the config variable named stylesheets inside the top-level download element to zero.
+
+
+6.6. How do I disable the traversal of external servers?
+By setting the config variable named extserverlinks inside the top-level control element to zero.
+
+Note that the fetchlevel  setting supersedes this setting also. Typically, this setting is also not congfigured directly but instead controlled through fetchlevel.
+
+6.7. Can I specify a project timeout?
+> Yes you can.
+
+A HarvestMan project can hang due to many reasons. Problems in the network, memory crunch situations, threads hogging the CPU, insufficient thread context switching are some of these. Most of the times this happens when the threads will be running but unable to do any actual work.
+
+In such cases, the program has a mechanism to time-out the project, by monitoring the last time a thread got some useful data or finished a download, and comparing it with the current time. If the main thread of the program is running fine, it will monitor this period and timeout the project, if it exceeds a certain value which is hard-coded into the program. This is about 300 seconds (5 minutes).
+
+
+6.8. Can I specify a thread timeout for worker threads?
+Yes. (The worker threads are described above in section 3.8)
+
+Worker threads are normally given the charge of downloading files that are not web-pages. These threads can block sometimes because the server is busy, or if the connection socket blocks due to a number of other reasons (network errors, hardware problems, denial of service attacks etc).
+
+In such cases, HarvestMan ensures that the thread does not hang infinitely by providing a thread timeout mechanism. This is controlled by the config variable named timeout which is an attribute of the worker element inside the top-level system element of the config file. The value of this is 200 seconds by default.
+If a thread takes more time than this to download a file, HarvestMan terminates the thread and cleans it up.
+
+If you are crawling a website with a lot of traffic and/or huge files, it is a good idea to set this to a higher value to give the download threads more time to complete their downloads.
+
+
+6.9. How to tell the program to retry failed links?
+You can do this by editing the config variable named retries inside the top-level download element.
+
+Downloads can fail due to a variety of reasons. File not found errors (HTTP 404), server busy, socket failure etc are some of those. HarvestMan can attempt to re-download failed files by attempting the downloads again.
+
+This is controlled by the above mentioned variable. If it is set to zero, the program does not attempt to re-download failed links. If it is set to a value more than zero, HarvestMan will attempt to download links that failed with non-fatal errors
+(Non-fatal errors are errors like socket failure, network busy, server error etc, whereas fatal errors are errors like "File not found") , by a count specified by this variable. Also, HarvestMan will attempt to re-download failed links once more at the end of the project.
+
+For example if this variable is set to 2, 2  retry attempts will be made to download a link that failed with a non-fatal error.
+
+
+7. Download Control - Advanced
+
+7.1. What are fetchlevels and how can I use them?
+
+Fetchlevels are the most useful download tweaking feature of HarvestMan. It is important to understand them in order to tweak your downloads.
+
+HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL.
+
+In general increasing a fetch level allows the program to crawl more files on the Internet.
+
+A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL.
+
+For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the 
+
+&lt;images&gt;
+
+ sub-directory and directories below it, and no other file.
+
+The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page.
+
+A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers.
+
+A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL.
+
+A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc.
+
+In short we can summarize the above rules in the following download guidelines.
+
+  1. If you just want to download all links directly below the directory of starting URL, use a fetch level of zero.
+> 2. If you want to download all links linked to the starting url in the same server, use a fetch level of one.
+> 3. If you want to download all links directly below the starting url, and also first level links linked to other websites, use a fetch level of three.
+> 4. If you want to download all links linked to the starting url in the same server, and also first level links linked to other websites, use a fetch level of two.
+> 5. If you dont want to prescribe any limits, set a fetch level of four and tweak other download control options like depth fetching, file limits etc.But since this will lead to to large scale downloads from different servers on the Web, you should think of using a distributed crawler such as D-HarvestMan.
+
+
+8. Application development and Customization
+
+8.1. I want to customize HarvestMan for a research project. Can you help out ?
+
+HarvestMan is made available under the GPL license, so you are free to download the latest source code tar ball and customize it the way you want, for your research project.
+
+If you belong to a research group which wants to use HarvestMan as a component of your software, I am happy to provide informal support as long as it is now and then. If you want regular support and also some customization, I am available for consulting, support and customization at regular offshore software consulting rates in India.
+8.2. I want to customize HarvestMan for a commercial project. Can you help out ?
+
+If you want to write a commercial webcrawler application or solution on top of HarvestMan I am available for consulting. You can contact me at my email address.
+
+9. UML Diagrams
+
+9.1. HarvestMan Class Diagram
+
+I have generated a UML object diagram using the Dot toolkit of AT&T graphviz project and the PyUMLGraph utility. The diagram is available here (size 350 KB).
\ No newline at end of file
diff --git a/FAQ_NEW.md b/FAQ_NEW.md
new file mode 100644
index 0000000..510a480
--- /dev/null
+++ b/FAQ_NEW.md
@@ -0,0 +1,2 @@
+## Work started, stuff will be here in a few weeks, please be patient ##
+### Once this is completed, the existing FAQ page will be replaced with this ###
\ No newline at end of file
diff --git a/HarvestMan-lite/HarvestMan.egg-info/PKG-INFO b/HarvestMan-lite/HarvestMan.egg-info/PKG-INFO
deleted file mode 100644
index ea79230..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/PKG-INFO
+++ /dev/null
@@ -1,24 +0,0 @@
-Metadata-Version: 1.0
-Name: HarvestMan
-Version: 2.0.4betadev-r180
-Summary: HarvestMan is a web crawler application and framework.
-Home-page: http://code.google.com/p/harvestman-crawler/
-Author: Lukasz Szybalski
-Author-email: szybalski@gmail.com
-License: GPLv2
-Description: HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan? can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
-        
-Keywords: crawler spider web-crawler web-bot robot data-mining python
-Platform: UNKNOWN
-Classifier: Development Status :: 5 - Production/Stable
-Classifier: Environment :: Console
-Classifier: Environment :: Web Environment
-Classifier: Intended Audience :: End Users/Desktop
-Classifier: Intended Audience :: Developers
-Classifier: License :: OSI Approved :: GNU General Public License (GPL)
-Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python
-Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
-Classifier: Topic :: Software Development :: Libraries :: Python Modules
-Classifier: Topic :: Software Development :: Testing :: Traffic Generation
-Classifier: Topic :: Text Processing :: Indexing
diff --git a/HarvestMan-lite/HarvestMan.egg-info/SOURCES.txt b/HarvestMan-lite/HarvestMan.egg-info/SOURCES.txt
deleted file mode 100644
index d304c75..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/SOURCES.txt
+++ /dev/null
@@ -1,212 +0,0 @@
-LICENSE.txt
-MANIFEST
-Readme.hget
-Readme.txt
-__init__.py
-ez_setup.py
-pydocgen.sh
-setup.cfg
-setup.py
-tarhm.py
-HarvestMan.egg-info/PKG-INFO
-HarvestMan.egg-info/SOURCES.txt
-HarvestMan.egg-info/dependency_links.txt
-HarvestMan.egg-info/entry_points.txt
-HarvestMan.egg-info/not-zip-safe
-HarvestMan.egg-info/paster_plugins.txt
-HarvestMan.egg-info/requires.txt
-HarvestMan.egg-info/top_level.txt
-deps/sgmlop-1.1.1-20040207.zip
-doc/Changelog.txt
-doc/Changes.txt
-doc/Datastructures.txt
-doc/PLAN.txt
-doc/STATUS.txt
-doc/events.HOWTO
-doc/harvestman-epydoc.sh
-doc/plugins.HOWTO
-doc/state_machine.txt
-harvestman/__init__.py
-harvestman/apps/__init__.py
-harvestman/apps/appbase.py
-harvestman/apps/config-sample-urlfilter.xml
-harvestman/apps/config-sample.xml
-harvestman/apps/hget.py
-harvestman/apps/spider.py
-harvestman/apps/samples/Readme.txt
-harvestman/apps/samples/__init__.py
-harvestman/apps/samples/blogger.py
-harvestman/apps/samples/htmlcrawler.py
-harvestman/apps/samples/imagecrawler.py
-harvestman/apps/samples/indexingcrawler.py
-harvestman/apps/samples/linkchecker.py
-harvestman/apps/samples/postingcrawler.py
-harvestman/apps/samples/searchingcrawler.py
-harvestman/apps/samples/taganalyzer.py
-harvestman/bugs/Readme.txt
-harvestman/bugs/config-issue20.xml
-harvestman/bugs/config-issue21.xml
-harvestman/bugs/s_municipaux.htm
-harvestman/bugs/soskut_hu_index.html
-harvestman/bugs/test_808.py
-harvestman/bugs/test_812.py
-harvestman/dev/__init__.py
-harvestman/dev/filethread.py
-harvestman/dev/sqlite_test.py
-harvestman/dev/sqlite_test2.py
-harvestman/dev/sqlite_test3.py
-harvestman/dev/sqlite_test4.py
-harvestman/ext/__init__.py
-harvestman/ext/datafilter.py
-harvestman/ext/lucene.py
-harvestman/ext/simulator.py
-harvestman/ext/spam.py
-harvestman/ext/swish-e.py
-harvestman/ext/userbrowse.py
-harvestman/ext/lucene/IndexFiles.py
-harvestman/ext/lucene/SearchFiles.py
-harvestman/ext/swish-e/HOWTO.swish-e
-harvestman/ext/swish-e/README.txt
-harvestman/ext/swish-e/swish-config.conf
-harvestman/lib/__init__.py
-harvestman/lib/config.py
-harvestman/lib/configparser.py
-harvestman/lib/connector.py
-harvestman/lib/crawler.py
-harvestman/lib/datamgr.py
-harvestman/lib/db.py
-harvestman/lib/document.py
-harvestman/lib/event.py
-harvestman/lib/filters.py
-harvestman/lib/gui.py
-harvestman/lib/hooks.py
-harvestman/lib/logger.py
-harvestman/lib/logger_old.py
-harvestman/lib/methodwrapper.py
-harvestman/lib/mirrors.py
-harvestman/lib/options.py
-harvestman/lib/pageparser.py
-harvestman/lib/robotparser.py
-harvestman/lib/rules.py
-harvestman/lib/test.html
-harvestman/lib/test_urlparser.py
-harvestman/lib/urlcollections.py
-harvestman/lib/urlparser.py
-harvestman/lib/urlproc.py
-harvestman/lib/urlqueue.py
-harvestman/lib/urlthread.py
-harvestman/lib/urltypes.py
-harvestman/lib/utils.py
-harvestman/lib/common/__init__.py
-harvestman/lib/common/bst.py
-harvestman/lib/common/bst_orig.py
-harvestman/lib/common/common.py
-harvestman/lib/common/dictcache.py
-harvestman/lib/common/feedparser.py
-harvestman/lib/common/keepalive.py
-harvestman/lib/common/lrucache.py
-harvestman/lib/common/macros.py
-harvestman/lib/common/netinfo.py
-harvestman/lib/common/optionparser.py
-harvestman/lib/common/progress.py
-harvestman/lib/common/properties.py
-harvestman/lib/common/pydblite.py
-harvestman/lib/common/singleton.py
-harvestman/lib/common/spincursor.py
-harvestman/lib/js/Parser.rb
-harvestman/lib/js/README.txt
-harvestman/lib/js/__init__.py
-harvestman/lib/js/jsdom.py
-harvestman/lib/js/jsparse.py
-harvestman/lib/js/jsparser.py
-harvestman/lib/js/narcissus.py
-harvestman/lib/js/parse.rb
-harvestman/lib/js/testnarcissus.py
-harvestman/lib/js/samples/Banco de Portugal.html
-harvestman/lib/js/samples/addbookmark.js
-harvestman/lib/js/samples/addrowstotable.js
-harvestman/lib/js/samples/arrayintersect.js
-harvestman/lib/js/samples/bportugal.html
-harvestman/lib/js/samples/bportugal_dom.html
-harvestman/lib/js/samples/breadcrumbs.js
-harvestman/lib/js/samples/colsameheight.js
-harvestman/lib/js/samples/combinations.js
-harvestman/lib/js/samples/dblkeypress.js
-harvestman/lib/js/samples/derive.js
-harvestman/lib/js/samples/domcss.js
-harvestman/lib/js/samples/domrollover.js
-harvestman/lib/js/samples/dzone.js
-harvestman/lib/js/samples/extlinks.js
-harvestman/lib/js/samples/formatnums.js
-harvestman/lib/js/samples/formatpaper.js
-harvestman/lib/js/samples/getelems.js
-harvestman/lib/js/samples/gmxpath.js
-harvestman/lib/js/samples/hotkey.js
-harvestman/lib/js/samples/htmlselect.js
-harvestman/lib/js/samples/imgsrc.js
-harvestman/lib/js/samples/incljsbydom.js
-harvestman/lib/js/samples/incljsbyxhr.js
-harvestman/lib/js/samples/jsnodom.html
-harvestman/lib/js/samples/jsredirect.html
-harvestman/lib/js/samples/jsredirect4.html
-harvestman/lib/js/samples/jsredirect5.html
-harvestman/lib/js/samples/jstest.html
-harvestman/lib/js/samples/jstest2.html
-harvestman/lib/js/samples/jstest2_dom.html
-harvestman/lib/js/samples/jstest3.html
-harvestman/lib/js/samples/jstest3_dom.html
-harvestman/lib/js/samples/jstest_dom.html
-harvestman/lib/js/samples/mouseposn.js
-harvestman/lib/js/samples/mysqltimestamp.js
-harvestman/lib/js/samples/progress.js
-harvestman/lib/js/samples/redirect.js
-harvestman/lib/js/samples/selectlist.js
-harvestman/lib/js/samples/showformvalues.js
-harvestman/lib/js/samples/strcmp.js
-harvestman/lib/js/samples/submitags.js
-harvestman/lib/js/samples/switchexample.js
-harvestman/lib/js/samples/syncajax.js
-harvestman/lib/js/samples/test.html
-harvestman/lib/js/samples/test.js
-harvestman/lib/js/samples/trim.js
-harvestman/lib/js/samples/trim2.js
-harvestman/lib/js/samples/xpathstring.js
-harvestman/lib/js/samples/fail/aim.js
-harvestman/lib/js/samples/fail/ajaxobject.js
-harvestman/lib/js/samples/fail/arrayfind.js
-harvestman/lib/js/samples/fail/bportugal.js
-harvestman/lib/js/samples/fail/cookiehandler.js
-harvestman/lib/js/samples/fail/createcss.js
-harvestman/lib/js/samples/fail/draghandler.js
-harvestman/lib/js/samples/fail/editor_main.js
-harvestman/lib/js/samples/fail/functreebuilder.js
-harvestman/lib/js/samples/fail/int2word.js
-harvestman/lib/js/samples/fail/openwindow.js
-harvestman/lib/js/samples/fail/openwindow2.js
-harvestman/lib/js/samples/fail/pluralize.js
-harvestman/lib/js/samples/fail/selform.js
-harvestman/lib/js/samples/fail/twitter.js
-harvestman/lib/js/samples/fail/typedtooltip.js
-harvestman/lib/js/samples/fail/vardump.js
-harvestman/lib/js/samples/fail/xpath.js
-harvestman/test/__init__.py
-harvestman/test/fail.html
-harvestman/test/pass.css
-harvestman/test/pass.html
-harvestman/test/run_tests.py
-harvestman/test/test_base.py
-harvestman/test/test_connector.py
-harvestman/test/test_logger.py
-harvestman/test/test_pageparser.py
-harvestman/test/test_urlcollections.py
-harvestman/test/test_urlfilter.py
-harvestman/test/test_urlparser.py
-harvestman/test/test_urltypes.py
-harvestman/tools/__init__.py
-harvestman/tools/genconfig.py
-harvestman/tools/printstats.py
-harvestman/ui/templates/form.html
-harvestman/ui/templates/content/example-print.css
-harvestman/ui/templates/content/example.css
-harvestman/ui/templates/content/tabber.js
-schema/HarvestMan.xsd
\ No newline at end of file
diff --git a/HarvestMan-lite/HarvestMan.egg-info/dependency_links.txt b/HarvestMan-lite/HarvestMan.egg-info/dependency_links.txt
deleted file mode 100644
index 8b13789..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/dependency_links.txt
+++ /dev/null
@@ -1 +0,0 @@
-
diff --git a/HarvestMan-lite/HarvestMan.egg-info/entry_points.txt b/HarvestMan-lite/HarvestMan.egg-info/entry_points.txt
deleted file mode 100644
index 186bf6f..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/entry_points.txt
+++ /dev/null
@@ -1,5 +0,0 @@
-
-      [console_scripts]
-        harvestman = harvestman.apps.spider:main
-        hget = harvestman.apps.hget:main
-      
\ No newline at end of file
diff --git a/HarvestMan-lite/HarvestMan.egg-info/not-zip-safe b/HarvestMan-lite/HarvestMan.egg-info/not-zip-safe
deleted file mode 100644
index 8b13789..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/not-zip-safe
+++ /dev/null
@@ -1 +0,0 @@
-
diff --git a/HarvestMan-lite/HarvestMan.egg-info/paster_plugins.txt b/HarvestMan-lite/HarvestMan.egg-info/paster_plugins.txt
deleted file mode 100644
index 8e50835..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/paster_plugins.txt
+++ /dev/null
@@ -1 +0,0 @@
-PasteScript
diff --git a/HarvestMan-lite/HarvestMan.egg-info/requires.txt b/HarvestMan-lite/HarvestMan.egg-info/requires.txt
deleted file mode 100644
index 7d90d30..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/requires.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-sgmlop >= 1.1.1
-pyparsing >= 1.4.8
-web.py >= 0.23
\ No newline at end of file
diff --git a/HarvestMan-lite/HarvestMan.egg-info/top_level.txt b/HarvestMan-lite/HarvestMan.egg-info/top_level.txt
deleted file mode 100644
index d6dba88..0000000
--- a/HarvestMan-lite/HarvestMan.egg-info/top_level.txt
+++ /dev/null
@@ -1 +0,0 @@
-harvestman
diff --git a/HarvestMan-lite/LICENSE.txt b/HarvestMan-lite/LICENSE.txt
deleted file mode 100755
index bc9a146..0000000
--- a/HarvestMan-lite/LICENSE.txt
+++ /dev/null
@@ -1,124 +0,0 @@
-The GNU General Public License (GPL)
-Version 2, June 1991
-
-Copyright (C) 1989, 1991 Free Software Foundation, Inc.
-59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-
-Everyone is permitted to copy and distribute verbatim copies
-of this license document, but changing it is not allowed.
-
-Preamble
-
-The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
-
-When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
-
-To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
-
-For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
-
-We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
-
-Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
-
-Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
-
-The precise terms and conditions for copying, distribution and modification follow.
-
-TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
-0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.
-
-1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.
-
-2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
-
-    a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.
-
-    b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
-
-    c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
-
-3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
-
-    a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
-
-    b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
-
-    c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
-
-If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
-
-4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
-
-5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.
-
-6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.
-
-7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.
-
-It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.
-
-This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.
-
-8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
-
-9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.
-
-10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
-
-NO WARRANTY
-
-11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
-
-12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
-
-END OF TERMS AND CONDITIONS
-
-How to Apply These Terms to Your New Programs
-
-If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
-
-To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found.
-
-    One line to give the program's name and a brief idea of what it does.
-    Copyright (C) <year> <name of author>
-
-    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-
-Also add information on how to contact you by electronic and paper mail.
-
-If the program is interactive, make it output a short notice like this when it starts in an interactive mode:
-
-    Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details.
-
-The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program.
-
-You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names:
-
-    Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker.
-
-    signature of Ty Coon, 1 April 1989
-    Ty Coon, President of Vice
-
-This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
\ No newline at end of file
diff --git a/HarvestMan-lite/MANIFEST b/HarvestMan-lite/MANIFEST
deleted file mode 100755
index 5c9bae6..0000000
--- a/HarvestMan-lite/MANIFEST
+++ /dev/null
@@ -1,58 +0,0 @@
-setup.py
-harvestman/__init__.py
-harvestman/apps/__init__.py
-harvestman/apps/harvestman.py
-harvestman/apps/harvestmanklass.py
-harvestman/apps/hget.py
-harvestman/dev/__init__.py
-harvestman/dev/feedparser.py
-harvestman/ext/__init__.py
-harvestman/ext/datafilter.py
-harvestman/ext/lucene.py
-harvestman/ext/simulator.py
-harvestman/ext/spam.py
-harvestman/ext/swish-e.py
-harvestman/ext/userbrowse.py
-harvestman/lib/__init__.py
-harvestman/lib/config.py
-harvestman/lib/configparser.py
-harvestman/lib/connector.py
-harvestman/lib/crawler.py
-harvestman/lib/datamgr.py
-harvestman/lib/filethread.py
-harvestman/lib/hooks.py
-harvestman/lib/logger.py
-harvestman/lib/mirrors.py
-harvestman/lib/options.py
-harvestman/lib/pageparser.py
-harvestman/lib/robotparser.py
-harvestman/lib/rules.py
-harvestman/lib/urlcollections.py
-harvestman/lib/urlparser.py
-harvestman/lib/urlproc.py
-harvestman/lib/urlqueue.py
-harvestman/lib/urlthread.py
-harvestman/lib/urltypes.py
-harvestman/lib/utils.py
-harvestman/lib/common/__init__.py
-harvestman/lib/common/common.py
-harvestman/lib/common/dictcache.py
-harvestman/lib/common/keepalive.py
-harvestman/lib/common/lrucache.py
-harvestman/lib/common/lrucache2.py
-harvestman/lib/common/lrucachetest.py
-harvestman/lib/common/macros.py
-harvestman/lib/common/methodwrapper.py
-harvestman/lib/common/optionparser.py
-harvestman/lib/common/progress.py
-harvestman/lib/common/singleton.py
-harvestman/lib/common/spincursor.py
-harvestman/lib/js/__init__.py
-harvestman/lib/js/jsdom.py
-harvestman/lib/js/jsparse.py
-harvestman/lib/js/jsparser.py
-harvestman/lib/js/narcissus.py
-harvestman/lib/js/testnarcissus.py
-harvestman/test/__init__.py
-harvestman/test/test_base.py
-harvestman/test/test_urlparser.py
diff --git a/HarvestMan-lite/PLAN.txt b/HarvestMan-lite/PLAN.txt
deleted file mode 100644
index 14cd9f5..0000000
--- a/HarvestMan-lite/PLAN.txt
+++ /dev/null
@@ -1,77 +0,0 @@
-Project start - May 7 2009
-Project duration - approx 5 weeks
-Project end - approx Jun 14 2009
-
-Participants:  self, None ?
-
-The following steps are involved in the project.
-
-1. Identify redundant/useless/ modules, test files, other files - 2 days
-2. Clean up code to remove modules/code, files found in (1) - 2 days.
-3. Identify redundant classes/objects etc in remaining
-   modules - 1 week
-4. Clean up code to remove items identified in (3) - 2 days.
-5. Refactor code, remove/collate functions, test - 1 week.
-6. Test code, modify datastructures for better perfomance
-   and memory usage - 2 weeks.
-7. Produce final HarvestMan-lite source code with complete
-   documentation - 1 week.
-
-
-Modules to trim
----------------
-I. Hget is an application which I wrote just for personal
-interest. Nobody seems to be using it, so remvoe all hget
-modules (app & lib) from HarvestMan - if necessary hget
-can be spawned with a separate code base.
-
-This means removing and cleaning up code for the 
-following modules.
-
-1. apps/hget.py
-
-This means also cleaning up references to hget modules
-and hget specific code inside modules like datamgr.py, 
-connector.py etc.
-
-The following moduels in lib/common are used currently
-only by hget, but since these are inside common, I
-guess they could remain (to be decided later).
-
-1. spincursor.py
-2. progress.py 
-
-II. The entire javascript folder is currently pretty
-basic. Except for the code which does JS forwarding,
-nothing else works well. Also the narcissus port in
-Python does not belong here anymore, since it is 
-never used in the HarvestMan code.
-
-So these folders/files would be entirely removed.
-
-1. js/testnarcissus.py
-2. js/narcissus.py
-3. js/jsparse.py
-4. js/parse.rb
-5. js/Parser.rb
-6. js/samples/*.js (html test files are for my
-other js parser, which can remain)
-7. js/samples/fail (only narcissus test cases)
-
-(The narcissus code will be moved to a separate project
-on google later.)
-
-III. Remove the following stuff
-
-1. tarhm.py
-2. Is deps/ folder required now that we are downloading
-and building everythning required ? 
-3. Can we do without ez_setup.py ?
-
-
-*NOTE* - This section is a work in progress.
-
-Documentation
--------------
-
-All cleanup actions will be documented in the file doc/cleanlog.txt .
diff --git a/HarvestMan-lite/Readme.txt b/HarvestMan-lite/Readme.txt
deleted file mode 100755
index 43ce26e..0000000
--- a/HarvestMan-lite/Readme.txt
+++ /dev/null
@@ -1,189 +0,0 @@
-*************************
-*                       *
-* HarvestMan Webcrawler *
-*                       *
-*************************
-
-About
------
-
-HarvestMan is a web crawler application and framework written entirely
-in the Python programming language. It can be used as a small, personal
-crawler to quickly download files from websites or as a crawler library
-/framework which can be used to develop large scale crawling applications.
-
-Author
-------
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-License
--------
-
-See the file LICENSE.TXT.
-
-Version
---------
-2.0.5 beta
-
-
-WWW
----
-http://harvestmanontheweb.com
-http://harvestman-crawler.googlecode.com/
-
-Requirements
-------------
-
-HarvestMan depends upon:
-1. python2.4 or higher. (python2.3 untested) (required)
-2. python-dev package (required for sgmlop)
-3. sgmlop,pyparsing,web.py which will get installed automatically as part of the installation. 
-
-Getting started 
----------------
-
-You need to get the sourcecode. You can either checkout the source code from our subversion repository or download the tar/zip file which has the source code. If you want to install from source code see our wiki page for the most current up to date installation instructions: http://code.google.com/p/harvestman-crawler/wiki/InstallHarvestMan
-
-If you get the tar archive you need to do the following:
-
-Unarchive the file to a directory of your choice. 
-
-% tar -xjf HarvestMan-<version>.tar.bz2
-
-where <version> is the version number.
-
-Go into the directory. 
-
-cd HarvestMan-<version>
-
-
-How to Install
---------------
-Make sure you are at the top-level HarvestMan directory. 
-
-On POSIX systems (Unix, Linux)
-
-% sudo python setup.py install
-
-On Win32 systems
-
-% python setup.py install
-
-The install script will work on Windows based Unix emulation
-layers such as cygwin also.
-
-The install script installs the HarvestMan framework to
-your system Python folder and creates shortcuts for the
-HarvestMan and Hget applications.
-
-Running the program(s)
----------------------
-        
-First thing to do is to test your application.
-
-harvestman --selftest
-
-To run harvestman you need a configuration file. This
-is named 'config.xml' by default. To pass a different
-configuration file, use the command-line argument '-C'
-or '--configfile'.
-
-harvestman -C config.xml
-
-To create your configuration file run
-
-harvestman --genconfig
-
-Your browser will open and you will be able to enter what sites you will want to crawl. Save that file as mycrawl.xml and start harvestman
-
-harvestman -C mycrawl.xml
-
-
-There is a sample config file incuded in the 'apps' directory if you just want to test it right away.
-
-To run HarvestMan application, just type "harvestman" in a command
-prompt.
-
-$ harvestman
-or
-$ harvestman -h
-
-If the program finds a configuration file in the current directory
-or in the users .harvestman folder, it will start. If it does not
-find these files the program will exit with an error. 
-
-Command line mode
------------------
-HarvestMan supports command-line options.
-
-For information on the command line options, run the program 
-with the --help or -h option.
-
-For a complete FAQ on the command line options, visit
-http://www.harvstmanontheweb.com/commandline.html .
-
-The Config file
----------------
-
-The config file provides the program with its settings.
-It is an xml file with top-level elements and children.
-Each top-level element denotes a section of HarvestMan
-configuration. Each child element denotes either a minor
-section or an actual configuration element.
-
-Example:
-
-      <project skip="0">
-        <url>http://www.python.org/doc/current/tut/tut.html</url>
-        <name>pytut</name>
-        <basedir>D:/websites</basedir>
-        <verbosity value="3"/>
-      </project>
-
-The new version of the config file separates config variables into
-8 different sections(elements) as described below.
-
-Section                       Description
-
-1. project                    All project related variables
-2. network                    All network related variables lik proxy,
-                              proxy username/password etc.
-3. download                   All download related variables (html/image/
-                              stylesheets/cookies etc)
-4. control                    All download control variables (filters/
-                              maximum limits/timeouts/depths/robots.txt)
-5. system                     Any system related variable( fastmode/thread status/
-                              thread timeouts/thread pool size etc)
-6. indexer                    All indexer related variables (localize etc)
-7. files                      All harvestman file settings (config/message log/ 
-                              error log/url list file etc) 
-8.display                    Display (GUI/browser) related setting
-  
-HarvestMan accepts about 60 configuration options in total.
-
-For a detailed discussion on the options, refer the HarvestMan 
-documentation files in the 'doc' sub-directory or point your browser
-to http://code.google.com/p/harvestman-crawler/wiki/ConfigXml
-
-Python Dependencies
--------------------
-The minimal requirement is Python 2.4 and the latest version of pyparsing.
-HarvestMan should work on all platforms where Python is supported. 
-Due to one of the subpackages we use Python-dev version is required.
-
-More Documentation
-------------------
-Read the HarvestMan documentation in the 'doc' sub-directory for
-more information. More information is also available in the project
-web page.
-
-Changes & Fix History
----------------------    
-See the file Changes.txt.
-
-Change Log for this Version
----------------------------
-See the file ChangeLog.txt.
-
-
diff --git a/HarvestMan-lite/deps/sgmlop-1.1.1-20040207.zip b/HarvestMan-lite/deps/sgmlop-1.1.1-20040207.zip
deleted file mode 100755
index 765748c..0000000
Binary files a/HarvestMan-lite/deps/sgmlop-1.1.1-20040207.zip and /dev/null differ
diff --git a/HarvestMan-lite/doc/Changelog.txt b/HarvestMan-lite/doc/Changelog.txt
deleted file mode 100755
index df0c74f..0000000
--- a/HarvestMan-lite/doc/Changelog.txt
+++ /dev/null
@@ -1,457 +0,0 @@
-
-*==========================================================*
-|            -Changelog.txt file for HarvestMan-           |
-|                                                          |
-|           URL: http://www.harvestmanontheweb.com         |
-*==========================================================*
-Version 2.0 b1
-Release Date: TBD
-
-Release Focus: Major Enhancements, Bug-fixes
-
-Brief
-=====
-The 2.0 version is a release of HarvestMan after more than 2.0
-years. This version has numerous new features in terms of
-extensibility and usability. This version also converts HarvestMan
-from a web crawler application to a full-fledged application framework
-for writing custom web crawling/data mining applications.
-
-The changes in this version are many - there are changes in the
-source code layout, extensibility and usability features, performance
-enhancement fixes and numerous other bug-fixes and changes in dependence
-on third-party modules and in the setup.py script. Each of these
-changes will be discussed in detail here.
-
-
-Source code layout changes
-==========================
-The source code layout is changed considerably. Tne new layout splits
-the code into well-defined folders which abstract a functionality.
-The new layout starts from a top-level "harvestman" folder which holds
-all the code. The layout in this folder is as follows.
-
-harvestman
-        |
-        +---apps
-        |     |
-        |     |
-        |     +---samples
-        |
-        |
-        +---dev
-        |
-        |
-        +---ext
-        |    |
-        |    |  
-        |    +---lucene
-        |    |
-        |    |
-        |    +---swish-e
-        |
-        |
-        +---lib
-        |    |
-        |    +---common
-        |    |
-        |    +---js
-        |         |
-        |         +---samples
-        |
-        +---test
-
-The folders and their function are described below.
-
-1. apps - This holds all application modules. The HarvestMan framework's
-main application (harvestman.py), the Hget application's module (hget.py)
-and the module holding the base application class (appbase.py) sit in
-this module. This module also holds some sample configuration XML files.
-
-2. apps/samples - This holds sample applications based on HarvestMan
-event framework which demonstrates custom crawling applications, extending
-the HarvestMan framework. 
-
-3. dev - This holds test code and code which is under a prototype stage
-and not part of the HarvestMan library yet, but could become so in future
-versions.
-
-4. ext - This holds modules which make use of the HarvestMan plugin
-framework and implement plugin extensions on HarvestMan. 
-
-5. ext/lucene - This holds some useful code for working with Lucene
-indexes.
-
-6. ext/swish-e - This holds some documentation and a sample swish-e
-configuration file which can be used with the swish-e plugin of HarvestMan.
-
-7. lib - This holds the main library modules of HarvestMan. This folder
-contains most of the code of the HarvestMan framework and in a way
-defines the HarvestMan framework. 
-
-8. lib/common - This holds library modules which implement common
-algorithms or data structures or generic libraries or holds global 
-data/objects or contain code from third party libraries. 
-
-9. lib/js - This holds a barebones Javascript parser written in pure
-Python which is used in HarvestMan to parse basic javascript. It also
-contains a pure Python implementation of the Narcissus javascript
-parser, though this parser is not used in HarvestMan framework anywhere.
-
-10.lib/js/samples - Contains sample javascript/html files as test
-cases for the Javascript/Narcissus parsers.
-
-11.test - Contains a unit test module and a single unit test case
-for the urlparser module. 
-
-
-New Features
-============
-As mentioned earlier, the features can be split into two - extensibility
-features and usability features. 
-
-Extensibility
--------------
-1. HarvestMan Event framework - This release adds an event framework
-to HarvestMan which allows a developer to very easily extend the program's
-behavior. Specific functions raise events before and after certain
-operations. A developer can bind his functions to these events and his
-functions are automatically called back by HarvestMan during program flow,
-when the event is raised. The developer can provide his own specific 
-custom processing in his event callback, which can modify the program
-behavior.
-
-There are many examples of writing custom crawler applications by using
-events in the "apps/samples" folder. For more information on HarvestMan
-event framework read the document "events.HOWTO" in this folder.
-
-2. HarvestMan plugins framework - This release adds a plugin
-mechanism to HarvestMan, which allows the developer to modify
-program behavior by writing custom logic and hooking it on to specific
-methods in HarvestMan classes. The plugin mechanism works using metaclasses
-and allows a developer to completely replace the code of a method or
-to attach functions as pre/post callbacks on methods. This has been
-used to implement sample plugins - the "ext" folder contains plugins
-which involve a simulator, a swish-e plugin, a lucene plugin etc. 
-
-For more information on plugins read "plugins.HOWTO" in this folder.
-
-3. Swish-e integration implemented as a plugin (see (2)). This plugin
-allows HarvestMan to run as an external program feeding content to swish-e
-indexer.
-
-4. Lucene integration implemented as a plugin (see (2)). This plugin
-converts HarvestMan to a web indexer, allowing to crawl and index web
-pages using Lucene via PyLucene.
-
-5. A simulate feature implemented as a standard plugin (see (2)). 
-This simulates crawling without actually downloading anything. 
-
-6. Crawl user database - From 2.0, a user crawl database named
-"crawls.db" is created in a user directory and all crawl meta
-information and statistics are appended as records to this database.
-The database is an sqlite database and consists of two tables
-namely "projects" and "project_stats". The former stores meta
-information on every HarvestMan project, while the latter stores
-information on crawl statistics of every project.
-
-For more information on the crawl database, see "dbdesign.txt" .
-
-7. New modules: Several new modules have been added. 
-These are:
-
-lib/
-
-o configparser.py - This contains configuration file parsing
-code. (This used to be named as xmlparser.py).
-o db.py - Defines classes implementing crawl database feature.
-o document.py - Defines a class which stands for a web document.
-o event.py - Defines classes for the HarvestMan event framework.
-o filethread.py - Defines a class for writing files in separate
-threas (not used).
-o hooks.py - Defines classes for the plugin mechanism.
-o methodwrapper.py - Defines metaclass level mechanisms for
-implementing the plugin feature.
-o mirrors.py - Defines classes which helps to manage mirrors
-for the Hget application.
-o options.py - Summarizes and holds all harvestman/hget options
-in tuples.
-o urlcollections.py - Defines URL context and collections classes
-which helps to associate and relate URLs belonging to specific
-contexts (frame, css etc) easily and to define new contexts.
-o urlproc.py - Defines a function which helps to replace HTML
-entitites in URLs.
-o urltypes.py - Defines a hierarchical type system for URLs encountered
-in the web, which is used by other modules.
-
-o common/* - The "common" sub-folder and all contained modules are
-brand new. These modules are,
-
-  - bst.py - Defines a binary search tree with disk caching.
-  - common.py - Contains global functions/data
-  - dictcache.py - Defines a dictionary type with disk caching.
-  - keepalive.py - Module borrowed from "urlgrabber" project, defining
-    HTTP handlers which provide HTTP/1.0 keep-alive connections.
-  - lrucache.py - Defines an O(1) LRU cache class. Code courtesy
-    Josiah Carlson.
-  - macros.py - Defines a metaclass and its derived C-like "macro" variables
-    and defines several "macros" for HarvestMan.
-  - netinfo.py - Common data and state moved from urlparser.py to this module.
-  - optionparser.py - Defines a generic option parser class which makes it
-    easy to write and modify command line options in the form of tuples.
-  - progress.py - Defines a progress bar class. Code borrowed from the 
-    "S.M.A.R.T" package manager project and customized.
-  - properties.py - A pure Python implementation of java.util.Properties class.
-  - pydblite.py - A pure Python in-memory database with selection by list 
-    comprehension and generator expression. Code courtesy Pierre Quentel.
-  - singleton.py - Singleton implementations.
-  - spincursor.py - A console spin-cursor implementation used by hget.
-
-apps/
-
-o hget.py - New application module for the "hget" application
-o appbase.py - New top-level application class module.
-
-ext/
-
- o This folder and all code under it are new.
-
-test/
-
- o This folder and all code under it are new.
-
-
-8. Deprecated/renamed modules - Several modules have been deprecated
-or renamed (moved) w.r.t 1.4.6 version.
-
- o common.py moved to "common" sub-folder.
- o feedparser.py moved to "dev" folder.
- o strptime.py deprecated and removed.
- o urlserver.py deprecated and removed.
- o xmlparser.py renamed to configparser.py .
-
-
-Usability
----------
-1. Support for HTTP compression (gzip, deflate).
-2. Support for HTML "embed" and "object" tags.
-3. Support and control for multimedia URL types. A
-new type called "multimedia" and subtypes "movies"
-and "sounds" have been added.
-4. Support for meta "refresh" tags.
-5. Support for meta "robot" tags (index, follow, noindex, nofollow).
-6. Feature to parse stylesheets, extract URLs and download them.
-7. Support for parsing and replacing HTML entities in URLs.
-8. Support for keywords and description attribtues in "meta" tags 
-for web pages.
-9. Support for specifying offset in child URLs of webpages.
-10. Support for HTTP 304 using "If-Modified-Since" HTTP headers.
-11. Support for "etags" in HTTP headers.
-12. Support for HTTP keep-alive in mutliple connections to the
-same site. Program now attempts to keep HTTP connections open
-as much as possible.
-13. Support for HTTP basic authorization for URLs.
-14. Command line options can be mixed with configuration file
-when using the -C option.
-15. Command line changes - The following command line options
-have been removed, added or updated.
-  1. --subdomain, -S: Changed to -s.
-  2. -m,--simulate: New option to simulate crawling.
-  3. -g,--plugin: New option to apply a given plugin.
-  4. --urllistfile, --urltreefile: Removed these options.
-  5. -F, --urlfile: New option to read list of start URLs from a file
-
-
-Other Changes (including performance enhancements)
---------------------------------------------------
-
-1. Project caching uses the database object defined by the pydblite
-module instead of using pickling or shelve. This helps fast
-cache reads/writes than before.
-2. Logging module rewritten to make use of Python logging library.
-3. Log for same project run many times, keeps appending to same
-log file instead of deleting old logs. 
-4. Log files now have time stamps in every log line.
-5. Replaced md5 with sha module for generating hashes.
-6. Removed lock objects in many places when the data types are
-automatically synchronized (lists etc) by the GIL.
-7. Removed strptime module since it is no longer required.
-8. Default value for maximum file size (for single file) increased
-to 5MB (was 1 MB earlier).
-9. Changed sleep times inside the loops of fetcher/crawler thread
-clases to random times ranging from 0 to 0.3 second. This allows
-for more better resource allocation and pooling.
-10. Sleep times is available as a configuration option. This can
-be used to increase sleep-times for slow/traffic intense websites.
-11. Caching supports HTTP-304 using "If-modified-since" and 
-etags using "If-none-match". Etags in HTTP headers are saved
-to project cache if found along with last-modified time information.
-12. Better management of HTTP errors in connector module. A new
-error class is defined and errors are managed according to 
-HTTP/1.1 definitions.
-13. Rudimentary javascript support - for javascript based URL
-forwarding and basic support for DOM modifications using document.write* .
-14. Better support for Frame pages.
-15. Arbitrary return values in most methods replaced by custom 
-defined macros where a number is expected as return value.
-Each macro is a class which maps to a numerical value.
-16. A fast HTML parser based on sgmlop is added to HarvestMan.
-By default this parser is used. This parser can parse even bad
-HTML so more pages should get parsed than in previous versions.
-17. The "GetObject", "SetObject" methods are replaced by a global
-object named "objects" which holds handles to all global objects
-by using alias strings. For example, instead of getting the
-config object using,
-
-cfg = GetObject('config')
-
-now this is,
-
-cfg = objects.config
-
-The "objects" object lives in the common/common.py module. The
-aliases are set using the "SetAlias" method. Objects which want
-to be registered using "SetAlias" should define their unique alias 
-string in the "alias" attribute.
-
-
-(The following are specific performance enhancement fixes)
-
-18. The url dictionary and collections data structure are now
-disk-caching BSTs as opposed to pure in-memory dictionaries as earlier.
-This helps reduce the memory usage of the application and also
-to increase program speed. 
-19. Many redundant data structures removed. No separate data structure
-used for duplicate link checking. Instead the url dictionary BST
-is reused. The "downloaddict" in datamgr.py module replaced with
-counters.
-20. The "_filter" dictionary in rules.py module no longer stores
-full URLs, but only their indices. The "_links" list is removed
-from this module.
-
-21. In urlparser.py module a new method named "get_canonical_url"
-is added. This returns the canonical form of a URL. This form is used
-to calculate an index of the URL which is used to filter out 
-duplicate URLs etc. Since canonicalization is done, this URL 
-duplicate filter algorithm works much better than before.
-22. Better unicode handling in url* modules.
-23. A state machine has been added for managing program exit
-condition instead of the previous active wait loop. The state machine
-keeps track of state changes in threads using a dictionary. Program
-exit is related to certain synchronization of states. Active wait
-replaced by waiting on a condition object.
-24. Url status dictionaries in datamgr replaced by url queue status
-variable on url objects and status macros. The "qstatus" variable
-on a URL object changes when the URL enters and exits queues, gets
-downloaded etc.
-25. Retry logic improved. Only URLs which failed without fatal
-errors and which were not read from cache are retried. 
-26. The requests dictionary is removed from the HarvestManUrlConnectorFactory
-class. There is no need to limit connections per server, only a global
-limit is used.
-
-For more information on data structure changes, see "Datastructures.txt".
-For more information on the state machine, see "state_machine.txt".
-
-Bug-fixes
----------
-1. Fixed 100% CPU utilization bug.
-2. Fixed many bugs in urlparser module,
-   - Correctly interpreting HTML entities
-   - Better unicode handling
-   - Bug-fixes for URLs with ".." character (URLs like http://www.foo.com/../bar )
-   - Bugs in re-resolving of URLs
-   - Bug to fix too many recalculations of the absolute url (in get_full_url())
-   - Bug fixing the URL index calculation
-   - Bug fixing handling of anchor tags
-   - Bug fixing handling of % characters in URLs
-   - Bug fixing changes in URLs which requires changes in directories/filenames etc (adding of recalc_locations method)
-3. Bug-fix in rules module to speed up rules checks.
-4. Bug-fix in rules module to correctly add URLs to filters dictionary for filtered URLs.
-5. Bug-fixes in downloading of duplicate content by fetchers.
-6.  Fixed problem with URLs not downloaded when base url is
-redefined. This was causing a deep-crawling problem for
-many sites.
-7. Bug-fixes in pageparser module. Fixed logic in CaselessDict class.
-8. Fixed unicode handling bug in logger module.
-9. Many others...
-
-
-Changes in setup/configuration
-------------------------------
-
-Since the 2.0 version also adds many changes in the setup.py script
-and also brings in the concept of user specific folders and 
-different levels of configuration and changes in the structure
-of configuration files also, so this is discussed separately
-in this section. 
-
-1. User folder - From 2.0 version, HarvestMan will create a folder
-for user specific configuration and data. For POSIX systems this
-folder is $HOME/.harvestman where $HOME is the home directory of the
-user. For Win32 systems this is %USERPROFILE%/Local Settings/Application Data/HarvestMan
-folder where %USERPROFILE% maps to the profile folder of the user, typically
-being the "C:/Documents and Settings/%USERNAME%" folder. A user specific
-configuration file is created in the "config" sub-folder of this folder.
-
-2. System folder - From 2.0 version, HarvestMan will create a folder for
-system wide configuration. For POSIX systems this is "/etc/harvestman"
-folder. For Win32 systems this is %ALLUSERPROFILE%/Application Data/HarvestMan/conf"
-folder. A system level configuration file is created and copied to this folder.
-
-HarvestMan can be customized by altering either the system configuration file,
-the user configuration file or both. HarvestMan first loads the system configuration
-file (if found) and next the user configuration file (if found) and further any
-project specific configuration files. So system-wide customization (for all users)
-can be kept in the system configuration file, user-specific customization in the
-user configuration file and project specific customization in specific project
-configuration files. Any changes applied further upstream overrides the settings
-applied earlier. In other words, project configuration files override user configuration
-files and user configuration files override the system configuration file.
-
-
-3. Configuration files - Earlier configuration XML files had to be complete and
-specify a <projects>...</projects> section specifying the URLs and other configuration
-for crawls. Now config XML files can be part or incomplete, with each specifying
-one or more top-level elements for any levels of configuration. In fact the user
-and system configuration files will not have the <projects>...</projects> section.
-These can be specified in another configuration file. 
-
-For example it is possible to have a configuration file as follows.
-
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    
-    <control>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="1" />
-      </plugins>
-    </control>
-
-  </config>
-  
-</HarvestMan>
-
-Since, this does not have a <projects>...</projects> section, to do
-any meaningful crawl with it, a URL has to be specified on the command line.
-So assuming the name of the above file is "cfg.xml", a sample crawl
-using this config file is,
-
-$ harvestman -C cfg.xml http://www.python.org/doc/current/tut/tut.html
-
-It is important to note that HarvestMan does not support multi-level cascading
-of configuration files. Only 3 levels of cascading are supported namely
-that of system=>user=>custom .So a crawl can be customized with configuration
-files at three levels which should be enough for most usage scenarios.
-
-4. setup.py - The setup.py script now installs sgmlop if it is not found.
-Since HarvestMan-2.0 depends on pyparsing it is also pulled in and automatically
-installed if not found. The setup.py script installs Python documentation for
-HarvestMan (generated by epydoc) and creates two shortcuts (softlinks) in POSIX
-systems namely "harvestman" for running the HarvestMan program and "hget" for
-running the hget program.
diff --git a/HarvestMan-lite/doc/Changes.txt b/HarvestMan-lite/doc/Changes.txt
deleted file mode 100755
index 31533db..0000000
--- a/HarvestMan-lite/doc/Changes.txt
+++ /dev/null
@@ -1,1405 +0,0 @@
-
-*==========================================================*
-|            -Changelog.txt file for HarvestMan-           |
-|                                                          |
-|           URL: http://www.harvestmanontheweb.com         |
-*==========================================================*
-
-
-Version 2.0 b1
-Release Date: TBD
-
-Release Focus: Re-design, major enhancements, features & bug-fixes
-
-Brief
-=====
-The 2.0 version is a release of HarvestMan after more than 2.0
-years. This version has numerous new features in terms of
-extensibility and usability. This version also converts HarvestMan
-from a web crawler application to a full-fledge application framework
-for writing custom web crawling/data mining applications.
-
-The changes in this version are many - there are changes in the
-source code layout, extensibility and usability features, performance
-enhancement fixes and numerous other bug-fixes and changes in dependence
-on third-party modules and in the setup.py script. Each of these
-changes will be discussed in detail here.
-
-
-Source code layout changes
-==========================
-The source code layout is changed considerably. Tne new layout splits
-the code into well-defined folders which abstract a functionality.
-The new layout starts from a top-level "harvestman" folder which holds
-all the code. The layout in this folder is as follows.
-
-harvestman
-        |
-        +---apps
-        |     |
-        |     |
-        |     +---samples
-        |
-        |
-        +---dev
-        |
-        |
-        +---ext
-        |    |
-        |    |  
-        |    +---lucene
-        |    |
-        |    |
-        |    +---swish-e
-        |
-        |
-        +---lib
-        |    |
-        |    +---common
-        |    |
-        |    +---js
-        |         |
-        |         +---samples
-        |
-        +---test
-
-The folders and their function are described below.
-
-1. apps - This holds all application modules. The HarvestMan framework's
-main application (harvestman.py), the Hget application's module (hget.py)
-and the module holding the base application class (appbase.py) sit in
-this module. This module also holds some sample configuration XML files.
-
-2. apps/samples - This holds sample applications based on HarvestMan
-event framework which demonstrates custom crawling applications, extending
-the HarvestMan framework. 
-
-3. dev - This holds sample code and code which is under a prototype stage
-and not part of the HarvestMan library yet, but could become so in future
-versions.
-
-4. ext - This holds modules which make use of the HarvestMan plugin
-framework and implement plugin extensions on HarvestMan. 
-
-5. ext/lucene - This holds some useful code for working with Lucene
-indexes.
-
-6. ext/swish-e - This holds some documentation and a sample swish-e
-configuration file which can be used with the swish-e plugin of HarvestMan.
-
-7. lib - This holds the main library modules of HarvestMan. This folder
-contains most of the code of the HarvestMan framework and in a way
-defines the HarvestMan framework. 
-
-8. lib/common - This holds library modules which implement common
-algorithms or data structures or generic libraries or holds global 
-data/objects or contain code from third party libraries. 
-
-9. lib/js - This holds a barebones Javascript parser written in pure
-Python which is used in HarvestMan to parse basic javascript. It also
-contains a pure Python implementation of the Narcissus javascript
-parser, though this parser is not used in HarvestMan framework anywhere.
-
-10.lib/js/samples - Contains sample javascript/html files as test
-cases for the Javascript/Narcissus parsers.
-
-11.test - Contains a unit test module and a single unit test case
-for the urlparser module. 
-
-
-New Features
-============
-As mentioned earlier, the features can be split into two - extensibility
-features and usability features. 
-
-Extensibility
--------------
-1. HarvestMan Event framework - This release adds an event framework
-to HarvestMan which allows a developer to very easily extend the program's
-behavior. Specific functions raise events before and after certain
-operations. A developer can bind his functions to these events and his
-functions are automatically called back by HarvestMan during program flow,
-when the event is raised. The developer can provide his own specific 
-custom processing in his event callback, which can modify the program
-behavior.
-
-There are many examples of writing custom crawler applications by using
-events in the "apps/samples" folder. For more information on HarvestMan
-event framework read the document "events.HOWTO" in this folder.
-
-2. HarvestMan plugins framework - This release adds a plugin
-mechanism to HarvestMan, which allows the developer to modify
-program behavior by writing custom logic and hooking it on to specific
-methods in HarvestMan classes. The plugin mechanism works using metaclasses
-and allows a developer to completely replace the code of a method or
-to attach functions as pre/post callbacks on methods. This has been
-used to implement sample plugins - the "ext" folder contains plugins
-which involve a simulator, a swish-e plugin, a lucene plugin etc. 
-
-For more information on plugins read "plugins.HOWTO" in this folder.
-
-3. Swish-e integration implemented as a plugin (see (2)). This plugin
-allows HarvestMan to run as an external program feeding content to swish-e
-indexer.
-
-4. Lucene integration implemented as a plugin (see (2)). This plugin
-converts HarvestMan to a web indexer, allowing to crawl and index web
-pages using Lucene via PyLucene.
-
-5. A simulate feature implemented as a standard plugin (see (2)). 
-This simulates crawling without actually downloading anything. 
-
-6. Crawl user database - From 2.0, a user crawl database named
-"crawls.db" is created in a user directory and all crawl meta
-information and statistics are appended as records to this database.
-The database is an sqlite database and consists of two tables
-namely "projects" and "project_stats". The former stores meta
-information on every HarvestMan project, while the latter stores
-information on crawl statistics of every project.
-
-For more information on the crawl database, see "dbdesign.txt" .
-
-7. New modules: Several new modules have been added. 
-These are:
-
-lib/
-
-o configparser.py - This contains configuration file parsing
-code. (This used to be named as xmlparser.py).
-o db.py - Defines classes implementing crawl database feature.
-o document.py - Defines a class which stands for a web document.
-o event.py - Defines classes for the HarvestMan event framework.
-o filethread.py - Defines a class for writing files in separate
-threas (not used).
-o hooks.py - Defines classes for the plugin mechanism
-o methodwrapper.py - Defines metaclass level mechanisms for
-implementing the plugin feature.
-o mirrors.py - Defines classes which helps to manage mirrors
-for the Hget application.
-o options.py - Summarizes and holds all harvestman/hget options
-in tuples.
-o urlcollections.py - Defines URL context and collections classes
-which helps to associate and relate URLs belonging to specific
-contexts (frame, css etc) easily and to define new contexts.
-o urlproc.py - Defines a function which helps to replace HTML
-entitites in URLs.
-o urltypes.py - Defines a hierarchical type system for URLs encountered
-in the web, which is used by other modules.
-
-o common/* - The "common" sub-folder and all contained modules are
-brand new. These modules are,
-
-  - bst.py - Defines a binary search tree with disk caching.
-  - common.py - Contains global functions/data
-  - dictcache.py - Defines a dictionary type with disk caching.
-  - keepalive.py - Module borrowed from "urlgrabber" project, defining
-    HTTP handlers which provide HTTP/1.0 keep-alive connections.
-  - lrucache.py - Defines an O(1) LRU cache class. Code courtesy
-    Josiah Carlson.
-  - macros.py - Defines a metaclass and its derived C-like "macro" variables
-    and defines several "macros" for HarvestMan.
-  - netinfo.py - Common data and state moved from urlparser.py to this module.
-  - optionparser.py - Defines a generic option parser class which makes it
-    easy to write and modify command line options in the form of tuples.
-  - progress.py - Defines a progress bar class. Code borrowed from the 
-    "S.M.A.R.T" package manager project and customized.
-  - properties.py - A pure Python implementation of java.util.Properties class.
-  - pydblite.py - A pure Python in-memory database with selection by list 
-    comprehension and generator expression. Code courtesy Pierre Quentel.
-  - singleton.py - Singleton implementations.
-  - spincursor.py - A console spin-cursor implementation used by hget.
-
-apps/
-
-o hget.py - New application module for the "hget" application
-o appbase.py - New top-level application class module.
-
-ext/
-
- o This folder and all code under it are new.
-
-test/
-
- o This folder and all code under it are new.
-
-
-8. Deprecated/renamed modules - Several modules have been deprecated
-or renamed (moved) w.r.t 1.4.6 version.
-
- o common.py moved to "common" sub-folder.
- o feedparser.py moved to "dev" folder.
- o strptime.py deprecated and removed.
- o urlserver.py deprecated and removed.
- o xmlparser.py renamed to configparser.py .
-
-
-Usability
----------
-1. Support for HTTP compression (gzip, deflate).
-2. Support for HTML "embed" and "object" tags.
-3. Support and control for multimedia URL types. A
-new type called "multimedia" and subtypes "movies"
-and "sounds" have been added.
-4. Support for meta "refresh" tags.
-5. Support for meta "robot" tags (index, follow, noindex, nofollow).
-6. Feature to parse stylesheets, extract URLs and download them.
-7. Support for parsing and replacing HTML entities in URLs.
-8. Support for keywords and description attribtues in "meta" tags 
-for web pages.
-9. Support for specifying offset in child URLs of webpages.
-10. Support for HTTP 304 using "If-Modified-Since" HTTP headers.
-11. Support for "etags" in HTTP headers.
-12. Support for HTTP keep-alive in mutliple connections to the
-same site. Program now attempts to keep HTTP connections open
-as much as possible.
-13. Support for HTTP basic authorization for URLs.
-14. Command line options can be mixed with configuration file
-when using the -C option.
-15. Command line changes - The following command line options
-have been removed, added or updated.
-  1. --subdomain, -S: Changed to -s.
-  2. -m,--simulate: New option to simulate crawling.
-  3. -g,--plugin: New option to apply a given plugin.
-  4. --urllistfile, --urltreefile: Removed these options.
-  5. -F, --urlfile: New option to read list of start URLs from a file
-
-
-Other Changes (including performance enhancements)
---------------------------------------------------
-
-1. Project caching uses the database object defined by the pydblite
-module instead of using pickling or shelve. This helps fast
-cache reads/writes than before.
-2. Logging module rewritten to make use of Python logging library.
-3. Log for same project run many times, keeps appending to same
-log file instead of deleting old logs. 
-4. Log files now have time stamps in every log line.
-5. Replaced md5 with sha module for generating hashes.
-6. Removed lock objects in many places when the data types are
-automatically synchronized (lists etc) by the GIL.
-7. Removed strptime module since it is no longer required.
-8. Default value for maximum file size (for single file) increased
-to 5MB (was 1 MB earlier).
-9. Changed sleep times inside the loops of fetcher/crawler thread
-clases to random times ranging from 0 to 0.3 second. This allows
-for more better resource allocation and pooling.
-10. Sleep times is available as a configuration option. This can
-be used to increase sleep-times for slow/traffic intense websites.
-11. Caching supports HTTP-304 using "If-modified-since" and 
-etags using "If-none-match". Etags in HTTP headers are saved
-to project cache if found along with last-modified time information.
-12. Better management of HTTP errors in connector module. A new
-error class is defined and errors are managed according to 
-HTTP/1.1 definitions.
-13. Rudimentary javascript support - for javascript based URL
-forwarding and basic support for DOM modifications using document.write* .
-14. Better support for Frame pages.
-15. Arbitrary return values in most methods replaced by custom 
-defined macros where a number is expected as return value.
-Each macro is a class which maps to a numerical value.
-16. A fast HTML parser based on sgmlop is added to HarvestMan.
-By default this parser is used. This parser can parse even bad
-HTML so more pages should get parsed than in previous versions.
-17. The "GetObject", "SetObject" methods are replaced by a global
-object named "objects" which holds handles to all global objects
-by using alias strings. For example, instead of getting the
-config object using,
-
-cfg = GetObject('config')
-
-now this is,
-
-cfg = objects.config
-
-The "objects" object lives in the common/common.py module. The
-aliases are set using the "SetAlias" method. Objects which want
-to be registered using "SetAlias" should define their unique alias 
-string in the "alias" attribute.
-
-
-(The following are specific performance enhancement fixes)
-
-18. The url dictionary and collections data structure are now
-disk-caching BSTs as opposed to pure in-memory dictionaries as earlier.
-This helps reduce the memory usage of the application and also
-to increase program speed. 
-19. Many redundant data structures removed. No separate data structure
-used for duplicate link checking. Instead the url dictionary BST
-is reused. The "downloaddict" in datamgr.py module replaced with
-counters.
-20. The "_filter" dictionary in rules.py module no longer stores
-full URLs, but only their indices. The "_links" list is removed
-from this module.
-
-21. In urlparser.py module a new method named "get_canonical_url"
-is added. This returns the canonical form of a URL. This form is used
-to calculate an index of the URL which is used to filter out 
-duplicate URLs etc. Since canonicalization is done, this URL 
-duplicate filter algorithm works much better than before.
-22. Better unicode handling in url* modules.
-23. A state machine has been added for managing program exit
-condition instead of the previous active wait loop. The state machine
-keeps track of state changes in threads using a dictionary. Program
-exit is related to certain synchronization of states. Active wait
-replaced by waiting on a condition object.
-24. Url status dictionaries in datamgr replaced by url queue status
-variable on url objects and status macros. The "qstatus" variable
-on a URL object changes when the URL enters and exits queues, gets
-downloaded etc.
-25. Retry logic improved. Only URLs which failed without fatal
-errors and which were not read from cache are retried. 
-26. The requests dictionary is removed from the HarvestManUrlConnectorFactory
-class. There is no need to limit connections per server, only a global
-limit is used.
-
-For more information on data structure changes, see "Datastructures.txt".
-For more information on the state machine, see "state_machine.txt".
-
-Bug-fixes
----------
-1. Fixed 100% CPU utilization bug.
-2. Fixed many bugs in urlparser module,
-   - Correctly interpreting HTML entities
-   - Better unicode handling
-   - Bug-fixes for URLs with ".." character (URLs like http://www.foo.com/../bar )
-   - Bugs in re-resolving of URLs
-   - Bug to fix too many recalculations of the absolute url (in get_full_url())
-   - Bug fixing the URL index calculation
-   - Bug fixing handling of anchor tags
-   - Bug fixing handling of % characters in URLs
-   - Bug fixing changes in URLs which requires changes in directories/filenames etc (adding of recalc_locations method)
-3. Bug-fix in rules module to speed up rules checks.
-4. Bug-fix in rules module to correctly add URLs to filters dictionary for filtered URLs.
-5. Bug-fixes in downloading of duplicate content by fetchers.
-6.  Fixed problem with URLs not downloaded when base url is
-redefined. This was causing a deep-crawling problem for
-many sites.
-7. Bug-fixes in pageparser module. Fixed logic in CaselessDict class.
-8. Fixed unicode handling bug in logger module.
-9. Many others...
-
-
-Changes in setup/configuration
-------------------------------
-
-Since the 2.0 version also adds many changes in the setup.py script
-and also brings in the concept of user specific folders and 
-different levels of configuration and changes in the structure
-of configuration files also, so this is discussed separately
-in this section. 
-
-1. User folder - From 2.0 version, HarvestMan will create a folder
-for user specific configuration and data. For POSIX systems this
-folder is $HOME/.harvestman where $HOME is the home directory of the
-user. For Win32 systems this is %USERPROFILE%/Local Settings/Application Data/HarvestMan
-folder where %USERPROFILE% maps to the profile folder of the user, typically
-being the "C:/Documents and Settings/%USERNAME%" folder. A user specific
-configuration file is created in the "config" sub-folder of this folder.
-
-2. System folder - From 2.0 version, HarvestMan will create a folder for
-system wide configuration. For POSIX systems this is "/etc/harvestman"
-folder. For Win32 systems this is %ALLUSERPROFILE%/Application Data/HarvestMan/conf"
-folder. A system level configuration file is created and copied to this folder.
-
-HarvestMan can be customized by altering either the system configuration file,
-the user configuration file or both. HarvestMan first loads the system configuration
-file (if found) and next the user configuration file (if found) and further any
-project specific configuration files. So system-wide customization (for all users)
-can be kept in the system configuration file, user-specific customization in the
-user configuration file and project specific customization in specific project
-configuration files. Any changes applied further upstream overrides the settings
-applied earlier. In other words, project configuration files override user configuration
-files and user configuration files override the system configuration file.
-
-
-3. Configuration files - Earlier configuration XML files had to be complete and
-specify a <projects>...</projects> section specifying the URLs and other configuration
-for crawls. Now config XML files can be part or incomplete, with each specifying
-one or more top-level elements for any levels of configuration. In fact the user
-and system configuration files will not have the <projects>...</projects> section.
-These can be specified in another configuration file. 
-
-For example it is possible to have a configuration file as follows.
-
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    
-    <control>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="1" />
-      </plugins>
-    </control>
-
-  </config>
-  
-</HarvestMan>
-
-Since, this does not have a <projects>...</projects> section, to do
-any meaningful crawl with it, a URL has to be specified on the command line.
-So assuming the name of the above file is "cfg.xml", a sample crawl
-using this config file is,
-
-$ harvestman -C cfg.xml http://www.python.org/doc/current/tut/tut.html
-
-It is important to note that HarvestMan does not support multi-level cascading
-of configuration files. Only 3 levels of cascading are supported namely
-that of system=>user=>custom .So a crawl can be customized with configuration
-files at three levels which should be enough for most usage scenarios.
-
-4. setup.py - The setup.py script now installs sgmlop if it is not found.
-Since HarvestMan-2.0 depends on pyparsing it is also pulled in and automatically
-installed if not found. The setup.py script installs Python documentation for
-HarvestMan (generated by epydoc) and creates two shortcuts (softlinks) in POSIX
-systems namely "harvestman" for running the HarvestMan program and "hget" for
-running the hget program.
-
----------------------------------------------------------------------------
-
-
-Version 1.4.6 final
-Release Date: Sep 9 2005
-
-Release Focus: Minor bugfix
-
-Changes
-=======
-1. Fixed bugs in the setup.py and install scripts
-so that they work with Python 2.4.
-2. Updated py2exe install script. It works correctly
-with py2exe version 0.6.1 upwards.
-
-
-Version 1.4.5 final
-Release Date: Aug 19 2005
-
-Release Focus: Bug-fixes
-
-Changes
-=======
-1. Added a subdomain flag to the command line.
-2. For verbosity level of zero, no message is printed.
-Earlier this used to print the welcome message.
-
-Bug-fixes
-=========
-1. Fixed the bug with starting a project by reading
-back an existing project file. This was not working
-before. Project file written out using Python marshal
-module, not pickle.
-2. Fixed bugs in localization. The regular expression's
-sub method should replace URL only once. Test site:
-http://www.oligopolywatch.com .
-3. Verbosity command line flag was not working. Fixed
-it. Fixed errors with a few other command line options.
-4. The stop project method of the program now
-calls the "terminate" method on threads so we dont
-have hanging threads.
-
-
-Version 1.4.5 b1 (beta 1)
-Release Date: Aug 02 2005
-
-Improvements
-============
-1. There is only one improvement in this release, the new
-command line options. The new release has a complete set
-of new command line options written from scratch. It 
-replaces the previous cluttered and confusing command line.
-A notable feature is that you can use HarvestMan like wget
-for only downloading URLs with a nocrawl option. The new
-command line supports a number of useful options which
-the user is most likely to configure. It skips a number
-of advanced or obscure options that the user need not
-be bothered with, making the command line user friendly.
-For more information, consult the Readme.txt of the package
-or go to http://harvestman.freezope.org/commandline.html .
-
-Bug-fixes
-=========
-1. Added extensions .shtm, .php4, .aspx, .cfm, .cfml,
-.cms as valid web-page extensions in urlparser.py. So
-web-pages ending in these extensions will work with
-HarvestMan.(These were present in HarvestMan 1.4 alphas 
-but somehow got lost!). 
-
-2. When printing the url tree, duplicate links were not
-checked. This has been fixed by adding a check.
-3. A minor bug in setting verbosity in logger object
-was fixed.
-4. Comments will be printed for starting & stoppping
-of url server at verbosity level 3. Comments for pinging
-url server is raised to debug level 4.
-5. Program version number, when print using the -v option
-will print the release level also. For example right now
-this will be printed as 'HarvestMan 1.4.5 beta 1'. Earlier
-it used to print only the version number.
-6. The __fix method of config.py now looks at the number
-of URLs. If no URLs are found (either from config.xml or
-through command line), it exits with an error message.
-7. Asyncore thread for urlserver is now a daemon thread,
-so it will exit if the program is killed.
-8. Fixed a minor bug in set_proxy in connector.py where
-the function to set proxy was being called three times.
-Changed this to once.
-9. Fixed a bug in rules.py. Member self._robocache should
-be a list.
-
-
-
-Version 1.4.5 a2 (alpha 2)
-Release Date: 21/07/2005
-
-Bug Fixes
-=========
-1. Fixed a bug in calculating url paths of directory-like
-urls which use the set_directory_url method in module
-urlparser.py . This was causing a number of invalid urls
-which resulted in HTTP 404 errors. This bug is fixed in
-this version.
-
-2. Fixed a bug in urls that use HTTP redirection with
-cookies. Sometimes some websites send a new url and
-a cookie along with an HTTP redirection error (301,302)
-when a url is requested. The HTTP redirection handler
-is expected to send a new request with the new url
-and the cookie. These kind of urls now work with
-HarvestMan. Fix in connector.py module.
-
-  If you are using Python 2.4, this uses the cookielib
-module and the new HTTPCookieProcessor handler. However,
-even if you are using Python 2.3 or earlier versions,
-this will work since a new HTTP redirect handle is 
-added in the connector module, that takes care of this.
-
-3. Fixed a bug in parsing <base href="..."> tags in
-module pageparser.py .
-
-4. Fixed a bug that created invalid urls because the
-html parser object was not reset before parsing everytime.
-This is now fixed in module crawler.py .
-
-5. Fixed a bug in connector.py module in extracting
-error numbers and error strings from error objects. 
-
-6. Fixed a bug in logger.py module to correctly convert
-non-string types to string types.
-
-7. Fixed a bug in config.py to take care of timelimit
-settings. This was getting ignored before.
-
-Other Changes
-=============
-1. All file encodings are now in latin-1, since
-iso-8859-1 was causing some problems.
-2. A number of modules now use the high performance
-collections.deque data structure if HarvestMan is
-run with Python 2.4. If not, these default to lists.
-3. Some functions in common.py module are removed.
-Some are moved to utils.py module.
-4. Error handler function in harvestman.py removed.
-5. Module htmlparser is removed since it is no
-longer used.
-6. Module cookiemgr is removed, since it is no
-longer used. Essential cookie handling is available
-in connector.py module.
-7. The PriorityQueue in urlqueue.py module now
-uses a modified collections.deque object if run
-with Python 2.4.Otherwise it defaults to a list.
-8. Exception handlers rewritten in many modules.
-9. Unnecessary and commented out debug statements
-are removed.
-10. Tool 'cachereader.py' is removed from tools
-sub-directory.
-
-Version: 1.4.5 a1 (alpha 1)
-Release Date: 27/05/2005
-
-Features
-========
-1. Changed config file format from text to xml. The default
-config file from this version onwards is named 'config.xml'. 
-The text config file format also works, but wil be slowly phased
-out in future releases.
-2. New HTML parser based on SGMLParser module.
-3. Dependency on HTML tidy is removed. 
-4. New archive feature for archiving project files
-   to tar.bz2/tar.gz archives.
-5. Changes in project caching: 
-
-   - Data of web pages is compressed before writing to cache. 
-   - Cache data structure changed to a dictionary, from list. 
-   - Option for writing cache in DBM format. 
-   - Headers of urls is also written to cache. 
-     This can be turned on or off.
-6. A junk filter for filtering out banner ads and similar urls.
-7. HarvestMan now Works with Python 2.4 .
-8. New scripts in 'tools' directory
-    - A script to generate project files from cache.
-    - A script to dump url headers in the form of
-      a DBM file from the project cache.
-    - A script to convert between xml & text 
-      config files.
-
-Bug-fixes
-========= 
-1. Bug fixes in urlparser module. 
-2. Bug fixes in datamgr module. 
-3. Bug fixes in rules module.
-
-
-Version: 1.4 final (Bug fixes + Minor features)
-Release Date: Dec 17 2004
-
-Changes from version 1.3.9-1
-============================
-
-Features
-========
-
-1. Added an asynchronous url server which listens
-to port 3081 (by default). The url server can be
-optionally enabled to gather and send urls instead
-of using a Queue. This can be faster, since the
-url server uses asyncore module of Python with
-queues, which is faster than just using queues.
-
-To enable this feature, set the config variable
-network.urlserver to 1.
-
-2. Modified caching algorithm to store the data
-of the files download in the cache file. Hence
-if some one accidentally deletes the downloaded files,
-HarvestMan can recreate the files from the cache file,
-without actually downloading them, if they are uptodate.
-
-3. Queue architecture modified. The data queue has
-been replaced with a links queue. Instead of pushing
-web page data into a queue, fetchers process them and
-push the new urls to a queue. Crawlers get the urls 
-, walk through them and posts the newly created url
-objects into the url queue or sends them to the url
-server. This saves memory on the queues.
-
-4. Added an option for controlling file download
-based on maximum file size. The maximum size by default
-for a single file is 1 MB.
-
-5. Added an option for dumping a url tree which shows
-parent-child dependencies of the urls generated. This
-can be either a text file or an html file. 
-
-6. Added an advertisement/banner filter to the rules
-module. If enabled this can skip urls related to ad
-banners or graphics.
-
-7. New controller thread to manage file and time limits
-on downloads.
-
-Fixes
-=====
-1. This release fixes a huge bug in HarvestMan, i.e
-that of hanging threads. The threading architecture
-is modified to introduce local buffers. Threads 
-do an unblocked push on the queue as opposed to
-a blocked push in all previous versions. If they
-cannot push the data (Queue full) after 5 attempts,
-they store the data in a local buffer. In the next
-loop of the threads, they try to push the buffer data
-before creating any new objects to push (by crawling
-pages/parsing html files. This ensures that the
-threads dont block continously on the queue leading
-to deadlocks and time outs.)
-
-2.Increased the idling time of threads to reduce CPU
-  load.
-
-3. Fixed a bug with correctly identifying WWW urls.
-4. Fixed a bug that incorrectly modifies urls
-   with spaces between words.
-5. Fixed many bugs with get_relative_filename method.
-6. Fixed bugs with generating urls. Trailing spaces
-   and/or newlines need to be removed from path
-   components.
-7. Added a method to correctly identify the type of
-   a url based on its mimetype.
-8. Fixed bugs in robot protocol checking method.
-   Many optimizations are also added to quickly
-   process urls. A robot object cache (dictionary)
-   and url object whitelist has been added to
-   reduce processing time. Also html files need
-   to be processed.
-9. Fixed bugs in url filter checking method.
-10.Fixed bugs in the order of checking rules
-   in violates_basic_rules method.
-11.Fixed bug in creating regular expression for
-   filtering based on file extension.
-12. Many bug fixes in localise_file_links method.
-13. Fixed a bug in correctly generating the
-    regular expression for old url.
-14. Fixed a bug in localising file names. All
-    web page files are correctly localised now.
-15. Fixed a bug in updating files from project
-    cache.
-16. Bugfixes in urltracker module.
-17. Fixed the bug when program exits sometimes
-    just after downloading the first url.
-18. Fixed bug with parsing <base href="..."> 
-    link.
-19. Fixed error in managing an empty url.
-    Correct error message is printed now.
-20. Fixed bugs with logging errors.
-    The error log stream is not enabled
-    now.
-21. Fix to allow special characters in project base
-    directory (such as ~ for home directory on
-    Unix systems).
-22. Fixed bug in function that opens robots.txt
-    urls.
-23. Removed some useless arguments from some
-    functions.
-24. Fixed bug with url object in connect(...) 
-    function.
-25. Fixes to make slow mode work.
-26. Modified to use methods of cPickle module instead
-    of pickle module in utils.py (cPickle is faster).
-27. Use our own strptime module since this function
-    is not available on all Python versions on Windows.
-28. Fixes in locale setting on Windows platform.
-29. Log file for each project is now generated in the
-    project directory as '<projectname>.log'. This is
-    not a configurable option anymore.
-30. The verification of downloaded files by checksumming
-    is disabled. This is not a configurable option
-    anymore.
-31. The renaming algorithm is disabled since it is not
-    general purpose.
-
-Other Changes
-=============
-
-1. License of program changed to GNU GPL.
-2. The genconfig.py script is more interactive now,
-   displaying the options selected.
-3. Language encoding specified on top of all Python
-   files.
-4. A script to check Python dependency namely, 'check_dep.py'
-   has been added.
-5. Installation made easier on Linux and Unix like systems.
-   A script named 'install' does the job for you.
-6. The 'genutils' directory is renamed to 'tools'.
-
-Version: 1.3.9-1 (minor bug fixes)
-Release Date: June 24 2004
-
-Changes in version 1.3.9-1 from 1.3.9
-=====================================
-
-1. Fixed a bug in cache algorithm. Key 'checksum'
-should not be checked if it is old cache.
-2. Fixed a bug in connector.py. Check for valid
-url object in line 622.
-3. Fixed a bug in urlparser. Anchor type urls
-should have the url file name as base url, not
-original url filename.
-4. Fixed a bug in url tracker. Anchor type urls
-should not be skipped.
-
-
-Version: 1.3.9 (features/bug fixes)
-Release Date: June 14 2004
-
-Changes in version 1.3.9 from 1.3.4
-==================================
-
-New Features
-------------
-
-1. Url priorities: Every url is assigned a priority according
-to which it is downloaded. Urls with higher priority are downloaded
-first. Priorities are determined by 3 factors.
-
-    a. The generation of the url
-    b. Whether the url is a webpage
-    c. User defined priorities
-
-Urls in a lower generation are given higher priority when compared
-to urls in a higher generation. This makes sure that urls which
-were created in the beginning of a project gets downloaded first.
-
-Webpage urls are given a higher priority when compared to other urls.
-
-Apart from this user can defined priorities in the config file in the
-range of (-5,5) based on file extensions.
-
-2. Website priorites: These are like url priorities but which
-can be specified by the user in the config file.
-
-Sample usage:
-
-control.serverpriority     www.foo.com+3,www.bar.com-3
-
-3. Thread groups for downloads: The download threads are now
-pre-launched in a group similar to tracker threads. The download
-jobs are submitted to the thread pool, which in turn delegates
-them to the threads. The thread pool has been made into a 
-queue for this. 
-
- This reduces thread latency, since we no longer spawn
-new threads during the life cycle of the program.
-
-4. Allow urls with spaces: HarvestMan can now download urls which 
-contain spaces like 'http://www.foo.com/bar/this url.html'.
-
-5. Changed the way to distinguish between directory and file like
-urls. Earlier when we parsed the url, a connection was made to
-the url, assuming it was directory like. If the reply was HTTP 404
-error, then it was assumed correctly to be a file like url.
-
-  This has been changed in the new version. We assume all urls are
-file like, For example, if there is a url like http://www.foo.com/bar/file
-, which can be a directory http://www.foo.com/bar/file/index.html or
-file http://www.foo.com/bar/file, we assume it is a file initialy and
-try to download it. The geturl() method of the file-like object returned
-by opening the url, will tell whether it is file like or directory like.
-This information is used to modify the local (disk) file name of the url
-at that point. This decouples the modules urlparser and connector to
-a large extent and makes performance better with such urls.
-
-6. Added functionality to tidy html pages before parsing them by
-   using 'uTidy', the python port of html tidy. This helps to crawl
-   sites that exit due to parsing errors in previous versions of
-   HarvestMan.
-
-7. Intranet downloads need not set a specific flag (download.intranet).
-   Instead HarvestMan can figure out whether the server is in intranet
-   by resolving its name and take appropriate action. This allows
-   intranet/internet downloads to be mixed in the same project.
-
-8. Modified the way url information is cached. The field 'last-modified'
-in url's headers is used, if it is available. If it is not there, a
-checksum based on the content of the url is used (previous algorithm)
-as fallback.
-
-Other Changes
-=============
-
-1. Regular expressions for filters are pre-compiled.
-2. Derived HarvestManStateObject (config class) from 'dict'  type.
-3. Main thread 'joins' each tracker thread with zero timeout instead
-   of killing them at the end of project.
-4. Optimization fix: Links are stored for localising, only if their
-   download is successful.
-5. Assigned 2:1 ratio for fetchers and crawlers instead of current
-   1:1 ratio.
-6. Renamed all modules.
-7. Used 'weakref' wherever possible to reduce extra references to
-   objects and avoid reference loops. This is mostly used in
-   'GetObject' method and in urlparser module.
-8. 
-
-Bug fixes
-========
-
-1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
-2. Fixed bug in url filter for images. 
-3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
-4. Close file like object returned by opening urls
-   after reading data.
-5. Fixed a bug in localising links. Directory like urls
-   need to be skipped.
-6. Fixed bug in finding common domain for servers that 
-   have lesser than three 'dots' in their name string. (This is
-   the same bug as # B1083256752.28 .)
-7. Fixed a bug in setting up network for clients behind a proxy/
-   firewall.
-
-
-Version: 1.3.3 (bug fixes)
-Release Date: Feb 24 2004
-
-Changes in Version 1.3.3 from 1.3.2
-===================================
-
-1. Fixed bug with parsing of FTP links.  Bug # B1077613467.85.
-2. Fixed another bug with external server links.
-3. Fixed bug with request control. Request dictionary 
-   key is server name, not ip.
-
-Version: 1.3.2 (minor feature enhancements)
-Release Date: Feb 13 2004
-
-Changes in Version 1.3.2 from 1.3.1
-===================================
-
-There is one minor feature in this release. 
-
-1. This release adds ability to limit downloads by
-controlling the number of simultaneous requests from the
-same server. This option can be controlled by the config
-variable named 'control.requests'.
-
-2. Apart from that I have re-structured the package,
-and added a distutils setup.py script which copies the 
-package to your PYTHON installation folder.
-
-
-Version: 1.3.1 (bug fix)
-Release Date: Feb 10 2004
-
-Changes in Version 1.3.1 from 1.3
-=================================
-
-This version is a bug fix version fixing most
-of the critical and annoying HarvestMan bugs.
-These bugs can be located in the bugs database
-at http://harvestman.freezope.org/Discussons .
-
-1. Fixed bug with query forms. The program no longer
-   tries to download server side query form links.
-   Bug #B1073291938.97.
-2. Fixed bug with handling frame redirects. Bug #B1076402199.0.
-3. Fixed bug with robots.txt url. Bug #B1072436188.35.
-4. Fixed bug in finding out external server links.
-   Bug #B1076402348.52.
-5. Fixed bug in external links with respect to subdomains.
-   Bug #B1076409910.45.
-6. Fixed bug with following non-existent links in a 
-   directory listing Bug #B1073028403.71.
-7. Fixed problem in printing harvestman url in welcome
-   message.
-8. Fixed some problems in config file parsing.
-9. Fixed problem with printing version string (-v and
-   --version options).
-10. Other miscellaneous fixes and corrections thanks to
-    Vivian, Sascha and some others.
-
-
-Version: 1.3 (final)
-Release Date: Dec 15 2003
-
-Changes in Version 1.3 (from 1.3 a1)
-=========================================
-
-1. This version adds one feature, that of searching 
-   a webpage for keywords. You can create complex
-   boolean regular expressions and supply them to
-   HarvestMan. HarvestMan will parse the regular
-   expressions and download only those web pages that
-   match the regular expression.
-
-   In simpler words, this means a keyword(s) search. :-)
-
-   For example, you need to download only those webpages
-   that contain the term 'Saddam' and 'WMD'. You create
-   the following regular expression and pass it on to
-   HarvestMan as the option 'control.wordfilter'.
-
-   ;; config file for harvestman
-   control.wordfilter    (Saddam & WMD)
-
-   You use the boolean '&' and '|' to create the regular
-   expressions.
-
-   I have added this as a recipe in the ASPN Python Cookbook.
-   For more information on how it works, point to the URL,
-   http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526.
-
-Changes in Version 1.3 a1 (from 1.2 final)
-=========================================
-
-1. This version features the new threading model which was
-   started in the last release. This model is now completely
-   written to prevent thread deadlocking incidents. A description
-   of the model can be found in the HarvestMan webpage at
-   http://harvestman.freezope.org. 
-
-   This model will be developed further and will be the default
-   for all future releases of HarvestMan.
-
-2. The other major changes are complete re-writes of many modules.
-   Classes have been renamed wherever suitable and some function
-   names changed. The HarvestMan module has been trimmed up
-   considerably.
-
-3. This version has added an extra module HarvestManUtils which has
-   some utility classes for reading/writing project & cache files and for
-   creating the browse page. The code for these were earlier in the
-   HarvestMan, HarvestManDataManager and HarvestManConfig modules.
-
-4. The cache and project file information is compressed before writing
-   to files.
-
-Changes in Version 1.2 final (from 1.2 rc2)
-===========================================
-
-1. Added support for javascript and java applet tag parsing.
-   HarvestMan can now fetch javascript (.js) files and
-   java applets (.class) files from webpages. 
-  
-   The code for parsing this sits in the new HTMLParser 
-   customized for HarvestMan.
-
-2. Designated url trackers to two flavors - Fetchers and Getters.
-   Fetchers are responsible for crawling webpages and fetching links,
-   and Getters get the non-html files fetched by Fetchers. Images
-   are still fetched by the Fetchers in thier threads.
-
-   This should help in the growth of this program and make future
-   development easier. Also this might help in preventing the thread
-   locking incidents.
-
-3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser,
-   HarvestManUrlPathParser and HarvestManDataManager classes to take care
-   of this. Anchor links in webpages are localized correctly now.
-
-4. Due to javascript/javaapplet parsing code in the new html parser,
-   many webpages which failed to work before (due to mostly javascript
-   tags which the parser could not understand) will work correctly now.
-   
-5. Other routine bug fixes.
-
-   a) Fixed a problem in creating the project browse page.
-      We need to provide the absolute path of the project start url file.
-   b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser
-      class.
-   c) A few more...
-
-
-Changes in Version 1.2 rc2 (from 1.2 rc1)
-=========================================
-Release Date: Sep 27 2003
-
-1. Rewrote the algorithm for fetching urls with no filename
-   extensions. We assume that it is a directory-like url
-   (of the form dir/index.html) and try to fetch it during 
-   url path resolving time (in urlPathParser clas). 
-
-   If this fails, a 404 error is returned. The url is cached
-   for later lookup in the datamanager in a invalid urls cache.
-   We re-resolve the url assuming it now as a file-like url
-   (of the form /file ) and fetch it.
-  
-   If it does not fail, the url is again cached for later lookup
-   in the datamanager in a valid urls cache. The connector object
-   is also cached in a connector dictionary of the datamanager so
-   that we dont need to re-create the connection later.
-
-   This fixes the long-standing bug with urls with no filename
-   extensions.
-
-2. Rewrote algorithm for localizing links. Instead of re-parsing
-   html files and localizing the links, a dictionary of html files
-   and their links are kept in the datamanager object. This dictionary
-   is updated during crawling time with the url objects for each html
-   file. This dictionary is used at the end for localizing.
-
-   This improves localization time to as much as 500%.
-
-3. Fixed a bug in calculating project time. (Time for localization
-   should not be included).
-
-4. Modification in priting error messages. Error messages are printed
-   only for verbosity levels of 3 and up. OS and IO exceptions are 
-   printed only at verbosity level 4 (debug).
-
-   For seeing url error messages (connection errors), you need to set
-   the verbosity to 3 now.
-
-   At the default verbosity level (2), no error messages can be seen.
-
-5. Modified the checking of hanging threads. This check was not done
-   properly. Now it is done in the loop that checks for exit condition.
-   Also, reduced default timeout for hanging threads from 600 seconds
-   (10 minutes) to 120 seconds ( 2 minutes ). 
-   
-   Added socket timeout for sockets. This is same as thread timeout above.
-   (This works for users using Python 2.3.)
-
-   This will fix the problem of hanging threads in a big way.
-
-Changes in Version 1.2 rc1 (from 1.2 alpha)
-===========================================
-
-Release Date: Sep 24 2003
-
-1. Removed the earlier global download lock. Earlier the url
-   connector instances shared a common lock which they had to acquire
-   before downloading. This led to only a single download possible a
-   given moment. 
-
-   This has been changed to multiple downloads which can be specified
-   in the configuration file.
-
-2. We can specify any number of connections in the config file now.
-   The program makes sure that there are only so many connections
-   running at a given instant. This takes the place of the previous
-   global download lock. Since now many simultaneous downloads are possible
-   (apart from many threads), the program is much faster than before.
-
-3. Added an option for writing pickled cache files. This has been
-   made the default in this release. XML cache files take a long
-   time to read, if they are big.  
-
-4. Integrated genconfig.py script with harvestManConfig class. 
-   This makes future developments of this script easier. Added an abort
-   condition to the script which can be invoked by pressing the <space>
-   key.
-
-5. Fixes for handling error conditions in the url connector class.
-   Arbitrary error numbers are no longer used, instead we try to
-   get the error number by parsing the error strings.
-
-6. Redownload of failed links works only for links that failed with
-   non-fatal errors. This speeds up projects.
-
-7. Modified the regular expression behaviour. Compile the reg expressions
-   to optimize regular expression search.
-
-8. Moved code around from HarvestMan.py module to reduce its size. 
-   Parsing of config file is now done in the HarvestManConfig module.
-
-9. Removed usage of 'string' module everywhere and replaced with
-   methods on string objects.
-
-10. Added a timeout option for the project. Sometimes the last thread
-    in the program does not complete hanging a well downloaded project.
-    This option looks at the last data operation into the url queue 
-    and times it. If the time of the last operation (get/put) is more
-    than a prescribed time, the project times out. 
-
-    We also wait now for the download sub-threads to complete their work 
-    before exiting. This fixes any premature project exit conditions.
-
-11. Change in writing project files. We now write pickled project files
-    instead of XML project files. This will be the default from this
-    release.
-
-12. Bug fixes in urlpathparser module for fixing relative filename computation
-    errors.
-
-13. Bug fixes in rules module. Rewrote some methods in this module.
-
-14. Fixes in creating the project browse page. The project browse
-    page entry is now created correctly for every new project.
-
-15. Many other routine bug fixes to speed up downloads and reduce
-    bugs in threading.
-
-Changes in Version 1.2 alpha (From version 1.1.2)
-================================================
-
-1. This version has introduced limited support for Cookies.
-   This is experimental code, written from scratch
-   following RFC 2109. The cookie support is pretty
-   basic with only domain cookies supported. Netscape
-   style cookies may not work.
-
-2. Support for webpage caching is available. A cache
-   file (xml) is created in the project directory for
-   a project, the first time. The cache file associates
-   urls to file on the disk. We compare files by using
-   an md5 checksum on the file content. For any 
-   further runs of the project, only the out-of-date
-   files are re-fetched.
-
-3. Many bug fixes and better error checking.
-
-4. Bugs in genconfig script fixed.
-
-5. Documentation changes: We provide an RTF version of the
-   documentation file now. (Request by John J Lee of
-   Clientcookie fame)
-
-Changes in Version 1.1.2(From version 1.1.1)
-============================================
-1. Added a fast html parser based on sgmlop module by F.Lundh.
-   This can be selected by setting the variable HTMLPARSER in the
-   config file to 1. The default parser is still the standard
-   python parser.
-
-2. Added an option to localise links relatively. This is the
-   default now. That is we dont replace filenames with their
-   absolute pathname but only relative pathname, so that users
-   can browse the downloaded pages on another filesystem also.
-
-3. Added an option for the user to control md5 checksumming of files.
-   This option is controlled by the variable CHECKFILES in the 
-   config file.
-
-4. Support comments at the end of an option line in the config file.
-   (Egs: <URL http://www.python.org # This is the url> is valid now.
-   It would have thrown an error before.)
-
-5. We are not localising form links. This makes sure that a cgi
-   query goes directly to the webserver.
-
-6. An option for JIT (Just In Time) localization of url links.
-   If this option is selected, then urls in html files are localized
-   immediately after they are downloaded, instead of at the end.
-
-
-Changes In Architecture (Version 1.1)
-====================================
-
-1. Global Object Register/Lookup
-   -----------------------------
-
- One of the major changes in this version is the architecture of harvestman program.
-It uses a modified Object Oriented approach of looking up objects whenever the services
-of an object is needed by other objects. The classes no longer maintain pointers to
-other class instances inside them. 
-
-All Harvestman program objects register themselves with a global registry/look-up object
-when they are created. (It is upto the programmer to do this.). The registry object is
-a Borg singleton ensuring that the state of the objects is maintained. The objects are
-stored in the dictionary of the registry object using strings as the key. 
-
-When an object needs the services of another, it performs a simple 'query' or 'lookup'
-of the registry using the key of that particular object (This should be known. Right now
-we dont support a publish/subscribe mechanism, it will be added later.). The register
-object sits in the Harvestman globals module, so it is available to objects in all modules
-which do an import of this module. An example is given below.
-
-
-  # Create and register the object.
-  obj1 = HarvestManObject1()
-  HarvestManGlobals.SetObject('object1', obj1)
-
-  # Object2 wants services of obj1 
-  obj1instance = HarvestManGlobals.GetObject('object1')
-  # Use its services
-  obj1instance.func1(...)
-  
-
-This makes adding new modules to HarvestMan easy, if you make sure that you register them
-in the globals module.
-
-
-2. Threading Model
-   ---------------
-
-HarvestMan versions till 1.0 was using a model where url tracker threads were store in a
-queue. A url tracker object consisting of data of a url was pushed into a queue and was
-later popped by a monitor object so that downloads could be controlled. This gave rise
-to problems of controlling threads and overhead in the form of new thread contexts since
-we were not reusing threads.
-
-HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread
-data is only managed in the queue, and not threads themselves. The number of threads 
-(as per the config file or command line user input) are pre-launched in the beginning of
-the program. They run in a loop looking for url data which is managed by a url data queue.
-Threads post their url data to this queue. This ensures that we always have a given number
-of threads running. It also reduces overheads and latency.
-
-HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive
-(new thread launched per request) mechanism. This might be changed in future releases.
-
-3. Code Reorganization
-   -------------------
-
-   The new version features some extra modules which have been created by moving code
-   from existing modules and re-writing them. The aim was to split crawler code from
-   data management code, in which we succeeded quite well. There is a new Data Manager
-   module which takes care of scheduling downloading requests, indexing files, keeping
-   file statistics and localizing links. A Rules module checks the HarvestMan download
-   rules (this was earlier done by the previous "WebUrlTrackerMonitor" class).
-
-   A Synchronization lock has been added in the Connector module. This might
-   slow down downloads a bit, but should ensure that threads dont corrupt the data.
-   Interested users can experiment with the lock, removing it or modifying it, and
-   see how it works. Please report any improvements in performance you see to the
-   authors.
-
-4. Other Changes
-   -------------
-
-   For other changes continue reading.
-
-
-HISTORY
-=======
-
-+-----------------------------------------+
-|Changes in Version 1.1 (from Version 1.0)|
-+-----------------------------------------+
-
-1. A project file is created for every project in the harvestman directory
-   in the subdirectory 'projects'.
-2. Always download css files related to a web-page, even if 
-    it is outside of domain or directory. Same for images. Config options
-    for both added in the config file.
-3. Added a config file option to rename dynamically generated images.
-    Works right now for jpeg/gif images.
-4. Modified the urlfilter algorithm to check the order of filter
-    strings in case of a collision in filter results.
-5. Added a new option FETCHLEVEL to the program to allow very
-    basic control of download. For details see Readme.txt/HarvestMan.doc
-    file. 
-6. Get background images of webpages.
-7. Better error/message logging. Error files are created in each project's
-    download directory. All messages are logged to a file in the harvestman
-    installation directory. This by default is named 'harvestman.log'. User
-    can change this option by editing the config file. This file is created fresh
-      for every project.
-8. Added support for getting files from ftp servers.
-9. Write a project file based on HarvestMan.dtd before starting to crawl.
-    This file is written to the base directory.
-10. Stats file is no longer written in the current directory under "projects". Instead
-    it is written to the project directory of the particular project.
-11. Added command line support.
-12. Modified proxy setting. Removed port number from proxy string. Port number
-    needs to be specified as a separate config entry.
-13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where
-    projectname is the name of the current project) to the project directory. The file
-    extension 'hst' stands for 'HarvestMan Stats File'.
-14. Write a binary project file also. 
-15. Modified localise links function to take care of localising anchor type links also.
-    This was an undetected bug in version 1.0.
-16. HarvestMan can now load projects from saved project files. This can be done for
-    both the xml and binary project files. Added encryption for proxy related data.
-17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data
-    (except port number) before writing it to the config file.
-18. Added code in WebUrlConnector to request user for authentication information 
-    for a proxy-authenticated firewall. If the project file does not contain this information,
-    it will be requested from the user, interactively.
-19. WebRobotParser module uses the services of WebUrlConnector now, instead of having
-    its own internet connection code.
-20. Added a mechanism to log errors made in the config file, and inform user about it
-    at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals).
-21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE).
-22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now.
-    0 - 1 will fetch only local links. 2 fetches local + first level external links and
-    3 fetches any link.
-23. Tried different approaches to running thread queue. Ideally the runTrackers() method
-    should be called when you start the project and it should run separately from the
-    push() method. But this lead to blocking of the last download thread in many tests since
-    the CPU seems to run the runTrackers() method in priority to the last download thread.
-    So I reverted back to the existing method of running trackers where the push method
-    makes a call to runTrackers() ( I know that it is not good thread programming, but it works . )
-24. Modification to webUrlConnector class, this class now accepts a urlPathParser object
-    instead of a url directly. This makes handling of urls easy and we can pass more information
-    around. Made correspoding changes to Monitor/Tracker/Thread classes.
-25. Fixes for slowmode. Rewrote some code.
-
-
-+-----------------------------------------+
-|Changes in Version 1.0 (from Version 0.8)|
-+-----------------------------------------+
-
-1. Fully multithreaded. Multithreaded mode is the default.
-2. Depth fetching for starting server and external servers in config file.
-3. Browser page for projects similar to HTTrack.
-4. Added re-fetching of failed urls.
-5. Support for intranet servers.
-6. Verbosity option added in config file.
-7. Lots of configurable options added in the config file.
-   The list of options (apart from the basic ones) is now about 30.
-8. Signal handler for keyboard interrupts autmatically does clean up jobs.
-
-
-
diff --git a/HarvestMan-lite/doc/Datastructures.txt b/HarvestMan-lite/doc/Datastructures.txt
deleted file mode 100755
index 6e1b394..0000000
--- a/HarvestMan-lite/doc/Datastructures.txt
+++ /dev/null
@@ -1,67 +0,0 @@
-
-=================================================
-* Enhancements done in HarvestMan Datastructures*
-=================================================
-
-Many data structures were either enhanced or removed from HarvestMan
-during work done on this from Feb 12 to Feb 14 2008. Here are a list
-of changes.
-
-I. Module datamgr.py
-
-This module has the most critical data structures which manage the
-state of the program. 
-
-1. Dictionary _downloaddict is removed. Its constituents have been
-either removed, replaced with counters or enhanced. Here are the
-changes.
-
-  a. _savedfiles list is removed. This is really not required.
-     It is replaced with self.savedfiles which just keeps the
-     counter of the saved files.
-  b. _deletedfiels list is removed. This is replaced with nothing.
-     It is really a redundant list/counter.
-  c. _failedurls list is removed. Instead failed URLs are calculated
-     at program end by searching through the BST _urlsdb (below) by
-     using the atttributes of the URL object stored in it.
-  d. _doneurls list is removed. This again was not very imporant and
-     was redundant.
-  e. _collections list is replaced with a self.collections BST.
-  f. _reposfiles moved to self.reposfiles.
-  g. _cachefiles moved to self.cachefiles.
-
-2. self._fetcherstatus dictionary is removed. This logic is not
-required. This was mainly used to find out whether a URL is currently
-being downloaded etc, but this is replaced with a state transition
-on the URL objects.
-
-3. self._urldict is replaced with the disk-caching BST self._urlsdb .
-This has shown very good results.
-
-4. self._projectcache is no longer a shelve dictionary, instead an
-instance of Base, which is a in-memory dictionary like database
-written by Pierre Quentel and published in ASPN as "pydblite"
-recipe {http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496770}.
-This seems to work pretty well and use memory efficiently. It also
-makes searching into the cache more efficient.
-
-5. New counters added - self._numfailed2, self._numretried etc.
-
-
-II. Module rules.py
-
-1. The self._links list is removed. This was used to filter out
-duplicate links. This has been replaced by a search in the _urlsdb
-data structure. This is made possible by using the integer hash of the
-full url string as the index into the _urlsdb BST. So similar URLs
-(URLs with same full url string) will hash to the same index.
-
-2. self._filter is now a dictionary, not a list. Instead of
-keeping the list of full URLs, this now just keeps a hash of
-the index of the URLs which are filtered. 
-
-3. self._rexplist is removed since it was found to be used
-nowhere.
-
-
-
diff --git a/HarvestMan-lite/doc/STATUS.txt b/HarvestMan-lite/doc/STATUS.txt
deleted file mode 100755
index e5419c1..0000000
--- a/HarvestMan-lite/doc/STATUS.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-
-Based on EIAO bug tracker...
-
-------------------------------------------------------------------------------------------------
-
-Bug ID          Description             % Completed             Status
--------------------------------------------------------------------------------------------------
-607      State machine changes          Mostly done             Can be assumed as done.
-
-608      Config parameter for           100%                    Completed.
-         managing multimedia
-         download content
-
-609      Make sgmlop parser as          100%                   Completed. 
-         default
-
-610      Option for URL filters         0%
-         to only filter URLs for
-         downloads
-
-611      Config file split              100%                   Done. But not using another
-                                                               name for the configuration files.
-                                                               Instead program generates config.xml
-                                                               with non-project information and copies
-                                                               it to a system-wide area and also to 
-                                                               the user area. Need to add more params
-                                                               in the config file for this to be
-                                                               effective.
-
-612      URL Localization fixes         0%
-
-613      Back-port EIAO                 30%               Mostly done. 
-         stability fixes.                                
-
-628      Improved setup.py              50%              Check bug for more details.
-                                                            
diff --git a/HarvestMan-lite/doc/cleanlog.txt b/HarvestMan-lite/doc/cleanlog.txt
deleted file mode 100644
index 8e93c41..0000000
--- a/HarvestMan-lite/doc/cleanlog.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-1. May 7 2009 - Trimmed Narcissus (javascript parser) files from "js" folder (42 files removed).
-2. May 7 2009 - Cleaned up hget specific code from connector.py, urlthread.py, config.py, options.py,
-                and removed apps/hget.py. Removed hget specific tests from "test" folder .
diff --git a/HarvestMan-lite/doc/events.HOWTO b/HarvestMan-lite/doc/events.HOWTO
deleted file mode 100755
index 3b58519..0000000
--- a/HarvestMan-lite/doc/events.HOWTO
+++ /dev/null
@@ -1,160 +0,0 @@
-This document describes the event framework of HarvestMan
-which allows the programmer to hook into specific places
-at the program flow and perfom custom handling, thereby
-altering program behavior.
-
-Events
-------
-The event framework sits in the module lib/event.py . It
-defines an Event class and a HarvestManEvent class. The
-former defines an event and the latter acts as an
-event manager providing functions to bind to and
-raise events.
-
-Event class
------------
-An event class defines an event. It has the following
-fields. 
-
-1. name => A string, for the name of the event. The
-           name should be unique.
-2. config => A reference to the configuration object
-3. url => A handle to the URL object associated to the event.
-4. document => A handle to the document object associated to
-         the event.
-
-Note that of the above 4 attributes, only the document
-attribute could have a null (None) value. The rest
-of the attribtues should have non null values.
-
-Raising an event
-----------------
-An event can be raised by the raise_event method of 
-the HarvestManEvent class. The raise_event method takes
-the following arguments.
-
-1. event => The event name (string) for which we are raising
-            the event.
-2. url   => The URL object associated to the event.
-3. document => The document object associated to the event.
-            This could be a null value.
-
-Apart from the above 3 arguments, keyword arguments can be
-passed. The keyword arguments will be passed on to the 
-event handler.
-
-Note that every event may not supply all the arguments
-to the event handler.
-
-
-Registering event handlers
-----------------------
-Event handlers can be bound using the 'bind' method of the
-HarvestManEvent class. This takes 3 arguments namely,
-
-1. event -> The event of interest.
-2. funktion - A function registered to handle the event.
-              The function is called back when the event is
-              raised.
-
-The bind method is exposed on the HarvestMan class (as
-"register"), so practically, you will be using that method 
-instead of using the method directly on the HarvestManEvent class.
-
-
-Example
--------
-The following code shows how to register for an event.
-
-def write_this_url(event, *args):
-        
-    url = event.url
-    if url.is_image():
-        return True
-    else:
-        return False
-
-We want to bind this to the 'save_url_data' event to make
-sure we write only image URLs. To bind it,
-
-spider = HarvestMan()
-spider.register('save_url_data', write_this_url)
-
-Now whenever the 'save_url_data' event is raised, the function
-'write_this_url' is called automatically.
-
-Using Events
-------------
-Typical use of events are as 'before handlers', to
-allow the developer to insert custom logic to decide
-whether an action should be taken on a URL or document.
-
-For example, the event 'before_crawl_url' allows the 
-developer to hook into int and return a value, based 
-on custom processing. If the event handler returns True, 
-the URL is included, else filtered. 
-
-To write such handlers, the programmer has to clearly
-return True if we want the handler to allow the action
-or return False to deny the action. Look at the previous
-sample code as an illustration for this.
-
-Other events are 'post handlers' which raise an event
-after an action is done. Since the programmer cannot 
-influence the action at this stage, the return value of 
-these events are of no importance.
-
-By defaul every event raises the referring URL and
-the Document object. For most events the URL is a 
-valid (not None) object. However for some events,
-this need not be valid.
-
-Events in HarvestMan
---------------------
-HarvestMan defines the following events. This
-is grouped into 'before events' and 'after events'.
-
-Before Events
--------------
-Events raised before an action is performed.
-
-1. before_crawl_url : Raised before a URL's children are 
-put into the crawl queue. Args: url, document .
-2. before_download_url : Raised before a URL is fetched, i.e
-downloaded from the web. Args: url .
-3. before_url_connect : Raised before a connection is opened
-for a URL. Args: url, last modified time, etag.
-4. before_tag_parse: Raised before an HTML tag is parsed
-by the parser. Args: url, tag, tag attributes dictionary.
-5. before_tag_data: Raised before the CDATA of an HTML
-tag is parsed. Args: url, tag, CDATA.
-6. before_parse_url : Raised before a webpage URL's data
-is parsed to extract child links. Args: url, document
-7. save_url_data : Raised before a URL's data is saved to disk.
-Args: url, data
-8. include_this_link : Raised before a URL is put into the 
-crawl queue. Filtering logic can be added in the handler
-to filter the URL. Args: url .
-9. before_start_project: Raised before a crawl project
-is started. Args: starting URL.
-10. before_finish_project: Raised before a crawl project
-is finished: Args: starting URL.
-
-Post Events
--------------
-Events raised after an action is performed.
-
-1. post_crawl_url : Raised after the crawl of a URL's children
-are complete, i.e the children are put into the crawl
-queue. Args: url, document.
-2. post_fetch_url => Raised after the URL is downloaded and
-if a webpage, parsed and its children are put into the
-fetch queue. Args: url, document.
-3. post_parse_url : Raised after a webpage URL's data
-is parsed to extract webpage links. Args: url, document, links.
-4. post_crawl_complete: Raised after the crawl of all URLs
-is complete and just before exiting the application. Args: None.
-5. post_start_project: Raised after a crawl project is started.
-Args: starting URL.
-6. post_finish_project: Raised after a crawl project is finished.
-Args: starting URL.
diff --git a/HarvestMan-lite/doc/harvestman-epydoc.sh b/HarvestMan-lite/doc/harvestman-epydoc.sh
deleted file mode 100644
index d0ecef2..0000000
--- a/HarvestMan-lite/doc/harvestman-epydoc.sh
+++ /dev/null
@@ -1,4 +0,0 @@
-#Generate epydoc documentation from harvest man.
-#Required epydoc >2.0
-#Run as: sh harvestman-epydoc.sh
-epydoc -v -o harvestman-epydoc --name epydoc --css white                  --url http://localhost/ --inheritance listed  --graph classtree ../harvestman --no-frames
diff --git a/HarvestMan-lite/doc/plugins.HOWTO b/HarvestMan-lite/doc/plugins.HOWTO
deleted file mode 100755
index fbdd619..0000000
--- a/HarvestMan-lite/doc/plugins.HOWTO
+++ /dev/null
@@ -1,229 +0,0 @@
-This document summarizes the plugin architecture of HarvestMan
-and describes the procedure of writing and enabling plugins.
-
-Plugin Architecture
--------------------
-From version 2.0, HarvestMan provides a way to extend and modify
-the behaviour of the program by writing simple "plugins". Plugins
-are small pieces of Python code which sit in the "ext" folder
-in HarvestMan source tree. When a plugin or a set of plugins are
-enabled, the program behaviour is modified depending on how
-the plugins modify the functionality of certain methods of certain
-classes in HarvestMan.
-
-Hooks Module
-------------
-The plugin architecture and model is defined by the new "hooks"
-module. This module defines two different methods of modifying
-behavior of key classes of HarvestMan without modifying the
-original code statically.
-
-1. By defining your own functions which dynamically replace 
-the methods of certain classes in HarvestMan.
-2. By defining your own functions which can be dynamically
-injected as pre/post hooks of methods of certain classes 
-in HarvestMan. 
-
-The magic through which this happens without modifying the
-original code is through Python meta-classes. 
-
-Module Attributes
------------------
-Any module in HarvestMan which wants to expose methods of its
-classes to the any or both of the above forms of extension,
-makes them available as two attributes.
-
-Any method which is available for complete code replacement
-is listed as part of the "__plugins__" attribute. This is a dictionary
-of key-value pairs, where the key is the plugin name and the
-value is a string which combines the class name and the method
-name in the form "<klass>:<method>".
-
-For example, this is how the harvestman.py module defines this
-attribute.
-
-__plugins__ = { 'clean_up_plugin':'HarvestMan:clean_up',
-                'save_current_state_plugin': 'HarvestMan:save_current_state',
-                'restore_state_plugin': 'HarvestMan:restore_state',
-                'reset_state_plugin': 'HarvestMan:reset_state' }
-
-
-Any method which allows pre/post hooks is made available as part
-of the "__callbacks__" attribute. This is also a dictionary with
-the same format as above.
-
-For example, the datamgr module defines this attribute as,
-
-__callbacks__ = { 'download_url_callback': 'HarvestManDataManager:download_url' }
-
-
-Writing Plugins
----------------
-In order to write a particular feature as a plugin, it is important to
-have a good understanding of the HarvestMan source code. This helps 
-you to decide which code requires to be modified and how, in order
-to add the feature.
-
-Next you need to figure out if the entire logic of a method needs
-to be replaced. If so, a plugin has to be registered. However, if
-the functionality requires only injecting some code before/after
-a method is called, a callback needs to be used. In order to develop
-plugins which do complex tasks, both options would have to be used.
-
-Once this is decided, a plugin module should be developed which
-contains the required code in the form of functions which are meant
-to be either method callbacks or plugins.
-
-Once the function(s) are written, a special function named "apply_plugin"
-has to be written. This function should not take any arguments. In
-this function, the callback/plugin functions have to be registered
-at appropriate contexts by using respective methods defined
-in the hooks module. These methods are listed below:
-
-1. register_plugin_function: This takes a context and a function
-object as arguments. It injects the code of the function object
-in place of the context. The context is nothing but a key defined
-in the __plugins__ attribute of a module which connects it to
-a method in a class. The effect is to replace the code of the
-particular method with the cod of the function object.
-
-2. register_pre_callback_method/register_postcallback_method:
-These functions takes same arguments as register_plugin_function.
-They inject the code of the function object as eitherpre/post 
-callbacks to the passed class method, respectively. The effect
-is to modify the behaviour of the original method.
-
-The apply_plugin function will also change certain require
-configuration parameters. For example, in the simulator plugin
-provided along with HarvestMan, the localise and caching
-features are turned off in the configuration. This is because
-since no files are saved in a crawl simulation, it does not
-make sense to keep these features on. Your plugin might have
-to turn features on/off in a similar fashion.
-
-If any informational message is required to be printed to
-inform the user about the plugin, it can also be done. For
-this the "logconsole" method has to be used.
-
-
-Default Plugins
----------------
-HarvestMan provides the following default plugins.
-
-1. spam - A sample plugin which demoes adding a function
-as a pre-callback.
-2. swish-e - A plugin which allows to use HarvestMan 
-as the crawler for the swish-e indexer.
-3. simulator - A plugin which modifies the program
-behaviour to simulate crawling without downloading
-of files.
-4. lucene - A plugin which integrates HarvestMan with
-Lucene using PyLucene. Enabling the plugin causes
-the program to index the downloaded pages at the
-end of the crawl.
-5. userbrowse - A plugin which imitates user browsing,
-i.e it downloads URLs as if the user is browsing the
-current page and its immediate links.
-
-These plugins can be used as templates in developing
-your own plugins.
-
-Enabling Plugins
-----------------
-Plugins can be enabled using the configuration file or
-by using command-line arguments. 
-
-
-1. Configuration File
-
-Add an entry under the "plugins" element with an "enable" 
-attribute and a "name" attribute. The name attribute's
-value should be set to the name of the plugin module. In 
-order to enable the plugin, se the value of "enable" attribute
-to 1.
-
-For example, assume you have developed a plugin module
-named "spamplugin", you will enable it as follows in
-the configuration file.
-
-<plugins>
-  <plugin name="spamplugin" enable="1" />
-</plugins>
-
-2. Command Line
-
-Specify the value of the option "plugins" (short option "g")
-as the plugin's name. For example, to enable the spamplugin,
-
-$ harvestman -g spamplugin -C <someconfigfile>
-
-
-It is possible to enable more than one plugin at once. 
-However in such cases, it is important to make sure that
-their behaviors are supplmenting each other and not in
-conflict. This can be done for a group of plugins which
-togethe add a certain set of features.
-
-To enable more than one plugin using the configuration file,
-simply set the "enable" attribute of each plugin to "1".
-
-To enable on the command-line, concatenate the plugin 
-module names using the "+" character.
-
-For example, to enable both "spamplugin" and "fooplugin"
-in the configuration file.
-
-<plugins>
-  <plugin name="spamplugin" enable="1" />
-  <plugin name="fooplugin" enable="1" />
-</plugins>
-
-To do the same on the command line,
-
-$ harvestman -g spamplugin+fooplugin -C <someconfigfile>
-
-NOTE: Plugins are enabled in th same order they are 
-presented in the configuration file or the command line.
-The order can affect the cumulative behavior of the plugins
-if they are working with the same contexts. So it is important
-to specify the order correctly, to get the righe behavior.
-
-Also some plugins have contradicting behaviour, so it does
-not make sense to enable them together. For example the
-lucene plugin requires downloaded files whereas the simulator
-plugin changes program behaviour to not to download files.
-So a combination of simulator+lucene does not make sense,
-though the program accepts it.
-
-Things to note
---------------
-Here are a few points worthy of noting when writing plugins.
-
-1. Arguments - Most methods in HarvestMan classes take fixed
-length arguments. When you write a replacement function which
-acts as a plugin, it is important to make sure that it keeps
-the number and order of arguments the same.
-
-A function inserted as a pre-callback takes the same number of
-arguments as the original method. However a function insert
-as a post-callback takes an additional argument of the
-return value of the original method. This comes as the first
-argument in the list. In other words the argument list is 
-pushed back by this return value argument.
-
-2. Return value - When writing plugins or callbacks (especially
-post-callbacks), it is important to see that the return value
-semantics are not modified. For example, if a method returns
-downloaded data as a return value, the replacement plugin or
-a post-callback should also do the same. Otherwise it can
-break the program.
-
-
-
-
-
-
-
-
-
-
diff --git a/HarvestMan-lite/doc/state_machine.txt b/HarvestMan-lite/doc/state_machine.txt
deleted file mode 100755
index 5b7ed91..0000000
--- a/HarvestMan-lite/doc/state_machine.txt
+++ /dev/null
@@ -1,237 +0,0 @@
-Design for HarvestMan State Machine
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
-
-This document proposes a design for a state machine for
-HarvestMan. 
-
-Background
-----------
-The current logic in HarvestMan which uses a polling loop in the main
-thread is causing problems. Here are a few issues with this approach.
-
-1. Main thread keeps polling other threads in a spinning loop sleeping
-every second. This burns up CPU cycles.
-
-2. The polling is not very accurate, since threads can change state
-often. The polling takes a snapshot of the status of all the current
-crawler threads and then takes actions. However such decisions
-may not always reflect the current thread status.
-
-3. The current logic does not process the state of the worker threads
-properly. It determines exit condition as when the crawler threads
-are idle. For managing worker threads it uses a grace period once
-it detects crawlers are idle. This may not give enough time for the
-worker threads to do their work and generally decreases the robustness
-of the program.
-
-4. The current logic relies on "magic numbers" which have been
-arbitrarily added to decide on the program exit condition. For
-example the exit loop times the idle state of the crawler threads
-3 times continously to make sure that the program is idle and needs
-to exit. However the number "3" is not chosen because of any
-particular reason. We need to avoid such magic numbers.
-
-5. The current logic spreads state management across many functions
-and a few flags. This is not very object oriented. It would be
-better to consolidate the state and their processing and management
-to a single object.
-
-State Machine Design
---------------------
-
-A state machine class has been defined in the module urlqueue.py .
-This class has the following attributes.
-
-1. A thread state dictionary
-2. The queue manager object 
-3. Flags indicating that threads are blocked etc
-4. Counters for thread state transitions
-5. A condition object which is used to synchronize the state with
-the main thread.
-
-The module crawler.py defines the crawler states. Currently the
-following states are defined.
-
-THREAD_STATE_IDLE = 0          # Idle thread: not running
-THREAD_STATE_WAITING = 1       # Waiting for data
-THREAD_STATE_PUSH_BUFFER = 2   # Thread, pushing buffer data
-THREAD_STATE_BEFORE_WORK = 3
-THREAD_STATE_WORKING = 4       # Thread, doing work
-THREAD_STATE_SLEEPING = 5      # Thread, sleeping
-THREAD_STATE_DOWNLOADING = 6   # Fetcher thread, about to download
-THREAD_STATE_DOWNLOADED = 7    # Fetcher thread, just after download
-THREAD_STATE_PUSHING =      8  # Thread is pushing/about to push data to queue
-THREAD_STATE_PUSHED =      9   # Thread has pushed data to queue
-THREAD_STATE_DIED = 10         # Thread died due to an exception...
-
-Here are the descriptions for these states.
-
-1. THREAD_STATE_IDLE      -   Thread has not yet started.
-2. THREAD_STATE_WAITING   -   Thread has started and is waiting for data.
-3. THREAD_STATE_PUSH_BUFFER - Thread is trying to push local buffer data to queue.
-4. THREAD_STATE_BEFORE_WORK - Thread has got data and is ready to do work.
-5. THREAD_STATE_WORKING     - Thread is working now. This is a generic work flag
-                              which can be overriden by specific work flags by
-                              the threads.
-6. THREAD_STATE_SLEEPING    - Thread has finished one work cycle and is sleeping.
-                              The sleep state indicates a cycle of state transitions.
-7. THREAD_STATE_DOWNLOADING - This is a state specific to fetcher threads. This
-                              indicates the thread is about to download data.
-8. THREAD_STATE_DOWNLOADED -  This is a state specific to fetcher threads. This
-                              indicates the thread has finished downloading data.
-9. THREAD_STATE_PUSHING    -  This indicates the thread is trying to push processed
-                              data to queue.
-10. THREAD_STATE_PUSHED    -  This indicates the thread has pushed data successfully
-                              to the queue.
-11. THREAD_STATE_DIED      -  This indicates the threade died due to an exception.
-
-During the life time of a thread, it typically goes from THREAD_STATE_IDLE
-via other states to THREAD_STATE_WAITING or THREAD_STATE_DIED. 
-
-For example, here is the typical state transition cycle for a fetcher thread. The
-numbers indicate the value of the states. A '*' before a state indicates it is
-conditional.
-
-First cycle
-----------
-0-1-3-4-6-7-8-9-5-1
-
-Second cycle onwards
---------------------
-*2-1-*2-3-4-6-7-8-*9-*2
-
-This is because in the first cycle there is no buffer push and a push to the queue
-is guaranteed. This is not the case from second cycle onwards.
-
-Here is the same for a crawler thread.
-
-First cycle
-----------
-0-1-3-4-8-9-5-1
-
-Second cycle onwards
---------------------
-*2-1-*2-3-4-*8-*9-*2
-
-
-The threads keep updating their status on the state machine object which is shared
-between the threads and the queue. The status machine object is a singleton.
-
-Main thread synchronization
----------------------------
-
-The main thread creates the other threads and then enters the main loop.
-The main loop is as follows...
-
-    def mainloop(self):
-        """ Main program loop which waits for
-        the threads to do their work """
-
-        timeslot = 0.5
-        while not self.stateobj.end_state():
-            self.stateobj.wait2(timeslot)
-            if self._flag:
-                break
-
-        self._flag = 1
-        self.end_threads()
-
-
-In the loop the main thread keeps checking the end state of the state
-machine object. While the end state is not achieved it goes to sleep,
-waiting on the condition object of the state machine for a small time
-(0.5 seconds). It also checks an internal flag.
-
-Normal termination happens when the end state is achieved i.e the
-function end_state() returns True. Abnormal terminations (such
-as pressing Ctrl-C to terminate the program) sets the flag to 1,
-which breaks the main loop.
-
-At the end of main loop the threads are stopped.
-
-Why a timed wait ?
-------------------
-The simplest wait on a condition object is a simple wait() without
-a timeout argument. However if the main thread goes to such a wait,
-it will be prevented from receiving any signals such as a keyboard
-interrupt (Ctrl-C). Since Python only allows sending interrupts to
-the main thread, it means the program will be insensitive to any
-interrupts.
-
-However, waiting on the condition object with a very small timeout
-exposes the main thread to signals during the rest of the loop.
-This allows the program to be controlled using signals.
-
-Using a wait on a condition object prevents CPU hogging since
-the thread does not do a sleep(...). Instead it goes to sleep
-on the condition object which prevents it from taking any 
-CPU time.
-
-End state logic
----------------
-The end-state is decided when all the threads go to a waiting
-mode i.e when every thread reports THREAD_STATE_WAITING to the
-state object. It also uses some additional logic to prevent
-spurious end-state events.
-
-Here are the checks to prevent spurious end-state modes and
-to make sure end-states are valid.
-
-1. To make sure we do not end the crawl prematurely,
-the main thread calls end_state only after it makes sure
-that the other threads have started and well enough into 
-their tasks. For example, it waits for the base thread
-(the first fetcher thread) to get its data and start
-its work before going to the main loop.
-
-2. The end state performs validity of state transitions.
-For example, if the fetchers have done at least one push
-to the queue, the end-state will return True only if
-the crawlers have completed at least one cycle of
-transitions. The logic is that since the fetchers have
-pushed data, the crawlers have to at least extract the
-data before we signal an end-state. 
-
-3. No checks on queue lengths are performed because as
-said earlier, run-time checks on the queue objects will be
-erroneous as the queue objects use locking to control
-access to them.
-
-Additional checks
------------------
-We should add additional state transition based checks to
-take care of all the cases of thread deadlock, thread
-stalemates etc.
-
-Thread regeneration
--------------------
-The thread regeneration logic (re-creating a thread when
-it fails due to an exception) has been moved to the state
-machine.
-
-State of the code
------------------
-The current code implements the design and the logic 
-mentioned above.
-
-Pending work
-------------
-To complete the state machine, the following needs to be done.
-
-1. Add states for the worker threads - Currently the machine
-only takes care of tracker (crawler & fetcher) thread states.
-To fine-grain the thread management, we need to add the
-worker thread states also.
-
-2. Define algorithms to handle thread contentions, deadlocks 
-and timeouts - 
-
-The current logic does not handle any thread deadlocks or
-contentions. For example, there is no logic to take care of
-a hanging fetcher. The project timeout logic is not integrated
-to the state machine. 
-
-
-
-
-
diff --git a/HarvestMan-lite/ez_setup.py b/HarvestMan-lite/ez_setup.py
deleted file mode 100644
index d24e845..0000000
--- a/HarvestMan-lite/ez_setup.py
+++ /dev/null
@@ -1,276 +0,0 @@
-#!python
-"""Bootstrap setuptools installation
-
-If you want to use setuptools in your package's setup.py, just include this
-file in the same directory with it, and add this to the top of your setup.py::
-
-    from ez_setup import use_setuptools
-    use_setuptools()
-
-If you want to require a specific version of setuptools, set a download
-mirror, or use an alternate download directory, you can do so by supplying
-the appropriate options to ``use_setuptools()``.
-
-This file can also be run as a script to install or upgrade setuptools.
-"""
-import sys
-DEFAULT_VERSION = "0.6c9"
-DEFAULT_URL     = "http://pypi.python.org/packages/%s/s/setuptools/" % sys.version[:3]
-
-md5_data = {
-    'setuptools-0.6b1-py2.3.egg': '8822caf901250d848b996b7f25c6e6ca',
-    'setuptools-0.6b1-py2.4.egg': 'b79a8a403e4502fbb85ee3f1941735cb',
-    'setuptools-0.6b2-py2.3.egg': '5657759d8a6d8fc44070a9d07272d99b',
-    'setuptools-0.6b2-py2.4.egg': '4996a8d169d2be661fa32a6e52e4f82a',
-    'setuptools-0.6b3-py2.3.egg': 'bb31c0fc7399a63579975cad9f5a0618',
-    'setuptools-0.6b3-py2.4.egg': '38a8c6b3d6ecd22247f179f7da669fac',
-    'setuptools-0.6b4-py2.3.egg': '62045a24ed4e1ebc77fe039aa4e6f7e5',
-    'setuptools-0.6b4-py2.4.egg': '4cb2a185d228dacffb2d17f103b3b1c4',
-    'setuptools-0.6c1-py2.3.egg': 'b3f2b5539d65cb7f74ad79127f1a908c',
-    'setuptools-0.6c1-py2.4.egg': 'b45adeda0667d2d2ffe14009364f2a4b',
-    'setuptools-0.6c2-py2.3.egg': 'f0064bf6aa2b7d0f3ba0b43f20817c27',
-    'setuptools-0.6c2-py2.4.egg': '616192eec35f47e8ea16cd6a122b7277',
-    'setuptools-0.6c3-py2.3.egg': 'f181fa125dfe85a259c9cd6f1d7b78fa',
-    'setuptools-0.6c3-py2.4.egg': 'e0ed74682c998bfb73bf803a50e7b71e',
-    'setuptools-0.6c3-py2.5.egg': 'abef16fdd61955514841c7c6bd98965e',
-    'setuptools-0.6c4-py2.3.egg': 'b0b9131acab32022bfac7f44c5d7971f',
-    'setuptools-0.6c4-py2.4.egg': '2a1f9656d4fbf3c97bf946c0a124e6e2',
-    'setuptools-0.6c4-py2.5.egg': '8f5a052e32cdb9c72bcf4b5526f28afc',
-    'setuptools-0.6c5-py2.3.egg': 'ee9fd80965da04f2f3e6b3576e9d8167',
-    'setuptools-0.6c5-py2.4.egg': 'afe2adf1c01701ee841761f5bcd8aa64',
-    'setuptools-0.6c5-py2.5.egg': 'a8d3f61494ccaa8714dfed37bccd3d5d',
-    'setuptools-0.6c6-py2.3.egg': '35686b78116a668847237b69d549ec20',
-    'setuptools-0.6c6-py2.4.egg': '3c56af57be3225019260a644430065ab',
-    'setuptools-0.6c6-py2.5.egg': 'b2f8a7520709a5b34f80946de5f02f53',
-    'setuptools-0.6c7-py2.3.egg': '209fdf9adc3a615e5115b725658e13e2',
-    'setuptools-0.6c7-py2.4.egg': '5a8f954807d46a0fb67cf1f26c55a82e',
-    'setuptools-0.6c7-py2.5.egg': '45d2ad28f9750e7434111fde831e8372',
-    'setuptools-0.6c8-py2.3.egg': '50759d29b349db8cfd807ba8303f1902',
-    'setuptools-0.6c8-py2.4.egg': 'cba38d74f7d483c06e9daa6070cce6de',
-    'setuptools-0.6c8-py2.5.egg': '1721747ee329dc150590a58b3e1ac95b',
-    'setuptools-0.6c9-py2.3.egg': 'a83c4020414807b496e4cfbe08507c03',
-    'setuptools-0.6c9-py2.4.egg': '260a2be2e5388d66bdaee06abec6342a',
-    'setuptools-0.6c9-py2.5.egg': 'fe67c3e5a17b12c0e7c541b7ea43a8e6',
-    'setuptools-0.6c9-py2.6.egg': 'ca37b1ff16fa2ede6e19383e7b59245a',
-}
-
-import sys, os
-try: from hashlib import md5
-except ImportError: from md5 import md5
-
-def _validate_md5(egg_name, data):
-    if egg_name in md5_data:
-        digest = md5(data).hexdigest()
-        if digest != md5_data[egg_name]:
-            print >>sys.stderr, (
-                "md5 validation of %s failed!  (Possible download problem?)"
-                % egg_name
-            )
-            sys.exit(2)
-    return data
-
-def use_setuptools(
-    version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
-    download_delay=15
-):
-    """Automatically find/download setuptools and make it available on sys.path
-
-    `version` should be a valid setuptools version number that is available
-    as an egg for download under the `download_base` URL (which should end with
-    a '/').  `to_dir` is the directory where setuptools will be downloaded, if
-    it is not already available.  If `download_delay` is specified, it should
-    be the number of seconds that will be paused before initiating a download,
-    should one be required.  If an older version of setuptools is installed,
-    this routine will print a message to ``sys.stderr`` and raise SystemExit in
-    an attempt to abort the calling script.
-    """
-    was_imported = 'pkg_resources' in sys.modules or 'setuptools' in sys.modules
-    def do_download():
-        egg = download_setuptools(version, download_base, to_dir, download_delay)
-        sys.path.insert(0, egg)
-        import setuptools; setuptools.bootstrap_install_from = egg
-    try:
-        import pkg_resources
-    except ImportError:
-        return do_download()       
-    try:
-        pkg_resources.require("setuptools>="+version); return
-    except pkg_resources.VersionConflict, e:
-        if was_imported:
-            print >>sys.stderr, (
-            "The required version of setuptools (>=%s) is not available, and\n"
-            "can't be installed while this script is running. Please install\n"
-            " a more recent version first, using 'easy_install -U setuptools'."
-            "\n\n(Currently using %r)"
-            ) % (version, e.args[0])
-            sys.exit(2)
-        else:
-            del pkg_resources, sys.modules['pkg_resources']    # reload ok
-            return do_download()
-    except pkg_resources.DistributionNotFound:
-        return do_download()
-
-def download_setuptools(
-    version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
-    delay = 15
-):
-    """Download setuptools from a specified location and return its filename
-
-    `version` should be a valid setuptools version number that is available
-    as an egg for download under the `download_base` URL (which should end
-    with a '/'). `to_dir` is the directory where the egg will be downloaded.
-    `delay` is the number of seconds to pause before an actual download attempt.
-    """
-    import urllib2, shutil
-    egg_name = "setuptools-%s-py%s.egg" % (version,sys.version[:3])
-    url = download_base + egg_name
-    saveto = os.path.join(to_dir, egg_name)
-    src = dst = None
-    if not os.path.exists(saveto):  # Avoid repeated downloads
-        try:
-            from distutils import log
-            if delay:
-                log.warn("""
----------------------------------------------------------------------------
-This script requires setuptools version %s to run (even to display
-help).  I will attempt to download it for you (from
-%s), but
-you may need to enable firewall access for this script first.
-I will start the download in %d seconds.
-
-(Note: if this machine does not have network access, please obtain the file
-
-   %s
-
-and place it in this directory before rerunning this script.)
----------------------------------------------------------------------------""",
-                    version, download_base, delay, url
-                ); from time import sleep; sleep(delay)
-            log.warn("Downloading %s", url)
-            src = urllib2.urlopen(url)
-            # Read/write all in one block, so we don't create a corrupt file
-            # if the download is interrupted.
-            data = _validate_md5(egg_name, src.read())
-            dst = open(saveto,"wb"); dst.write(data)
-        finally:
-            if src: src.close()
-            if dst: dst.close()
-    return os.path.realpath(saveto)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-def main(argv, version=DEFAULT_VERSION):
-    """Install or upgrade setuptools and EasyInstall"""
-    try:
-        import setuptools
-    except ImportError:
-        egg = None
-        try:
-            egg = download_setuptools(version, delay=0)
-            sys.path.insert(0,egg)
-            from setuptools.command.easy_install import main
-            return main(list(argv)+[egg])   # we're done here
-        finally:
-            if egg and os.path.exists(egg):
-                os.unlink(egg)
-    else:
-        if setuptools.__version__ == '0.0.1':
-            print >>sys.stderr, (
-            "You have an obsolete version of setuptools installed.  Please\n"
-            "remove it from your system entirely before rerunning this script."
-            )
-            sys.exit(2)
-
-    req = "setuptools>="+version
-    import pkg_resources
-    try:
-        pkg_resources.require(req)
-    except pkg_resources.VersionConflict:
-        try:
-            from setuptools.command.easy_install import main
-        except ImportError:
-            from easy_install import main
-        main(list(argv)+[download_setuptools(delay=0)])
-        sys.exit(0) # try to force an exit
-    else:
-        if argv:
-            from setuptools.command.easy_install import main
-            main(argv)
-        else:
-            print "Setuptools version",version,"or greater has been installed."
-            print '(Run "ez_setup.py -U setuptools" to reinstall or upgrade.)'
-
-def update_md5(filenames):
-    """Update our built-in md5 registry"""
-
-    import re
-
-    for name in filenames:
-        base = os.path.basename(name)
-        f = open(name,'rb')
-        md5_data[base] = md5(f.read()).hexdigest()
-        f.close()
-
-    data = ["    %r: %r,\n" % it for it in md5_data.items()]
-    data.sort()
-    repl = "".join(data)
-
-    import inspect
-    srcfile = inspect.getsourcefile(sys.modules[__name__])
-    f = open(srcfile, 'rb'); src = f.read(); f.close()
-
-    match = re.search("\nmd5_data = {\n([^}]+)}", src)
-    if not match:
-        print >>sys.stderr, "Internal error!"
-        sys.exit(2)
-
-    src = src[:match.start(1)] + repl + src[match.end(1):]
-    f = open(srcfile,'w')
-    f.write(src)
-    f.close()
-
-
-if __name__=='__main__':
-    if len(sys.argv)>2 and sys.argv[1]=='--md5update':
-        update_md5(sys.argv[2:])
-    else:
-        main(sys.argv[1:])
-
-
-
-
-
-
diff --git a/HarvestMan-lite/harvestman/__init__.py b/HarvestMan-lite/harvestman/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/apps/__init__.py b/HarvestMan-lite/harvestman/apps/__init__.py
deleted file mode 100755
index 35caf0a..0000000
--- a/HarvestMan-lite/harvestman/apps/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-# -- coding: utf-8
-import sys, os
-
-d = os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__'])))
-sys.path.append(os.path.dirname(d))
diff --git a/HarvestMan-lite/harvestman/apps/appbase.py b/HarvestMan-lite/harvestman/apps/appbase.py
deleted file mode 100755
index f9ace0a..0000000
--- a/HarvestMan-lite/harvestman/apps/appbase.py
+++ /dev/null
@@ -1,83 +0,0 @@
-# -- coding: utf-8
-"""
-appbase.py - Defines the base application class for
-applications using the HarvestMan framework.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Modification History
-
-Created: Dec 12 2007       Anand B Pillai     By moving code from
-                                              harvestman.py module.
-"""
-
-import sys, os
-import __init__
-import atexit
-
-from harvestman.lib import config
-from harvestman.lib import logger
-
-from harvestman.lib.common.common import *
-
-class HarvestManAppBase(object):
-    """ Base application class for applications using the HarvestMan framework """
-
-    # All applications using HarvestMan framework should derive from this class
-    # or one of its subclasses.
-    
-    def __init__(self):
-        """ Initializer """
-        
-        self.prepare()
-        
-    def prepare(self):
-        """ Creates the state and logger objects and their aliases """
-        
-        # Init Config Object
-        SetAlias(config.HarvestManStateObject.makeInstance())
-        # Initialize logger object
-        SetAlias(logger.HarvestManLogger())
-        
-    def process_plugins(self):
-        """ Loads any plugin modules specified in configuration and process them """
-
-        import harvestman.lib
-        #Why are we adding a path here? It should know where hooks is
-        #sys.path.append(harvestman.lib.__path__)
-        from harvestman.lib import hooks
-
-        plugin_dir = os.path.abspath(os.path.join(os.path.dirname(__init__.__file__), '..', 'ext'))
-        # print plugin_dir
-
-        if os.path.isdir(plugin_dir):
-            sys.path.append(plugin_dir)
-            # Load plugins specified in plugins list
-            for plugin in objects.config.plugins:
-                # Load plugins
-                try:
-                    logconsole('Loading plugin %s...' % plugin)
-                    M = __import__(plugin)
-                    func = getattr(M, 'apply_plugin', None)
-                    if not func:
-                        logconsole('Invalid plugin %s, should define function "apply_plugin"!' % plugin)
-                    try:
-                        logconsole('Applying plugin %s...' % plugin)
-                        func()
-                    except Exception, e:
-                        logconsole('Error while trying to apply plugin %s' % plugin)
-                        logconsole('Error is:',str(e))
-                        sys.exit(0)
-                except (KeyError, ImportError), e:
-                    logconsole('Error importing plugin module %s' % plugin)
-                    logconsole('Error is:',str(e))
-                    logconsole('Invalid plugin: %s !' % plugin)
-                    hexit(0)
-
-    def get_options(self):
-        """ Reads program options from command line or configuration files """
-        
-        # Get program options
-        objects.config.get_program_options()
-
-    
diff --git a/HarvestMan-lite/harvestman/apps/cache.db b/HarvestMan-lite/harvestman/apps/cache.db
deleted file mode 100644
index 8c64a5c..0000000
Binary files a/HarvestMan-lite/harvestman/apps/cache.db and /dev/null differ
diff --git a/HarvestMan-lite/harvestman/apps/config-sample-urlfilter.xml b/HarvestMan-lite/harvestman/apps/config-sample-urlfilter.xml
deleted file mode 100644
index b0653c8..0000000
--- a/HarvestMan-lite/harvestman/apps/config-sample-urlfilter.xml
+++ /dev/null
@@ -1,128 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-      <project ignore="0">
-
-        <url>www.harvestmanontheweb.com</url>
-        <name>harvestman</name>
-        <verbosity level="extrainfo"/>
-      </project>
-    
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <movies value="0"/>
-        <flash value="1"/>
-        <sounds value="0"/>
-        <documents value="0"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types> 
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5000 MB" /> 
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="0"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>
-          <path value="-/images/*+/images/public/*" case="1" enable="0" />
-          <extension value="-jpg-png+doc" enable="0"/>
-          <regex value="(\s*\/banner\/)" enable="1" flags='re.LOCALE' />
-        </urlfilter>
-        <textfilter>
-          <meta value="project page of the harvestman" tags="description" case="1" />
-        </textfilter>
-        <junkfilter enable="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-    
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-    
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-    
-    <display>
-      <browsepage value="0"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
diff --git a/HarvestMan-lite/harvestman/apps/config-sample.xml b/HarvestMan-lite/harvestman/apps/config-sample.xml
deleted file mode 100755
index a5bc05e..0000000
--- a/HarvestMan-lite/harvestman/apps/config-sample.xml
+++ /dev/null
@@ -1,160 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-      
-      <project ignore="0">
-
-        <url>http://docs.python.org/tutorial/index.html</url>
-        <name>pytut</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.harvestmanontheweb.com</url>
-        <name>foo</name>
-        <verbosity level="extrainfo"/>
-      </project>
-      
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-    
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <movies value="0"/>
-        <flash value="1"/>
-        <sounds value="0"/>
-        <documents value="0"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types> 
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5000 MB" /> 
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>
-          <path value="-/images/*+/images/public/*" enable="0" case="1" />
-          <extension value="-jpg-png+doc" enable="0"/>
-          <regex value="(\s*\/banner\/)" enable="0" flags='re.LOCALE' />
-        </urlfilter>
-        <textfilter>
-          <meta value="(internet|crawler|web-bot|web-crawler)" tags="keywords" enable="0" />
-          <meta value="harvestman|web-crawler" tags="title,description" enable="0" />
-        </textfilter>
-        <junkfilter enable="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-    
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-    
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-    
-    <display>
-      <browsepage value="0"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
diff --git a/HarvestMan-lite/harvestman/apps/pytut.hpf b/HarvestMan-lite/harvestman/apps/pytut.hpf
deleted file mode 100644
index 3c7de37..0000000
--- a/HarvestMan-lite/harvestman/apps/pytut.hpf
+++ /dev/null
@@ -1,1004 +0,0 @@
-ccopy_reg
-_reconstructor
-p1
-(charvestman.lib.config
-HarvestManStateObject
-p2
-c__builtin__
-dict
-p3
-(dp4
-S'projectfile'
-p5
-S''
-sS'terminate'
-p6
-I0
-sS'datacache'
-p7
-I0
-sS'maxextservers'
-p8
-I0
-sS'USER_AGENT'
-p9
-S'Firefox v2.0.0.8'
-p10
-sS'getstylesheets'
-p11
-I1
-sS'retryfailed'
-p12
-I1
-sS'queuesize'
-p13
-I5000
-sS'javaapplet'
-p14
-I1
-sS'flash'
-p15
-I1
-sS'pathurlfilters'
-p16
-(lp17
-sS'program'
-p18
-S'spider.py'
-p19
-sS'randomsleep'
-p20
-I1
-sS'proxyport'
-p21
-I80
-sS'htmlfeatures'
-p22
-(lp23
-(Va
-I1
-tp24
-a(Vbase
-I1
-tp25
-a(Vframe
-I1
-tp26
-a(Vimg
-I1
-tp27
-a(Vform
-I1
-tp28
-a(Vlink
-I1
-tp29
-a(Vbody
-I1
-tp30
-a(Vscript
-I1
-tp31
-a(Vapplet
-I1
-tp32
-a(Varea
-I1
-tp33
-a(Vmeta
-I1
-tp34
-a(Vembed
-I1
-tp35
-a(Vobject
-I1
-tp36
-a(Voption
-I0
-tp37
-a(Va
-I1
-tp38
-a(Vbase
-I1
-tp39
-a(Vframe
-I1
-tp40
-a(Vimg
-I1
-tp41
-a(Vform
-I1
-tp42
-a(Vlink
-I1
-tp43
-a(Vbody
-I1
-tp44
-a(Vscript
-I1
-tp45
-a(Vapplet
-I1
-tp46
-a(Varea
-I1
-tp47
-a(Vmeta
-I1
-tp48
-a(Vembed
-I1
-tp49
-a(Vobject
-I1
-tp50
-a(Voption
-I0
-tp51
-a(Va
-I1
-tp52
-a(Vbase
-I1
-tp53
-a(Vframe
-I1
-tp54
-a(Vimg
-I1
-tp55
-a(Vform
-I1
-tp56
-a(Vlink
-I1
-tp57
-a(Vbody
-I1
-tp58
-a(Vscript
-I1
-tp59
-a(Vapplet
-I1
-tp60
-a(Varea
-I1
-tp61
-a(Vmeta
-I1
-tp62
-a(Vembed
-I1
-tp63
-a(Vobject
-I1
-tp64
-a(Voption
-I0
-tp65
-asS'maxfiles'
-p66
-I0
-sS'format'
-p67
-S'xml'
-p68
-sS'renamefiles'
-p69
-I0
-sS'urlheaders'
-p70
-I1
-sS'rawsave'
-p71
-I0
-sS'urlmap'
-p72
-(dp73
-sS'logfile'
-p74
-V./pytut/pytut.log
-p75
-sS'wordfilter'
-p76
-S''
-sS'projects'
-p77
-(lp78
-(dp79
-S'url'
-p80
-Vhttp://docs.python.org/tutorial/index.html
-p81
-sS'project'
-p82
-Vpytut
-p83
-sS'verbosity'
-p84
-I15
-sS'basedir'
-p85
-S'.'
-sasS'inclfilter'
-p86
-(lp87
-sS'_error'
-p88
-S''
-sg84
-I15
-sS'serverfilter'
-p89
-S''
-sS'bytes'
-p90
-F20
-sS'html'
-p91
-I1
-sS'timelimit'
-p92
-F-1
-sS'starttime'
-p93
-I0
-sS'throttlefactor'
-p94
-F1.5
-sS'anyimages'
-p95
-I1
-sS'testing'
-p96
-I0
-sS'skipruletypes'
-p97
-(lp98
-sS'connections'
-p99
-I10
-sS'configfile'
-p100
-S'config-sample.xml'
-p101
-sS'multipart'
-p102
-I0
-sS'plugins'
-p103
-(lp104
-sS'images'
-p105
-I1
-sS'contentfilters'
-p106
-(lp107
-sS'writecache'
-p108
-I01
-sS'movies'
-p109
-I0
-sg85
-S'.'
-sS'socktimeout'
-p110
-F60
-sS'version'
-p111
-S'2.0'
-p112
-sS'getquerylinks'
-p113
-I1
-sS'usersessiondir'
-p114
-S'/home/anand/.harvestman/sessions'
-p115
-sS'pagecache'
-p116
-I1
-sS'datamode'
-p117
-charvestman.lib.common.macros
-CONNECTOR_DATA_MODE_FLUSH
-p118
-sS'timeout'
-p119
-F1200
-sS'exclfilter'
-p120
-(lp121
-sS'javascript'
-p122
-I0
-sS'progname'
-p123
-S'HarvestMan 2.0 alpha 1'
-p124
-sS'projtimeout'
-p125
-F1800
-sS'proxy'
-p126
-S''
-sS'runfile'
-p127
-NsS'fetchertimeout'
-p128
-F240
-sS'cachefileformat'
-p129
-S'pickled'
-p130
-sS'endtime'
-p131
-I0
-sS'project_ignore'
-p132
-I1
-sS'usethreads'
-p133
-I1
-sS'items_to_skip'
-p134
-(lp135
-sS'maxtrackers'
-p136
-I10
-sS'junkfilterdomains'
-p137
-I1
-sS'appname'
-p138
-S'HarvestMan'
-p139
-sS'maxbytes'
-p140
-I5242880000
-sS'userdir'
-p141
-S'/home/anand/.harvestman'
-p142
-sS'localise'
-p143
-I0
-sS'getimagelinks'
-p144
-I1
-sS'_badrequests'
-p145
-I0
-sS'metafilters'
-p146
-(lp147
-sS'httpcompress'
-p148
-I1
-sS'proxyenc'
-p149
-I1
-sS'sounds'
-p150
-I0
-sS'fastmode'
-p151
-I1
-sS'puser'
-p152
-S''
-sS'sleeptime'
-p153
-F3
-sS'serverprioritydict'
-p154
-(dp155
-sS'threadpoolsize'
-p156
-I10
-sS'verbosity_override'
-p157
-I00
-sS'urltreefile'
-p158
-I1
-sS'nomultipart'
-p159
-I0
-sS'cachefound'
-p160
-I0
-sS'linksoffsetend'
-p161
-I-1
-sS'username'
-p162
-S''
-sS'etcdir'
-p163
-S'/usr/etc/harvestman'
-p164
-sS'extdepth'
-p165
-I0
-sS'projdir'
-p166
-V./pytut
-p167
-sS'xml_map'
-p168
-(dp169
-S'images_value'
-p170
-(S'images'
-S'int'
-tp171
-sS'javaapplet_value'
-p172
-(S'javaapplet'
-S'int'
-tp173
-sS'extpagelinks_value'
-p174
-(S'epagelinks'
-p175
-S'int'
-tp176
-sS'browsepage_value'
-p177
-(S'browsepage'
-S'int'
-tp178
-sS'urlheaders_status'
-p179
-(S'urlheaders'
-S'int'
-tp180
-sS'proxyport_value'
-p181
-(S'proxyport'
-S'int'
-tp182
-sS'urlprioritydict_value'
-p183
-(S'urlprioritydict'
-S'dict'
-tp184
-sS'junkfilter_value'
-p185
-(S'junkfilter'
-S'int'
-tp186
-sS'simulate_value'
-p187
-(S'simulate'
-S'int'
-tp188
-sS'workers_status'
-p189
-(S'usethreads'
-S'int'
-tp190
-sS'maxfilesize_value'
-p191
-(S'maxfilesize'
-S'int'
-tp192
-sS'feature_name'
-p193
-(S'htmlfeatures'
-S'func:set_parse_features'
-tp194
-sS'proxypasswd'
-p195
-(S'ppasswd'
-S'str'
-tp196
-sS'documents_value'
-p197
-(S'documents'
-S'int'
-tp198
-sS'urllist'
-p199
-(S'urlfile'
-S'str'
-tp200
-sS'maxbandwidth_value'
-p201
-(S'bandwidthlimit'
-S'func:set_maxbandwidth'
-tp202
-sS'archive_format'
-p203
-(S'archformat'
-S'str'
-tp204
-sS'datacache_value'
-p205
-(S'datacache'
-S'int'
-tp206
-sS'ignoretlds_value'
-p207
-(S'ignoretlds'
-S'int'
-tp208
-sS'trackers_timeout'
-p209
-(S'fetchertimeout'
-S'float'
-tp210
-sS'extension_value'
-p211
-(S'extension'
-p212
-S'func:set_urlfilter'
-tp213
-sS'maxconnections_value'
-p214
-(S'connections'
-S'int'
-tp215
-sS'stylesheetlinks_value'
-p216
-(S'getstylesheets'
-S'int'
-tp217
-sS'querylinks_value'
-p218
-(S'getquerylinks'
-S'int'
-tp219
-sS'imagelinks_value'
-p220
-(S'getimagelinks'
-S'int'
-tp221
-sg85
-(S'basedir'
-S'func:set_project'
-tp222
-sS'extdepth_value'
-p223
-(S'extdepth'
-S'int'
-tp224
-sS'fastmode_value'
-p225
-(S'fastmode'
-S'int'
-tp226
-sS'projectfile_value'
-p227
-(S'projectfile'
-S'str'
-tp228
-sS'fetchlevel_value'
-p229
-(S'fetchlevel'
-S'int'
-tp230
-sS'passwd'
-p231
-(S'passwd'
-S'str'
-tp232
-sS'proxyserver'
-p233
-(S'proxy'
-S'str'
-tp234
-sS'offset_start'
-p235
-(S'linksoffsetstart'
-S'int'
-tp236
-sS'localise_value'
-p237
-(S'localise'
-S'int'
-tp238
-sS'sounds_value'
-p239
-(S'sounds'
-S'int'
-tp240
-sS'offset_end'
-p241
-(S'linksoffsetend'
-S'int'
-tp242
-sS'content_value'
-p243
-(S'content'
-S'func:set_textfilter'
-tp244
-sS'retries_value'
-p245
-(S'retryfailed'
-S'int'
-tp246
-sS'meta_value'
-p247
-(S'meta'
-S'func:set_textfilter'
-tp248
-sS'robots_value'
-p249
-(S'robots'
-S'int'
-tp250
-sS'connections_type'
-p251
-(S'datamode'
-S'func:set_datamode'
-tp252
-sS'html_value'
-p253
-(S'html'
-S'int'
-tp254
-sg162
-(S'username'
-S'str'
-tp255
-sS'maxbandwidth_factor'
-p256
-(S'throttlefactor'
-S'float'
-tp257
-sS'urltreefile_status'
-p258
-(S'urltreefile'
-S'int'
-tp259
-sS'serverprioritydict_value'
-p260
-(S'serverprioritydict'
-S'dict'
-tp261
-sS'proxyuser'
-p262
-(S'puser'
-S'str'
-tp263
-sS'archive_status'
-p264
-(S'archive'
-S'int'
-tp265
-sS'http_compress'
-p266
-(S'httpcompress'
-S'int'
-tp267
-sS'maxbytes_value'
-p268
-(S'maxbytes'
-S'func:set_maxbytes'
-tp269
-sS'urlpriority'
-p270
-(S'urlpriority'
-S'str'
-tp271
-sS'maxextdirs_value'
-p272
-(S'maxextdirs'
-p273
-S'int'
-tp274
-sS'plugin_name'
-p275
-(S'plugins'
-S'func:set_plugin'
-tp276
-sS'timegap_value'
-p277
-(S'sleeptime'
-S'float'
-tp278
-sS'trackers_value'
-p279
-(S'maxtrackers'
-S'int'
-tp280
-sS'urlfilterre_value'
-p281
-((S'inclfilter'
-S'list'
-tp282
-(S'exclfilter'
-S'list'
-tp283
-(S'allfilters'
-S'list'
-tp284
-tp285
-sS'path_value'
-p286
-(S'path'
-S'func:set_urlfilter'
-tp287
-sS'maxextservers_value'
-p288
-(g8
-S'int'
-tp289
-sS'maxfiles_value'
-p290
-(S'maxfiles'
-S'int'
-tp291
-sg76
-(S'wordfilter'
-S'str'
-tp292
-sS'extserverlinks_value'
-p293
-(S'eserverlinks'
-p294
-S'int'
-tp295
-sg132
-(S'project_ignore'
-S'int'
-tp296
-sS'cache_status'
-p297
-(S'pagecache'
-S'int'
-tp298
-sS'timegap_random'
-p299
-(S'randomsleep'
-S'int'
-tp300
-sS'timelimit_value'
-p301
-(S'timelimit'
-S'float'
-tp302
-sS'name'
-p303
-(g82
-S'func:set_project'
-tp304
-sS'workers_timeout'
-p305
-(S'timeout'
-S'float'
-tp306
-sS'subdomain_value'
-p307
-(S'subdomain'
-S'int'
-tp308
-sS'url'
-p309
-(g80
-S'func:set_project'
-tp310
-sS'serverpriority'
-p311
-(S'serverpriority'
-S'str'
-tp312
-sg89
-(S'serverfilter'
-S'str'
-tp313
-sS'configfile_value'
-p314
-(S'configfile'
-S'str'
-tp315
-sS'javascript_value'
-p316
-(S'javascript'
-S'int'
-tp317
-sS'regex_value'
-p318
-(S'regex'
-S'func:set_urlfilter'
-tp319
-sS'savesessions_value'
-p320
-(S'savesessions'
-S'int'
-tp321
-sS'flash_value'
-p322
-(S'flash'
-S'int'
-tp323
-sS'movies_value'
-p324
-(S'movies'
-S'int'
-tp325
-sS'useragent_value'
-p326
-(S'USER_AGENT'
-S'str'
-tp327
-sS'verbosity_level'
-p328
-(S'verbosity'
-S'func:set_project'
-tp329
-sS'depth_value'
-p330
-(S'depth'
-S'int'
-tp331
-sS'workers_size'
-p332
-(S'threadpoolsize'
-S'int'
-tp333
-ssS'simulate'
-p334
-I0
-sS'blocking'
-p335
-I0
-sS'projtmpdir'
-p336
-V./pytut/.tmp
-p337
-sS'allfilters'
-p338
-(lp339
-sS'urlfilter'
-p340
-NsS'linksoffsetstart'
-p341
-I0
-sS'numparts'
-p342
-I4
-sS'project'
-p343
-g83
-sS'maturity'
-p344
-S'alpha 1'
-p345
-sg273
-I0
-sS'errorfile'
-p346
-S'errors.log'
-p347
-sS'verbosity_default'
-p348
-I20
-sS'checkfiles'
-p349
-I1
-sS'maxfilesize'
-p350
-I5242880
-sS'queuetime'
-p351
-F1
-sS'regexurlfilters'
-p352
-(lp353
-sS'forcesplit'
-p354
-I0
-sS'urlprioritydict'
-p355
-(dp356
-sS'archive'
-p357
-I0
-sS'bandwidthlimit'
-p358
-F40960
-sS'userconfdir'
-p359
-S'/home/anand/.harvestman/conf'
-p360
-sS'documents'
-p361
-I0
-sS'minfilesize'
-p362
-I0
-sS'_connaddua'
-p363
-I01
-sS'downloadtime'
-p364
-F0
-sS'ignoretlds'
-p365
-I0
-sS'ppasswd'
-p366
-S''
-sS'subdomain'
-p367
-I0
-sg311
-S''
-sS'ignoreinterrupts'
-p368
-I0
-sg231
-S''
-sg270
-S''
-sS'urlfile'
-p369
-S''
-sS'robots'
-p370
-I1
-sS'extnurlfilters'
-p371
-(lp372
-sS'savesessions'
-p373
-I0
-sS'testnocrawl'
-p374
-I0
-sS'fromprojfile'
-p375
-I0
-sS'junkfilter'
-p376
-I1
-sS'fetchlevel'
-p377
-I0
-sS'textfilter'
-p378
-Nsg309
-g81
-sS'keyboardinterrupt'
-p379
-I0
-sS'datamodename'
-p380
-S'flush'
-p381
-sS'urlfiltercontext'
-p382
-S'crawl'
-p383
-sS'browsepage'
-p384
-I0
-sS'depth'
-p385
-I10
-sS'archformat'
-p386
-S'bzip'
-p387
-sS'resuming'
-p388
-I0
-sS'userdbdir'
-p389
-S'/home/anand/.harvestman/db'
-p390
-sS'junkfilterpatterns'
-p391
-I1
-stRp392
-g392
-b.
\ No newline at end of file
diff --git a/HarvestMan-lite/harvestman/apps/samples/Readme.txt b/HarvestMan-lite/harvestman/apps/samples/Readme.txt
deleted file mode 100755
index 602f437..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/Readme.txt
+++ /dev/null
@@ -1,20 +0,0 @@
-This directory contains sample crawler applications which extend
-HarvestMan and override events to produce specific crawling behavior.
-Each of the module in this directory implement a custom crawler class
-which inherits from the "HarvestMan" class.
-
-o htmlcrawler.py:     An html-only crawler
-o imagecrawler.py:    A crawler which downloads only images
-o searchingcrawler.py : A crawler which searches a web-site for pages matching  
-                      a keyword or a regular expression and downloads  them.
-o taganalyzer.py :    A crawler which analyzes tags in HTML pages.
-o indexingcrawler.py: A crawler which indexes downloaded pages using PyLucene.
-o linkchecker.py    : A crawler which checks a site for broken links and
-		      reports them.
-o postingcrawler.py:  A specific application example. This crawler is custom
-                      written to crawl monthly archives of the Bangalore Python
-                      User's group (BangPypers), retrieve JOB postings and post
-                      them automatically to the blog http://pythonjobs.blogspot.com .
-               
-
-
diff --git a/HarvestMan-lite/harvestman/apps/samples/__init__.py b/HarvestMan-lite/harvestman/apps/samples/__init__.py
deleted file mode 100755
index c446ec4..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-# -- coding: utf-8
-import sys, os
-
-d = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__']))))
-sys.path.append(os.path.dirname(d))
-
diff --git a/HarvestMan-lite/harvestman/apps/samples/blogger.py b/HarvestMan-lite/harvestman/apps/samples/blogger.py
deleted file mode 100755
index 79e79a2..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/blogger.py
+++ /dev/null
@@ -1,58 +0,0 @@
-"""
-blogger.py - Job posting class using Google blogger API.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 28 May 2008
-
-Copyright(C) 2008 Anand B Pillai.
-"""
-
-from gdata import service
-import gdata
-import atom
-import getopt
-import sys
-
-class BlogJobPoster(object):
-    """ Class which performs job posting to http://pythonjobs.blogspot.com """
-    
-    def __init__(self, email, password):
-        # Authenticate using ClientLogin.
-        self.service = service.GDataService(email, password)
-        self.service.source = 'Blogger_Python_Sample-1.0'
-        self.service.service = 'blogger'
-        self.service.server = 'www.blogger.com'
-        self.service.ProgrammaticLogin()
-        self.blog_id = 0
-        
-        # Get the blog ID for http://pythonjobs.blogspot.com
-        query = service.Query()
-        query.feed = '/feeds/default/blogs'
-        feed = self.service.Get(query.ToUri())
-        
-        for entry in feed.entry:
-            print "\t" + entry.title.text
-            print entry.link[0].href
-            
-            # if entry.link[0].href=='http://pythonjobs.blogspot.com/':
-            if entry.link[0].href=='http://www.blogger.com/feeds/18362312542208032325/blogs/5503040385101187323':
-                self_link = entry.GetSelfLink()
-                self.blog_id = self_link.href.split('/')[-1]
-                break
-
-    def post_job(self, title, content):
-        """ Post a job with given title and content """
-        
-        # Create the entry to insert.
-        entry = gdata.GDataEntry()
-        entry.author.append(atom.Author(atom.Name(text="Post Author")))
-        entry.title = atom.Title(title_type='xhtml', text=title)
-        entry.content = atom.Content(content_type='html', text=content)
-
-        # Ask the service to insert the new entry.
-        job_post = self.service.Post(entry, '/feeds/' + str(self.blog_id) + '/posts/default')
-        print "Successfully created post: \"" + job_post.title.text + "\".\n"
-
-if __name__ == "__main__":
-    pass
-
-
diff --git a/HarvestMan-lite/harvestman/apps/samples/cache.db b/HarvestMan-lite/harvestman/apps/samples/cache.db
deleted file mode 100644
index 08559b9..0000000
Binary files a/HarvestMan-lite/harvestman/apps/samples/cache.db and /dev/null differ
diff --git a/HarvestMan-lite/harvestman/apps/samples/customcrawler.py b/HarvestMan-lite/harvestman/apps/samples/customcrawler.py
deleted file mode 100644
index ba4f65e..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/customcrawler.py
+++ /dev/null
@@ -1,36 +0,0 @@
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.macros import *
-from harvestman.lib import logger
-
-class MyCustomCrawler(HarvestMan):
-    """ A custom crawler """
-
-    size_threshold = 4096
-
-    def save_this_url(self, event, *args, **kwargs):
-        """ Custom callback function which modifies behaviour
-            of saving URLs to disk """
-
-        # Get the url object
-        url = event.url
-        # If not image, save always
-        if not url.is_image():
-            return True
-        else:
-            # If image, check for content-length > 128K
-            size = url.clength
-            return (size>self.size_threshold)
-
-# Set up the custom crawler
-if __name__ == "__main__":
-    crawler = MyCustomCrawler()
-    crawler.initialize()
-    # Get the configuration object
-    config = crawler.get_config()
-    config.verbosity = logger.EXTRAINFO
-    # Register for 'save_url_data' event which will be called
-    # back just before a URL is saved to disk
-    crawler.register('save_url_data', crawler.save_this_url)
-    # Run
-    crawler.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/html_table_crawler.py b/HarvestMan-lite/harvestman/apps/samples/html_table_crawler.py
deleted file mode 100644
index a25cfb6..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/html_table_crawler.py
+++ /dev/null
@@ -1,89 +0,0 @@
-
-#!/usr/bin/env python
-
-"""
-html_table_crawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which prints out
-CDATA contained in <td> elements inside an HTML table
-per page crawled.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2010 Anand B Pillai
-"""
-
-import sys
-import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib import logger
-
-class HtmlTableCrawler(HarvestMan):
-    """ A crawler which fetches only HTML (webpage) pages """
-
-    prefix_link = '/wiki/List_of_airports_by_IATA_code'.lower()
-    prev_tag = ''
-    tag_data = []
-    count = 0
-    
-    def crawl_this_url(self, event, *args, **kwargs):
-        
-        url, url_string = event.url, event.url.origurl
-        
-        if url.is_webpage() and url_string.lower().startswith(self.prefix_link):
-            return None
-        else:
-            return False
-
-    def download_this_url(self, event, *args, **kwargs):
-        
-        url, url_string = event.url, event.url.origurl
-        
-        if url.is_webpage() and url_string.lower().startswith(self.prefix_link):
-            return None
-        else:
-            return False        
-
-    def parse_tag(self, event, tag, attrs, **kwargs):
-        """ Parse tag """
-
-        if tag.lower()=='a' and self.prev_tag=='td':
-            self.count += 1
-            print event.url,':',self.tag_data[-2:],'=>',attrs
-
-
-        if self.count>=2:
-            self.tag_data = []
-            self.count = 0
-            
-        # Reset tag_data
-        self.prev_tag = tag
-        
-        
-    def parse_tag_data(self, event, tag, data, **kwargs):
-        """ Parse specific tag data """
-
-        if tag.lower()=='td':
-            # Need to append tag data since there are two <td> elements i.e
-            # 2 columns before the <a ...> tags come.
-            self.tag_data.append(data)
-            
-            
-        self.prev_tag = tag
-
-if __name__ == "__main__":
-    spider=HtmlTableCrawler()
-    spider.init()
-    
-    cfg = spider.get_config()
-    cfg.localise = 0
-    # Need to fool web-site
-    cfg.USER_AGENT = 'Firefox/v3.5'
-    cfg.add(url='http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A')
-
-    spider.register('before_crawl_url', spider.crawl_this_url)
-    spider.register('before_download_url', spider.download_this_url)    
-    spider.register('before_tag_data', spider.parse_tag_data)
-    spider.register('before_tag_parse', spider.parse_tag)    
-
-    cfg.setup()
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/htmlcrawler.py b/HarvestMan-lite/harvestman/apps/samples/htmlcrawler.py
deleted file mode 100755
index 307c1ed..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/htmlcrawler.py
+++ /dev/null
@@ -1,39 +0,0 @@
-#!/usr/bin/env python
-
-"""
-htmlcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which fetches
-only web pages from the web.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib import logger
-
-class HtmlCrawler(HarvestMan):
-    """ A crawler which fetches only HTML (webpage) pages """
-    
-    def include_this_link(self, event, *args, **kwargs):
-        
-        url = event.url
-        if url.is_webpage():
-            # Allow for further processing by rules...
-            # otherwise we will end up crawling the entire
-            # web, since no other rules will apply if we
-            # return True here.
-            return None
-        else:
-            return False
-
-if __name__ == "__main__":
-    spider=HtmlCrawler()
-    spider.initialize()
-    cfg = spider.get_config()
-
-    spider.register('include_this_link', spider.include_this_link)
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/imagecrawler.py b/HarvestMan-lite/harvestman/apps/samples/imagecrawler.py
deleted file mode 100755
index 9784cb2..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/imagecrawler.py
+++ /dev/null
@@ -1,50 +0,0 @@
-#!/usr/bin/env python
-
-"""
-imagecrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which downloads
-only images from the web.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-#import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.macros import *
-
-class ImageCrawler(HarvestMan):
-    """ A crawler which saves only images to disk """
-    
-    def write_this_url(self, event, *args, **kwargs):
-        
-        url = event.url
-        if url.is_image() or url.starturl:
-            return True
-        else:
-            return False
-
-    def include_links(self, event, *args, **kwargs):
-
-        url = event.url
-        if url.is_image():
-            return True
-        else:
-            pass
-
-if __name__ == "__main__":
-    spider=ImageCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.robots = 0 # You might want to re-enable this!
-    config.verbosity = 3
-    # Need in-mem data mode to obtain data for
-    # web-page URLs to parse them!
-    config.datamode = CONNECTOR_DATA_MODE_INMEM 
-
-    spider.register('save_url_data', spider.write_this_url)
-    spider.register('include_this_link', spider.include_links)
-    
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/indexingcrawler.py b/HarvestMan-lite/harvestman/apps/samples/indexingcrawler.py
deleted file mode 100755
index a3e871f..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/indexingcrawler.py
+++ /dev/null
@@ -1,118 +0,0 @@
-#!/usr/bin/env python
-
-"""
-indexingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which crawls a given
-URL and indexes documents at the end of the crawl.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import __init__
-import sys, os
-import PyLucene
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import *
-from types import StringTypes
-
-# You can write a variety of custom crawlers by combining
-# different events and writing handlers for them...
-
-class IndexingCrawler(HarvestMan):
-    """ A text indexing crawler using PyLucene """
-
-    # NOTE: This class performs work equivalent to the lucene plugin ...
-
-    def __init__(self):
-        super(IndexingCrawler, self).__init__()
-
-    def create_index(self):
-        """ Post download setup callback for creating a lucene index """
-
-        info("Creating lucene index")
-
-        count = 0
-
-        urllist = []
-
-        urldb = objects.datamgr.get_urldb()
-
-        storeDir = "index"
-        if not os.path.exists(storeDir):
-            os.mkdir(storeDir)
-
-        store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-        lucene_writer = PyLucene.IndexWriter(store, PyLucene.StandardAnalyzer(), True)    
-        lucene_writer.setMaxFieldLength(1048576)
-        
-        
-        for node in urldb.preorder():
-            urlobj = node.get()
-
-            # Only index if web-page or document
-            if not urlobj.is_webpage() and not urlobj.is_document(): continue
-
-            filename = urlobj.get_full_filename()
-            url = urlobj.get_full_url()
-
-            try:
-                urllist.index(urlobj.index)
-                continue
-            except ValueError:
-                urllist.append(urlobj.index)
-
-            if not os.path.isfile(filename): continue
-
-            data = ''
-
-            extrainfo('Adding index for URL',url)
-
-            try:
-                data = unicode(open(filename).read(), 'iso-8859-1')
-            except UnicodeDecodeError, e:
-                data = ''
-
-            try:
-                doc = PyLucene.Document()
-                doc.add(PyLucene.Field("name", 'file://' + filename,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.UN_TOKENIZED))
-                doc.add(PyLucene.Field("path", url,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.UN_TOKENIZED))
-                if data and len(data) > 0:
-                    doc.add(PyLucene.Field("contents", data,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.TOKENIZED))
-                else:
-                    warning("warning: no content in %s" % filename)
-
-                lucene_writer.addDocument(doc)
-            except PyLucene.JavaError, e:
-                print e
-                continue
-            
-            count += 1
-
-        info('Created lucene index for %d documents' % count)
-        info('Optimizing lucene index')
-        lucene_writer.optimize()
-        lucene_writer.close()
-    
-    def post_download_cb(self, event, *args, **kwargs):
-        self.create_index()
-
-if __name__ == "__main__":
-    spider=IndexingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = 3
-    config.localise = 0
-    config.images = 0
-    config.pagecache = 0
-
-    spider.register('post_crawl_complete', spider.post_download_cb)
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/linkchecker.py b/HarvestMan-lite/harvestman/apps/samples/linkchecker.py
deleted file mode 100755
index 873b306..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/linkchecker.py
+++ /dev/null
@@ -1,52 +0,0 @@
-#!/usr/bin/env python
-
-"""
-linkchecker.py - Demonstrating custom crawling by overriding
-events. This is a crawler class which reports broken links
-in a website. Broken links are those which result in HTTP
-404 (not found) errors.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import __init__
-import sys
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import objects, logconsole
-
-class LinkChecker(HarvestMan):
-    """ A crawler which checks a website/directory for broken links """
-
-    def __init__(self,):
-        self.broken = []
-        super(LinkChecker, self).__init__()
-
-    def find_broken_links(self, event, *args, **kwargs):
-        urldb = objects.datamgr.get_urldb()
-
-        for node in urldb.preorder():
-            urlobj = node.get()
-            if urlobj.status == 404:
-                self.broken.append(urlobj.get_full_url())
-
-        # Write to a file
-        baseurl = objects.queuemgr.get_base_url()
-        fname = '404#' + str(hash(baseurl)) + '.txt'
-        logconsole('Writing broken links to',fname)
-        f = open(fname, 'w')
-        f.write("Broken links for crawl starting with URL %s\n\n" % baseurl)
-        for link in self.broken:
-            f.write(link + '\n')
-        f.close()
-
-        return False
-    
-if __name__ == "__main__":
-    spider=LinkChecker()
-    spider.initialize()
-
-    spider.register('before_finish_project', spider.find_broken_links)
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/post_jobs.py b/HarvestMan-lite/harvestman/apps/samples/post_jobs.py
deleted file mode 100644
index bdfd719..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/post_jobs.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-import sys
-import blogger
-import getpass
-import cPickle
-
-def post_jobs(*fnames):
-    print 'About to post jobs from %d dumps' % len(fnames)
-    print 'Logging to blogger...'
-
-    username = raw_input("Enter username: ").strip()
-    passwd = getpass.getpass("Enter password: ").strip()
-    b = blogger.BlogJobPoster(username, passwd)
-    print 'Successfully logged into blogger.'
-    
-    for fname in fnames:
-        print 'Posting job from file %s' % fname
-        jobs = cPickle.load(open(fname,'rb'))
-        print 'Found %d job postings' % len(jobs)
-
-        count = 0
-        content = ''
-
-        for title, (url,data) in jobs.items():
-            # Extract date from the text...
-            print data
-            go_ahead = raw_input("Post this JOB [y/n] ? ")
-            if go_ahead.lower().strip() != 'y': continue
-
-            count += 1
-            date = data[data.find('<I>')+3:data.find('</I>')]
-            title += '(Posted on: %s)' % date
-            content = data[data.find('<PRE>')+5:data.find('</PRE>')] + '<br>\n'
-            # Wrap content to text width nicely
-            # content = wrap(content) + '<br>'
-
-            content += 'Referrering URL: <i><a href="%s">%s</a></i>\n' % (url, url)
-            content = '<P>' + content + '</P>'
-
-            print 'Posting job ID %d for %s' % (count, url)
-            b.post_job(title, content)
-
-if __name__ == "__main__":
-    post_jobs(*sys.argv[1:])
diff --git a/HarvestMan-lite/harvestman/apps/samples/postingcrawler.py b/HarvestMan-lite/harvestman/apps/samples/postingcrawler.py
deleted file mode 100755
index 4fa4026..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/postingcrawler.py
+++ /dev/null
@@ -1,118 +0,0 @@
-#!/usr/bin/env python
-
-"""
-postingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which crawls
-BangPyper archives, finds job postings and posts it to
-a specific blog.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import objects
-from harvestman.lib import logger
-
-import sys
-import blogger
-import getpass
-import re
-import cPickle
-
-class JobPostingCrawler(HarvestMan):
-    """ A job-posting crawler by integrating HarvestMan
-    with the google blogger API """
-
-    month_re = re.compile(r'\d{4}-\w+')
-
-    def __init__(self):
-        self.jobs = {}
-        # Archive and post later
-        self.archive = True
-        super(JobPostingCrawler, self).__init__()
-
-    def after_parse_cb(self, event, *args, **kwargs):
-        
-        document = event.document
-        url = event.url
-
-        if document:
-            content = document.content.lower()
-            title = document.title.lower()
-            
-            if title.find('job') != -1:
-                # If this is a reply to an original job post, ignore it...
-                data = document.content
-                
-                idx = data.find('Previous message:')
-                if idx != -1:
-                    idx2 = data.find('</A>', idx)
-                    # print 'String=>',data[idx:idx2]
-                    # print 'Title=>',document.title
-                    if not data[idx:idx2].endswith(document.title):
-                        self.jobs[document.title] = (url.get_full_url(), document.content)
-                else:
-                    self.jobs[document.title] = (url.get_full_url(), document.content)                    
-
-    def finish_event_cb(self, event, *args, **kwargs):
-
-        if len(self.jobs):
-            print 'Found %d job postings' % len(self.jobs)
-            if self.archive:
-                # Archive the data as a pickled file
-                # get base URL
-                post_month = self.month_re.findall(objects.config.url)[0]
-                fname = 'pythonjobs-%s' % post_month
-                f = open(fname, 'wb')
-                cPickle.dump(self.jobs, f)
-                f.close()
-                print 'Wrote jobs data to file %s.' % fname
-                return
-                
-            go_ahead = raw_input("Go ahead with posting [y/n] ? ")
-            if go_ahead.lower().strip() == 'y':
-                print 'Logging to blogger...'
-                username = raw_input("Enter username: ").strip()
-                passwd = getpass.getpass("Enter password: ").strip()
-                b = blogger.BlogJobPoster(username, passwd)
-                print 'Successfully logged into blogger.'
-
-                count = 0
-                content = ''
-
-                for title, (url,data) in self.jobs.items():
-                    # Extract date from the text...
-                    print data
-                    go_ahead = raw_input("Post this JOB [y/n] ? ")
-                    if go_ahead.lower().strip() != 'y': continue
-                    
-                    count += 1
-                    date = data[data.find('<I>')+3:data.find('</I>')]
-                    title += '(Posted on: %s)' % date
-                    content = data[data.find('<PRE>')+5:data.find('</PRE>')] + '<br>\n'
-                    # Wrap content to text width nicely
-                    # content = wrap(content) + '<br>'
-
-                    content += 'Referrering URL: <i><a href="%s">%s</a></i>\n' % (url, url)
-                    content = '<P>' + content + '</P>'
-
-                    print 'Posting job ID %d for %s' % (count, url)
-                    b.post_job(title, content)
-            else:
-                print 'Found %d jobs, but did not post.' % len(self.jobs)
-        else:
-            print 'No job postings found!'
-
-if __name__ == "__main__":
-    spider=JobPostingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = logger.EXTRAINFO
-    config.robots = 0
-    config.localise = 0
-
-    spider.register('post_parse_url', spider.after_parse_cb)
-    spider.register('before_finish_project', spider.finish_event_cb)    
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/searchingcrawler.py b/HarvestMan-lite/harvestman/apps/samples/searchingcrawler.py
deleted file mode 100755
index 19fc946..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/searchingcrawler.py
+++ /dev/null
@@ -1,79 +0,0 @@
-#!/usr/bin/env python
-
-"""
-searchingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which downloads
-and crawls only pages which mention a certain keyword.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import re
-import __init__
-from harvestman.apps.spider import HarvestMan
-
-class SearchingCrawler(HarvestMan):
-    """ A crawler which fetches pages by searching for specific data
-    This crawler can be run on a website to download only pages which
-    contain a keyword or set of keywords or a regular expression """
-    
-    def __init__(self, regexp):
-        self.rexp = regexp
-        super(SearchingCrawler, self).__init__()
-        
-    def write_this_url(self, event, *args, **kwargs):
-
-        if kwargs:
-            data = kwargs['data']
-            if self.rexp.search(data):
-                return True
-            else:
-                return False
-        else:
-            return False
-
-    def parse_this_link(self, event, *args):
-
-        document = event.document
-        url = event.url
-        
-        if not url.starturl and (self.rexp.search(document.content) == None):
-            return False
-        else:
-            return True
-        
-    def crawl_this_link(self, event, *args):
-        
-        document = event.document
-        url = event.url
-
-        if document:
-            # Don't forget to crawl the start URL!
-            if not url.starturl:
-                # Look for the keyword in the document keywords also...
-                matches = [self.rexp.search(keyword) for keyword in document.keywords]
-                if len(matches) or \
-                   self.rexp.search(document.description) or \
-                   self.rexp.search(document.content):
-                    return True
-                else:
-                    return False
-            else:
-                return True
-        else:
-            return False
-
-if __name__ == "__main__":
-    # Search for strings "database" or "dbm" and save such pages only.
-    spider=SearchingCrawler(re.compile(r'database|dbm',re.IGNORECASE))
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = 3
-
-    spider.register('before_crawl_url', spider.crawl_this_link)
-    spider.register('before_parse_url', spider.parse_this_link)
-    spider.register('save_url_data', spider.write_this_url)
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/samples/taganalyzer.py b/HarvestMan-lite/harvestman/apps/samples/taganalyzer.py
deleted file mode 100755
index f3d9806..0000000
--- a/HarvestMan-lite/harvestman/apps/samples/taganalyzer.py
+++ /dev/null
@@ -1,64 +0,0 @@
-#!/usr/bin/env python
-
-"""
-taganalyzer.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which allows
-you to subscribe to HTML tag parsing events and to
-take actions. 
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import __init__
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import CaselessDict
-
-class TagAnalyzingCrawler(HarvestMan):
-    """ A crawler which can perform custom tag analysis """
-
-    def __init__(self):
-        # Dictionary for storing information
-        self.d = {'images_no_alt': [], 'csslinks': []}
-        super(TagAnalyzingCrawler, self).__init__()
-
-    def write_this_url(self, event, *args, **kwargs):
-        # Since we are doing only tag analysis, don't write anything..
-        return False
-    
-    def analyze_this_tag(self, event, tag, attrs, **kwargs):
-
-        # This performs a check on images not having the 'alt' attribute...
-        if tag.lower() == 'img':
-            d = CaselessDict(attrs)
-            if not 'alt' in d:
-                imgurl = d['src'] or d['href']
-                self.d['images_no_alt'].append(imgurl)
-
-        
-    def finish_event_cb(self, event, *args, **kwargs):
-
-        print self.d
-        info = open('tagsinfo.txt','w')
-        
-        if len(self.d['images_no_alt']):
-            info.write('Image URLs without "alt" attribute\n')
-            for url in self.d['images_no_alt']:
-                info.write(url + '\n')
-
-        info.close()
-
-if __name__ == "__main__":
-    spider=TagAnalyzingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    # Disable caching
-    config.pagecache = 0
-
-    spider.register('save_this_url', spider.write_this_url)
-    spider.register('before_tag_parse', spider.analyze_this_tag)
-    spider.register('before_finish_project', spider.finish_event_cb)
-    spider.main()
diff --git a/HarvestMan-lite/harvestman/apps/spider.py b/HarvestMan-lite/harvestman/apps/spider.py
deleted file mode 100755
index 8747676..0000000
--- a/HarvestMan-lite/harvestman/apps/spider.py
+++ /dev/null
@@ -1,708 +0,0 @@
-#! /usr/bin/python
-
-# -- coding: utf-8
-
-""" HarvestMan - Extensible, modular, flexible, multithreaded Internet
-    spider program using urllib2 and other python modules. This is
-    the main module of HarvestMan.
-    
-    Version      - 2.0 alpha 1.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    HARVESTMAN is free software. See the file LICENSE.txt for information
-    on the terms and conditions of usage, and a DISCLAIMER of ALL WARRANTIES.
-
- Modification History
-
-    Created: Aug 2003
-
-     Jan 23 2007          Anand      Changes to copy config file to ~/.harvestman/conf
-                                     folder on POSIX systems. This file is also looked for
-                                     if config.xml not found in curdir.
-     Jan 25 2007          Anand      Simulation feature added. Also modified config.py
-                                     to allow reading cmd line arguments when passing
-                                     a config file using -C option.
-     Feb 7 2007          Anand       Finished implementation of plugin feature. Crawl
-                                     simulator is now a plugin.
-     Feb 8 2007          Anand       Added swish-e integration as a plugin.
-     Feb 11 2007         Anand       Changes in the swish-e plugin implementation,
-                                     by using callbacks.
-     Mar 2 2007          Anand       Renamed finish to finish_project. Moved
-                                     Finish method from common.py to here and
-                                     renamed it as finish(...). finish is never
-                                     called at project end, but by default at
-                                     program end.
-     Mar 7 2007          Anand       Disabled urlserver option.
-     Mar 15 2007         Anand       Added bandwidth calculation for determining
-                                     max filesize before crawl. Need to add
-                                     code to redetermine bandwidth when network
-                                     interface changes.
-     Apr 18 2007         Anand       Added the urltypes module for URL type
-                                     definitions and replaced entries with it.
-                                     Upped version number to 2.0 since this is
-                                     almost a new program now!
-     Apr 19 2007         Anand       Disabled urlserver option completely. Removed
-                                     all referring code from this module, crawler
-                                     and urlqueue modules. Moved code for grabbing
-                                     URL to new hget module.
-    Apr 24 2007          Anand       Made to work on Windows (XP SP2 Professional,
-                                     Python 2.5)
-    Apr 24 2007          Anand       Made the config directory creation/session
-                                     saver features to work on Windows also.
-    Apr 24 2007          Anand       Modified connector algorithm to flush data to
-                                     temp files for hget. This makes sure that hget
-                                     can download huge files as multipart.
-    May 7 2007           Anand       Added plugin as option in configuration file.
-                                     Added ability to process more than one plugin
-                                     at once. Modified loading logic of plugins.
-    May 10 2007          Anand       Replaced a number of private attributes in classes
-                                     (with double underscores), to semi-private (one
-                                     underscore). This helps in inheriting from these
-                                     classes.
-    Dec 12 2007          Anand       Re-merged code from harvestmanklass module to this
-                                     and moved common initialization code to appbase.py
-                                     under HarvestManAppBase class.
-    Feb 12-14 08        Anand        Major datastructure enhancements/revisions, fixes
-                                     etc in datamgr, rules, urlparser, connector, crawler,
-                                     ,urlqueue, urlthread modules.
-
-   Copyright (C) 2004 Anand B Pillai.     
-"""     
-
-__version__ = '2.0 a1'
-__author__ = 'Anand B Pillai'
-
-import __init__
-import os, sys
-
-import cPickle
-import pickle
-import time
-import threading
-import shutil
-import glob
-import re
-import copy
-import signal
-import locale
-
-from shutil import copy
-
-from harvestman.lib.event import HarvestManEvent
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib import urlqueue
-from harvestman.lib import connector
-from harvestman.lib import rules
-from harvestman.lib import datamgr
-from harvestman.lib import utils
-from harvestman.lib import urlparser
-from harvestman.lib.db import HarvestManDbManager
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-# Current folder - okay
-from appbase import HarvestManAppBase
-
-# Defining callback points
-__callbacks__ = { 'run_saved_state_callback':'HarvestMan:run_saved_state',
-                  'restore_state_callback':'HarvestMan:restore_state',
-                  'run_projects_callback':'HarvestMan:run_projects',
-                  'start_project_callback':'HarvestMan:start_project',
-                  'finish_project_callback':'HarvestMan:finish_project',
-                  'finalize_callback':'HarvestMan:finalize',                  
-                  'init_callback' : 'HarvestMan:init'}
-
-# Defining pluggable functions
-__plugins__ = { 'clean_up_plugin':'HarvestMan:clean_up',
-                'save_current_state_plugin': 'HarvestMan:save_current_state',
-                'restore_state_plugin': 'HarvestMan:restore_state',
-                'reset_state_plugin': 'HarvestMan:reset_state' }
-
-
-class HarvestMan(HarvestManAppBase):
-    """ The main crawler application class for HarvestMan """
-
-    klassmap = {}
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'spider'
-    
-    USER_AGENT = '%s/%s (+http://code.google.com/p/harvestman-crawler/wiki/bot)' %('Harvestman',__version__)
-    
-        
-    def __init__(self):
-        """ Initializing method """
-
-        self._projectstartpage = 'file://'
-        super(HarvestMan, self).__init__()
-        
-    def finish_project(self):
-        """ Actions to take after download is over for the current project """
-
-        if objects.eventmgr.raise_event('before_finish_project', objects.queuemgr.baseurl, None)==False:
-            return
-        
-        # Localise file links
-        # This code sits in the data manager class
-        objects.datamgr.post_download_setup()
-        
-        # if not objects.config.testing:
-        if objects.config.browsepage:
-            logconsole("Creating browser index page for the project...")
-            browser = utils.HarvestManBrowser()
-            browser.make_project_browse_page()
-            logconsole("Done.")
-
-        objects.eventmgr.raise_event('post_finish_project', objects.queuemgr.baseurl, None)
-        
-    def finalize(self):
-        """ This method is called at program exit or when handling signals to clean up """
-        
-        # If this was started from a runfile,
-        # remove it.
-        if objects.config.runfile:
-            try:
-                os.remove(objects.config.runfile)
-            except OSError, e:
-                error('Error removing runfile %s.' % objects.config.runfile)
-
-        # inform user of config file errors
-        if globaldata.userdebug:
-            logconsole("Some errors were found in your configuration, please correct them!")
-            for x in range(len(globaldata.userdebug)):
-                logconsole(str(x+1),':', globaldata.userdebug[x])
-
-        globaldata.userdebug = []
-        logconsole('HarvestMan session finished.')
-
-        objects.datamgr.clean_up()
-        objects.rulesmgr.clean_up()
-        objects.logger.shutdown()
-
-    def save_current_state(self):
-        """ Save state of objects to disk so program can be restarted from saved state """
-
-        # If savesession is disabled, return
-        if not objects.config.savesessions:
-            extrainfo('Session save feature is disabled.')
-            return
-        
-        # Top-level state dictionary
-        state = {}
-        # All state objects are dictionaries
-
-        # Get state of queue & tracker threads
-        state['trackerqueue'] = objects.queuemgr.get_state()
-        # Get state of datamgr
-        state['datamanager'] = objects.datamgr.get_state()
-        # Get state of urlthreads 
-
-        #if p: state['threadpool'] = p.get_state()
-        #state['ruleschecker'] = objects.rulesmgr.get_state()
-
-        # Get config object
-        #state['configobj'] = objects.config.copy()
-        
-        # Dump with time-stamp 
-        fname = os.path.join(objects.config.usersessiondir, '.harvestman_saves#' + str(int(time.time())))
-        extrainfo('Saving run-state to file %s...' % fname)
-
-        try:
-            cPickle.dump(state, open(fname, 'wb'), pickle.HIGHEST_PROTOCOL)
-            extrainfo('Saved run-state to file %s.' % fname)
-        except (pickle.PicklingError, RuntimeError), e:
-            logconsole(e)
-            error('Could not save run-state !')
-        
-    def welcome_message(self):
-        """ Prints a welcome message before start of the program """
-
-        logconsole('Starting %s...' % objects.config.progname)
-        logconsole('Copyright (C) 2004, Anand B Pillai')
-        logconsole(' ')
-
-    def register_common_objects(self):
-        """ Create and register aliases for the common objects required by all program modules """
-
-        # Set myself
-        SetAlias(self)
-
-        objects.logger.make_logger()
-        # Set verbosity in logger object
-        objects.logger.setLogSeverity(objects.config.verbosity)
-        
-        # Data manager object
-        dmgr = datamgr.HarvestManDataManager()
-        SetAlias(dmgr)
-        
-        # Rules checker object
-        ruleschecker = rules.HarvestManRulesChecker()
-        SetAlias(ruleschecker)
-        
-        # Connector manager object
-        connmgr = connector.HarvestManNetworkConnector()
-        SetAlias(connmgr)
-
-        # Connector factory
-        conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-        SetAlias(conn_factory)
-
-        queuemgr = urlqueue.HarvestManCrawlerQueue()
-        SetAlias(queuemgr)
-
-        SetAlias(HarvestManEvent())
-        
-    def start_project(self):
-        """ Starts crawl for the current project, crawling its URL  """
-
-        if objects.eventmgr.raise_event('before_start_project', objects.queuemgr.baseurl, None)==False:
-            return
-        
-        # crawls through a site using http/ftp/https protocols
-        if objects.config.project:
-            info('*** Log Started ***\n')
-            if not objects.config.resuming:
-                info('Starting project',objects.config.project,'...')
-            else:
-                info('Re-starting project',objects.config.project,'...')                
-
-            
-            # Write the project file 
-            # if not objects.config.fromprojfile:
-            #    projector = utils.HarvestManProjectManager()
-            #    projector.write_project()
-
-            # Write the project database record
-            HarvestManDbManager.add_project_record()
-            
-        if not objects.config.resuming:
-            info('Starting download of url',objects.config.url,'...')
-        else:
-            pass
-
-        # Reset objects keeping project-specific states now
-        # Reset and re-initialize datamgr
-        objects.datamgr.clean_up()
-        objects.datamgr.initialize()
-        objects.rulesmgr.reset()
-            
-        # Read the project cache file, if any
-        if objects.config.pagecache:
-            objects.datamgr.read_project_cache()
-            
-        if not objects.config.resuming:
-            # Configure tracker manager for this project
-            if objects.queuemgr.configure():
-                # start the project
-                objects.queuemgr.crawl()
-        else:
-            objects.queuemgr.restart()
-
-        objects.eventmgr.raise_event('post_start_project', objects.queuemgr.baseurl, None)
-        
-    def clean_up(self):
-        """ Performs clean up actions as part of the interrupt handling """
-
-        # Shut down logging on file
-        extrainfo('Shutting down logging...')
-        objects.logger.disableFileLogging()
-        objects.queuemgr.endloop()
-
-    def calculate_bandwidth(self):
-        """ Calculate bandwidth of the user by downloading a specific URL and timing it,
-        setting a limit on maximum file size """
-
-        # Calculate bandwidth
-        bw = 0
-        # Look for harvestman.conf in user conf dir
-        conf = os.path.join(objects.config.userconfdir, 'harvestman.conf')
-        if not os.path.isfile(conf):
-            conn = connector.HarvestManUrlConnector()
-            urlobj = urlparser.HarvestManUrl('http://harvestmanontheweb.com/schemas/HarvestMan.xsd')
-            bw = conn.calc_bandwidth(urlobj)
-            bwstr='bandwidth=%f\n' % bw
-            if bw:
-                try:
-                    open(conf,'w').write(bwstr)
-                except IOError, e:
-                    pass
-        else:
-            r = re.compile(r'(bandwidth=)(.*)')
-            try:
-                data = open(conf).read()
-                m = r.findall(data)
-                if m:
-                    bw = float(m[0][-1])
-            except IOError, e:
-                pass
-
-        return bw
-        
-    def create_user_directories(self):
-        """ Creates the user folders for HarvestMan. Creates folders for storing user specific
-        configuration, session and crawl database information """
-
-        # Create user's HarvestMan directory on POSIX at $HOME/.harvestman and 
-        # on Windows at $USERPROFILE/Local Settings/Application Data/HarvestMan
-        harvestman_dir = objects.config.userdir
-        harvestman_conf_dir = objects.config.userconfdir
-        harvestman_sessions_dir = objects.config.usersessiondir
-        harvestman_db_dir = objects.config.userdbdir
-        
-        if not os.path.isdir(harvestman_dir):
-            try:
-                logconsole('Looks like you are running HarvestMan for the first time in this machine')
-                logconsole('Doing initial setup...')
-                logconsole('Creating folder %s for storing HarvestMan application data...' % harvestman_dir)
-                os.makedirs(harvestman_dir)
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_conf_dir):
-            try:
-                logconsole('Creating "conf" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_conf_dir)
-
-                # Create user configuration in .harvestman/conf
-                conf_data = objects.config.generate_user_configuration()
-                logconsole("Generating user configuration in %s..." % harvestman_conf_dir)
-                try:
-                    user_conf_file = os.path.join(harvestman_conf_dir, 'config.xml')
-                    open(user_conf_file, 'w').write(conf_data)
-                    logconsole("Done.")                    
-                except IOError, e:
-                    print e
-
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_sessions_dir):
-            try:
-                logconsole('Creating "sessions" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_sessions_dir)                        
-                logconsole('Done.')
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_db_dir):
-            try:
-                logconsole('Creating "db" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_db_dir)                        
-                logconsole('Done.')
-            except (OSError, IOError), e:
-                logconsole(e)
-
-            try:
-                HarvestManDbManager.create_user_database()
-            except Exception, e:
-                logconsole(e)
-                
-        
-    def init(self):
-        """ Initialize the crawler by creating, register common objects and creating the
-        user folders """
-
-        if objects.config.USER_AGENT=='':
-            objects.config.USER_AGENT = self.__class__.USER_AGENT
-            
-        self.register_common_objects()
-        self.create_user_directories()
-
-        # Calculate bandwidth and set max file size
-        # bw = self.calculate_bandwidth()
-        # Max file size is calculated as bw*timeout
-        # where timeout => max time for a worker thread
-        # if bw: objects.config.maxfilesize = bw*objects.config.timeout
-        
-    def init_config(self):
-        """ Initialize the configuration of the crawler """
-        
-        # Following 2 methods are inherited from the parent class
-        self.get_options()
-        self.process_plugins()
-        # For the time being, save session set to false
-        objects.config.savesessions = 0
-
-    def get_config(self):
-        """ Return the configuration object """
-
-        return objects.config
-        
-    def initialize(self):
-        """ Umbrella method to initialize the crawler configuration
-        and the crawler object """
-        
-        self.init_config()
-        self.init()
-
-    def reset(self):
-        """ Resets the state of the crawler, except the state of the
-        configuration object """
-        
-        self.init()
-        
-    def run_projects(self):
-        """ Run all the HarvestMan projects specified for the current session """
-
-        # Set locale - To fix errors with
-        # regular expressions on non-english web
-        # sites.
-        locale.setlocale(locale.LC_ALL, '')
-
-        # objects.rulesmgr.make_filters()
-        
-        if objects.config.verbosity:
-            self.welcome_message()
-
-        for x in range(len(objects.config.projects)):
-            # Get all project related vars
-            entry = objects.config.projects[x]
-
-            url = entry.get('url')
-            project = entry.get('project')
-            basedir = entry.get('basedir')
-            verb = entry.get('verbosity')
-
-            if not url or not project or not basedir:
-                info('Invalid config options found!')
-                if not url: info('Provide a valid url')
-                if not project: info('Provide a valid project name')
-                if not basedir: info('Provide a valid base directory')
-                continue
-            
-            # Set the current project vars
-            objects.config.url = url
-            objects.config.project = project
-            objects.config.verbosity = verb
-            objects.config.basedir = basedir
-
-            try:
-                self.run_project()
-            except Exception, e:
-                # Note: This design means that when we are having more than
-                # one project configured, HarvestMan exits only the current
-                # project if an interrupt (Ctrl-C) is received. The next
-                # project will continue when control comes back...This was
-                # not intentional, but is a good side-effect.
-                
-                # However if a Python exception is received, we exit the
-                # program after calling this function...
-                self.handle_interrupts(-1, None, e)
-                    
-    def run_project(self):
-        """ Run the current HarvestMan project by creating any project directories
-        , initializing state and starting the project """
-
-        # Set project directory
-        # Expand any shell variables used in the base directory.
-        objects.config.basedir = os.path.expandvars(os.path.expanduser(objects.config.basedir))
-        
-        if objects.config.basedir:
-            objects.config.projdir = os.path.join( objects.config.basedir, objects.config.project )
-            if objects.config.projdir and not os.path.exists( objects.config.projdir ):
-                os.makedirs(objects.config.projdir)
-                
-            if objects.config.datamode == CONNECTOR_DATA_MODE_FLUSH:    
-                objects.config.projtmpdir = os.path.join(objects.config.projdir, '.tmp')
-                if objects.config.projtmpdir and not os.path.exists( objects.config.projtmpdir ):
-                    os.makedirs(objects.config.projtmpdir)            
-
-        # Set message log file
-        if objects.config.projdir and objects.config.project:
-            objects.config.logfile = os.path.join( objects.config.projdir, "".join((objects.config.project,
-                                                                          '.log')))
-
-        SetLogFile()
-
-        if not objects.config.testnocrawl:
-            self.start_project()
-
-        self.finish_project()
-            
-    def restore_state(self, state_file):
-        """ Restore state of some objects from the most recent run of the program.
-        This helps to re-run the program from where it left off """
-
-        try:
-            state = cPickle.load(open(state_file, 'rb'))
-            # This has six keys - configobj, threadpool, ruleschecker,
-            # datamanager, common and trackerqueue.
-
-            # First update config object
-            localcfg = {}
-            cfg = state.get('configobj')
-            if cfg:
-                for key,val in cfg.items():
-                    localcfg[key] = val
-            else:
-                print 'Config corrupted'
-                return RESTORE_STATE_NOT_OK
-
-            # Now update trackerqueue
-            ret = objects.queuemgr.set_state(state.get('trackerqueue'))
-            if ret == -1:
-                logconsole("Error restoring state in 'urlqueue' module - cannot proceed further!")
-                return RESTORE_STATE_NOT_OK                
-            else:
-                logconsole("Restored state in urlqueue module.")
-            
-            # Now update datamgr
-            ret = objects.datamgr.set_state(state.get('datamanager'))
-            if ret == -1:
-                logconsole("Error restoring state in 'datamgr' module - cannot proceed further!")
-                return RESTORE_STATE_NOT_OK                
-            else:
-                dm.initialize()
-                logconsole("Restored state in datamgr module.")                
-            
-            # Update threadpool if any
-            pool = None
-            if state.has_key('threadpool'):
-                pool = dm._urlThreadPool
-                ret = pool.set_state(state.get('threadpool'))
-                logconsole('Restored state in urlthread module.')
-            
-            # Update ruleschecker
-            ret = objects.rulesmgr.set_state(state.get('ruleschecker'))
-            logconsole('Restored state in rules module.')
-
-            # Everything is fine, copy localcfg to config object
-            for key,val in localcfg.items():
-                objects.config[key] = val
-
-            # Open stream to log file
-            SetLogFile()
-                
-            return RESTORE_STATE_OK
-        except (pickle.UnpicklingError, AttributeError, IndexError, EOFError), e:
-            return RESTORE_STATE_NOT_OK            
-
-    def run_saved_state(self):
-        """ Restart the program from a previous state, from state saved during
-        the most recent run of the program, if any """
-        
-        # If savesession is disabled, return
-        #if not objects.config.savesessions:
-        extrainfo('Session save feature is disabled, ignoring save files')
-        return SAVE_STATE_NOT_OK
-        
-        # Set locale - To fix errors with
-        # regular expressions on non-english web
-        # sites.
-        # See if there is a file named .harvestman_saves#...
-  ##       sessions_dir = objects.config.usersessiondir
-
-##         files = glob.glob(os.path.join(sessions_dir, '.harvestman_saves#*'))
-
-##         # Get the last dumped file
-##         if files:
-##             runfile = max(files)
-##             res = raw_input('Found HarvestMan save file %s. Do you want to re-run it ? [y/n]' % runfile)
-##             if res.lower()=='y':
-##                 if self.restore_state(runfile) == RESTORE_STATE_OK:
-##                     objects.config.resuming = True
-##                     objects.config.runfile = runfile
-
-##                     if objects.config.verbosity:
-##                         self.welcome_message()
-        
-##                     if not objects.config.testnocrawl:
-##                         try:
-##                             self.start_project()
-##                         except Exception, e:
-##                             self.handle_interrupts(-1, None, e)
-
-##                     try:
-##                         self.finish_project()
-##                         return SAVE_STATE_OK
-                    
-##                     except Exception, e:
-##                         # To catch errors at interpreter shutdown
-##                         pass
-##                 else:
-##                     logconsole('Could not re-run saved state, defaulting to standard configuration...')
-##                     objects.config.resuming = False
-##                     return SAVE_STATE_NOT_OK
-##             else:
-##                 logconsole('OK, falling back to default configuration...')
-##                 return SAVE_STATE_NOT_OK
-##         else:
-##             return SAVE_STATE_NOT_OK
-##         pass
-
-    def handle_interrupts(self, signum, frame, e=None):
-        """ Method which is called to handle program interrupts such as a Ctrl-C (interrupt) """
-
-        # print 'Signal handler called with',signum
-        if signum == signal.SIGINT:
-            objects.config.keyboardinterrupt = True
-
-        if e != None:
-            logconsole('Exception received=>',e)
-            print_traceback()
-
-        logtraceback()
-        # dont allow to write cache, since it
-        # screws up existing cache.
-        objects.datamgr.conditional_cache_set()
-        # self.save_current_state()
-        self.clean_up()
-
-    def bind_event(self, event, funktion, *args):
-        """ Binds a function to a specific event in HarvestMan """
-        
-        objects.eventmgr.bind(event, funktion, args)
-
-    def register(self, event, funktion):
-        """ Alias for bind-event method. Registers a user supplied
-        function as a call-back to the HarvestMan event defined by
-        'event' """
-
-        self.bind_event(event, funktion)
-        
-    def main(self):
-        """ The main sub-routine of the HarvestMan class """
-
-        # Set stderr to dummy to prevent all the thread
-        # errors being printed by interpreter at shutdown
-        # sys.stderr = DummyStderr()
-        signal.signal(signal.SIGINT, self.handle_interrupts)
-        
-        # See if a crash file is there, then try to load it and run
-        # program from crashed state.
-        if self.run_saved_state() == SAVE_STATE_NOT_OK:
-            # No such crashed state or user refused to run
-            # from crashed state. So do the usual run.
-            self.run_projects()
-            
-        # Final cleanup
-        self.finalize()
-
-def callgraph_filter(call_stack, module_name, class_name, func_name, full_name):
-    """ Function which is used to filter the call graphs when HarvestMan is
-    run with 'pycallgraph' to generate call graph trees """
-    
-    if class_name.lower().find('harvestman') != -1 or \
-       full_name.lower().find('harvestman') != -1:
-        return True
-    else:
-        return False
-
-def main():
-    """ Main routine """
-
-    spider = HarvestMan()
-    spider.initialize()
-    
-    #import pycallgraph
-    #pycallgraph.start_trace(filter_func=callgraph_filter)
-    spider.main()
-    #pycallgraph.make_dot_graph('harvestman.png')
-    
-if __name__=="__main__":
-    main()
-    
-
-               
-        
-
diff --git a/HarvestMan-lite/harvestman/bugs/Readme.txt b/HarvestMan-lite/harvestman/bugs/Readme.txt
deleted file mode 100755
index c241c4d..0000000
--- a/HarvestMan-lite/harvestman/bugs/Readme.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-This folder has test cases for EIAO bug reports
-of HarvestMan which affect HarvestMan-2.0 also.
-
-The bugs for these test cases can be seen at
-http://trac.eiao.net/trac.cgi. (You need authorisation to access this site.)
-
-
-
-
-
diff --git a/HarvestMan-lite/harvestman/bugs/config-issue20.xml b/HarvestMan-lite/harvestman/bugs/config-issue20.xml
deleted file mode 100644
index 2ad4424..0000000
--- a/HarvestMan-lite/harvestman/bugs/config-issue20.xml
+++ /dev/null
@@ -1,154 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-
-      <project ignore="0">
-
-        <url>http://ichikaflower.com/</url>
-        <name>ichikaflower.com</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.foo.com</url>
-        <name>foo</name>
-
-        <basedir>/tmp/sites</basedir>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types>
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxextservers value="0"/>
-        <maxextdirs value="0"/>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5 MB" />
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter></urlfilter>
-        <serverfilter></serverfilter>
-        <wordfilter></wordfilter>
-        <junkfilter value="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-
-    <display>
-      <browsepage value="0"/>
-    </display>
-
-  </config>
-
-</HarvestMan>
diff --git a/HarvestMan-lite/harvestman/bugs/config-issue21.xml b/HarvestMan-lite/harvestman/bugs/config-issue21.xml
deleted file mode 100644
index 2bbd275..0000000
--- a/HarvestMan-lite/harvestman/bugs/config-issue21.xml
+++ /dev/null
@@ -1,154 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-
-      <project ignore="0">
-
-        <url>http://www.dumitruoprea.ro</url>
-        <name>bla</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.foo.com</url>
-        <name>foo</name>
-
-        <basedir>/tmp/sites</basedir>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types>
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxextservers value="0"/>
-        <maxextdirs value="0"/>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="50 MB" />
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter></urlfilter>
-        <serverfilter></serverfilter>
-        <wordfilter></wordfilter>
-        <junkfilter value="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-
-    <display>
-      <browsepage value="0"/>
-    </display>
-
-  </config>
-
-</HarvestMan>
diff --git a/HarvestMan-lite/harvestman/bugs/s_municipaux.htm b/HarvestMan-lite/harvestman/bugs/s_municipaux.htm
deleted file mode 100755
index a9602f9..0000000
--- a/HarvestMan-lite/harvestman/bugs/s_municipaux.htm
+++ /dev/null
@@ -1,233 +0,0 @@
-<html>
-<head>
-<title>Epinay-sous-S&eacute;nart</title>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
-<script language="JavaScript" type="text/JavaScript">
-<!--
-function MM_preloadImages() { //v3.0
-  var d=document; if(d.images){ if(!d.MM_p) d.MM_p=new Array();
-    var i,j=d.MM_p.length,a=MM_preloadImages.arguments; for(i=0; i<a.length; i++)
-    if (a[i].indexOf("#")!=0){ d.MM_p[j]=new Image; d.MM_p[j++].src=a[i];}}
-}
-
-function MM_swapImgRestore() { //v3.0
-  var i,x,a=document.MM_sr; for(i=0;a&&i<a.length&&(x=a[i])&&x.oSrc;i++) x.src=x.oSrc;
-}
-
-function MM_findObj(n, d) { //v4.01
-  var p,i,x;  if(!d) d=document; if((p=n.indexOf("?"))>0&&parent.frames.length) {
-    d=parent.frames[n.substring(p+1)].document; n=n.substring(0,p);}
-  if(!(x=d[n])&&d.all) x=d.all[n]; for (i=0;!x&&i<d.forms.length;i++) x=d.forms[i][n];
-  for(i=0;!x&&d.layers&&i<d.layers.length;i++) x=MM_findObj(n,d.layers[i].document);
-  if(!x && d.getElementById) x=d.getElementById(n); return x;
-}
-
-function MM_swapImage() { //v3.0
-  var i,j=0,x,a=MM_swapImage.arguments; document.MM_sr=new Array; for(i=0;i<(a.length-2);i+=3)
-   if ((x=MM_findObj(a[i]))!=null){document.MM_sr[j++]=x; if(!x.oSrc) x.oSrc=x.src; x.src=a[i+2];}
-}
-
-function MM_jumpMenu(targ,selObj,restore){ //v3.0
-  eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+"'");
-  if (restore) selObj.selectedIndex=0;
-}
-//-->
-</script>
-<link href="style.css" rel="stylesheet" type="text/css">
-<style type="text/css">
-<!--
-.Style5 {font-size: 12px}
-.Style6 {font-size: x-small}
--->
-</style>
-</head>
-<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="MM_preloadImages('images/m_ville_r.gif','images/m_elus_r.gif','images/m_democratie_r.gif','images/m_services_r.gif','images/m_servicesol_r.gif','images/m_annuaire_r.gif','images/m_spinolien_r.gif','images/m_agenda_r.gif','images/m_actu_r.gif','images/m_plan_r.gif','images/m_travaux_r.gif','images/m_pratique_r.gif','images/m_economie_r.gif','images/m_emploi_r.gif','images/m_accueil_r.gif')">
-<table width="752" border="0" align="center" cellpadding="0" cellspacing="1" bgcolor="#333333">
-  <tr>
-    <td><table width="750" border="0" align="center" cellpadding="0" cellspacing="0" bgcolor="#FFFFFF">
-        <tr> 
-          <td><img name="gabarit_r1_c1" src="images/gabarit_r1_c1.gif" width="262" height="150" border="0" alt=""></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><img name="gabarit_r1_c5" src="images/gabarit_r1_c5.gif" width="149" height="150" border="0" alt=""></td>
-				<td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                    <tr> 
-                      <td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                          <tr> 
-                            <td><img name="gabarit_r1_c7" src="images/gabarit_r1_c7.gif" width="192" height="108" border="0" alt=""></td>
-                            <td><table border="0" cellpadding="0" cellspacing="0" width="147">
-                                <tr> 
-                                  <td><a href="emploi.asp" onMouseOver="MM_swapImage('emploi','','images/m_emploi_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/gabarit_r1_c11.gif" alt="" name="emploi" width="147" height="35" border="0" id="emploi"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="economie.htm" onMouseOver="MM_swapImage('economie','','images/m_economie_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/gabarit_r2_c11.gif" alt="" name="economie" width="147" height="27" border="0" id="economie"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="marches.htm" onMouseOver="MM_swapImage('marches','','images/m_marchespublicsr.jpg',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_marchespublics.jpg" alt="" name="marches" width="147" height="25" border="0" id="marches"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="default.asp" onMouseOver="MM_swapImage('accueil','','images/m_accueil_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_accueil.gif" name="accueil" width="147" height="21" border="0" id="accueil"></a></td></tr>
-                              </table></td>
-                          </tr>
-                        </table></td>
-                    </tr>
-                    <tr> 
-                      <td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                          <tr> 
-                            <td><a href="plan.htm" onMouseOver="MM_swapImage('plan','','images/m_plan_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_plan.gif" alt="" name="plan" width="108" height="42" border="0" id="plan"></a></td>
-                            <td><a href="travaux.asp" onMouseOver="MM_swapImage('travaux','','images/m_travaux_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_travaux.gif" alt="" name="travaux" width="92" height="42" border="0" id="travaux"></a></td>
-                            <td><a href="pratique.asp" onMouseOver="MM_swapImage('pratique','','images/m_pratique_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_pratique.gif" alt="" name="pratique" width="139" height="42" border="0" id="pratique"></a></td>
-                          </tr>
-                        </table></td>
-                    </tr>
-                  </table></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td><table border="0" cellpadding="0" cellspacing="0" width="262">
-              <tr> 
-                <td><img name="gabarit_r6_c1" src="images/gabarit_r6_c1.gif" width="156" height="21" border="0" alt=""></td>
-                <td><a href="ville.htm" onMouseOver="MM_swapImage('ville','','images/m_ville_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_ville.gif" alt="" name="ville" width="52" height="21" border="0" id="ville"></a></td>
-                <td><a href="elus.htm" onMouseOver="MM_swapImage('elus','','images/m_elus_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_elus.gif" alt="" name="elus" width="54" height="21" border="0" id="elus"></a></td>
-              </tr>
-            </table></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><a href="democratie.htm" onMouseOver="MM_swapImage('democratie','','images/m_democratie_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_democratie.gif" alt="" name="democratie" width="117" height="21" border="0" id="democratie"></a></td>
-                <td><a href="s_municipaux.htm" onMouseOver="MM_swapImage('services','','images/m_services_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_services.gif" alt="" name="services" width="126" height="21" border="0" id="services"></a></td>
-                <td><a href="s_online.htm" onMouseOver="MM_swapImage('servicesol','','images/m_servicesol_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_servicesol.gif" alt="" name="servicesol" width="106" height="21" border="0" id="servicesol"></a></td>
-                <td><a href="annuaire.htm" onMouseOver="MM_swapImage('annuaire','','images/m_annuaire_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_annuaire.gif" alt="" name="annuaire" width="61" height="21" border="0" id="annuaire"></a></td>
-                <td><a href="spinolien.htm" onMouseOver="MM_swapImage('spinolien','','images/m_spinolien_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_spinolien.gif" alt="" name="spinolien" width="78" height="21" border="0" id="spinolien"></a></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td><img name="gabarit_r7_c1" src="images/gabarit_r7_c1.gif" width="262" height="115" border="0" alt=""></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-                    <tr> 
-                      <td width="170"><a href="agenda.asp" onMouseOver="MM_swapImage('agenda','','images/m_agenda_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_agenda.gif" alt="" name="agenda" width="170" height="19" border="0" id="agenda"></a></td>
-                      <td width="171"><a href="actualite.asp" onMouseOver="MM_swapImage('actu','','images/m_actu_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_actu.gif" alt="" name="actu" width="171" height="19" border="0" id="actu"></a></td>
-                      <td><a href="sports.asp" onMouseOver="MM_swapImage('sports','','images/m_sports_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_sports.gif" alt="" name="sports" width="147" height="19" border="0" id="sports"></a></td>
-                    </tr>
-                  </table></td>
-              </tr>
-              <tr> 
-                <td><img src="images/smunicipaux2.gif" width="488" height="96" border="0"></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td valign="top">
-<table border="0" cellpadding="0" cellspacing="0" width="262">
-              <tr> 
-                <td><img name="gabarit_r9_c1" src="images/gabarit_r9_c1.jpg" width="262" height="252" border="0" alt=""></td>
-              </tr>
-              <tr> 
-                <td><table border="0" cellpadding="0" cellspacing="0" width="262">
-                    <tr> 
-                      <td>&nbsp;</td>
-                      <td>&nbsp;</td>
-                    <tr>
-                      <td colspan="2" class="texte"><a href="mailto:contact@ville-epinay-senart.fr" class="lienscat">   <img src="images/mail.gif" width="24" height="15" hspace="2" vspace="2" border="0" align="absmiddle"> Contact webmestre </a><br>                        
-                      &copy; Mairie d'Epinay-sous-S&eacute;nart </td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte"><div align="center">N&deg; vert propret&eacute; </div></td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte"><div align="center"><span class="textegras">N&deg; vert : 0800091860</span></div></td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                  </table></td>
-              </tr>
-            </table></td>
-          <td valign="top"><table width="100%" border="0" cellspacing="0" cellpadding="0">
-              <tr>
-                <td>&nbsp;</td>
-              </tr>
-              <tr>
-                <td><div align="right">
-                    <table width="100%" border="0" cellspacing="4" cellpadding="0">
-                           <tr> 
-                        <td><div align="right">
-                          <select name="menu1" class="texte" onChange="MM_jumpMenu('parent',this,1)">
-                            <option value="s_municipaux.htm"selected>Les services municipaux</option>
-                            <option value="s_generaux.htm">Administration g�n�rale</option>
-							<option value="CCAS.htm">Centre communal d&rsquo;action sociale</option>
-                            <option value="s_com.htm">Communication</option>
-							<option value="s_culture.htm">Culturel</option>
-							<option value="s_scolaire.htm">Education</option>
-							<option value="s_enfance.htm">Enfance</option>
-							<option value="s_jeunesse.htm">Jeunesse</option>
-							<option value="s_logement.htm">Logement</option>
-							<option value="police.htm">Police municipale</option>
-                            <option value="s_senior.htm">S&eacute;niors</option>
-							<option value="s_sport.htm">Sports</option>
-							<option value="s_techniques.htm">Techniques</option>
-                            <option value="s_association.htm">Vie associative</option>
-                            </select>
-                          </div></td>
-                      </tr>
-                      <tr>
-                        <td><div align="left"><span class="textegras Style6">H&ocirc;tel 
-                            de ville</span><span class="texte"><br>
-                            8, rue Sainte-Genevi&egrave;ve<br>
-                            91860 &Eacute;pinay-sous-S&eacute;nart </span> 
-                            <p class="texte"><span class="textegras">T&eacute;l&eacute;phone 
-                              :</span> 01 60 47 85 00<br>
-                              <span class="textegras">T&eacute;l&eacute;copie 
-                              :</span> 01 60 46 68 34<br>
-                              <span class="textegras">Mail :</span> <a href="mailto:contact@ville-epinay-senart.fr" class="lienscat">contact@ville-epinay-senart.fr</a><br>
-                              <br>
-                              <span class="textegras">Horaires 
-                              d&#8217;ouverture au public :</span> <br>
-                              Lundi, mardi, jeudi et vendredi : 8h30 &agrave; 
-                              11h45 et de 13h30 &agrave; 17h30<br>
-                              Mercredi et samedi : 8h30 &agrave; 11h45                            </p>
-                            <p align="center" class="texte">                              <strong>........................................                            </strong></p>
-                            <p class="texte"><span class="textegras Style5">Espace Informations Spinolien </span><br>
-                              Centre Commercial Principal<br>
-                            91860 Epinay-sous-S&eacute;nart</p>
-                            <p class="texte"><span class="textegras">T&eacute;l&eacute;phone :</span> 01 60 46 
-                              93 49 <br>
-                              <span class="textegras">T&eacute;l&eacute;copie :</span> 01 60 46 16 59 <br>
-                              <span class="textegras"><br>
-                              Horaires 
-                              d&#8217;ouverture au public : </span><br>
-                              Matin : mardi, mercredi et vendredi de 10h00 &agrave; 12h00<br>
-                              Apr&egrave;s midi : du lundi au vendredi de 14h00 &agrave; 18h30</p>
-                            <p><span class="texte">Situ&eacute; au coeur de la ville, un service de proximit&eacute; qui rapproche les habitants des services publics.<br>
-                                <br>
-                            </span><span class="texte"> Consultation des comptes rendus 
-                              des s&eacute;ances du Conseil Municipal. <br>
-                              Exposition sur les r&eacute;alisations 
-                                                en cours dans la ville.</span><br>
-                            </p>
-                            <blockquote>&nbsp;</blockquote>
-                        </div></td>
-                      </tr>
-                      <tr>
-                        <td>&nbsp;</td>
-                      </tr>
-                    </table>
-                  </div></td>
-              </tr>
-            </table></td>
-        </tr>
-      </table>
-      <div align="center"></div></td>
-  </tr>
-</table>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html b/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html
deleted file mode 100755
index 20fe7b4..0000000
--- a/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html
+++ /dev/null
@@ -1,255 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
-<HTML>
-<HEAD>
-<TITLE> S�sk�t </TITLE>
-<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">
-
-
-
-        <script>
-                var sname='soskut';
-                var slang='hu';
-        </script>
-
-        <script type="text/javascript" src="/jsutils/utils_hu.js"></script>
-
-        <style>
-                P {margin: 0cm 0cm 0pt;}
-                FORM {margin: 0cm 0cm 0pt;}
-                IMG {border: none}
-                A                                               {font-family:arial;font-size:11px;color:#523201}
-                A:hover                                 {text-decoration:underline}
-
-                .hand   {cursor:pointer;cursor:hand;}
-                .frame {border:#A0B9AC solid 2px}
-
-                .deftable                               {font-family:arial;font-size:11px;color:#523201}
-                .list_table                             {font-family:arial;font-size:11px;color:#523201;}
-                .list_table_head                {font-family:arial;font-size:11px;color:#523201;background-color:#c4b481}
-
-                .list_table2                    {font-family:arial;font-size:11px;color:#523201;}
-                .td1                                    {font-family:arial;font-size:11px;color:#523201;}
-                .n                                              {font-family:arial;font-size:11px;color:#523201;}
-                .ctable                                 {font-family:arial;font-size:11px;color:#523201;padding:2px;border:1px solid #405539}
-                li                                              {list-style-image:url(/images/nagyrabe/hu/dot.gif)}
-
-                .hircim1                                {font-family:arial;font-size:11px;color:black;font-weight:bold}
-                .hirszoveg1                             {font-family:arial;font-size:11px;color:black}
-
-                .a1                                             {font-family:arial;font-size:11px;color:#523201;text-decoration:none}
-                .a1:hover                               {text-decoration:underline}
-
-                .a2                                             {font-family:arial;font-size:11px;color:#635d49;}
-                .a3                                             {font-family:arial;font-size:11px;color:#635d49;}
-
-                .poll_head                              {font-family:arial;font-size:11px;color:black;font-weight:bold}
-                .poll_text                              {font-family:arial;font-size:11px;color:black;}
-
-                .forum                                  {font-family:arial;font-size:11px;color:white;}
-                .forum-fresh                    {font-family:arial;font-size:11px;color:#635d49;}
-                .forum-input input              {font-family:arial;font-size:11px;border:1px solid #6e623b;background-color:#c9bd98;width:90px}
-                .event-fresh                    {font-family:arial;font-size:11px;color:#635d49;}
-                .institutes-fresh               {font-family:arial;font-size:11px;color:#303c43;}
-                .newsmail-input input   {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:142}
-                .plaint-input input             {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .plaint-input textarea  {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .search-input           {font-family:arial;font-size:11px;border:1px solid #6e623b;background-color:#c9bd98;width:100}
-                .guestbook-input input  {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .guestbook-input textarea {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-
-                .municip-table { margin-top:8px;text-align:center}
-                .municip-header { background-color: #c4b481}
-                .institutes-table { margin-top:8px;text-align:center}
-                .institutes-header { background-color: #c4b481}
-                .plaint-table { margin-top:8px;text-align:center}
-                .plaint-header { background-color: #c4b481}
-                .plaintbox-table { margin-top:8px; text-align:center}
-                .plaintbox-header { background-color: #c4b481}
-                .guestbook-table { margin-top:8px;text-align:center}
-                .guestbook-header { background-color: #c4b481}
-                .forum-table { margin-top:8px;text-align:center;}
-                .forum-header { background-color: #c4b481}
-                .theme-header { font-family:arial;font-size:11px;color:#523201;text-align:center;padding-top:5px}
-
-                .bgcolor1 {background-color: #d9c78f}
-                .bgcolor2 {background-color: #e9d69a}
-
-                select {font:11px arial;color:#635d49;}
-                .select_year select {width:100px;}
-                .select_month select {width:120px;}
-                .select_type select {width:100px;}
-
-
-        </style>
-
-
-
-
-</HEAD>
-
-<BODY>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="ad9f72">
-        <td><img src="/images/soskut/hu/v2/img1.jpg"></td>
-        <td><img src="/images/soskut/hu/v2/img2.jpg"></td>
-        <td><img src="/images/soskut/hu/v2/img3.jpg"></td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="#ad9f72">
-        <td><img src="/images/soskut/hu/v2/img4.gif"></td>
-        <td><a href="?"><img src="/images/soskut/hu/v2/cimer.jpg" border="0" alt="Vissza a f�oldalra"></a></td>
-        <td><img src="/images/soskut/hu/v2/img5.gif"></td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="#ad9f72">
-        <td width="225" valign="top">
-
-                <div style="padding-left:15px">
-                        <img src="/images/soskut/hu/v2/lmenu-top.gif" border="0"><br>
-                        <a href="?module=news&fname=telepules#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-1.gif" border="0"></a><br>
-                        <a href="?module=institutes#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-2.gif" border="0"></a><br>
-                        <a href="?module=municip#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-3.gif" border="0"></a><br>
-                        <a href="?module=news&fname=galeria#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-4.gif" border="0"></a><br>
-                        <a href="?module=news&fname=turizmus#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-5.gif" border="0"></a><br>
-                        <a href="?module=institutes&imname=cegkatalogus#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-6.gif" border="0"></a><br>
-                        <a href="?module=news&fname=terkep#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-7.gif" border="0"></a><br>
-                        <img src="/images/soskut/hu/v2/lmenu-bot.gif" border="0"><br>
-                </div>
-
-        </td>
-        
-        <td valign="top" width="530" bgcolor="#f6f3e9">
-
-
-                 <div align="center">
-                        <img src="/images/soskut/hu/v2/soskut.gif"><br>
-                </div>
-                
-
-                <div align="center">
-
-                        <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,29,0" width="489" height="301"><param name="movie" value="/images/soskut/hu/v2/soskut2.swf"><param name="quality" value="high"><embed src="/images/soskut/hu/v2/soskut2.swf" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" width="489" height="301"></embed></object>
-
-<!--                    <img src="/images/soskut/hu/v2/main-img.jpg"><br> -->
-                </div>
-
-<br><br><br>
-
-
-                <table cellpadding="0" cellspacing="0" border="0" align="center">
-                        <tr>
-                                <td width="226" height="108" background="/images/soskut/hu/v2/forum-bg.jpg">
-                                
-                                         
-    <div style="position:relative;left:15px;top:15px;">
-          <form  action="/index.php" method="post" name="forum_login_form" id="forum_login_form">
-          <input name="module" type="hidden" value="forum" />
-<input name="action" type="hidden" value="login" />
-<input name="done" type="hidden" value="1" />
-
-          <table cellpadding="0" cellspacing="0" border="0" class="forum">
-                <tr>
-                  <td>Felhaszn�l�n�v:</td>
-                  <td class="forum-input"><input size="20" maxlength="30" name="nick_name" type="text" /></td>
-                </tr>
-                <tr>
-                  <td>Jelsz�:</td>
-                  <td class="forum-input"><input size="20" maxlength="30" name="pass" type="password" /></td>
-                </tr>
-                <tr>
-                  <td>
-                        <input type="image" src='/images/soskut/hu/v2/forum-send.gif'>
-                  </td>
-                   <td>
-                                  <a href="?module=forum_members&action=form">
-                          <img src="/images/soskut/hu/v2/forum-registration.gif" border=0 class="hand">
-                        </a>
-                  </td>
-                </tr>
-          </table>
-          </form>
-  </div>
-
-
-                                </td>
-                                <td width="228" height="110" background="/images/soskut/hu/v2/guestbook-bg.jpg">
-                                
-                                        <div style="position:relative;left:15px;top:15px;">
-         <div style="margin-bottom:5px;">       
-                <a href="?module=guestbook&action=list">
-                  <img src="/images/soskut/hu/v2/guestbook-view.gif" border=0 class="hand">
-                </a>
-         </div>
-          <div> 
-                <a href="?module=guestbook&action=form">
-                  <img src="/images/soskut/hu/v2/guestbook-send.gif" border=0 class="hand">
-                </a>
-          </div>        
-</div>
-                                </td>
-                        </tr>
-                        <tr>
-                                <td width="227" height="103" background="/images/soskut/hu/v2/search-bg.jpg">
-                                
-                                        <div style="position:relative;left:15px;top:10px;">
-                                                <form name="search_form" method="get">
-                                                <input type="hidden" name="module" value="search">
-                                                <input type="hidden" name="action" value="list">
-
-                                                <input type="text" name="search" size="20" class="search-input"><br>
-                                                <input type="image" src="/images/soskut/hu/v2/search-button.gif">
-                                                </form>
-                                        </div>
-
-
-                                </td>
-                                <td width="230" height="104" background="/images/soskut/hu/v2/plaint-bg.jpg">
-                                
-                                        <div style="position:relative;left:18px;top:10px;">
-        <a href="?module=plaintbox&action=form">
-      <img src="/images/soskut/hu/v2/plaint-send.gif" border=0 class="hand">
-    </a>
-</div>                          
-                                </td>
-                        </tr>
-                </table>
-
-
-                <br><br>
-
-                
-
-        </td>
-        
-        <td width="225" valign="top" background="/images/soskut/hu/v2/rmenu-bg.gif">
-
-                <div style="padding-right:15px">
-                        <img src="/images/soskut/hu/v2/rmenu-top.gif" border="0"><br>
-                        <a href="?module=regulations#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-1.gif" border="0"></a><br>
-                        <a href="?module=news&fname=urlapok#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-2.gif" border="0"></a><br>
-                        <a href="?module=news&fname=testuleti_ulesek#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-3.gif" border="0"></a><br>
-                        <a href="?module=news&fname=helyi_hirek#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-4.gif" border="0"></a><br>
-                        <a href="?module=news&fname=palyazatok#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-5.gif" border="0"></a><br>
-                        <a href="?module=events#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-6.gif" border="0"></a><br>
-                        <a href="?module=news&fname=kultura#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-7.gif" border="0"></a><br>
-                        <a href="?module=news&fname=helyi_ujsag#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-8.gif" border="0"></a><br>
-                        <a href="?module=news&fname=sport#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-9.gif" border="0"></a><br>
-                        <a href="?module=news&fname=oktatas#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-10.gif" border="0"></a><br>
-                        <a href="?module=news&fname=egyhaz#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-11.gif" border="0"></a><br>
-                        <img src="/images/soskut/hu/v2/rmenu-bot.gif" border="0"><br>
-                </div>
-
-                <br>
-                <div align="center"><a href="http://www.magyarorszag.hu/" target="_blank"><img src="/images/soskut/hu/v2/magyarhu.jpg" border="0"></a></div>
-
-        </td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980">
-        <td>
-                <img src="/images/soskut/hu/v2/main-bot.gif"><br>
-        </td>
-</table>
-
-</BODY>
-</HTML>
diff --git a/HarvestMan-lite/harvestman/bugs/test_808.py b/HarvestMan-lite/harvestman/bugs/test_808.py
deleted file mode 100755
index 251830d..0000000
--- a/HarvestMan-lite/harvestman/bugs/test_808.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# Demoing fix for #808.
-# 808: Crawler should try and parse links in "select" options in HTML
-# forms.
-# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/808
-import sys
-sys.path.append('..')
-from lib import pageparser
-from lib import config
-from lib import logger
-from lib.common.common import *
-
-
-SetAlias(config.HarvestManStateObject())
-SetAlias(logger.HarvestManLogger())
-
-# First parse with sgmlop parser with option parsing disabled...
-print 'Testing with sgmlop parser...'
-p = pageparser.HarvestManSGMLOpParser()
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag disabled...'
-assert(len(p.links)==18)
-
-# Now turn on option tag parsing
-p.enable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag enabled...'
-assert(len(p.links)==31)
-
-print 'Testing with pure Python parser...'
-p = pageparser.HarvestManSimpleParser()
-p.disable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag disabled...'
-assert(len(p.links)==18)
-
-# Now turn on option tag parsing
-p.enable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag enabled...'
-assert(len(p.links)==31)
diff --git a/HarvestMan-lite/harvestman/bugs/test_812.py b/HarvestMan-lite/harvestman/bugs/test_812.py
deleted file mode 100755
index 36cad66..0000000
--- a/HarvestMan-lite/harvestman/bugs/test_812.py
+++ /dev/null
@@ -1,49 +0,0 @@
-# Demoing fix for EIAO bug #812.
-# 812: Crawler does not identify links with arguments containing "#".
-# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/812
-
-import sys
-sys.path.append('..')
-from lib import pageparser
-from lib import config
-from lib import logger
-from lib.common.common import *
-from lib import urltypes
-
-class Url(str):
-
-    def __init__(self, link):
-        self.url = link[1]
-        self.typ = link[0]
-
-    def __eq__(self, item):
-        return item == self.url
-    
-SetAlias(config.HarvestManStateObject())
-SetAlias(logger.HarvestManLogger())
-
-cfg = objects.config
-cfg.getquerylinks = True
-
-p = pageparser.HarvestManSGMLOpParser()
-p.feed(open('soskut_hu_index.html').read())
-
-urls = []
-for link in p.links:
-    urls.append(Url(link))
-
-print urls
-
-test_urls = ['?module=municip#MIDDLE',
-             '?module=institutes#MIDDLE',
-             '?module=regulations#MIDDLE',
-             '?module=events#MIDDLE']
-
-for turl in test_urls:
-    print 'Asserting',turl
-    assert(turl in urls)
-
-for url in urls:
-    if url in test_urls:
-        print 'Asserting type of',turl        
-        assert(url.typ ==  urltypes.URL_TYPE_ANY and url.typ != urltypes.URL_TYPE_ANCHOR)
diff --git a/HarvestMan-lite/harvestman/dev/__init__.py b/HarvestMan-lite/harvestman/dev/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/dev/filethread.py b/HarvestMan-lite/harvestman/dev/filethread.py
deleted file mode 100755
index f6d1b80..0000000
--- a/HarvestMan-lite/harvestman/dev/filethread.py
+++ /dev/null
@@ -1,112 +0,0 @@
-# -- coding: utf-8
-""" filethread.py - File saver thread module.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-   Copyright (C) 2007 Anand B Pillai.
-    
-"""
-
-# Currently no code from this module is being used anywhere
-# in the program.
-
-import threading
-from common.common import *
-from common.singleton import Singleton
-import sys, os
-import shutil
-from Queue import Queue
-
-class FileQueue(Queue, Singleton):
-    """ File saver queue class """
-    
-    def push(self, filename, directory, url, datastring):
-        self.put((filename, directory, url, datastring))
-
-class HarvestManFileThread(threading.Thread):
-    """ File saver thread """
-    
-    def __init__(self):
-        self.q = FileQueue.getInstance()
-        self._flag = False
-        self._cfg = objects.config
-        threading.Thread.__init__(self, None, None, 'Saver')
-            
-    def _write_url_filename(self, data, filename):
-        """ Write downloaded data to the passed file """
-
-        try:
-            extrainfo('Writing file ', filename)
-            f=open(filename, 'wb')
-            # print 'Data len=>',len(self._data)
-            f.write(data.getvalue())
-            f.close()
-        except IOError,e:
-            debug('IO Exception' , str(e))
-            return 0
-        except ValueError, e:
-            return 0
-
-        return 1
-
-    def stop(self):
-        self._flag = True
-        
-    def run(self):
-
-        while not self._flag:
-            item = self.q.get()
-            if item:
-                filename, directory, url, datastring = item
-                if self.create_local_directory(directory) == 0:
-                    self._write_url_filename( datastring, filename )
-                else:
-                    extrainfo("Error in creating local directory for", url)
-
-    def create_local_directory(self, directory):
-        """ Create the directories on the disk named 'directory' """
-
-        # new in 1.4.5 b1 - No need to create the
-        # directory for raw saves using the nocrawl
-        # option.
-        if self._cfg.rawsave:
-            return 0
-
-        try:
-            # Fix for EIAO bug #491
-            # Sometimes, however had we try, certain links
-            # will be saved as files, whereas they might be
-            # in fact directories. In such cases, check if this
-            # is a file, then create a folder of the same name
-            # and move the file as index.html to it.
-            path = directory
-            while path:
-                if os.path.isfile(path):
-                    # Rename file to file.tmp
-                    fname = path
-                    os.rename(fname, fname + '.tmp')
-                    # Now make the directory
-                    os.makedirs(path)
-                    # If successful, move the renamed file as index.html to it
-                    if os.path.isdir(path):
-                        fname = fname + '.tmp'
-                        shutil.move(fname, os.path.join(path, 'index.html'))
-                    
-                path2 = os.path.dirname(path)
-                # If we hit the root, break
-                if path2 == path: break
-                path = path2
-                
-            if not os.path.isdir(directory):
-                os.makedirs( directory )
-                extrainfo("Created => ", directory)
-            return 0
-        except OSError:
-            moreinfo("Error in creating directory", directory)
-            return -1
-
-        return 0
-
-                    
-            
diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test.py b/HarvestMan-lite/harvestman/dev/sqlite_test.py
deleted file mode 100755
index 58a9fe8..0000000
--- a/HarvestMan-lite/harvestman/dev/sqlite_test.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import sqlite3
-
-class Point(object):
-
-    def __init__(self, x, y):
-        self.x, self.y = x, y
-
-    def __conform__(self, protocol):
-        if protocol is sqlite3.PrepareProtocol:
-            return '%f;%f' % (self.x, self.y)
-
-con = sqlite3.connect("test")
-c = con.cursor()
-
-p = Point(5.0, 3.5)
-
-c.execute("drop table points")
-c.execute("create table points (point text)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into points values (?)", (p,))
-
-c.execute("select * from points")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test2.py b/HarvestMan-lite/harvestman/dev/sqlite_test2.py
deleted file mode 100755
index 5b143bd..0000000
--- a/HarvestMan-lite/harvestman/dev/sqlite_test2.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("test")
-c = con.cursor()
-
-now = datetime.datetime.now()
-c.execute("drop table if exists times")
-c.execute("create table times (time real)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into times values (?)", (now,))
-
-c.execute("select * from times")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test3.py b/HarvestMan-lite/harvestman/dev/sqlite_test3.py
deleted file mode 100755
index 25f6d5c..0000000
--- a/HarvestMan-lite/harvestman/dev/sqlite_test3.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("test")
-c = con.cursor()
-
-c.execute("drop table if exists projects")
-c.execute("create table projects (id integer primary key autoincrement default 0, date real, project text)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project1'))
-time.sleep(1.0)
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project2'))
-time.sleep(1.0)
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project3'))
-
-c.execute("select max(id) from projects")
-print c.fetchone()[0]
-
-c.close()
-
-
diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test4.py b/HarvestMan-lite/harvestman/dev/sqlite_test4.py
deleted file mode 100755
index 90b7690..0000000
--- a/HarvestMan-lite/harvestman/dev/sqlite_test4.py
+++ /dev/null
@@ -1,17 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("/home/anand/.harvestman/db/crawls.db")
-c = con.cursor()
-
-c.execute("select * from project_stats")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan-lite/harvestman/ext/__init__.py b/HarvestMan-lite/harvestman/ext/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/ext/datafilter.py b/HarvestMan-lite/harvestman/ext/datafilter.py
deleted file mode 100755
index 3a47587..0000000
--- a/HarvestMan-lite/harvestman/ext/datafilter.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# -- coding: utf-8
-""" Data filter plugin example based on the
-simulator plugin for HarvestMan. This
-plugin changes the behaviour of HarvestMan
-to only simulate crawling without actually
-downloading anything. In addition, it shows 
-how to get access to the data downloaded by the crawler,
-to implement various kinds of data filters.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-Modified Nov 2 2007 by: Nils Ulltveit-Moe <nils at u-moe dot no>
-
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-from HTMLParser import HTMLParser
-
-class MyHTMLParser(HTMLParser):
-    # Example on a HTML parser, to filter img tags
-
-    def handle_starttag(self, tag, attrs):
-
-        # This just prints the image tag and its attributes
-        if tag=="img":
-            print tag,attrs
-
-def process_url(self, data):
-    """ Post process url callback test """
-    # This shows how to get access to the
-    # downloaded HTML document that is being processed.
-    # Data is either HTML document or None
-    if data:
-        p = MyHTMLParser()
-        p.feed(data)
-
-    return data
-
-def save_url(self, urlobj):
-
-    # For simulation, we need to modify the behaviour
-    # of save_url function in HarvestManUrlConnector class.
-    # This is achieved by injecting this function as a plugin
-    # Note that the signatures of both functions have to
-    # be the same.
-    url = urlobj.get_full_url()
-    self.connect(urlobj, True, self._cfg.retryfailed)
-
-    return 6
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    cfg = objects.config
-    cfg.simulate = True
-    cfg.localise = 0
-
-    # Dummy function that does not really write the mirrored files.
-    hooks.register_plugin_function('connector:save_url_plugin', save_url)
-
-    # Hook to get access to the downloaded data after process_url has been called.
-    hooks.register_post_callback_method('crawler:fetcher_process_url_callback',
-                                            process_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off header dumping, since no files are saved
-    cfg.urlheaders = 0
-    logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.')
diff --git a/HarvestMan-lite/harvestman/ext/lucene.py b/HarvestMan-lite/harvestman/ext/lucene.py
deleted file mode 100755
index d934afb..0000000
--- a/HarvestMan-lite/harvestman/ext/lucene.py
+++ /dev/null
@@ -1,132 +0,0 @@
-# -- coding: utf-8
-""" Lucene plugin to HarvestMan. This plugin modifies the
-behaviour of HarvestMan to create an index of crawled
-webpages by using PyLucene.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created  Aug 7 2007     Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import PyLucene
-import sys, os
-import time
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-class PorterStemmerAnalyzer(object):
-
-    def tokenStream(self, fieldName, reader):
-
-        result = PyLucene.StandardTokenizer(reader)
-        result = PyLucene.StandardFilter(result)
-        result = PyLucene.LowerCaseFilter(result)
-        result = PyLucene.PorterStemFilter(result)
-        result = PyLucene.StopFilter(result, PyLucene.StopAnalyzer.ENGLISH_STOP_WORDS)
-
-        return result
-
-def create_index(self, arg):
-    """ Post download setup callback for creating a lucene index """
-
-    moreinfo("Creating lucene index")
-    storeDir = "index"
-    if not os.path.exists(storeDir):
-        os.mkdir(storeDir)
-
-    store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-    
-    self.lucene_writer = PyLucene.IndexWriter(store, PyLucene.StandardAnalyzer(), True)
-    # Uncomment this line to enable a PorterStemmer analyzer
-    # self.lucene_writer = PyLucene.IndexWriter(store, PorterStemmerAnalyzer(), True)    
-    self.lucene_writer.setMaxFieldLength(1048576)
-    
-    count = 0
-
-    urllist = []
-    
-    for node in self._urldb.preorder():
-        urlobj = node.get()
-
-        # Only index if web-page or document
-        if not urlobj.is_webpage() and not urlobj.is_document(): continue
-        
-        filename = urlobj.get_full_filename()
-        url = urlobj.get_full_url()
-
-        try:
-            urllist.index(urlobj.index)
-            continue
-        except ValueError:
-            urllist.append(urlobj.index)
-
-        if not os.path.isfile(filename): continue
-        
-        data = ''
-
-        moreinfo('Adding index for URL',url)
-
-        try:
-            data = unicode(open(filename).read(), 'iso-8859-1')
-        except UnicodeDecodeError, e:
-            data = ''
-
-        try:
-            doc = PyLucene.Document()
-            doc.add(PyLucene.Field("name", 'file://' + filename,
-                                   PyLucene.Field.Store.YES,
-                                   PyLucene.Field.Index.UN_TOKENIZED))
-            doc.add(PyLucene.Field("path", url,
-                                   PyLucene.Field.Store.YES,
-                                   PyLucene.Field.Index.UN_TOKENIZED))
-            if data and len(data) > 0:
-                doc.add(PyLucene.Field("contents", data,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.TOKENIZED))
-            else:
-                extrainfo("warning: no content in %s" % filename)
-
-            self.lucene_writer.addDocument(doc)
-        except PyLucene.JavaError, e:
-            print e
-            
-        count += 1
-
-    moreinfo('Created lucene index for %d documents' % count)
-    moreinfo('Optimizing lucene index')
-    self.lucene_writer.optimize()
-    self.lucene_writer.close()
-        
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-
-    hooks.register_post_callback_method('datamgr:post_download_setup_callback',
-                                        create_index)
-    #logger.disableConsoleLogging()
-    # Turn off session-saver feature
-    cfg.savesessions = False
-    # Turn off interrupt handling
-    # cfg.ignoreinterrupts = True
-    # No need for localising
-    cfg.localise = 0
-    # Turn off image downloading
-    cfg.images = 0
-    # Turn off caching
-    cfg.pagecache = 0
diff --git a/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py b/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py
deleted file mode 100755
index 789aa95..0000000
--- a/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# -- coding: utf-8
-#!/usr/bin/env python
-
-import sys, os, PyLucene, threading, time
-from datetime import datetime
-
-"""
-This class is loosely based on the Lucene (java implementation) demo class 
-org.apache.lucene.demo.IndexFiles.  It will take a directory as an argument
-and will index all of the files in that directory and downward recursively.
-It will index on the file path, the file name and the file contents.  The
-resulting Lucene index will be placed in the current directory and called
-'index'.
-"""
-
-class Ticker(object):
-
-    def __init__(self):
-        self.tick = True
-
-    def run(self):
-        while self.tick:
-            sys.stdout.write('.')
-            sys.stdout.flush()
-            time.sleep(1.0)
-
-class IndexFiles(object):
-    """Usage: python IndexFiles <doc_directory>"""
-
-    def __init__(self, root, storeDir, analyzer):
-
-        if not os.path.exists(storeDir):
-            os.mkdir(storeDir)
-        store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-        writer = PyLucene.IndexWriter(store, analyzer, True)
-        writer.setMaxFieldLength(1048576)
-        self.indexDocs(root, writer)
-        ticker = Ticker()
-        print 'optimizing index',
-        threading.Thread(target=ticker.run).start()
-        writer.optimize()
-        writer.close()
-        ticker.tick = False
-        print 'done'
-
-    def indexDocs(self, root, writer):
-        for root, dirnames, filenames in os.walk(root):
-            for filename in filenames:
-                #if not filename.endswith('.txt'):
-                #    continue
-                print "adding", filename
-                try:
-                    path = os.path.join(root, filename)
-                    file = open(path)
-                    contents = unicode(file.read(), 'iso-8859-1')
-                    file.close()
-                    doc = PyLucene.Document()
-                    doc.add(PyLucene.Field("name", filename,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.UN_TOKENIZED))
-                    doc.add(PyLucene.Field("path", path,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.UN_TOKENIZED))
-                    if len(contents) > 0:
-                        doc.add(PyLucene.Field("contents", contents,
-                                               PyLucene.Field.Store.YES,
-                                               PyLucene.Field.Index.TOKENIZED))
-                    else:
-                        print "warning: no content in %s" % filename
-                    writer.addDocument(doc)
-                except Exception, e:
-                    print "Failed in indexDocs:", e
-
-if __name__ == '__main__':
-    if len(sys.argv) < 2:
-        print IndexFiles.__doc__
-        sys.exit(1)
-    print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION
-    start = datetime.now()
-    try:
-        IndexFiles(sys.argv[1], "index", PyLucene.StandardAnalyzer())
-        end = datetime.now()
-        print end - start
-    except Exception, e:
-        print "Failed: ", e
diff --git a/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py b/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py
deleted file mode 100755
index a9bfa0a..0000000
--- a/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# -- coding: utf-8
-#!/usr/bin/env python
-from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory
-from PyLucene import VERSION, LUCENE_VERSION
-
-"""
-This script is loosely based on the Lucene (java implementation) demo class 
-org.apache.lucene.demo.SearchFiles.  It will prompt for a search query, then it
-will search the Lucene index in the current directory called 'index' for the
-search query entered against the 'contents' field.  It will then display the
-'path' and 'name' fields for each of the hits it finds in the index.  Note that
-search.close() is currently commented out because it causes a stack overflow in
-some cases.
-"""
-def run(searcher, analyzer):
-
-    while True:
-        print
-        print "Hit enter with no input to quit."
-        command = raw_input("Query:")
-        if command == '':
-            return
-
-        print
-        print "Searching for:", command
-        query = QueryParser("contents", analyzer).parse(command)
-        hits = searcher.search(query)
-        print "%s total matching documents" % hits.length()
-        
-        for i, doc in hits:
-            print 'path:', doc.get("path"), 'name:', doc.get("name"), 100*hits.score(i)
-
-if __name__ == '__main__':
-    STORE_DIR = "index"
-    print 'PyLucene', VERSION, 'Lucene', LUCENE_VERSION
-    directory = FSDirectory.getDirectory(STORE_DIR, False)
-    searcher = IndexSearcher(directory)
-    analyzer = StandardAnalyzer()
-    run(searcher, analyzer)
-    searcher.close()
diff --git a/HarvestMan-lite/harvestman/ext/simulator.py b/HarvestMan-lite/harvestman/ext/simulator.py
deleted file mode 100755
index 66bac2d..0000000
--- a/HarvestMan-lite/harvestman/ext/simulator.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# -- coding: utf-8
-""" Simulator plugin for HarvestMan. This
-plugin changes the behaviour of HarvestMan
-to only simulate crawling without actually
-downloading anything.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import CONNECTOR_DATA_MODE_INMEM
-
-def save_url(self, urlobj):
-
-    # For simulation, we need to modify the behaviour
-    # of save_url function in HarvestManUrlConnector class.
-    # This is achieved by injecting this function as a plugin
-    # Note that the signatures of both functions have to
-    # be the same.
-
-    url = urlobj.get_full_url()
-    self.connect(urlobj, True, self._cfg.retryfailed)
-
-    return 6
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    cfg = objects.config
-    cfg.simulate = True
-    cfg.localise = 0
-    hooks.register_plugin_function('connector:save_url_plugin', save_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off header dumping, since no files are saved
-    cfg.urlheaders = 0
-    # For simulator, we need in-mem data mode
-    # since files are never saved!
-    cfg.datamode = CONNECTOR_DATA_MODE_INMEM
-    logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.')
diff --git a/HarvestMan-lite/harvestman/ext/spam.py b/HarvestMan-lite/harvestman/ext/spam.py
deleted file mode 100755
index 3435f8d..0000000
--- a/HarvestMan-lite/harvestman/ext/spam.py
+++ /dev/null
@@ -1,34 +0,0 @@
-# -- coding: utf-8
-""" Test plugin for HarvestMan. This demonstrates
-how to write a simple plugin based on callbacks.
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-def func(self):
-    print 'Before running projects...'
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    hooks.register_pre_callback_method('harvestman:run_projects_callback', func)
-
diff --git a/HarvestMan-lite/harvestman/ext/swish-e.py b/HarvestMan-lite/harvestman/ext/swish-e.py
deleted file mode 100755
index fc8c5c6..0000000
--- a/HarvestMan-lite/harvestman/ext/swish-e.py
+++ /dev/null
@@ -1,115 +0,0 @@
-# -- coding: utf-8
-""" Swish-e plugin to HarvestMan. This plugin modifies the
-behaviour of HarvestMan to work as an external crawler program
-for the swish-e search engine {http://swish-e.org}
-
-The data format is according to the guidelines given
-at http://swish-e.org/docs/swish-run.html#indexing.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created  Feb 8 2007     Anand B Pillai <abpillai at gmail dot com>
-Modified Feb 17 2007    Anand B Pillai Modified logic to use callbacks
-                                       instead of hooks. The logic is
-                                       in a post callback registered
-                                       at context crawler:fetcher_process_url_callback.
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys, os
-import time
-from types import StringTypes
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-urllist = []
-
-def process_url(self, data):
-    """ Post process url callback for swish-e """
-
-    if (type(data) in StringTypes) and len(data):
-        global urllist
-        urllist.append(self._urlobject.get_full_url())
-        
-        try:
-            data = data.encode('ascii', 'ignore')
-            l = len(data)
-            s = ''
-            
-            # Code which works for www.python.org/doc/current/tut/tut.html
-            # and for swish-e.org/docs
-            # if len(data) != len(data.strip()):
-            #    data = data.strip()
-            #    l = len(data) + 1
-
-            add = 0
-            if l != len(data.strip()):
-                # print l, len(data.strip()), self._urlobject.get_full_url()
-                data = data.strip()
-                l = len(data) + 1
-                # print l
-
-            if self.wp.can_index:
-                s ="Path-Name:%s\nContent-Length:%d\n\n%s" % (self._urlobject.get_full_url(),
-                                                             l,
-                                                             data)
-                # Swish-e seems to be very sensitive to any additional
-                # blank lines between content and headers. So stripping
-                # the data of trailing and preceding newlines is important.
-                # print data.strip()
-                try:
-                    print str(s)
-                except IOError, e:
-                    # global urllist
-                    # open('err.out','w').write('\n'.join(urllist))
-                    objects.queuemgr.endloop()
-
-            return data
-        except UnicodeDecodeError, e:
-            # open('uni.out','a').write(self._urlobject.get_full_url() + '\n')
-            return data
-
-    return data
-
-    
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-
-    # Makes sense to activate the callback only if swish-integration
-    # is enabled.
-    hooks.register_post_callback_method('crawler:fetcher_process_url_callback',
-                                        process_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off console-logging
-    logger = objects.logger
-    #logger.disableConsoleLogging()
-    # Turn off session-saver feature
-    cfg.savesessions = False
-    # Turn off interrupt handling
-    # cfg.ignoreinterrupts = True
-    # No need for localising
-    cfg.localise = 0
-    # Turn off image downloading
-    cfg.images = 0
-    # Increase sleep time
-    cfg.sleeptime = 1.0
-    # sys.stderr = open('swish-errors.txt','wb')
-    # cfg.maxtrackers = 2
-    cfg.usethreads = 0
diff --git a/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e b/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e
deleted file mode 100755
index 6735e42..0000000
--- a/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e
+++ /dev/null
@@ -1,122 +0,0 @@
-Using HarvestMan with swish-e
------------------------------
-HarvestMan can be used as an external crawler program for swish-e 
-indexer {http://www.swish-e.org}. The swish-e support for 
-HarvestMan is built into the swish-e plugin present in the plugins
-folder.
-
-Swish-e configuration
----------------------
-In order to use swish-e with HarvestMan, an appropriate configuration
-file needs to be generated. A sample configuration file is available
-in this folder as swish-config.conf. Typically this configuration
-file only contains two directives
-
-IndexDir <program>
-SwishProgParameters <params>
-
-"IndexDir" is the path to the external crawler program. If HarvestMan
-is installed in your machine, this would be "harvesttman". If the
-PATH where HarvestMan is present is not part of the PATH environment
-variable, you need to specify the full path.
-
-"SwishProgParameters" is the parameters required for the external
-program. Here you can specify the parameters required for HarvestMan.
-
-
-HarvestMan configuration for swish-e
-------------------------------------
-In HarvestMan, there are two ways to load plugins like swish-e.
-Either the plugin can be given as a command-line parameter using the
--g/--plugins option, or it can be specified in the configuration file
-by editing the "plugins" element and adding an appopriate plugin
-element with its "enable" attribute set to 1. For more information
-read the HOWTO.plugins document in the "doc" folder.
-
-There are also two ways to pass URL and other options. The suggested
-way is to create an appropriate configuration file and put all the
-options there. If the file is the default 'config.xml' present in
-the current directory or the user's .harvestman directory, there is
-no need to specify this file. In such case, "SwishProgParameters"
-is empty and should not be specified. In this case the swish configuration
-file will look like,
-
-IndexDir harvestman
-
-However, if the configuration file name is different, it has to be 
-passed to HarvestMan with the -C option. In order to enable swish-e,
-the "enable" attribute of the swish-e plugin element should be set to
-1 in this file. In this case the swish configuration file will look like,
-
-IndexDir harvestman
-SwishProgParameters -C <path_to_config_file>
-
-The other way is to specify a URL and other options in the command line
-and pass it to HarvestMan. This typically can be used for the simplest
-crawl which do not require a lot of customization. For example,
-
-IndexDir harvestman
-SwishProgParameters -g swish-e http://swish-e.org/docs/
-
-The last line instructs HarvestMan to crawl http://www.swish-e.org/docs .
-Swish-e will in turn index the content of files contained at ths URL.
-
-NOTE: If you have more than three parameters to customize it is better to
-use a configuration file than specifying them on the command line.
-
-Running directly from source
-----------------------------
-In case you prefer to run HarvestMan directly from the source tree
-with swish-e without installing it, the above mentioned configuration
-would not work.
-
-In this case there are two ways of writing the configuration. The simplest
-way is to make the harvestman.py module executable and use the
-following configuration.
-
-IndexDir <path>/harvestman.py
-SwishProgParameters <params>
-
-where <path> is the relative path to where HarvestMan source code is
-present. If it is the current directory, this would be '.'.
-
-The second way is to run harvestman.py as an argument to Python. In
-this case the following configuration need to be used.
-
-IndexDir python
-SwishProgParameters <path>/harvestman.py <params>
-
-In this case, the main program becomes Python and path to harvestman.py
-is passed as the first part of SwishProgParameters param value.
-
-Running swish-e 
----------------
-Once the appropriate swish configuration file is written, swish-e can
-be run with HarvestMan as follows
-
-swish-e -c <path_to_config_file> -S prog
-
-Once crawling and indexing starts, swish-e prints an output like,
-
-$ swish-e -c swish-config.cong -S prog
-
-Indexing Data Source: "External-Program"
-Indexing "harvestman"
-External Program found: /usr/bin/harvestman
-
-If everything goes well, the indexing will terminate soon after
-the crawling is completed and an index summary is printed.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/HarvestMan-lite/harvestman/ext/swish-e/README.txt b/HarvestMan-lite/harvestman/ext/swish-e/README.txt
deleted file mode 100755
index 8b37e2d..0000000
--- a/HarvestMan-lite/harvestman/ext/swish-e/README.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-This folder contains sample files/code which demonstrates
-the usage of plugins with HarvestMan.
\ No newline at end of file
diff --git a/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf b/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf
deleted file mode 100755
index 8cb4ba5..0000000
--- a/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf
+++ /dev/null
@@ -1,10 +0,0 @@
-## Sample configuration file for HarvestMan integration with swish-e.
-## See http://swish-e.org/docs/swish-run.html#indexing for more information.
-
-# Indexing program to use
-IndexDir ./harvestman.py
-# Parameters to pass to the Indexing program
-# Change the last parameter to your own URL or configuration file.
-# SwishProgParameters -g swish-e http://swish-e.org/docs
-SwishProgParameters -g swish-e http://www.python.org/doc/current/
-# SwishProgParameters -g swish-e http://www.woogroups.com
diff --git a/HarvestMan-lite/harvestman/ext/userbrowse.py b/HarvestMan-lite/harvestman/ext/userbrowse.py
deleted file mode 100755
index 5071b36..0000000
--- a/HarvestMan-lite/harvestman/ext/userbrowse.py
+++ /dev/null
@@ -1,53 +0,0 @@
-# -- coding: utf-8
-""" User browse plugin. Simulate a scenario of a user
-browsing a web-page.
-
-(Requested by Roy Cheeran)
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created  Aug 13 2007     Anand B Pillai 
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-# User browsing plugin approximates how a webpage
-# presents itself to a user. This means a few things
-#
-# 1. All images and stylesheets referenced by the page are fetched.
-# 2. In addition, all links directly linked from the page are
-# fetched and saved to disk. Nothing further is crawled.
-#
-# This is done by using a fetchlevel control of 2, a depth
-# control of 0, and allowing images & stylesheets to skip
-# constraints.
-
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-    # Set depth to 0
-    cfg.depth = 0
-    # Set fetchlevel to 2
-    cfg.fetchlevel = 2
-    # Images & stylesheets will skip rules
-    cfg.skipruletypes = ['image','stylesheet']
-    # One might have to set robots to 0
-    # sometimes to fetch images - uncomment this
-    # in such a case.
-    # cfg.robots = 0
diff --git a/HarvestMan-lite/harvestman/lib/__init__.py b/HarvestMan-lite/harvestman/lib/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/lib/common/__init__.py b/HarvestMan-lite/harvestman/lib/common/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/lib/common/bst.py b/HarvestMan-lite/harvestman/lib/common/bst.py
deleted file mode 100755
index edf7f03..0000000
--- a/HarvestMan-lite/harvestman/lib/common/bst.py
+++ /dev/null
@@ -1,544 +0,0 @@
-"""
-bst.py - Basic binary search tree in Python with automated disk caching at
-the nodes. This is not a full-fledged implementation since it does not
-implement node deletion, tree balancing etc.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-Modified Anand B Pillai  Make BST use bsddb caching (experimental!)
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-import cPickle
-import math
-import sys
-import os
-import shutil
-import weakref
-import bsddb
-
-from dictcache import DictCache
-
-class BSTNode(dict):
-    """ Node class for a BST """
-    
-    def __init__(self, key, val, left=None, right=None, tree=None):
-        self.key = key
-        self[key] = val
-        self['left'] = left
-        self['right'] = right
-        # Mode flag
-        # 0 => mem
-        # 1 => disk
-        self.mode = 0
-        # Number of gets
-        self.cgets = 0
-        # Number of loads
-        self.cloads = 0
-        # Link back to the tree
-        self.tree = weakref.proxy(tree)
-        
-    def __getitem__(self, key):
-
-        try:
-            return super(BSTNode, self).__getitem__(key)
-        except KeyError:
-            return None
-        
-    def set(self, value):
-        self[self.key] = value
-        if self.mode == 1:
-            # Already dumped
-            self.mode = 0
-            self.dump()
-            
-    def get(self):
-        
-        if self.mode==0:
-            self.cgets += 1
-            return self[self.key]
-        else:
-            self.cloads += 1            
-            self.load()
-            return self[self.key]
-
-    def is_balanced(self, level=1):
-
-        # Return if this node is balanced
-        # The node balance check is done per
-        # level. Default is 1 which means we
-        # check whether this node has both left
-        # and right children. If level is 2, this
-        # is done at one more level, i.e for the
-        # child nodes also...
-
-        # Leaf node is not balanced...
-        if self['left']==None and self['right']==None:
-            return False
-
-        while level>0:
-            level -= 1
-            
-            if self['left'] !=None and self['right'] != None:
-                if level:
-                    return self['left'].is_balanced(level) and \
-                           self['right'].is_balanced(level)
-                else:
-                    return True
-            else:
-                return False
-
-        return False
-        
-    def load(self, recursive=False):
-
-        # Load values from disk
-        try:
-            # Don't load if mode is 0 and value is not None
-            if self.mode==1 and self[self.key] == None:
-                self[self.key] = self.tree.from_cache(self.key)
-                self.mode = 0
-            
-            if recursive:
-                left = self['left']
-                if left: left.load(True)
-                right = self['right']
-                if right: right.load(True)
-                
-        except Exception, e:
-            raise
-        
-    def dump(self, recursive=False):
-
-        try:
-            if self.mode==0 and self[self.key] != None:
-                self.tree.to_cache(self.key, self[self.key])
-                self[self.key]=None
-                self.mode = 1
-            else:
-                # Dont do anything
-                pass
-            
-            if recursive:
-                left = self['left']
-                if left: left.dump(True)
-                right = self['right']
-                if right: right.dump(True)
-                
-        except Exception, e:
-            raise
-
-    def clear(self):
-
-        # Clear removes the data from memory as well as from disk
-        try:
-            del self[self.key]
-        except KeyError:
-            pass
-
-        left = self['left']
-        right = self['right']
-                
-        if left:
-            left.clear()
-        if right:
-            right.clear()
-
-        super(BSTNode, self).clear()
-
-class BST(object):
-    """ BST class with automated disk caching of node values """
-
-    # Increase the recursion limit for large trees
-    sys.setrecursionlimit(20000)
-        
-    def __init__(self, key=None, val=None):
-        # Size of tree
-        self.size = 0
-        # Height of tree
-        self.height = 0
-        # 'Hardened' flag - if the data structure
-        # is dumped to disk fully, the flag hard
-        # is set to True
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        # Autocommit mode level
-        self.autolevel = 0
-        # Current auto left node for autocommit
-        self.autocurr_l = None
-        # Current auto right node for autocommit
-        self.autocurr_r = None        
-        # For stats
-        # Total number of lookups
-        self.nlookups = 0
-        # Total number of in-mem lookups
-        self.ngets = 0
-        # Total number of disk loads
-        self.nloads = 0
-        self.root = None
-        if key:
-            self.root = self.insert(key, val)
-        self.bdir = ''
-        self.diskcache = None
-
-    def __del__(self):
-        self.clear()
-
-    def to_cache(self, key, val):
-        self.diskcache[str(key)] = cPickle.dumps(val)
-        self.diskcache.sync()
-        
-    def from_cache(self, key):
-        return cPickle.loads(self.diskcache[str(key)])
-    
-    def addNode(self, key, val):
-        self.size += 1
-        self.height = int(math.ceil(math.log(self.size+1, 2)))
-        node = BSTNode(key, val, tree=self)
-
-        if self.auto and self.autolevel and self.size>1:
-            # print 'Auto-dumping...', self.size
-            if self.size % self.autolevel==0:
-                self.dump(self.autocurr_l)
-                # Set autocurr to this node
-                self.autocurr_l = node
-            
-            #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_l.key
-            #    self.dump(self.autocurr_l)
-            #    curr = self.autocurr_l
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right
-            #    print 'Left=>',self.autocurr_l
-            #    print 'Right=>',self.autocurr_r                
-            #    print 'Root=>',self.root.key
-                
-            #if self.autocurr_r == self.autocurr_l:
-            #    return node
-            
-            #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_r.key
-            #    self.dump(self.autocurr_r)
-            #    curr = autocurr_r
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right                
-                
-                
-        return node
-    
-    def __insert(self, root, key, val):
-        
-        if root==None:
-            return self.addNode(key, val)
-
-        else:
-            if key<=root.key:
-                # Goes to left subtree
-                # print 'Inserting on left subtree...', key
-                root['left'] = self.__insert(root['left'], key, val)
-            else:
-                # Goes to right subtree
-                # print 'Inserting on right subtree...', key
-                root['right'] = self.__insert(root['right'], key, val)
-
-            return root
-        
-    def __lookup(self, root, key):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                # Note we are returning the value
-                return root.get()
-            else:
-                if key < root.key:
-                    return self.__lookup(root['left'], key)
-                else:
-                    return self.__lookup(root['right'], key)
-
-    def __update(self, root, key, newval):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                root.set(newval)
-            else:
-                if key < root.key:
-                    return self.__update(root['left'], key, newval)
-                else:
-                    return self.__update(root['right'], key, newval)                
-                
-    def insert(self, key, val):
-        node = self.__insert(self.root, key, val)
-
-        if self.root == None:
-            self.root = node
-            # Set auto node
-            self.autocurr_l = self.autocurr_r = self.root
-
-        # If node is added to left of current autocurrent node..
-        
-        return node
-
-    def lookup(self, key):
-        return self.__lookup(self.root, key)
-
-    def update(self, key, newval):
-        self.__update(self.root, key, newval)
-        
-    def __inorder(self, root):
-
-        if root != None:
-            for node in self.__inorder(root['left']):
-                yield node
-            yield root
-            for node in self.__inorder(root['right']):
-                yield node
-            
-    def inorder(self):
-        # Inorder traversal, yielding the nodes
-        
-        return self.__inorder(self.root)
-
-    def __preorder(self, root):
-
-        if root != None:
-            yield root
-            for node in self.__preorder(root['left']):
-                yield node
-            for node in self.__preorder(root['right']):
-                yield node            
-            
-    def preorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__preorder(self.root)
-
-    def __postorder(self, root):
-
-        if root != None:
-            for node in self.__postorder(root['left']):
-                yield node
-            for node in self.__postorder(root['right']):
-                yield node            
-            yield root
-            
-    def postorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__postorder(self.root)        
-        
-    def minnode(self):
-        # Node with the minimum key value
-
-        root = self.root
-
-        while (root['left'] != None):
-            root = root['left']
-
-        return root
-    
-    def minkey(self):
-
-        node = self.minnode()
-        return node.key
-
-    def maxnode(self):
-        # Node with the maximum key value
-
-        root = self.root
-
-        while (root['right'] != None):
-            root = root['right']
-
-        return root
-    
-    def maxkey(self):
-
-        node = self.maxnode()
-        return node.key
-
-    def size_lhs(self):
-
-        # Traverse pre-order and increment counts
-        if self.root == None:
-            return 0
-        
-        root_left = self.root['left']
-        count = 0
-
-        for node in self.__preorder(root_left):
-            count += 1
-
-        return count
-
-    
-    def size_rhs(self):
-
-        if self.root == None:
-            return 0
-
-        # Traverse pre-order and increment counts
-        root_right = self.root['right']
-        count = 0
-
-        for node in self.__preorder(root_right):
-            count += 1
-
-        return count
-    
-    def size(self):
-        return self.count
-
-    def stats(self):
-
-        d = {'gets': 0, 'loads': 0}
-        self.__stats(self.root, d)
-        return d
-
-    def __stats(self, root, d):
-
-        if root != None:
-            d['gets'] += root.cgets
-            d['loads'] += root.cloads
-            self.__stats(root['left'], d)
-            self.__stats(root['right'], d)
-            
-    def dump(self, startnode=None):
-
-        if startnode==None:
-            startnode = self.root
-            
-        if startnode==None:
-            return None
-        else:
-            startnode.dump(True)
-
-        self.hard = True
-
-    def load(self):
-        if self.root==None:
-            return None
-
-        if self.hard:
-            self.root.load(True)
-            self.hard = False
-
-    def reset(self):
-        self.size = 0
-        self.height = 0
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        self.autolevel = 0
-        self.autocurr_l = None
-        self.autocurr_r = None        
-        self.nlookups = 0
-        self.ngets = 0
-        self.nloads = 0
-        self.root = None
-        
-    def clear(self):
-
-        if self.root:
-            self.root.clear()
-
-        self.reset()
-        if self.diskcache:
-            self.diskcache.clear()
-
-        # Remvoe the directory
-        if self.bdir and os.path.isdir(self.bdir):
-            try:
-                shutil.rmtree(self.bdir)
-            except Exception, e:
-                print e
-
-    def set_auto(self, level):
-        # Enable auto commit and set level
-        # If auto commit is set to true, the tree
-        # is flushed to disk after the existing
-        # autocommit node is balanced at the
-        # level 'level'. The starting autocommit
-        # node is root by default.
-        self.auto = True
-        self.autolevel = level
-        # Directory for file representation
-        self.bdir = '.bidx' + str(hash(self))        
-        if not os.path.isdir(self.bdir):
-            try:
-                os.makedirs(self.bdir)
-            except Exception, e:
-                raise
-
-        self.diskcache = bsddb.btopen('cache.db','n')              # DictCache(10, self.bdir)
-        # self.diskcache.freq = self.autolevel
-        
-if __name__ == "__main__":
-    b = BST()
-    b.set_auto(3)
-    print b.root
-    b.insert(4,[4])
-    b.insert(3,[2])
-    b.insert(2,[6])
-    b.insert(1, [3])
-    b.insert(5,[5])
-    b.insert(6,[7])
-    b.insert(0,[8])        
-    print b.size
-    print b.height
-    print b.lookup(4)
-    b.dump()
-    # Now try to lookup item 3
-    print b.lookup(3)
-    print b.lookup(3)
-    print b.lookup(3)    
-    # Load all
-    b.load()
-    print b.size, b.height
-    
-    # Do inorder
-    print 'Inorder...'
-    for node in b.inorder():
-        print node.key,'=>',node[node.key]
-    # Do preorder
-    print 'Preorder...'    
-    for node in b.preorder():
-        print node.key,'=>',node[node.key]
-    # Do postorder
-    print 'Postorder...'        
-    for node in b.postorder():
-        print node.key,'=>',node[node.key]
-
-    print 'LHS=>',b.size_lhs()
-    print 'RHS=>',b.size_rhs()    
-    
-    # b.clear()
-    print b.stats()
-    root = b.root
-    print root.is_balanced()    
-    print root.is_balanced(2)
-    
-    del b
-
-    b= BST()
-    b.insert(10,[4])
-    b.insert(5,[2])
-    b.insert(2,[6])
-    b.insert(7, [3])
-    b.insert(14,[5])
-    b.insert(12,[7])
-    b.insert(15,[8])
-
-    root = b.root    
-    print root.is_balanced(1)
-    print root.is_balanced(2)
-    print root.is_balanced(3)
-
-    print 'LHS=>',b.size_lhs()
-    print 'RHS=>',b.size_rhs()    
-    
diff --git a/HarvestMan-lite/harvestman/lib/common/bst_orig.py b/HarvestMan-lite/harvestman/lib/common/bst_orig.py
deleted file mode 100755
index be0dc1c..0000000
--- a/HarvestMan-lite/harvestman/lib/common/bst_orig.py
+++ /dev/null
@@ -1,489 +0,0 @@
-"""
-bst.py - Basic binary search tree in Python with automated disk caching at
-the nodes. This is not a full-fledged implementation since it does not
-implement node deletion, tree balancing etc.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-
-
-import cPickle
-import math
-import os
-import shutil
-
-class BSTNode(dict):
-    """ Node class for a BST """
-    
-    def __init__(self, key, val, left=None, right=None):
-        self.key = key
-        self[key] = val
-        self[0] = left
-        self[1] = right
-        # Mode flag
-        # 0 => mem
-        # 1 => disk
-        self.mode = 0
-        # Cached idx filename
-        self.fname = ''
-        # Number of gets
-        self.cgets = 0
-        # Number of loads
-        self.cloads = 0
-
-    def __getitem__(self, key):
-
-        try:
-            return super(BSTNode, self).__getitem__(key)
-        except KeyError:
-            return None
-        
-    def set(self, value):
-        self.val = value
-
-    def get(self):
-        
-        if self.mode==0:
-            self.cgets += 1
-            return self[self.key]
-        else:
-            self.cloads += 1            
-            self.load()
-            return self[self.key]
-
-    def is_balanced(self, level=1):
-
-        # Return if this node is balanced
-        # The node balance check is done per
-        # level. Default is 1 which means we
-        # check whether this node has both left
-        # and right children. If level is 2, this
-        # is done at one more level, i.e for the
-        # child nodes also...
-
-        # Leaf node is not balanced...
-        if self[0]==None and self[1]==None:
-            return False
-
-        while level>0:
-            level -= 1
-            
-            if self[0] !=None and self[1] != None:
-                if level:
-                    return self[0].is_balanced(level) and \
-                           self[1].is_balanced(level)
-                else:
-                    return True
-            else:
-                return False
-
-        return False
-        
-    def load(self, recursive=False):
-
-        # Load values from disk
-        try:
-            # Don't load if mode is 0 and value is not None
-            if self.mode==1 and self[self.key] == None:
-                self[self.key] = cPickle.load(open(self.fname, 'rb'))
-                self.mode = 0
-            
-            if recursive:
-                left = self[0]
-                if left: left.load(True)
-                right = self[1]
-                if right: right.load(True)
-                
-        except cPickle.UnpicklingError, e:
-            raise
-        except Exception, e:
-            raise
-        
-    def dump(self, bdir, recursive=False):
-
-        try:
-            if self.mode==0:
-                self.fname = os.path.join(bdir, str(self.key))
-                cPickle.dump(self[self.key], open(self.fname, 'wb'))
-                # If dumping was done, set val to None to
-                # reclaim memory...
-                del self[self.key]
-                self.mode = 1
-            else:
-                # Dont do anything
-                pass
-            
-            if recursive:
-                left = self[0]
-                if left: left.dump(bdir, True)
-                right = self[1]
-                if right: right.dump(bdir, True)
-                
-        except cPickle.PicklingError, e:
-            raise
-        except Exception, e:
-            raise
-
-    def clear(self):
-
-        # Clear removes the data from memory as well as from disk
-        self.val = None
-        if self.fname and os.path.isfile(self.fname):
-            try:
-                os.remove(self.fname)
-            except Exception, e:
-                print e
-
-        left = self[0]
-        right = self[1]
-                
-        if left:
-            left.clear()
-        if right:
-            right.clear()
-
-        super(BSTNode, self).clear()
-
-        
-class BST(object):
-    """ BST class with automated disk caching of node values """
-    
-    def __init__(self, key=None, val=None):
-        # Size of tree
-        self.size = 0
-        # Height of tree
-        self.height = 0
-        # 'Hardened' flag - if the data structure
-        # is dumped to disk fully, the flag hard
-        # is set to True
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        # Autocommit mode level
-        self.autolevel = 0
-        # Current auto left node for autocommit
-        self.autocurr_l = None
-        # Current auto right node for autocommit
-        self.autocurr_r = None        
-        # For stats
-        # Total number of lookups
-        self.nlookups = 0
-        # Total number of in-mem lookups
-        self.ngets = 0
-        # Total number of disk loads
-        self.nloads = 0
-        # Directory for file representation
-        self.bdir = '.bidx' + str(hash(self))        
-        if not os.path.isdir(self.bdir):
-            try:
-                os.makedirs(self.bdir)
-            except Exception, e:
-                raise
-
-        self.root = None
-        if key:
-            self.root = self.insert(key, val)
-
-    def addNode(self, key, val):
-        self.size += 1
-        self.height = int(math.ceil(math.log(self.size+1, 2)))
-        node = BSTNode(key, val)
-
-        if self.auto and self.autolevel and self.size>1:
-            # Check if the node has become balanced at the
-            # requested level...
-
-            if self.auto and self.autolevel:
-                # print 'Auto-dumping...', self.size
-                if self.size % self.autolevel==0:
-                    self.dump(self.autocurr_l)
-                    # Set autocurr to this node
-                    self.autocurr_l = node
-            
-            #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_l.key
-            #    self.dump(self.autocurr_l)
-            #    curr = self.autocurr_l
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right
-            #    print 'Left=>',self.autocurr_l
-            #    print 'Right=>',self.autocurr_r                
-            #    print 'Root=>',self.root.key
-                
-            #if self.autocurr_r == self.autocurr_l:
-            #    return node
-            
-            #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_r.key
-            #    self.dump(self.autocurr_r)
-            #    curr = autocurr_r
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right                
-                
-                
-        return node
-    
-    def __insert(self, root, key, val):
-        
-        if root==None:
-            return self.addNode(key, val)
-
-        else:
-            if key<=root.key:
-                # Goes to left subtree
-                # print 'Inserting on left subtree...', key
-                root[0] = self.__insert(root[0], key, val)
-            else:
-                # Goes to right subtree
-                # print 'Inserting on right subtree...', key
-                root[1] = self.__insert(root[1], key, val)
-
-            return root
-        
-    def __lookup(self, root, key):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                # Note we are returning the value
-                return root.get()
-            else:
-                if key < root.key:
-                    return self.__lookup(root[0], key)
-                else:
-                    return self.__lookup(root[1], key)
-                
-    def insert(self, key, val):
-        node = self.__insert(self.root, key, val)
-
-        if self.root == None:
-            self.root = node
-            # Set auto node
-            self.autocurr_l = self.autocurr_r = self.root
-
-        # If node is added to left of current autocurrent node..
-        
-        return node
-
-    def lookup(self, key):
-        return self.__lookup(self.root, key)
-
-    def __inorder(self, root):
-
-        if root != None:
-            for node in self.__inorder(root[0]):
-                yield node
-            yield root
-            for node in self.__inorder(root[1]):
-                yield node
-            
-    def inorder(self):
-        # Inorder traversal, yielding the nodes
-        
-        return self.__inorder(self.root)
-
-    def __preorder(self, root):
-
-        if root != None:
-            yield root
-            for node in self.__preorder(root[0]):
-                yield node
-            for node in self.__preorder(root[1]):
-                yield node            
-            
-    def preorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__preorder(self.root)
-
-    def __postorder(self, root):
-
-        if root != None:
-            for node in self.__postorder(root[0]):
-                yield node
-            for node in self.__postorder(root[1]):
-                yield node            
-            yield root
-            
-    def postorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__postorder(self.root)        
-        
-    def minnode(self):
-        # Node with the minimum key value
-
-        root = self.root
-
-        while (root[0] != None):
-            root = root[0]
-
-        return root
-    
-    def minkey(self):
-
-        node = self.minnode()
-        return node.key
-
-    def maxnode(self):
-        # Node with the maximum key value
-
-        root = self.root
-
-        while (root[1] != None):
-            root = root[1]
-
-        return root
-    
-    def maxkey(self):
-
-        node = self.maxnode()
-        return node.key
-
-    def size_lhs(self):
-
-        # Return the node size on the LHS (excluding root)
-        root = self.root
-        count = 0
-        
-        while root[0] != None:
-            root = root[0]
-            count += 1
-
-        return count
-
-    def size_rhs(self):
-
-        # Return the node size on the LHS (excluding root)
-        root = self.root
-        count = 0
-        
-        while root[1] != None:
-            root = root[1]
-            count += 1
-
-        return count
-    
-    def size(self):
-        return self.count
-
-    def stats(self):
-
-        d = {'gets': 0, 'loads': 0}
-        self.__stats(self.root, d)
-        return d
-
-    def __stats(self, root, d):
-
-        if root != None:
-            d['gets'] += root.cgets
-            d['loads'] += root.cloads
-            self.__stats(root[0], d)
-            self.__stats(root[1], d)
-            
-    def dump(self, startnode=None):
-
-        if startnode==None:
-            startnode = self.root
-            
-        if startnode==None:
-            return None
-        else:
-            startnode.dump(self.bdir, True)
-
-        self.hard = True
-
-    def load(self):
-        if self.root==None:
-            return None
-
-        if self.hard:
-            self.root.load(True)
-            self.hard = False
-
-    def clear(self):
-
-        if self.root:
-            self.root.clear()
-        # Remvoe the directory
-        if self.bdir and os.path.isdir(self.bdir):
-            try:
-                shutil.rmtree(self.bdir)
-            except Exception, e:
-                print e
-
-    def set_auto(self, level):
-        # Enable auto commit and set level
-        # If auto commit is set to true, the tree
-        # is flushed to disk after the existing
-        # autocommit node is balanced at the
-        # level 'level'. The starting autocommit
-        # node is root by default.
-        self.auto = True
-        self.autolevel = level
-        
-        
-if __name__ == "__main__":
-    b = BST()
-    b.set_auto(3)
-    print b.root
-    b.insert(4,[4])
-    b.insert(3,[2])
-    b.insert(2,[6])
-    b.insert(1, [3])
-    b.insert(5,[5])
-    b.insert(6,[7])
-    b.insert(0,[8])        
-    print b.size
-    print b.height
-    print b.lookup(4)
-    #b.dump()
-    # Now try to lookup item 3
-    print b.lookup(3)
-    print b.lookup(3)
-    print b.lookup(3)    
-    # Load all
-    b.load()
-    print b.size, b.height
-    
-    # Do inorder
-    print 'Inorder...'
-    for node in b.inorder():
-        print node.key,'=>',node[node.key]
-    # Do preorder
-    print 'Preorder...'    
-    for node in b.preorder():
-        print node.key,'=>',node[node.key]
-    # Do postorder
-    print 'Postorder...'        
-    for node in b.postorder():
-        print node.key,'=>',node[node.key]
-
-    print b.size_lhs()
-    print b.size_rhs()    
-    
-    # b.clear()
-    print b.stats()
-    root = b.root
-    print root.is_balanced()    
-    print root.is_balanced(2)
-    del b
-
-    b= BST()
-    b.insert(10,[4])
-    b.insert(5,[2])
-    b.insert(2,[6])
-    b.insert(7, [3])
-    b.insert(14,[5])
-    b.insert(12,[7])
-    b.insert(15,[8])
-
-    root = b.root    
-    print root.is_balanced(1)
-    print root.is_balanced(2)
-    print root.is_balanced(3)        
diff --git a/HarvestMan-lite/harvestman/lib/common/common.py b/HarvestMan-lite/harvestman/lib/common/common.py
deleted file mode 100755
index 14d2e5c..0000000
--- a/HarvestMan-lite/harvestman/lib/common/common.py
+++ /dev/null
@@ -1,603 +0,0 @@
-# -- coding: utf-8
-""" common.py - Global functions for HarvestMan Program.
-    This file is part of the HarvestMan software.
-    For licensing information, see file LICENSE.TXT.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    Created: Jun 10 2003
-
-    Aug 17 2006          Anand          Modifications for the new logging
-                                        module.
-
-    Feb 7 2007           Anand          Some changes. Added logconsole
-                                        function. Split Initialize() to
-                                        InitConfig() and InitLogger().
-    Feb 26 2007          Anand          Replaced urlmappings dictionary
-                                        with a WeakValueDictionary.
-
-   Copyright (C) 2004 - Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import weakref
-import os, sys
-import socket
-import binascii
-import copy
-import threading
-import shelve
-import cStringIO
-import traceback
-import threading
-import collections
-import random
-import cStringIO
-import tokenize
-
-from types import *
-from singleton import Singleton
-
-class Alias(Singleton):
-    def __getattr__(self, name):
-        try:
-            return super(Alias, self).__getattr__(name)
-        except AttributeError:
-            return None
-    pass
-
-class AliasError(Exception):
-    pass
-
-class GlobalData(Singleton):
-    def __getattr__(self, name):
-        try:
-            return super(Alias, self).__getattr__(name)
-        except AttributeError:
-            return None
-
-# Namespace for global unique objects
-
-# This varible holds each global object in HarvestMan
-# If any module redefines an 'objects' variable locally, it
-# is doing at its own peril!
-objects = Alias()
-
-# Namespace for global data
-globaldata = GlobalData()
-globaldata.userdebug = []
-
-
-class SleepEvent(object):
-    """ A class representing a timeout event. This can be
-    used to passively wait for a given time-period instead of
-    using time.sleep(...) """
-
-    def __init__(self, sleeptime):
-        self._sleeptime = sleeptime
-        self.evt = threading.Event()
-        self.evt.set()
-
-    def sleep(self):
-        self.evt.clear()
-        self.evt.wait(self._sleeptime)
-        self.evt.set()
-
-class RandomSleepEvent(SleepEvent):
-    """ A class representing a timeout event. This can be
-    used to passively wait for a given time-period instead of
-    using time.sleep(...) """
-
-    def sleep(self):
-        self.evt.clear()
-        self.evt.wait(random.random()*self._sleeptime)
-        self.evt.set()
-    
-class DummyStderr(object):
-    """ A dummy class to imitate stderr """
-    
-    def write(self, msg):
-        pass
-
-class CaselessDict(dict):
-
-    def __init__(self, mapping=None):
-        if mapping:
-            if type(mapping) is dict:
-                for k,v in d.items():
-                    self.__setitem__(k, v)
-            elif type(mapping) in (list, tuple):
-                d = dict(mapping)
-                for k,v in d.items():
-                    self.__setitem__(k, v)
-                    
-        # super(CaselessDict, self).__init__(d)
-        
-    def __setitem__(self, name, value):
-
-        if type(name) in StringTypes:
-            super(CaselessDict, self).__setitem__(name.lower(), value)
-        else:
-            super(CaselessDict, self).__setitem__(name, value)
-
-    def __getitem__(self, name):
-        if type(name) in StringTypes:
-            return super(CaselessDict, self).__getitem__(name.lower())
-        else:
-            return super(CaselessDict, self).__getitem__(name)
-
-    def __copy__(self):
-        pass
-
-            
-class Ldeque(collections.deque):
-    """ Length-limited deque """
-    
-    def __init__(self, count=10):
-        self.max = count
-        super(Ldeque, self).__init__()
-
-    def append(self, item):
-        super(Ldeque, self).append(item)
-        if len(self)>self.max:
-            # if size exceeds, pop from left
-            self.popleft()
-
-    def appendleft(self, item):
-        super(Ldeque, self).appendleft(item)
-        if len(self)>self.max:
-            # if size exceeds, pop from right
-            self.pop()            
-
-    def index(self, item):
-        """ Return the index of an item from the deque """
-        
-        return list(self).index(item)
-
-    def remove(self, item):
-        """ Remove an item from the deque """
-        
-        idx = self.index(item)
-        self.__delitem__(idx)      
-
-def SysExceptHook(typ, val, tracebak):
-    """ Dummy function to replace sys.excepthook """
-    pass
-
-
-def SetAlias(obj):
-    """ Set unique alias for the object """
-
-    # Alias is another name for the object, it should be unique
-    # The object's class should have a field name 'alias'
-    if getattr(obj, 'alias') == None:
-        raise AliasError, "object does not define 'alias' attribute!"
-
-    setattr(objects, obj.alias, obj)
-
-def SetLogFile():
-
-    logfile = objects.config.logfile
-    if logfile:
-        objects.logger.setLogSeverity(objects.config.verbosity)
-        # If simulation is turned off, add file-handle
-        if not objects.config.simulate:
-            objects.logger.addLogHandler('FileHandler',logfile)
-
-def SetUserDebug(message):
-    """ Used to store error messages related
-    to user settings in the config file/project file.
-    These will be printed at the end of the program """
-
-    if message:
-        try:
-            globaldata.userdebug.index(message)
-        except:
-            globaldata.userdebug.append(message)
-
-def SetLogSeverity():
-    objects.logger.setLogSeverity(objects.config.verbosity)    
-    
-def wasOrWere(val):
-    """ What it says """
-
-    if val > 1: return 'were'
-    else: return 'was'
-
-def plural((s, val)):
-    """ What it says """
-
-    if val>1:
-        if s[len(s)-1] == 'y':
-            return s[:len(s)-1]+'ies'
-        else: return s + 's'
-    else:
-        return s
-
-# file type identification functions
-# this is the precursor of a more generic file identificator
-# based on the '/etc/magic' file on unices.
-
-signatures = { "gif" : [0, ("GIF87a", "GIF89a")],
-               "jpeg" :[6, ("JFIF",)],
-               "bmp" : [0, ("BM6",)]
-             }
-aliases = { "gif" : (),                       # common extension aliases
-            "jpeg" : ("jpg", "jpe", "jfif"),
-            "bmp" : ("dib",) }
-
-def bin_crypt(data):
-    """ Encryption using binascii and obfuscation """
-
-    if data=='':
-        return ''
-
-    try:
-        return binascii.hexlify(obfuscate(data))
-    except TypeError, e:
-        debug('Error in encrypting data: <',data,'>', e)
-        return data
-    except ValueError, e:
-        debug('Error in encrypting data: <',data,'>', e)
-        return data
-
-def bin_decrypt(data):
-    """ Decrypttion using binascii and deobfuscation """
-
-    if data=='':
-        return ''
-
-    try:
-        return unobfuscate(binascii.unhexlify(data))
-    except TypeError, e:
-        logconsole('Error in decrypting data: <',data,'>', e)
-        return data
-    except ValueError, e:
-        logconsole('Error in decrypting data: <',data,'>', e)
-        return data
-
-
-def obfuscate(data):
-    """ Obfuscate a string using repeated xor """
-
-    out = ""
-    import operator
-
-    e0=chr(operator.xor(ord(data[0]), ord(data[1])))
-    out = "".join((out, e0))
-
-    x=1
-    eprev=e0
-    for x in range(1, len(data)):
-        ax=ord(data[x])
-        ex=chr(operator.xor(ax, ord(eprev)))
-        out = "".join((out,ex))
-        eprev = ex
-
-    return out
-
-def unobfuscate(data):
-    """ Unobfuscate a xor obfuscated string """
-
-    out = ""
-    x=len(data) - 1
-
-    import operator
-
-    while x>1:
-        apos=data[x]
-        aprevpos=data[x-1]
-        epos=chr(operator.xor(ord(apos), ord(aprevpos)))
-        out = "".join((out, epos))
-        x -= 1
-
-    out=str(reduce(lambda x, y: y + x, out))
-    e2, a2 = data[1], data[0]
-    a1=chr(operator.xor(ord(a2), ord(e2)))
-    a1 = "".join((a1, out))
-    out = a1
-    e1,a1=out[0], data[0]
-    a0=chr(operator.xor(ord(a1), ord(e1)))
-    a0 = "".join((a0, out))
-    out = a0
-
-    return out
-
-def send_url(data, host, port):
-    
-    cfg = objects.config
-    if cfg.urlserver_protocol == 'tcp':
-        return send_url_tcp(data, host, port)
-    elif cfg.urlserver_protocol == 'udp':
-        return send_url_udp(data, host, port)
-    
-def send_url_tcp(data, host, port):
-    """ Send url to url server """
-
-    # Return's server response if connection
-    # succeeded and null string if failed.
-    try:
-        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((host,port))
-        sock.sendall(data)
-        response = sock.recv(8192)
-        sock.close()
-        return response
-    except socket.error, e:
-        # print 'url server error:',e
-        pass
-
-    return ''
-
-def send_url_udp(data, host, port):
-    """ Send url to url server """
-
-    # Return's server response if connection
-    # succeeded and null string if failed.
-    try:
-        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-        sock.sendto(data,0,(host, port))
-        response, addr = sock.recvfrom(8192, 0)
-        sock.close()
-        return response
-    except socket.error:
-        pass
-
-    return ''
-
-def ping_urlserver(host, port):
-    
-    cfg = objects.config
-    
-    if cfg.urlserver_protocol == 'tcp':
-        return ping_urlserver_tcp(host, port)
-    elif cfg.urlserver_protocol == 'udp':
-        return ping_urlserver_udp(host, port)
-        
-def ping_urlserver_tcp(host, port):
-    """ Ping url server to see if it is alive """
-
-    # Returns server's response if server is
-    # alive & null string if server is not alive.
-    try:
-        debug('Pinging server at (%s:%d)' % (host, port))
-        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((host,port))
-        # Send a small packet
-        sock.sendall("ping")
-        response = sock.recv(8192)
-        if response:
-            debug('Url server is alive')
-        sock.close()
-        return response
-    except socket.error:
-        debug('Could not connect to (%s:%d)' % (host, port))
-        return ''
-
-def ping_urlserver_udp(host, port):
-    """ Ping url server to see if it is alive """
-
-    # Returns server's response if server is
-    # alive & null string if server is not alive.
-    try:
-        debug('Pinging server at (%s:%d)' % (host, port))
-        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-        # Send a small packet
-        sock.sendto("ping", 0, (host,port))
-        response, addr = sock.recvfrom(8192,0)
-        if response:
-            debug('Url server is alive')
-        sock.close()
-        return response
-    except socket.error:
-        debug('Could not connect to (%s:%d)' % (host, port))
-        return ''    
-
-def GetTempDir():
-    """ Return the temporary directory """
-
-    # Currently used by hget
-    tmpdir = max(map(lambda x: os.environ.get(x, ''), ['TEMP','TMP','TEMPDIR','TMPDIR']))
-
-    if tmpdir=='':
-        # No temp dir env variable
-        if os.name == 'posix':
-            if os.path.isdir('/tmp'):
-                return '/tmp'
-            elif os.path.isdir('/usr/tmp'):
-                return '/usr/tmp'
-        elif os.name == 'nt':
-            profiledir = os.environ.get('USERPROFILE','')
-            if profiledir:
-                return os.path.join(profiledir,'Local Settings','Temp')
-    else:
-        return os.path.abspath(tmpdir)
-
-def GetMyTempDir():
-    """ Return temporary directory for HarvestMan. Also creates
-    it if the directory is not there """
-
-    # This is tempdir/HarvestMan
-    tmpdir = os.path.join(GetTempDir(), 'harvestman')
-    if not os.path.isdir(tmpdir):
-        try:
-            os.makedirs(tmpdir)
-        except OSError, e:
-            return ''
-
-    return tmpdir
-
-def debug(arg, *args):
-    """ Log information, will log if verbosity is equal to DEBUG level """
-
-    objects.logger.debug(arg, *args)    
-
-def info(arg, *args):
-    """ Log information, will log if verbosity is <= INFO level """
-
-    objects.logger.info(arg, *args)
-
-def extrainfo(arg, *args):
-    """ Log information, will log if verbosity is <= EXTRAINFO level """
-
-    objects.logger.extrainfo(arg, *args)    
-
-def warning(arg, *args):
-    """ Log information, will log if verbosity is <= WARNING level """
-
-    objects.logger.warning(arg, *args)        
-
-def error(arg, *args):
-    """ Log information, will log if verbosity is <= ERROR level """
-
-    objects.logger.error(arg, *args)        
-
-def critical(arg, *args):
-    """ Log information, will log if verbosity is <= CRITICAL level """
-
-    objects.logger.critical(arg, *args)        
-
-def logconsole(arg, *args):
-    """ Log directly to sys.stdout using print """
-
-    # Setting verbosity to 5 will print maximum information
-    # plus maximum debugging information.
-    objects.logger.logconsole(arg, *args)        
-
-def logtraceback(console=False):
-    """ Log the most recent exception traceback. By default
-    the trace goes only to the log file """
-
-    s = cStringIO.StringIO()
-    traceback.print_tb(sys.exc_info()[-1], None, s)
-    if not console:
-        objects.logger.disableConsoleLogging()
-    # Log to logger
-    objects.logger.debug(s.getvalue())
-    # Enable console logging again
-    objects.logger.enableConsoleLogging()    
-
-def hexit(arg):
-    """ Exit wrapper for HarvestMan """
-
-    print_traceback()
-    sys.exit(arg)
-    
-def print_traceback():
-    print 'Printing error traceback for debugging...'
-    traceback.print_tb(sys.exc_info()[-1], None, sys.stdout)
-
-# Effbot's simple_eval function which is a safe replacement
-# for Python's eval for tuples...
-
-def atom(next, token):
-    if token[1] == "(":
-        out = []
-        token = next()
-        while token[1] != ")":
-            out.append(atom(next, token))
-            token = next()
-            if token[1] == ",":
-                token = next()
-        return tuple(out)
-    elif token[0] is tokenize.STRING:
-        return token[1][1:-1].decode("string-escape")
-    elif token[0] is tokenize.NUMBER:
-        try:
-            return int(token[1], 0)
-        except ValueError:
-            return float(token[1])
-    raise SyntaxError("malformed expression (%s)" % token[1])
-
-def simple_eval(source):
-    src = cStringIO.StringIO(source).readline
-    src = tokenize.generate_tokens(src)
-    res = atom(src.next, src.next())
-    if src.next()[0] is not tokenize.ENDMARKER:
-        raise SyntaxError("bogus data after expression")
-    return res
-
-def set_aliases(path=None):
-
-    if path != None:
-        sys.path.append(path)
-        
-    import config
-    SetAlias(config.HarvestManStateObject())
-
-    import datamgr
-    import rules
-    import connector
-    import urlqueue
-    import logger
-    import event
-
-    SetAlias(logger.HarvestManLogger())
-    
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    dmgr.initialize()
-    SetAlias(dmgr)
-    
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-    
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-    
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-    SetAlias(conn_factory)
-    
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-    
-    SetAlias(event.HarvestManEvent())
-        
-def test_sgmlop():
-    """ Test whether sgmlop is available and working """
-
-    html="""\
-    <html><
-    title>Test sgmlop</title>
-    <body>
-    <p>This is a pargraph</p>
-    <img src="img.jpg"/>
-    <a href="http://www.python.org'>Python</a>
-    </body>
-    </html>
-    """
-    
-    # Return True for working and False for not-working
-    # or not-present...
-    try:
-        import sgmlop
-        
-        class DummyHandler(object):
-            links = []
-            def finish_starttag(self, tag, attrs):
-                self.links.append(tag)
-                pass
-            
-        parser = sgmlop.SGMLParser()
-        parser.register(DummyHandler())
-        parser.feed(html)
-
-        # Check if we got all the links...
-        if len(DummyHandler.links)==4:
-            return True
-        else:
-            return False
-        
-    except ImportError, e:
-        return False
-
-
-if __name__=="__main__":
-    pass
-    
diff --git a/HarvestMan-lite/harvestman/lib/common/dictcache.py b/HarvestMan-lite/harvestman/lib/common/dictcache.py
deleted file mode 100755
index 0830c2f..0000000
--- a/HarvestMan-lite/harvestman/lib/common/dictcache.py
+++ /dev/null
@@ -1,162 +0,0 @@
-"""
-dictcache.py - Module implementing a dictionary like object
-with three level caching (2 level memory, 1 level disk) with
-O(1) search times for keys. 
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-import os
-import cPickle
-import time
-from threading import Semaphore
-
-PID = os.getpid()
-
-class DictCache(object):
-    """ A dictionary like object with pickled disk caching
-    which allows to store large amount of data with minimal
-    memory costs """
-    
-    def __init__(self, frequency, tmpdir=''):
-        # Frequency at which commits are done to disk
-        self.freq = frequency
-        # Total number of commit cycles
-        self.cycles = 0
-        self.curr = 0
-        # Disk cache...
-        self.cache = {}
-        # Internal temporary cache
-        self.d = {} 
-        self.dmutex = Semaphore(1)
-        # Last loaded cache dictionary from disk
-        self.dcache = {}
-        # disk cache hits
-        self.dhits = 0
-        # in-mem cache hits
-        self.mhits = 0
-        # temp dict hits
-        self.thits = 0
-        self.tmpdir = tmpdir
-        if self.tmpdir:
-            self.froot = os.path.join(self.tmpdir, '.' + str(PID) + '_' + str(abs(hash(self))))
-        else:
-            self.froot = '.' + str(PID) + '_' + str(abs(hash(self)))
-        self.t = 0
-        
-    def __setitem__(self, key, value):
-
-        try:
-            self.dmutex.acquire()
-            try:
-                self.d[key] = value
-                self.curr += 1
-                if self.curr==self.freq:
-                    self.cycles += 1
-                    # Dump the cache dictionary to disk...
-                    fname = ''.join((self.froot,'#',str(self.cycles)))
-                    # print self.d
-                    cPickle.dump(self.d, open(fname, 'wb'))
-                    # We index the cache keys and associate the
-                    # cycle number to them, since the filename
-                    # is further associated to the cycle number,
-                    # finding the cache file associated to a
-                    # dictionary key is a simple dictionary look-up
-                    # operation, costing only O(1)...
-                    for k in self.d.iterkeys():
-                         self.cache[k] = self.cycles
-                    self.d.clear()
-                    self.curr = 0
-            except Exception, e:
-                import traceback
-                print 'Exception:',e, traceback.extract_stack()
-                traceback.print_stack()
-        finally:
-            self.dmutex.release()
-
-    def __len__(self):
-        # Return the 'virtual' length of the
-        # dictionary
-
-        # Length is the temporary cache length
-        # plus size of disk caches. This assumes
-        # that all the committed caches are still
-        # present...
-        return len(self.d) + self.cycles*self.freq
-    
-    def __getitem__(self, key):
-        try:
-            item = self.d[key]
-            self.thits += 1
-            return item
-        except KeyError:
-            try:
-                item = self.dcache[key]
-                self.mhits += 1
-                return item
-            except KeyError:
-                t1 = time.time()
-                # Load cache from disk...
-                # Cache filename lookup is an O(1) operation...
-                try:
-                    fname = ''.join((self.froot,'#',str(self.cache[key])))
-                except KeyError:
-                    return None
-                try:
-                    # Always caches the last loaded entry
-                    self.dcache = cPickle.load(open(fname,'rb'))
-                    self.dhits += 1
-                    # print time.time() - t1
-                    self.t += time.time() - t1
-                    
-                    return self.dcache[key]
-                except (OSError, IOError, EOFError,KeyError), e:
-                    return None
-
-    def clear(self):
-
-        try:
-            self.dmutex.acquire()
-            self.d.clear()
-            self.dcache.clear()
-            
-            # Remove cache filenames
-            for k in self.cache.itervalues():
-                fname = ''.join((self.froot,'#',str(k)))
-                if os.path.isfile(fname):
-                    # print 'Removing file',fname
-                    os.remove(fname)
-
-            self.cache.clear()
-            # Reset counters
-            self.curr = 0
-            self.cycles = 0
-            self.clear_counters()
-        finally:
-            self.dmutex.release()
-
-    def clear_counters(self):
-        self.dhits = 0
-        self.thits = 0
-        self.mhits = 0        
-        self.t = 0
-
-    def get_stats(self):
-        """ Return stats as a dictionary """
-
-        if len(self):
-            average = float(self.t)/float(len(self))
-        else:
-            average = 0.0
-            
-        return { 'disk_hits' : self.dhits,
-                 'mem_hits'  : self.mhits,
-                 'temp_hits' : self.thits,
-                 'time': self.t,
-                 'average' : average }
-        
-    def __del__(self):
-        self.clear()
diff --git a/HarvestMan-lite/harvestman/lib/common/feedparser.py b/HarvestMan-lite/harvestman/lib/common/feedparser.py
deleted file mode 100755
index bb802df..0000000
--- a/HarvestMan-lite/harvestman/lib/common/feedparser.py
+++ /dev/null
@@ -1,2858 +0,0 @@
-#!/usr/bin/env python
-"""Universal feed parser
-
-Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
-
-Visit http://feedparser.org/ for the latest version
-Visit http://feedparser.org/docs/ for the latest documentation
-
-Required: Python 2.1 or later
-Recommended: Python 2.3 or later
-Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/>
-"""
-
-__version__ = "4.1"# + "$Revision: 1.92 $"[11:15] + "-cvs"
-__license__ = """Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice,
-  this list of conditions and the following disclaimer.
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
-LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE."""
-__author__ = "Mark Pilgrim <http://diveintomark.org/>"
-__contributors__ = ["Jason Diamond <http://injektilo.org/>",
-                    "John Beimler <http://john.beimler.org/>",
-                    "Fazal Majid <http://www.majid.info/mylos/weblog/>",
-                    "Aaron Swartz <http://aaronsw.com/>",
-                    "Kevin Marks <http://epeus.blogspot.com/>"]
-_debug = 0
-
-# HTTP "User-Agent" header to send to servers when downloading feeds.
-# If you are embedding feedparser in a larger application, you should
-# change this to your application name and URL.
-USER_AGENT = "UniversalFeedParser/%s +http://feedparser.org/" % __version__
-
-# HTTP "Accept" header to send to servers when downloading feeds.  If you don't
-# want to send an Accept header, set this to None.
-ACCEPT_HEADER = "application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1"
-
-# List of preferred XML parsers, by SAX driver name.  These will be tried first,
-# but if they're not installed, Python will keep searching through its own list
-# of pre-installed parsers until it finds one that supports everything we need.
-PREFERRED_XML_PARSERS = ["drv_libxml2"]
-
-# If you want feedparser to automatically run HTML markup through HTML Tidy, set
-# this to 1.  Requires mxTidy <http://www.egenix.com/files/python/mxTidy.html>
-# or utidylib <http://utidylib.berlios.de/>.
-TIDY_MARKUP = 0
-
-# List of Python interfaces for HTML Tidy, in order of preference.  Only useful
-# if TIDY_MARKUP = 1
-PREFERRED_TIDY_INTERFACES = ["uTidy", "mxTidy"]
-
-# ---------- required modules (should come with any Python distribution) ----------
-import sgmllib, re, sys, copy, urlparse, time, rfc822, types, cgi, urllib, urllib2
-try:
-    from cStringIO import StringIO as _StringIO
-except:
-    from StringIO import StringIO as _StringIO
-
-# ---------- optional modules (feedparser will work without these, but with reduced functionality) ----------
-
-# gzip is included with most Python distributions, but may not be available if you compiled your own
-try:
-    import gzip
-except:
-    gzip = None
-try:
-    import zlib
-except:
-    zlib = None
-
-# If a real XML parser is available, feedparser will attempt to use it.  feedparser has
-# been tested with the built-in SAX parser, PyXML, and libxml2.  On platforms where the
-# Python distribution does not come with an XML parser (such as Mac OS X 10.2 and some
-# versions of FreeBSD), feedparser will quietly fall back on regex-based parsing.
-try:
-    import xml.sax
-    xml.sax.make_parser(PREFERRED_XML_PARSERS) # test for valid parsers
-    from xml.sax.saxutils import escape as _xmlescape
-    _XML_AVAILABLE = 1
-except:
-    _XML_AVAILABLE = 0
-    def _xmlescape(data):
-        data = data.replace('&', '&amp;')
-        data = data.replace('>', '&gt;')
-        data = data.replace('<', '&lt;')
-        return data
-
-# base64 support for Atom feeds that contain embedded binary data
-try:
-    import base64, binascii
-except:
-    base64 = binascii = None
-
-# cjkcodecs and iconv_codec provide support for more character encodings.
-# Both are available from http://cjkpython.i18n.org/
-try:
-    import cjkcodecs.aliases
-except:
-    pass
-try:
-    import iconv_codec
-except:
-    pass
-
-# chardet library auto-detects character encodings
-# Download from http://chardet.feedparser.org/
-try:
-    import chardet
-    if _debug:
-        import chardet.constants
-        chardet.constants._debug = 1
-except:
-    chardet = None
-
-# ---------- don't touch these ----------
-class ThingsNobodyCaresAboutButMe(Exception): pass
-class CharacterEncodingOverride(ThingsNobodyCaresAboutButMe): pass
-class CharacterEncodingUnknown(ThingsNobodyCaresAboutButMe): pass
-class NonXMLContentType(ThingsNobodyCaresAboutButMe): pass
-class UndeclaredNamespace(Exception): pass
-
-sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
-sgmllib.special = re.compile('<!')
-sgmllib.charref = re.compile('&#(x?[0-9A-Fa-f]+)[^0-9A-Fa-f]')
-
-SUPPORTED_VERSIONS = {'': 'unknown',
-                      'rss090': 'RSS 0.90',
-                      'rss091n': 'RSS 0.91 (Netscape)',
-                      'rss091u': 'RSS 0.91 (Userland)',
-                      'rss092': 'RSS 0.92',
-                      'rss093': 'RSS 0.93',
-                      'rss094': 'RSS 0.94',
-                      'rss20': 'RSS 2.0',
-                      'rss10': 'RSS 1.0',
-                      'rss': 'RSS (unknown version)',
-                      'atom01': 'Atom 0.1',
-                      'atom02': 'Atom 0.2',
-                      'atom03': 'Atom 0.3',
-                      'atom10': 'Atom 1.0',
-                      'atom': 'Atom (unknown version)',
-                      'cdf': 'CDF',
-                      'hotrss': 'Hot RSS'
-                      }
-
-try:
-    UserDict = dict
-except NameError:
-    # Python 2.1 does not have dict
-    from UserDict import UserDict
-    def dict(aList):
-        rc = {}
-        for k, v in aList:
-            rc[k] = v
-        return rc
-
-class FeedParserDict(UserDict):
-    keymap = {'channel': 'feed',
-              'items': 'entries',
-              'guid': 'id',
-              'date': 'updated',
-              'date_parsed': 'updated_parsed',
-              'description': ['subtitle', 'summary'],
-              'url': ['href'],
-              'modified': 'updated',
-              'modified_parsed': 'updated_parsed',
-              'issued': 'published',
-              'issued_parsed': 'published_parsed',
-              'copyright': 'rights',
-              'copyright_detail': 'rights_detail',
-              'tagline': 'subtitle',
-              'tagline_detail': 'subtitle_detail'}
-    def __getitem__(self, key):
-        if key == 'category':
-            return UserDict.__getitem__(self, 'tags')[0]['term']
-        if key == 'categories':
-            return [(tag['scheme'], tag['term']) for tag in UserDict.__getitem__(self, 'tags')]
-        realkey = self.keymap.get(key, key)
-        if type(realkey) == types.ListType:
-            for k in realkey:
-                if UserDict.has_key(self, k):
-                    return UserDict.__getitem__(self, k)
-        if UserDict.has_key(self, key):
-            return UserDict.__getitem__(self, key)
-        return UserDict.__getitem__(self, realkey)
-
-    def __setitem__(self, key, value):
-        for k in self.keymap.keys():
-            if key == k:
-                key = self.keymap[k]
-                if type(key) == types.ListType:
-                    key = key[0]
-        return UserDict.__setitem__(self, key, value)
-
-    def get(self, key, default=None):
-        if self.has_key(key):
-            return self[key]
-        else:
-            return default
-
-    def setdefault(self, key, value):
-        if not self.has_key(key):
-            self[key] = value
-        return self[key]
-        
-    def has_key(self, key):
-        try:
-            return hasattr(self, key) or UserDict.has_key(self, key)
-        except AttributeError:
-            return False
-        
-    def __getattr__(self, key):
-        try:
-            return self.__dict__[key]
-        except KeyError:
-            pass
-        try:
-            assert not key.startswith('_')
-            return self.__getitem__(key)
-        except:
-            raise AttributeError, "object has no attribute '%s'" % key
-
-    def __setattr__(self, key, value):
-        if key.startswith('_') or key == 'data':
-            self.__dict__[key] = value
-        else:
-            return self.__setitem__(key, value)
-
-    def __contains__(self, key):
-        return self.has_key(key)
-
-def zopeCompatibilityHack():
-    global FeedParserDict
-    del FeedParserDict
-    def FeedParserDict(aDict=None):
-        rc = {}
-        if aDict:
-            rc.update(aDict)
-        return rc
-
-_ebcdic_to_ascii_map = None
-def _ebcdic_to_ascii(s):
-    global _ebcdic_to_ascii_map
-    if not _ebcdic_to_ascii_map:
-        emap = (
-            0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
-            16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
-            128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
-            144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
-            32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
-            38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
-            45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
-            186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
-            195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,201,
-            202,106,107,108,109,110,111,112,113,114,203,204,205,206,207,208,
-            209,126,115,116,117,118,119,120,121,122,210,211,212,213,214,215,
-            216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,
-            123,65,66,67,68,69,70,71,72,73,232,233,234,235,236,237,
-            125,74,75,76,77,78,79,80,81,82,238,239,240,241,242,243,
-            92,159,83,84,85,86,87,88,89,90,244,245,246,247,248,249,
-            48,49,50,51,52,53,54,55,56,57,250,251,252,253,254,255
-            )
-        import string
-        _ebcdic_to_ascii_map = string.maketrans( \
-            ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
-    return s.translate(_ebcdic_to_ascii_map)
-
-_urifixer = re.compile('^([A-Za-z][A-Za-z0-9+-.]*://)(/*)(.*?)')
-def _urljoin(base, uri):
-    uri = _urifixer.sub(r'\1\3', uri)
-    return urlparse.urljoin(base, uri)
-
-class _FeedParserMixin:
-    namespaces = {'': '',
-                  'http://backend.userland.com/rss': '',
-                  'http://blogs.law.harvard.edu/tech/rss': '',
-                  'http://purl.org/rss/1.0/': '',
-                  'http://my.netscape.com/rdf/simple/0.9/': '',
-                  'http://example.com/newformat#': '',
-                  'http://example.com/necho': '',
-                  'http://purl.org/echo/': '',
-                  'uri/of/echo/namespace#': '',
-                  'http://purl.org/pie/': '',
-                  'http://purl.org/atom/ns#': '',
-                  'http://www.w3.org/2005/Atom': '',
-                  'http://purl.org/rss/1.0/modules/rss091#': '',
-                  
-                  'http://webns.net/mvcb/':                               'admin',
-                  'http://purl.org/rss/1.0/modules/aggregation/':         'ag',
-                  'http://purl.org/rss/1.0/modules/annotate/':            'annotate',
-                  'http://media.tangent.org/rss/1.0/':                    'audio',
-                  'http://backend.userland.com/blogChannelModule':        'blogChannel',
-                  'http://web.resource.org/cc/':                          'cc',
-                  'http://backend.userland.com/creativeCommonsRssModule': 'creativeCommons',
-                  'http://purl.org/rss/1.0/modules/company':              'co',
-                  'http://purl.org/rss/1.0/modules/content/':             'content',
-                  'http://my.theinfo.org/changed/1.0/rss/':               'cp',
-                  'http://purl.org/dc/elements/1.1/':                     'dc',
-                  'http://purl.org/dc/terms/':                            'dcterms',
-                  'http://purl.org/rss/1.0/modules/email/':               'email',
-                  'http://purl.org/rss/1.0/modules/event/':               'ev',
-                  'http://rssnamespace.org/feedburner/ext/1.0':           'feedburner',
-                  'http://freshmeat.net/rss/fm/':                         'fm',
-                  'http://xmlns.com/foaf/0.1/':                           'foaf',
-                  'http://www.w3.org/2003/01/geo/wgs84_pos#':             'geo',
-                  'http://postneo.com/icbm/':                             'icbm',
-                  'http://purl.org/rss/1.0/modules/image/':               'image',
-                  'http://www.itunes.com/DTDs/PodCast-1.0.dtd':           'itunes',
-                  'http://example.com/DTDs/PodCast-1.0.dtd':              'itunes',
-                  'http://purl.org/rss/1.0/modules/link/':                'l',
-                  'http://search.yahoo.com/mrss':                         'media',
-                  'http://madskills.com/public/xml/rss/module/pingback/': 'pingback',
-                  'http://prismstandard.org/namespaces/1.2/basic/':       'prism',
-                  'http://www.w3.org/1999/02/22-rdf-syntax-ns#':          'rdf',
-                  'http://www.w3.org/2000/01/rdf-schema#':                'rdfs',
-                  'http://purl.org/rss/1.0/modules/reference/':           'ref',
-                  'http://purl.org/rss/1.0/modules/richequiv/':           'reqv',
-                  'http://purl.org/rss/1.0/modules/search/':              'search',
-                  'http://purl.org/rss/1.0/modules/slash/':               'slash',
-                  'http://schemas.xmlsoap.org/soap/envelope/':            'soap',
-                  'http://purl.org/rss/1.0/modules/servicestatus/':       'ss',
-                  'http://hacks.benhammersley.com/rss/streaming/':        'str',
-                  'http://purl.org/rss/1.0/modules/subscription/':        'sub',
-                  'http://purl.org/rss/1.0/modules/syndication/':         'sy',
-                  'http://purl.org/rss/1.0/modules/taxonomy/':            'taxo',
-                  'http://purl.org/rss/1.0/modules/threading/':           'thr',
-                  'http://purl.org/rss/1.0/modules/textinput/':           'ti',
-                  'http://madskills.com/public/xml/rss/module/trackback/':'trackback',
-                  'http://wellformedweb.org/commentAPI/':                 'wfw',
-                  'http://purl.org/rss/1.0/modules/wiki/':                'wiki',
-                  'http://www.w3.org/1999/xhtml':                         'xhtml',
-                  'http://www.w3.org/XML/1998/namespace':                 'xml',
-                  'http://schemas.pocketsoap.com/rss/myDescModule/':      'szf'
-}
-    _matchnamespaces = {}
-
-    can_be_relative_uri = ['link', 'id', 'wfw_comment', 'wfw_commentrss', 'docs', 'url', 'href', 'comments', 'license', 'icon', 'logo']
-    can_contain_relative_uris = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
-    can_contain_dangerous_markup = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
-    html_types = ['text/html', 'application/xhtml+xml']
-    
-    def __init__(self, baseuri=None, baselang=None, encoding='utf-8'):
-        if _debug: sys.stderr.write('initializing FeedParser\n')
-        if not self._matchnamespaces:
-            for k, v in self.namespaces.items():
-                self._matchnamespaces[k.lower()] = v
-        self.feeddata = FeedParserDict() # feed-level data
-        self.encoding = encoding # character encoding
-        self.entries = [] # list of entry-level data
-        self.version = '' # feed type/version, see SUPPORTED_VERSIONS
-        self.namespacesInUse = {} # dictionary of namespaces defined by the feed
-
-        # the following are used internally to track state;
-        # this is really out of control and should be refactored
-        self.infeed = 0
-        self.inentry = 0
-        self.incontent = 0
-        self.intextinput = 0
-        self.inimage = 0
-        self.inauthor = 0
-        self.incontributor = 0
-        self.inpublisher = 0
-        self.insource = 0
-        self.sourcedata = FeedParserDict()
-        self.contentparams = FeedParserDict()
-        self._summaryKey = None
-        self.namespacemap = {}
-        self.elementstack = []
-        self.basestack = []
-        self.langstack = []
-        self.baseuri = baseuri or ''
-        self.lang = baselang or None
-        if baselang:
-            self.feeddata['language'] = baselang
-
-    def unknown_starttag(self, tag, attrs):
-        if _debug: sys.stderr.write('start %s with %s\n' % (tag, attrs))
-        # normalize attrs
-        attrs = [(k.lower(), v) for k, v in attrs]
-        attrs = [(k, k in ('rel', 'type') and v.lower() or v) for k, v in attrs]
-        
-        # track xml:base and xml:lang
-        attrsD = dict(attrs)
-        baseuri = attrsD.get('xml:base', attrsD.get('base')) or self.baseuri
-        self.baseuri = _urljoin(self.baseuri, baseuri)
-        lang = attrsD.get('xml:lang', attrsD.get('lang'))
-        if lang == '':
-            # xml:lang could be explicitly set to '', we need to capture that
-            lang = None
-        elif lang is None:
-            # if no xml:lang is specified, use parent lang
-            lang = self.lang
-        if lang:
-            if tag in ('feed', 'rss', 'rdf:RDF'):
-                self.feeddata['language'] = lang
-        self.lang = lang
-        self.basestack.append(self.baseuri)
-        self.langstack.append(lang)
-        
-        # track namespaces
-        for prefix, uri in attrs:
-            if prefix.startswith('xmlns:'):
-                self.trackNamespace(prefix[6:], uri)
-            elif prefix == 'xmlns':
-                self.trackNamespace(None, uri)
-
-        # track inline content
-        if self.incontent and self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            # element declared itself as escaped markup, but it isn't really
-            self.contentparams['type'] = 'application/xhtml+xml'
-        if self.incontent and self.contentparams.get('type') == 'application/xhtml+xml':
-            # Note: probably shouldn't simply recreate localname here, but
-            # our namespace handling isn't actually 100% correct in cases where
-            # the feed redefines the default namespace (which is actually
-            # the usual case for inline content, thanks Sam), so here we
-            # cheat and just reconstruct the element based on localname
-            # because that compensates for the bugs in our namespace handling.
-            # This will horribly munge inline content with non-empty qnames,
-            # but nobody actually does that, so I'm not fixing it.
-            tag = tag.split(':')[-1]
-            return self.handle_data('<%s%s>' % (tag, ''.join([' %s="%s"' % t for t in attrs])), escape=0)
-
-        # match namespaces
-        if tag.find(':') <> -1:
-            prefix, suffix = tag.split(':', 1)
-        else:
-            prefix, suffix = '', tag
-        prefix = self.namespacemap.get(prefix, prefix)
-        if prefix:
-            prefix = prefix + '_'
-
-        # special hack for better tracking of empty textinput/image elements in illformed feeds
-        if (not prefix) and tag not in ('title', 'link', 'description', 'name'):
-            self.intextinput = 0
-        if (not prefix) and tag not in ('title', 'link', 'description', 'url', 'href', 'width', 'height'):
-            self.inimage = 0
-        
-        # call special handler (if defined) or default handler
-        methodname = '_start_' + prefix + suffix
-        try:
-            method = getattr(self, methodname)
-            return method(attrsD)
-        except AttributeError:
-            return self.push(prefix + suffix, 1)
-
-    def unknown_endtag(self, tag):
-        if _debug: sys.stderr.write('end %s\n' % tag)
-        # match namespaces
-        if tag.find(':') <> -1:
-            prefix, suffix = tag.split(':', 1)
-        else:
-            prefix, suffix = '', tag
-        prefix = self.namespacemap.get(prefix, prefix)
-        if prefix:
-            prefix = prefix + '_'
-
-        # call special handler (if defined) or default handler
-        methodname = '_end_' + prefix + suffix
-        try:
-            method = getattr(self, methodname)
-            method()
-        except AttributeError:
-            self.pop(prefix + suffix)
-
-        # track inline content
-        if self.incontent and self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            # element declared itself as escaped markup, but it isn't really
-            self.contentparams['type'] = 'application/xhtml+xml'
-        if self.incontent and self.contentparams.get('type') == 'application/xhtml+xml':
-            tag = tag.split(':')[-1]
-            self.handle_data('</%s>' % tag, escape=0)
-
-        # track xml:base and xml:lang going out of scope
-        if self.basestack:
-            self.basestack.pop()
-            if self.basestack and self.basestack[-1]:
-                self.baseuri = self.basestack[-1]
-        if self.langstack:
-            self.langstack.pop()
-            if self.langstack: # and (self.langstack[-1] is not None):
-                self.lang = self.langstack[-1]
-
-    def handle_charref(self, ref):
-        # called for each character reference, e.g. for '&#160;', ref will be '160'
-        if not self.elementstack: return
-        ref = ref.lower()
-        if ref in ('34', '38', '39', '60', '62', 'x22', 'x26', 'x27', 'x3c', 'x3e'):
-            text = '&#%s;' % ref
-        else:
-            if ref[0] == 'x':
-                c = int(ref[1:], 16)
-            else:
-                c = int(ref)
-            text = unichr(c).encode('utf-8')
-        self.elementstack[-1][2].append(text)
-
-    def handle_entityref(self, ref):
-        # called for each entity reference, e.g. for '&copy;', ref will be 'copy'
-        if not self.elementstack: return
-        if _debug: sys.stderr.write('entering handle_entityref with %s\n' % ref)
-        if ref in ('lt', 'gt', 'quot', 'amp', 'apos'):
-            text = '&%s;' % ref
-        else:
-            # entity resolution graciously donated by Aaron Swartz
-            def name2cp(k):
-                import htmlentitydefs
-                if hasattr(htmlentitydefs, 'name2codepoint'): # requires Python 2.3
-                    return htmlentitydefs.name2codepoint[k]
-                k = htmlentitydefs.entitydefs[k]
-                if k.startswith('&#') and k.endswith(';'):
-                    return int(k[2:-1]) # not in latin-1
-                return ord(k)
-            try: name2cp(ref)
-            except KeyError: text = '&%s;' % ref
-            else: text = unichr(name2cp(ref)).encode('utf-8')
-        self.elementstack[-1][2].append(text)
-
-    def handle_data(self, text, escape=1):
-        # called for each block of plain text, i.e. outside of any tag and
-        # not containing any character or entity references
-        if not self.elementstack: return
-        if escape and self.contentparams.get('type') == 'application/xhtml+xml':
-            text = _xmlescape(text)
-        self.elementstack[-1][2].append(text)
-
-    def handle_comment(self, text):
-        # called for each comment, e.g. <!-- insert message here -->
-        pass
-
-    def handle_pi(self, text):
-        # called for each processing instruction, e.g. <?instruction>
-        pass
-
-    def handle_decl(self, text):
-        pass
-
-    def parse_declaration(self, i):
-        # override internal declaration handler to handle CDATA blocks
-        if _debug: sys.stderr.write('entering parse_declaration\n')
-        if self.rawdata[i:i+9] == '<![CDATA[':
-            k = self.rawdata.find(']]>', i)
-            if k == -1: k = len(self.rawdata)
-            self.handle_data(_xmlescape(self.rawdata[i+9:k]), 0)
-            return k+3
-        else:
-            k = self.rawdata.find('>', i)
-            return k+1
-
-    def mapContentType(self, contentType):
-        contentType = contentType.lower()
-        if contentType == 'text':
-            contentType = 'text/plain'
-        elif contentType == 'html':
-            contentType = 'text/html'
-        elif contentType == 'xhtml':
-            contentType = 'application/xhtml+xml'
-        return contentType
-    
-    def trackNamespace(self, prefix, uri):
-        loweruri = uri.lower()
-        if (prefix, loweruri) == (None, 'http://my.netscape.com/rdf/simple/0.9/') and not self.version:
-            self.version = 'rss090'
-        if loweruri == 'http://purl.org/rss/1.0/' and not self.version:
-            self.version = 'rss10'
-        if loweruri == 'http://www.w3.org/2005/atom' and not self.version:
-            self.version = 'atom10'
-        if loweruri.find('backend.userland.com/rss') <> -1:
-            # match any backend.userland.com namespace
-            uri = 'http://backend.userland.com/rss'
-            loweruri = uri
-        if self._matchnamespaces.has_key(loweruri):
-            self.namespacemap[prefix] = self._matchnamespaces[loweruri]
-            self.namespacesInUse[self._matchnamespaces[loweruri]] = uri
-        else:
-            self.namespacesInUse[prefix or ''] = uri
-
-    def resolveURI(self, uri):
-        return _urljoin(self.baseuri or '', uri)
-    
-    def decodeEntities(self, element, data):
-        return data
-
-    def push(self, element, expectingText):
-        self.elementstack.append([element, expectingText, []])
-
-    def pop(self, element, stripWhitespace=1):
-        if not self.elementstack: return
-        if self.elementstack[-1][0] != element: return
-        
-        element, expectingText, pieces = self.elementstack.pop()
-        output = ''.join(pieces)
-        if stripWhitespace:
-            output = output.strip()
-        if not expectingText: return output
-
-        # decode base64 content
-        if base64 and self.contentparams.get('base64', 0):
-            try:
-                output = base64.decodestring(output)
-            except binascii.Error:
-                pass
-            except binascii.Incomplete:
-                pass
-                
-        # resolve relative URIs
-        if (element in self.can_be_relative_uri) and output:
-            output = self.resolveURI(output)
-        
-        # decode entities within embedded markup
-        if not self.contentparams.get('base64', 0):
-            output = self.decodeEntities(element, output)
-
-        # remove temporary cruft from contentparams
-        try:
-            del self.contentparams['mode']
-        except KeyError:
-            pass
-        try:
-            del self.contentparams['base64']
-        except KeyError:
-            pass
-
-        # resolve relative URIs within embedded markup
-        if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types:
-            if element in self.can_contain_relative_uris:
-                output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
-        
-        # sanitize embedded markup
-        if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types:
-            if element in self.can_contain_dangerous_markup:
-                output = _sanitizeHTML(output, self.encoding)
-
-        if self.encoding and type(output) != type(u''):
-            try:
-                output = unicode(output, self.encoding)
-            except:
-                pass
-
-        # categories/tags/keywords/whatever are handled in _end_category
-        if element == 'category':
-            return output
-        
-        # store output in appropriate place(s)
-        if self.inentry and not self.insource:
-            if element == 'content':
-                self.entries[-1].setdefault(element, [])
-                contentparams = copy.deepcopy(self.contentparams)
-                contentparams['value'] = output
-                self.entries[-1][element].append(contentparams)
-            elif element == 'link':
-                self.entries[-1][element] = output
-                if output:
-                    self.entries[-1]['links'][-1]['href'] = output
-            else:
-                if element == 'description':
-                    element = 'summary'
-                self.entries[-1][element] = output
-                if self.incontent:
-                    contentparams = copy.deepcopy(self.contentparams)
-                    contentparams['value'] = output
-                    self.entries[-1][element + '_detail'] = contentparams
-        elif (self.infeed or self.insource) and (not self.intextinput) and (not self.inimage):
-            context = self._getContext()
-            if element == 'description':
-                element = 'subtitle'
-            context[element] = output
-            if element == 'link':
-                context['links'][-1]['href'] = output
-            elif self.incontent:
-                contentparams = copy.deepcopy(self.contentparams)
-                contentparams['value'] = output
-                context[element + '_detail'] = contentparams
-        return output
-
-    def pushContent(self, tag, attrsD, defaultContentType, expectingText):
-        self.incontent += 1
-        self.contentparams = FeedParserDict({
-            'type': self.mapContentType(attrsD.get('type', defaultContentType)),
-            'language': self.lang,
-            'base': self.baseuri})
-        self.contentparams['base64'] = self._isBase64(attrsD, self.contentparams)
-        self.push(tag, expectingText)
-
-    def popContent(self, tag):
-        value = self.pop(tag)
-        self.incontent -= 1
-        self.contentparams.clear()
-        return value
-        
-    def _mapToStandardPrefix(self, name):
-        colonpos = name.find(':')
-        if colonpos <> -1:
-            prefix = name[:colonpos]
-            suffix = name[colonpos+1:]
-            prefix = self.namespacemap.get(prefix, prefix)
-            name = prefix + ':' + suffix
-        return name
-        
-    def _getAttribute(self, attrsD, name):
-        return attrsD.get(self._mapToStandardPrefix(name))
-
-    def _isBase64(self, attrsD, contentparams):
-        if attrsD.get('mode', '') == 'base64':
-            return 1
-        if self.contentparams['type'].startswith('text/'):
-            return 0
-        if self.contentparams['type'].endswith('+xml'):
-            return 0
-        if self.contentparams['type'].endswith('/xml'):
-            return 0
-        return 1
-
-    def _itsAnHrefDamnIt(self, attrsD):
-        href = attrsD.get('url', attrsD.get('uri', attrsD.get('href', None)))
-        if href:
-            try:
-                del attrsD['url']
-            except KeyError:
-                pass
-            try:
-                del attrsD['uri']
-            except KeyError:
-                pass
-            attrsD['href'] = href
-        return attrsD
-    
-    def _save(self, key, value):
-        context = self._getContext()
-        context.setdefault(key, value)
-
-    def _start_rss(self, attrsD):
-        versionmap = {'0.91': 'rss091u',
-                      '0.92': 'rss092',
-                      '0.93': 'rss093',
-                      '0.94': 'rss094'}
-        if not self.version:
-            attr_version = attrsD.get('version', '')
-            version = versionmap.get(attr_version)
-            if version:
-                self.version = version
-            elif attr_version.startswith('2.'):
-                self.version = 'rss20'
-            else:
-                self.version = 'rss'
-    
-    def _start_dlhottitles(self, attrsD):
-        self.version = 'hotrss'
-
-    def _start_channel(self, attrsD):
-        self.infeed = 1
-        self._cdf_common(attrsD)
-    _start_feedinfo = _start_channel
-
-    def _cdf_common(self, attrsD):
-        if attrsD.has_key('lastmod'):
-            self._start_modified({})
-            self.elementstack[-1][-1] = attrsD['lastmod']
-            self._end_modified()
-        if attrsD.has_key('href'):
-            self._start_link({})
-            self.elementstack[-1][-1] = attrsD['href']
-            self._end_link()
-    
-    def _start_feed(self, attrsD):
-        self.infeed = 1
-        versionmap = {'0.1': 'atom01',
-                      '0.2': 'atom02',
-                      '0.3': 'atom03'}
-        if not self.version:
-            attr_version = attrsD.get('version')
-            version = versionmap.get(attr_version)
-            if version:
-                self.version = version
-            else:
-                self.version = 'atom'
-
-    def _end_channel(self):
-        self.infeed = 0
-    _end_feed = _end_channel
-    
-    def _start_image(self, attrsD):
-        self.inimage = 1
-        self.push('image', 0)
-        context = self._getContext()
-        context.setdefault('image', FeedParserDict())
-            
-    def _end_image(self):
-        self.pop('image')
-        self.inimage = 0
-
-    def _start_textinput(self, attrsD):
-        self.intextinput = 1
-        self.push('textinput', 0)
-        context = self._getContext()
-        context.setdefault('textinput', FeedParserDict())
-    _start_textInput = _start_textinput
-    
-    def _end_textinput(self):
-        self.pop('textinput')
-        self.intextinput = 0
-    _end_textInput = _end_textinput
-
-    def _start_author(self, attrsD):
-        self.inauthor = 1
-        self.push('author', 1)
-    _start_managingeditor = _start_author
-    _start_dc_author = _start_author
-    _start_dc_creator = _start_author
-    _start_itunes_author = _start_author
-
-    def _end_author(self):
-        self.pop('author')
-        self.inauthor = 0
-        self._sync_author_detail()
-    _end_managingeditor = _end_author
-    _end_dc_author = _end_author
-    _end_dc_creator = _end_author
-    _end_itunes_author = _end_author
-
-    def _start_itunes_owner(self, attrsD):
-        self.inpublisher = 1
-        self.push('publisher', 0)
-
-    def _end_itunes_owner(self):
-        self.pop('publisher')
-        self.inpublisher = 0
-        self._sync_author_detail('publisher')
-
-    def _start_contributor(self, attrsD):
-        self.incontributor = 1
-        context = self._getContext()
-        context.setdefault('contributors', [])
-        context['contributors'].append(FeedParserDict())
-        self.push('contributor', 0)
-
-    def _end_contributor(self):
-        self.pop('contributor')
-        self.incontributor = 0
-
-    def _start_dc_contributor(self, attrsD):
-        self.incontributor = 1
-        context = self._getContext()
-        context.setdefault('contributors', [])
-        context['contributors'].append(FeedParserDict())
-        self.push('name', 0)
-
-    def _end_dc_contributor(self):
-        self._end_name()
-        self.incontributor = 0
-
-    def _start_name(self, attrsD):
-        self.push('name', 0)
-    _start_itunes_name = _start_name
-
-    def _end_name(self):
-        value = self.pop('name')
-        if self.inpublisher:
-            self._save_author('name', value, 'publisher')
-        elif self.inauthor:
-            self._save_author('name', value)
-        elif self.incontributor:
-            self._save_contributor('name', value)
-        elif self.intextinput:
-            context = self._getContext()
-            context['textinput']['name'] = value
-    _end_itunes_name = _end_name
-
-    def _start_width(self, attrsD):
-        self.push('width', 0)
-
-    def _end_width(self):
-        value = self.pop('width')
-        try:
-            value = int(value)
-        except:
-            value = 0
-        if self.inimage:
-            context = self._getContext()
-            context['image']['width'] = value
-
-    def _start_height(self, attrsD):
-        self.push('height', 0)
-
-    def _end_height(self):
-        value = self.pop('height')
-        try:
-            value = int(value)
-        except:
-            value = 0
-        if self.inimage:
-            context = self._getContext()
-            context['image']['height'] = value
-
-    def _start_url(self, attrsD):
-        self.push('href', 1)
-    _start_homepage = _start_url
-    _start_uri = _start_url
-
-    def _end_url(self):
-        value = self.pop('href')
-        if self.inauthor:
-            self._save_author('href', value)
-        elif self.incontributor:
-            self._save_contributor('href', value)
-        elif self.inimage:
-            context = self._getContext()
-            context['image']['href'] = value
-        elif self.intextinput:
-            context = self._getContext()
-            context['textinput']['link'] = value
-    _end_homepage = _end_url
-    _end_uri = _end_url
-
-    def _start_email(self, attrsD):
-        self.push('email', 0)
-    _start_itunes_email = _start_email
-
-    def _end_email(self):
-        value = self.pop('email')
-        if self.inpublisher:
-            self._save_author('email', value, 'publisher')
-        elif self.inauthor:
-            self._save_author('email', value)
-        elif self.incontributor:
-            self._save_contributor('email', value)
-    _end_itunes_email = _end_email
-
-    def _getContext(self):
-        if self.insource:
-            context = self.sourcedata
-        elif self.inentry:
-            context = self.entries[-1]
-        else:
-            context = self.feeddata
-        return context
-
-    def _save_author(self, key, value, prefix='author'):
-        context = self._getContext()
-        context.setdefault(prefix + '_detail', FeedParserDict())
-        context[prefix + '_detail'][key] = value
-        self._sync_author_detail()
-
-    def _save_contributor(self, key, value):
-        context = self._getContext()
-        context.setdefault('contributors', [FeedParserDict()])
-        context['contributors'][-1][key] = value
-
-    def _sync_author_detail(self, key='author'):
-        context = self._getContext()
-        detail = context.get('%s_detail' % key)
-        if detail:
-            name = detail.get('name')
-            email = detail.get('email')
-            if name and email:
-                context[key] = '%s (%s)' % (name, email)
-            elif name:
-                context[key] = name
-            elif email:
-                context[key] = email
-        else:
-            author = context.get(key)
-            if not author: return
-            emailmatch = re.search(r'''(([a-zA-Z0-9\_\-\.\+]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?))''', author)
-            if not emailmatch: return
-            email = emailmatch.group(0)
-            # probably a better way to do the following, but it passes all the tests
-            author = author.replace(email, '')
-            author = author.replace('()', '')
-            author = author.strip()
-            if author and (author[0] == '('):
-                author = author[1:]
-            if author and (author[-1] == ')'):
-                author = author[:-1]
-            author = author.strip()
-            context.setdefault('%s_detail' % key, FeedParserDict())
-            context['%s_detail' % key]['name'] = author
-            context['%s_detail' % key]['email'] = email
-
-    def _start_subtitle(self, attrsD):
-        self.pushContent('subtitle', attrsD, 'text/plain', 1)
-    _start_tagline = _start_subtitle
-    _start_itunes_subtitle = _start_subtitle
-
-    def _end_subtitle(self):
-        self.popContent('subtitle')
-    _end_tagline = _end_subtitle
-    _end_itunes_subtitle = _end_subtitle
-            
-    def _start_rights(self, attrsD):
-        self.pushContent('rights', attrsD, 'text/plain', 1)
-    _start_dc_rights = _start_rights
-    _start_copyright = _start_rights
-
-    def _end_rights(self):
-        self.popContent('rights')
-    _end_dc_rights = _end_rights
-    _end_copyright = _end_rights
-
-    def _start_item(self, attrsD):
-        self.entries.append(FeedParserDict())
-        self.push('item', 0)
-        self.inentry = 1
-        self.guidislink = 0
-        id = self._getAttribute(attrsD, 'rdf:about')
-        if id:
-            context = self._getContext()
-            context['id'] = id
-        self._cdf_common(attrsD)
-    _start_entry = _start_item
-    _start_product = _start_item
-
-    def _end_item(self):
-        self.pop('item')
-        self.inentry = 0
-    _end_entry = _end_item
-
-    def _start_dc_language(self, attrsD):
-        self.push('language', 1)
-    _start_language = _start_dc_language
-
-    def _end_dc_language(self):
-        self.lang = self.pop('language')
-    _end_language = _end_dc_language
-
-    def _start_dc_publisher(self, attrsD):
-        self.push('publisher', 1)
-    _start_webmaster = _start_dc_publisher
-
-    def _end_dc_publisher(self):
-        self.pop('publisher')
-        self._sync_author_detail('publisher')
-    _end_webmaster = _end_dc_publisher
-
-    def _start_published(self, attrsD):
-        self.push('published', 1)
-    _start_dcterms_issued = _start_published
-    _start_issued = _start_published
-
-    def _end_published(self):
-        value = self.pop('published')
-        self._save('published_parsed', _parse_date(value))
-    _end_dcterms_issued = _end_published
-    _end_issued = _end_published
-
-    def _start_updated(self, attrsD):
-        self.push('updated', 1)
-    _start_modified = _start_updated
-    _start_dcterms_modified = _start_updated
-    _start_pubdate = _start_updated
-    _start_dc_date = _start_updated
-
-    def _end_updated(self):
-        value = self.pop('updated')
-        parsed_value = _parse_date(value)
-        self._save('updated_parsed', parsed_value)
-    _end_modified = _end_updated
-    _end_dcterms_modified = _end_updated
-    _end_pubdate = _end_updated
-    _end_dc_date = _end_updated
-
-    def _start_created(self, attrsD):
-        self.push('created', 1)
-    _start_dcterms_created = _start_created
-
-    def _end_created(self):
-        value = self.pop('created')
-        self._save('created_parsed', _parse_date(value))
-    _end_dcterms_created = _end_created
-
-    def _start_expirationdate(self, attrsD):
-        self.push('expired', 1)
-
-    def _end_expirationdate(self):
-        self._save('expired_parsed', _parse_date(self.pop('expired')))
-
-    def _start_cc_license(self, attrsD):
-        self.push('license', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('license')
-        
-    def _start_creativecommons_license(self, attrsD):
-        self.push('license', 1)
-
-    def _end_creativecommons_license(self):
-        self.pop('license')
-
-    def _addTag(self, term, scheme, label):
-        context = self._getContext()
-        tags = context.setdefault('tags', [])
-        if (not term) and (not scheme) and (not label): return
-        value = FeedParserDict({'term': term, 'scheme': scheme, 'label': label})
-        if value not in tags:
-            tags.append(FeedParserDict({'term': term, 'scheme': scheme, 'label': label}))
-
-    def _start_category(self, attrsD):
-        if _debug: sys.stderr.write('entering _start_category with %s\n' % repr(attrsD))
-        term = attrsD.get('term')
-        scheme = attrsD.get('scheme', attrsD.get('domain'))
-        label = attrsD.get('label')
-        self._addTag(term, scheme, label)
-        self.push('category', 1)
-    _start_dc_subject = _start_category
-    _start_keywords = _start_category
-        
-    def _end_itunes_keywords(self):
-        for term in self.pop('itunes_keywords').split():
-            self._addTag(term, 'http://www.itunes.com/', None)
-        
-    def _start_itunes_category(self, attrsD):
-        self._addTag(attrsD.get('text'), 'http://www.itunes.com/', None)
-        self.push('category', 1)
-        
-    def _end_category(self):
-        value = self.pop('category')
-        if not value: return
-        context = self._getContext()
-        tags = context['tags']
-        if value and len(tags) and not tags[-1]['term']:
-            tags[-1]['term'] = value
-        else:
-            self._addTag(value, None, None)
-    _end_dc_subject = _end_category
-    _end_keywords = _end_category
-    _end_itunes_category = _end_category
-
-    def _start_cloud(self, attrsD):
-        self._getContext()['cloud'] = FeedParserDict(attrsD)
-        
-    def _start_link(self, attrsD):
-        attrsD.setdefault('rel', 'alternate')
-        attrsD.setdefault('type', 'text/html')
-        attrsD = self._itsAnHrefDamnIt(attrsD)
-        if attrsD.has_key('href'):
-            attrsD['href'] = self.resolveURI(attrsD['href'])
-        expectingText = self.infeed or self.inentry or self.insource
-        context = self._getContext()
-        context.setdefault('links', [])
-        context['links'].append(FeedParserDict(attrsD))
-        if attrsD['rel'] == 'enclosure':
-            self._start_enclosure(attrsD)
-        if attrsD.has_key('href'):
-            expectingText = 0
-            if (attrsD.get('rel') == 'alternate') and (self.mapContentType(attrsD.get('type')) in self.html_types):
-                context['link'] = attrsD['href']
-        else:
-            self.push('link', expectingText)
-    _start_producturl = _start_link
-
-    def _end_link(self):
-        value = self.pop('link')
-        context = self._getContext()
-        if self.intextinput:
-            context['textinput']['link'] = value
-        if self.inimage:
-            context['image']['link'] = value
-    _end_producturl = _end_link
-
-    def _start_guid(self, attrsD):
-        self.guidislink = (attrsD.get('ispermalink', 'true') == 'true')
-        self.push('id', 1)
-
-    def _end_guid(self):
-        value = self.pop('id')
-        self._save('guidislink', self.guidislink and not self._getContext().has_key('link'))
-        if self.guidislink:
-            # guid acts as link, but only if 'ispermalink' is not present or is 'true',
-            # and only if the item doesn't already have a link element
-            self._save('link', value)
-
-    def _start_title(self, attrsD):
-        self.pushContent('title', attrsD, 'text/plain', self.infeed or self.inentry or self.insource)
-    _start_dc_title = _start_title
-    _start_media_title = _start_title
-
-    def _end_title(self):
-        value = self.popContent('title')
-        context = self._getContext()
-        if self.intextinput:
-            context['textinput']['title'] = value
-        elif self.inimage:
-            context['image']['title'] = value
-    _end_dc_title = _end_title
-    _end_media_title = _end_title
-
-    def _start_description(self, attrsD):
-        context = self._getContext()
-        if context.has_key('summary'):
-            self._summaryKey = 'content'
-            self._start_content(attrsD)
-        else:
-            self.pushContent('description', attrsD, 'text/html', self.infeed or self.inentry or self.insource)
-
-    def _start_abstract(self, attrsD):
-        self.pushContent('description', attrsD, 'text/plain', self.infeed or self.inentry or self.insource)
-
-    def _end_description(self):
-        if self._summaryKey == 'content':
-            self._end_content()
-        else:
-            value = self.popContent('description')
-            context = self._getContext()
-            if self.intextinput:
-                context['textinput']['description'] = value
-            elif self.inimage:
-                context['image']['description'] = value
-        self._summaryKey = None
-    _end_abstract = _end_description
-
-    def _start_info(self, attrsD):
-        self.pushContent('info', attrsD, 'text/plain', 1)
-    _start_feedburner_browserfriendly = _start_info
-
-    def _end_info(self):
-        self.popContent('info')
-    _end_feedburner_browserfriendly = _end_info
-
-    def _start_generator(self, attrsD):
-        if attrsD:
-            attrsD = self._itsAnHrefDamnIt(attrsD)
-            if attrsD.has_key('href'):
-                attrsD['href'] = self.resolveURI(attrsD['href'])
-        self._getContext()['generator_detail'] = FeedParserDict(attrsD)
-        self.push('generator', 1)
-
-    def _end_generator(self):
-        value = self.pop('generator')
-        context = self._getContext()
-        if context.has_key('generator_detail'):
-            context['generator_detail']['name'] = value
-            
-    def _start_admin_generatoragent(self, attrsD):
-        self.push('generator', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('generator')
-        self._getContext()['generator_detail'] = FeedParserDict({'href': value})
-
-    def _start_admin_errorreportsto(self, attrsD):
-        self.push('errorreportsto', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('errorreportsto')
-        
-    def _start_summary(self, attrsD):
-        context = self._getContext()
-        if context.has_key('summary'):
-            self._summaryKey = 'content'
-            self._start_content(attrsD)
-        else:
-            self._summaryKey = 'summary'
-            self.pushContent(self._summaryKey, attrsD, 'text/plain', 1)
-    _start_itunes_summary = _start_summary
-
-    def _end_summary(self):
-        if self._summaryKey == 'content':
-            self._end_content()
-        else:
-            self.popContent(self._summaryKey or 'summary')
-        self._summaryKey = None
-    _end_itunes_summary = _end_summary
-        
-    def _start_enclosure(self, attrsD):
-        attrsD = self._itsAnHrefDamnIt(attrsD)
-        self._getContext().setdefault('enclosures', []).append(FeedParserDict(attrsD))
-        href = attrsD.get('href')
-        if href:
-            context = self._getContext()
-            if not context.get('id'):
-                context['id'] = href
-            
-    def _start_source(self, attrsD):
-        self.insource = 1
-
-    def _end_source(self):
-        self.insource = 0
-        self._getContext()['source'] = copy.deepcopy(self.sourcedata)
-        self.sourcedata.clear()
-
-    def _start_content(self, attrsD):
-        self.pushContent('content', attrsD, 'text/plain', 1)
-        src = attrsD.get('src')
-        if src:
-            self.contentparams['src'] = src
-        self.push('content', 1)
-
-    def _start_prodlink(self, attrsD):
-        self.pushContent('content', attrsD, 'text/html', 1)
-
-    def _start_body(self, attrsD):
-        self.pushContent('content', attrsD, 'application/xhtml+xml', 1)
-    _start_xhtml_body = _start_body
-
-    def _start_content_encoded(self, attrsD):
-        self.pushContent('content', attrsD, 'text/html', 1)
-    _start_fullitem = _start_content_encoded
-
-    def _end_content(self):
-        copyToDescription = self.mapContentType(self.contentparams.get('type')) in (['text/plain'] + self.html_types)
-        value = self.popContent('content')
-        if copyToDescription:
-            self._save('description', value)
-    _end_body = _end_content
-    _end_xhtml_body = _end_content
-    _end_content_encoded = _end_content
-    _end_fullitem = _end_content
-    _end_prodlink = _end_content
-
-    def _start_itunes_image(self, attrsD):
-        self.push('itunes_image', 0)
-        self._getContext()['image'] = FeedParserDict({'href': attrsD.get('href')})
-    _start_itunes_link = _start_itunes_image
-        
-    def _end_itunes_block(self):
-        value = self.pop('itunes_block', 0)
-        self._getContext()['itunes_block'] = (value == 'yes') and 1 or 0
-
-    def _end_itunes_explicit(self):
-        value = self.pop('itunes_explicit', 0)
-        self._getContext()['itunes_explicit'] = (value == 'yes') and 1 or 0
-
-if _XML_AVAILABLE:
-    class _StrictFeedParser(_FeedParserMixin, xml.sax.handler.ContentHandler):
-        def __init__(self, baseuri, baselang, encoding):
-            if _debug: sys.stderr.write('trying StrictFeedParser\n')
-            xml.sax.handler.ContentHandler.__init__(self)
-            _FeedParserMixin.__init__(self, baseuri, baselang, encoding)
-            self.bozo = 0
-            self.exc = None
-        
-        def startPrefixMapping(self, prefix, uri):
-            self.trackNamespace(prefix, uri)
-        
-        def startElementNS(self, name, qname, attrs):
-            namespace, localname = name
-            lowernamespace = str(namespace or '').lower()
-            if lowernamespace.find('backend.userland.com/rss') <> -1:
-                # match any backend.userland.com namespace
-                namespace = 'http://backend.userland.com/rss'
-                lowernamespace = namespace
-            if qname and qname.find(':') > 0:
-                givenprefix = qname.split(':')[0]
-            else:
-                givenprefix = None
-            prefix = self._matchnamespaces.get(lowernamespace, givenprefix)
-            if givenprefix and (prefix == None or (prefix == '' and lowernamespace == '')) and not self.namespacesInUse.has_key(givenprefix):
-                    raise UndeclaredNamespace, "'%s' is not associated with a namespace" % givenprefix
-            if prefix:
-                localname = prefix + ':' + localname
-            localname = str(localname).lower()
-            if _debug: sys.stderr.write('startElementNS: qname = %s, namespace = %s, givenprefix = %s, prefix = %s, attrs = %s, localname = %s\n' % (qname, namespace, givenprefix, prefix, attrs.items(), localname))
-
-            # qname implementation is horribly broken in Python 2.1 (it
-            # doesn't report any), and slightly broken in Python 2.2 (it
-            # doesn't report the xml: namespace). So we match up namespaces
-            # with a known list first, and then possibly override them with
-            # the qnames the SAX parser gives us (if indeed it gives us any
-            # at all).  Thanks to MatejC for helping me test this and
-            # tirelessly telling me that it didn't work yet.
-            attrsD = {}
-            for (namespace, attrlocalname), attrvalue in attrs._attrs.items():
-                lowernamespace = (namespace or '').lower()
-                prefix = self._matchnamespaces.get(lowernamespace, '')
-                if prefix:
-                    attrlocalname = prefix + ':' + attrlocalname
-                attrsD[str(attrlocalname).lower()] = attrvalue
-            for qname in attrs.getQNames():
-                attrsD[str(qname).lower()] = attrs.getValueByQName(qname)
-            self.unknown_starttag(localname, attrsD.items())
-
-        def characters(self, text):
-            self.handle_data(text)
-
-        def endElementNS(self, name, qname):
-            namespace, localname = name
-            lowernamespace = str(namespace or '').lower()
-            if qname and qname.find(':') > 0:
-                givenprefix = qname.split(':')[0]
-            else:
-                givenprefix = ''
-            prefix = self._matchnamespaces.get(lowernamespace, givenprefix)
-            if prefix:
-                localname = prefix + ':' + localname
-            localname = str(localname).lower()
-            self.unknown_endtag(localname)
-
-        def error(self, exc):
-            self.bozo = 1
-            self.exc = exc
-            
-        def fatalError(self, exc):
-            self.error(exc)
-            raise exc
-
-class _BaseHTMLProcessor(sgmllib.SGMLParser):
-    elements_no_end_tag = ['area', 'base', 'basefont', 'br', 'col', 'frame', 'hr',
-      'img', 'input', 'isindex', 'link', 'meta', 'param']
-    
-    def __init__(self, encoding):
-        self.encoding = encoding
-        if _debug: sys.stderr.write('entering BaseHTMLProcessor, encoding=%s\n' % self.encoding)
-        sgmllib.SGMLParser.__init__(self)
-        
-    def reset(self):
-        self.pieces = []
-        sgmllib.SGMLParser.reset(self)
-
-    def _shorttag_replace(self, match):
-        tag = match.group(1)
-        if tag in self.elements_no_end_tag:
-            return '<' + tag + ' />'
-        else:
-            return '<' + tag + '></' + tag + '>'
-        
-    def feed(self, data):
-        data = re.compile(r'<!((?!DOCTYPE|--|\[))', re.IGNORECASE).sub(r'&lt;!\1', data)
-        #data = re.sub(r'<(\S+?)\s*?/>', self._shorttag_replace, data) # bug [ 1399464 ] Bad regexp for _shorttag_replace
-        data = re.sub(r'<([^<\s]+?)\s*/>', self._shorttag_replace, data) 
-        data = data.replace('&#39;', "'")
-        data = data.replace('&#34;', '"')
-        if self.encoding and type(data) == type(u''):
-            data = data.encode(self.encoding)
-        sgmllib.SGMLParser.feed(self, data)
-
-    def normalize_attrs(self, attrs):
-        # utility method to be called by descendants
-        attrs = [(k.lower(), v) for k, v in attrs]
-        attrs = [(k, k in ('rel', 'type') and v.lower() or v) for k, v in attrs]
-        return attrs
-
-    def unknown_starttag(self, tag, attrs):
-        # called for each start tag
-        # attrs is a list of (attr, value) tuples
-        # e.g. for <pre class='screen'>, tag='pre', attrs=[('class', 'screen')]
-        if _debug: sys.stderr.write('_BaseHTMLProcessor, unknown_starttag, tag=%s\n' % tag)
-        uattrs = []
-        # thanks to Kevin Marks for this breathtaking hack to deal with (valid) high-bit attribute values in UTF-8 feeds
-        for key, value in attrs:
-            if type(value) != type(u''):
-                value = unicode(value, self.encoding)
-            uattrs.append((unicode(key, self.encoding), value))
-        strattrs = u''.join([u' %s="%s"' % (key, value) for key, value in uattrs]).encode(self.encoding)
-        if tag in self.elements_no_end_tag:
-            self.pieces.append('<%(tag)s%(strattrs)s />' % locals())
-        else:
-            self.pieces.append('<%(tag)s%(strattrs)s>' % locals())
-
-    def unknown_endtag(self, tag):
-        # called for each end tag, e.g. for </pre>, tag will be 'pre'
-        # Reconstruct the original end tag.
-        if tag not in self.elements_no_end_tag:
-            self.pieces.append("</%(tag)s>" % locals())
-
-    def handle_charref(self, ref):
-        # called for each character reference, e.g. for '&#160;', ref will be '160'
-        # Reconstruct the original character reference.
-        self.pieces.append('&#%(ref)s;' % locals())
-        
-    def handle_entityref(self, ref):
-        # called for each entity reference, e.g. for '&copy;', ref will be 'copy'
-        # Reconstruct the original entity reference.
-        self.pieces.append('&%(ref)s;' % locals())
-
-    def handle_data(self, text):
-        # called for each block of plain text, i.e. outside of any tag and
-        # not containing any character or entity references
-        # Store the original text verbatim.
-        if _debug: sys.stderr.write('_BaseHTMLProcessor, handle_text, text=%s\n' % text)
-        self.pieces.append(text)
-        
-    def handle_comment(self, text):
-        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
-        # Reconstruct the original comment.
-        self.pieces.append('<!--%(text)s-->' % locals())
-        
-    def handle_pi(self, text):
-        # called for each processing instruction, e.g. <?instruction>
-        # Reconstruct original processing instruction.
-        self.pieces.append('<?%(text)s>' % locals())
-
-    def handle_decl(self, text):
-        # called for the DOCTYPE, if present, e.g.
-        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
-        #     "http://www.w3.org/TR/html4/loose.dtd">
-        # Reconstruct original DOCTYPE
-        self.pieces.append('<!%(text)s>' % locals())
-        
-    _new_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9:]*\s*').match
-    def _scan_name(self, i, declstartpos):
-        rawdata = self.rawdata
-        n = len(rawdata)
-        if i == n:
-            return None, -1
-        m = self._new_declname_match(rawdata, i)
-        if m:
-            s = m.group()
-            name = s.strip()
-            if (i + len(s)) == n:
-                return None, -1  # end of buffer
-            return name.lower(), m.end()
-        else:
-            self.handle_data(rawdata)
-#            self.updatepos(declstartpos, i)
-            return None, -1
-
-    def output(self):
-        '''Return processed HTML as a single string'''
-        return ''.join([str(p) for p in self.pieces])
-
-class _LooseFeedParser(_FeedParserMixin, _BaseHTMLProcessor):
-    def __init__(self, baseuri, baselang, encoding):
-        sgmllib.SGMLParser.__init__(self)
-        _FeedParserMixin.__init__(self, baseuri, baselang, encoding)
-
-    def decodeEntities(self, element, data):
-        data = data.replace('&#60;', '&lt;')
-        data = data.replace('&#x3c;', '&lt;')
-        data = data.replace('&#62;', '&gt;')
-        data = data.replace('&#x3e;', '&gt;')
-        data = data.replace('&#38;', '&amp;')
-        data = data.replace('&#x26;', '&amp;')
-        data = data.replace('&#34;', '&quot;')
-        data = data.replace('&#x22;', '&quot;')
-        data = data.replace('&#39;', '&apos;')
-        data = data.replace('&#x27;', '&apos;')
-        if self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            data = data.replace('&lt;', '<')
-            data = data.replace('&gt;', '>')
-            data = data.replace('&amp;', '&')
-            data = data.replace('&quot;', '"')
-            data = data.replace('&apos;', "'")
-        return data
-        
-class _RelativeURIResolver(_BaseHTMLProcessor):
-    relative_uris = [('a', 'href'),
-                     ('applet', 'codebase'),
-                     ('area', 'href'),
-                     ('blockquote', 'cite'),
-                     ('body', 'background'),
-                     ('del', 'cite'),
-                     ('form', 'action'),
-                     ('frame', 'longdesc'),
-                     ('frame', 'src'),
-                     ('iframe', 'longdesc'),
-                     ('iframe', 'src'),
-                     ('head', 'profile'),
-                     ('img', 'longdesc'),
-                     ('img', 'src'),
-                     ('img', 'usemap'),
-                     ('input', 'src'),
-                     ('input', 'usemap'),
-                     ('ins', 'cite'),
-                     ('link', 'href'),
-                     ('object', 'classid'),
-                     ('object', 'codebase'),
-                     ('object', 'data'),
-                     ('object', 'usemap'),
-                     ('q', 'cite'),
-                     ('script', 'src')]
-
-    def __init__(self, baseuri, encoding):
-        _BaseHTMLProcessor.__init__(self, encoding)
-        self.baseuri = baseuri
-
-    def resolveURI(self, uri):
-        return _urljoin(self.baseuri, uri)
-    
-    def unknown_starttag(self, tag, attrs):
-        attrs = self.normalize_attrs(attrs)
-        attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
-        _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
-        
-def _resolveRelativeURIs(htmlSource, baseURI, encoding):
-    if _debug: sys.stderr.write('entering _resolveRelativeURIs\n')
-    p = _RelativeURIResolver(baseURI, encoding)
-    p.feed(htmlSource)
-    return p.output()
-
-class _HTMLSanitizer(_BaseHTMLProcessor):
-    acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
-      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
-      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em', 'fieldset',
-      'font', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input',
-      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 'optgroup',
-      'option', 'p', 'pre', 'q', 's', 'samp', 'select', 'small', 'span', 'strike',
-      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'tfoot', 'th',
-      'thead', 'tr', 'tt', 'u', 'ul', 'var']
-
-    acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
-      'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
-      'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'clear', 'cols',
-      'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 'disabled',
-      'enctype', 'for', 'frame', 'headers', 'height', 'href', 'hreflang', 'hspace',
-      'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'media', 'method',
-      'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly',
-      'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size',
-      'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
-      'usemap', 'valign', 'value', 'vspace', 'width']
-
-    unacceptable_elements_with_end_tag = ['script', 'applet']
-
-    def reset(self):
-        _BaseHTMLProcessor.reset(self)
-        self.unacceptablestack = 0
-        
-    def unknown_starttag(self, tag, attrs):
-        if not tag in self.acceptable_elements:
-            if tag in self.unacceptable_elements_with_end_tag:
-                self.unacceptablestack += 1
-            return
-        attrs = self.normalize_attrs(attrs)
-        attrs = [(key, value) for key, value in attrs if key in self.acceptable_attributes]
-        _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
-        
-    def unknown_endtag(self, tag):
-        if not tag in self.acceptable_elements:
-            if tag in self.unacceptable_elements_with_end_tag:
-                self.unacceptablestack -= 1
-            return
-        _BaseHTMLProcessor.unknown_endtag(self, tag)
-
-    def handle_pi(self, text):
-        pass
-
-    def handle_decl(self, text):
-        pass
-
-    def handle_data(self, text):
-        if not self.unacceptablestack:
-            _BaseHTMLProcessor.handle_data(self, text)
-
-def _sanitizeHTML(htmlSource, encoding):
-    p = _HTMLSanitizer(encoding)
-    p.feed(htmlSource)
-    data = p.output()
-    if TIDY_MARKUP:
-        # loop through list of preferred Tidy interfaces looking for one that's installed,
-        # then set up a common _tidy function to wrap the interface-specific API.
-        _tidy = None
-        for tidy_interface in PREFERRED_TIDY_INTERFACES:
-            try:
-                if tidy_interface == "uTidy":
-                    from tidy import parseString as _utidy
-                    def _tidy(data, **kwargs):
-                        return str(_utidy(data, **kwargs))
-                    break
-                elif tidy_interface == "mxTidy":
-                    from mx.Tidy import Tidy as _mxtidy
-                    def _tidy(data, **kwargs):
-                        nerrors, nwarnings, data, errordata = _mxtidy.tidy(data, **kwargs)
-                        return data
-                    break
-            except:
-                pass
-        if _tidy:
-            utf8 = type(data) == type(u'')
-            if utf8:
-                data = data.encode('utf-8')
-            data = _tidy(data, output_xhtml=1, numeric_entities=1, wrap=0, char_encoding="utf8")
-            if utf8:
-                data = unicode(data, 'utf-8')
-            if data.count('<body'):
-                data = data.split('<body', 1)[1]
-                if data.count('>'):
-                    data = data.split('>', 1)[1]
-            if data.count('</body'):
-                data = data.split('</body', 1)[0]
-    data = data.strip().replace('\r\n', '\n')
-    return data
-
-class _FeedURLHandler(urllib2.HTTPDigestAuthHandler, urllib2.HTTPRedirectHandler, urllib2.HTTPDefaultErrorHandler):
-    def http_error_default(self, req, fp, code, msg, headers):
-        if ((code / 100) == 3) and (code != 304):
-            return self.http_error_302(req, fp, code, msg, headers)
-        infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        infourl.status = code
-        return infourl
-
-    def http_error_302(self, req, fp, code, msg, headers):
-        if headers.dict.has_key('location'):
-            infourl = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
-        else:
-            infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        if not hasattr(infourl, 'status'):
-            infourl.status = code
-        return infourl
-
-    def http_error_301(self, req, fp, code, msg, headers):
-        if headers.dict.has_key('location'):
-            infourl = urllib2.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
-        else:
-            infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        if not hasattr(infourl, 'status'):
-            infourl.status = code
-        return infourl
-
-    http_error_300 = http_error_302
-    http_error_303 = http_error_302
-    http_error_307 = http_error_302
-        
-    def http_error_401(self, req, fp, code, msg, headers):
-        # Check if
-        # - server requires digest auth, AND
-        # - we tried (unsuccessfully) with basic auth, AND
-        # - we're using Python 2.3.3 or later (digest auth is irreparably broken in earlier versions)
-        # If all conditions hold, parse authentication information
-        # out of the Authorization header we sent the first time
-        # (for the username and password) and the WWW-Authenticate
-        # header the server sent back (for the realm) and retry
-        # the request with the appropriate digest auth headers instead.
-        # This evil genius hack has been brought to you by Aaron Swartz.
-        host = urlparse.urlparse(req.get_full_url())[1]
-        try:
-            assert sys.version.split()[0] >= '2.3.3'
-            assert base64 != None
-            user, passw = base64.decodestring(req.headers['Authorization'].split(' ')[1]).split(':')
-            realm = re.findall('realm="([^"]*)"', headers['WWW-Authenticate'])[0]
-            self.add_password(realm, host, user, passw)
-            retry = self.http_error_auth_reqed('www-authenticate', host, req, headers)
-            self.reset_retry_count()
-            return retry
-        except:
-            return self.http_error_default(req, fp, code, msg, headers)
-
-def _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers):
-    """URL, filename, or string --> stream
-
-    This function lets you define parsers that take any input source
-    (URL, pathname to local or network file, or actual data as a string)
-    and deal with it in a uniform manner.  Returned object is guaranteed
-    to have all the basic stdio read methods (read, readline, readlines).
-    Just .close() the object when you're done with it.
-
-    If the etag argument is supplied, it will be used as the value of an
-    If-None-Match request header.
-
-    If the modified argument is supplied, it must be a tuple of 9 integers
-    as returned by gmtime() in the standard Python time module. This MUST
-    be in GMT (Greenwich Mean Time). The formatted date/time will be used
-    as the value of an If-Modified-Since request header.
-
-    If the agent argument is supplied, it will be used as the value of a
-    User-Agent request header.
-
-    If the referrer argument is supplied, it will be used as the value of a
-    Referer[sic] request header.
-
-    If handlers is supplied, it is a list of handlers used to build a
-    urllib2 opener.
-    """
-
-    if hasattr(url_file_stream_or_string, 'read'):
-        return url_file_stream_or_string
-
-    if url_file_stream_or_string == '-':
-        return sys.stdin
-
-    if urlparse.urlparse(url_file_stream_or_string)[0] in ('http', 'https', 'ftp'):
-        if not agent:
-            agent = USER_AGENT
-        # test for inline user:password for basic auth
-        auth = None
-        if base64:
-            urltype, rest = urllib.splittype(url_file_stream_or_string)
-            realhost, rest = urllib.splithost(rest)
-            if realhost:
-                user_passwd, realhost = urllib.splituser(realhost)
-                if user_passwd:
-                    url_file_stream_or_string = '%s://%s%s' % (urltype, realhost, rest)
-                    auth = base64.encodestring(user_passwd).strip()
-        # try to open with urllib2 (to use optional headers)
-        request = urllib2.Request(url_file_stream_or_string)
-        request.add_header('User-Agent', agent)
-        if etag:
-            request.add_header('If-None-Match', etag)
-        if modified:
-            # format into an RFC 1123-compliant timestamp. We can't use
-            # time.strftime() since the %a and %b directives can be affected
-            # by the current locale, but RFC 2616 states that dates must be
-            # in English.
-            short_weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
-            months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
-            request.add_header('If-Modified-Since', '%s, %02d %s %04d %02d:%02d:%02d GMT' % (short_weekdays[modified[6]], modified[2], months[modified[1] - 1], modified[0], modified[3], modified[4], modified[5]))
-        if referrer:
-            request.add_header('Referer', referrer)
-        if gzip and zlib:
-            request.add_header('Accept-encoding', 'gzip, deflate')
-        elif gzip:
-            request.add_header('Accept-encoding', 'gzip')
-        elif zlib:
-            request.add_header('Accept-encoding', 'deflate')
-        else:
-            request.add_header('Accept-encoding', '')
-        if auth:
-            request.add_header('Authorization', 'Basic %s' % auth)
-        if ACCEPT_HEADER:
-            request.add_header('Accept', ACCEPT_HEADER)
-        request.add_header('A-IM', 'feed') # RFC 3229 support
-        opener = apply(urllib2.build_opener, tuple([_FeedURLHandler()] + handlers))
-        opener.addheaders = [] # RMK - must clear so we only send our custom User-Agent
-        try:
-            return opener.open(request)
-        finally:
-            opener.close() # JohnD
-    
-    # try to open with native open function (if url_file_stream_or_string is a filename)
-    try:
-        return open(url_file_stream_or_string)
-    except:
-        pass
-
-    # treat url_file_stream_or_string as string
-    return _StringIO(str(url_file_stream_or_string))
-
-_date_handlers = []
-def registerDateHandler(func):
-    '''Register a date handler function (takes string, returns 9-tuple date in GMT)'''
-    _date_handlers.insert(0, func)
-    
-# ISO-8601 date parsing routines written by Fazal Majid.
-# The ISO 8601 standard is very convoluted and irregular - a full ISO 8601
-# parser is beyond the scope of feedparser and would be a worthwhile addition
-# to the Python library.
-# A single regular expression cannot parse ISO 8601 date formats into groups
-# as the standard is highly irregular (for instance is 030104 2003-01-04 or
-# 0301-04-01), so we use templates instead.
-# Please note the order in templates is significant because we need a
-# greedy match.
-_iso8601_tmpl = ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO',
-                'YY-?MM-?DD', 'YY-?OOO', 'YYYY', 
-                '-YY-?MM', '-OOO', '-YY',
-                '--MM-?DD', '--MM',
-                '---DD',
-                'CC', '']
-_iso8601_re = [
-    tmpl.replace(
-    'YYYY', r'(?P<year>\d{4})').replace(
-    'YY', r'(?P<year>\d\d)').replace(
-    'MM', r'(?P<month>[01]\d)').replace(
-    'DD', r'(?P<day>[0123]\d)').replace(
-    'OOO', r'(?P<ordinal>[0123]\d\d)').replace(
-    'CC', r'(?P<century>\d\d$)')
-    + r'(T?(?P<hour>\d{2}):(?P<minute>\d{2})'
-    + r'(:(?P<second>\d{2}))?'
-    + r'(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?'
-    for tmpl in _iso8601_tmpl]
-del tmpl
-_iso8601_matches = [re.compile(regex).match for regex in _iso8601_re]
-del regex
-def _parse_date_iso8601(dateString):
-    '''Parse a variety of ISO-8601-compatible formats like 20040105'''
-    m = None
-    for _iso8601_match in _iso8601_matches:
-        m = _iso8601_match(dateString)
-        if m: break
-    if not m: return
-    if m.span() == (0, 0): return
-    params = m.groupdict()
-    ordinal = params.get('ordinal', 0)
-    if ordinal:
-        ordinal = int(ordinal)
-    else:
-        ordinal = 0
-    year = params.get('year', '--')
-    if not year or year == '--':
-        year = time.gmtime()[0]
-    elif len(year) == 2:
-        # ISO 8601 assumes current century, i.e. 93 -> 2093, NOT 1993
-        year = 100 * int(time.gmtime()[0] / 100) + int(year)
-    else:
-        year = int(year)
-    month = params.get('month', '-')
-    if not month or month == '-':
-        # ordinals are NOT normalized by mktime, we simulate them
-        # by setting month=1, day=ordinal
-        if ordinal:
-            month = 1
-        else:
-            month = time.gmtime()[1]
-    month = int(month)
-    day = params.get('day', 0)
-    if not day:
-        # see above
-        if ordinal:
-            day = ordinal
-        elif params.get('century', 0) or \
-                 params.get('year', 0) or params.get('month', 0):
-            day = 1
-        else:
-            day = time.gmtime()[2]
-    else:
-        day = int(day)
-    # special case of the century - is the first year of the 21st century
-    # 2000 or 2001 ? The debate goes on...
-    if 'century' in params.keys():
-        year = (int(params['century']) - 1) * 100 + 1
-    # in ISO 8601 most fields are optional
-    for field in ['hour', 'minute', 'second', 'tzhour', 'tzmin']:
-        if not params.get(field, None):
-            params[field] = 0
-    hour = int(params.get('hour', 0))
-    minute = int(params.get('minute', 0))
-    second = int(params.get('second', 0))
-    # weekday is normalized by mktime(), we can ignore it
-    weekday = 0
-    # daylight savings is complex, but not needed for feedparser's purposes
-    # as time zones, if specified, include mention of whether it is active
-    # (e.g. PST vs. PDT, CET). Using -1 is implementation-dependent and
-    # and most implementations have DST bugs
-    daylight_savings_flag = 0
-    tm = [year, month, day, hour, minute, second, weekday,
-          ordinal, daylight_savings_flag]
-    # ISO 8601 time zone adjustments
-    tz = params.get('tz')
-    if tz and tz != 'Z':
-        if tz[0] == '-':
-            tm[3] += int(params.get('tzhour', 0))
-            tm[4] += int(params.get('tzmin', 0))
-        elif tz[0] == '+':
-            tm[3] -= int(params.get('tzhour', 0))
-            tm[4] -= int(params.get('tzmin', 0))
-        else:
-            return None
-    # Python's time.mktime() is a wrapper around the ANSI C mktime(3c)
-    # which is guaranteed to normalize d/m/y/h/m/s.
-    # Many implementations have bugs, but we'll pretend they don't.
-    return time.localtime(time.mktime(tm))
-registerDateHandler(_parse_date_iso8601)
-    
-# 8-bit date handling routines written by ytrewq1.
-_korean_year  = u'\ub144' # b3e2 in euc-kr
-_korean_month = u'\uc6d4' # bff9 in euc-kr
-_korean_day   = u'\uc77c' # c0cf in euc-kr
-_korean_am    = u'\uc624\uc804' # bfc0 c0fc in euc-kr
-_korean_pm    = u'\uc624\ud6c4' # bfc0 c8c4 in euc-kr
-
-_korean_onblog_date_re = \
-    re.compile('(\d{4})%s\s+(\d{2})%s\s+(\d{2})%s\s+(\d{2}):(\d{2}):(\d{2})' % \
-               (_korean_year, _korean_month, _korean_day))
-_korean_nate_date_re = \
-    re.compile(u'(\d{4})-(\d{2})-(\d{2})\s+(%s|%s)\s+(\d{,2}):(\d{,2}):(\d{,2})' % \
-               (_korean_am, _korean_pm))
-def _parse_date_onblog(dateString):
-    '''Parse a string according to the OnBlog 8-bit date format'''
-    m = _korean_onblog_date_re.match(dateString)
-    if not m: return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('OnBlog date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_onblog)
-
-def _parse_date_nate(dateString):
-    '''Parse a string according to the Nate 8-bit date format'''
-    m = _korean_nate_date_re.match(dateString)
-    if not m: return
-    hour = int(m.group(5))
-    ampm = m.group(4)
-    if (ampm == _korean_pm):
-        hour += 12
-    hour = str(hour)
-    if len(hour) == 1:
-        hour = '0' + hour
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': hour, 'minute': m.group(6), 'second': m.group(7),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('Nate date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_nate)
-
-_mssql_date_re = \
-    re.compile('(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})(\.\d+)?')
-def _parse_date_mssql(dateString):
-    '''Parse a string according to the MS SQL date format'''
-    m = _mssql_date_re.match(dateString)
-    if not m: return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('MS SQL date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_mssql)
-
-# Unicode strings for Greek date strings
-_greek_months = \
-  { \
-   u'\u0399\u03b1\u03bd': u'Jan',       # c9e1ed in iso-8859-7
-   u'\u03a6\u03b5\u03b2': u'Feb',       # d6e5e2 in iso-8859-7
-   u'\u039c\u03ac\u03ce': u'Mar',       # ccdcfe in iso-8859-7
-   u'\u039c\u03b1\u03ce': u'Mar',       # cce1fe in iso-8859-7
-   u'\u0391\u03c0\u03c1': u'Apr',       # c1f0f1 in iso-8859-7
-   u'\u039c\u03ac\u03b9': u'May',       # ccdce9 in iso-8859-7
-   u'\u039c\u03b1\u03ca': u'May',       # cce1fa in iso-8859-7
-   u'\u039c\u03b1\u03b9': u'May',       # cce1e9 in iso-8859-7
-   u'\u0399\u03bf\u03cd\u03bd': u'Jun', # c9effded in iso-8859-7
-   u'\u0399\u03bf\u03bd': u'Jun',       # c9efed in iso-8859-7
-   u'\u0399\u03bf\u03cd\u03bb': u'Jul', # c9effdeb in iso-8859-7
-   u'\u0399\u03bf\u03bb': u'Jul',       # c9f9eb in iso-8859-7
-   u'\u0391\u03cd\u03b3': u'Aug',       # c1fde3 in iso-8859-7
-   u'\u0391\u03c5\u03b3': u'Aug',       # c1f5e3 in iso-8859-7
-   u'\u03a3\u03b5\u03c0': u'Sep',       # d3e5f0 in iso-8859-7
-   u'\u039f\u03ba\u03c4': u'Oct',       # cfeaf4 in iso-8859-7
-   u'\u039d\u03bf\u03ad': u'Nov',       # cdefdd in iso-8859-7
-   u'\u039d\u03bf\u03b5': u'Nov',       # cdefe5 in iso-8859-7
-   u'\u0394\u03b5\u03ba': u'Dec',       # c4e5ea in iso-8859-7
-  }
-
-_greek_wdays = \
-  { \
-   u'\u039a\u03c5\u03c1': u'Sun', # caf5f1 in iso-8859-7
-   u'\u0394\u03b5\u03c5': u'Mon', # c4e5f5 in iso-8859-7
-   u'\u03a4\u03c1\u03b9': u'Tue', # d4f1e9 in iso-8859-7
-   u'\u03a4\u03b5\u03c4': u'Wed', # d4e5f4 in iso-8859-7
-   u'\u03a0\u03b5\u03bc': u'Thu', # d0e5ec in iso-8859-7
-   u'\u03a0\u03b1\u03c1': u'Fri', # d0e1f1 in iso-8859-7
-   u'\u03a3\u03b1\u03b2': u'Sat', # d3e1e2 in iso-8859-7   
-  }
-
-_greek_date_format_re = \
-    re.compile(u'([^,]+),\s+(\d{2})\s+([^\s]+)\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})\s+([^\s]+)')
-
-def _parse_date_greek(dateString):
-    '''Parse a string according to a Greek 8-bit date format.'''
-    m = _greek_date_format_re.match(dateString)
-    if not m: return
-    try:
-        wday = _greek_wdays[m.group(1)]
-        month = _greek_months[m.group(3)]
-    except:
-        return
-    rfc822date = '%(wday)s, %(day)s %(month)s %(year)s %(hour)s:%(minute)s:%(second)s %(zonediff)s' % \
-                 {'wday': wday, 'day': m.group(2), 'month': month, 'year': m.group(4),\
-                  'hour': m.group(5), 'minute': m.group(6), 'second': m.group(7),\
-                  'zonediff': m.group(8)}
-    if _debug: sys.stderr.write('Greek date parsed as: %s\n' % rfc822date)
-    return _parse_date_rfc822(rfc822date)
-registerDateHandler(_parse_date_greek)
-
-# Unicode strings for Hungarian date strings
-_hungarian_months = \
-  { \
-    u'janu\u00e1r':   u'01',  # e1 in iso-8859-2
-    u'febru\u00e1ri': u'02',  # e1 in iso-8859-2
-    u'm\u00e1rcius':  u'03',  # e1 in iso-8859-2
-    u'\u00e1prilis':  u'04',  # e1 in iso-8859-2
-    u'm\u00e1ujus':   u'05',  # e1 in iso-8859-2
-    u'j\u00fanius':   u'06',  # fa in iso-8859-2
-    u'j\u00falius':   u'07',  # fa in iso-8859-2
-    u'augusztus':     u'08',
-    u'szeptember':    u'09',
-    u'okt\u00f3ber':  u'10',  # f3 in iso-8859-2
-    u'november':      u'11',
-    u'december':      u'12',
-  }
-
-_hungarian_date_format_re = \
-  re.compile(u'(\d{4})-([^-]+)-(\d{,2})T(\d{,2}):(\d{2})((\+|-)(\d{,2}:\d{2}))')
-
-def _parse_date_hungarian(dateString):
-    '''Parse a string according to a Hungarian 8-bit date format.'''
-    m = _hungarian_date_format_re.match(dateString)
-    if not m: return
-    try:
-        month = _hungarian_months[m.group(2)]
-        day = m.group(3)
-        if len(day) == 1:
-            day = '0' + day
-        hour = m.group(4)
-        if len(hour) == 1:
-            hour = '0' + hour
-    except:
-        return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': month, 'day': day,\
-                 'hour': hour, 'minute': m.group(5),\
-                 'zonediff': m.group(6)}
-    if _debug: sys.stderr.write('Hungarian date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_hungarian)
-
-# W3DTF-style date parsing adapted from PyXML xml.utils.iso8601, written by
-# Drake and licensed under the Python license.  Removed all range checking
-# for month, day, hour, minute, and second, since mktime will normalize
-# these later
-def _parse_date_w3dtf(dateString):
-    def __extract_date(m):
-        year = int(m.group('year'))
-        if year < 100:
-            year = 100 * int(time.gmtime()[0] / 100) + int(year)
-        if year < 1000:
-            return 0, 0, 0
-        julian = m.group('julian')
-        if julian:
-            julian = int(julian)
-            month = julian / 30 + 1
-            day = julian % 30 + 1
-            jday = None
-            while jday != julian:
-                t = time.mktime((year, month, day, 0, 0, 0, 0, 0, 0))
-                jday = time.gmtime(t)[-2]
-                diff = abs(jday - julian)
-                if jday > julian:
-                    if diff < day:
-                        day = day - diff
-                    else:
-                        month = month - 1
-                        day = 31
-                elif jday < julian:
-                    if day + diff < 28:
-                       day = day + diff
-                    else:
-                        month = month + 1
-            return year, month, day
-        month = m.group('month')
-        day = 1
-        if month is None:
-            month = 1
-        else:
-            month = int(month)
-            day = m.group('day')
-            if day:
-                day = int(day)
-            else:
-                day = 1
-        return year, month, day
-
-    def __extract_time(m):
-        if not m:
-            return 0, 0, 0
-        hours = m.group('hours')
-        if not hours:
-            return 0, 0, 0
-        hours = int(hours)
-        minutes = int(m.group('minutes'))
-        seconds = m.group('seconds')
-        if seconds:
-            seconds = int(seconds)
-        else:
-            seconds = 0
-        return hours, minutes, seconds
-
-    def __extract_tzd(m):
-        '''Return the Time Zone Designator as an offset in seconds from UTC.'''
-        if not m:
-            return 0
-        tzd = m.group('tzd')
-        if not tzd:
-            return 0
-        if tzd == 'Z':
-            return 0
-        hours = int(m.group('tzdhours'))
-        minutes = m.group('tzdminutes')
-        if minutes:
-            minutes = int(minutes)
-        else:
-            minutes = 0
-        offset = (hours*60 + minutes) * 60
-        if tzd[0] == '+':
-            return -offset
-        return offset
-
-    __date_re = ('(?P<year>\d\d\d\d)'
-                 '(?:(?P<dsep>-|)'
-                 '(?:(?P<julian>\d\d\d)'
-                 '|(?P<month>\d\d)(?:(?P=dsep)(?P<day>\d\d))?))?')
-    __tzd_re = '(?P<tzd>[-+](?P<tzdhours>\d\d)(?::?(?P<tzdminutes>\d\d))|Z)'
-    __tzd_rx = re.compile(__tzd_re)
-    __time_re = ('(?P<hours>\d\d)(?P<tsep>:|)(?P<minutes>\d\d)'
-                 '(?:(?P=tsep)(?P<seconds>\d\d(?:[.,]\d+)?))?'
-                 + __tzd_re)
-    __datetime_re = '%s(?:T%s)?' % (__date_re, __time_re)
-    __datetime_rx = re.compile(__datetime_re)
-    m = __datetime_rx.match(dateString)
-    if (m is None) or (m.group() != dateString): return
-    gmt = __extract_date(m) + __extract_time(m) + (0, 0, 0)
-    if gmt[0] == 0: return
-    return time.gmtime(time.mktime(gmt) + __extract_tzd(m) - time.timezone)
-registerDateHandler(_parse_date_w3dtf)
-
-def _parse_date_rfc822(dateString):
-    '''Parse an RFC822, RFC1123, RFC2822, or asctime-style date'''
-    data = dateString.split()
-    if data[0][-1] in (',', '.') or data[0].lower() in rfc822._daynames:
-        del data[0]
-    if len(data) == 4:
-        s = data[3]
-        i = s.find('+')
-        if i > 0:
-            data[3:] = [s[:i], s[i+1:]]
-        else:
-            data.append('')
-        dateString = " ".join(data)
-    if len(data) < 5:
-        dateString += ' 00:00:00 GMT'
-    tm = rfc822.parsedate_tz(dateString)
-    if tm:
-        return time.gmtime(rfc822.mktime_tz(tm))
-# rfc822.py defines several time zones, but we define some extra ones.
-# 'ET' is equivalent to 'EST', etc.
-_additional_timezones = {'AT': -400, 'ET': -500, 'CT': -600, 'MT': -700, 'PT': -800}
-rfc822._timezones.update(_additional_timezones)
-registerDateHandler(_parse_date_rfc822)    
-
-def _parse_date(dateString):
-    '''Parses a variety of date formats into a 9-tuple in GMT'''
-    for handler in _date_handlers:
-        try:
-            date9tuple = handler(dateString)
-            if not date9tuple: continue
-            if len(date9tuple) != 9:
-                if _debug: sys.stderr.write('date handler function must return 9-tuple\n')
-                raise ValueError
-            map(int, date9tuple)
-            return date9tuple
-        except Exception, e:
-            if _debug: sys.stderr.write('%s raised %s\n' % (handler.__name__, repr(e)))
-            pass
-    return None
-
-def _getCharacterEncoding(http_headers, xml_data):
-    '''Get the character encoding of the XML document
-
-    http_headers is a dictionary
-    xml_data is a raw string (not Unicode)
-    
-    This is so much trickier than it sounds, it's not even funny.
-    According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
-    is application/xml, application/*+xml,
-    application/xml-external-parsed-entity, or application/xml-dtd,
-    the encoding given in the charset parameter of the HTTP Content-Type
-    takes precedence over the encoding given in the XML prefix within the
-    document, and defaults to 'utf-8' if neither are specified.  But, if
-    the HTTP Content-Type is text/xml, text/*+xml, or
-    text/xml-external-parsed-entity, the encoding given in the XML prefix
-    within the document is ALWAYS IGNORED and only the encoding given in
-    the charset parameter of the HTTP Content-Type header should be
-    respected, and it defaults to 'us-ascii' if not specified.
-
-    Furthermore, discussion on the atom-syntax mailing list with the
-    author of RFC 3023 leads me to the conclusion that any document
-    served with a Content-Type of text/* and no charset parameter
-    must be treated as us-ascii.  (We now do this.)  And also that it
-    must always be flagged as non-well-formed.  (We now do this too.)
-    
-    If Content-Type is unspecified (input was local file or non-HTTP source)
-    or unrecognized (server just got it totally wrong), then go by the
-    encoding given in the XML prefix of the document and default to
-    'iso-8859-1' as per the HTTP specification (RFC 2616).
-    
-    Then, assuming we didn't find a character encoding in the HTTP headers
-    (and the HTTP Content-type allowed us to look in the body), we need
-    to sniff the first few bytes of the XML data and try to determine
-    whether the encoding is ASCII-compatible.  Section F of the XML
-    specification shows the way here:
-    http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
-
-    If the sniffed encoding is not ASCII-compatible, we need to make it
-    ASCII compatible so that we can sniff further into the XML declaration
-    to find the encoding attribute, which will tell us the true encoding.
-
-    Of course, none of this guarantees that we will be able to parse the
-    feed in the declared character encoding (assuming it was declared
-    correctly, which many are not).  CJKCodecs and iconv_codec help a lot;
-    you should definitely install them if you can.
-    http://cjkpython.i18n.org/
-    '''
-
-    def _parseHTTPContentType(content_type):
-        '''takes HTTP Content-Type header and returns (content type, charset)
-
-        If no charset is specified, returns (content type, '')
-        If no content type is specified, returns ('', '')
-        Both return parameters are guaranteed to be lowercase strings
-        '''
-        content_type = content_type or ''
-        content_type, params = cgi.parse_header(content_type)
-        return content_type, params.get('charset', '').replace("'", '')
-
-    sniffed_xml_encoding = ''
-    xml_encoding = ''
-    true_encoding = ''
-    http_content_type, http_encoding = _parseHTTPContentType(http_headers.get('content-type'))
-    # Must sniff for non-ASCII-compatible character encodings before
-    # searching for XML declaration.  This heuristic is defined in
-    # section F of the XML specification:
-    # http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
-    try:
-        if xml_data[:4] == '\x4c\x6f\xa7\x94':
-            # EBCDIC
-            xml_data = _ebcdic_to_ascii(xml_data)
-        elif xml_data[:4] == '\x00\x3c\x00\x3f':
-            # UTF-16BE
-            sniffed_xml_encoding = 'utf-16be'
-            xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
-        elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') and (xml_data[2:4] != '\x00\x00'):
-            # UTF-16BE with BOM
-            sniffed_xml_encoding = 'utf-16be'
-            xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
-        elif xml_data[:4] == '\x3c\x00\x3f\x00':
-            # UTF-16LE
-            sniffed_xml_encoding = 'utf-16le'
-            xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
-        elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and (xml_data[2:4] != '\x00\x00'):
-            # UTF-16LE with BOM
-            sniffed_xml_encoding = 'utf-16le'
-            xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
-        elif xml_data[:4] == '\x00\x00\x00\x3c':
-            # UTF-32BE
-            sniffed_xml_encoding = 'utf-32be'
-            xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
-        elif xml_data[:4] == '\x3c\x00\x00\x00':
-            # UTF-32LE
-            sniffed_xml_encoding = 'utf-32le'
-            xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
-        elif xml_data[:4] == '\x00\x00\xfe\xff':
-            # UTF-32BE with BOM
-            sniffed_xml_encoding = 'utf-32be'
-            xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
-        elif xml_data[:4] == '\xff\xfe\x00\x00':
-            # UTF-32LE with BOM
-            sniffed_xml_encoding = 'utf-32le'
-            xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
-        elif xml_data[:3] == '\xef\xbb\xbf':
-            # UTF-8 with BOM
-            sniffed_xml_encoding = 'utf-8'
-            xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
-        else:
-            # ASCII-compatible
-            pass
-        xml_encoding_match = re.compile('^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
-    except:
-        xml_encoding_match = None
-    if xml_encoding_match:
-        xml_encoding = xml_encoding_match.groups()[0].lower()
-        if sniffed_xml_encoding and (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', 'iso-10646-ucs-4', 'ucs-4', 'csucs4', 'utf-16', 'utf-32', 'utf_16', 'utf_32', 'utf16', 'u16')):
-            xml_encoding = sniffed_xml_encoding
-    acceptable_content_type = 0
-    application_content_types = ('application/xml', 'application/xml-dtd', 'application/xml-external-parsed-entity')
-    text_content_types = ('text/xml', 'text/xml-external-parsed-entity')
-    if (http_content_type in application_content_types) or \
-       (http_content_type.startswith('application/') and http_content_type.endswith('+xml')):
-        acceptable_content_type = 1
-        true_encoding = http_encoding or xml_encoding or 'utf-8'
-    elif (http_content_type in text_content_types) or \
-         (http_content_type.startswith('text/')) and http_content_type.endswith('+xml'):
-        acceptable_content_type = 1
-        true_encoding = http_encoding or 'us-ascii'
-    elif http_content_type.startswith('text/'):
-        true_encoding = http_encoding or 'us-ascii'
-    elif http_headers and (not http_headers.has_key('content-type')):
-        true_encoding = xml_encoding or 'iso-8859-1'
-    else:
-        true_encoding = xml_encoding or 'utf-8'
-    return true_encoding, http_encoding, xml_encoding, sniffed_xml_encoding, acceptable_content_type
-    
-def _toUTF8(data, encoding):
-    '''Changes an XML data stream on the fly to specify a new encoding
-
-    data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already
-    encoding is a string recognized by encodings.aliases
-    '''
-    if _debug: sys.stderr.write('entering _toUTF8, trying encoding %s\n' % encoding)
-    # strip Byte Order Mark (if present)
-    if (len(data) >= 4) and (data[:2] == '\xfe\xff') and (data[2:4] != '\x00\x00'):
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-16be':
-                sys.stderr.write('trying utf-16be instead\n')
-        encoding = 'utf-16be'
-        data = data[2:]
-    elif (len(data) >= 4) and (data[:2] == '\xff\xfe') and (data[2:4] != '\x00\x00'):
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-16le':
-                sys.stderr.write('trying utf-16le instead\n')
-        encoding = 'utf-16le'
-        data = data[2:]
-    elif data[:3] == '\xef\xbb\xbf':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-8':
-                sys.stderr.write('trying utf-8 instead\n')
-        encoding = 'utf-8'
-        data = data[3:]
-    elif data[:4] == '\x00\x00\xfe\xff':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-32be':
-                sys.stderr.write('trying utf-32be instead\n')
-        encoding = 'utf-32be'
-        data = data[4:]
-    elif data[:4] == '\xff\xfe\x00\x00':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-32le':
-                sys.stderr.write('trying utf-32le instead\n')
-        encoding = 'utf-32le'
-        data = data[4:]
-    newdata = unicode(data, encoding)
-    if _debug: sys.stderr.write('successfully converted %s data to unicode\n' % encoding)
-    declmatch = re.compile('^<\?xml[^>]*?>')
-    newdecl = '''<?xml version='1.0' encoding='utf-8'?>'''
-    if declmatch.search(newdata):
-        newdata = declmatch.sub(newdecl, newdata)
-    else:
-        newdata = newdecl + u'\n' + newdata
-    return newdata.encode('utf-8')
-
-def _stripDoctype(data):
-    '''Strips DOCTYPE from XML document, returns (rss_version, stripped_data)
-
-    rss_version may be 'rss091n' or None
-    stripped_data is the same XML document, minus the DOCTYPE
-    '''
-    entity_pattern = re.compile(r'<!ENTITY([^>]*?)>', re.MULTILINE)
-    data = entity_pattern.sub('', data)
-    doctype_pattern = re.compile(r'<!DOCTYPE([^>]*?)>', re.MULTILINE)
-    doctype_results = doctype_pattern.findall(data)
-    doctype = doctype_results and doctype_results[0] or ''
-    if doctype.lower().count('netscape'):
-        version = 'rss091n'
-    else:
-        version = None
-    data = doctype_pattern.sub('', data)
-    return version, data
-    
-def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=[]):
-    '''Parse a feed from a URL, file, stream, or string'''
-    result = FeedParserDict()
-    result['feed'] = FeedParserDict()
-    result['entries'] = []
-    if _XML_AVAILABLE:
-        result['bozo'] = 0
-    if type(handlers) == types.InstanceType:
-        handlers = [handlers]
-    try:
-        f = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)
-        data = f.read()
-    except Exception, e:
-        result['bozo'] = 1
-        result['bozo_exception'] = e
-        data = ''
-        f = None
-
-    # if feed is gzip-compressed, decompress it
-    if f and data and hasattr(f, 'headers'):
-        if gzip and f.headers.get('content-encoding', '') == 'gzip':
-            try:
-                data = gzip.GzipFile(fileobj=_StringIO(data)).read()
-            except Exception, e:
-                # Some feeds claim to be gzipped but they're not, so
-                # we get garbage.  Ideally, we should re-request the
-                # feed without the 'Accept-encoding: gzip' header,
-                # but we don't.
-                result['bozo'] = 1
-                result['bozo_exception'] = e
-                data = ''
-        elif zlib and f.headers.get('content-encoding', '') == 'deflate':
-            try:
-                data = zlib.decompress(data, -zlib.MAX_WBITS)
-            except Exception, e:
-                result['bozo'] = 1
-                result['bozo_exception'] = e
-                data = ''
-
-    # save HTTP headers
-    if hasattr(f, 'info'):
-        info = f.info()
-        result['etag'] = info.getheader('ETag')
-        last_modified = info.getheader('Last-Modified')
-        if last_modified:
-            result['modified'] = _parse_date(last_modified)
-    if hasattr(f, 'url'):
-        result['href'] = f.url
-        result['status'] = 200
-    if hasattr(f, 'status'):
-        result['status'] = f.status
-    if hasattr(f, 'headers'):
-        result['headers'] = f.headers.dict
-    if hasattr(f, 'close'):
-        f.close()
-
-    # there are four encodings to keep track of:
-    # - http_encoding is the encoding declared in the Content-Type HTTP header
-    # - xml_encoding is the encoding declared in the <?xml declaration
-    # - sniffed_encoding is the encoding sniffed from the first 4 bytes of the XML data
-    # - result['encoding'] is the actual encoding, as per RFC 3023 and a variety of other conflicting specifications
-    http_headers = result.get('headers', {})
-    result['encoding'], http_encoding, xml_encoding, sniffed_xml_encoding, acceptable_content_type = \
-        _getCharacterEncoding(http_headers, data)
-    if http_headers and (not acceptable_content_type):
-        if http_headers.has_key('content-type'):
-            bozo_message = '%s is not an XML media type' % http_headers['content-type']
-        else:
-            bozo_message = 'no Content-type specified'
-        result['bozo'] = 1
-        result['bozo_exception'] = NonXMLContentType(bozo_message)
-        
-    result['version'], data = _stripDoctype(data)
-
-    baseuri = http_headers.get('content-location', result.get('href'))
-    baselang = http_headers.get('content-language', None)
-
-    # if server sent 304, we're done
-    if result.get('status', 0) == 304:
-        result['version'] = ''
-        result['debug_message'] = 'The feed has not changed since you last checked, ' + \
-            'so the server sent no data.  This is a feature, not a bug!'
-        return result
-
-    # if there was a problem downloading, we're done
-    if not data:
-        return result
-
-    # determine character encoding
-    use_strict_parser = 0
-    known_encoding = 0
-    tried_encodings = []
-    # try: HTTP encoding, declared XML encoding, encoding sniffed from BOM
-    for proposed_encoding in (result['encoding'], xml_encoding, sniffed_xml_encoding):
-        if not proposed_encoding: continue
-        if proposed_encoding in tried_encodings: continue
-        tried_encodings.append(proposed_encoding)
-        try:
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-            break
-        except:
-            pass
-    # if no luck and we have auto-detection library, try that
-    if (not known_encoding) and chardet:
-        try:
-            proposed_encoding = chardet.detect(data)['encoding']
-            if proposed_encoding and (proposed_encoding not in tried_encodings):
-                tried_encodings.append(proposed_encoding)
-                data = _toUTF8(data, proposed_encoding)
-                known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck and we haven't tried utf-8 yet, try that
-    if (not known_encoding) and ('utf-8' not in tried_encodings):
-        try:
-            proposed_encoding = 'utf-8'
-            tried_encodings.append(proposed_encoding)
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck and we haven't tried windows-1252 yet, try that
-    if (not known_encoding) and ('windows-1252' not in tried_encodings):
-        try:
-            proposed_encoding = 'windows-1252'
-            tried_encodings.append(proposed_encoding)
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck, give up
-    if not known_encoding:
-        result['bozo'] = 1
-        result['bozo_exception'] = CharacterEncodingUnknown( \
-            'document encoding unknown, I tried ' + \
-            '%s, %s, utf-8, and windows-1252 but nothing worked' % \
-            (result['encoding'], xml_encoding))
-        result['encoding'] = ''
-    elif proposed_encoding != result['encoding']:
-        result['bozo'] = 1
-        result['bozo_exception'] = CharacterEncodingOverride( \
-            'documented declared as %s, but parsed as %s' % \
-            (result['encoding'], proposed_encoding))
-        result['encoding'] = proposed_encoding
-
-    if not _XML_AVAILABLE:
-        use_strict_parser = 0
-    if use_strict_parser:
-        # initialize the SAX parser
-        feedparser = _StrictFeedParser(baseuri, baselang, 'utf-8')
-        saxparser = xml.sax.make_parser(PREFERRED_XML_PARSERS)
-        saxparser.setFeature(xml.sax.handler.feature_namespaces, 1)
-        saxparser.setContentHandler(feedparser)
-        saxparser.setErrorHandler(feedparser)
-        source = xml.sax.xmlreader.InputSource()
-        source.setByteStream(_StringIO(data))
-        if hasattr(saxparser, '_ns_stack'):
-            # work around bug in built-in SAX parser (doesn't recognize xml: namespace)
-            # PyXML doesn't have this problem, and it doesn't have _ns_stack either
-            saxparser._ns_stack.append({'http://www.w3.org/XML/1998/namespace':'xml'})
-        try:
-            saxparser.parse(source)
-        except Exception, e:
-            if _debug:
-                import traceback
-                traceback.print_stack()
-                traceback.print_exc()
-                sys.stderr.write('xml parsing failed\n')
-            result['bozo'] = 1
-            result['bozo_exception'] = feedparser.exc or e
-            use_strict_parser = 0
-    if not use_strict_parser:
-        feedparser = _LooseFeedParser(baseuri, baselang, known_encoding and 'utf-8' or '')
-        feedparser.feed(data)
-    result['feed'] = feedparser.feeddata
-    result['entries'] = feedparser.entries
-    result['version'] = result['version'] or feedparser.version
-    result['namespaces'] = feedparser.namespacesInUse
-    return result
-
-if __name__ == '__main__':
-    if not sys.argv[1:]:
-        print __doc__
-        sys.exit(0)
-    else:
-        urls = sys.argv[1:]
-    zopeCompatibilityHack()
-    from pprint import pprint
-    for url in urls:
-        print url
-        print
-        result = parse(url)
-        pprint(result)
-        print
-
-#REVISION HISTORY
-#1.0 - 9/27/2002 - MAP - fixed namespace processing on prefixed RSS 2.0 elements,
-#  added Simon Fell's test suite
-#1.1 - 9/29/2002 - MAP - fixed infinite loop on incomplete CDATA sections
-#2.0 - 10/19/2002
-#  JD - use inchannel to watch out for image and textinput elements which can
-#  also contain title, link, and description elements
-#  JD - check for isPermaLink='false' attribute on guid elements
-#  JD - replaced openAnything with open_resource supporting ETag and
-#  If-Modified-Since request headers
-#  JD - parse now accepts etag, modified, agent, and referrer optional
-#  arguments
-#  JD - modified parse to return a dictionary instead of a tuple so that any
-#  etag or modified information can be returned and cached by the caller
-#2.0.1 - 10/21/2002 - MAP - changed parse() so that if we don't get anything
-#  because of etag/modified, return the old etag/modified to the caller to
-#  indicate why nothing is being returned
-#2.0.2 - 10/21/2002 - JB - added the inchannel to the if statement, otherwise its
-#  useless.  Fixes the problem JD was addressing by adding it.
-#2.1 - 11/14/2002 - MAP - added gzip support
-#2.2 - 1/27/2003 - MAP - added attribute support, admin:generatorAgent.
-#  start_admingeneratoragent is an example of how to handle elements with
-#  only attributes, no content.
-#2.3 - 6/11/2003 - MAP - added USER_AGENT for default (if caller doesn't specify);
-#  also, make sure we send the User-Agent even if urllib2 isn't available.
-#  Match any variation of backend.userland.com/rss namespace.
-#2.3.1 - 6/12/2003 - MAP - if item has both link and guid, return both as-is.
-#2.4 - 7/9/2003 - MAP - added preliminary Pie/Atom/Echo support based on Sam Ruby's
-#  snapshot of July 1 <http://www.intertwingly.net/blog/1506.html>; changed
-#  project name
-#2.5 - 7/25/2003 - MAP - changed to Python license (all contributors agree);
-#  removed unnecessary urllib code -- urllib2 should always be available anyway;
-#  return actual url, status, and full HTTP headers (as result['url'],
-#  result['status'], and result['headers']) if parsing a remote feed over HTTP --
-#  this should pass all the HTTP tests at <http://diveintomark.org/tests/client/http/>;
-#  added the latest namespace-of-the-week for RSS 2.0
-#2.5.1 - 7/26/2003 - RMK - clear opener.addheaders so we only send our custom
-#  User-Agent (otherwise urllib2 sends two, which confuses some servers)
-#2.5.2 - 7/28/2003 - MAP - entity-decode inline xml properly; added support for
-#  inline <xhtml:body> and <xhtml:div> as used in some RSS 2.0 feeds
-#2.5.3 - 8/6/2003 - TvdV - patch to track whether we're inside an image or
-#  textInput, and also to return the character encoding (if specified)
-#2.6 - 1/1/2004 - MAP - dc:author support (MarekK); fixed bug tracking
-#  nested divs within content (JohnD); fixed missing sys import (JohanS);
-#  fixed regular expression to capture XML character encoding (Andrei);
-#  added support for Atom 0.3-style links; fixed bug with textInput tracking;
-#  added support for cloud (MartijnP); added support for multiple
-#  category/dc:subject (MartijnP); normalize content model: 'description' gets
-#  description (which can come from description, summary, or full content if no
-#  description), 'content' gets dict of base/language/type/value (which can come
-#  from content:encoded, xhtml:body, content, or fullitem);
-#  fixed bug matching arbitrary Userland namespaces; added xml:base and xml:lang
-#  tracking; fixed bug tracking unknown tags; fixed bug tracking content when
-#  <content> element is not in default namespace (like Pocketsoap feed);
-#  resolve relative URLs in link, guid, docs, url, comments, wfw:comment,
-#  wfw:commentRSS; resolve relative URLs within embedded HTML markup in
-#  description, xhtml:body, content, content:encoded, title, subtitle,
-#  summary, info, tagline, and copyright; added support for pingback and
-#  trackback namespaces
-#2.7 - 1/5/2004 - MAP - really added support for trackback and pingback
-#  namespaces, as opposed to 2.6 when I said I did but didn't really;
-#  sanitize HTML markup within some elements; added mxTidy support (if
-#  installed) to tidy HTML markup within some elements; fixed indentation
-#  bug in _parse_date (FazalM); use socket.setdefaulttimeout if available
-#  (FazalM); universal date parsing and normalization (FazalM): 'created', modified',
-#  'issued' are parsed into 9-tuple date format and stored in 'created_parsed',
-#  'modified_parsed', and 'issued_parsed'; 'date' is duplicated in 'modified'
-#  and vice-versa; 'date_parsed' is duplicated in 'modified_parsed' and vice-versa
-#2.7.1 - 1/9/2004 - MAP - fixed bug handling &quot; and &apos;.  fixed memory
-#  leak not closing url opener (JohnD); added dc:publisher support (MarekK);
-#  added admin:errorReportsTo support (MarekK); Python 2.1 dict support (MarekK)
-#2.7.4 - 1/14/2004 - MAP - added workaround for improperly formed <br/> tags in
-#  encoded HTML (skadz); fixed unicode handling in normalize_attrs (ChrisL);
-#  fixed relative URI processing for guid (skadz); added ICBM support; added
-#  base64 support
-#2.7.5 - 1/15/2004 - MAP - added workaround for malformed DOCTYPE (seen on many
-#  blogspot.com sites); added _debug variable
-#2.7.6 - 1/16/2004 - MAP - fixed bug with StringIO importing
-#3.0b3 - 1/23/2004 - MAP - parse entire feed with real XML parser (if available);
-#  added several new supported namespaces; fixed bug tracking naked markup in
-#  description; added support for enclosure; added support for source; re-added
-#  support for cloud which got dropped somehow; added support for expirationDate
-#3.0b4 - 1/26/2004 - MAP - fixed xml:lang inheritance; fixed multiple bugs tracking
-#  xml:base URI, one for documents that don't define one explicitly and one for
-#  documents that define an outer and an inner xml:base that goes out of scope
-#  before the end of the document
-#3.0b5 - 1/26/2004 - MAP - fixed bug parsing multiple links at feed level
-#3.0b6 - 1/27/2004 - MAP - added feed type and version detection, result['version']
-#  will be one of SUPPORTED_VERSIONS.keys() or empty string if unrecognized;
-#  added support for creativeCommons:license and cc:license; added support for
-#  full Atom content model in title, tagline, info, copyright, summary; fixed bug
-#  with gzip encoding (not always telling server we support it when we do)
-#3.0b7 - 1/28/2004 - MAP - support Atom-style author element in author_detail
-#  (dictionary of 'name', 'url', 'email'); map author to author_detail if author
-#  contains name + email address
-#3.0b8 - 1/28/2004 - MAP - added support for contributor
-#3.0b9 - 1/29/2004 - MAP - fixed check for presence of dict function; added
-#  support for summary
-#3.0b10 - 1/31/2004 - MAP - incorporated ISO-8601 date parsing routines from
-#  xml.util.iso8601
-#3.0b11 - 2/2/2004 - MAP - added 'rights' to list of elements that can contain
-#  dangerous markup; fiddled with decodeEntities (not right); liberalized
-#  date parsing even further
-#3.0b12 - 2/6/2004 - MAP - fiddled with decodeEntities (still not right);
-#  added support to Atom 0.2 subtitle; added support for Atom content model
-#  in copyright; better sanitizing of dangerous HTML elements with end tags
-#  (script, frameset)
-#3.0b13 - 2/8/2004 - MAP - better handling of empty HTML tags (br, hr, img,
-#  etc.) in embedded markup, in either HTML or XHTML form (<br>, <br/>, <br />)
-#3.0b14 - 2/8/2004 - MAP - fixed CDATA handling in non-wellformed feeds under
-#  Python 2.1
-#3.0b15 - 2/11/2004 - MAP - fixed bug resolving relative links in wfw:commentRSS;
-#  fixed bug capturing author and contributor URL; fixed bug resolving relative
-#  links in author and contributor URL; fixed bug resolvin relative links in
-#  generator URL; added support for recognizing RSS 1.0; passed Simon Fell's
-#  namespace tests, and included them permanently in the test suite with his
-#  permission; fixed namespace handling under Python 2.1
-#3.0b16 - 2/12/2004 - MAP - fixed support for RSS 0.90 (broken in b15)
-#3.0b17 - 2/13/2004 - MAP - determine character encoding as per RFC 3023
-#3.0b18 - 2/17/2004 - MAP - always map description to summary_detail (Andrei);
-#  use libxml2 (if available)
-#3.0b19 - 3/15/2004 - MAP - fixed bug exploding author information when author
-#  name was in parentheses; removed ultra-problematic mxTidy support; patch to
-#  workaround crash in PyXML/expat when encountering invalid entities
-#  (MarkMoraes); support for textinput/textInput
-#3.0b20 - 4/7/2004 - MAP - added CDF support
-#3.0b21 - 4/14/2004 - MAP - added Hot RSS support
-#3.0b22 - 4/19/2004 - MAP - changed 'channel' to 'feed', 'item' to 'entries' in
-#  results dict; changed results dict to allow getting values with results.key
-#  as well as results[key]; work around embedded illformed HTML with half
-#  a DOCTYPE; work around malformed Content-Type header; if character encoding
-#  is wrong, try several common ones before falling back to regexes (if this
-#  works, bozo_exception is set to CharacterEncodingOverride); fixed character
-#  encoding issues in BaseHTMLProcessor by tracking encoding and converting
-#  from Unicode to raw strings before feeding data to sgmllib.SGMLParser;
-#  convert each value in results to Unicode (if possible), even if using
-#  regex-based parsing
-#3.0b23 - 4/21/2004 - MAP - fixed UnicodeDecodeError for feeds that contain
-#  high-bit characters in attributes in embedded HTML in description (thanks
-#  Thijs van de Vossen); moved guid, date, and date_parsed to mapped keys in
-#  FeedParserDict; tweaked FeedParserDict.has_key to return True if asking
-#  about a mapped key
-#3.0fc1 - 4/23/2004 - MAP - made results.entries[0].links[0] and
-#  results.entries[0].enclosures[0] into FeedParserDict; fixed typo that could
-#  cause the same encoding to be tried twice (even if it failed the first time);
-#  fixed DOCTYPE stripping when DOCTYPE contained entity declarations;
-#  better textinput and image tracking in illformed RSS 1.0 feeds
-#3.0fc2 - 5/10/2004 - MAP - added and passed Sam's amp tests; added and passed
-#  my blink tag tests
-#3.0fc3 - 6/18/2004 - MAP - fixed bug in _changeEncodingDeclaration that
-#  failed to parse utf-16 encoded feeds; made source into a FeedParserDict;
-#  duplicate admin:generatorAgent/@rdf:resource in generator_detail.url;
-#  added support for image; refactored parse() fallback logic to try other
-#  encodings if SAX parsing fails (previously it would only try other encodings
-#  if re-encoding failed); remove unichr madness in normalize_attrs now that
-#  we're properly tracking encoding in and out of BaseHTMLProcessor; set
-#  feed.language from root-level xml:lang; set entry.id from rdf:about;
-#  send Accept header
-#3.0 - 6/21/2004 - MAP - don't try iso-8859-1 (can't distinguish between
-#  iso-8859-1 and windows-1252 anyway, and most incorrectly marked feeds are
-#  windows-1252); fixed regression that could cause the same encoding to be
-#  tried twice (even if it failed the first time)
-#3.0.1 - 6/22/2004 - MAP - default to us-ascii for all text/* content types;
-#  recover from malformed content-type header parameter with no equals sign
-#  ('text/xml; charset:iso-8859-1')
-#3.1 - 6/28/2004 - MAP - added and passed tests for converting HTML entities
-#  to Unicode equivalents in illformed feeds (aaronsw); added and
-#  passed tests for converting character entities to Unicode equivalents
-#  in illformed feeds (aaronsw); test for valid parsers when setting
-#  XML_AVAILABLE; make version and encoding available when server returns
-#  a 304; add handlers parameter to pass arbitrary urllib2 handlers (like
-#  digest auth or proxy support); add code to parse username/password
-#  out of url and send as basic authentication; expose downloading-related
-#  exceptions in bozo_exception (aaronsw); added __contains__ method to
-#  FeedParserDict (aaronsw); added publisher_detail (aaronsw)
-#3.2 - 7/3/2004 - MAP - use cjkcodecs and iconv_codec if available; always
-#  convert feed to UTF-8 before passing to XML parser; completely revamped
-#  logic for determining character encoding and attempting XML parsing
-#  (much faster); increased default timeout to 20 seconds; test for presence
-#  of Location header on redirects; added tests for many alternate character
-#  encodings; support various EBCDIC encodings; support UTF-16BE and
-#  UTF16-LE with or without a BOM; support UTF-8 with a BOM; support
-#  UTF-32BE and UTF-32LE with or without a BOM; fixed crashing bug if no
-#  XML parsers are available; added support for 'Content-encoding: deflate';
-#  send blank 'Accept-encoding: ' header if neither gzip nor zlib modules
-#  are available
-#3.3 - 7/15/2004 - MAP - optimize EBCDIC to ASCII conversion; fix obscure
-#  problem tracking xml:base and xml:lang if element declares it, child
-#  doesn't, first grandchild redeclares it, and second grandchild doesn't;
-#  refactored date parsing; defined public registerDateHandler so callers
-#  can add support for additional date formats at runtime; added support
-#  for OnBlog, Nate, MSSQL, Greek, and Hungarian dates (ytrewq1); added
-#  zopeCompatibilityHack() which turns FeedParserDict into a regular
-#  dictionary, required for Zope compatibility, and also makes command-
-#  line debugging easier because pprint module formats real dictionaries
-#  better than dictionary-like objects; added NonXMLContentType exception,
-#  which is stored in bozo_exception when a feed is served with a non-XML
-#  media type such as 'text/plain'; respect Content-Language as default
-#  language if not xml:lang is present; cloud dict is now FeedParserDict;
-#  generator dict is now FeedParserDict; better tracking of xml:lang,
-#  including support for xml:lang='' to unset the current language;
-#  recognize RSS 1.0 feeds even when RSS 1.0 namespace is not the default
-#  namespace; don't overwrite final status on redirects (scenarios:
-#  redirecting to a URL that returns 304, redirecting to a URL that
-#  redirects to another URL with a different type of redirect); add
-#  support for HTTP 303 redirects
-#4.0 - MAP - support for relative URIs in xml:base attribute; fixed
-#  encoding issue with mxTidy (phopkins); preliminary support for RFC 3229;
-#  support for Atom 1.0; support for iTunes extensions; new 'tags' for
-#  categories/keywords/etc. as array of dict
-#  {'term': term, 'scheme': scheme, 'label': label} to match Atom 1.0
-#  terminology; parse RFC 822-style dates with no time; lots of other
-#  bug fixes
-#4.1 - MAP - removed socket timeout; added support for chardet library
diff --git a/HarvestMan-lite/harvestman/lib/common/keepalive.py b/HarvestMan-lite/harvestman/lib/common/keepalive.py
deleted file mode 100755
index 675febb..0000000
--- a/HarvestMan-lite/harvestman/lib/common/keepalive.py
+++ /dev/null
@@ -1,650 +0,0 @@
-# -- coding: utf-8
-# keepalive.py - Module which supports HTTP/HTTPS keep-alive connections
-# on the same host using a thread-safe connection pool.
-#
-# Created Anand B Pillai Sep 10 2007 Code borrowed from urlgrabber
-#                                    project.
-#
-# Original copyright follows:
-#--------------Original Copyright-----------------------------------
-#   This library is free software; you can redistribute it and/or
-#   modify it under the terms of the GNU Lesser General Public
-#   License as published by the Free Software Foundation; either
-#   version 2.1 of the License, or (at your option) any later version.
-#
-#   This library is distributed in the hope that it will be useful,
-#   but WITHOUT ANY WARRANTY; without even the implied warranty of
-#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-#   Lesser General Public License for more details.
-#
-#   You should have received a copy of the GNU Lesser General Public
-#   License along with this library; if not, write to the 
-#      Free Software Foundation, Inc., 
-#      59 Temple Place, Suite 330, 
-#      Boston, MA  02111-1307  USA
-
-# This file is part of urlgrabber, a high-level cross-protocol url-grabber
-# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko
-#------------Original Copyright---------------------------------------
-#
-#
-
-__author__ = "Anand B Pillai"
-__maintainer__ = "Anand B Pillai"
-__version__ = "2.0 b1"
-
-"""An HTTP handler for urllib2 that supports HTTP 1.1 and keepalive.
-
->>> import urllib2
->>> from keepalive import HTTPHandler
->>> keepalive_handler = HTTPHandler()
->>> opener = urllib2.build_opener(keepalive_handler)
->>> urllib2.install_opener(opener)
->>> 
->>> fo = urllib2.urlopen('http://www.python.org')
-
-If a connection to a given host is requested, and all of the existing
-connections are still in use, another connection will be opened.  If
-the handler tries to use an existing connection but it fails in some
-way, it will be closed and removed from the pool.
-
-To remove the handler, simply re-run build_opener with no arguments, and
-install that opener.
-
-You can explicitly close connections by using the close_connection()
-method of the returned file-like object (described below) or you can
-use the handler methods:
-
-  close_connection(host)
-  close_all()
-  open_connections()
-
-NOTE: using the close_connection and close_all methods of the handler
-should be done with care when using multiple threads.
-  * there is nothing that prevents another thread from creating new
-    connections immediately after connections are closed
-  * no checks are done to prevent in-use connections from being closed
-
->>> keepalive_handler.close_all()
-
-EXTRA ATTRIBUTES AND METHODS
-
-  Upon a status of 200, the object returned has a few additional
-  attributes and methods, which should not be used if you want to
-  remain consistent with the normal urllib2-returned objects:
-
-    close_connection()  -  close the connection to the host
-    readlines()         -  you know, readlines()
-    status              -  the return status (ie 404)
-    reason              -  english translation of status (ie 'File not found')
-
-  If you want the best of both worlds, use this inside an
-  AttributeError-catching try:
-
-  >>> try: status = fo.status
-  >>> except AttributeError: status = None
-
-  Unfortunately, these are ONLY there if status == 200, so it's not
-  easy to distinguish between non-200 responses.  The reason is that
-  urllib2 tries to do clever things with error codes 301, 302, 401,
-  and 407, and it wraps the object upon return.
-
-  For python versions earlier than 2.4, you can avoid this fancy error
-  handling by setting the module-level global HANDLE_ERRORS to zero.
-  You see, prior to 2.4, it's the HTTP Handler's job to determine what
-  to handle specially, and what to just pass up.  HANDLE_ERRORS == 0
-  means "pass everything up".  In python 2.4, however, this job no
-  longer belongs to the HTTP Handler and is now done by a NEW handler,
-  HTTPErrorProcessor.  Here's the bottom line:
-
-    python version < 2.4
-        HANDLE_ERRORS == 1  (default) pass up 200, treat the rest as
-                            errors
-        HANDLE_ERRORS == 0  pass everything up, error processing is
-                            left to the calling code
-    python version >= 2.4
-        HANDLE_ERRORS == 1  pass up 200, treat the rest as errors
-        HANDLE_ERRORS == 0  (default) pass everything up, let the
-                            other handlers (specifically,
-                            HTTPErrorProcessor) decide what to do
-
-  In practice, setting the variable either way makes little difference
-  in python 2.4, so for the most consistent behavior across versions,
-  you probably just want to use the defaults, which will give you
-  exceptions on errors.
-
-"""
-
-# $Id: keepalive.py,v 1.2 2007/10/08 20:52:00 pythonhacker Exp $
-
-import urllib2
-import httplib
-import socket
-import thread
-
-class FakeLogger:
-    def debug(self, msg, *args): print msg % args
-    info = warning = error = debug
-    
-DEBUG = None
-
-# import sslfactory
-
-import sys
-if sys.version_info < (2, 4): HANDLE_ERRORS = 1
-else: HANDLE_ERRORS = 0
-    
-class ConnectionManager:
-    """
-    The connection manager must be able to:
-      * keep track of all existing
-      """
-    def __init__(self):
-        self._lock = thread.allocate_lock()
-        self._hostmap = {} # map hosts to a list of connections
-        self._connmap = {} # map connections to host
-        self._readymap = {} # map connection to ready state
-
-    def add(self, host, connection, ready):
-        self._lock.acquire()
-        try:
-            if not self._hostmap.has_key(host): self._hostmap[host] = []
-            self._hostmap[host].append(connection)
-            self._connmap[connection] = host
-            self._readymap[connection] = ready
-        finally:
-            self._lock.release()
-
-    def remove(self, connection):
-        self._lock.acquire()
-        try:
-            try:
-                host = self._connmap[connection]
-            except KeyError:
-                pass
-            else:
-                del self._connmap[connection]
-                del self._readymap[connection]
-                self._hostmap[host].remove(connection)
-                if not self._hostmap[host]: del self._hostmap[host]
-        finally:
-            self._lock.release()
-
-    def set_ready(self, connection, ready):
-        try: self._readymap[connection] = ready
-        except KeyError: pass
-        
-    def get_ready_conn(self, host):
-        conn = None
-        self._lock.acquire()
-        try:
-            if self._hostmap.has_key(host):
-                for c in self._hostmap[host]:
-                    if self._readymap[c]:
-                        self._readymap[c] = 0
-                        conn = c
-                        break
-        finally:
-            self._lock.release()
-        return conn
-
-    def get_all(self, host=None):
-        if host:
-            return list(self._hostmap.get(host, []))
-        else:
-            return dict(self._hostmap)
-
-class KeepAliveHandler:
-    def __init__(self):
-        self._cm = ConnectionManager()
-        
-    #### Connection Management
-    def open_connections(self):
-        """return a list of connected hosts and the number of connections
-        to each.  [('foo.com:80', 2), ('bar.org', 1)]"""
-        return [(host, len(li)) for (host, li) in self._cm.get_all().items()]
-
-    def close_connection(self, host):
-        """close connection(s) to <host>
-        host is the host:port spec, as in 'www.cnn.com:8080' as passed in.
-        no error occurs if there is no connection to that host."""
-        for h in self._cm.get_all(host):
-            self._cm.remove(h)
-            h.close()
-        
-    def close_all(self):
-        """close all open connections"""
-        for host, conns in self._cm.get_all().items():
-            for h in conns:
-                self._cm.remove(h)
-                h.close()
-        
-    def _request_closed(self, request, host, connection):
-        """tells us that this request is now closed and the the
-        connection is ready for another request"""
-        self._cm.set_ready(connection, 1)
-
-    def _remove_connection(self, host, connection, close=0):
-        if close: connection.close()
-        self._cm.remove(connection)
-        
-    #### Transaction Execution
-    def do_open(self, req):
-        host = req.get_host()
-        if not host:
-            raise urllib2.URLError('no host given')
-
-        try:
-            h = self._cm.get_ready_conn(host)
-            while h:
-                r = self._reuse_connection(h, req, host)
-
-                # if this response is non-None, then it worked and we're
-                # done.  Break out, skipping the else block.
-                if r: break
-
-                # connection is bad - possibly closed by server
-                # discard it and ask for the next free connection
-                h.close()
-                self._cm.remove(h)
-                h = self._cm.get_ready_conn(host)
-            else:
-                # no (working) free connections were found.  Create a new one.
-                h = self._get_connection(host)
-                if DEBUG: DEBUG.info("creating new connection to %s (%d)" % (host, id(h)))
-                self._cm.add(host, h, 0)
-                self._start_transaction(h, req)
-                r = h.getresponse()
-        except (socket.error, httplib.HTTPException), err:
-            raise urllib2.URLError(err)
-            
-        # if not a persistent connection, don't try to reuse it
-        if r.will_close: self._cm.remove(h)
-
-        if DEBUG: DEBUG.info("STATUS: %s, %s" % (r.status, r.reason))
-        r._handler = self
-        r._host = host
-        r._url = req.get_full_url()
-        r._connection = h
-        r.code = r.status
-        r.headers = r.msg
-        r.msg = r.reason
-        
-        if r.status == 200 or not HANDLE_ERRORS:
-            return r
-        else:
-            return self.parent.error('http', req, r,
-                                     r.status, r.msg, r.headers)
-
-    def _reuse_connection(self, h, req, host):
-        """start the transaction with a re-used connection
-        return a response object (r) upon success or None on failure.
-        This DOES not close or remove bad connections in cases where
-        it returns.  However, if an unexpected exception occurs, it
-        will close and remove the connection before re-raising.
-        """
-        try:
-            self._start_transaction(h, req)
-            r = h.getresponse()
-            # note: just because we got something back doesn't mean it
-            # worked.  We'll check the version below, too.
-        except (socket.error, httplib.HTTPException):
-            r = None
-        except:
-            # adding this block just in case we've missed
-            # something we will still raise the exception, but
-            # lets try and close the connection and remove it
-            # first.  We previously got into a nasty loop
-            # where an exception was uncaught, and so the
-            # connection stayed open.  On the next try, the
-            # same exception was raised, etc.  The tradeoff is
-            # that it's now possible this call will raise
-            # a DIFFERENT exception
-            if DEBUG: DEBUG.error("unexpected exception - closing " + \
-                                  "connection to %s (%d)" % host, id(h))
-            self._cm.remove(h)
-            h.close()
-            raise
-                    
-        if r is None or r.version == 9:
-            # httplib falls back to assuming HTTP 0.9 if it gets a
-            # bad header back.  This is most likely to happen if
-            # the socket has been closed by the server since we
-            # last used the connection.
-            if DEBUG: DEBUG.info("failed to re-use connection to %s (%d)" % (host, id(h)))
-            r = None
-        else:
-            if DEBUG: DEBUG.info("re-using connection to %s (%d)" % (host, id(h)))
-
-        return r
-
-    def _start_transaction(self, h, req):
-        try:
-            if req.has_data():
-                data = req.get_data()
-                h.putrequest('POST', req.get_selector())
-                if not req.headers.has_key('Content-type'):
-                    h.putheader('Content-type',
-                                'application/x-www-form-urlencoded')
-                if not req.headers.has_key('Content-length'):
-                    h.putheader('Content-length', '%d' % len(data))
-            else:
-                h.putrequest('GET', req.get_selector())
-        except (socket.error, httplib.HTTPException), err:
-            raise urllib2.URLError(err)
-
-        for args in self.parent.addheaders:
-            h.putheader(*args)
-        for k, v in req.headers.items():
-            h.putheader(k, v)
-        h.endheaders()
-        if req.has_data():
-            h.send(data)
-
-    def _get_connection(self, host):
-        return NotImplementedError
-
-class HTTPHandler(KeepAliveHandler, urllib2.HTTPHandler):
-    def __init__(self):
-        KeepAliveHandler.__init__(self)
-
-    def http_open(self, req):
-        return self.do_open(req)
-
-    def _get_connection(self, host):
-        return HTTPConnection(host)
-
-class HTTPSHandler(KeepAliveHandler, urllib2.HTTPSHandler):
-    def __init__(self, ssl_factory=None):
-        KeepAliveHandler.__init__(self)
-        #if not ssl_factory:
-        #    ssl_factory = sslfactory.get_factory()
-        #self._ssl_factory = ssl_factory
-    
-    def https_open(self, req):
-        return self.do_open(req)
-
-    def _get_connection(self, host):
-        # return self._ssl_factory.create_https_connection(host)
-        return HTTPSConnection(host)
-
-class HTTPResponse(httplib.HTTPResponse):
-    # we need to subclass HTTPResponse in order to
-    # 1) add readline() and readlines() methods
-    # 2) add close_connection() methods
-    # 3) add info() and geturl() methods
-
-    # in order to add readline(), read must be modified to deal with a
-    # buffer.  example: readline must read a buffer and then spit back
-    # one line at a time.  The only real alternative is to read one
-    # BYTE at a time (ick).  Once something has been read, it can't be
-    # put back (ok, maybe it can, but that's even uglier than this),
-    # so if you THEN do a normal read, you must first take stuff from
-    # the buffer.
-
-    # the read method wraps the original to accomodate buffering,
-    # although read() never adds to the buffer.
-    # Both readline and readlines have been stolen with almost no
-    # modification from socket.py
-    
-
-    def __init__(self, sock, debuglevel=0, strict=0, method=None):
-        if method: # the httplib in python 2.3 uses the method arg
-            httplib.HTTPResponse.__init__(self, sock, debuglevel, method)
-        else: # 2.2 doesn't
-            httplib.HTTPResponse.__init__(self, sock, debuglevel)
-        self.fileno = sock.fileno
-        self.code = None
-        self._rbuf = ''
-        self._rbufsize = 8096
-        self._handler = None # inserted by the handler later
-        self._host = None    # (same)
-        self._url = None     # (same)
-        self._connection = None # (same)
-
-    _raw_read = httplib.HTTPResponse.read
-
-    def close(self):
-        if self.fp:
-            self.fp.close()
-            self.fp = None
-            if self._handler:
-                self._handler._request_closed(self, self._host,
-                                              self._connection)
-
-    def close_connection(self):
-        self._handler._remove_connection(self._host, self._connection, close=1)
-        self.close()
-        
-    def info(self):
-        return self.headers
-
-    def geturl(self):
-        return self._url
-
-    def read(self, amt=None):
-        # the _rbuf test is only in this first if for speed.  It's not
-        # logically necessary
-        if self._rbuf and not amt is None:
-            L = len(self._rbuf)
-            if amt > L:
-                amt -= L
-            else:
-                s = self._rbuf[:amt]
-                self._rbuf = self._rbuf[amt:]
-                return s
-
-        s = self._rbuf + self._raw_read(amt)
-        self._rbuf = ''
-        return s
-
-    def readline(self, limit=-1):
-        data = ""
-        i = self._rbuf.find('\n')
-        while i < 0 and not (0 < limit <= len(self._rbuf)):
-            new = self._raw_read(self._rbufsize)
-            if not new: break
-            i = new.find('\n')
-            if i >= 0: i = i + len(self._rbuf)
-            self._rbuf = self._rbuf + new
-        if i < 0: i = len(self._rbuf)
-        else: i = i+1
-        if 0 <= limit < len(self._rbuf): i = limit
-        data, self._rbuf = self._rbuf[:i], self._rbuf[i:]
-        return data
-
-    def readlines(self, sizehint = 0):
-        total = 0
-        list = []
-        while 1:
-            line = self.readline()
-            if not line: break
-            list.append(line)
-            total += len(line)
-            if sizehint and total >= sizehint:
-                break
-        return list
-
-
-class HTTPConnection(httplib.HTTPConnection):
-    # use the modified response class
-    response_class = HTTPResponse
-
-class HTTPSConnection(httplib.HTTPSConnection):
-    response_class = HTTPResponse
-
-    def connect(self):
-        import _socket
-        
-        # For fixing #503
-        sock = _socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((self.host, self.port))
-        # Change this to certicate paths where you have your SSL client certificates
-        # to be able to download URLs producing SSL errors.
-        ssl = socket.ssl(sock, None, None)
-        
-        self.sock = httplib.FakeSocket(sock, ssl)
-    
-    
-    
-#########################################################################
-#####   TEST FUNCTIONS
-#########################################################################
-
-def error_handler(url):
-    global HANDLE_ERRORS
-    orig = HANDLE_ERRORS
-    keepalive_handler = HTTPHandler()
-    opener = urllib2.build_opener(keepalive_handler)
-    urllib2.install_opener(opener)
-    pos = {0: 'off', 1: 'on'}
-    for i in (0, 1):
-        print "  fancy error handling %s (HANDLE_ERRORS = %i)" % (pos[i], i)
-        HANDLE_ERRORS = i
-        try:
-            fo = urllib2.urlopen(url)
-            foo = fo.read()
-            fo.close()
-            try: status, reason = fo.status, fo.reason
-            except AttributeError: status, reason = None, None
-        except IOError, e:
-            print "  EXCEPTION: %s" % e
-            raise
-        else:
-            print "  status = %s, reason = %s" % (status, reason)
-    HANDLE_ERRORS = orig
-    hosts = keepalive_handler.open_connections()
-    print "open connections:", hosts
-    keepalive_handler.close_all()
-
-def continuity(url):
-    import md5
-    format = '%25s: %s'
-    
-    # first fetch the file with the normal http handler
-    opener = urllib2.build_opener()
-    urllib2.install_opener(opener)
-    fo = urllib2.urlopen(url)
-    foo = fo.read()
-    fo.close()
-    m = md5.new(foo)
-    print format % ('normal urllib', m.hexdigest())
-
-    # now install the keepalive handler and try again
-    opener = urllib2.build_opener(HTTPHandler())
-    urllib2.install_opener(opener)
-
-    fo = urllib2.urlopen(url)
-    foo = fo.read()
-    fo.close()
-    m = md5.new(foo)
-    print format % ('keepalive read', m.hexdigest())
-
-    fo = urllib2.urlopen(url)
-    foo = ''
-    while 1:
-        f = fo.readline()
-        if f: foo = foo + f
-        else: break
-    fo.close()
-    m = md5.new(foo)
-    print format % ('keepalive readline', m.hexdigest())
-
-def comp(N, url):
-    print '  making %i connections to:\n  %s' % (N, url)
-
-    sys.stdout.write('  first using the normal urllib handlers')
-    # first use normal opener
-    opener = urllib2.build_opener()
-    urllib2.install_opener(opener)
-    t1 = fetch(N, url)
-    print '  TIME: %.3f s' % t1
-
-    sys.stdout.write('  now using the keepalive handler       ')
-    # now install the keepalive handler and try again
-    opener = urllib2.build_opener(HTTPHandler())
-    urllib2.install_opener(opener)
-    t2 = fetch(N, url)
-    print '  TIME: %.3f s' % t2
-    print '  improvement factor: %.2f' % (t1/t2, )
-    
-def fetch(N, url, delay=0):
-    import time
-    lens = []
-    starttime = time.time()
-    for i in range(N):
-        if delay and i > 0: time.sleep(delay)
-        fo = urllib2.urlopen(url)
-        foo = fo.read()
-        fo.close()
-        lens.append(len(foo))
-    diff = time.time() - starttime
-
-    j = 0
-    for i in lens[1:]:
-        j = j + 1
-        if not i == lens[0]:
-            print "WARNING: inconsistent length on read %i: %i" % (j, i)
-
-    return diff
-
-def test_timeout(url):
-    global DEBUG
-    dbbackup = DEBUG
-    class FakeLogger:
-        def debug(self, msg, *args): print msg % args
-        info = warning = error = debug
-    DEBUG = FakeLogger()
-    print "  fetching the file to establish a connection"
-    fo = urllib2.urlopen(url)
-    data1 = fo.read()
-    fo.close()
- 
-    i = 20
-    print "  waiting %i seconds for the server to close the connection" % i
-    while i > 0:
-        sys.stdout.write('\r  %2i' % i)
-        sys.stdout.flush()
-        time.sleep(1)
-        i -= 1
-    sys.stderr.write('\r')
-
-    print "  fetching the file a second time"
-    fo = urllib2.urlopen(url)
-    data2 = fo.read()
-    fo.close()
-
-    if data1 == data2:
-        print '  data are identical'
-    else:
-        print '  ERROR: DATA DIFFER'
-
-    DEBUG = dbbackup
-
-    
-def test(url, N=10):
-    print "checking error hander (do this on a non-200)"
-    try: error_handler(url)
-    except IOError, e:
-        print "exiting - exception will prevent further tests"
-        sys.exit()
-    print
-    print "performing continuity test (making sure stuff isn't corrupted)"
-    continuity(url)
-    print
-    print "performing speed comparison"
-    comp(N, url)
-    print
-    print "performing dropped-connection check"
-    test_timeout(url)
-    
-if __name__ == '__main__':
-    import time
-    import sys
-    try:
-        N = int(sys.argv[1])
-        url = sys.argv[2]
-    except:
-        print "%s <integer> <url>" % sys.argv[0]
-    else:
-        test(url, N)
diff --git a/HarvestMan-lite/harvestman/lib/common/lrucache.py b/HarvestMan-lite/harvestman/lib/common/lrucache.py
deleted file mode 100755
index bd8abc4..0000000
--- a/HarvestMan-lite/harvestman/lib/common/lrucache.py
+++ /dev/null
@@ -1,370 +0,0 @@
-# -- coding: utf-8
-"""
-lrucache.py - Length-limited O(1) LRU cache implementation
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-    
-Created Anand B Pillai Jun 26 2007 from ASPN Python Cookbook recipe #252524.
-
-{http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252524}
-
-Original code courtesy Josiah Carlson.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-import copy
-import cPickle, os, sys
-import time
-import cStringIO
-from threading import Semaphore
-from dictcache import DictCache
-
-class Node(object):
-    # __slots__ = ['prev', 'next', 'me']
-
-    def __init__(self, prev, me):
-        self.prev = prev
-        self.me = me
-        self.next = None
-
-    def __copy__(self):
-        n = Node(self.prev, self.me)
-        n.next = self.next
-
-        return n
-
-    #def __getstate__(self):
-    #    return self
-    
-class LRU(object):
-    """
-    Implementation of a length-limited O(1) LRU queue.
-    Built for and used by PyPE:
-    http://pype.sourceforge.net
-    Copyright 2003 Josiah Carlson.
-    """
-    def __init__(self, count, pairs=[]):
-        self.count = max(count, 1)
-        self.d = {}
-        self.first = None
-        self.last = None
-        for key, value in pairs:
-            self[key] = value
-
-    def __copy__(self):
-        lrucopy = LRU(self.count)
-        lrucopy.first = copy.copy(self.first)
-        lrucopy.last = copy.copy(self.last)
-        lrucopy.d = self.d.copy()
-        for key,value in self.iteritems():
-            lrucopy[key] = value
-
-        return lrucopy
-        
-    def __contains__(self, obj):
-        return obj in self.d
-
-    def __getitem__(self, obj):
-
-        a = self.d[obj].me
-        self[a[0]] = a[1]
-        return a[1]
-
-    def __setitem__(self, obj, val):
-        if obj in self.d:
-            del self[obj]
-        nobj = Node(self.last, (obj, val))
-        if self.first is None:
-            self.first = nobj
-        if self.last:
-            self.last.next = nobj
-        self.last = nobj
-        self.d[obj] = nobj
-        if len(self.d) > self.count:
-            if self.first == self.last:
-                self.first = None
-                self.last = None
-                return
-            a = self.first
-            if a:
-                if a.next:
-                    a.next.prev = None
-                    self.first = a.next
-                    a.next = None
-                    try:
-                       del self.d[a.me[0]]
-                    except KeyError:
-                       pass
-                del a
-
-    def __delitem__(self, obj):
-        nobj = self.d[obj]
-        if nobj.prev:
-            nobj.prev.next = nobj.next
-        else:
-            self.first = nobj.next
-        if nobj.next:
-            nobj.next.prev = nobj.prev
-        else:
-            self.last = nobj.prev
-        del self.d[obj]
-
-    def __iter__(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me[1]
-            cur = cur2
-
-    def iteritems(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me
-            cur = cur2
-
-    def iterkeys(self):
-        return iter(self.d)
-
-    def itervalues(self):
-        for i,j in self.iteritems():
-            yield j
-
-    def keys(self):
-        return self.d.keys()
-
-    def clear(self):
-        self.d.clear()
-
-    def __len__(self):
-        return len(self.d)
-
-
-class LRU2(object):
-    """
-    Implementation of a length-limited O(1) LRU queue
-    with disk caching. 
-    """
-
-    # This LRU drops off items to a disk dictionary cache
-    # when older items are dropped. 
-    def __init__(self, count, freq, cachedir='', pairs=[]):
-        self.count = max(count, 1)
-        self.d = {}
-        self.lastmutex = Semaphore(1)
-        self.first = None
-        self.last = None
-        for key, value in pairs:
-            self[key] = value
-        # Set the frequency to something like 1/100th of
-        # the expected dictionary final size to achieve
-        # best performance.
-        self.diskcache = DictCache(freq, cachedir)
-        
-    def __copy__(self):
-        lrucopy = LRU(self.count)
-        lrucopy.first = copy.copy(self.first)
-        lrucopy.last = copy.copy(self.last)
-        lrucopy.d = self.d.copy()
-        for key,value in self.iteritems():
-            lrucopy[key] = value
-
-        return lrucopy
-        
-    def __contains__(self, obj):
-        return obj in self.d
-
-    def __getitem__(self, obj):
-       try:
-           a = self.d[obj].me
-           self[a[0]] = a[1]
-           return a[1]
-       except (KeyError,AttributeError):
-           return self.diskcache[obj]
-        
-    def __setitem__(self, obj, val):
-        if obj in self.d:
-            del self[obj]
-        nobj = Node(self.last, (obj, val))
-        if self.first is None:
-            self.first = nobj
-        self.lastmutex.acquire()
-        try:
-            if self.last:
-               self.last.next = nobj
-            self.last = nobj
-        except:
-            pass
-        self.lastmutex.release()
-        self.d[obj] = nobj
-        if len(self.d) > self.count:
-            self.lastmutex.acquire()
-            try:
-                if self.first == self.last:
-                    self.first = None
-                    self.last = None
-                    self.lastmutex.release()
-                    return
-            except:
-                pass
-            self.lastmutex.release()
-            a = self.first
-            if a:
-                if a.next:
-                    a.next.prev = None
-                self.first = a.next
-                a.next = None
-            try:
-                key, val = a.me[0], self.d[a.me[0]]
-                del self.d[a.me[0]]
-                del a
-                self.diskcache[key] = val.me[1]
-            except (KeyError,AttributeError):
-                pass
-
-    def __delitem__(self, obj):
-        nobj = self.d[obj]
-        if nobj.prev:
-            nobj.prev.next = nobj.next
-        else:
-            self.first = nobj.next
-        if nobj.next:
-            nobj.next.prev = nobj.prev
-        else:
-            self.last = nobj.prev
-        del self.d[obj]
-
-    def __iter__(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me[1]
-            cur = cur2
-
-    def iteritems(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me
-            cur = cur2
-
-    def iterkeys(self):
-        return iter(self.d)
-
-    def itervalues(self):
-        for i,j in self.iteritems():
-            yield j
-
-    def keys(self):
-        return self.d.keys()
-
-    def clear(self):
-        self.d.clear()
-        self.diskcache.clear()
-
-    def __len__(self):
-        return len(self.d)
-
-    def get_stats(self):
-        """ Return statistics as a dictionary """
-
-        return self.diskcache.get_stats()
-    
-    def test(self, N):
-
-        # Test to see if the diskcache works. Pass
-        # the total number of items added to this
-        # function...
-        
-        flag = True
-        
-        for x in range(N):
-            if self[x] == None:
-                flag = False
-                break
-
-        return flag
-
-def test_lru2():
-    import random
-    
-    n1, n2 = 10000, 5000
-    
-    l=LRU2(n1, 100)
-    for x in range(n1):
-        l[x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html')
-
-    # make use of first n2 random entries
-    for x in range(n2):
-        l[random.randint(0,n2)]
-        
-    # Add another n2 entries
-    # This will cause the LRU to drop
-    # entries and cache old entries.
-    for x in range(n2):
-        l[n1+x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html') #  x + 1
-
-    print l.test(n1+n2)
-
-    print 'Random access test...'
-    # Try to access random entries
-    for x in range(n1+n2):
-        # A random access will take more time since in-mem
-        # cache will be emptied more often
-        l[random.randint(0,n1+n2-1)]
-
-    print
-    print "Disk Hits",l.diskcache.dhits
-    print "Mem Hits",l.diskcache.mhits
-    print "Temp dict Hits",l.diskcache.thits    
-    print "Time taken",l.diskcache.t
-    print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2)
-    print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits)
-    print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache))
-    
-    l.diskcache.clear_counters()
-
-    print 'Sequential access test...'
-    
-    for x in range(n1+n2):
-        # A sequential access will be faster since in-mem cache
-        # will be hit sequentially...
-        l[x]        
-
-    print
-    
-    print "Disk Hits",l.diskcache.dhits
-    print "Mem Hits",l.diskcache.mhits
-    print "Temp dict Hits",l.diskcache.thits    
-    print "Time taken",l.diskcache.t
-    print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2)
-    print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits)    
-    print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache))    
-    
-    l.clear()
-
-if __name__=="__main__":
-    test_lru2()
- ##    l = LRU2(10)
-##     for x in range(10):
-##         l[x] = x
-##     print l.keys()
-##     print l[3]
-##     print l[3]
-##     print l[9]
-##     print l[9]
-    
-##     l[12]=11
-##     l[13]=12
-##     l[14]=15
-##     l[15]=16
-##     l[16]=17
-##     l[17]=18
-##     l[18]=19
-##     l[19]=20
-##     print l.keys()
-##     print len(l)
-##     print l[0]
-##     print l[1]
-##     print l[2]    
-##     print copy.copy(l).keys()
diff --git a/HarvestMan-lite/harvestman/lib/common/macros.py b/HarvestMan-lite/harvestman/lib/common/macros.py
deleted file mode 100755
index 63cd500..0000000
--- a/HarvestMan-lite/harvestman/lib/common/macros.py
+++ /dev/null
@@ -1,185 +0,0 @@
-# -- coding: utf-8
-"""
-macros.py - Defining macro variables for use by other
-modules.
-
-Created Anand B Pillai <abpillai at gmail dot com> Oct 5 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-class HarvestManMacroVariable(type):
-    """ A metaclass for HarvestMan macro variables """
-    
-    PIDX = 0
-    NIDX = 0
-    macrodict = {}
-    
-    def __new__(cls, name, bases=(), dct={}):
-
-        val = dct.get('value')
-        if val != None:
-            dct['index'] = val
-            
-        elif dct.get('negate'):
-            cls.NIDX -= 1
-            dct['index'] = cls.NIDX
-        else:
-            cls.PIDX += 1
-            dct['index'] = cls.PIDX
-            
-        item = type.__new__(cls, name, bases, dct)
-        cls.macrodict[name] = item
-        return item
-
-    def __init__(cls, name, bases=(), dct={}):
-	pass
-    def __str__(self):
-        return '%d' % (self.index)
-
-    def __eq__(self, number):
-        # Makes it easy to do things like
-        # THREAD_IDLE == 0 in code.
-        return self.index == number
-
-    def __lt__(self, number):
-
-        return self.index < number
-
-    def __gt__(self, number):
-
-        return self.index > number
-
-    def __le__(self, number):
-
-        return self.index <= number
-
-    def __ge__(self, number):
-
-        return self.index >= number
-    
-    
-def DEFINE_MACRO(name, val=None):
-    """ A factory function for defining macros """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val})
-    else:
-        globals()[name] = HarvestManMacroVariable(name)
-
-def DEFINE_NEGATIVE_MACRO(name, val=None):
-    """ A factory function for defining macros with negative values """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val,'negate': True})
-    else:
-        globals()[name] = HarvestManMacroVariable(name, dct={'negate': True})
-
-
-def SUCCESS(status):
-    return (status > 0)
-    
-DEFINE_ERROR_MACRO = DEFINE_NEGATIVE_MACRO
-
-# Special (predefined) macros
-DEFINE_MACRO("HARVESTMAN_OK", 1)
-DEFINE_MACRO("HARVESTMAN_FAIL", -1)
-DEFINE_MACRO("OPTION_TURN_OFF", 0)
-DEFINE_MACRO("OPTION_TURN_ON", 1)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_FLUSH", 0)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_INMEM", 1)
-
-# Success macros
-DEFINE_MACRO("RESTORE_STATE_OK")
-DEFINE_MACRO("SAVE_STATE_OK")
-DEFINE_MACRO("CONFIG_FILE_EXISTS")
-DEFINE_MACRO("CONFIG_FILE_PARSE_OK")
-DEFINE_MACRO("CONFIG_OPTION_SET")
-DEFINE_MACRO("CONFIG_ITEM_SKIPPED")
-DEFINE_MACRO("CONFIG_OPTION_NOT_DEFINED")
-DEFINE_MACRO("CONFIG_ARGUMENT_OK")
-DEFINE_MACRO("CONFIG_ARGUMENTS_OK")
-DEFINE_MACRO("PROJECT_FILE_EXISTS", 0)
-DEFINE_MACRO("CONFIGURE_PROTOCOL_OK")
-DEFINE_MACRO("CONNECT_MULTIPART_DOWNLOAD")
-DEFINE_MACRO("CONNECT_NO_UPTODATE")
-DEFINE_MACRO("CONNECT_YES_DOWNLOADED")
-DEFINE_MACRO("DOWNLOAD_YES_WITH_MODIFICATION")
-DEFINE_MACRO("DOWNLOAD_NO_UPTODATE")
-DEFINE_MACRO("DOWNLOAD_NO_CACHE_SYNCED")
-DEFINE_MACRO("DOWNLOAD_YES_OK")
-DEFINE_MACRO("URL_PUSHED_TO_POOL")
-DEFINE_MACRO("CREATE_DIRECTORY_OK")
-DEFINE_MACRO("URL_DOWNLOAD_OK")
-DEFINE_MACRO("DATA_ALREADY_PRESENT")
-DEFINE_MACRO("FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_OK")
-DEFINE_MACRO("DUMP_URL_OK")
-DEFINE_MACRO("PROJECT_FILE_READ_OK")
-DEFINE_MACRO("PROJECT_FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_HEADERS_OK")
-DEFINE_MACRO("BROWSE_FILE_WRITE_OK")
-DEFINE_MACRO("LINK_FILTERED")
-DEFINE_MACRO("LINK_NOT_FILTERED")
-DEFINE_MACRO("LINK_EMPTY")
-DEFINE_MACRO("ANCHOR_LINK_FOUND")
-DEFINE_MACRO("SET_STATE_OK")
-DEFINE_MACRO("THREAD_MIGRATION_OK")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_QUEUED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_COMPLETED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_STATUS_UNKNOWN")
-DEFINE_MACRO("HGET_DOWNLOAD_OK")
-
-# Error macros
-DEFINE_ERROR_MACRO("SAVE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("RESTORE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("CONFIG_FILE_DOES_NOT_EXIST")
-DEFINE_ERROR_MACRO("CONFIG_FILE_PARSE_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_EMPTY")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_MISMATCH")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_NOT_SET")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_ASSIGN_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_INVALID_ARGUMENT")
-DEFINE_ERROR_MACRO("CONFIG_ARGUMENT_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_NO_RULES_VIOLATION")
-DEFINE_ERROR_MACRO("CONNECT_NO_FILTERED")
-DEFINE_ERROR_MACRO("CONNECT_NO_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_DOWNLOAD_ABORTED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_ERROR")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_WRITE_FILTERED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_RULE_VIOLATION")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_CACHE_SYNC_FAILED")
-DEFINE_ERROR_MACRO("CREATE_DIRECTORY_NOT_OK")
-DEFINE_ERROR_MACRO("URL_DOWNLOAD_FAILED")
-DEFINE_ERROR_MACRO("DATA_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("DATA_EMPTY_ERROR")
-DEFINE_ERROR_MACRO("FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_FAILED")
-DEFINE_ERROR_MACRO("NULL_URLOBJECT_ERROR")
-DEFINE_ERROR_MACRO("INVALID_ARCHIVE_FORMAT")
-DEFINE_ERROR_MACRO("FILE_TRUNCATE_ERROR")
-DEFINE_ERROR_MACRO("DUMP_URL_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_REMOVE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_HEADERS_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_NOT_FOUND")
-DEFINE_ERROR_MACRO("BROWSE_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_EMPTY")
-DEFINE_ERROR_MACRO("BROWSE_FILE_INVALID")
-DEFINE_ERROR_MACRO("BROWSE_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("ANCHOR_LINK_NOT_FOUND")
-DEFINE_ERROR_MACRO("SET_STATE_ERROR")
-DEFINE_ERROR_MACRO("THREAD_MIGRATION_ERROR")
-DEFINE_ERROR_MACRO("MULTIPART_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("HGET_FATAL_ERROR")
-DEFINE_ERROR_MACRO("HGET_KEYBOARD_INTERRUPT")
-DEFINE_ERROR_MACRO("HGET_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("MIRRORS_NOT_FOUND")
-DEFINE_ERROR_MACRO("WRITE_URL_FILTERED")
-DEFINE_ERROR_MACRO("WRITE_URL_BLOCKED")
-DEFINE_ERROR_MACRO("CONTROLLER_EXIT")
-
-if __name__ == "__main__":
-    for key, val in HarvestManMacroVariable.macrodict.iteritems():
-        print key,'=>',val.index
diff --git a/HarvestMan-lite/harvestman/lib/common/netinfo.py b/HarvestMan-lite/harvestman/lib/common/netinfo.py
deleted file mode 100755
index 080e85a..0000000
--- a/HarvestMan-lite/harvestman/lib/common/netinfo.py
+++ /dev/null
@@ -1,184 +0,0 @@
-"""
-netinfo - Module summarizing information regarding protocols,
-ports, file extensions, regular expressions for analyzing URLs etc
-for HarvestMan.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 22 2008, moving
-                                                   content from urlparser.py
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import re
-
-URLSEP = '/'          # URL separator character
-PROTOSEP = '//'       # String which separates a protocol string from the rest of URL
-DOTDOT = '..'         # A convenient name for the .. string
-DOT = '.'             # A convenient name for the . string
-PORTSEP = ':'         # Protocol separator character, character which separates the protocol
-                      # string from rest of URL
-BACKSLASH = '\\'      # A convenient name for the backslash character
-
-# Mapping popular protocols to most widely used port numbers
-protocol_map = { "http://" : 80,       
-                 "ftp://" : 21,
-                 "https://" : 443,
-                 "file://": 0,
-                 "file:": 0
-                 }
-
-# Popular image types file extensions
-image_extns = ('.bmp', '.dib', '.dcx', '.emf', '.fpx', '.gif', '.ico', '.img',
-               '.jp2', '.jpc', '.j2k', '.jpf', '.jpg', '.jpeg', '.jpe',
-               '.mng', '.pbm', '.pcd', '.pcx', '.pgm', '.png', '.ppm', 
-               '.psd', '.ras', '.rgb', '.tga', '.tif', '.tiff', '.wbmp',
-               '.xbm', '.xpm')
-
-# Popular video types file extensions
-movie_extns = ('.3gp', '.avi', '.asf','.asx', '.avs', '.bay',
-               '.bik', '.bsf', '.dat', '.dv' ,'.dvr-ms', 'flc',
-               '.flv', '.ivf', '.m1v', '.m2ts', '.m2v', '.m4v',
-               '.mgv', '.mkv', '.mov', '.mp2v', '.mp4', '.mpe',
-               '.mpeg', '.mpg', '.ogm', '.qt', '.ratDVD', '.rm',
-               '.smi', '.vob', '.wm', '.wmv', '.xvid' )
-
-# Popular audio types file extensions
-sound_extns = ('.aac', '.aif', '.aiff', '.aifc', '.aifr', '.amr',
-               '.ape' ,'.asf', '.au', '.aud', '.aup', '.bwf', 
-               '.cda', '.dct', '.dss', '.dts', '.dvf', '.esu',
-               '.eta', '.flac', '.gsm', '.jam', '.m4a', '.m4p',
-               '.mdi', '.mid', '.midi', '.mka', '.mod', '.mp1', '.mp2',
-               '.mp3', '.mpa', '.mpc', '.mpega', '.msv', '.mus',
-               '.nrj', '.nwc', '.nwp', '.ogg', '.psb', '.psm', '.ra',
-               '.ram', '.rel', '.sab', '.shn', '.smf', '.snd', '.speex',
-               '.tta', '.vox', '.vy3', '.wav', '.wave', '.wma',
-               '.wpk', '.wv', '.wvc')
-
-# Most common web page url file extensions
-# including dynamic server pages & cgi scripts.
-webpage_extns = ('', '.htm', '.html', '.shtm', '.shtml', '.php',
-                 '.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl',
-                 '.cgi', '.stx', '.cfm', '.cfml', '.cms', '.ars')
-
-
-# Document extensions
-document_extns = ('.doc','.rtf','.odt','.odp','.ott','.sxw','.stw',
-                  '.sdw','.vor','.pdf','.ps')
-
-# Extensions for flash/flash source code/flash action script
-flash_extns = ('.swf', '.fla', '.mxml', '.as', '.abc')
-
-# Web-page extensions which automatically default to directories
-# These are special web-page types which are web-pages as well
-# as directories. Most common example is the .ars file extension
-# of arstechnica.com.
-default_directory_extns = ('.ars',)
-
-# Most common stylesheet url file extensions
-stylesheet_extns = ( '.css', )
-
-# Regular expression for matching
-# urls which contain white spaces
-wspacere = re.compile(r'\s+\S+', re.LOCALE|re.UNICODE)
-
-# Regular expression for anchor tags
-anchore = re.compile(r'\#+')
-
-# jkleven: Regex if we still don't recognize a URL address as HTML.  Only
-# to be used if we've looked at everything else and URL still isn't
-# a known type.  This regex is similar to one in pageparser.py but 
-# we changed a few '*' to '+' to get one or more matches.  
-# form_re = re.compile(r'[-.:_a-zA-Z0-9]+\?[-.:_a-zA-Z0-9]+=[-.a:_-zA-Z0-9]*', re.UNICODE)
-
-# Made this more generic and lenient.
-form_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE)
-
-# Junk chars which cannot be part of valid filenames
-junk_chars = ('?','*','"','<','>','!',':','/','\\')
-
-# Replacement chars
-junk_chars_repl = ('',)*len(junk_chars)
-
-# Dirty chars which need to be hex encoded in URLs (apart from white-space)
-# We are assuming that there won't be many idiots who would put a backslash in a URL...
-dirty_chars = ('<','>','(',')','{','}','[',']','^','`','|')
-
-# These are replaced with their hex counterparts
-dirty_chars_repl = ('%3C','%3E','%28','%29','%7B','%7D','%5B','%5D','%5E','%60','%7C')
-
-# %xx char replacement regexp
-percent_repl = re.compile(r'\%[a-f0-9][a-f0-9]', re.IGNORECASE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*)+', re.UNICODE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[^\&]*)+', re.UNICODE)
-
-# Regexp which extracts params from query URLs, most generic
-params_re = re.compile(r'([^&=\?]*=[^&=\?]*)', re.UNICODE)
-# Regular expression for validating a query param group (such as "lang=en")
-param_re = re.compile(r'([^&=\?]+=[^&=\?\s]+)', re.UNICODE)
-
-# ampersand regular expression at URL end
-ampersand_re = re.compile(r'\&+$')
-# question mark regular expression at URL end
-question_re = re.compile(r'\?+$')
-# Regular expression for www prefixes at front of a string
-www_re = re.compile(r'^www(\d*)\.')
-# Regular expression for www prefixes anywhere
-www2_re = re.compile(r'www(\d*)\.')
-
-# List of TLD (top-level domain) name endings from http://data.iana.org/TLD/tlds-alpha-by-domain.txt
-
-tlds = ['ac', 'ad', 'ae', 'aero', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq',
-        'ar', 'arpa', 'as', 'asia', 'at', 'au', 'aw','ax', 'az', 'ba', 'bb', 'bd',
-        'be', 'bf', 'bg', 'bh', 'bi', 'biz', 'bj', 'bm', 'bn', 'bo', 'br', 'bs',
-        'bt', 'bv', 'bw', 'by', 'bz', 'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch',
-        'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'com', 'coop', 'cr', 'cu', 'cv', 'cx',
-        'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'edu', 'ee', 'eg',
-        'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo', 'fr', 'ga', 'gb',
-        'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq',
-        'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht', 'hu',
-        'id', 'ie', 'il', 'im', 'in', 'info', 'int', 'io', 'iq', 'ir', 'is',
-        'it', 'je', 'jm', 'jo', 'jobs', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn',
-        'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls',
-        'lt', 'lu', 'lv', 'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mil', 'mk',
-        'ml', 'mm', 'mn', 'mo', 'mobi', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu',
-        'museum', 'mv', 'mw', 'mx', 'my', 'mz', 'na', 'name', 'nc', 'ne', 'net',
-        'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'org', 'pa',
-        'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'pro', 'ps', 'pt',
-        'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd',
-        'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st',
-        'su', 'sv', 'sy', 'sz', 'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk',
-        'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'travel', 'tt', 'tv', 'tw', 'tz',
-        'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi',
-        'vn', 'vu', 'wf', 'ws', 'xn--0zwm56d', 'xn--11b5bs3a9aj6g', 'xn--80akhbyknj4f',
-        'xn--9t4b11yi5a', 'xn--deba0ad', 'xn--g6w251d', 'xn--hgbk6aj7f53bba',
-        'xn--hlcj6aya9esc7a', 'xn--jxalpdlp', 'xn--kgbechtv', 'xn--zckzah',
-        'ye', 'yt', 'yu', 'za', 'zm', 'zw']
-
-def get_base_server(server):
-    """ Return the base server name of  the passed
-    server (domain) name """
-
-    # If the server name is of the form say bar.foo.com
-    # or vodka.bar.foo.com, i.e there are more than one
-    # '.' in the name, then we need to return the
-    # last string containing a dot in the middle.
-    if server.count('.') > 1:
-        dotstrings = server.split('.')
-        # now the list is of the form => [vodka, bar, foo, com]
-
-        # Skip the list for skipping over tld domain name endings
-        # such as .org.uk, .mobi.uk etc. For example, if the
-        # server is games.mobileworld.mobi.uk, then we
-        # need to return mobileworld.mobi.uk, not mobi.uk
-        dotstrings.reverse()
-        idx = 0
-
-        for item in dotstrings:
-            if item.lower() in tlds:
-                idx += 1
-
-        return '.'.join(dotstrings[idx::-1])
-    else:
-        # The server is of the form foo.com or just "foo"
-        # so return it straight away
-        return server
diff --git a/HarvestMan-lite/harvestman/lib/common/optionparser.py b/HarvestMan-lite/harvestman/lib/common/optionparser.py
deleted file mode 100755
index 6c6cea7..0000000
--- a/HarvestMan-lite/harvestman/lib/common/optionparser.py
+++ /dev/null
@@ -1,286 +0,0 @@
-# -- coding: utf-8
-"""
-optionparser.py - Generic option parser class. This class
-can be used to write code that will parse command line options
-for an application by invoking one of the standard Python
-library command argument parser modules optparse or
-getopt.
-
-The class first tries to use optparse. It it is not there
-(< Python 2.3), it invokes getopt. However, this is
-transparent to the application which uses the class.
-
-The class requires a list with tuple entries of the following
-form for each command line option.
-
-('option_var', 'short=<short option>','long=<long option>',
-'help=<help string>', 'meta=<meta variable>','default=<default value>',
-'type=<option type>')
-
-where, 'option_var' is the key for the option in the final
-dictionary of option-value pairs. The rest are strings of
-the form 'key=value', where 'key' is borrowed from the way
-optparse represents each variables for an option setting.
-
-To parse the arguments, call the method 'parse_arguments'.
-The return value is a dictionary of the option-value pairs.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-This module was originally created as an ASPN cookbook
-recipe http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/425345 .
-The current module is a slightly modified form of the recipe.
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created: Feb 11 2007 by Anand B Pillai 
-
-
-Copyright (C) 2007, Anand B Pillai.
-
-"""
-
-import sys
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-class GenericOptionParserError(Exception):
-    
-    def __init__(self,value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-    
-class GenericOptionParser:
-    """ Generic option parser using
-    either optparse or getopt """
-
-    def __init__(self, optlist, usage='', description=''):
-        self._optlist = self.__process_optlist(optlist)
-        self._optdict = {}
-        self._args = ''
-        self._usage = usage
-        self._description = description
-        self.maxw = 24
-
-    def __process_optlist(self, optlist):
-        """ Process options list """
-
-        optionslist = []
-
-        for optiontuple in optlist:
-            # Option destination is first item
-            optiondest = optiontuple[0]
-            # Create empty dictionary
-            itemdict = {}
-            # Convert rest to a dictionary
-            for item in optiontuple[1:]:
-                key,val = item.split('=', 1)
-                itemdict[key] = val
-
-            # Append option dest as first val
-            # and itemdict as second val in a list
-            optionslist.append([optiondest, itemdict])
-
-        return optionslist
-            
-    def parse_arguments(self):
-        """ Parse command line arguments and
-        return a dictionary of option-value pairs """
-
-        try:
-            self.optparse = __import__('optparse')
-            # For invoking help, when no arguments
-            # are passed.
-            if len(sys.argv)==1:
-                sys.argv.append('-h')
-
-            self._parse_arguments1()
-        except ImportError:
-            try:
-                import getopt
-                self.getopt = __import__('getopt')                
-                self._parse_arguments2()
-            except ImportError:
-                raise GenericOptionParserError,'Fatal Error: No optparse or getopt modules found'
-
-        return (self._optdict, self._args)
-                
-    def _parse_arguments1(self):
-        """ Parse command-line arguments using optparse """
-
-        p = self.optparse.OptionParser(usage=self._usage, description=self._description)
-        
-        for option, itemdict in self._optlist:
-            # Default action is 'store'
-            action = 'store'
-            # Short option string
-            sopt = itemdict.get('short','')
-            # Long option string
-            lopt = itemdict.get('long','')
-            # Help string
-            helpstr = itemdict.get('help','')
-            # Meta var
-            meta = itemdict.get('meta','')
-            # Default value
-            defl = itemdict.get('default','')
-            # Default type is 'string'
-            typ = itemdict.get('type','string')
-            
-            # If bool type...
-            if typ == 'bool':
-                action = 'store_true'
-                defl = bool(str(defl) == 'True')
-
-            if sopt: sopt = '-' + sopt
-            if lopt: lopt = '--' + lopt
-            
-            # Add option
-            p.add_option(sopt,lopt,dest=option,help=helpstr,metavar=meta,action=action,
-                         default=defl)
-
-        (options,args) = p.parse_args()
-        self._optdict = options.__dict__
-        self._args = args
-        
-    def _parse_arguments2(self):
-        """ Parse command-line arguments using getopt """
-
-        # getopt requires help string to
-        # be generated.
-        if len(sys.argv)==1:
-            sys.exit(self._usage())
-        
-        shortopt,longopt='h',['help']
-        # Create short option string and long option
-        # list for getopt
-        for option, itemdict in self._optlist:
-            sopt = itemdict.get('short','')
-            lopt = itemdict.get('long','')
-            typ = itemdict.get('type','string')            
-            defl = itemdict.get('default','')
-
-            # If bool type...
-            if typ == 'bool':
-                defl = bool(str(defl) == 'True')
-            # Set default value
-            self._optdict[key] = defl
-
-            if typ=='bool':
-                if sopt: shortopt += sopt
-                if lopt: longopt.append(lopt)
-            else:
-                if sopt: shortopt = "".join((shortopt,sopt,':'))
-                if lopt: longopt.append(lopt+'=')
-
-        # Parse
-        (optlist,args) = self.getopt.getopt(sys.argv[1:],shortopt,longopt)
-        self._args = args
-        
-        # Match options
-        for opt,val in optlist:
-            # Invoke help
-            if opt in ('-h','--help'):
-                sys.exit(self._usage())
-                
-            for option,itemdict in self._optlist:
-                sopt = '-' + itemdict.get('short','')
-                lopt = '--' + itemdict.get('long','')
-                typ = itemdict.get('type','string')
-                
-                if opt in (sopt,lopt):
-                    if typ=='bool': val = True
-                    self._optdict[key]=val
-                    break
-
-    def _split_help_str(self, help):
-        """ Split help string into many lines """
-
-        # According to following
-        # Max number of chars per line is 52
-        # If char count exceeds, split line,
-        # preserving words.
-        maxlen = 53
-        helps,count=[],0
-        if len(help)<=maxlen:
-            return help
-        
-        while help:
-            if len(help)<maxlen:
-                helps.append(self.maxw*' ' + help.strip())
-                break
-            else:
-                piece = help[0:maxlen]
-                # Find max index of space char in this piece
-                sindx = piece.rfind(' ')
-                # Split according to space char
-                piece = piece[0:sindx]
-                if count>0:
-                    piece = self.maxw*' ' + piece
-                helps.append(piece)
-                help = help[sindx+1:]
-                count += 1
-                
-        return '\n'.join(helps)
-    
-    def _usage(self):
-        """ Generate and return a help string
-        for the program, similar to the one
-        generated by optparse """
-
-        if self._usage:
-            usage = [self._usage]
-        else:
-            usage = ["usage: %s [options]\n\n" % sys.argv[0]]
-            usage.append("options:\n")
-
-        options = [('  -h, --help', 'show this help message and exit\n')]
-        maxlen = 0
-        for option, itemdict in self._optlist:
-            sopt = itemdict.get('short','')
-            lopt = itemdict.get('long','')
-            help = itemdict.get('help','')
-            meta = itemdict.get('meta','')
-            
-            optstr = ""
-            if sopt: optstr="".join(('  -',sopt,' ',meta))
-            if lopt: optstr="".join((optstr,', --',lopt))
-            if meta: optstr="".join((optstr,'=',meta))
-
-            help = self._split_help_str(help)
-                
-            l = len(optstr)
-            if l>maxlen: maxlen=l
-            options.append((optstr,help))
-            
-        for x in range(len(options)):
-            optstr = options[x][0]
-            helpstr = options[x][1]
-            if maxlen<self.maxw - 1:
-                usage.append("".join((optstr,(maxlen-len(optstr) + 2)*' ', helpstr,'\n')))
-            elif len(optstr)<self.maxw - 1:
-                usage.append("".join((optstr,(self.maxw-len(optstr))*' ', helpstr,'\n')))
-            else:
-                usage.append("".join((optstr,'\n',self.maxw*' ', helpstr,'\n')))                
-
-        return "".join(usage)
-
-if __name__=="__main__":
-    l=[ ('infile', 'short=i','long=in','help=Input file for the program',
-                    'meta=IN'),
-        ('outfile', 'short=o','long=out','help=Output file for the program',
-                    'meta=OUT'),
-        ('verbose', 'short=V','long=verbose','help=Be verbose in output',
-                    'type=bool') ]
-
-    g=GenericOptionParser(l)
-    optdict = g.parse_arguments()
- 
-    for key,value in optdict.items():
-         # Use the option and the value in
-         # your program
-         print key,value
diff --git a/HarvestMan-lite/harvestman/lib/common/properties.py b/HarvestMan-lite/harvestman/lib/common/properties.py
deleted file mode 100755
index 8c7e9f3..0000000
--- a/HarvestMan-lite/harvestman/lib/common/properties.py
+++ /dev/null
@@ -1,362 +0,0 @@
-# -- coding: utf-8
-
-#! /usr/bin/env python
-"""
-properties.py - A Python replacement for java.util.Properties class
-This is modelled as closely as possible to the Java original.
-
-Created Anand B Pillai <abpillai at gmail dot com> 2006-07-28
-
-Copyright(C) 2006-2007, Anand B Pillai.
-
-"""
-
-import sys,os
-import re
-import time
-from types import StringTypes
-
-class Properties(object):
-    """ A Python replacement for java.util.Properties """
-    
-    def __init__(self, props=None):
-
-        # Note: We don't take a default properties object
-        # as argument yet
-
-        # Dictionary of properties.
-        self._props = {}
-
-        # Dictionary of properties with 'pristine' keys
-        # This is used for dumping the properties to a file
-        # using the 'store' method
-        self._origprops = {}
-
-        # Dictionary mapping keys from property
-        # dictionary to pristine dictionary
-        self._keymap = {}
-        
-        self.othercharre = re.compile(r'(?<!\\)(\s*\=)|(?<!\\)(\s*\:)')
-        self.othercharre2 = re.compile(r'(\s*\=)|(\s*\:)')
-        self.bspacere = re.compile(r'\\(?!\s$)')
-
-        if props:
-            # Store the passed properties, it should be
-            # a dictionary or an instance of this class itself...
-            self._copyFrom(props)
-        
-    def __str__(self):
-        s='{'
-        for key,value in self._props.items():
-            s = ''.join((s,key,'=',value,', '))
-
-        s=''.join((s[:-2],'}'))
-        return s
-
-    def __parse(self, lines):
-        """ Parse a list of lines and create
-        an internal property dictionary """
-
-        # Every line in the file must consist of either a comment
-        # or a key-value pair. A key-value pair is a line consisting
-        # of a key which is a combination of non-white space characters
-        # The separator character between key-value pairs is a '=',
-        # ':' or a whitespace character not including the newline.
-        # If the '=' or ':' characters are found, in the line, even
-        # keys containing whitespace chars are allowed.
-
-        # A line with only a key according to the rules above is also
-        # fine. In such case, the value is considered as the empty string.
-        # In order to include characters '=' or ':' in a key or value,
-        # they have to be properly escaped using the backslash character.
-
-        # Some examples of valid key-value pairs:
-        #
-        # key     value
-        # key=value
-        # key:value
-        # key     value1,value2,value3
-        # key     value1,value2,value3 \
-        #         value4, value5
-        # key
-        # This key= this value
-        # key = value1 value2 value3
-        
-        # Any line that starts with a '#' is considerered a comment
-        # and skipped. Also any trailing or preceding whitespaces
-        # are removed from the key/value.
-        
-        # This is a line parser. It parses the
-        # contents like by line.
-
-        lineno=0
-        i = iter(lines)
-
-        for line in i:
-            lineno += 1
-            line = line.strip()
-            # Skip null lines
-            if not line: continue
-            # Skip lines which are comments
-            if line[0] == '#': continue
-            # Some flags
-            escaped=False
-            # Position of first separation char
-            sepidx = -1
-            # A flag for performing wspace re check
-            flag = 0
-            # Check for valid space separation
-            # First obtain the max index to which we
-            # can search.
-            m = self.othercharre.search(line)
-            if m:
-                first, last = m.span()
-                start, end = 0, first
-                flag = 1
-                wspacere = re.compile(r'(?<![\\\=\:])(\s)')        
-            else:
-                if self.othercharre2.search(line):
-                    # Check if either '=' or ':' is present
-                    # in the line. If they are then it means
-                    # they are preceded by a backslash.
-                    
-                    # This means, we need to modify the
-                    # wspacere a bit, not to look for
-                    # : or = characters.
-                    wspacere = re.compile(r'(?<![\\])(\s)')        
-                start, end = 0, len(line)
-                
-            m2 = wspacere.search(line, start, end)
-            if m2:
-                # print 'Space match=>',line
-                # Means we need to split by space.
-                first, last = m2.span()
-                sepidx = first
-            elif m:
-                # print 'Other match=>',line
-                # No matching wspace char found, need
-                # to split by either '=' or ':'
-                first, last = m.span()
-                sepidx = last - 1
-                # print line[sepidx]
-                
-                
-            # If the last character is a backslash
-            # it has to be preceded by a space in which
-            # case the next line is read as part of the
-            # same property
-            while line[-1] == '\\':
-                # Read next line
-                nextline = i.next()
-                nextline = nextline.strip()
-                lineno += 1
-                # This line will become part of the value
-                line = line[:-1] + nextline
-
-            # Now split to key,value according to separation char
-            if sepidx != -1:
-                key, value = line[:sepidx], line[sepidx+1:]
-            else:
-                key,value = line,''
-
-            self.processPair(key, value)
-            
-    def processPair(self, key, value):
-        """ Process a (key, value) pair """
-
-        oldkey = key
-        oldvalue = value
-        
-        # Create key intelligently
-        keyparts = self.bspacere.split(key)
-        # print keyparts
-
-        strippable = False
-        lastpart = keyparts[-1]
-
-        if lastpart.find('\\ ') != -1:
-            keyparts[-1] = lastpart.replace('\\','')
-
-        # If no backspace is found at the end, but empty
-        # space is found, strip it
-        elif lastpart and lastpart[-1] == ' ':
-            strippable = True
-
-        key = ''.join(keyparts)
-        if strippable:
-            key = key.strip()
-            oldkey = oldkey.strip()
-        
-        oldvalue = self.unescape(oldvalue)
-        value = self.unescape(value)
-        
-        self._props[key] = value.strip()
-
-        # Check if an entry exists in pristine keys
-        if self._keymap.has_key(key):
-            oldkey = self._keymap.get(key)
-            self._origprops[oldkey] = oldvalue.strip()
-        else:
-            self._origprops[oldkey] = oldvalue.strip()
-            # Store entry in keymap
-            self._keymap[key] = oldkey
-
-    def _copyFrom(self, props):
-        """ Copy data from a passed property object, the passed
-        object has to be a dictionary or an instance of this class """
-
-        if type(props) is dict or isinstance(props, self.__class__):
-            for key, value in props.iteritems():
-                self.setProperty(key, value)
-        else:
-            raise TypeError,"Argument should be dictionary or Properties instance!"
-            
-    def escape(self, value):
-
-        # Java escapes the '=' and ':' in the value
-        # string with backslashes in the store method.
-        # So let us do the same.
-        newvalue = value.replace(':','\:')
-        newvalue = newvalue.replace('=','\=')
-
-        return newvalue
-
-    def unescape(self, value):
-
-        # Reverse of escape
-        newvalue = value.replace('\:',':')
-        newvalue = newvalue.replace('\=','=')
-
-        return newvalue    
-        
-    def load(self, stream):
-        """ Load properties from an open file stream """
-        
-        # For the time being only accept file input streams
-        if type(stream) is not file:
-            raise TypeError,'Argument should be a file object!'
-        # Check for the opened mode
-        if stream.mode != 'r':
-            raise ValueError,'Stream should be opened in read-only mode!'
-
-        # Reset dictionaries...
-        self.reset()
-        
-        try:
-            lines = stream.readlines()
-            self.__parse(lines)
-        except IOError, e:
-            raise
-
-    def getProperty(self, key):
-        """ Return a property for the given key """
-        
-        return self._props.get(key,'')
-
-    def setProperty(self, key, value):
-        """ Set the property for the given key """
-
-        if type(key) in StringTypes and type(value) in StringTypes:
-            self.processPair(key, value)
-        else:
-            raise TypeError,'both key and value should be strings!'
-
-    def propertyNames(self):
-        """ Return an iterator over all the keys of the property
-        dictionary, i.e the names of the properties """
-
-        return self._props.keys()
-
-    def list(self, out=sys.stdout):
-        """ Prints a listing of the properties to the
-        stream 'out' which defaults to the standard output """
-
-        out.write('-- listing properties --\n')
-        for key,value in self._props.items():
-            out.write(''.join((key,'=',value,'\n')))
-
-    def store(self, out, header=""):
-        """ Write the properties list to the stream 'out' along
-        with the optional 'header' """
-
-        if out.mode[0] != 'w':
-            raise ValueError,'Steam should be opened in write mode!'
-
-        try:
-            out.write(''.join(('#',header,'\n')))
-            # Write timestamp
-            tstamp = time.strftime('%a %b %d %H:%M:%S %Z %Y', time.localtime())
-            out.write(''.join(('#',tstamp,'\n')))
-            # Write properties from the pristine dictionary
-            for prop, val in self._origprops.items():
-                out.write(''.join((prop,'=',self.escape(val),'\n')))
-                
-            out.close()
-        except IOError, e:
-            raise
-
-    def getPropertyDict(self):
-        return self._props
-
-    def reset(self):
-        self._props.clear()
-        self._props = {}
-        self._origprops.clear()
-        self._origprops = {}
-        self._keymap.clear()
-        self._keymap = {}
-        
-    def __getitem__(self, name):
-        """ To support direct dictionary like access """
-
-        return self.getProperty(name)
-
-    def __setitem__(self, name, value):
-        """ To support direct dictionary like access """
-
-        self.setProperty(name, value)
-        
-    def __getattr__(self, name):
-        """ For attributes not found in self, redirect
-        to the properties dictionary """
-
-        try:
-            return self.__dict__[name]
-        except KeyError:
-            if hasattr(self._props,name):
-                return getattr(self._props, name)
-
-    # To support other dictionary like methods
-    def keys(self):
-        return self._props.keys()
-    
-    def values(self):
-        return self._props.values()
-
-    def items(self):
-        return self._props.items()    
-    
-    def iterkeys(self):
-        for key in self._props.keys():
-            yield key
-
-    def itervalues(self):
-        for value in self._props.values():
-            yield value
-
-    def iteritems(self):
-        for key,value in self._props.items():
-            yield key, value
-            
-    def clear(self):
-        self.reset()
-        
-if __name__=="__main__":
-    d = {}
-    for x in range(10):
-        d[str(x)] = 'string#' + str(x+1)
-    p = Properties(d)
-    p.list()
-    p.store(open("test.properties", 'w'))
-    p.load(open('test.properties'))
-    p.list()
diff --git a/HarvestMan-lite/harvestman/lib/common/pydblite.py b/HarvestMan-lite/harvestman/lib/common/pydblite.py
deleted file mode 100755
index e9d2c12..0000000
--- a/HarvestMan-lite/harvestman/lib/common/pydblite.py
+++ /dev/null
@@ -1,361 +0,0 @@
-"""PyDbLite.py
-
-In-memory database management, with selection by list comprehension 
-or generator expression
-
-Fields are untyped : they can store anything that can be pickled.
-Selected records are returned as dictionaries. Each record is 
-identified by a unique id and has a version number incremented
-at every record update, to detect concurrent access
-
-Syntax :
-    from PyDbLite import Base
-    db = Base('dummy')
-    # create new base with field names
-    db.create('name','age','size')
-    # existing base
-    db.open()
-    # insert new record
-    db.insert(name='homer',age=23,size=1.84)
-    # records are dictionaries with a unique integer key __id__
-    # selection by list comprehension
-    res = [ r for r in db if 30 > r['age'] >= 18 and r['size'] < 2 ]
-    # or generator expression
-    for r in (r for r in db if r['name'] in ('homer','marge') ):
-    # simple selection (equality test)
-    res = db(age=30)
-    # delete a record or a list of records
-    db.delete(one_record)
-    db.delete(list_of_records)
-    # delete a record by its id
-    del db[rec_id]
-    # direct access by id
-    record = db[rec_id] # the record such that record['__id__'] == rec_id
-    # create an index on a field
-    db.create_index('age')
-    # access by index
-    records = db._age[23] # returns the list of records with age == 23
-    # update
-    db.update(record,age=24)
-    # add and drop fields
-    db.add_field('new_field')
-    db.drop_field('name')
-    # save changes on disk
-    db.commit()
-"""
-
-import os
-import cPickle
-import bisect
-
-# compatibility with Python 2.3
-try:
-    set([])
-except NameError:
-    from sets import Set as set
-    
-class Index:
-    """Class used for indexing a base on a field
-    The instance of Index is an attribute the Base instance"""
-
-    def __init__(self,db,field):
-        self.db = db # database object (instance of Base)
-        self.field = field # field name
-
-    def __iter__(self):
-        return iter(self.db.indices[self.field])
-
-    def keys(self):
-        return self.db.indices[self.field].keys()
-
-    def __getitem__(self,key):
-        """Lookup by key : return the list of records where
-        field value is equal to this key, or an empty list"""
-        ids = self.db.indices[self.field].get(key,[])
-        return [ self.db.records[_id] for _id in ids ]
-
-class Base:
-
-    def __init__(self,basename):
-        self.name = basename
-
-    def create(self,*fields,**kw):
-        """Create a new base with specified field names
-        A keyword argument mode can be specified ; it is used if a file
-        with the base name already exists
-        - if mode = 'open' : open the existing base, ignore the fields
-        - if mode = 'override' : erase the existing base and create a
-        new one with the specified fields"""
-        self.mode = mode = kw.get("mode",None)
-        if os.path.exists(self.name):
-            if not os.path.isfile(self.name):
-                raise IOError,"%s exists and is not a file" %self.name
-            elif mode is None:
-                raise IOError,"Base %s already exists" %self.name
-            elif mode == "open":
-                return self.open()
-            elif mode == "override":
-                os.remove(self.name)
-        self.fields = list(fields)
-        self.records = {}
-        self.next_id = 0
-        self.indices = {}
-        self.commit()
-        return self
-
-    def create_index(self,*fields):
-        """Create an index on the specified field names
-        
-        An index on a field is a mapping between the values taken by the field
-        and the sorted list of the ids of the records whose field is equal to 
-        this value
-        
-        For each indexed field, an attribute of self is created, an instance 
-        of the class Index (see above). Its name it the field name, with the
-        prefix _ to avoid name conflicts
-        """
-        reset = False
-        for f in fields:
-            if not f in self.fields:
-                raise NameError,"%s is not a field name" %f
-            # initialize the indices
-            if self.mode == "open" and f in self.indices:
-                continue
-            reset = True
-            self.indices[f] = {}
-            for _id,record in self.records.iteritems():
-                # use bisect to quickly insert the id in the list
-                bisect.insort(self.indices[f].setdefault(record[f],[]),
-                    _id)
-            # create a new attribute of self, used to find the records
-            # by this index
-            setattr(self,'_'+f,Index(self,f))
-        if reset:
-            self.commit()
-
-    def open(self):
-        """Open an existing database and load its content into memory"""
-        _in = open(self.name) # don't specify binary mode !
-        self.fields = cPickle.load(_in)
-        self.next_id = cPickle.load(_in)
-        self.records = cPickle.load(_in)
-        self.indices = cPickle.load(_in)
-        for f in self.indices.keys():
-            setattr(self,'_'+f,Index(self,f))
-        _in.close()
-        self.mode = "open"
-        return self
-
-    def commit(self):
-        """Write the database to a file"""
-        out = open(self.name,'wb')
-        cPickle.dump(self.fields,out)
-        cPickle.dump(self.next_id,out)
-        cPickle.dump(self.records,out)
-        cPickle.dump(self.indices,out)
-        out.close()
-
-    def insert(self,*args,**kw):
-        """Insert a record in the database
-        Parameters can be positional or keyword arguments. If positional
-        they must be in the same order as in the create() method
-        If some of the fields are missing the value is set to None
-        Returns the record identifier
-        """
-        if args:
-            kw = dict([(f,arg) for f,arg in zip(self.fields,args)])
-        # initialize all fields to None
-        record = dict([(f,None) for f in self.fields])
-        # set keys and values
-        for (k,v) in kw.iteritems():
-            record[k]=v
-        # add the key __id__ : record identifier
-        record['__id__'] = self.next_id
-        # add the key __version__ : version number
-        record['__version__'] = 0
-        # create an entry in the dictionary self.records, indexed by __id__
-        self.records[self.next_id] = record
-        # update index
-        for ix in self.indices.keys():
-            bisect.insort(self.indices[ix].setdefault(record[ix],[]),
-                self.next_id)
-        # increment the next __id__ to attribute
-        self.next_id += 1
-        return record['__id__']
-
-    def delete(self,removed):
-        """Remove a single record, or the records in an iterable
-        Before starting deletion, test if all records are in the base
-        and don't have twice the same __id__
-        Return the number of deleted items
-        """
-        if isinstance(removed,dict):
-            # remove a single record
-            removed = [removed]
-        else:
-            # convert iterable into a list (to be able to sort it)
-            removed = [ r for r in removed ]
-        if not removed:
-            return 0
-        _ids = [ r['__id__'] for r in removed ]
-        _ids.sort()
-        keys = set(self.records.keys())
-        # check if the records are in the base
-        if not set(_ids).issubset(keys):
-            missing = list(set(_ids).difference(keys))
-            raise IndexError,'Delete aborted. Records with these ids' \
-                ' not found in the base : %s' %str(missing)
-        # raise exception if duplicate ids
-        for i in range(len(_ids)-1):
-            if _ids[i] == _ids[i+1]:
-                raise IndexError,"Delete aborted. Duplicate id : %s" %_ids[i]
-        deleted = len(removed)
-        while removed:
-            r = removed.pop()
-            _id = r['__id__']
-            # remove id from indices
-            for indx in self.indices.keys():
-                pos = bisect.bisect(self.indices[indx][r[indx]],_id)-1
-                del self.indices[indx][r[indx]][pos]
-                if not self.indices[indx][r[indx]]:
-                    del self.indices[indx][r[indx]]
-            # remove record from self.records
-            del self.records[_id]
-        return deleted
-
-    def update(self,record,**kw):
-        """Update the record with new keys and values and update indices"""
-        # update indices
-        _id = record["__id__"]
-        for indx in self.indices.keys():
-            if indx in kw.keys():
-                if record[indx] == kw[indx]:
-                    continue
-                # remove id for the old value
-                old_pos = bisect.bisect(self.indices[indx][record[indx]],_id)-1
-                del self.indices[indx][record[indx]][old_pos]
-                if not self.indices[indx][record[indx]]:
-                    del self.indices[indx][record[indx]]
-                # insert new value
-                bisect.insort(self.indices[indx].setdefault(kw[indx],[]),_id)
-        # update record values
-        record.update(kw)
-        # increment version number
-        record["__version__"] += 1
-
-    def add_field(self,field,default=None):
-        if field in self.fields + ["__id__","__version__"]:
-            raise ValueError,"Field %s already defined" %field
-        for r in self:
-            r[field] = default
-        self.fields.append(field)
-        self.commit()
-    
-    def drop_field(self,field):
-        if field in ["__id__","__version__"]:
-            raise ValueError,"Can't delete field %s" %field
-        self.fields.remove(field)
-        for r in self:
-            del r[field]
-        if field in self.indices:
-            del self.indices[field]
-        self.commit()
-
-    def __call__(self,**kw):
-        """Selection by field values
-        db(key=value) returns the list of records where r[key] = value"""
-        for key in kw:
-            if not key in self.fields:
-                raise ValueError,"Field %s not in the database" %key
-        def sel_func(r):
-            for key in kw:
-                if not r[key] == kw[key]:
-                    return False
-            return True
-        return [ r for r in self if sel_func(r) ]
-    
-    def __getitem__(self,record_id):
-        """Direct access by record id"""
-        return self.records[record_id]
-    
-    def __len__(self):
-        return len(self.records)
-
-    def __delitem__(self,record_id):
-        """Delete by record id"""
-        self.delete(self[record_id])
-        
-    def __iter__(self):
-        """Iteration on the records"""
-        return self.records.itervalues()
-
-if __name__ == '__main__':
-    # test on a 1000 record base
-    import random
-    import datetime
-    names = ['pierre','claire','simon','camille','jean',
-                 'florence','marie-anne']
-    db = Base('PyDbLite_test')
-    db.create('name','age','size','birth',mode="override")
-    for i in range(1000):
-        db.insert(name=unicode(random.choice(names)),
-             age=random.randint(7,47),size=random.uniform(1.10,1.95),
-             birth=datetime.date(1990,10,10))
-    db.create_index('age')
-    db.commit()
-
-    print 'Record #20 :',db[20]
-    print '\nRecords with age=30 :'
-    for rec in db._age[30]:
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-
-    print "\nSame with __call__"
-    for rec in db(age=30):
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-    print db._age[30] == db(age=30)
-
-    db.insert(name=unicode(random.choice(names))) # missing fields
-    print '\nNumber of records with 30 <= age < 33 :',
-    print sum([1 for r in db if 33 > r['age'] >= 30])
-    
-    print db.delete([])
-
-    d = db.delete([r for r in db if 32> r['age'] >= 30 and r['name']==u'pierre'])
-    print "\nDeleting %s records with name == 'pierre' and 30 <= age < 32" %d
-    print '\nAfter deleting records '
-    for rec in db._age[30]:
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-    print '\n',sum([1 for r in db]),'records in the database'
-    print '\nMake pierre uppercase for age > 27'
-    for record in ([r for r in db if r['name']=='pierre' and r['age'] >27]) :
-        db.update(record,name=u"Pierre")
-    print len([r for r in db if r['name']==u'Pierre']),'Pierre'
-    print len([r for r in db if r['name']==u'pierre']),'pierre'
-    print len([r for r in db if r['name'] in [u'pierre',u'Pierre']]),'p/Pierre'
-    print 'is unicode :',isinstance(db[20]['name'],unicode)
-    db.commit()
-    db.open()
-    print '\nSame operation after commit + open'
-    print len([r for r in db if r['name']==u'Pierre']),'Pierre'
-    print len([r for r in db if r['name']==u'pierre']),'pierre'
-    print len([r for r in db if r['name'] in [u'pierre',u'Pierre']]),'p/Pierre'
-    print 'is unicode :',isinstance(db[20]['name'],unicode)
-    
-    print "\nDeleting record #20"
-    del db[20]
-    if not 20 in db:
-        print "record 20 removed"
-
-    print db[21]
-    db.drop_field('name')
-    print db[21]
-    db.add_field('adate',datetime.date.today())
-    print db[21]
-    
-    k = db._age.keys()[0]
-    print "key",k
-    print k in db._age
-    db.delete(db._age[k])
-    print db._age[k]
-    print k in db._age
-
diff --git a/HarvestMan-lite/harvestman/lib/common/singleton.py b/HarvestMan-lite/harvestman/lib/common/singleton.py
deleted file mode 100755
index ce7a008..0000000
--- a/HarvestMan-lite/harvestman/lib/common/singleton.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# -- coding: utf-8
-""" singleton.py - Singleton design-pattern implemented using
-    meta-classes. 
-
-    This module is part of the HarvestMan program.
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-    Created Anand B Pillai Feb 2 2007
-    
-
-    Copyright(C) 2007, Anand B Pillai.
-    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-class SingletonMeta(type):
-    """ A type for Singleton classes """
-
-    def my_new(cls,name,bases=(),dct={}):
-        if not cls.instance:
-            cls.instance = object.__new__(cls)
-                
-        return cls.instance
-    
-    def __init__(cls, name, bases, dct):
-        super(SingletonMeta, cls).__init__(name, bases, dct)
-        cls.instance = None
-        cls.__new__ = cls.my_new
-
-class SingletonMeta2(type):
-    """ A type for Singleton classes """    
-
-    def __init__(cls, *args):
-        type.__init__(cls, *args)
-        cls.instance = None
-
-    def __call__(cls, *args):
-        if not cls.instance:
-            cls.instance = type.__call__(cls, *args)
-        return cls.instance
-    
-class Singleton(object):
-    """ The default implementation for a Python Singleton class """
-
-    __metaclass__ = SingletonMeta2
-
-    def getInstance(cls, *args):
-        return cls(*args)
-    
-    getInstance = classmethod(getInstance)
-    makeInstance = getInstance
-
diff --git a/HarvestMan-lite/harvestman/lib/config.py b/HarvestMan-lite/harvestman/lib/config.py
deleted file mode 100755
index ff19724..0000000
--- a/HarvestMan-lite/harvestman/lib/config.py
+++ /dev/null
@@ -1,1512 +0,0 @@
-# -- coding: utf-8
-""" config.py - Module to keep configuration options
-    for HarvestMan program and its related modules. This 
-    module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-
-    Jan 23 2007      Anand    Added code to check config in $HOME/.harvestman.
-                              Added control-var for session saving feature.
-    Feb 8 2007       Anand    Added config support for loading plugins. Added
-                              code for swish-e plugin.
-
-    Feb 11 2007      Anand    Re-wrote configuration parsing using generic option
-                              parser.
-
-    Mar 03 2007      Anand    Removed old option parsing dictionary and some
-                              obsolete code. Added option for changing time gap
-                              between downloads in config file. Removed command
-                              line option for urllistfile/urltree file. Added
-                              option to read multiple start URLs from a file.
-                              Modified behaviour so that if a source of URL is
-                              specified (command-line, URL file etc), any URLs
-                              in config file is skipped. Set urlserver option
-                              as default.
-   Mar 06 2007       Anand    Reset default option to queue.
-   April 11 2007     Anand    Renamed xmlparser module to configparser.
-   April 20 2007     Anand    Added options for hget.
-   May 7 2007       Anand     Modified option parsing for plugin option.
-   Jun 2 2008       Anand     Fixed kludgy processing of <project> options
-                              by using a function set_project. Added method
-                              'add' for easy adding of project URLs in
-                              interactive/programmatic crawling.
-   
-   Copyright (C) 2004 Anand B Pillai.                              
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-USAGE = """\
- %(program)s [options] [optional URL]
- 
-%(appname)s %(version)s %(maturity)s: An extensible, multithreaded web crawler.
-Author: Anand B Pillai
-
-Mail bug reports and suggestions to <abpillai at gmail dot com>."""
-
-
-import os, sys
-import re
-
-from harvestman.lib import configparser
-from harvestman.lib import options
-from harvestman.lib import urlparser
-from harvestman.lib import logger
-from harvestman.lib import utils
-
-from harvestman.lib.common.optionparser import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.common import hexit, test_sgmlop, logconsole, objects
-from harvestman.lib.common.singleton import Singleton
-
-CONFIG_XML_TEMPLATE="""\
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-            
-   <config version="3.0" xmlversion="1.0">
-            %(@PROJECTS_ELEMENT)s
-     <network>
-      <proxy>
-        <proxyserver>%(proxy)s</proxyserver>
-        <proxyuser>%(puser)s</proxyuser>
-        <proxypasswd>%(ppasswd)s</proxypasswd>
-        <proxyport value="%(proxyport)s" />
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="%(html)s" />
-        <images value="%(images)s" />
-        <movies value="%(movies)s" />
-        <flash value="%(flash)s" />
-        <sounds value="%(sounds)s" />
-        <documents value="%(documents)s" />        
-        <javascript value="%(javascript)s" />
-        <javaapplet value="%(javaapplet)s" />
-        <querylinks value="%(getquerylinks)s" />
-      </types> 
-      <cache status="%(pagecache)s">
-        <datacache value="%(datacache)s" />
-      </cache>
-      <protocol>
-        <http compress="%(httpcompress)s" />
-      </protocol>
-      <misc>
-        <retries value="%(retryfailed)s" />
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="%(getimagelinks)s" />
-        <stylesheetlinks value="%(getstylesheets)s" />
-        <offset start="%(linksoffsetstart)s" end="%(linksoffsetend)s" />
-      </links>
-      <extent>
-        <fetchlevel value="%(fetchlevel)s" />
-        <depth value="%(depth)s" />
-        <extdepth value="%(extdepth)s" />
-        <subdomain value="%(subdomain)s" />
-        <ignoretlds value="%(ignoretlds)s" />
-      </extent>
-      <limits>
-        <maxfiles value="%(maxfiles)s" />
-        <maxfilesize value="%(maxfilesize)s" />
-        <maxbytes value="%(maxbytes)s" />
-        <maxconnections value="%(connections)s" />
-        <maxbandwidth value="%(bandwidthlimit)s" factor="%(throttlefactor)s" />
-        <timelimit value="%(timelimit)s" />
-      </limits>
-      <rules>
-        <robots value="%(robots)s" />
-        <urlpriority>%(urlpriority)s</urlpriority>
-        <serverpriority>%(serverpriority)s</serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>%(urlfilter)s</urlfilter>
-        <serverfilter>%(serverfilter)s</serverfilter>
-        <wordfilter>%(wordfilter)s</wordfilter>
-        <junkfilter value="%(junkfilter)s" />
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-        <plugin name="userbrowse" enable="0" />
-        <plugin name="spam" enable="0" />
-        <plugin name="datafilter" enable="0" />        
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-      
-    <system>
-      <useragent value="%(USER_AGENT)s" />
-      <workers status="%(usethreads)s" size="%(threadpoolsize)s" timeout="%(timeout)s" />
-      <trackers value="%(maxtrackers)s" timeout="%(fetchertimeout)s" />
-      <timegap value="%(sleeptime)s" random="%(randomsleep)s" />
-      <connections type="%(datamodename)s" />
-    </system>
-    
-    <files>
-      <urltreefile status="%(urltreefile)s" />
-      <archive status="%(archive)s" format="%(archformat)s" />
-      <urlheaders status="%(urlheaders)s" />
-      <localise value="%(localise)s" />
-    </files>
-    
-    <display>
-      <browsepage value="%(browsepage)s"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
-"""
-
-param_re = re.compile(r'\S+=\S+',re.LOCALE|re.UNICODE)
-int_re = re.compile(r'\d+')
-float_re = re.compile(r'\d+\.\d*')
-maxbytes_re = re.compile(r'(\d+\s*)(kb?|mb?|gb?)?$', re.IGNORECASE)
-maxbw_re = re.compile(r'(\d+\s*)(k(b|bps)?|m(b|bps)?|g(b|bps)?)?$', re.IGNORECASE)
-projectname_re = re.compile(r'^[a-zA-Z0-9-_\.]+$', re.IGNORECASE|re.UNICODE|re.LOCALE)
-
-# This will contain the absolute path of parent-folder of
-# harvestman installation...
-
-module_path = ''
-
-class HarvestManConfigError(Exception):
-    """ Exception class for HarvestManStateObject """
-    pass
-
-HarvestManStateObject = HarvestManConfigError
-    
-class HarvestManStateObject(dict, Singleton):
-    """ Configuration class for HarvestMan framework and applications
-    derived from it. A single instance of this class keeps most of the
-    shared state and configuration params of HarvestMan """
-
-    klassmap = {}
-    alias = 'config'
-    
-    def __init__(self):
-        """ Initialize dictionary with the most common settings and their values """
-
-        # Calculate the module path
-        mydir = os.path.dirname(globals()["__file__"])
-        global module_path
-        module_path = os.path.dirname(mydir)
-            
-        self._init1()
-        self._init2()
-        self.set_system_params()
-        self.set_user_params()
-        super(HarvestManStateObject, self).__init__()
-        
-    def _init1(self):
-        """ First level initialization method. Initializes most of the state variables """
-        
-        self.items_to_skip=[]
-        # USER-AGENT string
-        # Version for harvestman
-        self.version='2.0'
-        # Maturity for harvestman
-        self.maturity="alpha 1"
-        self.appname='HarvestMan'
-        #self.USER_AGENT = 'v'.join((self.appname + ' ', self.version))
-        self.USER_AGENT = '%s/%s (+http://code.google.com/p/harvestman-crawler/wiki/bot)' %(self.appname,self.version)
-        self.progname="".join((self.appname," ",self.version," ",self.maturity))
-        self.program = sys.argv[0]
-        self.url=''
-        self.project=''
-        self.project_ignore = 0
-        self.basedir=''
-        # A list which will hold dicts of (url, name, basedir, verbosity) for all projects
-        self.projects = []
-        self.urlmap = {}
-        self.archive = 0
-        self.archformat = 'bzip'
-        self.urlheaders = 1
-        self.configfile = 'config.xml'
-        self.projectfile = ''         
-        self.proxy=''
-        self.puser=''
-        self.ppasswd=''
-        self.proxyenc=1
-        self.username=''   
-        self.passwd=''     
-        self.proxyport=80
-        self.errorfile='errors.log'
-        self.localise=2
-        self.images=1
-        self.movies=0
-        self.flash=0
-        self.sounds=0
-        self.documents=1
-        self.depth=10
-        self.html=1
-        self.robots=1
-        # self.eserverlinks=0
-        # self.epagelinks=1
-        self.fastmode=1
-        self.usethreads=1
-        self.maxfiles=5000
-        self.maxbytes=0
-        # self.maxextservers=0
-        # self.maxextdirs=0
-        self.retryfailed=1
-        self.extdepth=0
-        self.maxtrackers=4
-        # Url filter object
-        self.urlfilter = None
-        self.urlfiltercontext = 'crawl'
-        # To prevent config from breaking...
-        self.serverfilter=''
-        self.wordfilter=''
-        self.regexurlfilters = []
-        self.pathurlfilters = []
-        self.extnurlfilters = []
-        # Text filter object
-        self.textfilter = None
-        self.contentfilters = []
-        self.metafilters = []
-        self.inclfilter=[]
-        self.exclfilter=[]
-        self.allfilters=[]
-        self.urlpriority = ''
-        self.serverpriority = ''
-        self.urlprioritydict = {}
-        self.serverprioritydict = {}
-        self.verbosity=logger.INFO
-        self.verbosity_default=logger.INFO
-        # Override project verbosity - done
-        # if a global verbosity flag is defined
-        # say using the command line
-        self.verbosity_override = False
-        # timeout for worker threads is a rather
-        # large 5 minutes.
-        self.timeout=300.00
-        # timeout for sockets is a rather high 1.0 minute
-        self.socktimeout = 60.0
-        # Time out for fetchers is a rather small 4 minutes
-        self.fetchertimeout = 240.0
-        self.getimagelinks=1
-        self.getstylesheets=1
-        # Load images from anywhere irrepsective of rules
-        self.anyimages=1
-        self.threadpoolsize=10
-        self.renamefiles=0
-        self.fetchlevel=0
-        self.browsepage=0
-        self.checkfiles=1
-        self.pagecache=1
-        # Internal variable telling whether to write cache
-        self.writecache=True
-        self.cachefound=0
-        self._error=''
-        self.starttime=0
-        self.endtime=0
-        self.javascript = 1
-        self.javaapplet = 1
-        self.connections=5
-        # Bandwidth limit, 0 means no limit
-        self.bandwidthlimit = 0
-        self.throttlefactor = 1.5
-        self.cachefileformat='pickled' 
-        self.testing = 0
-        self.testnocrawl = 0
-        self.ignoreinterrupts = 0
-        # Set to true when a kb interrupt is caught
-        self.keyboardinterrupt = 0
-        # Differentiate between sub-domains of a domain ?
-        # When set to True, subdomains act like different
-        # domains, so they are filtered out for fetchlevel<=1
-        self.subdomain = 1
-        # Flag to ignore tlds, if set to True, domains
-        # such as www.foo.com, www.foo.co.uk, www.foo.org
-        # will all evaluate to the same server.
-        # Use this carefully!
-        self.ignoretlds = 0
-        self.getquerylinks = 0
-        self.bytes = 20.00 # Not used!
-        self.projtimeout = 1800.00
-        self.downloadtime = 0.0
-        self.timelimit = -1
-        self.terminate = 0
-        self.datacache = 0
-        self.blocking = 0
-        self.junkfilter = 1
-        self.junkfilterdomains = 1
-        self.junkfilterpatterns = 1
-        self.urltreefile = 0
-        self.urlfile = ''
-        self.maxfilesize=5242880
-        self.minfilesize=0
-        self.format = 'xml'
-        self.rawsave = 0
-        self.fromprojfile = 0
-        # HTML features (optional)
-        self.htmlfeatures = []
-        # For running from previous states.
-        self.resuming = 0
-        self.runfile = None
-        # Control var for session-saver feature.
-        self.savesessions = 0
-        # List of enabled plugins
-        self.plugins = []
-        # Control var for simulation feature
-        self.simulate = 0
-        # Time to sleep between requests
-        self.sleeptime = 1.0
-        # Time to sleep on the request queue
-        self.queuetime = 1.0
-        # Queue size - fixed...
-        self.queuesize = 5000
-        self.randomsleep = 1
-        # For http compression
-        self.httpcompress = 1
-        # Type of URLs which can be
-        # set to skip any rules we define
-        # This is not a user configurable
-        # option, but can be configured in
-        # plugins, of course.
-        self.skipruletypes = []
-        # Number of parts to split a file
-        # to, for multipart http downloads
-        self.numparts = 4
-        # Flag to force multipart downloads off
-        self.nomultipart = 0
-        # Flag to indicate that a multipart
-        # download is in progress
-        self.multipart = 0
-        # Links offset start - this will
-        # skip the list of child links of a URL
-        # to the given value
-        self.linksoffsetstart = 0
-        # Links offset value - this will skip
-        # the list of child links of a URL
-        # after the given value
-        self.linksoffsetend = -1
-        # Cache size for 
-        # Flag for forcing multipart downloads
-        self.forcesplit = 0
-        # Data save mode for connectors
-        # Is flush by default
-        self.datamode = CONNECTOR_DATA_MODE_FLUSH
-        # Name for data mode
-        self.datamodename = "flush"
-        # Internal state param
-        self._badrequests = 0
-        # Internal config param
-        self._connaddua = True
-        
-    def _init2(self):
-        """ Second level initialization method. Initializes the dictionary which maps
-        configuration XML file entries to state variables """
-        
-        # For mapping xml entities to config entities
-        
-        self.xml_map = { 'project_ignore' : ('project_ignore', 'int'),
-                         'url' : ('url', 'func:set_project'),
-                         'name': ('project', 'func:set_project'),
-                         'basedir' : ('basedir', 'func:set_project'),
-                         'verbosity_level' : ('verbosity', 'func:set_project'),
-
-                         'proxyserver': ('proxy','str'),
-                         'proxyuser': ('puser','str'),
-                         'proxypasswd' : ('ppasswd','str'),
-                         'proxyport_value' : ('proxyport','int'),
-
-                         'username': ('username','str'),
-                         'passwd' : ('passwd','str'),
-                         
-                         'html_value' : ('html','int'),
-                         'images_value' : ('images','int'),
-                         'movies_value' : ('movies','int'),
-                         'flash_value' : ('flash','int'),                         
-                         'sounds_value' : ('sounds','int'),
-                         'documents_value' : ('documents','int'),                         
-                         
-                         'javascript_value' : ('javascript','int'),
-                         'javaapplet_value' : ('javaapplet','int'),
-                         'querylinks_value' : ('getquerylinks','int'),
-
-                         'cache_status' : ('pagecache','int'),
-                         'datacache_value' : ('datacache','int'),
-
-                         'urllist': ('urlfile', 'str'),
-                         'urltreefile_status' : ('urltreefile', 'int'),
-                         'archive_status' : ('archive', 'int'),
-                         'archive_format' : ('archformat', 'str'),
-                         'urlheaders_status' : ('urlheaders', 'int'),
-                         'retries_value': ('retryfailed','int'),
-                         'imagelinks_value' : ('getimagelinks','int'),
-                         'stylesheetlinks_value' : ('getstylesheets','int'),
-                         'offset_start' : ('linksoffsetstart','int'),
-                         'offset_end' : ('linksoffsetend','int'),
-                         'fetchlevel_value' : ('fetchlevel','int'),
-                         'extserverlinks_value' : ('eserverlinks','int'),
-                         'extpagelinks_value' : ('epagelinks','int'),
-                         'depth_value' : ('depth','int'),
-                         'extdepth_value' : ('extdepth','int'),
-                         'subdomain_value' : ('subdomain','int'),
-                         'ignoretlds_value': ('ignoretlds','int'),
-                         'maxextservers_value' : ('maxextservers','int'),
-                         'maxextdirs_value' : ('maxextdirs','int'),
-                         'maxfiles_value' : ('maxfiles','int'),
-                         'maxfilesize_value' : ('maxfilesize','int'),
-                         'maxbytes_value' : ('maxbytes', 'func:set_maxbytes'),
-                         'maxconnections_value' : ('connections','int'),
-                         'maxbandwidth_value' : ('bandwidthlimit','func:set_maxbandwidth'),
-                         'maxbandwidth_factor': ('throttlefactor','float'),
-                         'robots_value' : ('robots','int'),
-                         'timelimit_value' : ('timelimit','float'),
-                         'urlpriority' : ('urlpriority','str'),
-                         'serverpriority' : ('serverpriority','str'),
-                         'serverfilter' : ('serverfilter','str'),
-                         'wordfilter' : ('wordfilter','str'),
-                         'junkfilter_value' : ('junkfilter','int'),
-                         'useragent_value': ('USER_AGENT','str'),
-                         'workers_status' : ('usethreads','int'),
-                         'workers_size' : ('threadpoolsize','int'),
-                         'workers_timeout' : ('timeout','float'),
-                         'trackers_value' : ('maxtrackers','int'),
-                         'trackers_timeout' : ('fetchertimeout','float'),                         
-                         'fastmode_value': ('fastmode','int'),
-                         'savesessions_value': ('savesessions','int'),
-                         'timegap_value': ('sleeptime', 'float'),
-                         'timegap_random': ('randomsleep', 'int'),
-                         'connections_type' : ('datamode', 'func:set_datamode'),
-                         'feature_name' : ('htmlfeatures', 'func:set_parse_features'),
-                         'simulate_value': ('simulate', 'int'),
-                         'localise_value' : ('localise','int'),
-                         'browsepage_value' : ('browsepage','int'),
-
-                         'configfile_value': ('configfile', 'str'),
-                         'projectfile_value': ('projectfile', 'str'),
-
-                         'regex_value' : ('regex', 'func:set_urlfilter'),
-                         'path_value': ('path', 'func:set_urlfilter'),
-                         'extension_value': ('extension', 'func:set_urlfilter'),
-
-                         'content_value': ('content', 'func:set_textfilter'),
-                         'meta_value': ('meta', 'func:set_textfilter'),
-                         'urlfilterre_value': (('inclfilter', 'list'),
-                                               ('exclfilter', 'list'),
-                                               ('allfilters', 'list')),
-                         'urlprioritydict_value': ('urlprioritydict', 'dict'),
-                         'serverprioritydict_value': ('serverprioritydict', 'dict'),
-                         'http_compress' : ('httpcompress', 'int'),
-                         'plugin_name': ('plugins','func:set_plugin')
-                         }
-
-    def copy(self):
-        """ Return a serializable copy of this instance """
-
-        return self
-
-    def __getstate__(self):
-        """ Overloaded __getstate__ method """
-        return self
-
-    def __setstate__(self, state):
-        """ Overloaded __setstate__ method """
-        pass
-    
-
-    def assign_option(self, option_val, value, kwargs={}):
-        """ Assigns values to internal variables using the option specified """
-
-        try:
-            if len(option_val) == 2:
-                key, typ = option_val
-                
-                # If type is not a list, the
-                # action is simple assignment
-
-                # Bug fix: If someone has set the
-                # value to 'True'/'False' instead of
-                # 1/0, convert to bool type first.
-
-                if type(value) in (str, unicode):
-                    if value.lower() == 'true':
-                        value = 1
-                    elif value.lower() == 'false':
-                        value = 0
-
-                if typ.find(':') == -1:
-                    # do any type casting of the option
-                    fval = (eval(typ))(value)
-                    self[key] = fval
-                    
-                    # If type is list, the action is
-                    # appending, after doing any type
-                    # casting of the actual value
-                else:
-                    # Type is of the form <type>:<actual type>
-                    typname, typ = typ.split(':')
-
-                    if typname == 'list':
-                        if typ:
-                            fval = (eval(typ))(value)
-                        else:
-                            fval = value
-
-                        var = self[key]
-                        var.append(fval)
-                    elif typname == 'func':
-                        funktion = getattr(self, typ)
-                        if funktion:
-                            funktion(key, value, kwargs)
-            else:
-                error('Error in option value %s!' % option_val)
-        except Exception, e:
-            raise HarvestManConfigError, "Error: " + str(e)
-
-    def set_option(self, option, value, negate=0):
-        """ Sets the passed option in with its value as the passed value """
-        
-        # find out if the option exists in the dictionary
-        if option in self.xml_map.keys():
-            # if the option is a string or int or any
-            # non-seq type
-
-            # if value is an emptry string, return error
-            if value=="": return CONFIG_VALUE_EMPTY
-
-            # Bug fix: If someone has set the
-            # value to 'True'/'False' instead of
-            # 1/0, convert to bool type first.
-            if type(value) in (str, unicode):
-                if value.lower() == 'true':
-                    value = 1
-                elif value.lower() == 'false':
-                    value = 0
-            
-            if type(value) is not tuple:
-                # get the key for the option
-                key = (self.xml_map[option])[0]
-                # get the type of the option
-                typ = (self.xml_map[option])[1]
-                # do any type casting of the option
-                fval = (eval(typ))(value)
-                # do any negation of the option
-                if type(fval) in (int,bool):
-                    if negate: fval = not fval
-                # set the option on the dictionary
-                self[key] = fval
-                
-                return CONFIG_OPTION_SET
-            else:
-                # option is a tuple of values
-                # iterate through all values of the option
-                # see if the size of the value tuple and the
-                # size of the values for this key match
-                _values = self.xml_map[option]
-                if len(_values) != len(value): return CONFIG_VALUE_MISMATCH
-
-                for index in range(0, len(_values)):
-                    _v = _values[index]
-                    if len(_v) !=2: continue
-                    _key, _type = _v
-
-                    v = value[index]
-                    # do any type casting on the option's value
-                    fval = (eval(_type))(v)
-                    # do any negation
-                    if type(fval) in (int,bool):                    
-                        if negate: fval = not fval
-                    # set the option on the dictionary
-                    self[_key] = fval
-
-                return CONFIG_OPTION_SET
-
-        return CONFIG_OPTION_NOT_SET
-
-    def set_option_xml_attr(self, option, value, attrs):
-        """ Sets an option from the XML config file for an XML attribute """
-
-        # If option in things to be skipped, return
-        if option in self.items_to_skip:
-            return CONFIG_ITEM_SKIPPED
-        
-        option_val = self.xml_map.get(option, None)
-        
-        if option_val:
-            try:
-                if type(option_val) is tuple:
-                    self.assign_option(option_val, value, attrs)
-                elif type(option_val) is list:
-                    # If the option_val is a list, there
-                    # might be multiple vars to set.
-                    for item in option_val:
-                        # The item has to be a tuple again...
-                        if type(item) is tuple:
-                            # Set it
-                            self.assign_option(item, value, attrs)
-            except Exception, e:
-                print 'Error assigning option','"'+option+'"','=>',str(e)
-                if e.__class__==ValueError:
-                    print '(Perhaps you invoked the wrong argument ?)'
-                print 'Pass option -h for command line usage.'                    
-                hexit(e)
-        else:
-            return CONFIG_OPTION_NOT_DEFINED
-
-        return CONFIG_OPTION_SET
-
-    def set_option_xml(self, option, value):
-        """ Set an option from the XML config file from an XML element """
-        
-        # If option in things to be skipped, return
-        if option in self.items_to_skip:
-            return CONFIG_ITEM_SKIPPED
-        
-        option_val = self.xml_map.get(option, None)
-        
-        if option_val:
-            try:
-                if type(option_val) is tuple:
-                    # print 'Assigning option',option_val,value
-                    self.assign_option(option_val, value)
-                elif type(option_val) is list:
-                    # If the option_val is a list, there
-                    # might be multiple vars to set.
-                    for item in option_val:
-                        # The item has to be a tuple again...
-                        if type(item) is tuple:
-                            # Set it
-                            self.assign_option(item, value)
-            except Exception, e:
-               print 'Error assigning option','"'+option+'"','=>',str(e)
-               # If this is a ValueError, mostly the wrong argument was passed
-               if e.__class__==ValueError:
-                   print '(Perhaps you invoked the wrong argument ?)'
-               print 'Pass option -h for command line usage.'
-               hexit(1)
-        else:
-            return CONFIG_OPTION_NOT_DEFINED
-
-        return CONFIG_OPTION_SET        
-
-    def set_maxbytes(self, key, val, attrdict):
-
-        # The value could be in any of the following forms
-        # <maxbytes value="5000" /> - End crawl at 5000 bytes
-        # <maxbytes value="10kb" /> - End crawl at 10kb 
-        # <maxbytes value="50MB" /> - End crawl at 50 MB.
-        # <maxbytes value="1GB" /> - End crawl at 1 GB.
-        # <maxbytes value="10k" /> - End crawl at 10kb 
-        # <maxbytes value="50M" /> - End crawl at 50 MB.
-        # <maxbytes value="1G" /> - End crawl at 1 GB.        
-        # Any extra spaces should also be taken care of
-        
-        # The regexp does all the above
-        items = maxbytes_re.findall(val.strip())
-        if items:
-            # 'item' is a pair
-            item = items[0]
-            # First member is the number, second the
-            # specification for kb, mb, gb if any.
-            limit, spec = item
-            limit = int(limit)
-            
-            if spec != '':
-                # Check for kb, mb, gb
-                spec = spec.strip().lower()
-                if spec.startswith('k'):
-                    limit *= 1024
-                elif spec.startswith('m'):
-                    limit *= pow(1024, 2)
-                elif spec.startswith('g'):
-                    limit *= pow(1024, 3)
-
-            # Set maxbytes
-            self.maxbytes = limit
-
-    def set_maxbandwidth(self, key, val, attrdict):
-
-        # The value could be in any of the following forms
-        # <maxbandwidth value="5" /> - Crawl at 5 bytes per sec 
-        # <maxbandwidth value="5 k" /> - Crawl at 5 kbps
-        # <maxbandwidth value="5 kb" /> - Crawl at 5 kbps
-        # <maxbandwidth value="5 kbps" /> - Crawl at 5 kbps        
-        # <maxbandwidth value="5 m" /> - Crawl at 5 mbps
-        # <maxbandwidth value="5 mb" /> - Crawl at 5 mbps
-        # <maxbandwidth value="5 mbps" /> - Crawl at 5 mbps        
-        # Any extra spaces should also be taken care of
-        
-        # The regexp does all the above
-        items = maxbw_re.findall(val.strip())
-        if items:
-            item = items[0]
-            # First member is the number, second the
-            # specification for kb, mb, gb if any.
-            limit = int(item[0])
-            spec = item[1]
-            
-            if spec != '':
-                # Check for kb, mb, gb, kbps, mbps, gbps
-                spec = spec.strip().lower()
-                if spec.startswith('k'):
-                    limit *= 1024
-                elif spec.startswith('m'):
-                    limit *= pow(1024, 2)
-                elif spec.startswith('g'):
-                    limit *= pow(1024, 3)
-
-            # Set maxbandwidth
-            self.bandwidthlimit = float(limit)
-
-    def set_urlfilter(self, key, val, filterdict):
-
-        enable = int(filterdict.get(u'enable', 1))
-        if not enable:
-            return
-        
-        casing = int(filterdict.get(u'case',0))
-        flags = filterdict.get(u'flags','')
-
-        # Append a tuple of value, casing, flags
-        if key=='regex':
-            self.regexurlfilters.append((val,casing,flags))
-            # print self.regexurlfilters
-        elif key=='path':
-            self.pathurlfilters.append((val,casing,flags))
-            # print self.pathurlfilters
-        elif key=='extension':
-            self.extnurlfilters.append((val,casing,flags))
-            # print self.extnurlfilters
-
-    def set_textfilter(self, key, val, filterdict):
-
-        enable = int(filterdict.get(u'enable', 1))
-        if not enable:
-            return
-        
-        casing = int(filterdict.get(u'case',0))
-        flags = filterdict.get(u'flags','')
-
-        # Append a tuple of value, casing, flags
-        if key=='content':
-            self.contentfilters.append((val,casing,flags))
-        elif key=='meta':
-            tags = filterdict.get(u'tags','all')
-            self.metafilters.append((val,casing,flags,tags))
-
-    def set_project(self, key, val, prjdict):
-        # Same function is called for url, basedir, name
-        # and verbosity
-
-        # If project is to be ignored, skip this
-        if self.project_ignore:
-            return
-
-        # Project name has to match [a-zA-Z0-9_]. No other
-        # characters allowed.
-        if key=='project':
-            if not projectname_re.match(val):
-                raise HarvestManConfigError,'Project name %s is not valid' % val
-            
-        new_entry, recent = False, {}
-        sz = len(self.projects)
-        if sz==0:
-            new_entry = True
-        else:
-            item = self.projects[-1]
-            # If item is completed, new entry is True
-            # else it is false
-            if item.get('done',False):
-                new_entry = True
-            else:
-                recent = item
-
-                # Still check if this is beginning of a new entry
-                # since we may not define all fields inside <project>...</project>
-                if key in recent:
-                    # Key already there, so treat this as a fresh entry
-                    # and close current entry...
-                    recent['done'] = True
-                    self.projects[-1] = recent
-                    recent = {}
-                    new_entry = True
-
-        if key in ('url','basedir','project'):
-            recent[key] = val
-        elif key=='verbosity':
-            try:
-                recent['verbosity'] = logger.getLogLevel(prjdict[u'level'])
-                # print 'Verbosity=>',recent['verbosity']
-            except KeyError:
-                recent['verbosity'] = logger.getLogLevel(val)
-
-        # If all items are present, put 'done' to True
-        if len(recent)==4:
-            recent['done'] = True
-
-        # If new entry, then append, else set in position
-        if new_entry:
-            self.projects.append(recent)
-        else:
-            self.projects[-1] = recent
-
-        # print self.projects
-        
-    def set_plugin(self, key, val, plugindict):
-        """ Sets the state of the plugins param """
-
-        plugin = plugindict['name']
-        enable = int(plugindict['enable'])
-        if enable: self.plugins.append(plugin)
-
-    def set_datamode(self, key, val, modedict):
-        """ Sets the state of the connections param """
-
-        
-        mode = modedict['type']
-        if type(mode) is str:
-            if mode.lower()=='flush' :
-                self.datamode = CONNECTOR_DATA_MODE_FLUSH
-                self.datamodename == mode.lower()
-            elif mode.lower()=='mem':
-                self.datamode = CONNECTOR_DATA_MODE_INMEM
-                self.datamodename == mode.lower()
-        elif type(mode) is int:
-            self.datamode = mode
-            if self.datamode>1: 
-                self.datamode=1
-            
-    def set_parse_features(self, key, val, featuredict):
-        """ Sets the state of the plugins param """
-
-        feat = featuredict['name']
-        enable = int(featuredict['enable'])
-        self.htmlfeatures.append((feat, enable))
-        
-    def parse_arguments(self):
-        """ Parses the command line arguments """
-
-        # This function has 3 return values
-        # CONFIG_INVALID_ARGUMENT => no cmd line arguments/invalid cmd line arguments
-        # ,so force program to read config file.
-        # PROJECT_FILE_EXISTS => existing project file supplied in cmd line
-        # CONFIG_ARGUMENTS_OK => all options correctly read from cmd line
-
-        # if no cmd line arguments, then use config file,
-        # return -1
-        if len(sys.argv)==1:
-            return CONFIG_INVALID_ARGUMENT
-
-        # Otherwise parse the arguments, the command line arguments
-        # are the same as the variables(dictionary keys) of this class.
-        # Description
-        # Options needing no arguments
-        #
-        # -h => prints help
-        # -v => prints version info
-        
-        args, optdict = '',{}
-        try:
-            gopt = GenericOptionParser(options.getOptList(self.appname), usage = USAGE % self )
-            optdict, args = gopt.parse_arguments()
-        except GenericOptionParserError, e:
-            hexit('Error: ' + str(e))
-
-        # print optdict
-        
-        cfgfile = False
-
-        if self.appname == 'HarvestMan':
-            for option, value in optdict.items():
-                # If an option with value of null string, skip it
-                if value=='':
-                   # print 'Skipping option',option
-                   continue
-                else:
-                   # print 'Processing option',option,'value',value
-                   pass
-
-                # first parse arguments with no options
-                if option=='version' and value:
-                    self.print_version_info()
-                    sys.exit(0)                
-                elif option=='configfile':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.set_option_xml('configfile_value', self.process_value(value))
-                        cfgfile = True
-                        # Continue parsing and take rest of options from cmd line
-                elif option=='projectfile':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.set_option_xml('projectfile_value', self.process_value(value))
-
-                        projector = utils.HarvestManProjectManager()
-
-                        if projector.read_project() == PROJECT_FILE_EXISTS:
-                            # No need to parse further values
-                            return PROJECT_FILE_EXISTS
-
-                elif option=='basedir':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('basedir', self.process_value(value))
-                elif option=='project':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('name', self.process_value(value))
-                elif option=='retries':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('retries_value', self.process_value(value))
-                elif option=='localise':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('localise_value', self.process_value(value))
-                elif option=='fetchlevel':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('fetchlevel_value', self.process_value(value))
-                elif option=='maxthreads':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('trackers_value', self.process_value(value))
-                elif option=='maxfiles':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('maxfiles_value', self.process_value(value))
-                elif option=='timelimit':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('timelimit_value', self.process_value(value))
-                elif option=='workers':
-                    self.set_option_xml('workers_status',1)
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('workers_size', self.process_value(value))                
-                elif option=='urlfilter':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urlfilter', self.process_value(value))
-                elif option=='depth':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('depth_value', self.process_value(value))
-                elif option=='robots':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('robots_value', self.process_value(value))
-                elif option=='urllist':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urllist', self.process_value(value))
-                elif option=='proxy':
-                    if SUCCESS(self.check_value(option,value)):
-                        # Set proxyencrypted flat to False
-                        self.proxyenc=False
-                        self.set_option_xml('proxyserver', self.process_value(value))
-                elif option=='proxyuser':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxyuser', self.process_value(value))                
-                elif option=='proxypasswd':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxypasswd', self.process_value(value))
-                elif option=='urlserver':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urlserver_status', self.process_value(value))
-
-                elif option=='cache':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('cache_status', self.process_value(value))
-                elif option=='connections':
-                    if SUCCESS(self.check_value(option,value)):
-                        val = self.process_value(value)
-                        if val>=self.connections:
-                            self.connections = val + 1
-                elif option=='verbosity':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.verbosity = logger.getLogLevel(value)
-                        self.verbosity_override = True
-                    else:
-                        print 'Check failed!'
-                elif option=='subdomain':
-                    if value: self.set_option_xml('subdomain_value', 0)                    
-                #elif option=='savesessions':
-                #    if SUCCESS(self.check_value(option,value)): self.set_option_xml('savesessions_value', self.process_value(value))
-                elif option=='simulate':
-                    self.set_option_xml('simulate_value', value)
-                elif option=='plugins':
-                    # Plugin is specified as plugin1+plugin2+...
-                    plugins = value.split('+')
-                    # Remove any duplicate occurence of same plugin
-                    self.plugins = list(set([plugin.strip() for plugin in plugins]))
-                    # Don't allow reading plugin from config file now
-                    self.items_to_skip.append('plugin_name')
-                elif option=='option':
-                    # Value should be of the form param=value
-                    if not param_re.match(value):
-                        print 'Error in option value, should be of the form <param>=<value>'
-                    else:
-                        param,val=value.strip().split('=')
-                        if param in self:
-                            # Guess type of param
-                            # Could be a tuple, dict or list value ?
-                            if (val.startswith('(') and val.endswith(')')) or \
-                               (val.startswith('[') and val.endswith(']')) or \
-                               (val.startswith('{') and val.endswith('}')):
-
-                                # Try tupling
-                                try:
-                                    val = eval(val)
-                                except Exception:
-                                    pass
-                            else:
-                                # Try float next
-                                if float_re.match(val):
-                                    val = float(val)
-                                # Try int next
-                                elif int_re.match(val):
-                                    val = int(val)
-                                else:
-                                    # Plain string
-                                    pass
-                                
-                            self[param]=val
-                elif option == 'ui' and value:
-                    # Start the web UI
-                    from harvestman.lib import gui
-                    gui.run()
-                    
-                    sys.exit(0)
-                elif option == 'genconfig' and value:
-                    #Generate configuration File
-                    from harvestman.tools import genconfig
-                    genconfig.main()
-                    
-                    sys.exit(0)
-                elif option == 'selftest' and value:
-                    # Run the unit-tests as self-tests
-                    print 'Running self-test...'
-                    sys.path.append(os.path.join(module_path, 'test'))
-                    from harvestman.test import run_tests
-                    
-                    result = run_tests.run_all_tests()
-                    print result
-                    if result.wasSuccessful():
-                        print 'self-test complete. All tests passed.'
-                        sys.exit(0)
-                    else:
-                        print 'self-test failed. Please check your installation!'
-                        sys.exit(1)
-
-        if args:
-            # Any option without an argument is assumed to be a URL
-            for arg in args:
-                self.set_option_xml('url',self.process_value(arg))
-                
-            # Since we set a URL from outside, we dont want to read
-            # URLs from the config file - same for plugins
-            self.items_to_skip = ['url','name','basedir','verbosity_level']
-
-        # If urlfile option set, read all URLs from a file
-        # and load them.
-        if self.urlfile:
-            if not os.path.isfile(self.urlfile):
-                print 'Error: Cannot find URL file %s!' % self.urlfile
-                return CONFIG_INVALID_ARGUMENT
-            
-            # Open file
-            try:
-                lines = open(self.urlfile).readlines()
-                if len(lines):
-                    # Reset all...
-                    self.projects = []
-
-                    for line in lines:
-                        url = line.strip()
-                        # Fix URL protocol string
-                        url = self._fix_url_protocol(url)
-                        try:
-                            # Create project name
-                            h = urlparser.HarvestManUrl(url)
-                            project = h.get_domain()
-                            self.projects.append({'url': url,
-                                                  'project': project,
-                                                  'basedir': '.',
-                                                  'verbosity': self.verbosity_default})
-                        except urlparser.HarvestManUrlError, e:
-                            continue
-
-                    # We would now want to skip url, project,
-                    # basedir etc in the config file
-                    self.items_to_skip = ['url','name','basedir','verbosity_level']
-
-            except Exception, e:
-                print e
-                return CONFIG_INVALID_ARGUMENT
-
-
-        # Error in option value
-        if self._error:
-            print self._error, value
-            return CONFIG_INVALID_ARGUMENT
-
-        # If need to pass config file return CONFIG_INVALID_ARGUMENT
-        if cfgfile:
-            return CONFIG_INVALID_ARGUMENT
-        
-        return CONFIG_ARGUMENTS_OK
-
-    def check_value(self, option, value):
-        """ This function checks the values for options when options
-        are supplied as command line arguments. Returns 0 on any error
-        and non-zero otherwise """
-
-        # check #1: If value is a null, return 0
-        if not value:
-            self._error='Error in option value for option %s, value should not be empty!' % option
-            return CONFIG_ARGUMENT_ERROR
-
-        # no other checks right now
-        return CONFIG_ARGUMENT_OK
-
-    def process_value(self, value):
-        """ This function maps values of command line arguments to values
-        which can be used to assign to config params """
-
-        # a 'yes' is treated as 1 and 'no' as 0
-        # also an 'on' is treated as 1 and 'off' as 0
-        # Other valid values: integers, strings, 'YES'/'NO'
-        # 'OFF'/'ON'
-
-        ret = OPTION_TURN_OFF
-        # We expect the null check has been done before
-        val = value.lower()
-        if val in ('yes', 'on'):
-            return OPTION_TURN_ON
-        elif val in ('no', 'off'):
-            return OPTION_TURN_OFF
-
-        # convert value to int
-        try:
-            ret=int(val)
-            return ret
-        except:
-            pass
-
-        # return string value directly
-        return str(value)
-
-    def print_help(self):
-        """ Prints the command-line usage information """
-
-        print PROG_HELP % {'appname' : self.appname,
-                           'version' : self.version,
-                           'maturity' : self.maturity }
-
-    def print_version_info(self):
-        """ Prints version information about the program """
-
-        print 'Version: %s %s' % (self.version, self.maturity)
-
-    def _fix_url_protocol(self, url):
-        """ Fixes errors in URL protocol string, if any """
-        
-        r = re.compile('www\d?\.|http://|https://|ftp://|file://',re.IGNORECASE)
-        if not r.match(url):
-            # Assume http url
-            # prepend http:// to it
-            # We prepend http:// to even FTP urls so that
-            # the ftp servers can be crawled.
-            url = 'http://' + url
-
-        return url
-
-    def add(self, url, name='', basedir='.', verbosity=logger.INFO):
-        """ Adds a crawl project to the crawler. The arguments
-        are the starting URL, and optional name for the project,
-        a base directory for saving files and project verbosity """
-
-        # Useful for command-line crawling
-        self.projects.append({'url': url,
-                              'project': name,
-                              'basedir': basedir,
-                              'verbosity': verbosity})
-        
-    def setup(self):
-        """ Sets up the configuration object for full use,
-        after fixing any errors in key config variables such as
-        URL, project directory, project names etc """
-
-        
-        # If there is more than one url, we
-        # combine all the project related
-        # variables into a dictionary for easy
-        # lookup.
-        num=len(self.projects)
-        if num==0:
-            msg = 'Fatal Error: No URLs given, Aborting.\nFor command-line options run with -h option'
-            sys.exit(msg)
-
-        # Fix url error
-        for x in range(len(self.projects)):
-            entry = self.projects[x]
-            url = entry['url']
-            
-            # If null url, return
-            if not url: continue
-
-            # Fix protocol strings
-            url = self._fix_url_protocol(url)
-            entry['url'] = url
-            
-            # If project is not set, set it to domain
-            # name of the url.
-            if entry.get('project','')=='':
-                h = urlparser.HarvestManUrl(url)
-                project = h.get_domain()
-                entry['project'] = project
-
-            if entry.get('basedir','')=='':
-                entry['basedir'] = '.'
-
-            if entry.get('verbosity',-1)==-1:
-                if self.verbosity_override:
-                    entry['verbosity'] = logger.getLogLevel(self.verbosity)
-                else:
-                    entry['verbosity'] = logger.getLogLevel(self.verbosity_default)
-            else:
-                entry['verbosity'] = logger.getLogLevel(entry['verbosity'])
-
-        self.plugins = list(set(self.plugins))
-            
-        if 'swish-e' in self.plugins:
-            # Disable any message output for swish-e
-            self.verbosity = logger.DISABLE
-            # Set verbosity to zero for all projects
-            for entry in self.projects:
-                entry['verbosity'] = logger.DISABLE
-
-    def set_system_params(self):
-        """ Sets config file/directory parameters for all users """
-
-        # Need to reload sys to get setdefaultencoding attribute
-        reload(sys)
-        sys.setdefaultencoding('utf8')
-        
-        # Directory for system wide configuration files...
-        if os.name == 'posix':
-            #We might have to use find_packager() if somebody will use py2app py2exe
-            #print os.path.split(os.path.dirname(__main__.__file__))[0]
-            if os.path.isdir(sys.prefix):
-                basefolder = os.path.abspath(sys.prefix)
-            else:
-                basefolder = os.path.dirname(sys.prefix)
-            #print os.path.join(basefolder, 'etc', 'harvestman', 'config.xml')
-            self.etcdir=os.path.join(basefolder, 'etc', 'harvestman')
-            #self.etcdir = '/etc/harvestman'            
-        elif os.name == 'nt':
-            self.etcdir = os.path.join(os.environ.get("ALLUSERSPROFILE"),
-                                       "Application Data", "HarvestMan", "conf")
-                
-    def set_user_params(self):
-        """ Set config file/directory parameters specific to the current user  """
-        
-        if os.name == 'posix':
-            homedir = os.environ.get('HOME')
-            if homedir and os.path.isdir(homedir):
-                harvestman_dir = os.path.join(homedir, '.harvestman')
-                
-        elif os.name == 'nt':
-            profiledir = os.environ.get('USERPROFILE')
-            if profiledir and os.path.isdir(profiledir):
-                harvestman_dir = os.path.join(profiledir, 'Local Settings', 'Application Data','HarvestMan')
-
-        if harvestman_dir:
-            harvestman_conf_dir = os.path.join(harvestman_dir, 'conf')
-            harvestman_sessions_dir = os.path.join(harvestman_dir, 'sessions')
-            harvestman_db_dir = os.path.join(harvestman_dir, 'db')
-
-            self.userdir = harvestman_dir
-            self.userconfdir = harvestman_conf_dir
-            self.usersessiondir = harvestman_sessions_dir
-            self.userdbdir = harvestman_db_dir
-    
-    def parse_config_file(self, configfile=None):
-        """ Parses the configuration file. An optional configuration file can be
-        passed to this method. Otherwise it tries to parse the default configuration
-        file """
-
-        if configfile:
-            cfgfile = configfile
-        else:
-            cfgfile = self.configfile
-            
-        if not os.path.isfile(cfgfile):
-            logconsole('Configuration file %s not found...' % cfgfile)
-        else:
-            logconsole('Using configuration file %s...' % cfgfile)
-
-        return configparser.parse_xml_config_file(self, cfgfile)
-            
-    def get_program_options(self):
-        """ Umbrella method for reading the program configuration
-        from a configuration file or the command-line or both """
-
-        # Now load system wide configuration file...
-        system_conf_file = os.path.join(self.etcdir, "config.xml")
-        if os.path.isfile(system_conf_file):
-            logconsole("Loading system configuration...")
-            configparser.parse_xml_config_file(self, system_conf_file)
-
-        # Then load user configuration file
-        user_conf_file = os.path.join(self.userconfdir, 'config.xml')
-        if os.path.isfile(user_conf_file):
-            logconsole("Loading user configuration...")
-            configparser.parse_xml_config_file(self, user_conf_file)
-
-        # first check in argument list, if failed
-        # check in config file
-        res = self.parse_arguments()
-
-        if res == CONFIG_INVALID_ARGUMENT:
-            self.parse_config_file()
-
-        # fix errors in config variables
-        self.setup()
-        
-    def enable_controller(self):
-        """ Return whether we need to start the controller
-        thread. This is determined by whether we have
-        any limits either on time, files or data """
-
-        return (self.timelimit != -1) or \
-               (self.maxfiles) or \
-               (self.maxbytes)
-    
-    def __getattr__(self, name):
-        """Overloaded __getattr__ method """
-        
-        try:
-            return self[intern(name)]
-        except KeyError:
-            return
-
-    def __setattr__(self, name, value):
-        """ Overloaded __setattr__ method """
-        
-        self[intern(name)] = value
-
-    def set_klass_plugin_func(self, klassname, funcname, func):
-        """ Sets the plugin function for the given HarvestMan class
-        'klassname'. The plugin target function is specified by
-        'funcname' and the plugin source function is 'func' """
-        
-        try:
-            d = self.__class__.klassmap[klassname + '_plugins']
-            d[funcname] = func
-        except KeyError:
-            self.__class__.klassmap[klassname + '_plugins'] = { funcname: func }            
-
-    def get_klass_plugins(self, klassname):
-        """ Return the plugin function dictionary for the HarvestMan class
-        with the name 'klassname' """
-        
-        return self.__class__.klassmap.get(klassname + '_plugins')
-
-    def set_klass_callback_func(self, klassname, funcname, func, where):
-        """ Sets the callback function for the given HarvestMan class
-        'klassname'. The callback target function is specified by
-        'funcname' and the callback source function is 'func'. The
-        last argument specifies whether to insert the callback before
-        or after the target function """
-        
-        try:
-            d = self.__class__.klassmap[klassname + '_callbacks']
-            d[funcname] = (func, where)
-        except KeyError:
-            self.__class__.klassmap[klassname + '_callbacks'] = { funcname: (func, where) }            
-
-    def get_klass_callbacks(self, klassname):
-        """ Return the callbacks function dictionary for the HarvestMan class
-        with the name 'klassname' """
-        
-        return self.__class__.klassmap.get(klassname + '_callbacks')    
-
-    def generate_projects_xml(self):
-        """ Generates and returns content for the <projects> config file XML element """
-
-        content = "<projects>\n\n"
-        for x in range(len(self.projects)):
-            entry = self.projects[x]
-            
-            project = entry.get('project')
-            url = entry.get('url')
-            verb = logger.getLogLevelName(entry.get('verbosity'))
-            basedir = entry.get('basedir')
-            
-            projcontent = '<project skip="0">\n'
-            projcontent += '<url>' + url + '</url>\n'
-            projcontent += '<name>' + project + '</name>\n'
-            projcontent += '<basedir>' + basedir + '</basedir>\n'            
-            projcontent += '<verbosity level="' + str(verb) + '"/>\n'
-            projcontent += '</project>\n\n'
-
-            content = content + projcontent
-
-        content += "\n</projects>\n"
-
-        return content
-
-    def generate_current_configuration(self):
-        """ Generates and returns the XML configuration string for the current configuration """
-
-        projcontent = self.generate_projects_xml()
-        self['@PROJECTS_ELEMENT'] = projcontent
-        return CONFIG_XML_TEMPLATE % self
-
-    def generate_system_configuration(self):
-        """ Generates and returns configuration content for the system wide
-        HarvestMan configuration file """
-
-        self['@PROJECTS_ELEMENT'] = ''
-        return CONFIG_XML_TEMPLATE % self
-
-    def generate_user_configuration(self):
-        """ Generates and returns configuration content for the user specific
-        HarvestMan configuration file """
-
-        return self.generate_system_configuration()
-
-def set_aliases():
-
-    from harvestman.lib import datamgr
-    from harvestman.lib import rules
-    from harvestman.lib import connector
-    from harvestman.lib import urlqueue
-    from harvestman.lib import event
-    from harvestman.lib import logger
-    from harvestman.lib.common.common import SetAlias
-    
-    SetAlias(HarvestManStateObject())
-    SetAlias(logger.HarvestManLogger())
-
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    SetAlias(dmgr)
-
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-    SetAlias(conn_factory)
-
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-
-    SetAlias(event.HarvestManEvent())
-        
-
-if __name__ == "__main__":
-    s = HarvestManStateObject()
-    print s.generate_system_configuration()
-    
diff --git a/HarvestMan-lite/harvestman/lib/configparser.py b/HarvestMan-lite/harvestman/lib/configparser.py
deleted file mode 100755
index cda593e..0000000
--- a/HarvestMan-lite/harvestman/lib/configparser.py
+++ /dev/null
@@ -1,143 +0,0 @@
-# -- coding: utf-8
-"""
-configparser.py - Module which contains routines to parse
-the XML configuration file of HarvestMan
-
-This module contains a single class HarvestManConfigParser which acts
-as a class for parsing HarvestMan XML configuration files
-using pyexpat.
-
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <<abpillai at gmail dot com>>
-
-Created xx-xx-xxxx  Anand
-
-Added this comment header                         10-1-06 Anand
-Fixes for handling URLs with '&amp;' correctly    10/1/06 jkleven 
-Renamed class to HarvestManConfigParser and       11/04/07 Anand
-module to configparser.
-
-Copyright (C) 2004 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os
-import xml.parsers.expat
-from harvestman.lib.common.macros import *
-
-class HarvestManConfigParser(object):
-    """ Class whose function is used to parse the XML configuration
-    file of HarvestMan """
-    
-    def __init__(self, config):
-        """ Overloaded __init__ method """
-        
-        self.cfg = config
-        self._node = ''
-        self._data = ''
-        
-    def start_element(self, name, attrs):
-        """ Callback handler for StartElementHandler """
-        
-        # reset character and node data, we're starting new element
-        self._data = ''
-        
-        if attrs:
-            # If the element has attributes
-            # it does not have CDATA. So set
-            # curr elem to null.
-            self._node = ''
-            
-            for key, value in attrs.iteritems():
-                # Form key name in xml map
-                xmlkey = "".join((name, "_", key))
-                # Set value
-                if self.cfg:
-                    # print 'Setting option for',xmlkey,value,attrs                    
-                    self.cfg.set_option_xml_attr(xmlkey, value, attrs)
-                else:
-                   print key, value
-        else:
-            # If element has no attributes, the
-            # value will be in CDATA. Store the
-            # element name so that we can use it
-            # in cdata callback.
-            self._node = name
-
-    def end_element(self, name):
-        """ Callback handler for EndElementHandler """
-        
-        # This is called after the closing tag of an XML element was found
-        # When this is called we know that char_data now truly has all the data
-        # that was between the element start and end tag.        
-        
-        # jkleven: 10/1/06 - this function exists because we weren't 
-        # parsing strings in config file 
-        # with '&amp;' (aka '&') correctly.  Now we are.
-        
-        if self._data != '':
-            # This was an element with data between an opening and closing tag
-            # ... now that we're guaranteed to have it all, lets add it to the config
-            # print 'Setting option for %s %s ' % (self._node, self._data)
-            if self.cfg:
-                self.cfg.set_option_xml(self._node, self._data)
-            else:
-                print self._data
-                
-        # reset these because we'll be encountering a new element node 
-        # name soon, and our char data will then be useless as well.
-        self._node = ''
-        self._data  = ''            
-            
-    def char_data(self, data):
-        """ Callback handler for CharacterDataHandler """
-        
-        # This will be called after the
-        # start element is called. Simply
-        # record all the data passed in and
-        # then in the end element callback
-        # we will actually add the whole
-        # string to the internal config structure
-
-        self._data += data.strip()
-
-def parse_xml_config_file(configobj, configfile):
-    """ Function to parse an XML configuration file 'configfile'.
-    The first argument is the configuration object of HarvestMan """
-
-    # Create config parser
-    c = HarvestManConfigParser(configobj)
-    p = xml.parsers.expat.ParserCreate()
-
-    p.StartElementHandler = c.start_element
-    p.CharacterDataHandler = c.char_data
-    p.EndElementHandler = c.end_element
-
-    if not os.path.isfile(configfile):
-        print 'Error: file %s does not exist' % configfile
-        return CONFIG_FILE_DOES_NOT_EXIST
-    
-    try:
-        p.Parse(open(configfile).read())
-        return CONFIG_FILE_PARSE_OK
-    except (IOError, OSError, xml.parsers.expat.ExpatError), e:
-        print e
-        return CONFIG_FILE_PARSE_ERROR
-    
-if __name__=="__main__":
-    p = xml.parsers.expat.ParserCreate()
-    c = HarvestManConfigParser(None)
-    
-    p.StartElementHandler = c.start_element
-    p.CharacterDataHandler = c.char_data
-    p.EndElementHandler = c.end_element
-
-    try:
-        p.Parse(open('config.xml').read())
-    except xml.parsers.expat.ExpatError, e:
-        print e
-        
diff --git a/HarvestMan-lite/harvestman/lib/connector.py b/HarvestMan-lite/harvestman/lib/connector.py
deleted file mode 100755
index 0ebc600..0000000
--- a/HarvestMan-lite/harvestman/lib/connector.py
+++ /dev/null
@@ -1,1851 +0,0 @@
-# -- coding: utf-8
-""" connector.py - Module to manage and retrieve data
-    from an internet connection using urllib2. This module is
-    part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification History
-    ====================
-
-    Aug 16 06         Restarted dev cycle. Fixed 2 bugs with 404
-                      errors, one with setting directory URLs
-                      and another with re-resolving URLs.
-    Feb 8 2007        Added hooks support.
-
-    Mar 5 2007        Modified cache check logic slightly to add
-                      support for HTTP 304 errors. HarvestMan will
-                      now use HTTP 304 if caching is enabled and
-                      we have data cache for the URL being checked.
-                      This adds true server-side cache check.
-                      Older caching logic retained as fallback.
-
-   Mar 7 2007         Added HTTP compression (gzip) support.
-   Mar 8 2007         Added connect2 method for grabbing URLs.
-                      Added interactive progress bar for connect2 method.
-                      Improved interactive progress bar to resize
-                      with changing size of terminal.
-
-   Mar 9 2007         Made progress bar use Progress class borrowed
-                      from SMART package manager (Thanks to Vaibhav
-                      for pointing this out!)
-
-   Mar 14 2007        Completed implementation of multipart with
-                      range checks and all.
-
-   Mar 26 2007        Finished implementation of multipart, integrating
-                      with the crawler pieces. Resuming of URLs and
-                      caching changes are pending.
-
-   April 20 2007  Anand Added force-splitting option for hget.
-   April 30 2007  Anand Using datetime module to convert seconds to
-                        hh:mm:ss display.
-                        HarvestManFileObject obejcts not recreated when a lost
-                        connection is resumed, instead new data is
-                        added to existing data, by adjusting byte range
-                        if necessary.
-   Aug 14 2007    Anand Fixed a bug with download after querying a server
-                        for multipart download abilities. Also split
-                        _write_url function and rewrote it.
-
-   Aug 22 2007    Anand  MyRedirectHandler is buggy - replaced with
-                         urllib2.HTTPRedirectHandler.
-   Mar 07 2008    Anand  Made connect to create HEAD request (instead of 'GET')
-                         when either last modified time or etag is given. Added etag
-                         support to connect and HarvestMan cache.
-                         
-   Copyright (C) 2004 Anand B Pillai.    
-                              
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys
-import socket
-import time
-import datetime
-import threading
-
-import urllib2 
-import urlparse
-import gzip
-import cStringIO
-import os
-import shutil
-import glob
-import random
-import base64
-import sha
-import weakref
-import getpass
-import cookielib
-
-from httplib import BadStatusLine
-
-from harvestman.lib import document
-from harvestman.lib import urlparser
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common import keepalive
-
-# Defining pluggable functions
-__plugins__ = { 'save_url_plugin': 'HarvestManUrlConnector:save_url' }
-
-# Defining functions with callbacks
-__callbacks__ = { 'connect_callback' : 'HarvestManUrlConnector:connect' }
-
-__protocols__=["http", "ftp"]
-
-# Error Macros with arbitrary error numbers
-URL_IO_ERROR = 31
-URL_BADSTATUSLINE = 41
-URL_TYPE_ERROR = 51
-URL_VALUE_ERROR = 61
-URL_ASSERTION_ERROR = 71
-URL_SOCKET_ERROR = 81
-URL_SOCKET_TIMEOUT = 91
-URL_GENERAL_ERROR = 101
-
-FILEOBJECT_EXCEPTION = 111
-
-TEST = False
-
-class HeadRequest(urllib2.Request):
-    """ A request class which performs a HEAD request """
-
-    def get_method(self):
-        return 'HEAD'
-    
-class HarvestManFileObjectException(Exception):
-    """ Exception class for HarvestManFileObject class """
-    pass
-
-class HarvestManFileObject(threading.Thread):
-    """ A class which imitates a file object. This wraps
-    around the file object returned by urllib2 and provides
-    features such as block reading, throttling and a progress
-    bar """
-
-    # Class level attributes used for multipart
-    ORIGLENGTH = 0
-    START_TIME = 0.0
-    CONTENTLEN = []
-    MULTIPART = False
-    NETDATALEN = 0
-    
-    def __init__(self, fobj, filename, clength, mode = 0, bwlimit = 0):
-        """ Overloaded __init__ method """
-
-        self._fobj = fobj
-        self._data = ''
-        self._clength = int(clength)
-        self._start = 0.0
-        self._flag = False
-        # Mode: 0 => flush data to file (default)
-        #     : 1 => keep data in memory
-        self._mode = mode
-        self._filename = filename
-        if self._mode == CONNECTOR_DATA_MODE_FLUSH:
-            self._tmpf = open(filename, 'wb')
-        else:
-            self._tmpf = None
-        # Content-length so far
-        self._contentlen = 0
-        self._index = 0
-        # Initialized flag
-        self._init = False
-        # Last error
-        self._lasterror = None
-        # Bandwidth limit as bytes/sec
-        self._bwlimit = bwlimit
-        self._bs = 4096
-
-        threading.Thread.__init__(self, None, None, 'data reader')
-        
-    def initialize(self):
-        """ Initialize before using an instance of this class.
-        This methods sets the start time and the init flag """
-        
-        self._start = time.time()
-        self._init = True
-
-    def is_initialized(self):
-        """ Returns the init flag """
-        
-        return self._init
-
-    def set_fileobject(self, fileobj):
-        """ Setter method for the encapsulated file object """
-        
-        self._fobj = fileobj
-
-    def throttle(self, bytecount, start_time, factor):
-        """ Throttle to fall within limits of specified download speed """
-            
-        diff = float(bytecount)/self._bwlimit - (time.time() - start_time)
-        diff = factor*diff/HarvestManUrlConnectorFactory.connector_count
-        # We need to sleep. But a time.sleep does waste raw CPU
-        # cycles. Still there does not seem to be an option here
-        # since we cannot use SleepEvent class here as there could
-        # be many connectors at any given time and hence the threading
-        # library may not be able to create so many distinct Event
-        # objects...
-        # print 'Diff=>',diff
-
-        if diff>0:
-            # We are 'ahead' of the required bandwidth, so sleep
-            # the time difference off.
-            if self._bs>=256: 
-                self._bs -= 128
-            time.sleep(diff)
-
-        elif diff<0:
-            # We are behind the required bandwidth, so read the
-            # additional data
-            self._bs += int(self._bwlimit*abs(diff))
-
-    def run(self):
-        """ Overloaded run method """
-        
-        self.initialize()
-        self.read()
-
-    def read(self):
-
-        reads = 0
-        
-        dmgr = objects.datamgr
-        config = objects.config
-
-        start_time = config.starttime
-        tfactor = config.throttlefactor
-
-        while not self._flag:
-            try:
-                block = self._fobj.read(self._bs)
-                if block=='':
-                    self._flag = True
-                    # Close the file
-                    if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                        self.close()                
-                    break
-                else:
-                    reads += 1
-                    self._data = self._data + block
-                    self._contentlen += len(block)
-                    if self._bwlimit:
-                        self.throttle(dmgr.bytes, start_time, tfactor)
-                    
-                    # Flush data to disk
-                    if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                        self.flush()
-            except socket.error, e:
-                self._flag = True
-                self._lasterror = e
-                break
-            except Exception, e:
-                self._flag = True
-                self._lasterror = e
-                break
-
-        self._fobj.close()
-
-    def readNext(self):
-        """ Method which reads the next block of data from the URL """
-
-        dmgr = objects.datamgr
-        config = objects.config
-
-        start_time = config.starttime
-        tfactor = config.throttlefactor
-
-        try:
-            block = self._fobj.read(self._bs)
-            if block=='':
-                self._flag = True
-                # Close the file
-                if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                    self.close()
-                return False
-            else:
-                self._data = self._data + block
-                self._contentlen += len(block)
-                if self._bwlimit:
-                    self.throttle(dmgr.bytes, start_time, tfactor)
-                
-                # Flush data to disk
-                if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                    self.flush()
-
-        except socket.error, e:
-            self._fobj.close()
-            raise HarvestManFileObjectException, str(e)
-        except Exception, e:
-            self._fobj.close()            
-            raise HarvestManFileObjectException, str(e)               
-
-    def flush(self):
-        """ Flushes data to the temporary file on disk """
-
-        self._tmpf.write(self._data)
-        self._data = ''
-
-    def close(self):
-        """ Closes the temporary file object """
-        
-        self._tmpf.close()
-
-    def get_tmpfile(self):
-
-        return self._tmpf
-    
-    def get_lasterror(self):
-        """ Returns the last error object """
-        
-        return self._lasterror
-    
-    def get_data(self):
-        """ Returns the downloaded data """
-        
-        return self._data
-
-    def get_datalen(self):
-        """ Returns length of downloaded data """
-        
-        return self._contentlen
-
-    def set_index(self, idx):
-        """ Sets the index attribute (used for multipart downloads only) """
-
-        self._index = idx
-
-    def get_index(self):
-        """ Gets the index attribute (used for multipart downloads only) """
-        
-        return self._index
-    
-    def stop(self):
-        """ Stops the thread """
-        
-        self._flag = True
-
-    @classmethod
-    def reset_klass(cls):
-        """ Resets all class attributes (classmethod) """
-        
-        cls.ORIGLENGTH = 0
-        cls.START_TIME = 0.0
-        cls.CONTENTLEN = []
-        cls.MULTIPART = False
-        cls.NETDATALEN = 0
-        
-class HarvestManNetworkConnector(object):
-    """ This class keeps the Internet settings and configures the network. """
-
-    alias = 'connmgr'                
-    
-    def __init__(self):
-        # use proxies flag
-        self._useproxy=0
-        # check for ssl support in python
-        self._initssl=False
-        # Number of socket errors
-        self._sockerrs = 0
-        # Config object
-        self._cfg = objects.config
-        
-        if hasattr(socket, 'ssl'):
-            self._initssl=True
-            __protocols__.append("https")
-
-        # dictionary of protocol:proxy values
-        self._proxydict = {}
-        # dictionary of protocol:proxy auth values
-        self._proxyauth = {}
-        self.configure()
-        
-    def set_useproxy(self, val=True):
-        """ Set the value of use-proxy flag. Also sets proxy dictionaries to default values """
-
-        self._useproxy=val
-
-        if val:
-            proxystring = 'proxy:80'
-            
-            # proxy variables
-            self._proxydict["http"] =  proxystring
-            self._proxydict["https"] = proxystring
-            self._proxydict["ftp"] = proxystring
-            # set default for proxy authentication
-            # tokens.
-            self._proxyauth["http"] = ""
-            self._proxyauth["https"] = ""
-            self._proxyauth["ftp"] = ""            
-
-    def set_ftp_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets ftp proxy information """
-
-        if encrypted:
-            self._proxydict["ftp"] = "".join((bin_decrypt(proxyserver),  ':', str(proxyport)))
-        else:
-            self._proxydict["ftp"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring = "".join((username, ':', passwd))
-
-            self._proxyauth["ftp"] = passwdstring
-
-    def set_https_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets https(ssl) proxy  information """
-
-        if encrypted:
-            self._proxydict["https"] = "".join((bin_decrypt(proxyserver), ':', str(proxyport)))
-        else:
-            self._proxydict["https"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring = "".join((username, ':', passwd))
-
-            self._proxyauth["https"] = passwdstring
-
-    def set_http_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets http proxy information """
-
-        if encrypted:
-            self._proxydict["http"] = "".join((bin_decrypt(proxyserver), ':', str(proxyport)))
-        else:
-            self._proxydict["http"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring= "".join((username, ':', passwd))
-
-            self._proxyauth["http"] = passwdstring
-
-    def set_proxy(self, server, port, authinfo=(), encrypted=True):
-        """ Sets proxy information for all protocols """
-
-        # For most users, only this method will be called,
-        # rather than the specific method for each protocol,
-        # as proxies are normally shared for all tcp/ip protocols 
-
-        for p in __protocols__:
-            # eval helps to do this dynamically
-            s='self.set_' + p + '_proxy'
-            func=eval(s, locals())
-            
-            func(server, port, authinfo, encrypted)
-
-    def set_authinfo(self, username, passwd, encrypted=True):
-        """ Set authentication information for proxy server """
-
-        
-        # Note: If this function is used all protocol specific
-        # authentication will be replaced by this authentication. 
-
-        if encrypted:
-            passwdstring = "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-        else:
-            passwdstring = "".join((username, ':', passwd))
-
-        self._proxyauth = {"http" : passwdstring,
-                            "https" : passwdstring,
-                            "ftp" : passwdstring }
-
-    def configure(self):
-        """ Wrapper method for configuring network and protocols """
-
-        log = objects.logger
-
-        keepalive.DEBUG = keepalive.FakeLogger()
-        keepalive.DEBUG.info = log.debug
-        keepalive.DEBUG.error = log.debug
-        
-        self.configure_network()
-        self.configure_protocols()
-        
-    def configure_network(self):
-        """ Configure network settings for the user """
-
-        # First: Configuration of network (proxies/intranet etc)
-        
-        # Check for proxies in the config object
-        if self._cfg.proxy:
-            self.set_useproxy(True)
-            proxy = self._cfg.proxy
-            
-            index = proxy.rfind(':')
-            if index != -1:
-                port = proxy[(index+1):].strip()
-                server = proxy[:index]
-                # strip of any 'http://' from server
-                index = server.find('http://')
-                if index != -1:
-                    server = server[(index+7):]
-
-                self.set_proxy(server, int(port), (), self._cfg.proxyenc)
-
-            else:
-                port = self._cfg.proxyport
-                server = self._cfg.proxy
-                self.set_proxy(server, int(port), (), self._cfg.proxyenc)
-
-            # Set proxy username and password, if specified
-            puser, ppasswd = self._cfg.puser, self._cfg.ppasswd
-            if puser and ppasswd: self.set_authinfo(puser, ppasswd, self._cfg.proxyenc)
-
-
-    def configure_protocols(self):
-        """ Configures protocol handlers """
-        
-        # Second: Configuration of protocol handlers.
-
-        # TODO: Verify gopher protocol
-        authhandler = urllib2.HTTPBasicAuthHandler()
-        cookiehandler = None
-
-        # No need for version checks since HarvestMan install
-        # works for Python >=2.4 anyway
-        socket.setdefaulttimeout( self._cfg.socktimeout )
-        cj = cookielib.MozillaCookieJar()
-        cookiehandler = urllib2.HTTPCookieProcessor(cj)
-
-        # HTTP/HTTPS handlers
-        httphandler = keepalive.HTTPHandler
-        httpshandler = urllib2.HTTPSHandler #keepalive.HTTPSHandler
-            
-        # If we are behing proxies/firewalls
-        if self._useproxy:
-            if self._proxyauth['http']:
-                httpproxystring = "".join(('http://',
-                                           self._proxyauth['http'],
-                                           '@',
-                                           self._proxydict['http']))
-            else:
-                httpproxystring = "".join(('http://', self._proxydict['http']))
-
-            if self._proxyauth['ftp']:
-                ftpproxystring = "".join(('http://',
-                                          self._proxyauth['ftp'],
-                                          '@',
-                                          self._proxydict['ftp']))
-            else:
-                ftpproxystring = "".join(('http://', self._proxydict['ftp']))
-
-            if self._proxyauth['https']:
-                httpsproxystring = "".join(('http://',
-                                            self._proxyauth['https'],
-                                            '@',
-                                            self._proxydict['https']))
-            else:
-                httpsproxystring = "".join(('http://', self._proxydict['https']))
-
-            # Set this as the new entry in the proxy dictionary
-            self._proxydict['http'] = httpproxystring
-            self._proxydict['ftp'] = ftpproxystring
-            self._proxydict['https'] = httpsproxystring
-
-            
-            proxy_support = urllib2.ProxyHandler(self._proxydict)
-            
-            # build opener and install it
-            if self._initssl:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              proxy_support,
-                                              httphandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              urllib2.FTPHandler,
-                                              #urllib2.GopherHandler,
-                                              httpshandler,
-                                              urllib2.FileHandler,
-                                              cookiehandler)
-            else:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              proxy_support,
-                                              httphandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              urllib2.FTPHandler,
-                                              #urllib2.GopherHandler,
-                                              urllib2.FileHandler,
-                                              cookiehandler)
-
-        else:
-            # Direct connection to internet
-            if self._initssl:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              httphandler,
-                                              urllib2.FTPHandler,
-                                              httpshandler,
-                                              #urllib2.GopherHandler,
-                                              urllib2.FileHandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              cookiehandler)
-                opener.addheaders=[] #Need to clear default headers so we can apply our own
-            else:
-                opener = urllib2.build_opener( authhandler,
-                                               urllib2.HTTPRedirectHandler,
-                                               httphandler,
-                                               urllib2.FTPHandler,
-                                               #urllib2.GopherHandler,
-                                               urllib2.FileHandler,
-                                               urllib2.HTTPDefaultErrorHandler,
-                                               cookiehandler)
-                opener.addheaders=[] #Need to clear default headers so we can apply our own
-
-        urllib2.install_opener(opener)
-
-        return CONFIGURE_PROTOCOL_OK
-
-    # Get methods
-    def get_useproxy(self):
-        """ Returns whether we are going through a proxy server """
-
-        return self._useproxy
-    
-    def get_proxy_info(self):
-        """ Return proxy information as a tuple. The first member
-        of the tuple is the proxy server dictionary and the second
-        member the proxy authentication information """
-        
-        return (self._proxydict, self._proxyauth)
-
-    def increment_socket_errors(self, val=1):
-        """ Increment socket error count """
-        
-        self._sockerrs += val
-
-    def decrement_socket_errors(self, val=1):
-        """ Decrement socket error count """
-        
-        self._sockerrs -= val
-        
-    def get_socket_errors(self):
-        """ Get socket error count """
-        
-        return self._sockerrs
-
-class HarvestManUrlError(object):
-    """ Class encapsulating errors raised by HarvestManUrlConnector objects
-    while connecting and downloading data from the Internet """
-    
-    def __init__(self):
-        """ Overloaded __init__ method """
-        
-        self.initialize()
-
-    def initialize(self):
-        """ Initializes an instance of this class """
-        self.reset()
-
-    def __str__(self):
-        """ Returns string representation of an instance of the class """
-        
-        return ''.join((str(self.errclass),' ', str(self.number),': ',self.msg))
-
-    def reset(self):
-        """ Resets attributes """
-        
-        self.number = 0
-        self.msg = ''
-        self.fatal = False
-        self.errclass = ''        
-        
-class HarvestManUrlConnector(object):
-    """ Class which performs the work of fetching data for URLs
-    from the Internet and save data to the disk """
-
-    __metaclass__ = MethodWrapperMetaClass
-    
-    def __str__(self):
-        """ Return a string representation of an instance of this class """
-        return `self` 
-        
-    def __init__(self):
-        """ Overloaded __init__ method """
-
-        # file like object returned by
-        # urllib2.urlopen(...)
-        self._freq = None
-        # data downloaded
-        self._data = ''
-        # length of data downloaded
-        self._datalen = 0
-        # error object
-        self._error = HarvestManUrlError()
-        # time to wait before reconnect
-        # in case of failed connections
-        self._sleeptime = 0.5
-        # global network configurator
-        self.network_conn = objects.connmgr
-        # Config object
-        self._cfg = objects.config    
-        # Http header for current connection
-        self._headers = CaselessDict()
-        # HarvestMan file object
-        self._fo = None
-        # Elasped time for reading data
-        self._elapsed = 0.0
-        # Mode for data download
-        self._mode = self._cfg.datamode
-        # Temporary filename if any
-        self._tmpfname = ''
-        # Status of connection
-        # 0 => no connection
-        # 1 => connected, download in progress
-        self._status = 0
-        # Number of tries
-        self._numtries = 0
-        # Acquired flag
-        self._acquired = True
-        # Block write flag - to be used
-        # to indicate to connector to
-        # not save the data to disk
-        self.blockwrite = False
-        # Throttle sleeping time to be
-        # set on the file object
-        self.throttle_time = 0
-        
-    def __del__(self):
-        del self._data
-        self._data = None
-        del self._freq
-        self._freq = None
-        del self._error
-        self._error = None
-        del self.network_conn
-        self.network_conn = None
-        del self._cfg
-        self._cfg = None
-        
-    def _proxy_query(self, queryauth=1, queryserver=0):
-        """ Query the user for proxy related information """
-
-        self.network_conn.set_useproxy(True)
-        
-        if queryserver or queryauth:
-            # There is an error in the config file/project file/user input
-            SetUserDebug("Error in proxy server settings (Regenerate the config/project file)")
-
-        # Get proxy info from user
-        try:
-            if queryserver:
-                server=bin_crypt(raw_input('Enter the name/ip of your proxy server: '))
-                port=int(raw_input('Enter the proxy port: '))         
-                self.network_conn.set_proxy(server, port)
-
-            if queryauth:
-                user=bin_crypt(raw_input('Enter username for your proxy server: '))
-                # Ask for password only if a valid user is given.
-                if user:
-                    passwd=bin_crypt(getpass.getpass('Enter password for your proxy server: '))
-                    # Set it on myself and re-configure
-                    self.network_conn.set_authinfo(user,passwd)
-        except EOFError, e:
-            error("Proxy Setting Error:",e)
-
-        info('Re-configuring protocol handlers...')
-        self.network_conn.configure_protocols()
-        
-        extrainfo('Done.')
-
-    def release(self):
-        """ Marks the connector object as released """
-
-        self._acquired = False
-
-    def is_released(self):
-        """ Returns whether the connector was released or not """
-
-        return (not self._acquired)
-    
-    def urlopen(self, url):
-        """ Opens the URL and returns the url file stream """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            self.connect(urlobj, True, self._cfg.retryfailed )
-            # return the file like object
-            if self._error.fatal:
-                return None
-            else:
-                return self._freq
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error:",e)
-            
-    def robot_urlopen(self, url):
-        """ Opens a robots.txt URL and returns the request object """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            self.connect(urlobj, False, 0)
-            # return the file like object
-            if self._error.fatal:
-                return None
-            else:
-                return self._freq
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error:",e)
-        
-    def connect(self, urlobj, fetchdata=True, retries=1, lastmodified='', etag=''):
-        """ Connects to the Internet and fetches data for the URL encapsulated
-        in the object 'urlobj' """
-
-        # This is the work-horse method of this class...
-        
-        data = ''
-
-        dmgr = objects.datamgr
-        rulesmgr = objects.rulesmgr
-
-        self._numtries = 0
-        three_oh_four = False
-
-        # Reset the http headers
-        self._headers.clear()
-        urltofetch = urlobj.get_full_url()
-
-        lmt, tag = lastmodified, etag
-
-        # Raise an event...
-        if objects.eventmgr.raise_event('before_url_connect', urlobj, None, lastmodified, etag)==False:
-            return CONNECT_NO_FILTERED
-
-        add_ua = self._cfg._connaddua
-        
-        while self._numtries <= retries and not self._error.fatal:
-
-            # Reset status
-            self._status = 0
-            
-            errnum = 0
-            try:
-                # Reset error
-                self._error.reset()
-
-                self._numtries += 1
-
-                # create a request object
-                # If we are passed either the lastmodified time or
-                # the etag value or both, we will be creating a
-                # head request. Now if either the etag or lastmodified
-                # time match, the server should produce a 304 error
-                # and we break the loop automatically. If not, we have
-                # to set lmt and tag values to null strings so that
-                # we make an actual request.
-
-                # Set lmt, tag to null strings if try count is greater
-                # than 1...
-                if self._numtries>1:
-                    lmt, tag = '', ''
-                    
-                request = self.create_request(urltofetch, lmt, tag, useragent=add_ua)
-
-                # Check for urlobject which is trying to do
-                # multipart download.
-                #byterange = urlobj.range
-                #if byterange:
-                #    range1 = byterange[0]
-                #    range2 = byterange[-1]
-                #    request.add_header('Range','bytes=%d-%d' % (range1, range2))
-
-                # If we accept http-compression, add the required header.
-                if self._cfg.httpcompress:
-                    request.add_header('Accept-Encoding', 'gzip')
-
-                self._freq = urllib2.urlopen(request)
-                # Set status to 1
-                self._status = 1
-                
-                # Set http headers
-                self.set_http_headers()
-
-                clength = int(self.get_content_length())
-                if urlobj: urlobj.clength = clength
-                
-                trynormal = False
-                # Check constraint on file size, dont do this on
-                # objects which are already downloading pieces of
-                # a multipart download.
-                if not self.check_content_length(): # and not byterange
-                    maxsz = self._cfg.maxfilesize
-                    extrainfo("Url",urltofetch,"does not match size constraints")
-                    # Raise an event...
-                    objects.eventmgr.raise_event('post_url_connect', urlobj, None)
-                    
-                    return CONNECT_NO_RULES_VIOLATION
-                
-##                     supports_multipart = dmgr.supports_range_requests(urlobj)
-                    
-##                     # Dont do range checking on FTP servers since they
-##                     # typically support it by default.
-##                     if urlobj.protocol != 'ftp' and supports_multipart==0:
-##                         # See if the server supports 'Range' header
-##                         # by requesting half the length
-##                         self._headers.clear()
-##                         request.add_header('Range','bytes=%d-%d' % (0,clength/2))
-##                         self._freq = urllib2.urlopen(request)
-##                         # Set http headers
-##                         self.set_http_headers()
-##                         range_result = self._headers.get('accept-ranges')
-
-##                         if range_result.lower()=='bytes':
-##                             supports_multipart = 1
-##                         else:
-##                             extrainfo('Server %s does not support multipart downloads' % urlobj.domain)
-##                             extrainfo('Aborting download of  URL %s.' % urltofetch)
-##                             return CONNECT_NO_RULES_VIOLATION
-
-##                     if supports_multipart==1:
-##                         extrainfo('Server %s supports multipart downloads' % urlobj.domain)
-##                         dmgr.download_multipart_url(urlobj, clength)
-##                         return CONNECT_MULTIPART_DOWNLOAD
-                    
-                # The actual url information is used to
-                # differentiate between directory like urls
-                # and file like urls.
-                actual_url = self._freq.geturl()
-                
-                # Replace the urltofetch in actual_url with null
-                if actual_url:
-                    no_change = (actual_url == urltofetch)
-                    
-                    if not no_change:
-                        replacedurl = actual_url.replace(urltofetch, '')
-                        # If the difference is only as a directory url
-                        if replacedurl=='/':
-                            no_change = True
-                        else:
-                            no_change = False
-                            
-                        # Sometimes, there could be HTTP re-directions which
-                        # means the actual url may not be same as original one.
-                        if no_change:
-                            if (actual_url[-1] == '/' and urltofetch[-1] != '/'):
-                                extrainfo('Setting directory url=>',urltofetch)
-                                urlobj.set_directory_url()
-                                
-                        else:
-                            # There is considerable change in the URL.
-                            # So we need to re-resolve it, since otherwies
-                            # some child URLs which derive from this could
-                            # be otherwise invalid and will result in 404
-                            # errors.
-                            urlobj.redirected = True                            
-                            urlobj.url = actual_url
-                            debug('Actual URL=>',actual_url)
-                            debug("Re-resolving URL: Current is %s..." % urlobj.get_full_url())
-                            urlobj.wrapper_resolveurl()
-                            debug("Re-resolving URL: New is %s..." % urlobj.get_full_url())
-                            urltofetch = urlobj.get_full_url()
-                    
-                # Find the actual type... if type was assumed
-                # as wrong, correct it.
-                content_type = self.get_content_type()
-                urlobj.manage_content_type(content_type)
-                        
-                # update byte count
-                # if this is the not the first attempt, print a success msg
-                if self._numtries>1:
-                    extrainfo("Reconnect succeeded => ", urltofetch)
-
-                # Update content info on urlobject
-                self.set_content_info(urlobj)
-
-                if fetchdata:
-                    try:
-                        # If gzip-encoded, need to deflate data
-                        encoding = self.get_content_encoding()
-                        clength = self.get_content_length()
-                        
-                        t1 = time.time()
-                        
-                        if self._fo==None:
-                            if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                                if self._cfg.projtmpdir:
-                                    self._tmpfname = self.make_tmp_fname(urlobj.get_filename(),
-                                                                         self._cfg.projtmpdir)
-                                else:
-                                    # For stand-alone use outside crawls
-                                    self._tmpfname = self.make_tmp_fname(urlobj.get_filename(),
-                                                                         GetMyTempDir())
-                            self._fo = HarvestManFileObject(self._freq,
-                                                            self._tmpfname,
-                                                            clength,
-                                                            self._mode,
-                                                            float(self._cfg.bandwidthlimit))
-                            self._fo.initialize()
-                        else:
-                            self._fo.set_fileobject(self._freq)
-
-                        
-                        self._fo.read()
-                        self._elapsed = time.time() - t1
-                        
-                        self._freq.close()                       
- 
-                        if self._mode==CONNECTOR_DATA_MODE_INMEM:
-                            data = self._fo.get_data()
-                            self._datalen = len(data)
-
-                            # Save a reference
-                            data0 = data
-                            self._freq.close()                        
-                            dmgr.update_bytes(len(data))
-                            debug('Encoding',encoding)
-                        
-                            if encoding.strip().find('gzip') != -1:
-                                try:
-                                    gzfile = gzip.GzipFile(fileobj=cStringIO.StringIO(data))
-                                    data = gzfile.read()
-                                    gzfile.close()
-                                except (IOError, EOFError), e:
-                                    data = data0
-                                    pass
-                        else:
-                            self._datalen = self._fo.get_datalen()
-                            dmgr.update_bytes(self._datalen)
-                            
-                    except MemoryError, e:
-                        # Catch memory error for sockets
-                        error("Memory Error:",str(e))
-
-                # Explicitly set the status of urlobj to zero since
-                # download was completed...
-                urlobj.status = 0
-                        
-                break
-
-            #except Exception, e:
-            #     raise
-            
-            except urllib2.HTTPError, e:
-                
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.strip().split()
-                    self._error.number = int(parts[-1])
-                    self._error.msg = errdescn.strip()
-                    self._error.errclass = "HTTPError"
-                except:
-                    pass
-
-                if self._error.msg:
-                    error(self._error.msg, '=> ',urltofetch)
-                else:
-                    error('HTTPError:',urltofetch)
-
-                try:
-                    errnum = int(self._error.number)
-                except:
-                    pass
-
-                if errnum==304:
-                    # Page not modified
-                    three_oh_four = True
-                    self._error.fatal = False
-                    # Need to do this to ensure that the crawler
-                    # proceeds further!
-                    content_type = self.get_content_type()
-                    urlobj.manage_content_type(content_type)                    
-                    break
-                if errnum in range(400, 407):
-                    # 400 => bad request
-                    # 401 => Unauthorized
-                    # 402 => Payment required (not used)
-                    # 403 => Forbidden
-                    # 404 => Not found
-                    # 405 => Method not allowed
-                    # 406 => Not acceptable
-                    
-                    # If error is 400, 405 or 406, then we
-                    # retry with the useragent string not set.
-                    if errnum in (400, 405, 406):
-                        self._cfg._badrequests += 1
-                        # If we get many badrequests in a row
-                        # we disable UA addition for this crawl.
-                        if self._cfg._badrequests>=5:
-                            self._cfg._connaddua = False
-                            
-                        if self._numtries<=retries:
-                            add_ua = False
-                        else:
-                            self._error.fatal = True                            
-                    else:
-                        self._error.fatal = True
-                elif errnum == 407:
-                    # Proxy authentication required
-                    self._proxy_query(1, 1)
-                elif errnum == 408:
-                    # Request timeout, try again
-                    pass
-                elif errnum == 412:
-                    # Pre-condition failed, this has been
-                    # detected due to our user-agent on some
-                    # websites (sample URL: http://guyh.textdriven.com/)
-                    self._error.fatal =  True
-                elif errnum in range(409, 418):
-                    # Error codes in 409-417 contain a mix of
-                    # fatal and non-fatal states. For example
-                    # 410 indicates requested resource is no
-                    # Longer available, but we could try later.
-                    # However for all practical purposes, we
-                    # are marking these codes as fatal errors
-                    # for the time being.
-                    self._error.fatal = True
-                elif errnum == 500:
-                    # Internal server error, can try again
-                    pass
-                elif errnum == 501:
-                    # Server does not implement the functionality
-                    # to fulfill the request - fatal
-                    self._error.fatal = True
-                elif errnum == 502:
-                    # Bad gateway, can try again ?
-                    pass
-                elif errnum in (503, 506):
-                    # 503 - Service unavailable
-                    # 504 - Gatway timeout
-                    # 505 - HTTP version not supported
-                    self._error.fatal = True
-
-                if self._error.fatal:
-                    rulesmgr.add_to_filter(urltofetch)
-                    
-            except urllib2.URLError, e:
-                # print 'urlerror',urltofetch
-                
-                errdescn = ''
-                self._error.errclass = "URLError"
-                
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.split()                            
-                except:
-                    try:
-                        errbasic, errdescn = (str(e)).split(',')
-                        parts = errbasic.split('(')
-                        errdescn = (errdescn.split("'"))[1]
-                    except:
-                        pass
-
-                try:
-                    self._error.number = int(parts[-1])
-                except:
-                    pass
-                
-                if errdescn:
-                    self._error.msg = errdescn
-
-                if self._error.msg:
-                    error(self._error.msg, '=> ',urltofetch)
-                else:
-                    error('URLError:',urltofetch)
-
-                errnum = self._error.number
-                if errnum == 10049 or errnum == 10061: # Proxy server error
-                    self._proxy_query(1, 1)
-                elif errnum == 10055:
-                    # no buffer space available
-                    self.network_conn.increment_socket_errors()
-                    # If the number of socket errors is >= 4
-                    # we decrease max connections by 1
-                    sockerrs = self.network_conn.get_socket_errors()
-                    if sockerrs>=4:
-                        self._cfg.connections -= 1
-                        self.network_conn.decrement_socket_errors(4)
-
-            except IOError, e:
-                self._error.number = URL_IO_ERROR
-                self._error.fatal=True
-                self._error.msg = str(e)
-                self._error.errclass = "IOError"                
-                # Generated by invalid ftp hosts and
-                # other reasons,
-                # bug(url: http://www.gnu.org/software/emacs/emacs-paper.html)
-                error(e ,'=> ',urltofetch)
-
-            except BadStatusLine, e:
-                self._error.number = URL_BADSTATUSLINE
-                self._error.msg = str(e)
-                self._error.errclass = "BadStatusLine"
-                error(e, '=> ',urltofetch)
-
-            except TypeError, e:
-                self._error.number = URL_TYPE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "TypeError"                
-                error(e, '=> ',urltofetch)
-                
-            except ValueError, e:
-                self._error.number = URL_VALUE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "ValueError"                
-                error(e, '=> ',urltofetch)
-
-            except AssertionError, e:
-                self._error.number = URL_ASSERTION_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "AssertionError"                
-                error(e ,'=> ',urltofetch)
-
-            except socket.error, e:
-                self._error.msg = str(e)
-                self._error.number = URL_SOCKET_ERROR
-                self._error.errclass = "SocketError"                
-                errmsg = self._error.msg
-
-                error('Socket Error: ', errmsg,'=> ',urltofetch)
-
-                if errmsg.lower().find('connection reset by peer') != -1:
-                    # Connection reset by peer (socket error)
-                    self.network_conn.increment_socket_errors()
-                    # If the number of socket errors is >= 4
-                    # we decrease max connections by 1
-                    sockerrs = self.network_conn.get_socket_errors()
-
-                    if sockerrs>=4:
-                        self._cfg.connections -= 1
-                        self.network_conn.decrement_socket_errors(4)
-
-            except socket.timeout, e:
-                self._error['msg'] = 'socket timed out'
-                self._error['number'] = URL_SOCKET_TIMEOUT
-                errmsg = self._error['msg']
-
-                error('Socket Error: ', errmsg,'=> ',urltofetch)
-                
-            except Exception, e:
-                self._error.msg = str(e)
-                self._error.number = URL_GENERAL_ERROR
-                self._error.errclass = "GeneralError"                
-                errmsg = self._error.msg
-            
-                error('General Error: ', errmsg,'=> ',urltofetch)
-                
-            # attempt reconnect after some time
-            # self.evnt.sleep()
-            time.sleep(self._sleeptime)
-
-        if data:
-            self._data = data
-            # Set hash on URL object
-            urlobj.pagehash = sha.new(data).hexdigest()
-
-        # print 'URLOBJ STATUS=>',urlobj.status
-        if urlobj and urlobj.status != 0:
-            urlobj.status = self._error.number
-            urlobj.fatal = self._error.fatal
-            debug('Setting %s status to %s' % (urlobj.get_full_url(), str(urlobj.status)))
-            
-        # Raise an event...
-        objects.eventmgr.raise_event('post_url_connect', urlobj, None)
-        
-        if three_oh_four:
-            return CONNECT_NO_UPTODATE
-            
-        if self._data or self._datalen:
-            return CONNECT_YES_DOWNLOADED
-        else:
-            return CONNECT_NO_ERROR
-
-    def make_tmp_fname(self, filename, directory='.'):
-        """ Creates a temporary filename for download """
-
-        random.seed()
-        
-        while True:
-            fint = int(random.random()*random.random()*10000000)
-            fname = ''.join(('.',filename,'#',str(fint)))
-            fpath = os.path.join(directory, fname)
-            if not os.path.isfile(fpath):
-                return fpath
-
-    def create_request(self, urltofetch, lmt='', etag='', useragent=True):
-        """ Creates request object for the URL 'urltofetch' and return it """
-
-        # This function takes care of adding any additional headers
-        # etc in addition to creating the request object.
-        
-        # create a request object
-        if lmt or etag:
-            # print 'Making a head request', lmt, etag
-            # Create a head request...
-            request = HeadRequest(urltofetch)
-            if lmt != '':
-                ts = time.strftime("%a, %d %b %Y %H:%M:%S GMT", time.localtime(lmt))
-                request.add_header('If-Modified-Since', ts)
-            if etag != '':
-                request.add_header('If-None-Match', etag)
-        else:
-            request = urllib2.Request(urltofetch)
-
-        # Some sites do not like User-Agent strings and raise a Bad Request
-        # (HTTP 400) error. Egs: http://www.bad-ischl.ooe.gv.at/. In such
-        # cases, the connect method, sets useragent flag to False and calls
-        # this method again.
-        if useragent: 
-            request.add_header('User-Agent', self._cfg.USER_AGENT)
-        
-        # Check if any HTTP username/password are required
-        username, password = self._cfg.username, self._cfg.passwd
-        if username and password:
-            # Add basic HTTP auth headers
-            authstring = base64.encodestring('%s:%s' % (username, password))
-            request.add_header('Authorization','Basic %s' % authstring)
-
-        return request
-
-    def get_url_data(self, url):
-        """ Downloads data for the given URL and returns it """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            res = self.connect(urlobj)
-            return self._data
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error: ",e)
-        
-    def get_error(self):
-        """ Returns the error object """
-        
-        return self._error
-
-    def set_content_info(self, urlobj):
-        """ Sets the contents information on the url object 'urlobj' """
-
-        # set this on the url object
-        if self._headers:
-            urlobj.set_url_content_info(self._headers)
-
-    def set_http_headers(self):
-        """ Sets http header dictionary from current request """
-
-        self._headers.clear()
-        for key,val in dict(self._freq.headers).iteritems():
-            self._headers[key] = val
-
-    def print_http_headers(self):
-        """ Prints the HTTP headers for this connection """
-
-        print 'HTTP Headers '
-        for k,v in self._headers.iteritems():
-            print k,'=> ', v
-
-        print '\n'
-
-    def get_content_length(self):
-        """ Returns the content length after fetching a URL """
-        
-        clength = self._headers.get('content-length', 0)
-        if clength != 0:
-            # Sometimes this could be two numbers
-            # separated by commas.
-            return int(clength.split(',')[0].strip())
-        else:
-            return self._datalen
-
-    def check_content_length(self):
-        """ Checks whether content length of a URL is within the
-        limits of the maximum allowed file size """
-
-        # check for min & max file size
-        try:
-            length = int(self.get_content_length())
-        except:
-            length = 0
-            
-        return (length <= self._cfg.maxfilesize)
-        
-    def get_content_type(self):
-        """ Returns content type after fetching a URL """
-
-        ctype = self._headers.get('content-type','')
-        if ctype:
-            # Sometimes content type
-            # definition might have
-            # the charset information,
-            # - .stx files for example.
-            # Need to strip out that
-            # part, if any
-            if ctype.find(';') != -1:
-                ctype2, charset = ctype.split(';',1)
-                if ctype2: ctype = ctype2
-            
-        return ctype
-
-    def get_etag(self):
-        """ Returns the 'etag' header information """
-        
-        return self._headers.get('etag','')
-    
-    def get_last_modified_time(self):
-        """ Returns the 'last-modified' header information """
-        
-        return self._headers.get('last-modified','')
-
-    def get_content_encoding(self):
-        """ Returns the 'content-encoding' header information """
-        
-        return self._headers.get('content-encoding', 'plain')
-
-    def _write_url(self, urlobj, overwrite=False):
-        """ Writes the data for the URL object 'urlobj' to a disk file (internal method) """
-
-        # Raise writeurl event
-        if objects.eventmgr.raise_event('save_url_data', urlobj, None, self._data)==False:
-            extrainfo('Filtering write of URL',urlobj)
-            return WRITE_URL_FILTERED
-
-        if self.blockwrite:
-            return WRITE_URL_BLOCKED
-        
-        dmgr = objects.datamgr
-        # For raw-saves, i.e saving without any directory structure, a simple
-        # logic is sufficient.
-        if self._cfg.rawsave:
-            if SUCCESS(self._write_url_filename( urlobj.get_filename())):
-                return WRITE_URL_OK
-            else:
-                return WRITE_URL_FAILED
-                
-        # If the file does not exist...
-        fname = urlobj.get_full_filename()
-        if not os.path.isfile(fname):
-            # Recalculate locations to check if there is any error
-            # in computed directories/filenames - like saving a
-            # filename, when its parent directory is saved as a
-            # file or trying to save as file when there is already
-            # a directory in that name etc... This is a fix for
-            # EIAO bug #491 - sample websites: www.nyc.estemb.org
-            # and www.est-emb.fr
-            urlobj.recalc_locations()
-            directory = urlobj.get_local_directory()
-
-            if SUCCESS(dmgr.create_local_directory(directory)):
-                if SUCCESS(self._write_url_filename( urlobj.get_full_filename())):
-                    return WRITE_URL_OK
-                else:
-                    return WRITE_URL_FAILED
-                
-            else:
-                error("Error in creating local directory for", urlobj.get_full_url())
-                return WRITE_URL_FAILED
-        else:
-            debug("File exists => ",urlobj.get_full_filename())
-            # File exists - could be many reasons for it (redirected URL
-            # duplicate download etc) - first check if this is a redirected
-            # URL.
-            if urlobj.reresolved:
-                # Get old filename and save in it
-                old_urlobj = urlobj.get_original_state()
-                directory = old_urlobj.get_local_directory()
-
-                if SUCCESS(dmgr.create_local_directory(directory)):
-                    if SUCCESS(self._write_url_filename( old_urlobj.get_full_filename())):
-                        return WRITE_URL_OK
-                    else:
-                        return WRITE_URL_FAILED
-                else:
-                    error("Error in creating local directory for", urlobj.get_full_url())
-                    return WRITE_URL_FAILED
-            else:
-                if overwrite:
-                    extrainfo("Over-writing file =>",urlobj.get_full_filename())
-                else:
-                    # Save as filename.1 etc
-                    index = 1
-                    fname2 = fname
-                    rootfname = urlobj.validfilename
-                    
-                    while os.path.isfile(fname2):
-                        urlobj.validfilename = rootfname + '.' + str(index)
-                        fname2 = urlobj.get_full_filename()
-                        index += 1
-
-                directory = urlobj.get_local_directory()
-
-                if SUCCESS(dmgr.create_local_directory(directory)):
-                    if SUCCESS(self._write_url_filename( urlobj.get_full_filename())):
-                        return WRITE_URL_OK
-                    else:
-                        WRITE_URL_FAILED
-                else:
-                    error("Error in creating local directory for", urlobj.get_full_url())
-                    return WRITE_URL_FAILED
-
-    def _write_url_filename(self, filename, overwrite=True, printmsg=False):
-        """ Writes downloaded data for a URL to the file named 'filename' """
-
-        if self.blockwrite:
-            return WRITE_URL_BLOCKED
-        
-        if self._data=='' and self._datalen==0:
-            return DATA_EMPTY_ERROR
-
-        if not overwrite:
-            # Recalcuate new filename...
-            origfilepath, n = filename, 1
-        
-            while os.path.isfile(filename):
-                filename = ''.join((origfilepath,'.',str(n)))
-                n += 1
-
-        try:
-            extrainfo('Writing file ', filename)
-            if self._mode==CONNECTOR_DATA_MODE_INMEM:
-                f=open(filename, 'wb')
-                f.write(self._data)
-                f.close()
-            else:
-                # Rename file
-                if os.path.isfile(self._tmpfname):
-                    # If gzip-encoded, we need to deflate data
-                    if self.get_content_encoding().strip().find('gzip') != -1:
-                        try:
-                            g=gzip.GzipFile(fileobj=open(self._tmpfname, 'rb'))
-                            # Open file for writing and write block by block
-                            f=open(filename, 'wb')
-                            while 1:
-                                block = g.read(8192)
-                                if block=='':
-                                    f.flush()
-                                    f.close()
-                                    break
-                                else:
-                                    f.write(block)
-                                    f.flush()
-                            g.close()
-                        except (IOError, OSError), e:
-                            return FILE_WRITE_ERROR
-                    else:
-                        shutil.move(self._tmpfname, filename)
-                    
-            if os.path.isfile(filename):
-                self._writelen = os.path.getsize(filename)
-
-                if printmsg:
-                    print '\nSaved to %s' % filename    
-                return FILE_WRITE_OK
-                
-        except IOError,e:
-            error('IO Error:', str(e))
-            return FILE_WRITE_ERROR
-        except ValueError, e:
-            error(str(e))
-            return FILE_WRITE_ERROR
-
-        return FILE_WRITE_ERROR
-
-    def wrapper_connect(self, urlobj):
-        """ Wrapper method for the connect/connect2 methods """
-
-        if self._cfg.nocrawl:
-            return self.connect2(urlobj)
-        else:
-            url = urlobj.get_full_url()
-            # See if this URL is in cache, then get its lmt time & data
-            lmt = objects.datamgr.get_last_modified_time(urlobj)
-            return self.connect(urlobj, True, self._cfg.retryfailed, lmt)            
-                        
-    def save_url(self, urlobj):
-        """ Downloads data for url represented by 'urlobj' and saves the
-        downloaded data to disk """
-
-        # Rearranged this to take care of http 304
-        url = urlobj.get_full_url()
-
-        # See if this URL is in cache, then get its lmt time & data
-
-        # If data caching is enabled, we cannot use this since
-        # we will not have any data to parse...
-        lmt, cached_data, etag, filefound = '', '', '', ''
-
-        if self._cfg.rawsave:
-            filename = urlobj.get_filename()
-        else:
-            filename = urlobj.get_full_filename()
-
-        filefound = os.path.isfile(filename)
-        
-        if self._cfg.datacache:
-            cached_data = objects.datamgr.get_url_cache_data(urlobj)
-            
-        # Makes sense to do this only if we find the cached data or the file...
-        if (cached_data or filefound) and not urlobj.starturl:
-            lmt = objects.datamgr.get_last_modified_time(urlobj)
-            etag = objects.datamgr.get_etag(urlobj)
-
-        urlobj.qstatus = urlparser.URL_IN_DOWNLOAD
-        res = self.connect(urlobj, True, self._cfg.retryfailed, lmt, etag)
-        urlobj.qstatus = urlparser.URL_DONE_DOWNLOAD
-        
-        # If it was a rules violation or error, skip it
-        if res in (CONNECT_NO_RULES_VIOLATION, CONNECT_NO_ERROR):
-            return res
-
-        # If this became a request for multipart download
-        # wait for the download to complete.
-        #if res == CONNECT_MULTIPART_DOWNLOAD:
-        #    # Trying multipart download...
-        #    pool = objects.datamgr.get_url_threadpool()
-        #    while not pool.get_multipart_download_status(urlobj):
-        #        time.sleep(2.0)
-
-        #    data = pool.get_multipart_url_data(urlobj)
-        #    self._data = data
-
-        #    if SUCCESS(self._write_url(urlobj)):
-        #        return DOWNLOAD_YES_OK
-        #    else:
-        #        return DOWNLOAD_NO_ERROR
-
-        if res == CONNECT_NO_UPTODATE:
-            # Set the data as cache-data
-            self._data = cached_data
-            
-        # If no need to save html files return from here
-        if urlobj.is_webpage() and not self._cfg.html:
-            extrainfo("Html filter prevents download of url =>", url)
-            return DOWNLOAD_NO_RULE_VIOLATION
-
-        # Get last modified time
-        timestr = self.get_last_modified_time()
-        if timestr:
-            try:
-                lmt = time.mktime( time.strptime(timestr, "%a, %d %b %Y %H:%M:%S GMT"))
-            except ValueError, e:
-                pass
-        
-        update, fileverified = False, False
-
-        datalen = self.get_content_length()
-        etag = self.get_etag()
-        
-        if self._cfg.cachefound:
-            update, fileverified = objects.datamgr.is_url_cache_uptodate(urlobj, filename, self._data, datalen, lmt, etag)
-
-            # If this caused a 304 error, then our copy is up-to-date so nothing to be done.
-            # print update, fileverified
-            if res == CONNECT_NO_UPTODATE:
-                if update and fileverified:
-                    extrainfo("Project cache is uptodate (304) =>", url)
-                    return DOWNLOAD_NO_UPTODATE
-            
-            # No need to download
-            elif update and fileverified:
-                extrainfo("Project cache is uptodate =>", url)
-                return DOWNLOAD_NO_UPTODATE                        
-
-            # If cache is up to date, but someone has deleted
-            # the downloaded files, instruct data manager to
-            # write file from the cache.
-            if update and not fileverified:
-                if objects.datamgr.write_file_from_cache(urlobj):
-                    return DOWNLOAD_NO_CACHE_SYNCED
-                else:
-                    return DOWNLOAD_NO_CACHE_SYNC_FAILED
-        else:
-            objects.datamgr.update_cache_for_url(urlobj, filename, self._data, datalen, lmt, etag)
-
-        # Overwrite flag...
-        # If file exists, but is not up to date
-        # it needs to be overwritten...
-        overwrite = fileverified and (not update)
-
-        retval = self._write_url(urlobj, overwrite)
-        
-        if SUCCESS(retval):
-            # Update saved bytes
-            objects.datamgr.update_saved_bytes(self._writelen)
-            return DOWNLOAD_YES_OK
-        
-        elif retval == WRITE_URL_FILTERED:
-            return DOWNLOAD_NO_WRITE_FILTERED
-        else:
-            return DOWNLOAD_NO_ERROR
-
-    def calc_bandwidth(self, urlobj):
-        """ Calculates bandwidth of the user's network by downloading URL specified by 'urlobj' """
-
-        url = urlobj.get_full_url()
-        # Set verbosity to silent
-        logobj = objects.logger
-        self._cfg.verbosity = 0
-        logobj.setLogSeverity(0)
-
-        # Reset force-split, otherwise download
-        # will be split!
-        fs = self._cfg.forcesplit
-        self._cfg.forcesplit = False
-        ret = self.connect2(urlobj)
-        # Reset verbosity
-        self._cfg.verbosity = self._cfg.verbosity_default
-        logobj.setLogSeverity(self._cfg.verbosity)
-
-        # Set it back
-        self._cfg.forcesplit = fs
-        
-        if self._data:
-            return float(len(self._data))/(self._elapsed)
-        else:
-            return 0
-
-    def get_data(self):
-        """ Returns the downloaded data string """
-        
-        return self._data
-    
-    def get_error(self):
-        """ Returns last network error code """
-
-        return self._error
-
-    def get_fileobj(self):
-        """ Returns a handle to internal HarvestMan file object """
-
-        return self._fo
-
-    def get_data_sofar(self):
-        """ Returned the length of data downloaded so far """
-
-        if self._fo:
-            return self._fo.get_datalen()
-
-        return 0
-    
-    def get_data_mode(self):
-        """ Returns the data download mode """
-
-        # 0 => Data is flushed
-        # 1 => Data in memory (default)
-        return self._mode
-
-    def get_tmpfname(self):
-        """ Returns temporary filename if any """
-
-        return self._tmpfname
-
-    def get_status(self):
-        """ Returns the download status """
-
-        return self._status
-
-    def get_numtries(self):
-        """ Returns the number of attempts in fetching a URL """
-
-        return self._numtries
-
-    def reset(self):
-        """ Resets the attribute values """
-
-        # file like object returned by
-        # urllib2.urlopen(...)
-        self._freq = urllib2.Request('file://')
-        # data downloaded
-        self._data = ''
-        # length of data downloaded
-        self._datalen = 0
-        # length of data written
-        self._writelen = 0
-        # error dictionary
-        self._error.reset()
-        # Http header for current connection
-        self._headers = CaselessDict()
-        # Elasped time for reading data
-        self._elapsed = 0.0
-        # Status of connection
-        # 0 => no connection
-        # 1 => connected, download in progress
-        self._status = 0
-        # Number of tries
-        self._numtries = 0
-        
-class HarvestManUrlConnectorFactory(object):
-    """ Factory class for HarvestManUrlConnector class """
-
-    klass = HarvestManUrlConnector
-    alias = 'connfactory'                
-    connector_count = 0
-    
-    def __init__(self, maxsize):
-        """ Overloaded __init__ method """
-        
-        # The requests dictionary
-        self._requests = {}
-        self._sema = threading.BoundedSemaphore(maxsize)
-        self._conndict = {}
-
-    def create_connector(self):
-        """ Creates and returns a connector object """
-
-        # Even if the number of connections is
-        # below the maximum, the number of requests
-        # to the same server can exceed the maximum
-        # count. So check for that condition. If
-        # the number of current active requests to
-        # the server is equal to the maximum allowd
-        # this call will also block the calling
-        # thread
-        self._sema.acquire()
-
-        # Make a connector 
-        connector = self.__class__.klass()
-        self._conndict[connector] = 1
-        self.__class__.connector_count += 1
-        
-        return connector
-        
-    def remove_connector(self, conn):
-        """ Removes a connector object after use """
-
-        # Release the semaphore once to increase the internal count
-        self.__class__.connector_count -= 1
-        # print 'Connector removed, count is',self._count
-        conn.release()
-        del self._conndict[conn]       
-        self._sema.release()
-
-    def get_count(self):
-        """ Return the current connector count """
-
-        return self.__class__.connector_count
-
-    def get_connector_dict(self):
-        return self._conndict
-    
-# test code
-if __name__=="__main__":
-    pass
-
diff --git a/HarvestMan-lite/harvestman/lib/crawler.py b/HarvestMan-lite/harvestman/lib/crawler.py
deleted file mode 100755
index 5f08ad2..0000000
--- a/HarvestMan-lite/harvestman/lib/crawler.py
+++ /dev/null
@@ -1,998 +0,0 @@
-# -- coding: utf-8
-""" crawler.py - Module which does crawling and downloading
-    of urls from the web. This module is part of HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification history (Trimmed on Dec 14 2004)
-
-    Aug 22 2006  Anand    Changes for fixing single-thread mode.
-    Nov 9 2006   Anand    Added support to download imported stylesheets.
-
-    Jan 2007     Anand    Support for META robot tags.
-    Feb 17 2007  Anand    Modified return type of process_url in
-                          HarvestManUrlFetcher class to return the data.
-                          This is required for the modified swish-e
-                          plugin.
-    Feb 26 2007 Anand     Figured out the problem with 'disappearing' URLs.
-                          The error is in the crawl_url method which was
-                          checking whether a source URL was crawled. This
-                          happens when a page redefines its base URL as
-                          something else and when that URL is already crawled.
-                          We need to modify our logic of applying base URLs.
-    Mar 06 2007 Anand     Reset the logic of url-server to old one (only
-                          crawlers send data to url server). This is because
-                          sending both data to the server causes it to fail
-                          in a number of ways.
-
-                          NOTE: Decided not to use url server anymore since
-                          it is not yet stable. I think I need to go the
-                          Twisted way if this has to be done right.
-
-    Apr 06 2007  Anand    Added check to make sure that threads are not
-                          re-started for the same recurring problem.
-
-    Oct 21 2007  Anand    Added states for the crawler state machine.
-    
- Copyright (C) 2004 Anand B Pillai.
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import socket
-import time
-import threading
-import random
-import exceptions
-import sha
-from sgmllib import SGMLParseError
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.urltypes import *
-from harvestman.lib.urlcollections import *
-
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-from harvestman.lib.js.jsparser import JSParser, JSParserException
-
-from harvestman.lib import urlparser
-from harvestman.lib import pageparser
-from harvestman.lib.common import netinfo 
-
-
-# Defining pluggable functions
-# Plugin name is the key and value is <class>:<function>
-
-__plugins__ = { 'fetcher_process_url_plugin': 'HarvestManUrlFetcher:process_url',
-                'crawler_crawl_url_plugin': 'HarvestManUrlCrawler:crawl_url' }
-
-# Defining functions with pre & post callbacks
-# Callback name is the key and value is <class>:<function>
-__callbacks__ = { 'fetcher_process_url_callback' : 'HarvestManUrlFetcher:process_url',
-                  'crawler_crawl_url_callback' : 'HarvestManUrlCrawler:crawl_url',
-                  'fetcher_push_buffer_callback' : 'HarvestManUrlFetcher:push_buffer',
-                  'crawler_push_buffer_callback' : 'HarvestManUrlCrawler:push_buffer',
-                  'fetcher_terminate_callback' : 'HarvestManUrlFetcher:terminate',
-                  'crawler_terminate_callback' : 'HarvestManUrlCrawler:terminate' }
-
-
-class HarvestManThreadState(type):
-    """ A metaclass for HarvestMan thread states """
-
-    IDX = -1
-    
-    def __new__(cls, name, bases=(), dct={}):
-        """ Overloaded Constructor """
-        
-        # Automatically increment index, without we bothering
-        # to assign a number to the state class...
-        cls.IDX += 1
-        dct['index'] = cls.IDX
-        return type.__new__(cls, name, bases, dct)
-
-    def __init__(self, name, bases=(), dct={}):
-        type.__init__(self, name, bases, dct)
-        
-    def __repr__(self):
-        return '%d: %s' % (self.index, self.about)
-
-    def __str__(self):
-        return self.__name__
-    
-    def __eq__(self, number):
-        """ Overloaded __eq__ method to allow
-        comparisons with numbers """
-        
-        # Makes it easy to do things like
-        # THREAD_IDLE == 0 in code.
-        return self.index == number
-    
-def DEFINE_STATE(name, description):
-    """ A factory function for defining thread state classes """
-
-    # State classes are created and automatically injected in the module's
-    # global namespace using the class name.
-    globals()[name] = HarvestManThreadState(name, dct={'about': description})
-     
-# Thread states
-DEFINE_STATE('THREAD_IDLE', "Idle thread, not running")
-DEFINE_STATE('THREAD_STARTED', "Thread started to run")
-DEFINE_STATE('CRAWLER_WAITING', "Crawler thread waiting for data")
-DEFINE_STATE('FETCHER_WAITING', "Fetcher thread waiting for data")
-DEFINE_STATE('CRAWLER_GOT_DATA', "Crawler thread got new list of URLs to crawl from the queue")
-DEFINE_STATE('FETCHER_GOT_DATA', "Fetcher thread got new URL information from the queue")
-DEFINE_STATE('FETCHER_DOWNLOADING', "Fetcher thread downloading data")
-DEFINE_STATE('FETCHER_PARSING', "Fetcher thread parsing webpage to extract new URLs")
-DEFINE_STATE('CRAWLER_CRAWLING', "Crawler thread crawling a page")
-DEFINE_STATE('FETCHER_PUSH_URL', "Fetcher thread pushing URL to queue")
-DEFINE_STATE('CRAWLER_PUSH_URL', "Crawler thread pushing URL to queue")
-DEFINE_STATE('FETCHER_PUSHED_URL', "Fetcher thread pushed URL to queue")
-DEFINE_STATE('CRAWLER_PUSHED_URL', "Crawler thread pushed URL to queue")
-DEFINE_STATE('THREAD_SLEEPING', "Thread sleeping")
-DEFINE_STATE('THREAD_SUSPENDED', "Thread is suspended on the state machine")
-DEFINE_STATE('THREAD_DIED', "Thread died due to an error")
-DEFINE_STATE('THREAD_STOPPED', "Thread stopped")
-
-class HarvestManUrlCrawlerException(Exception):
-    """ An exception class for HarvestManBaseUrlCrawler and its
-    derived classes """
-    
-    def __init__(self, value):
-        """ Overloaded __init__ method """
-        
-        self.value = value
-
-    def __repr__(self):
-        return self.__str__()
-    
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManBaseUrlCrawler( threading.Thread ):
-    """ Base class to do the crawling and fetching of internet/intranet urls.
-    This is the base class with no actual code apart from the threading or
-    termination functions. """
-
-    __metaclass__ = MethodWrapperMetaClass
-    # Last error which caused the thread to die
-    _lasterror = None
-    
-    def __init__(self, index, url_obj = None, isThread = True):
-        # Index of the crawler
-        self._index = index
-        # Initialize my variables
-        self._initialize()
-        # Am i a thread
-        self._isThread = isThread
-        
-        if isThread:
-            threading.Thread.__init__(self, None, None, self._role + str(self._index))
-
-    def _initialize(self):
-        """ Initialise my state after construction """
-
-        # End flag
-        self._endflag = False
-        # Download flag
-        self._download = True
-        self.url = None
-        self.document = None
-        # Number of loops
-        self._loops = 0
-        # Role string
-        self._role = "undefined"
-        # State of the crawler
-        self.stateobj = objects.queuemgr.stateobj
-        # Configuration
-        self._configobj = objects.config
-        # Local Buffer for Objects
-        # to be put in q. Maximum size is 100
-        self.buffer = Ldeque(100)
-        # Flag for pushing to buffer
-        self._pushflag = self._configobj.fastmode and (not self._configobj.blocking)
-        # Resume flag - for resuming from a saved state
-        self.resuming = False
-        # Last exception
-        self.exception = None
-        # Sleep event
-        if self._configobj.randomsleep:
-            self.evnt = RandomSleepEvent(self._configobj.sleeptime)
-        else:
-            self.evnt = SleepEvent(self._configobj.sleeptime)            
-        
-    def __str__(self):
-        return self.getName()
-
-    def get_url(self):
-        """ Return my url """
-
-        return self.url
-
-    def set_download_flag(self, val = True):
-        """ Set the download flag """
-        self._download = val
-
-    def set_url_object(self, obj):
-        """ Set the url object of this crawler """
-
-        self.url = obj
-        return True
-
-    def set_index(self, index):
-        self._index = index
-
-    def get_index(self):
-        return self._index
-    
-    def get_url_object(self):
-        """ Return the url object of this crawler """
-
-        return self.url
-
-    def get_current_url(self):
-        """ Return the current url """
-
-        return self.url.get_full_url()
-    
-    def action(self):
-        """ The action method, to be overridden by
-        sub classes to provide action """
-
-        pass
-        
-    def run(self):
-        """ The overloaded run method of threading.Thread class """
-
-        #try:
-        self.stateobj.set(self, THREAD_STARTED)
-        self.action()
-        #except Exception, e:
-        #    # print 'Exception',e,self
-        #    self.exception = e
-        #    self.stateobj.set(self, THREAD_DIED)                
-
-    def stop(self):
-        self.join()
-        
-    def join(self):
-        """ Stop this crawler thread """
-
-        self._endflag = True
-        self.set_download_flag(False)
-        threading.Thread.join(self, 1.0)
-        self.stateobj.set(self, THREAD_STOPPED)
-        # raise HarvestManUrlCrawlerException, "%s: Stopped" % self.getName()
-    
-    def sleep(self):
-
-        self.stateobj.set(self, THREAD_SLEEPING)
-        self.evnt.sleep()
-        
-    def crawl_url(self):
-        """ Crawl a web page, recursively downloading its links """
-
-        pass
-
-    def process_url(self):
-        """ Download the data for a web page or a link and
-        manage its data """
-
-        pass
-        
-    def push_buffer(self):
-        """ Try to push items in local buffer to queue """
-
-        # Try to push the last item
-        stuff = self.buffer[-1]
-
-        if objects.queuemgr.push(stuff, self._role):
-            # Remove item
-            self.buffer.remove(stuff)
-
-class HarvestManUrlCrawler(HarvestManBaseUrlCrawler):
-    """ The crawler class which crawls urls and fetches their links.
-    These links are posted to the url queue """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManBaseUrlCrawler.__init__(self, index, url_obj, isThread)
-        # Not running yet
-        self.stateobj.set(self, THREAD_IDLE)
-        
-    def _initialize(self):
-        HarvestManBaseUrlCrawler._initialize(self)
-        self._role = "crawler"
-        self.links = []
-
-    def set_url_object(self, obj):
-
-        # Reset
-        self.links = []
-        
-        if not obj:
-            return False
-
-        prior, coll, document = obj
-        url_index = coll.getSourceURL()
-        url_obj = objects.datamgr.get_url(url_index)
-        
-        if not url_obj:
-            return False
-        
-        self.links = [objects.datamgr.get_url(index) for index in coll.getAllURLs()]
-        self.document = document
-        
-        return HarvestManBaseUrlCrawler.set_url_object(self, url_obj)
-
-    def action(self):
-        
-        if self._isThread:
-            
-            if not self.resuming:
-                self._loops = 0
-
-            while not self._endflag:
-
-                if not self.resuming:
-                    if self.buffer and self._pushflag:
-                        self.push_buffer()
-
-                    self.stateobj.set(self, CRAWLER_WAITING)
-                    obj = objects.queuemgr.get_url_data( "crawler" )
-                    
-                    if not obj:
-                        if self._endflag: break
-
-                        if self.buffer and self._pushflag:
-                            debug('Trying to push buffer...')
-                            self.push_buffer()
-
-                        debug('OBJECT IS NONE,CONTINUING...',self)
-                        continue
-
-                    self.set_url_object(obj)
-
-                    if self.url==None:
-                        debug('NULL URLOBJECT',self)
-                        continue
-
-                    # We needs to do violates check here also
-                    if self.url.violates_rules():
-                        extrainfo("Filtered URL",self.url)
-                        continue
-
-                # Do a crawl to generate new objects
-                # only after trying to push buffer
-                # objects.
-                self.crawl_url()
-                self._loops += 1
-                # Sleep for some time
-                self.sleep()
-
-                # If I had resumed from a saved state, set resuming flag
-                # to false
-                self.resuming = False
-        else:
-            self.process_url()
-            self.crawl_url()
-
-
-    def apply_url_priority(self, url_obj):
-        """ Apply priority to url objects """
-
-        cfg = objects.config
-        
-        # Set initial priority to previous url's generation
-        url_obj.priority = self.url.generation
-
-        # Get priority
-        curr_priority = url_obj.priority
-
-        # html files (webpages) get higher priority
-        if url_obj.is_webpage():
-            curr_priority -= 1
-
-        # Apply any priorities specified based on file extensions in
-        # the config file.
-        pr_dict1, pr_dict2 = cfg.urlprioritydict, cfg.serverprioritydict
-        # Get file extension
-        extn = ((os.path.splitext(url_obj.get_filename()))[1]).lower()
-        # Skip the '.'
-        extn = extn[1:]
-
-        # Get domain (server)
-        domain = url_obj.get_domain()
-
-        # Apply url priority
-        if extn in pr_dict1:
-            curr_priority -= int(pr_dict1[extn])
-
-        # Apply server priority, this allows a a partial
-        # key match 
-        for key in pr_dict2:
-            # Apply the first match
-            if domain.find(key) != -1:
-                curr_priority -= int(pr_dict2[domain])
-                break
-            
-        # Set priority again
-        url_obj.priority = curr_priority
-        
-        return HARVESTMAN_OK
-
-    def crawl_url(self):
-        """ Crawl a web page, recursively downloading its links """
-
-        # Raise before crawl event...
-        if (objects.eventmgr.raise_event('before_crawl_url', self.url, self.document)==False) and (not self.url.starturl):
-            extrainfo('Not crawling this url',self.url)
-            return
-        
-        if not self._download:
-            debug('DOWNLOAD FLAG UNSET!',self)
-            return None
-
-        self.stateobj.set(self, CRAWLER_CRAWLING)                    
-        info('Fetching links', self.url)
-        
-        priority_indx = 0
-
-        # print self.links
-        
-        for url_obj in self.links:
-
-            # Check for status flag to end loop
-            if self._endflag: break
-            if not url_obj: continue
-
-            url_obj.generation = self.url.generation + 1
-            typ = url_obj.get_type()
-
-            # Type based filtering
-            if typ == 'javascript':
-                if not self._configobj.javascript:
-                    continue
-            elif typ == 'javaapplet':
-                if not self._configobj.javaapplet:
-                    continue
-                
-            # Check for basic rules of download
-            if url_obj.violates_rules():
-                extrainfo("Filtered URL",url_obj.get_full_url())
-                continue
-
-            priority_indx += 1
-            self.apply_url_priority( url_obj )
-
-            if not objects.queuemgr.push( url_obj, "crawler" ):
-                if self._pushflag: self.buffer.append(url_obj)
-
-        objects.eventmgr.raise_event('post_crawl_url', self.url, self.document)
-        
-class HarvestManUrlFetcher(HarvestManBaseUrlCrawler):
-    """ This is the fetcher class, which downloads data for a url
-    and writes its files. It also posts the data for web pages
-    to a data queue """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManBaseUrlCrawler.__init__(self, index, url_obj, isThread)
-        self._fetchtime = 0
-        self.stateobj.set(self, THREAD_IDLE)
-        
-    def _initialize(self):
-        HarvestManBaseUrlCrawler._initialize(self)
-        self._role = "fetcher"
-        self.make_html_parser()
-        
-    def make_html_parser(self, choice=0):
-
-        if choice==0:
-            self.wp = pageparser.HarvestManSimpleParser()
-        elif choice==1:
-            try:
-                self.wp = pageparser.HarvestManSGMLOpParser()
-            except ImportError:
-                self.wp = pageparser.HarvestManSimpleParser()
-
-
-        # Enable/disable features
-        if self.wp != None:
-            for feat, val in self._configobj.htmlfeatures:
-                # int feat,'=>',val
-                if val: self.wp.enable_feature(feat)
-                else: self.wp.disable_feature(feat)
-        
-    def get_fetch_timestamp(self):
-        """ Return the time stamp before fetching """
-
-        return self._fetchtime
-    
-    def set_url_object(self, obj):
-
-        if not obj: return False
-        
-        try:
-            prior, url_obj = obj
-            # url_obj = GetUrlObject(indx)
-        except TypeError:
-            url_obj = obj
-
-        return HarvestManBaseUrlCrawler.set_url_object(self, url_obj)
-
-    def action(self):
-        
-        if self._isThread:
-
-            if not self.resuming:
-                self._loops = 0            
-
-            while not self._endflag:
-                    
-                if not self.resuming:
-                    if self.buffer and self._pushflag:
-                        debug('Trying to push buffer...')
-                        self.push_buffer()
-
-                    self.stateobj.set(self, FETCHER_WAITING)                    
-                    obj = objects.queuemgr.get_url_data("fetcher" )
-                    
-                    if not obj:
-                        if self._endflag: break
-
-                        if self.buffer and self._pushflag:
-                            debug('Trying to push buffer...')
-                            self.push_buffer()
-
-                        continue
-
-                    if not self.set_url_object(obj):
-                        debug('NULL URLOBJECT',self)
-                        if self._endflag: break
-                        continue
-
-                # Process to generate new objects
-                # only after trying to push buffer
-                # objects.
-                self.process_url()
-
-                # Raise "afterfetch" event
-                objects.eventmgr.raise_event('post_fetch_url', self.url)
-                
-                self._loops += 1
-
-                # Sleep for some random time
-                self.sleep()
-
-                # Set resuming flag to False
-                self.resuming = False
-        else:
-            self.process_url()
-            self.crawl_url()
-
-    def offset_links(self, links):
-        """ Calculate a new list by applying any offset params
-        on the list of links """
-
-        n = len(links)
-        # Check for any links offset params - if so trim
-        # the list of links to the supplied offset values
-        offset_start = self._configobj.linksoffsetstart
-        offset_end = self._configobj.linksoffsetend
-        # Check for negative values for end offset
-        # This is considered as follows.
-        # -1 => Till and including end of list
-        # -2 => Till and including (n-1) element
-        # -3 => Till and including (n-2) element
-        # like that... upto -(n-1)...
-        if offset_end < 0:
-            offset_end = n - (offset_end + 1)
-        # If we still get negative value for offset end
-        # discard it and use list till end
-        if offset_end < 0:
-            offset_end = n
-
-        # Start offset should not have negative values
-        if offset_start >= 0:
-            return links[offset_start:offset_end]
-        else:
-            return links[:offset_end]
-        
-    def process_url(self):
-        """ This function downloads the data for a url and writes its files.
-        It also posts the data for web pages to a data queue """
-
-        data = ''
-        # Raise "beforefetch" event...
-        if (objects.eventmgr.raise_event('before_download_url', self.url)==False) and (not self.url.starturl):
-            return 
-        
-        if self.url.qstatus==urlparser.URL_NOT_QUEUED:
-            info('Downloading', self.url.get_full_url())
-            # About to fetch
-            self._fetchtime = time.time()
-            self.stateobj.set(self, FETCHER_DOWNLOADING)
-            data = objects.datamgr.download_url(self, self.url)
-            
-        # Add webpage links in datamgr, if we managed to
-        # download the url
-        url_obj = self.url
-
-        # print self.url,'=>',self.url.is_webpage()
-        if self.url.is_webpage() and data:
-            # Create a HarvestMan document with all data we have...
-
-            # Create a document and keep updating it -this is useful to provide
-            # information to events...
-            document = url_obj.make_document(data, [], '', [])
-            
-            # Raise "beforeparse" event...
-            if (objects.eventmgr.raise_event('before_parse_url', self.url, document)==False) and (not self.url.starturl):
-                return 
-            
-            # Check if this page was already crawled
-            url = self.url.get_full_url()
-            sh = sha.new(data)
-            # Set this hash on the URL object itself
-            self.url.pagehash = str(sh.hexdigest())
-
-            extrainfo("Parsing web page", self.url)
-
-            self.stateobj.set(self, FETCHER_PARSING)
-        
-            links = []
-
-            # Perform any Javascript based redirection etc
-            if self._configobj.javascript:
-                skipjsparse = False
-                # Raise "beforejsparse" event...
-                if objects.eventmgr.raise_event('before_js_parse', self.url, document)==False:
-                    # Don't return, skip this...
-                    skipjsparse = True
-
-                if not skipjsparse:
-                    try:
-                        parser = JSParser()
-                        parser.parse(data)
-                        if parser.locnchanged:
-                            redirect_url = parser.getLocation().href
-                            extrainfo("Javascript redirection to", redirect_url)
-                            links.append((urlparser.URL_TYPE_ANY, redirect_url))
-
-                        # DOM modification parsing logic is rudimentary and will
-                        # screw up original page data most of the time!
-                        
-                        #elif parser.domchanged:
-                        #    extrainfo("Javascript modified page DOM, using modified data to construct URLs...")
-                        #    # Get new content
-                        #    datatemp = repr(parser.getDocument())
-                        #    # Somehow if data is NULL, don't use it
-                        #    if len(datatemp) !=0:
-                        #        data = datatemp
-                        #    # print data
-                    except JSParserException, e:
-                        # No point printing this as error, since the parser is very baaaasic!
-                        # debug("Javascript parsing error =>", e)
-                        pass
-
-                    # Raise "afterjsparse" event
-                    objects.eventmgr.raise_event('post_js_parse', self.url, document, links)
-
-            parsecount = 0
-            
-            while True:
-                try:
-                    parsecount += 1
-
-                    self.wp.reset()
-                    self.wp.set_url(self.url)
-                    self.wp.feed(data)
-                    # Bug Fix: If the <base href="..."> tag was defined in the
-                    # web page, relative urls must be constructed against
-                    # the url provided in <base href="...">
-
-                    if self.wp.base_url_defined():
-                        url = self.wp.get_base_url()
-                        if not self.url.is_equal(url):
-                            debug("Base url defined, replacing",self.url)
-                            # Construct a url object
-                            url_obj = urlparser.HarvestManUrl(url,
-                                                              URL_TYPE_BASE,
-                                                              0,
-                                                              self.url,
-                                                              self._configobj.projdir)
-
-                            # Change document
-                            objects.datamgr.add_url(url_obj)
-                            document.set_url(url_obj)
-
-                    self.wp.close()
-                    # Related to issue #25 - Print a message if parsing went through
-                    # in a 2nd attempt
-                    if parsecount>1:
-                        extrainfo('Parsed web page successfully in second attempt',self.url) 
-                    break
-                except (SGMLParseError, IOError), e:
-                    error('SGML parse error:',str(e))
-                    error('Error in parsing web-page %s' % self.url)
-
-                    if self.wp.typ==0:
-                        # Parse error occurred with Python parser
-                        debug('Trying to reparse using the HarvestManSGMLOpParser...')
-                        self.make_html_parser(choice=1)
-                    else:
-                        break
-                #except ValueError, e:
-                #    break
-                #except Exception, e:
-                #    
-                #    break
-
-            if self._configobj.robots:
-                # Check for NOFOLLOW tag
-                if not self.wp.can_follow:
-                    extrainfo('URL %s defines META Robots NOFOLLOW flag, not following its children...' % self.url)
-                    return data
-
-            links.extend(self.wp.links)
-            # print 'LINKS=>',self.wp.links
-            #for typ, link in links:
-            #    print 'Link=>',link
-                
-            # Let us update some stuff on the document...
-            document.keywords = self.wp.keywords[:]
-            document.description = self.wp.description
-            document.title = self.wp.title
-            
-            # Raise "afterparse" event...
-            objects.eventmgr.raise_event('post_parse_url', self.url, document, links)
-
-            # Apply textfilter check here. This filter is applied on content
-            # or metadata and is always a crawl filter, i.e since it operates
-            # on content, we cannot apply the filter before the URL is fetched.
-            # However it is applied after the URL is fetched on its content. If
-            # matches, then its children are not crawled...
-            if objects.rulesmgr.apply_text_filter(document, self.url):
-                extrainfo('Text filter - filtered', self.url)                
-                return data
-            
-            # Some times image links are provided in webpages as regular <a href=".."> links.
-            # So in order to filer images fully, we need to check the wp.links list also.
-            # Sample site: http://www.sheppeyseacadets.co.uk/gallery_2.htm
-            if self._configobj.images:
-                links += self.wp.images
-            else:
-                # Filter any links with image extensions out from links
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.image_extns] 
-
-            #for typ, link in links:
-            #    print 'Link=>',link
-                
-            self.wp.reset()
-            
-            # Filter like that for video, flash & audio
-            if not self._configobj.movies:
-                # Filter any links with video extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.movie_extns]
-
-            if not self._configobj.flash:
-                # Filter any links with flash extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.flash_extns]    
-                
-
-            if not self._configobj.sounds:
-                # Filter any links with audio extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.sound_extns]                
-
-            if not self._configobj.documents:
-                # Filter any links with popular documents extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.document_extns]                
-            
-            links = self.offset_links(links)
-            # print "Filtered links",links
-            
-            # Create collection object
-            coll = HarvestManAutoUrlCollection(url_obj)
-
-            children = []
-            for typ, url in links:
-                
-                is_cgi, is_php = False, False
-
-                # Not sure of the logical validity of the following 2 lines anymore...!
-                # This is old code...
-                if url.find('php?') != -1: is_php = True
-                if typ == 'form' or is_php: is_cgi = True
-
-                if not url or len(url)==0: continue
-                # print 'URL=>',url,url_obj.get_full_url()
-                
-                try:
-                    child_urlobj = urlparser.HarvestManUrl(url,
-                                                           typ,
-                                                           is_cgi,
-                                                           url_obj)
-
-                    # print url, child_urlobj.get_full_url()
-                    
-                    if objects.datamgr.check_exists(child_urlobj):
-                        continue
-                    else:
-                        objects.datamgr.add_url(child_urlobj)
-                        coll.addURL(child_urlobj)
-                        children.append(child_urlobj)
-                    
-                except urlparser.HarvestManUrlError, e:
-                    error('URL Error:', e)
-                    continue
-
-            # objects.queuemgr.endloop(True)
-            
-            # Update the document again...
-            for child in children:
-                document.add_child(child)
-                    
-            if not objects.queuemgr.push((url_obj.priority, coll, document), 'fetcher'):
-                if self._pushflag: self.buffer.append((url_obj.priority, coll, document))
-
-            # Update links called here
-            objects.datamgr.update_links(url_obj, coll)
-
-            
-            return data
-        
-        elif self.url.is_stylesheet() and data:
-
-            # Parse stylesheet to find all contained URLs
-            # including imported stylesheets, if any.
-
-            # Create a document and keep updating it -this is useful to provide
-            # information to events...
-            document = url_obj.make_document(data, [], '', [])
-
-            sp = pageparser.HarvestManCSSParser()
-            sp.feed(data)
-
-            links = self.offset_links(sp.links)
-            
-            # Filter the CSS URLs also w.r.t rules
-            # Filter any links with image extensions out from links
-            if not self._configobj.images:
-                links = [link for link in links if link[link.rfind('.'):].lower() not in netinfo.image_extns]
-                
-            children = []
-             
-            # Create collection object
-            coll = HarvestManAutoUrlCollection(self.url)
-            
-            # Add these links to the queue
-            for url in links:
-                if not url: continue
-                
-                # There is no type information - so look at the
-                # extension of the URL. If ending with .css then
-                # add as stylesheet type, else as generic type.
-
-                if url.lower().endswith('.css'):
-                    urltyp = URL_TYPE_STYLESHEET
-                else:
-                    urltyp = URL_TYPE_ANY
-                    
-                try:
-                    child_urlobj =  urlparser.HarvestManUrl(url,
-                                                            urltyp,
-                                                            False,
-                                                            self.url)
-
-
-                    if objects.datamgr.check_exists(child_urlobj):
-                        continue
-                    else:
-                        objects.datamgr.add_url(child_urlobj)
-                        coll.addURL(child_urlobj)                    
-                        children.append(child_urlobj)
-                        
-                except urlparser.HarvestManUrlError:
-                    continue
-
-            # Update the document...
-            for child in children:
-                document.add_child(child)
-            
-            if not objects.queuemgr.push((self.url.priority, coll, document), 'fetcher'):
-                if self._pushflag: self.buffer.append((self.url.priority, coll, document))
-
-            # Update links called here
-            objects.datamgr.update_links(self.url, coll)
-
-            # Successful return returns data
-            return data
-        else:
-            # Dont do anything
-            return None
-
-
-class HarvestManUrlDownloader(HarvestManUrlFetcher, HarvestManUrlCrawler):
-    """ This is a mixin class which does both the jobs of crawling webpages
-    and download urls """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManUrlFetcher.__init__(self, index, url_obj, isThread)
-        self.set_url_object(url_obj)
-        
-    def _initialize(self):
-        HarvestManUrlFetcher._initialize(self)
-        HarvestManUrlCrawler._initialize(self)        
-        self._role = 'downloader'
-
-    def set_url_object(self, obj):
-        HarvestManUrlFetcher.set_url_object(self, obj)
-
-    def set_url_object2(self, obj):
-        HarvestManUrlCrawler.set_url_object(self, obj)        
-
-    def exit_condition(self, caller):
-
-        # Exit condition for single thread case
-        if caller=='crawler':
-            return (objects.queuemgr.data_q.qsize()==0)
-        elif caller=='fetcher':
-            return (objects.queuemgr.url_q.qsize()==0)
-
-        return False
-
-    def is_exit_condition(self):
-
-        return (self.exit_condition('crawler') and self.exit_condition('fetcher'))
-    
-    def action(self):
-
-        if self._isThread:
-            self._loops = 0
-
-            while not self._endflag:
-                obj = objects.queuemgr.get_url_data("downloader")
-                if not obj: continue
-                
-                self.set_url_object(obj)
-                
-                self.process_url()
-                self.crawl_url()
-
-                self._loops += 1
-                self.sleep()
-        else:
-            while True:
-                self.process_url()
-
-                obj = objects.queuemgr.get_url_data( "crawler" )
-                if obj: self.set_url_object2(obj)
-                
-                if self.url.is_webpage():
-                    self.crawl_url()
-
-                obj = objects.queuemgr.get_url_data("fetcher" )
-                self.set_url_object(obj)
-                    
-                if self.is_exit_condition():
-                    break
-
-    def process_url(self):
-
-        # First process urls using fetcher's algorithm
-        HarvestManUrlFetcher.process_url(self)
-
-    def crawl_url(self):
-        HarvestManUrlCrawler.crawl_url(self)
-        
-
-
diff --git a/HarvestMan-lite/harvestman/lib/datamgr.py b/HarvestMan-lite/harvestman/lib/datamgr.py
deleted file mode 100755
index 216a035..0000000
--- a/HarvestMan-lite/harvestman/lib/datamgr.py
+++ /dev/null
@@ -1,1383 +0,0 @@
-# -- coding: utf-8
-""" datamgr.py - Data manager module for HarvestMan.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Oct 13 2006     Anand          Removed data lock since it is not required - Python GIL
-                                   automatically locks byte operations.
-
-    Feb 2 2007      Anand          Re-added function parse_style_sheet which went missing.
-
-    Feb 26 2007      Anand          Fixed bug in check_duplicate_download for stylesheets.
-                                   Also rewrote logic.
-
-    Mar 05 2007     Anand          Added method get_last_modified_time_and_data to support
-                                   server-side cache checking using HTTP 304. Fixed a small
-                                   bug in css url handling.
-    Apr 19 2007     Anand          Made to work with URL collections. Moved url mapping
-                                   dictionary here. Moved CSS parsing logic to pageparser
-                                   module.
-    Feb 13 2008     Anand          Replaced URL dictionary with disk caching binary search
-                                   tree. Other changes done later -> Got rid of many
-                                   redundant lists which were wasting memory. Need to trim
-                                   this further.
-
-   Feb 14 2008      Anand          Many changes. Replaced/removed datastructures. Merged
-                                   cache updating functions. Details in doc/Datastructures.txt .
-
-   April 4 2008     Anand          Added update_url method and corresponding update method
-                                   in bst.py to update state of URLs after download. Added
-                                   statement to print broken links information at end.
-
-   Jan 13 2008      Anand          Better check for thread download in download_url method.
-                                   Added method 'parseable' in urlparser.py for the same.
-   
-   Copyright (C) 2004 Anand B Pillai.
-    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import shutil
-import time
-import math
-import re
-import sha
-import copy
-import random
-import shelve
-import tarfile
-import zlib
-
-import threading 
-# Utils
-from harvestman.lib import utils
-from harvestman.lib import urlparser
-
-from harvestman.lib.mirrors import HarvestManMirrorManager
-from harvestman.lib.db import HarvestManDbManager
-
-from harvestman.lib.urlthread import HarvestManUrlThreadPool
-from harvestman.lib.connector import *
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.bst import BST
-from harvestman.lib.common.pydblite import Base
-
-
-# Defining pluggable functions
-__plugins__ = { 'download_url_plugin': 'HarvestManDataManager:download_url',
-                'post_download_setup_plugin': 'HarvestManDataManager:post_download_setup',
-                'print_project_info_plugin': 'HarvestManDataManager:print_project_info',
-                'dump_url_tree_plugin': 'HarvestManDataManager:dump_url_tree'}
-
-# Defining functions with callbacks
-__callbacks__ = { 'download_url_callback': 'HarvestManDataManager:download_url',
-                  'post_download_setup_callback' : 'HarvestManDataManager:post_download_setup' }
-
-class HarvestManDataManager(object):
-    """ The data manager cum indexer class """
-
-    # For supporting callbacks
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'datamgr'        
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        # URLs which failed with any error
-        self._numfailed = 0
-        # URLs which failed even after a re-download
-        self._numfailed2 = 0
-        # URLs which were retried
-        self._numretried = 0
-        self.cache = None
-        self.savedfiles = 0
-        self.reposfiles = 0
-        self.cachefiles = 0
-        self.filteredfiles = 0
-        # Config object
-        self._cfg = objects.config
-        # Dictionary of servers crawled and
-        # their meta-data. Meta-data is
-        # a dictionary which currently
-        # has only one entry.
-        # i.e accept-ranges.
-        self._serversdict = {}
-        # byte count
-        self.bytes = 0L
-        # saved bytes count
-        self.savedbytes = 0L        
-        # Redownload flag
-        self._redownload = False
-        # Mirror manager
-        self.mirrormgr = HarvestManMirrorManager.getInstance()
-        # Condition object for synchronization
-        self.cond = threading.Condition(threading.Lock())        
-        self._urldb = None
-        self.collections = None
-
-    def initialize(self):
-        """ Do initializations per project """
-
-        # Url thread group class for multithreaded downloads
-        if self._cfg.usethreads:
-            self._urlThreadPool = HarvestManUrlThreadPool()
-            self._urlThreadPool.spawn_threads()
-        else:
-            self._urlThreadPool = None
-
-        # URL database, a BST with disk-caching
-        self._urldb = BST()
-        # Collections database, a BST with disk-caching        
-        self.collections = BST()
-        # For testing, don't set this otherwise we might
-        # be left with many orphaned .bidx... folders!
-        if not self._cfg.testing:
-            self._urldb.set_auto(2)
-            self.collections.set_auto(2)
-
-        # Load any mirrors
-        self.mirrormgr.load_mirrors(self._cfg.mirrorfile)
-        # Set mirror search flag
-        self.mirrormgr.mirrorsearch = self._cfg.mirrorsearch
-
-    def get_urldb(self):
-        return self._urldb
-    
-    def add_url(self, urlobj):
-        """ Add urlobject urlobj to the local dictionary """
-
-        # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index)
-        self._urldb.insert(urlobj.index, urlobj)
-        
-    def update_url(self, urlobj):
-        """ Update urlobject urlobj in the local dictionary """
-
-        # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index)
-        self._urldb.update(urlobj.index, urlobj)
-        
-    def get_url(self, index):
-
-        # return self._urldict[str(index)]
-        return self._urldb.lookup(index)
-
-    def get_original_url(self, urlobj):
-
-        # Return the original URL object for
-        # duplicate URLs. This is useful for
-        # processing URL objects obtained from
-        # the collection object, because many
-        # of them might be duplicate and would
-        # not have any post-download information
-        # such a headers etc.
-        if urlobj.refindex != -1:
-            return self.get_url(urlobj.refindex)
-        else:
-            # Return the same URL object to avoid
-            # an <if None> check on the caller
-            return urlobj
-        
-    def get_proj_cache_filename(self):
-        """ Return the cache filename for the current project """
-
-        # Note that this function does not actually build the cache directory.
-        # Get the cache file path
-        if self._cfg.projdir and self._cfg.project:
-            cachedir = os.path.join(self._cfg.projdir, "hm-cache")
-            cachefilename = os.path.join(cachedir, 'cache')
-
-            return cachefilename
-        else:
-            return ''
-
-    def get_proj_cache_directory(self):
-        """ Return the cache directory for the current project """
-
-        # Note that this function does not actually build the cache directory.
-        # Get the cache file path
-        if self._cfg.projdir and self._cfg.project:
-            return os.path.join(self._cfg.projdir, "hm-cache")
-        else:
-            return ''        
-
-    def get_server_dictionary(self):
-        return self._serversdict
-
-    def supports_range_requests(self, urlobj):
-        """ Check whether the given url object
-        supports range requests """
-
-        # Look up its server in the dictionary
-        server = urlobj.get_full_domain()
-        if server in self._serversdict:
-            d = self._serversdict[server]
-            return d.get('accept-ranges', False)
-
-        return False
-        
-    def read_project_cache(self):
-        """ Try to read the project cache file """
-
-        # Get cache filename
-        info('Reading Project Cache...')
-        cachereader = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-        obj, found = cachereader.read_project_cache()
-        self._cfg.cachefound = found
-        self.cache = obj
-        if not found:
-            # Fresh cache - create structure...
-            self.cache.create('url','last_modified','etag', 'updated','location','checksum',
-                              'content_length','data','headers')
-            
-            # Create an index on URL
-            self.cache.create_index('url')
-        else:
-            pass
-
-    def write_file_from_cache(self, urlobj):
-        """ Write file from url cache. This
-        works only if the cache dictionary of this
-        url has a key named 'data' """
-
-        ret = False
-
-        # print 'Inside write_file_from_cache...'
-        url = urlobj.get_full_url()
-        content = self.cache._url[url]
-        
-        if len(content):
-            # Value itself is a dictionary
-            item = content[0]
-            if not item.has_key('data'):
-                return ret
-            else:
-                urldata = item['data']
-                if urldata:
-                    fileloc = item['location']                    
-                    # Write file
-                    extrainfo("Updating file from cache=>", fileloc)
-                    try:
-                        if SUCCESS(self.create_local_directory(os.path.dirname(fileloc))):
-                            f=open(fileloc, 'wb')
-                            f.write(zlib.decompress(urldata))
-                            f.close()
-                            ret = True
-                    except (IOError, zlib.error), e:
-                        error("Error:",e)
-                                
-        return ret
-
-    def update_cache_for_url(self, urlobj, filename, urldata, contentlen, lastmodified, tag):
-        """ Method to update the cache information for the URL 'url'
-        associated to file 'filename' on the disk """
-
-        # if page caching is disabled, skip this...
-        if not objects.config.pagecache:
-            return
-        
-        url = urlobj.get_full_url()
-        if urldata:
-            csum = sha.new(urldata).hexdigest()
-        else:
-            csum = ''
-            
-        # Update all cache keys
-        content = self.cache._url[url]
-        if content:
-            rec = content[0]
-            self.cache.update(rec, checksum=csum, location=filename,content_length=contentlen, 
-                              last_modified=lastmodified,etag=tag, updated=True)
-            if self._cfg.datacache:
-                self.cache.update(rec,data=zlib.compress(urldata))
-        else:
-            # Insert as new values
-            if self._cfg.datacache:
-                self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen,last_modified=lastmodified,
-                                  etag=tag, updated=True,data=zlib.compress(urldata))
-            else:
-                self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen, last_modified=lastmodified,
-                                  etag=tag, updated=True)                
-        
-
-    def get_url_cache_data(self, urlobj):
-        """ Get cached data for the URL from disk """
-
-        # This is returned as Unix time, i.e number of
-        # seconds since Epoch.
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. However, we support this only
-        # if we have data for the URL.
-        if (not self._cfg.pagecache) or (not self._cfg.datacache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            item = content[0]
-            # Check if we have the data for the URL
-            data = item.get('data','')
-            if data:
-                try:
-                    return zlib.decompress(data)
-                except zlib.error, e:
-                    error('Error:',e)
-                    return ''
-
-        return ''
-
-    def get_last_modified_time(self, urlobj):
-        """ Return last-modified-time and data of the given URL if it
-        was found in the cache """
-
-        # This is returned as Unix time, i.e number of
-        # seconds since Epoch.
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. 
-        if (not self._cfg.pagecache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            return content[0].get('last_modified', '')
-        else:
-            return ''
-
-    def get_etag(self, urlobj):
-        """ Return the etag of the given URL if it was found in  the cache """
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. 
-        if (not self._cfg.pagecache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            return content[0].get('etag', '')
-        else:
-            return ''        
-
-    def is_url_cache_uptodate(self, urlobj, filename, urldata, contentlen=0, last_modified=0, etag=''):
-        """ Check with project cache and find out if the
-        content needs update """
-        
-        # If page caching is not enabled, return False
-        # straightaway!
-
-        # print 'Inside is_url_cache_uptodate...'
-        
-        if not self._cfg.pagecache:
-            return (False, False)
-
-        # Return True if cache is uptodate(no update needed)
-        # and False if cache is out-of-date(update needed)
-        # NOTE: We are using an comparison of the sha checksum of
-        # the file's data with the sha checksum of the cache file.
-        
-        # Assume that cache is not uptodate apriori
-        uptodate, fileverified = False, False
-
-        url = urlobj.get_full_url()
-        content = self.cache._url[url]
-
-        if content:
-            cachekey = content[0]
-            cachekey['updated']=False
-
-            fileloc = cachekey['location']
-            if os.path.exists(fileloc) and os.path.abspath(fileloc) == os.path.abspath(filename):
-                fileverified=True
-
-            # Use a cascading logic - if last_modified is available use it first
-            if last_modified:
-                if cachekey['last_modified']:
-                    # Get current modified time
-                    cmt = cachekey['last_modified']
-                    # print cmt,'=>',lmt
-                    # If the latest page has a modified time greater than this
-                    # page is out of date, otherwise it is uptodate
-                    if last_modified<=cmt:
-                        uptodate=True
-
-            # Else if etag is available use it...
-            elif etag:
-                if cachekey['etag']:
-                    tag = cachekey['etag']
-                    if etag==tag:
-                        uptodate = True
-            # Finally use a checksum of actual data if everything else fails
-            elif urldata:
-                if cachekey['checksum']:
-                    cachesha = cachekey['checksum']
-                    digest = sha.new(urldata).hexdigest()
-                    
-                    if cachesha == digest:
-                        uptodate=True
-        
-        if not uptodate:
-            # Modified this logic - Anand Jan 10 06            
-            self.update_cache_for_url(urlobj, filename, urldata, contentlen, last_modified, etag)
-
-        return (uptodate, fileverified)
-
-    def conditional_cache_set(self):
-        """ A utility function to conditionally enable/disable
-        the cache mechanism """
-
-        # If already page cache is disabled, do not do anything
-        if not self._cfg.pagecache:
-            return
-        
-        # If the cache file exists for this project, disable
-        # cache, else enable it.
-        cachefilename = self.get_proj_cache_filename()
-
-        if os.path.exists(cachefilename) and os.path.getsize(cachefilename):
-            self._cfg.writecache = False
-        else:
-            self._cfg.writecache = True
-
-    def post_download_setup(self):
-        """ Actions to perform after project is complete """
-
-        # Loop through URL db, one by one and then for those
-        # URLs which were downloaded but did not succeed, try again.
-        # But make sure we don't download links which were not-modified
-        # on server-side (HTTP 304) and hence were skipped.
-        failed = []
-        # Broken links (404)
-        nbroken = 0
-        
-        for node in self._urldb.preorder():
-            urlobj = node.get()
-            # print 'URL=>',urlobj.get_full_url()
-            
-            if urlobj.status == 404:
-                # print 'BROKEN', urlobj.get_full_url()
-                nbroken += 1
-            elif urlobj.qstatus == urlparser.URL_DONE_DOWNLOAD and \
-                   urlobj.status != 0 and urlobj.status != 304:
-                failed.append(urlobj)
-                    
-        self._numfailed = len(failed)
-        # print 'BROKEN=>', nbroken
-        
-        if self._cfg.retryfailed:
-            info(' ')
-
-            # try downloading again
-            if self._numfailed:
-                info('Redownloading failed links...',)
-                self._redownload=True
-                
-                for urlobj in failed:
-                    if urlobj.fatal or urlobj.starturl: continue
-                    extrainfo('Re-downloading',urlobj.get_full_url())
-                    self._numretried += 1
-                    self.thread_download(urlobj)
-                    
-                # Wait for the downloads to complete...
-                if self._numretried:
-                    extrainfo("Waiting for the re-downloads to complete...")
-                    self._urlThreadPool.wait(10.0, self._cfg.timeout)
-
-                worked = 0
-                # Let us calculate the failed rate again...
-                for urlobj in failed:
-                    if urlobj.status == 0:
-                        # Download was done
-                        worked += 1
-
-                self._numfailed2 = self._numfailed - worked
-
-        # Stop the url thread pool
-        # Stop worker threads
-        self._urlThreadPool.stop_all_threads()
-                    
-        # bugfix: Moved the time calculation code here.
-        t2=time.time()
-
-        self._cfg.endtime = t2
-
-        # Write cache file
-        if self._cfg.pagecache and self._cfg.writecache:
-            cachewriter = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-            self.add_headers_to_cache()
-            cachewriter.write_project_cache(self.cache)
-
-        # If url header dump is enabled, dump it
-        if self._cfg.urlheaders:
-            self.dump_headers()
-
-        if self._cfg.localise:
-            self.localise_links()
-
-        # Write archive file...
-        if self._cfg.archive:
-            self.archive_project()
-            
-        # dump url tree (dependency tree) to a file
-        if self._cfg.urltreefile:
-            self.dump_urltree()
-
-        if not self._cfg.project: return
-
-        nlinks = self._urldb.size
-        # print stats of the project
-        nservers, ndirs, nfiltered = objects.rulesmgr.get_stats()
-        nfailed = self._numfailed
-        numstillfailed = self._numfailed2
-
-        numfiles = self.savedfiles
-        numfilesinrepos = self.reposfiles
-        numfilesincache = self.cachefiles
-
-        numretried = self._numretried
-        
-        fetchtime = self._cfg.endtime-self._cfg.starttime
-        
-        statsd = { 'links' : nlinks,
-                   'filtered': nfiltered,
-                   'processed': nlinks - nfiltered,
-                   'broken': nbroken,
-                   'extservers' : nservers,
-                   'extdirs' : ndirs,
-                   'failed' : nfailed,
-                   'fatal' : numstillfailed,
-                   'files' : numfiles,
-                   'filesinrepos' : numfilesinrepos,
-                   'filesincache' : numfilesincache,
-                   'retries' : numretried,
-                   'bytes': self.bytes,
-                   'fetchtime' : fetchtime,
-                }
-
-        self.print_project_info(statsd)
-        objects.eventmgr.raise_event('post_crawl_complete', None)
-            
-    def check_exists(self, urlobj):
-
-        # Check if this URL object exits (is a duplicate)
-        return self._urldb.lookup(urlobj.index)
-        
-    def update_bytes(self, count):
-        """ Update the global byte count """
-
-        self.bytes += count
-
-    def update_saved_bytes(self, count):
-        """ Update the saved byte count """
-
-        self.savedbytes += count        
-
-    def update_file_stats(self, urlObject, status):
-        """ Add the passed information to the saved file list """
-
-        if not urlObject: return NULL_URLOBJECT_ERROR
-
-        filename = urlObject.get_full_filename()
-
-        if status == DOWNLOAD_YES_OK:
-            self.savedfiles += 1
-        elif status == DOWNLOAD_NO_UPTODATE:
-            self.reposfiles += 1
-        elif status == DOWNLOAD_NO_CACHE_SYNCED:
-            self.cachefiles += 1
-        elif status == DOWNLOAD_NO_WRITE_FILTERED:
-            self.filteredfiles += 1                        
-        
-        return HARVESTMAN_OK
-    
-    def update_links(self, source, collection):
-        """ Update the links dictionary for this collection """
-        
-        self.collections.insert(source.index, collection)
-
-    def thread_download(self, url):
-        """ Schedule download of this web document in a separate thread """
-
-        # Add this task to the url thread pool
-        if self._urlThreadPool:
-            url.qstatus = urlparser.URL_QUEUED
-            self._urlThreadPool.push(url)
-
-    def has_download_threads(self):
-        """ Return true if there are any download sub-threads
-        running, else return false """
-
-        if self._urlThreadPool:
-            num_threads = self._urlThreadPool.has_busy_threads()
-            if num_threads:
-                return True
-
-        return False
-
-    def last_download_thread_report_time(self):
-        """ Get the time stamp of the last completed
-        download (sub) thread """
-
-        if self._urlThreadPool:
-            return self._urlThreadPool.last_thread_report_time()
-        else:
-            return 0
-
-    def kill_download_threads(self):
-        """ Terminate all the download threads """
-
-        if self._urlThreadPool:
-            self._urlThreadPool.end_all_threads()
-
-    def create_local_directory(self, directory):
-        """ Create the directories on the disk named 'directory' """
-
-        # new in 1.4.5 b1 - No need to create the
-        # directory for raw saves using the nocrawl
-        # option.
-        if self._cfg.rawsave:
-            return CREATE_DIRECTORY_OK
-        
-        try:
-            # Fix for EIAO bug #491
-            # Sometimes, however had we try, certain links
-            # will be saved as files, whereas they might be
-            # in fact directories. In such cases, check if this
-            # is a file, then create a folder of the same name
-            # and move the file as index.html to it.
-            path = directory
-            while path:
-                if os.path.isfile(path):
-                    # Rename file to file.tmp
-                    fname = path
-                    os.rename(fname, fname + '.tmp')
-                    # Now make the directory
-                    os.makedirs(path)
-                    # If successful, move the renamed file as index.html to it
-                    if os.path.isdir(path):
-                        fname = fname + '.tmp'
-                        shutil.move(fname, os.path.join(path, 'index.html'))
-                    
-                path2 = os.path.dirname(path)
-                # If we hit the root, break
-                if path2 == path: break
-                path = path2
-                
-            if not os.path.isdir(directory):
-                os.makedirs( directory )
-                extrainfo("Created => ", directory)
-            return CREATE_DIRECTORY_OK
-        except OSError, e:
-            error("Error in creating directory", directory)
-            error(str(e))
-            return CREATE_DIRECTORY_NOT_OK
-
-        return CREATE_DIRECTORY_OK
-
-    def download_multipart_url(self, urlobj, clength):
-        """ Download a URL using HTTP/1.1 multipart download
-        using range headers """
-
-        # First add entry of this domain in
-        # dictionary, if not there
-        domain = urlobj.get_full_domain()
-        orig_url = urlobj.get_full_url()
-        old_urlobj = urlobj.get_original_state()
-
-        domain_changed_a_lot = False
-        
-        # If this was a re-directed URL, check if there is a
-        # considerable change in the domains. If there is,
-        # there is a very good chance that the original URL
-        # is redirecting to mirrors, so we can split on
-        # the original URL and this would automatically
-        # produce a split-mirror download without us having
-        # to do any extra work!
-        if urlobj.redirected and old_urlobj != None:
-            old_domain = old_urlobj.get_domain()
-            if old_domain != domain:
-                # Check if it is somewhat similar
-                # if domain.find(old_domain) == -1:
-                domain_changed_a_lot = True
-
-        try:
-            self._serversdict[domain]
-        except KeyError:
-            self._serversdict[domain] = {'accept-ranges': True}
-
-        if self.mirrormgr.mirrors_available(urlobj):
-            return self.mirrormgr.download_multipart_url(urlobj, clength, self._cfg.numparts, self._urlThreadPool)
-        else:
-            if domain_changed_a_lot:
-                urlobj = old_urlobj
-                # Set a flag to indicate this
-                urlobj.redirected_old = True
-                
-        parts = self._cfg.numparts
-        # Calculate size of each piece
-        piecesz = clength/parts
-        
-        # Calculate size of each piece
-        pcsizes = [piecesz]*parts
-        # For last URL add the reminder
-        pcsizes[-1] += clength % parts 
-        # Create a URL object for each and set range
-        urlobjects = []
-        for x in range(parts):
-            urlobjects.append(copy.copy(urlobj))
-
-        prev = 0
-        for x in range(parts):
-            curr = pcsizes[x]
-            next = curr + prev
-            urlobject = urlobjects[x]
-            # Set mirror_url attribute
-            urlobject.mirror_url = urlobj
-            urlobject.trymultipart = True
-            urlobject.clength = clength
-            urlobject.range = (prev, next-1)
-            urlobject.mindex = x
-            prev = next
-            self._urlThreadPool.push(urlobject)
-            
-        # Push this URL objects to the pool
-        return URL_PUSHED_TO_POOL
-
-    def download_url(self, caller, url):
-
-        no_threads = (not self._cfg.usethreads) or \
-                     url.parseable()
-
-        data=""
-        if no_threads:
-            # This call will block if we exceed the number of connections
-            url.qstatus = urlparser.URL_QUEUED            
-            conn = objects.connfactory.create_connector()
-
-            # Set status to queued
-            url.qstatus = urlparser.URL_IN_QUEUE            
-            res = conn.save_url( url )
-            
-            objects.connfactory.remove_connector(conn)
-
-            filename = url.get_full_filename()
-            if res != CONNECT_NO_ERROR:
-                filename = url.get_full_filename()
-
-                self.update_file_stats( url, res )
-
-                if res==DOWNLOAD_YES_OK:
-                    info("Saved",filename)
-
-                if url.is_webpage():
-                    if self._cfg.datamode==CONNECTOR_DATA_MODE_INMEM: 
-                        data = conn.get_data()
-                    elif os.path.isfile(filename):
-                        # Need to read data from the file...
-                        data = open(filename, 'rb').read()
-
-            else:
-                fetchurl = url.get_full_url()
-                extrainfo( "Failed to download url", fetchurl)
-
-            self._urldb.update(url.index, url)
-            
-        else:
-            # Set status to queued
-            self.thread_download( url )
-
-        return data
-
-    def clean_up(self):
-        """ Purge data for a project by cleaning up
-        lists, dictionaries and resetting other member items"""
-
-        # Reset byte count
-        if self._urldb and self._urldb.size:
-            del self._urldb
-        if self.collections and self.collections.size:
-            del self.collections
-        self.reset()
-
-    def archive_project(self):
-        """ Archive project files into a tar archive file.
-        The archive will be further compressed in gz or bz2
-        format """
-
-        extrainfo("Archiving project files...")
-        # Get project directory
-        projdir = self._cfg.projdir
-        # Get archive format
-        if self._cfg.archformat=='bzip':
-            format='bz2'
-        elif self._cfg.archformat=='gzip':
-            format='gz'
-        else:
-            error("Archive Error: Archive format not recognized")
-            return INVALID_ARCHIVE_FORMAT
-
-        # Create tarfile name
-        ptarf = os.path.join(self._cfg.basedir, "".join((self._cfg.project,'.tar.',format)))
-        cwd = os.getcwd()
-        os.chdir(self._cfg.basedir)
-
-        # Create tarfile object
-        tf = tarfile.open(ptarf,'w:'+format)
-        # Projdir base name
-        pbname = os.path.basename(projdir)
-
-        # Add directories
-        for item in os.listdir(projdir):
-            # Skip cache directory, if any
-            if item=='hm-cache':
-                continue
-            # Add directory
-            fullpath = os.path.join(projdir,item)
-            if os.path.isdir(fullpath):
-                tf.add(os.path.join(pbname,item))
-        # Dump the tarfile
-        tf.close()
-
-        os.chdir(cwd)            
-        # Check whether writing was done
-        if os.path.isfile(ptarf):
-            extrainfo("Wrote archive file",ptarf)
-            return FILE_WRITE_OK
-        else:
-            error("Error in writing archive file",ptarf)
-            return FILE_WRITE_ERROR
-            
-    def add_headers_to_cache(self):
-        """ Add original URL headers of urls downloaded
-        as an entry to the cache file """
-        
-        # Navigate in pre-order, i.e in the order of insertion...
-        for node in self.collections.preorder():
-            coll = node.get()
-
-            # Get list of links for this collection
-            for urlobjidx in coll.getAllURLs():
-                urlobj = self.get_url(urlobjidx)
-                if urlobj==None: continue
-                
-                url = urlobj.get_full_url()
-                # Get headers
-                headers = urlobj.get_url_content_info()
-                
-                if headers:
-                    content = self.cache._url[url]
-                    if content:
-                        urldict = content[0]
-                        urldict['headers'] = headers
-
-
-    def dump_headers(self):
-        """ Dump the headers of the web pages
-        downloaded, into a DBM file """
-        
-        # print dbmfile
-        extrainfo("Writing url headers database")        
-        
-        headersdict = {}
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            for urlobjidx in coll.getAllURLs():
-                urlobj = self.get_url(urlobjidx)
-                
-                if urlobj:
-                    url = urlobj.get_full_url()
-                    # Get headers
-                    headers = urlobj.get_url_content_info()
-                    if headers:
-                        headersdict[url] = str(headers)
-                        
-        cache = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-        return cache.write_url_headers(headersdict)
-    
-    def localise_links(self):
-        """ Localise all links (urls) of the downloaded html pages """
-
-        # Dont confuse 'localising' with language localization.
-        # This means just converting the outward (Internet) pointing
-        # URLs in files to local files.
-
-        info('Localising links of downloaded web pages...',)
-
-        count = 0
-        localized = []
-        
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            sourceurl = self.get_url(coll.getSourceURL())
-            childurls = [self.get_url(index) for index in coll.getAllURLs()]
-            filename = sourceurl.get_full_filename()
-
-            if (not filename in localized) and os.path.exists(filename):
-                extrainfo('Localizing links for',filename)
-                if SUCCESS(self.localise_file_links(filename, childurls)):
-                    count += 1
-                    localized.append(filename)
-
-        info('Localised links of',count,'web pages.')
-
-    def localise_file_links(self, filename, links):
-        """ Localise links for this file """
-
-        data=''
-        
-        try:
-            fw=open(filename, 'r+')
-            data=fw.read()
-            fw.seek(0)
-            fw.truncate(0)
-        except (OSError, IOError),e:
-            return FILE_TRUNCATE_ERROR
-
-        # Regex1 to replace ( at the end
-        r1 = re.compile(r'\)+$')
-        r2 = re.compile(r'\(+$')        
-        
-        # MOD: Replace any <base href="..."> line
-        basehrefre = re.compile(r'<base href=.*>', re.IGNORECASE)
-        if basehrefre.search(data):
-            data = re.sub(basehrefre, '', data)
-        
-        for u in links:
-            if not u: continue
-            
-            url_object = u
-            typ = url_object.get_type()
-
-            if url_object.is_image():
-                http_str="src"
-            else:
-                http_str="href"
-
-            v = url_object.get_original_url()
-            if v == '/': continue
-
-            # Somehow, some urls seem to have an
-            # unbalanced parantheses at the end.
-            # Remove it. Otherwise it will crash
-            # the regular expressions below.
-            v = r1.sub('', v)
-            v2 = r2.sub('', v)
-            
-            # Bug fix, dont localize cgi links
-            if typ != 'base':
-                if url_object.is_cgi(): 
-                    continue
-                
-                fullfilename = os.path.abspath( url_object.get_full_filename() )
-                urlfilename=''
-
-                # Modification: localisation w.r.t relative pathnames
-                if self._cfg.localise==2:
-                    urlfilename = url_object.get_relative_filename()
-                elif self._cfg.localise==1:
-                    urlfilename = fullfilename
-
-                # replace '\\' with '/'
-                urlfilename = urlfilename.replace('\\','/')
-
-                newurl=''
-                oldurl=''
-            
-                # If we cannot get the filenames, replace
-                # relative url paths will full url paths so that
-                # the user can connect to them.
-                if not os.path.exists(fullfilename):
-                    # for relative links, replace it with the
-                    # full url path
-                    fullurlpath = url_object.get_full_url_sans_port()
-                    newurl = "href=\"" + fullurlpath + "\""
-                else:
-                    # replace url with urlfilename
-                    if typ == 'anchor':
-                        anchor_part = url_object.get_anchor()
-                        urlfilename = "".join((urlfilename, anchor_part))
-                        # v = "".join((v, anchor_part))
-
-                    if self._cfg.localise == 1:
-                        newurl= "".join((http_str, "=\"", "file://", urlfilename, "\""))
-                    else:
-                        newurl= "".join((http_str, "=\"", urlfilename, "\""))
-
-            else:
-                newurl="".join((http_str,"=\"","\""))
-
-            if typ != 'img':
-                oldurl = "".join((http_str, "=\"", v, "\""))
-                try:
-                    oldurlre = re.compile("".join((http_str,'=','\\"?',v,'\\"?')))
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-                    
-                # Get the location of the link in the file
-                try:
-                    if oldurl != newurl:
-                        data = re.sub(oldurlre, newurl, data,1)
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-            else:
-                try:
-                    oldurlre1 = "".join((http_str,'=','\\"?',v,'\\"?'))
-                    oldurlre2 = "".join(('href','=','\\"?',v,'\\"?'))
-                    oldurlre = re.compile("".join(('(',oldurlre1,'|',oldurlre2,')')))
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-                
-                http_strs=('href','src')
-            
-                for item in http_strs:
-                    try:
-                        oldurl = "".join((item, "=\"", v, "\""))
-                        if oldurl != newurl:
-                            data = re.sub(oldurlre, newurl, data,1)
-                    except:
-                        pass
-
-        try:
-            fw.write(data)
-            fw.close()
-        except IOError, e:
-            logconsole(e)
-            return HARVESTMAN_FAIL
-
-        return HARVESTMAN_OK
-
-    def print_project_info(self, statsd):
-        """ Print project information """
-
-        nlinks = statsd['links']
-        nservers = statsd['extservers'] + 1
-        nfiles = statsd['files']
-        ndirs = statsd['extdirs'] + 1
-        numfailed = statsd['failed']
-        nretried = statsd['retries']
-        fatal = statsd['fatal']
-        fetchtime = statsd['fetchtime']
-        nfilesincache = statsd['filesincache']
-        nfilesinrepos = statsd['filesinrepos']
-        nbroken = statsd['broken']
-        
-        # Bug fix, download time to be calculated
-        # precisely...
-
-        dnldtime = fetchtime
-
-        strings = [('link', nlinks), ('server', nservers),
-                   ('file', nfiles), ('file', nfilesinrepos),
-                   ('directory', ndirs), ('link', numfailed), ('link', fatal),
-                   ('link', nretried), ('file', nfilesincache), ('link', nbroken) ]
-
-        fns = map(plural, strings)
-        info(' ')
-
-        bytes = self.bytes
-        savedbytes = self.savedbytes
-        
-        ratespec='KB/sec'
-        if bytes and dnldtime:
-            bps = float(bytes/dnldtime)/1024.0
-            if bps<1.0:
-                bps *= 1000.0
-                ratespec='bytes/sec'
-            bps = '%.2f' % bps
-        else:
-            bps = '0.0'
-
-        fetchtime = float((math.modf(fetchtime*100.0)[1])/100.0)
-
-        if self._cfg.simulate:
-            info("HarvestMan crawl simulation of",self._cfg.project,"completed in",fetchtime,"seconds.")
-        else:
-            info('HarvestMan mirror',self._cfg.project,'completed in',fetchtime,'seconds.')
-            
-        if nlinks: info(nlinks,fns[0],'scanned in',nservers,fns[1],'.')
-        else: info('No links parsed.')
-        if nfiles: info(nfiles,fns[2],'written.')
-        else:info('No file written.')
-        
-        if nfilesinrepos:
-            info(nfilesinrepos,fns[3],wasOrWere(nfilesinrepos),'already uptodate in the repository for this project and',wasOrWere(nfilesinrepos),'not updated.')
-        if nfilesincache:
-            info(nfilesincache,fns[8],wasOrWere(nfilesincache),'updated from the project cache.')
-
-        if nbroken: info(nbroken,fns[9],wasOrWere(nbroken),'were broken.')
-        if fatal: info(fatal,fns[5],'had fatal errors and failed to download.')
-        if bytes: info(bytes,' bytes received at the rate of',bps,ratespec,'.')
-        if savedbytes: info(savedbytes,' bytes were written to disk.\n')
-        
-        info('*** Log Completed ***\n')
-        
-        # get current time stamp
-        s=time.localtime()
-
-        tz=(time.tzname)[0]
-
-        format='%b %d %Y '+tz+' %H:%M:%S'
-        tstamp=time.strftime(format, s)
-
-        if not self._cfg.simulate:
-            # Write statistics to the crawl database
-            HarvestManDbManager.add_stats_record(statsd)
-            logconsole('Done.')
-
-            # No longer writing a stats file...
-            # Write stats to a stats file
-            #statsfile = self._cfg.project + '.hst'
-            #statsfile = os.path.abspath(os.path.join(self._cfg.projdir, statsfile))
-            #logconsole('Writing stats file ', statsfile , '...')
-            # Append to files contents
-            #sf=open(statsfile, 'a')
-
-            # Write url, file count, links count, time taken,
-            # files per second, failed file count & time stamp
-            #infostr='url:'+self._cfg.url+','
-            #infostr +='files:'+str(nfiles)+','
-            #infostr +='links:'+str(nlinks)+','
-            #infostr +='dirs:'+str(ndirs)+','
-            #infostr +='failed:'+str(numfailed)+','
-            #infostr +='refetched:'+str(nretried)+','
-            #infostr +='fatal:'+str(fatal)+','
-            #infostr +='elapsed:'+str(fetchtime)+','
-            #infostr +='fps:'+str(fps)+','
-            #infostr +='kbps:'+str(bps)+','
-            #infostr +='timestamp:'+tstamp
-            #infostr +='\n'
-            
-            #sf.write(infostr)
-            #sf.close()
-
-    def dump_urltree(self):
-        """ Dump url tree to a file """
-
-        # This creats an html file with
-        # each url and its children below
-        # it. Each url is a hyperlink to
-        # itself on the net if the file
-        # is an html file.
-
-        # urltreefile is <projdir>/urls.html
-        urlfile = os.path.join(self._cfg.projdir, 'urltree.html')
-        
-        try:
-            if os.path.exists(urlfile):
-                os.remove(urlfile)
-        except OSError, e:
-            logconsole(e)
-
-        info('Dumping url tree to file', urlfile)
-        fextn = ((os.path.splitext(urlfile))[1]).lower()        
-        
-        try:
-            f=open(urlfile, 'w')
-            if fextn in ('', '.txt'):
-                self.dump_urltree_textmode(f)
-            elif fextn in ('.htm', '.html'):
-                self.dump_urltree_htmlmode(f)
-            f.close()
-        except Exception, e:
-            logconsole(e)
-            return DUMP_URL_ERROR
-
-        debug("Done.")
-
-        return DUMP_URL_OK
-
-    def dump_urltree_textmode(self, stream):
-        """ Dump urls in text mode """
-
-        for node in self.collections.preorder():
-            coll = node.get()
-
-            idx = 0
-            links = [self.get_url(index) for index in coll.getAllURLs()]
-            children = []
-            
-            for link in links:
-                if not link: continue
-
-                # Get base link, only for first
-                # child url, since base url will
-                # be same for all child urls.
-                if idx==0:
-                    children = []
-                    base_url = link.get_parent_url().get_full_url()
-                    stream.write(base_url + '\n')
-
-                childurl = link.get_full_url()
-                if childurl and childurl not in children:
-                    stream.write("".join(('\t',childurl,'\n')))
-                    children.append(childurl)
-
-                idx += 1
-
-
-    def dump_urltree_htmlmode(self, stream):
-        """ Dump urls in html mode """
-
-        # Write html header
-        stream.write('<html>\n')
-        stream.write('<head><title>')
-        stream.write('Url tree generated by HarvestMan - Project %s'
-                     % self._cfg.project)
-        stream.write('</title></head>\n')
-
-        stream.write('<body>\n')
-
-        stream.write('<p>\n')
-        stream.write('<ol>\n')
-        
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            idx = 0
-            links = [self.get_url(index) for index in coll.getAllURLs()]            
-
-            children = []
-            for link in links:
-                if not link: continue
-
-                # Get base link, only for first
-                # child url, since base url will
-                # be same for all child urls.
-                if idx==0:
-                    children = []                   
-                    base_url = link.get_parent_url().get_full_url()
-                    stream.write('<li>')                    
-                    stream.write("".join(("<a href=\"",base_url,"\"/>",base_url,"</a>")))
-                    stream.write('</li>\n')
-                    stream.write('<p>\n')
-                    stream.write('<ul>\n')
-                                 
-                childurl = link.get_full_url()
-                if childurl and childurl not in children:
-                    stream.write('<li>')
-                    stream.write("".join(("<a href=\"",childurl,"\"/>",childurl,"</a>")))
-                    stream.write('</li>\n')                    
-                    children.append(childurl)
-                    
-                idx += 1                
-
-
-            # Close the child list
-            stream.write('</ul>\n')
-            stream.write('</p>\n')
-            
-        # Close top level list
-        stream.write('</ol>\n')        
-        stream.write('</p>\n')
-        stream.write('</body>\n')
-        stream.write('</html>\n')
-
-    def get_url_threadpool(self):
-        """ Return the URL thread-pool object """
-
-        return self._urlThreadPool
-
-class HarvestManController(threading.Thread):
-    """ A controller class for managing exceptional
-    conditions such as file limits. Right now this
-    is written with the sole aim of managing file
-    & time limits, but could get extended in future
-    releases. """
-
-    def __init__(self):
-        self._dmgr = objects.datamgr
-        self._tq =  objects.queuemgr
-        self._cfg = objects.config
-        self._exitflag = False
-        self._starttime = 0
-        threading.Thread.__init__(self, None, None, 'HarvestMan Control Class')
-
-    def run(self):
-        """ Run in a loop looking for
-        exceptional conditions """
-
-        while not self._exitflag:
-            # Wake up every half second and look
-            # for exceptional conditions
-            time.sleep(1.0)
-            if self._cfg.timelimit != -1:
-                if self._manage_time_limits()==CONTROLLER_EXIT:
-                    break
-            if self._cfg.maxfiles:
-                if self._manage_file_limits()==CONTROLLER_EXIT:
-                    break
-            if self._cfg.maxbytes:
-                if self._manage_maxbytes_limits()==CONTROLLER_EXIT:
-                    break
-            
-    def stop(self):
-        """ Stop this thread """
-
-        self._exitflag = True
-
-    def terminator(self):
-        """ The function which terminates the program
-        in case of an exceptional condition """
-
-        # This somehow got deleted in HarvestMan 1.4.5
-        self._tq.endloop(True)
-
-    def _manage_time_limits(self):
-        """ Manage limits on time for the project """
-
-        t2=time.time()
-
-        timediff = float((math.modf((t2-self._cfg.starttime)*100.0)[1])/100.0)
-        timemax = self._cfg.timelimit
-        
-        if timediff >= timemax -1:
-            info('Specified time limit of',timemax ,'seconds reached!')            
-            self.terminator()
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-
-    def _manage_file_limits(self):
-        """ Manage limits on maximum file count """
-
-        lsaved = self._dmgr.savedfiles
-        lmax = self._cfg.maxfiles
-
-        if lsaved >= lmax:
-            info('Specified file limit of',lmax ,'reached!')
-            self.terminator()
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-
-    def _manage_maxbytes_limits(self):
-        """ Manage limits on maximum bytes a crawler should download in total per job. """
-
-        lsaved = self._dmgr.savedbytes
-        lmax = self._cfg.maxbytes
-
-        # Let us check for a closer hit of 90%...
-        if (lsaved >=0.90*lmax):
-            info('Specified maxbytes limit of',lmax ,'reached!')
-            self.terminator()   
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-        
-                    
diff --git a/HarvestMan-lite/harvestman/lib/db.py b/HarvestMan-lite/harvestman/lib/db.py
deleted file mode 100755
index 2235121..0000000
--- a/HarvestMan-lite/harvestman/lib/db.py
+++ /dev/null
@@ -1,133 +0,0 @@
-# -- coding: utf-8
-"""
-db.py - Provides HarvestManDbManager class which takes care
-of creating and managing the user's crawl database. The
-crawl database is an sqlite database created as
-$HOME/.harvestman/db/crawls.db where $HOME is the home
-directory of the user. The crawls database is updated with
-meta-data of every crawl after a crawl is completed.
-
-Created by Anand B Pillai <abpillai at gmail dot com> Mar 26 2008
-
-Copyright (C) 2008 Anand B Pillai.
-
-"""
-
-import os, sys
-import time
-
-from harvestman.lib.common.common import objects, extrainfo, logconsole
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-class HarvestManDbManager(object):
-    """ Class performing the creation/management of crawl databases """
-
-    projid = 0
-
-    @classmethod
-    def try_import(cls):
-        try:
-            import sqlite3
-            return sqlite3
-        except ImportError, e:
-            pass
-            
-    @classmethod
-    def create_user_database(cls):
-
-        sqlite3 = cls.try_import()
-            
-        if sqlite3 is None:
-            return
-
-        logconsole("Creating user's crawl database file in %s..." % objects.config.userdbdir)
-        
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        
-        # Create table for projects
-        # This line is causing a problem in darwin
-        # c.execute("drop table if exists projects")
-        c.execute("""create table projects (id integer primary key autoincrement default 0, time real, name text, url str, config str)""")
-        # Create table for project statistics
-        # We are storing the information for
-        # 1. number of urls scanned
-        # 2. number of urls processed (fetched/crawled)
-        # 3. number of URLs which were crawl-filtered
-        # 4. number of urls failed to fetch
-        # 5. number of urls with 404 errors
-        # 6. number of URLs which hit the cache
-        # 7. number of servers scanned
-        # 8. number of unique directories scanned
-        # 9. number of files saved
-        # 10. Amount of data fetched in bytes
-        # 11. the total time for the crawl.
-        
-        # This line is causing a problem in darwin        
-        # c.execute("drop table project_stats")        
-        c.execute("""create table project_stats (project_id integer primary key, urls integer, procurls integer, filteredurls integer, failedurls integer, brokenurls integer, cacheurls integer, servers integer, directories integer, files integer, data real, duration text)""")
-        
-        c.close()
-
-    @classmethod
-    def add_project_record(cls):
-
-        sqlite3 = cls.try_import()
-        if sqlite3 is None:
-            return
-        
-        extrainfo('Writing project record to crawls database...')
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        
-        # Get the configuration as a pickled string
-        cfg = objects.config.copy()
-
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        c.execute("insert into projects (time, name, url, config) values(?,?,?,?)",
-                  (time.time(),cfg.project,cfg.url, repr(cfg)))
-        conn.commit()
-
-        # Fetch the most recent project id and save it as projid
-        c.execute("select max(id) from projects")
-        cls.projid = c.fetchone()[0]
-        # print 'project id=>',cls.projid
-        c.close()
-        extrainfo("Done.")
-
-    @classmethod
-    def add_stats_record(cls, statsd):
-
-        sqlite3 = cls.try_import()        
-        if sqlite3 is None:
-            return
-
-        logconsole('Writing project statistics to crawl database...')
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        t = (cls.projid,
-             statsd['links'],
-             statsd['processed'],
-             statsd['filtered'],
-             statsd['fatal'],
-             statsd['broken'],
-             statsd['filesinrepos'],
-             statsd['extservers'] + 1,
-             statsd['extdirs'] + 1,
-             statsd['files'],
-             statsd['bytes'],
-             '%.2f' % statsd['fetchtime'])
-             
-        c.execute("insert into project_stats values(?,?,?,?,?,?,?,?,?,?,?,?)", t)
-        conn.commit()
-        c.close()
-        pass
-
-if __name__ == "__main__":
-    HarvestManDbManager.create_user_database()
-    pass
-
diff --git a/HarvestMan-lite/harvestman/lib/document.py b/HarvestMan-lite/harvestman/lib/document.py
deleted file mode 100755
index 0f982fc..0000000
--- a/HarvestMan-lite/harvestman/lib/document.py
+++ /dev/null
@@ -1,74 +0,0 @@
-# -- coding: utf-8
-"""
-document.py - Provides HarvestManDocument class which provides
-an abstraction over a webpage with attributes such as URL,
-content, child URLs, HTTP headers, lastmodified value and
-other attributes.
-
-Created by Anand B Pillai <abpillai at gmail dot com> Feb 26 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-import bz2
-from harvestman.lib.common.common import *
-
-class HarvestManDocument(object):
-    """ Web document class """
-
-    def __init__(self, urlobj):
-        # Store only index for conserving memory
-        self.urlindex = urlobj.index
-        # Also, list of children is actually list of
-        # child indices to save memory...
-        self.children = []
-        self.content = ''
-        self.content_hash = ''
-        self.headers = {}
-        # Only valid for webpages
-        self.description = ''
-        # Only valid for webpages        
-        self.keywords = []
-        # Only valid for webpages
-        self.title = ''
-        self.lastmodified = ''
-        self.etag = ''
-        #self.httpstatus = ''
-        #self.httpreason = ''
-        self.content_type = ''
-        self.content_encoding = ''
-        self.error = None
-        
-    def get_url(self):
-        return objects.datamgr.get_url(self.urlindex)
-
-    def set_url(self, urlobj):
-        self.urlindex = urlobj.index
-
-    def add_child(self, urlobj):
-        self.children.append(urlobj.index)
-        
-    def get_links(self):
-        # Links are already "normalized"
-        return [objects.datamgr.get_url(index) for index in self.children]
-
-    def get_content(self):
-        return self.content
-
-    def set_content(self, data):
-        self.content = data
-        
-    def get_content_hash(self):
-        return self.content_hash
-
-    def get_zipped_content(self):
-        # Return the content, gzipped
-        pass
-
-    def get_bzipped_content(self):
-        return bz2.compress(self.content)
-
-
-    
-
-        
diff --git a/HarvestMan-lite/harvestman/lib/event.py b/HarvestMan-lite/harvestman/lib/event.py
deleted file mode 100755
index 86f653e..0000000
--- a/HarvestMan-lite/harvestman/lib/event.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# -- coding: utf-8
-"""event.py - Module defining an event notification framework
-associated with the data flow in HarvestMan.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 28 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.singleton import Singleton
-
-class Event(object):
-    """ Event class for HarvestMan """
-
-    def __init__(self):
-        self.name = ''
-        self.config = objects.config
-        self.url = None
-        self.document = None
-
-class HarvestManEvent(Singleton):
-    """ Event manager class for HarvestMan """
-
-    alias = 'eventmgr'
-
-    def __init__(self):
-        self.events = {}
-        
-    def bind(self, event, funktion, *args):
-        """ Register for a function 'funktion' to be bound to a certain event.
-        The return value of the function will be used to determine the behaviour
-        of the original function which raises the event in cases of events
-        which are called before the original function bound to the event. For
-        events which are raised after the original function is called, the
-        behavior of the original function is not changed """
-
-        # An event is a string, you can bind only one function to an event
-        # The function should accept a default first argument which is the
-        # event object. The event object will provide 4 attributes, namely
-        # the event name, the url associated to the event (should be valid),
-        # the document associated to the event (could be None) and the configuration
-        # object of the system.
-        self.events[event] = funktion
-        # print self.events
-        
-    def raise_event(self, event, url, document=None, *args, **kwargs):
-        """ Raise a certain event. This automatically calls back on any function
-        registered for the event and returns the return value of that function. This
-        is an internal method """
-
-        try:
-            funktion = self.events[event]
-            eventobj = Event()
-            eventobj.name = event
-            eventobj.url = url
-            eventobj.document = document
-            # Other keyword arguments
-            return funktion(eventobj, *args, **kwargs)
-        except KeyError:
-            pass
-
-
diff --git a/HarvestMan-lite/harvestman/lib/filters.py b/HarvestMan-lite/harvestman/lib/filters.py
deleted file mode 100755
index 588a88d..0000000
--- a/HarvestMan-lite/harvestman/lib/filters.py
+++ /dev/null
@@ -1,788 +0,0 @@
-# -- coding: utf-8
-"""
-filters.py - Module which holds class definitions for
-classes which define filters for filtering out URLs
-and web pages based on regualr expression and other kinds
-of filters.
-
- Author: Anand B Pillai <abpillai at gmail dot com>
-
- Modification History
- --------------------
-
- Jul 23 2008 Anand   Creation
- Nov 17 2008 Anand   Completed URL filters class implementation
-                     and integrated with HarvestMan.
- Jan 13 2009 Anand   Added text filter class. Modified
-                     junk filter class to follow the filter
-                     class interface.
- 
-  Copyright (C) 2003-2008 Anand B Pillai.
-                                
-"""
-import re
-from harvestman.lib.common.common import *
-
-class HarvestManBaseFilter(object):
-    """ Base class for all HarvestMan filter classes """
-
-    def __init__(self):
-        self.context = None
-        
-    def filter(self, url):
-        raise NotImplementedError
-
-    def make_regex(self, pattern, casing, flags):
-
-        flag = 0
-        if not casing:
-            flag |= re.IGNORECASE
-        if flags:
-            flag |= eval(flags)
-
-        return re.compile(pattern, flag)
-        
-class HarvestManUrlFilter(HarvestManBaseFilter):
-    """ Filter class for filtering out web pages based on the URL path string """
-
-    def __init__(self, pathfilters=[], extnfilters=[], regexfilters=[]):
-        # Filter pattern strings
-        self.regexfilterpatterns = regexfilters
-        self.pathfilterpatterns = pathfilters
-        self.extnfilterpatterns = extnfilters
-        # Intermediate patterns, dictionaries
-        # with keys 'include' and 'exclude'
-        self.regexpatterns = []
-        self.pathpatterns = { 'include': [], 'exclude': [] }
-        self.extnpatterns = { 'include': [], 'exclude': [] }
-        # Actual filters
-        self.inclfilters = []
-        self.exclfilters = []
-        self.compile_filters()
-
-    def parse_filter(self, filterstring):
-        """ Parse a filter pattern string and return a two
-        tuple of (<inclusion>, <exclusion>) pattern string
-        lists """
-
-        fstr = filterstring
-        # First replace any ''' with ''
-        fstr=fstr.replace("'",'')            
-        # regular expressions to include
-        include=[]
-        # regular expressions to exclude        
-        exclude=[]
-
-        index=0
-        previndex=-1
-        fstr += '+'
-        for c in fstr:
-            if c in ('+','-'):
-                previndex=index
-            index+=1
-
-        l=fstr.split('+')
-
-        for s in l:
-            l2=s.split('-')
-            for x in xrange(len(l2)):
-                s=l2[x]
-                if s=='': continue
-                if x==0:
-                    include.append(s)
-                else:
-                    exclude.append(s)
-
-        return (include, exclude)
-
-    def create_filter(self, plainstr, extn=False):
-        """ Create a python regular expression based on
-        the list of filter strings provided as input """
-
-        # Final filter string
-        fstr = ''
-        s = plainstr
-        
-        # First replace any ''' with ''
-        s=s.replace("'",'')            
-        # Then we remove the asteriks
-        s=s.replace('*','.*')
-        fstr = s
-
-        if extn:
-            fstr = '\.' + fstr + '$'
-
-        return fstr
-        
-    def make_path_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a URL path filter. A URL path is specified
-        as an include/exclude filter. Wildcards are specified by
-        using asteriks """
-        
-        include, exclude = self.parse_filter(filterstring)
-        
-        for pattern in include:
-            self.pathpatterns['include'].append((self.create_filter(pattern), casing, flags))
-        for pattern in exclude:
-            self.pathpatterns['exclude'].append((self.create_filter(pattern), casing, flags))
-
-    def make_extn_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a file extension filter. A file extension filter
-        is specified by concatenating file extensions with a + or - in
-        front of them to specify include/exclude respectively """
-        
-        include, exclude = self.parse_filter(filterstring)
-        
-        for pattern in include:
-            self.extnpatterns['include'].append((self.create_filter(pattern, True), casing, flags))
-        for pattern in exclude:
-            self.extnpatterns['exclude'].append((self.create_filter(pattern, True), casing, flags))
-
-    def make_regex_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a regular expression filter. This is nothing but a Python
-        regular expression string which is compiled directly into a regex """
-        
-        # Direct use - no processing required
-        self.regexpatterns.append((filterstring, casing, flags))
-        
-    def compile_filters(self):
-        """ Compile all filter strings and create regular
-        expression objects """
-        
-        for pattern, casing, flags in self.pathfilterpatterns:
-            self.make_path_filter(pattern, casing, flags)
-
-        for pattern, casing, flags in self.extnfilterpatterns:
-            self.make_extn_filter(pattern, casing, flags)            
-            
-        for pattern, casing, flags in self.regexfilterpatterns:
-            self.make_regex_filter(pattern, casing, flags)
-
-        # Now, compile each to regular expressions and
-        # append to include & exclude regex filter list
-        for urlfilter in self.pathpatterns['include'] + self.extnpatterns['include']:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.inclfilters.append(regexp)
-            
-        for urlfilter in self.pathpatterns['exclude'] + self.extnpatterns['exclude']:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.exclfilters.append(regexp)
-
-        for urlfilter in self.regexpatterns:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.exclfilters.append(regexp)            
-
-    def filter(self, urlobj):
-        """ Apply all URL filters on the passed URL object 'urlobj'.
-        Return True if filtered and False if not filtered """
-
-        # The logic of this is simple - The URL is checked
-        # against all inclusion filters first, if any. If
-        # anything matches, then we don't do exclusion filter
-        # check since inclusion (+) has preference over exclusion (-)
-        # In that case, False is returned.
-        
-        # Otherwise, the URL is checked against all exclusion
-        # filters and if any match, True is returned.
-
-        # Finally, if none match, False is returned.
-
-        url = urlobj.get_full_url()
-        matchincl, matchexcl = False, False
-
-        for urlfilter in self.inclfilters:
-            m = urlfilter.search(url)
-            if m:
-                debug("Inclusion filter for URL %s found", url)
-                matchincl = True
-                break
-
-        if matchincl:
-            return False
-
-        for urlfilter in self.exclfilters:
-            m = urlfilter.search(url)
-            if m:
-                debug("Exclusion filter for URL %s found", url)
-                matchexcl = True
-                break
-
-        if matchexcl:
-            return True
-
-        return False
-
-class HarvestManTextFilter(HarvestManBaseFilter):
-    """ Filter class for filtering out web pages based on the URL path string """
-
-    def __init__(self, contentfilters=[], metafilters=[]):
-        # Filter pattern strings
-        self.contentpatterns = contentfilters
-        self.metapatterns = metafilters
-        # print 'Content=>',self.contentpatterns
-        # print 'Meta=>',self.metapatterns
-        # Actual filters
-        # Text filters are always exclude filters, so
-        # no need of separate include & exclude keys
-        self.contentfilter = []
-        # Meta filters
-        self.keywordfilter = []
-        self.titlefilter = []
-        self.descfilter = []
-        # Parse and compile the filters
-        self.compile_filters()
-
-    def compile_filters(self):
-
-        # Content filter is straight forward
-        for pattern, casing, flags in self.contentpatterns:
-            self.contentfilter.append(self.make_regex(pattern, casing, flags))
-
-        # Some pre-processing is involved in meta-filters
-        for pattern,casing,flags,tags in self.metapatterns:
-            regex = self.make_regex(pattern, casing, flags)
-            if tags=='all':
-                # Append to all filters
-                self.keywordfilter.append(regex)
-                self.titlefilter.append(regex)
-                self.descfilter.append(regex)
-            else:
-                # Split and see which all tags are specified
-                tagslist = tags.split(',')
-                if 'title' in tagslist:
-                    self.titlefilter.append(regex)
-                if 'keywords' in tagslist:
-                    self.keywordfilter.append(regex)
-                if 'description' in tagslist:
-                    self.descfilter.append(regex)                    
-
-
-    def filter(self, urldoc, urlobj):
-        """ Apply all URL filters on the passed URL document object
-        Return True if filtered and False if not filtered """
-
-        filterurl = False
-        
-        # Apply content filter
-        for cfilter in self.contentfilter:
-            m = cfilter.search(urldoc.content)
-            if m:
-                debug("Content filter for URL %s found" % urlobj)
-                self.context='Content'
-                return True
-
-        # Apply meta filters
-        for tfilter in self.titlefilter:
-            m = tfilter.search(urldoc.title)
-            if m:
-                debug("Title filter for URL %s found" % urlobj)
-                self.context='Title'                
-                return True
-
-        for dfilter in self.descfilter:
-            m = dfilter.search(urldoc.description)
-            if m:
-                debug("Description filter for URL %s found" % urlobj)
-                self.context='Description'                
-                return True            
-            
-        for kfilter in self.keywordfilter:
-            matches = [kfilter.search(keyword) for keyword in urldoc.keywords]
-            if len(matches):
-                debug("Keyword filter for URL %s found" % urlobj)
-                self.context='Keyword'                
-                return True
-
-        return False
-                   
-class HarvestManJunkFilter(HarvestManBaseFilter):
-    """ Junk filter class. Filter out junk urls such
-    as ads, banners, flash files etc """
-
-    # Domain specific blocking - List courtesy
-    # junkbuster proxy.
-    block_domains =[ '1ad.prolinks.de',
-                     '1st-fuss.com',
-                     '247media.com',
-                     'admaximize.com',
-                     'adbureau.net',
-                     'adsolution.de',
-                     'adwisdom.com',
-                     'advertising.com',
-                     'atwola.com',
-                     'aladin.de',
-                     'annonce.insite.dk',
-                     'a.tribalfusion.com',                           
-                     'avenuea.com',
-                     'bannercommunity.de',
-                     'banerswap.com',
-                     'bizad.nikkeibp.co.jp',
-                     'bluestreak.com',
-                     'bs.gsanet.com',
-                     'cash-for-clicks.de',
-                     'cashformel.com',                           
-                     'cash4banner.de',
-                     'cgi.tietovalta.fi',
-                     'cgicounter.puretec.de',
-                     'click-fr.com',
-                     'click.egroups.com',
-                     'commonwealth.riddler.com',
-                     'comtrack.comclick.com',
-                     'customad.cnn.com',
-                     'cybereps.com:8000',
-                     'cyberclick.net',
-                     'dino.mainz.ibm.de',
-                     'dinoadserver1.roka.net',
-                     'disneystoreaffiliates.com',
-                     'dn.adzerver.com',
-                     'doubleclick.net',
-                     'ds.austriaonline.at',
-                     'einets.com',
-                     'emap.admedia.net',
-                     'eu-adcenter.net',
-                     'eurosponser.de',
-                     'fastcounter.linkexchange.com',
-                     'findcommerce.com',
-                     'flycast.com',
-                     'focalink.com',
-                     'fp.buy.com',
-                     'globaltrack.com',
-                     'globaltrak.net',
-                     'gsanet.com',                           
-                     'hitbox.com',
-                     'hurra.de',
-                     'hyperbanner.net',
-                     'iadnet.com',
-                     'image.click2net.com',
-                     'image.linkexchange.com',
-                     'imageserv.adtech.de',
-                     'imagine-inc.com',
-                     'img.getstats.com',
-                     'img.web.de',
-                     'imgis.com',
-                     'james.adbutler.de',
-                     'jmcms.cydoor.com',
-                     'leader.linkexchange.com',
-                     'linkexchange.com',
-                     'link4ads.com',
-                     'link4link.com',
-                     'linktrader.com',
-                     'media.fastclick.net',
-                     'media.interadnet.com',
-                     'media.priceline.com',
-                     'mediaplex.com',
-                     'members.sexroulette.com',
-                     'newsads.cmpnet.com',
-                     'ngadcenter.net',
-                     'nol.at:81',
-                     'nrsite.com',
-                     'offers.egroups.com',
-                     'omdispatch.co.uk',
-                     'orientserve.com',
-                     'pagecount.com',
-                     'preferences.com',
-                     'promotions.yahoo.com',
-                     'pub.chez.com',
-                     'pub.nomade.fr',
-                     'qa.ecoupons.com',
-                     'qkimg.net',
-                     'resource-marketing.com',
-                     'revenue.infi.net',
-                     'sam.songline.com',
-                     'sally.songline.com',
-                     'sextracker.com',
-                     'smartage.com',
-                     'smartclicks.com',
-                     'spinbox1.filez.com',
-                     'spinbox.versiontracker.com',
-                     'stat.onestat.com',
-                     'stats.surfaid.ihost.com',
-                     'stats.webtrendslive.com',
-                     'swiftad.com',
-                     'tm.intervu.net',
-                     'tracker.tradedoubler.com',
-                     'ultra.multimania.com',
-                     'ultra1.socomm.net',
-                     'uproar.com',
-                     'usads.imdb.com',
-                     'valueclick.com',
-                     'valueclick.net',
-                     'victory.cnn.com',
-                     'videoserver.kpix.com',
-                     'view.atdmt.com',
-                     'webcounter.goweb.de',
-                     'websitesponser.de',
-                     'werbung.guj.de',
-                     'wvolante.com',
-                     'www.ad-up.com',
-                     'www.adclub.net',
-                     'www.americanpassage.com',
-                     'www.bannerland.de',
-                     'www.bannermania.nom.pl',
-                     'www.bizlink.ru',
-                     'www.cash4banner.com',                           
-                     'www.clickagents.com',
-                     'www.clickthrough.ca',
-                     'www.commision-junction.com',
-                     'www.eads.com',
-                     'www.flashbanner.no',                           
-                     'www.mediashower.com',
-                     'www.popupad.net',                           
-                     'www.smartadserver.com',                           
-                     'www.smartclicks.com:81',
-                     'www.spinbox.com',
-                     'www.sponsorpool.net',
-                     'www.ugo.net',
-                     'www.valueclick.com',
-                     'www.virtual-hideout.net',
-                     'www.web-stat.com',
-                     'www.webpeep.com',
-                     'www.zserver.com',
-                     'www3.exn.net:80',
-                     'xb.xoom.com',
-                     'yimg.com' ]
-
-    # Common block patterns. These are created
-    # in the Python regular expression syntax.
-    # Original list courtesy junkbuster proxy.
-    block_patterns = [ r'/*.*/(.*[-_.])?ads?[0-9]?(/|[-_.].*|\.(gif|jpe?g))',
-                       r'/*.*/(.*[-_.])?count(er)?(\.cgi|\.dll|\.exe|[?/])',
-                       r'/*.*/(.*[-_.].*)?maino(kset|nta|s).*(/|\.(gif|html?|jpe?g|png))',
-                       r'/*.*/(ilm(oitus)?|kampanja)(hallinta|kuvat?)(/|\.(gif|html?|jpe?g|png))',
-                       r'/*.*/(ng)?adclient\.cgi',
-                       r'/*.*/(plain|live|rotate)[-_.]?ads?/',
-                       r'/*.*/(sponsor|banner)s?[0-9]?/',
-                       r'/*.*/*preferences.com*',
-                       r'/*.*/.*banner([-_]?[a-z0-9]+)?\.(gif|jpg)',
-                       r'/*.*/.*bannr\.gif',
-                       r'/*.*/.*counter\.pl',
-                       r'/*.*/.*pb_ihtml\.gif',
-                       r'/*.*/Advertenties/',
-                       r'/*.*/Image/BannerAdvertising/',
-                       r'/*.*/[?]adserv',
-                       r'/*.*/_?(plain|live)?ads?(-banners)?/',
-                       r'/*.*/abanners/',
-                       r'/*.*/ad(sdna_image|gifs?)/',
-                       r'/*.*/ad(server|stream|juggler)\.(cgi|pl|dll|exe)',
-                       r'/*.*/adbanner*',
-                       r'/*.*/adfinity',
-                       r'/*.*/adgraphic*',
-                       r'/*.*/adimg/',
-                       r'/*.*/adjuggler',
-                       r'/*.*/adlib/server\.cgi',
-                       r'/*.*/ads\\',
-                       r'/*.*/adserver',
-                       r'/*.*/adstream\.cgi',
-                       r'/*.*/adv((er)?ts?|ertis(ing|ements?))?/',
-                       r'/*.*/advanbar\.(gif|jpg)',
-                       r'/*.*/advanbtn\.(gif|jpg)',
-                       r'/*.*/advantage\.(gif|jpg)',
-                       r'/*.*/amazon([a-zA-Z0-9]+)\.(gif|jpg)',
-                       r'/*.*/ana2ad\.gif',
-                       r'/*.*/anzei(gen)?/?',
-                       r'/*.*/ban[-_]cgi/',
-                       r'/*.*/banner_?ads/',
-                       r'/*.*/banner_?anzeigen',
-                       r'/*.*/bannerimage/',
-                       r'/*.*/banners?/',
-                       r'/*.*/banners?\.cgi/',
-                       r'/*.*/bizgrphx/',
-                       r'/*.*/biznetsmall\.(gif|jpg)',
-                       r'/*.*/bnlogo.(gif|jpg)',
-                       r'/*.*/buynow([a-zA-Z0-9]+)\.(gif|jpg)',
-                       r'/*.*/cgi-bin/centralad/getimage',
-                       r'/*.*/drwebster.gif',
-                       r'/*.*/epipo\.(gif|jpg)',
-                       r'/*.*/gsa_bs/gsa_bs.cmdl',
-                       r'/*.*/images/addver\.gif',
-                       r'/*.*/images/advert\.gif',
-                       r'/*.*/images/marketing/.*\.(gif|jpe?g)',
-                       r'/*.*/images/na/us/brand/',
-                       r'/*.*/images/topics/topicgimp\.gif',
-                       r'/*.*/phpAds/phpads.php',
-                       r'/*.*/phpAds/viewbanner.php',
-                       r'/*.*/place-ads',
-                       r'/*.*/popupads/',
-                       r'/*.*/promobar.*',
-                       r'/*.*/publicite/',
-                       r'/*.*/randomads/.*\.(gif|jpe?g)',
-                       r'/*.*/reklaam/.*\.(gif|jpe?g)',
-                       r'/*.*/reklama/.*\.(gif|jpe?g)',
-                       r'/*.*/reklame/.*\.(gif|jpe?g)',
-                       r'/*.*/servfu.pl',
-                       r'/*.*/siteads/',
-                       r'/*.*/smallad2\.gif',
-                       r'/*.*/spin_html/',
-                       r'/*.*/sponsor.*\.gif',
-                       r'/*.*/sponsors?[0-9]?/',
-                       r'/*.*/ucbandeimg/',
-                       r'/*.*/utopiad\.(gif|jpg)',
-                       r'/*.*/werb\..*',
-                       r'/*.*/werbebanner/',
-                       r'/*.*/werbung/.*\.(gif|jpe?g)',
-                       r'/*ad.*.doubleclick.net',
-                       r'/.*(ms)?backoff(ice)?.*\.(gif|jpe?g)',
-                       r'/.*./Adverteerders/',
-                       r'/.*/?FPCreated\.gif',
-                       r'/.*/?va_banner.html',
-                       r'/.*/adv\.',
-                       r'/.*/advert[0-9]+\.jpg',
-                       r'/.*/favicon\.ico',
-                       r'/.*/ie_?(buttonlogo|static?|anim.*)?\.(gif|jpe?g)',
-                       r'/.*/ie_horiz\.gif',
-                       r'/.*/ie_logo\.gif',
-                       r'/.*/ns4\.gif',
-                       r'/.*/opera13\.gif',
-                       r'/.*/opera35\.gif',
-                       r'/.*/opera_b\.gif',
-                       r'/.*/v3sban\.gif',
-                       r'/.*Ad00\.gif',
-                       r'/.*activex.*(gif|jpe?g)',
-                       r'/.*add_active\.gif',
-                       r'/.*addchannel\.gif',
-                       r'/.*adddesktop\.gif',
-                       r'/.*bann\.gif',
-                       r'/.*barnes_logo\.gif',
-                       r'/.*book.search\.gif',
-                       r'/.*by/main\.gif',
-                       r'/.*cnnpostopinionhome.\.gif',
-                       r'/.*cnnstore\.gif',
-                       r'/.*custom_feature\.gif',
-                       r'/.*exc_ms\.gif',
-                       r'/.*explore.anim.*gif',
-                       r'/.*explorer?.(gif|jpe?g)',
-                       r'/.*freeie\.(gif|jpe?g)',
-                       r'/.*gutter117\.gif',
-                       r'/.*ie4_animated\.gif',
-                       r'/.*ie4get_animated\.gif',
-                       r'/.*ie_sm\.(gif|jpe?g)',
-                       r'/.*ieget\.gif',
-                       r'/.*images/cnnfn_infoseek\.gif',
-                       r'/.*images/pathfinder_btn2\.gif',
-                       r'/.*img/gen/fosz_front_em_abc\.gif',
-                       r'/.*img/promos/bnsearch\.gif',
-                       r'/.*infoseek\.gif',
-                       r'/.*logo_msnhm_*',
-                       r'/.*mcsp2\.gif',
-                       r'/.*microdell\.gif',
-                       r'/.*msie(30)?\.(gif|jpe?g)',
-                       r'/.*msn2\.gif',
-                       r'/.*msnlogo\.(gif|jpe?g)',
-                       r'/.*n_iemap\.gif',
-                       r'/.*n_msnmap\.gif',
-                       r'/.*navbars/nav_partner_logos\.gif',
-                       r'/.*nbclogo\.gif',
-                       r'/.*office97_ad1\.(gif|jpe?g)',
-                       r'/.*pathnet.warner\.gif',
-                       r'/.*pbbobansm\.(gif|jpe?g)',
-                       r'/.*powrbybo\.(gif|jpe?g)',
-                       r'/.*s_msn\.gif',
-                       r'/.*secureit\.gif',
-                       r'/.*sqlbans\.(gif|jpe?g)',
-                       r'/BannerImages/'
-                       r'/BarnesandNoble/images/bn.recommend.box.*',
-                       r'/Media/Images/Adds/',
-                       r'/SmartBanner/',
-                       r'/US/AD/',
-                       r'/_banner/',
-                       r'/ad[-_]container/',
-                       r'/adcycle.cgi',
-                       r'/adcycle/',
-                       r'/adgenius/',
-                       r'/adimages/',
-                       r'/adproof/',
-                       r'/adserve/',
-                       r'/affiliate_banners/',
-                       r'/annonser?/',
-                       r'/anz/pics/',
-                       r'/autoads/',
-                       r'/av/gifs/av_logo\.gif',
-                       r'/av/gifs/av_map\.gif',
-                       r'/av/gifs/new/ns\.gif',
-                       r'/bando/',
-                       r'/bannerad/',
-                       r'/bannerfarm/',
-                       r'/bin/getimage.cgi/...\?AD',
-                       r'/cgi-bin/centralad/',
-                       r'/cgi-bin/getimage.cgi/....\?GROUP=',
-                       r'/cgi-bin/nph-adclick.exe/',
-                       r'/cgi-bin/nph-load',
-                       r'/cgi-bin/webad.dll/ad',
-                       r'/cgi/banners.cgi',
-                       r'/cwmail/acc\.gif',
-                       r'/cwmail/amzn-bm1\.gif',
-                       r'/db_area/banrgifs/',
-                       r'/digitaljam/images/digital_ban\.gif',
-                       r'/free2try/',
-                       r'/gfx/bannerdir/',
-                       r'/gif/buttons/banner_.*',
-                       r'/gif/buttons/cd_shop_.*',
-                       r'/gif/cd_shop/cd_shop_ani_.*',
-                       r'/gif/teasere/',
-                       r'/grafikk/annonse/',
-                       r'/graphics/advert',
-                       r'/graphics/defaultAd/',
-                       r'/grf/annonif',
-                       r'/hotstories/companies/images/companies_banner\.gif',
-                       r'/htmlad/',
-                       r'/image\.ng/AdType',
-                       r'/image\.ng/transactionID',
-                       r'/images/.*/.*_anim\.gif',
-                       r'/images/adds/',
-                       r'/images/getareal2\.gif',
-                       r'/images/locallogo.gif',
-                       r'/img/special/chatpromo\.gif',
-                       r'/include/watermark/v2/',
-                       r'/ip_img/.*\.(gif|jpe?g)',
-                       r'/ltbs/cgi-bin/click.cgi',
-                       r'/marketpl*/',
-                       r'/markets/images/markets_banner\.gif',
-                       r'/minibanners/',
-                       r'/ows-img/bnoble\.gif',
-                       r'/ows-img/nb_Infoseek\.gif',
-                       r'/p/d/publicid',
-                       r'/pics/amzn-b5\.gif',
-                       r'/pics/getareal1\.gif',
-                       r'/pics/gotlx1\.gif',
-                       r'/promotions/',
-                       r'/rotads/',
-                       r'/rotations/',
-                       r'/torget/jobline/.*\.gif'
-                       r'/viewad/'
-                       r'/we_ba/',
-                       r'/werbung/',
-                       r'/world-banners/',
-                       r'/worldnet/ad\.cgi',
-                       r'/zhp/auktion/img/' ]
-                            
-
-    def __init__(self):
-        self.msg = '<No Error>'
-        self.match = ''
-        # Compile pattern list for performance
-        self.patterns = map(re.compile, self.block_patterns)
-        # Create base domains list from domains list
-        self.base_domains = map(self.base_domain, self.block_domains)
-
-    def reset_msg(self):
-        self.msg = '<No Error>'
-
-    def reset_match(self):
-        self.msg = ''        
-        
-    def filter(self, urlobj):
-        """ Apply Junk filter on the passed URL object. Return True
-        if filtered and False if not filtered """
-
-        self.reset_msg()
-        self.reset_match()
-        
-        # Check domain first
-        ret = self._check_domain(urlobj)
-        if ret:
-            return ret
-
-        # Check pattern next
-        return self._check_pattern(urlobj)
-
-    def base_domain(self, domain):
-
-        if domain.count(".") > 1:
-            strings = domain.split(".")
-            return "".join((strings[-2], strings[-1]))
-        else:
-            return domain
-            
-    def _check_domain(self, url_obj):
-        """ Check whether the url belongs to a junk
-        domain. Return true if url is O.K (NOT a junk
-        domain) and False otherwise """
-
-        # Get base server of the domain with port
-        base_domain_port = url_obj.get_base_domain_with_port()
-        # Get domain with port
-        domain_port = url_obj.get_domain_with_port()
-
-        # First check for domain
-        if domain_port in self.block_domains:
-            self.msg = '<Found domain match>'
-            return True
-        # Then check for base domain
-        else:
-            if base_domain_port in self.base_domains:
-                self.msg = '<Found base-domain match>'                
-                return True
-
-        return False
-
-    def _check_pattern(self, url_obj):
-        """ Check whether the url matches a junk pattern.
-        Return true if url is O.K (not a junk pattern) and
-        false otherwise """
-
-        url = url_obj.get_full_url()
-
-        indx=0
-        for p in self.patterns:
-            # Do a search, not match
-            if p.search(url):
-                self.msg = '<Found pattern match>'
-                self.match = self.block_patterns[indx]
-                return True
-            
-            indx += 1
-            
-        return False
-            
-    def get_error_msg(self):
-        return self.msg
-
-    def get_match(self):
-        return self.match
-    
-if __name__=="__main__":
-    import urlparser
-    
-    # Test filter class
-    filter = HarvestManJunkFilter()
-    
-    # Violates, should return False
-    # The first two are direct domain matches, the
-    # next two are base domain matches.
-    u = urlparser.HarvestManUrl("http://a.tribalfusion.com/images/1.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    u = urlparser.HarvestManUrl("http://stats.webtrendslive.com/cgi-bin/stats.pl")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    u = urlparser.HarvestManUrl("http://stats.cyberclick.net/cgi-bin/stats.pl")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()    
-    u = urlparser.HarvestManUrl("http://m.doubleclick.net/images/anim.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    
-    # The next are pattern matches
-    u = urlparser.HarvestManUrl("http://www.foo.com/popupads/ad.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()
-    u = urlparser.HarvestManUrl("http://www.foo.com/htmlad/1.html")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/logos/nbclogo.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/bar/siteads/1.ad")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/banners/world-banners/banner.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()
-    u = urlparser.HarvestManUrl("http://ads.foo.com/")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    
-    
-    # This one should not match
-    u = urlparser.HarvestManUrl("http://www.foo.com/doc/logo.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    # This also...
-    u = urlparser.HarvestManUrl("http://www.foo.org/bar/vodka/pattern.html")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()    
-
diff --git a/HarvestMan-lite/harvestman/lib/gui.py b/HarvestMan-lite/harvestman/lib/gui.py
deleted file mode 100755
index 16b0e88..0000000
--- a/HarvestMan-lite/harvestman/lib/gui.py
+++ /dev/null
@@ -1,664 +0,0 @@
-"""
-gui.py - Module which provides a browser based UI
-mode to HarvestMan using web.py. This module is part
-of the HarvestMan program.
-
-Created Anand B Pillai <abpillai at gmail dot com> Jun 01 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import sys, os
-import web
-import webbrowser
-import time
-
-from web import form, net #, request
-
-def get_templates_location():
-    # Templates are located at harvestman/ui/templates folder...
-    top = os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__'])))
-    template_dir = os.path.join(top, 'ui','templates')
-    return template_dir
-
-# Global render object
-g_render = web.template.render(get_templates_location())
-
-CONFIG_HTML_TEMPLATE="""\
-<html><head><title>HarvestMan Configuration File Generator</title>
-%s
-</head>
-<body>
-%s
-</body>
-</html>
-"""
-
-PLUG_TEMPLATE="""\
-       <plugin name="%s" enable="1" />
-"""
-
-PLUGINS_TEMPLATE="""\
-    <plugins>
-        %s
-    </plugins>
-"""
-
-
-CONFIG_XML_TEMPLATE="""\
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-
-   <!--- Configuration file generated by HarvestMan Config Generator %(TIMESTAMP)s -->
-   <config version="3.0" xmlversion="1.0">
-     <projects>
-      
-      <project skip="0">
-
-        <url>%(url)s</url>
-        <name>%(projname)s</name>
-
-        <basedir>%(basedir)s</basedir>
-        <verbosity value="%(verbosity)s"/>
-      </project>
-      
-     </projects>
-     <network>
-      <proxy>
-        <proxyserver>%(proxy)s</proxyserver>
-        <proxyuser>%(puser)s</proxyuser>
-        <proxypasswd>%(ppasswd)s</proxypasswd>
-        <proxyport value="%(proxyport)s"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="%(html)s"/>
-        <images value="%(images)s"/>
-        <movies value="%(movies)s"/>
-        <flash value="%(flash)s"/>        
-        <sounds value="%(sounds)s"/>
-        <documents value="%(documents)s"/>        
-        <javascript value="%(javascript)s"/>
-        <javaapplet value="%(javaapplet)s"/>
-        <querylinks value="%(getquerylinks)s"/>
-      </types> 
-      <cache status="%(pagecache)s">
-        <datacache value="%(datacache)s"/>
-      </cache>
-      <protocol>
-        <http compress="%(httpcompress)s" />
-      </protocol>
-      <misc>
-        <retries value="%(retryfailed)s"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="%(getimagelinks)s" />
-        <stylesheetlinks value="%(getstylesheets)s"/>
-        <offset start="%(linksoffsetstart)s" end="%(linksoffsetend)s" />
-      </links>
-      <extent>
-        <fetchlevel value="%(fetchlevel)s"/>
-        <depth value="%(depth)s"/>
-        <extdepth value="%(extdepth)s"/>
-        <subdomain value="%(subdomain)s"/>
-        <ignoretlds value="%(ignoretlds)s"/>
-      </extent>
-      <limits>
-        <maxfiles value="%(maxfiles)s"/>
-        <maxfilesize value="%(maxfilesize)s"/>
-        <maxbandwidth value="%(maxbandwidth)s"/>
-        <connections value="%(connections)s"/>
-        <timelimit value="%(timelimit)s"/>
-      </limits>
-      <rules>
-        <robots value="%(robots)s"/>
-        <urlpriority>%(urlpriority)s</urlpriority>
-        <serverpriority>%(serverpriority)s</serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>%(urlfilter)s</urlfilter>
-        <serverfilter>%(serverfilter)s</serverfilter>
-        <wordfilter>%(wordfilter)s</wordfilter>
-        <junkfilter value="%(junkfilter)s"/>
-      </filters>
-      %(PLUGIN)s
-    </control>
-
-    <parser>
-      <feature name="a" enable="%(parser_enable_a)s" />
-      <feature name="base" enable="%(parser_enable_base)s" />
-      <feature name="frame" enable="%(parser_enable_frame)s" />
-      <feature name="img" enable="%(parser_enable_img)s" />
-      <feature name="form" enable="%(parser_enable_form)s" />
-      <feature name="link" enable="%(parser_enable_link)s" />
-      <feature name="body" enable="%(parser_enable_body)s" />
-      <feature name="script" enable="%(parser_enable_script)s" />
-      <feature name="applet" enable="%(parser_enable_applet)s" />
-      <feature name="area" enable="%(parser_enable_area)s" />
-      <feature name="meta" enable="%(parser_enable_meta)s" />
-      <feature name="embed" enable="%(parser_enable_embed)s" />
-      <feature name="object" enable="%(parser_enable_object)s" />
-      <feature name="option" enable="%(parser_enable_option)s" />
-    </parser>
-      
-    <system>
-      <workers status="%(usethreads)s" size="%(threadpoolsize)s" timeout="%(timeout)s"/>
-      <trackers value="%(maxtrackers)s" timeout="%(fetchertimeout)s" />
-      <timegap value="%(sleeptime)s" random="%(randomsleep)s" />
-    </system>
-    
-    <files>
-      <urltreefile>%(urltreefile)s</urltreefile>
-      <archive status="%(archive)s" format="%(archformat)s"/>
-      <urlheaders status="%(urlheaders)s" />
-      <localise value="%(localise)s"/>
-    </files>
-    
-    <display>
-      <browsepage value="%(browsepage)s"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
-"""
-
-def render_stylesheet():
-    css = """\
-    <style>
-    body {
-           font-family: Arial;
-           font-size: 12px;
-         }
-    form th {
-           text-align: right;
-           vertical-align:top;
-         }
-
-    .description {
-       font-style: italic;
-     }
-    .help {
-       font-style: italic;
-       font-size: 12px;
-       color: #343434;
-     }
-    </style>
-    """
-
-    return css
-
-
-def render_tabs():
-
-    content ="""\
-    <html><head><title>HarvestMan Web Console</title>
-    <head>
-    <script type="text/javascript" src="content/tabberjs"></script>
-    <link rel="stylesheet" href="content/example_css" TYPE="text/css" MEDIA="screen">
-    <link rel="stylesheet" href="content/example_print_css" TYPE="text/css" MEDIA="print">
-    <script type="text/javascript">
-
-    /* Optional: Temporarily hide the "tabber" class so it does not "flash"
-       on the page as plain HTML. After tabber runs, the class is changed
-       to "tabberlive" and it will appear. */
-
-       document.write('<style type="text/css">.tabber{display:none;}<\/style>');
-     </script>
-     </head>
-
-    <body>
-    <h1>HarvestMan Web Console</h1>
-
-    <div class="tabber">
-
-     <div class="tabbertab">
-          <h2>Configuration</h2>
-          <p>
-            <ol>
-              <li>User configuration</li>
-              <li>System configuration</li>
-              <li>New configuration</li>
-            </ol>
-          </p>
-     </div>
-
-
-     <div class="tabbertab">
-          <h2>Projects</h2>
-          <p>
-            <ol>
-              <li>Project History</li>
-              <li>Current Project</li>
-              <li>New Project</li>
-            </ol>
-          </p>          
-     </div>
-
-     <div class="tabbertab">
-
-          <h2>Documentation</h2>
-          <p>
-            <ol>
-              <li>Release Notes</li>
-              <li>Change History</li>
-              <li>API Documentation</li>
-              <li>HOWTOs & Tutorials</li>
-            </ol>
-          </p>
-     </div>     
-
-     <div class="tabbertab">
-
-          <h2>About</h2>
-          <p>HarvestMan - Web crawling application/framework written in pure Python.</p>
-          <p>WWW: <a href="http://www.harvestmanontheweb.com">HarvestMan on the Web</a></p>
-     </div>
-     
-    </div>
-    </body>
-    </html>
-    """
-
-    return content
-       
-       
-
-    
-############## Start web.py custom widgets ####################################################
-      
-class SizedTextbox(form.Textbox):
-    """ A GUI class for a textbox which accepts an argument for
-    its size """
-    
-    def __init__(self, name, size, title='', *validators, **attrs):
-        super(SizedTextbox, self).__init__(name, *validators, **attrs)
-        self.size = size
-        self.val = self.value
-        self.title = title
-        
-    def render(self):
-        x = '<input type="text" name="%s" size="%d" title="%s"' % (net.websafe(self.name),
-                                                                   self.size,
-                                                                   net.websafe(self.title))
-        if self.val !=None: x += ' value="%s"' % net.websafe(self.val)
-        x += self.addatts()
-        x += ' />'
-        return x
-
-class MyDropbox(form.Dropdown):
-    """ A modified Dropdown class """
-    
-    def __init__(self, name, title='', args=None, *validators, **attrs):
-        super(MyDropbox, self).__init__(name, args, *validators, **attrs)
-        self.val = self.value
-        self.title = title
-
-    def render(self):
-        x = '<select name="%s"%s title="%s">\n' % (net.websafe(self.name),
-                                                   self.addatts(),
-                                                   net.websafe(self.title))
-        for arg in self.args:
-            if type(arg) == tuple:
-                value, desc= arg
-            else:
-                value, desc = arg, arg 
-
-            if self.val == value: select_p = 'selected'
-            else: select_p = ''
-            x += '  <option %s>%s</option>\n' % (select_p, net.websafe(desc))
-        x += '</select>\n'
-        return x
-        
-class Label(form.Input):
-    """ A class which provides a Label widget """
-    
-    def __init__(self, text,bold=False,italic=False,underlined=False, *validators, **attrs):
-        self.name = ''
-        self.text = text
-        self.bold = bold
-        self.italic = italic
-        self.underlined = underlined
-        super(Label, self).__init__('', *validators, **attrs)
-        
-    def render(self):
-        text = '<p>%s</p>' % self.text
-        if self.bold:
-            text = '<b>%s</b>' % text
-        if self.italic:
-            text = '<i>%s</i>' % text
-        if self.underlined:
-            text = '<u>%s</u>' % text
-
-        return text
-
-    def validate(self, v):
-        return True
-
-############## End web.py custom widgets ####################################################
-
-class HarvestManConfigGenerator(object):
-    """ A class for web-based, interactive configuration
-    file generation for HarvestMan """
-
-    def __init__(self):
-        self.form = None
-        
-    def create_form(self):
-        """ Create an HTML form for user input """
-        
-        myform = form.Form(
-            Label("Project Configuration", True),
-            SizedTextbox("URL", 100, 'Starting URL for the project',
-                         form.Validator("", lambda x:len(x)),value='http://www.foo.com'),
-            SizedTextbox("Name", 20,'Name for the project',
-                         form.Validator("", lambda x:len(x)), value='foo'),
-            SizedTextbox("Base Directory", 50, 'Directory for saving downloaded files',
-                         form.Validator("", lambda x:len(x)), value='~/websites'),
-            MyDropbox("Verbosity", "0=>No messages, 5=>Maximum messages",
-                      [0,1,2,3,4,5], value=2),
-            Label("Network Configuration", True),
-            SizedTextbox("Proxy Server", 50, "Proxy server address for your network, if any"),
-            SizedTextbox("Proxy Server Port",10, "Port number for the proxy server",
-                         value=80),
-            SizedTextbox("Proxy Server Username", 20,
-                         "Username for authenticating the proxy server (leave blank for unauthenticated proxies)"),
-            SizedTextbox("Proxy Server Password", 20,
-                         "Password for authenticating the proxy server (leave blank for unauthenticated proxies)"),
-            Label("Download Types/Caching/Protocol Configuration", True),
-            MyDropbox("HTML", 'Save HTML pages ?', ["Yes","No"]),
-            MyDropbox("Images",'Save images in pages ?',["Yes","No"]),
-            MyDropbox("Video",'Save video URLs (movies) ?',["No","Yes"]),
-            MyDropbox("Flash",'Save Adobe Flash URLs ?',["No","Yes"]),
-            MyDropbox("Audio",'Save audio URLs (sounds) ?',["No","Yes"]),
-            MyDropbox("Documents",'Save Microsoft Office, Openoffice, PDF and Postscript files ?',
-                      ["Yes","No"]),
-            MyDropbox("Javascript",'Save server-side javascript URLs ?',["Yes","No"]),
-            MyDropbox("Javaapplet",'Save java applet class files ?',["Yes","No"]),        
-            MyDropbox("Query Links",'Save links of the form "http://www.foo.com/query?param=val" ?',
-                      ["Yes","No"]),                      
-            MyDropbox("Caching",'Enable URL caching in HarvestMan ?',
-                      ["Yes","No"]),    
-            MyDropbox("Data Caching",'Enable caching of URL data in the cache (requires more space) ?',
-                      ["No","Yes"]),
-            MyDropbox("HTTP Compression",'Accept gzip compressed data from web servers ?',
-                      ["Yes","No"]),
-            SizedTextbox("Retry Attempts", 10,
-                         'Number of additional download tries for URLs which produce errors',
-                         value=1),
-            Label("Download Limits/Extent Configuration", True),
-            MyDropbox("Fetch Level",
-                      'Fetch level for the crawl (see FAQ)',[0,1,2,3,4]),
-            MyDropbox("Crawl Sub-domains",
-                      'Crawls "http://bar.foo.com" when starting URL belongs to "http://foo.com"',
-                      ["No","Yes"]),
-            SizedTextbox("Maximum Files Limit",10,
-                         'Stops crawl when number of files downloaded reaches this limit',
-                         value=5000),
-            SizedTextbox("Maximum File Size Limit",10,
-                         'Ignore URLs whose size is larger than this limit',
-                         value=5242880),    
-            SizedTextbox("Maximum Connections Limit",10,
-                         'Maximum number of simultaneously open HTTP connections',
-                         value=5),
-            SizedTextbox("Maximum Bandwidth Limit(kb)",10,
-                         'Maximum number of bandwidth used for given HTTP connections',
-                         value=0),
-            SizedTextbox("Crawl Time Limit",10,
-                         'Stops crawl after the crawl duration reaches this limit',
-                         value=-1),
-            Label("Download Rules/Filters Configuration", True),    
-            MyDropbox("Robots Rules",
-                      'Obey robots.txt and META ROBOTS rules ?', 
-                      ["Yes","No"]),
-            SizedTextbox("URL Filter String",100,'A filter string for URLs (see FAQ)'),
-            # SizedTextbox("Server Filter String",100, 'A filter string for servers (see FAQ)'),
-            SizedTextbox("Word Filter String",100,
-                         'A generic word filter based on regular expressions to filter web pages'),
-            MyDropbox("JunkFilter",'Enable the advertisement/banner/other junk URL filter ?',
-                      ["Yes","No"]),
-            Label("Download Plugins Configuration", True),
-            Label("Add up-to 5 valid plugins in the boxes below",italic=True),
-            SizedTextbox("Plugin 1",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 2",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 3",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 4",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 5",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            Label("Files Configuration", True),    
-            SizedTextbox("Url Tree File", 20,
-                         'A filename which will capture parent/child relationship of all processed URLs',
-                         value=''),
-            MyDropbox("Archive Saved Files", 'Archive all saved files to a single tar archive file ?',
-                      ["No","Yes"]),
-            MyDropbox("Archive Format",'Archive format (tar.bz2 or tar.gz)',["bzip","gzip"]),
-            MyDropbox("Serialize URL Headers",'Serialize all URL headers to a file (urlheaders.db) ?',
-                      ["Yes","No"]),
-            MyDropbox("Localise Links",'Convert outward (web) pointing links to disk pointing links ?',
-                      ["No","Yes"]),
-            Label("Misc Configuration", True),        
-            MyDropbox("Create Project Browse Page",'Create an HTML page which summarizes all crawled projects ?',
-                      ["No","Yes"]),            
-            Label("Advanced Configuration Settings", True),
-            Label('These are configuration parameters which are useful only for advanced tweaking. Most users can ignore the following settings and use the defaults',italic=True),
-            Label("Download Limits/Extent/Filters/Rules Configuration", True, True),            
-            MyDropbox("Fetch Image Links Always",
-                      'Ignore download rules when fetching images ?',["Yes","No"]),
-            MyDropbox("Fetch Stylesheet Links Always",
-                      'Ignore download rules when fetching stylesheets ?',["Yes","No"]),
-            SizedTextbox("Links Offset Start", 10,
-                         'Offset of child links measured from zero (useful for crawling web directories)',
-                         value=0),
-            SizedTextbox("Links Offset End", 10,
-                         'Offset of child links measured from end (useful for crawling web directories)',
-                         value=-1),    
-            MyDropbox("URL Depth", 'Maximum depth of a URL in relation to the starting URL',
-                      [10,9,8,7,6,5,4,3,2,1,0]),
-            MyDropbox("External URL Depth",
-                      'Maximum depth of an external URL in relation to its server root (useful for only fetchlevels >1)',
-                      [0,1,2,3,4,5,6,7,8,9,10]),
-            MyDropbox("Ignore TLDs (Top level domains)",
-                      'Consider http://foo.com and http://foo.org as the same server (dangerous)',
-                      ["No","Yes"]),    
-            SizedTextbox("URL Priority String",100,'A priority string for URLs (see FAQ)'),
-            # SizedTextbox("Server Priority String",100, 'A priority string for servers (see FAQ)'),    
-            Label("Parser Configuration", True, True),
-
-            Label("Enable/Disable parsing of the tags shown below",italic=True),
-            MyDropbox("Tag <a>", 'Enable parsing of <a> tags ?',["Yes","No"]),
-            MyDropbox("Tag <applet>", 'Enable parsing of <applet> tags ?',["Yes","No"]),
-            MyDropbox("Tag <area>", 'Enable parsing of <area> tags ?',["Yes","No"]),
-            MyDropbox("Tag <base>", 'Enable parsing of <base> tags ?',["Yes","No"]),
-            MyDropbox("Tag <body>", 'Enable parsing of <body> tags ?',["Yes","No"]),
-            MyDropbox("Tag <embed>", 'Enable parsing of <embed> tags ?',["Yes","No"]),
-            MyDropbox("Tag <form>", 'Enable parsing of <form> tags ?',["Yes","No"]),
-            MyDropbox("Tag <frame>", 'Enable parsing of <frame> tags ?',["Yes","No"]),
-            MyDropbox("Tag <img>", 'Enable parsing of <img> tags ?',["Yes","No"]),
-            MyDropbox("Tag <link>", 'Enable parsing of <link> tags ?',["Yes","No"]),
-            MyDropbox("Tag <meta>", 'Enable parsing of <meta> tags ?',["Yes","No"]),
-            MyDropbox("Tag <object>", 'Enable parsing of <object> tags ?',["Yes","No"]),
-            MyDropbox("Tag <option>", 'Enable parsing of <option> tags ?',["No","Yes"]),
-            MyDropbox("Tag <script>", 'Enable parsing of <script> tags ?',["Yes","No"]),
-            Label("Crawler System Configuration", True, True),
-            MyDropbox("Worker Threads", 'Enable worker (downloader) thread pool ?',["Yes","No"]),
-            SizedTextbox("Worker Thread Count", 10, 'Size of the worker thread pool',value=10),
-            SizedTextbox("Worker Thread Timeout", 10, 'Timeout for the worker thread pool',value=1200.0),    
-            SizedTextbox("Tracker Thread Count", 10, 'Size of the tracker (crawler/fetcher) thread pool',
-                         value=10),
-            SizedTextbox("Tracker Thread Timeout", 10, 'Timeout for the tracker thread pool',
-                         value=240.0),    
-            SizedTextbox("Tracker Sleep Time", 10,
-                         'Duration of sleep time for tracker threads between cycles of activity',
-                         value=3.0),
-            MyDropbox("Tracker Sleep Randomized", 'Randomize the tracker thread sleep time ?',
-                      ["Yes","No"]))
-            
-
-        return myform
-
-
-    def convert_val(self, val):
-        if val=="Yes":
-            return '1'
-        else:
-            return '0'
-    
-    def make_config_xml(self, form):
-        
-        # Make dictionary...
-        params_dict = {'url': form['URL'].value,
-                       'projname': form['Name'].value,
-                       'basedir': form['Base Directory'].value,
-                       'verbosity': form['Verbosity'].value,
-                       'proxy': form['Proxy Server'].value,
-                       'puser': form['Proxy Server Username'].value,
-                       'ppasswd': form['Proxy Server Password'].value,
-                       'proxyport': form['Proxy Server Port'].value,
-                       'html': self.convert_val(form['HTML'].value),
-                       'images': self.convert_val(form['Images'].value),
-                       'movies': self.convert_val(form['Video'].value),
-                       'flash': self.convert_val(form['Flash'].value),                       
-                       'sounds': self.convert_val(form['Audio'].value),
-                       'documents': self.convert_val(form['Documents'].value),
-                       'javascript': self.convert_val(form['Javascript'].value),
-                       'javaapplet': self.convert_val(form['Javaapplet'].value),
-                       'getquerylinks': self.convert_val(form['Query Links'].value),
-                       'pagecache': self.convert_val(form['Caching'].value),
-                       'datacache': self.convert_val(form['Data Caching'].value),
-                       'httpcompress': self.convert_val(form['HTTP Compression'].value),                                      
-                       'retryfailed': form['Retry Attempts'].value,
-                       'pagecache': self.convert_val(form['Caching'].value),                                      
-                       'getimagelinks': self.convert_val(form['Fetch Image Links Always'].value),
-                       'getstylesheets': self.convert_val(form['Fetch Stylesheet Links Always'].value),
-                       'linksoffsetstart': form["Links Offset Start"].value,
-                       'linksoffsetend': form["Links Offset End"].value,
-                       'fetchlevel': form["Fetch Level"].value,
-                       'depth': form["URL Depth"].value,
-                       "extdepth": form["External URL Depth"].value,
-                       "subdomain": self.convert_val(form["Crawl Sub-domains"].value),
-                       'ignoretlds': self.convert_val(form["Ignore TLDs (Top level domains)"].value),
-                       'maxfiles': form["Maximum Files Limit"].value,
-                       'maxfilesize': form["Maximum File Size Limit"].value,
-                       'connections': form["Maximum Connections Limit"].value,
-                       'maxbandwidth': str(form["Maximum Bandwidth Limit(kb)"].value) +'kb',
-                       'timelimit': form["Crawl Time Limit"].value,
-                       'robots': self.convert_val(form["Robots Rules"].value),
-                       'urlpriority': form["URL Priority String"].value,
-                       # 'serverpriority': form["Server Priority String"].value,
-                       'serverpriority': '',
-                       'urlfilter': form['URL Filter String'].value,
-                       #'serverfilter': form['Server Filter String'].value,
-                       'serverfilter': '',
-                       'wordfilter': form['Word Filter String'].value,
-                       'junkfilter': self.convert_val(form["JunkFilter"].value),
-                       'parser_enable_a': self.convert_val(form["Tag <a>"].value),
-                       'parser_enable_applet': self.convert_val(form["Tag <applet>"].value),
-                       'parser_enable_area': self.convert_val(form["Tag <area>"].value),
-                       'parser_enable_base': self.convert_val(form["Tag <base>"].value),
-                       'parser_enable_body': self.convert_val(form["Tag <body>"].value),
-                       'parser_enable_embed': self.convert_val(form["Tag <embed>"].value),
-                       'parser_enable_form': self.convert_val(form["Tag <form>"].value),
-                       'parser_enable_frame': self.convert_val(form["Tag <frame>"].value),
-                       'parser_enable_img': self.convert_val(form["Tag <img>"].value),
-                       'parser_enable_link': self.convert_val(form["Tag <link>"].value),
-                       'parser_enable_meta': self.convert_val(form["Tag <meta>"].value),
-                       'parser_enable_object': self.convert_val(form["Tag <object>"].value),
-                       'parser_enable_option': self.convert_val(form["Tag <option>"].value),
-                       'parser_enable_script': self.convert_val(form["Tag <script>"].value),
-                       'usethreads': self.convert_val(form["Worker Threads"].value),
-                       'threadpoolsize': form["Worker Thread Count"].value,
-                       'timeout': form["Worker Thread Timeout"].value,
-                       'maxtrackers': form["Tracker Thread Count"].value,
-                       'fetchertimeout': form["Tracker Thread Timeout"].value,
-                       'sleeptime': form["Tracker Sleep Time"].value,
-                       'randomsleep': self.convert_val(form["Tracker Sleep Randomized"].value),
-                       'urltreefile': form["Url Tree File"].value,
-                       'archive': self.convert_val(form["Archive Saved Files"].value),
-                       'archformat': form["Archive Format"].value,
-                       'urlheaders': self.convert_val(form["Serialize URL Headers"].value),
-                       'localise': self.convert_val(form["Localise Links"].value),
-                       'browsepage': self.convert_val(form["Create Project Browse Page"].value)}
-
-        plugins = ""
-
-        # Add plugins information
-        plugin1 = form["Plugin 1"].value
-        plugin2 = form["Plugin 2"].value
-        plugin3 = form["Plugin 3"].value
-        plugin4 = form["Plugin 4"].value
-        plugin5 = form["Plugin 5"].value
-        plugint = (plugin1, plugin2, plugin3, plugin4, plugin5)
-
-        for plug in plugint:
-            if plug != '':
-                plugins += PLUG_TEMPLATE % plug
-
-        if plugins != '':
-            plugins = PLUGINS_TEMPLATE % plugins
-
-        params_dict['PLUGIN'] = plugins
-        params_dict['TIMESTAMP'] = ' '.join((time.ctime(), time.tzname[0]))
-
-        return CONFIG_XML_TEMPLATE % params_dict
-
-    def GET(self):
-        
-        form = self.create_form()
-        return CONFIG_HTML_TEMPLATE % (render_stylesheet(), g_render.form(form))
-
-    def POST(self): 
-
-        form = self.create_form()
-        if not form.validates(): 
-            return g_render.form(form)
-        else:
-            return self.make_config_xml(form)
-
-class HarvestManGUI(object):
-    """ Main UI class for HarvestMan """
-
-    def GET(self):
-        return "%s" % render_tabs()
-
-class HarvestManLoader(object):
-
-    # GET = request.autodelegate('GET_')
-
-    def GET_tabberjs(self):
-        print 'LOCATION=>',get_templates_location()
-        path = os.path.join(get_templates_location(), 'content','tabber.js')
-        return '%s' % open(path).read()
-
-    def GET_example_css(self):
-        path = os.path.join(get_templates_location(), 'content','example.css')
-        return '%s' % open(path).read()
-
-    def GET_example_print_css(self):
-        path = os.path.join(get_templates_location(), 'content','example-print.css')
-        return '%s' % open(path).read()        
-        
-
-urls = ('/', 'HarvestManGUI',
-        '/content/(.*)', 'HarvestManLoader')
-
-def run():
-    """ Run the web UI """
-
-    # UI runs on port 5940
-    sys.argv = [sys.argv[0]]
-    sys.argv.append("5940")
-    print "Starting HarvestMan Web UI at port 5940..."
-    web.internalerror = web.debugerror
-    app=web.application(urls, globals())
-    app.run()
-    
-if __name__ == "__main__":
-    run()
-
-
diff --git a/HarvestMan-lite/harvestman/lib/hooks.py b/HarvestMan-lite/harvestman/lib/hooks.py
deleted file mode 100755
index dd07337..0000000
--- a/HarvestMan-lite/harvestman/lib/hooks.py
+++ /dev/null
@@ -1,295 +0,0 @@
-# -- coding: utf-8
-""" hooks.py - Module allowing developer extensions(plugins/callbacks)
-    to HarvestMan. This module makes it possible to hook into/modify the
-    execution flow of HarvestMan, making it easy to extend and customize
-    it. 
-
-    This module is part of the HarvestMan program. For licensing information
-    see the file LICENSE.txt that is included in this distribution.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Created by Anand B Pillai Feb 1 07.
-
-    Modified by Anand B Pillai Feb 17 2007 Completed callback implementation
-                                           using metaclass methodwrappers.
-    Modified by Anand B Pillai Feb 26 2007 Replaced all 'hook' strings with
-                                           'plugin'.
-
-   Copyright (C) 2007 Anand B Pillai.    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys, os
-import imp
-import __init__
-
-from harvestman.lib.common.common import *
-from methodwrapper import MethodWrapperMetaClass, set_wrapper_method
-
-class HarvestManHooksException(Exception):
-    """ Exception class for HarvestManHooks class """
-    pass
-
-class HarvestManHooks:
-    """ Class which manages pluggable hooks and callbacks for HarvestMan """
-    
-    supported_modules = ('harvestman.lib.crawler',
-                         'harvestman.apps.spider',                         
-                         'harvestman.lib.urlqueue',
-                         'harvestman.lib.datamgr',
-                         'harvestman.lib.connector',
-                         'harvestman.lib.rules')
-    module_plugins = {}
-    module_callbacks = {}
-    run_plugins = {}
-    run_callbacks = {}
-
-    def __init__(self):
-        pass
-    
-    @classmethod
-    def add_all_plugins(cls):
-
-        dirname = os.path.dirname(os.path.dirname(os.path.abspath(__init__.__file__)))
-        # Append app path
-        libpath = os.path.join(dirname, 'lib')
-        sys.path.append(libpath)
-        
-        for module in cls.supported_modules:
-            # Get __plugins__ attribute from the module
-            from_module = '.'.join(module.split('.')[:-1])
-            M = __import__(module,fromlist=[from_module])
-            plugins = getattr(M, '__plugins__',{})
-            for plugin in plugins.keys():
-                # print 'Adding plugin',module,plugin
-                cls.add_plugin(module, plugin)
-
-    @classmethod
-    def add_all_callbacks(cls):
-
-        for module in cls.supported_modules:
-            # Get __plugins__ attribute from the module
-            M = __import__(module)
-            callbacks = getattr(M, '__callbacks__',{})
-
-            for cb in callbacks.keys():
-                cls.add_callback(module, cb)
-
-    @classmethod                
-    def add_plugin(cls, module, plugin):
-        """ Add a plugin named 'plugin' for module 'module' """
-
-        l = cls.module_plugins.get(module)
-        if l is None:
-            # Reduce the module to its basic
-            module = module.split('.')[-1]
-            cls.module_plugins[module] = [plugin]
-        else:
-            l.append(plugin)
-
-    @classmethod
-    def add_callback(cls, module, callback):
-        """ Add a callback named 'callback' for module 'module' """
-
-        l = cls.module_callbacks.get(module)
-        if l is None:
-            # Reduce the module to its basic
-            module = module.split('.')[-1]            
-            cls.module_callbacks[module] = [callback]
-        else:
-            l.append(callback)            
-
-    @classmethod
-    def get_plugins(cls, module):
-        """ Return all plugins for module 'module' """
-
-        return cls.module_plugins.get(module)
-
-    @classmethod
-    def get_callbacks(cls, module):
-        """ Return all callbacks for module 'module' """
-
-        return cls.module_callbacks.get(module)    
-
-    @classmethod
-    def get_all_plugins(cls):
-        """ Return the plugins data structure """
-
-        # Note this is a copy of the dictionary,
-        # so modifying it will not have any impact
-        # locally.
-        return cls.module_plugins.copy()
-
-    @classmethod
-    def get_all_callbacks(cls):
-        """ Return the callbacks data structure """
-
-        # Note this is a copy of the dictionary,
-        # so modifying it will not have any impact
-        # locally.
-        return cls.module_callbacks.copy()    
-
-    @classmethod
-    def set_plugin_func(self, context, func):
-        """ Set plugin function 'func' for context 'context' """
-
-        self.run_plugins[context] = func
-        # Inject the given function in place of the original
-        module, plugin = context.split(':')
-        # Load module and get the entry point
-        M = __import__(module)
-        orig_func = getattr(M, '__plugins__').get(plugin)
-        # Orig func is in the form class:function
-        klassname, function = orig_func.split(':')
-        if klassname:
-            klass = getattr(M, klassname)
-            # print klass
-            # Replace function with the new one
-            # print 'Klassid=>',id(klass), func, id(func)
-            objects.config.set_klass_plugin_func(klassname, function, func)
-
-    @classmethod        
-    def set_callback_method(self, context, method, order='post'):
-        """ Set callback method at context as the given
-        method 'method'. The callback will be inserted after
-        the original function call if order is 'post' and
-        inserted before the original function call if order
-        is 'pre' """
-
-        self.run_callbacks[order + ':' + context] = method
-        module, callback = context.split(':')
-        # Load module and get the entry point
-        M = __import__(module)
-        orig_meth = getattr(M, '__callbacks__').get(callback)
-        
-        # Orig func is in the form class:function
-        klassname, origmethod = orig_meth.split(':')
-        if klassname:
-            klass = getattr(M, klassname)
-            # print klass
-            # If klass does not define its __metaclass__ attribute
-            # as MethodWrapperMetaClass, then we cannot do anything.
-            cls = getattr(klass, '__metaclass__', None)
-            if (cls==None) or (cls.__name__ != 'MethodWrapperMetaClass'):
-                raise HarvestManHooksException, 'Error: Cannot set callback on klass %s, __metaclass__ attribute is not set correctly!' % klassname
-            
-            # Insert new function which basicaly calls
-            # new function before or after the original
-            methobj = getattr(klass, origmethod)
-            # Post call back function should take one extra argument
-            # than the function itself.
-            argcnt1 = methobj.im_func.func_code.co_argcount
-            argcnt2 = method.func_code.co_argcount
-            if order=='post' and ((argcnt1 + 1)!= argcnt2) or \
-               order=='pre' and (argcnt1 != argcnt2):
-                raise HarvestManHooksException,'Error: invalid callback method, signature does not match!'
-            # Set wrapper method
-            # print 'Klassid=>',id(klass), origmethod, id(method)
-            objects.config.set_klass_callback_func(klassname, origmethod, method, order)            
-        else:
-            pass
-
-    @classmethod                          
-    def get_plugin_func(self, context):
-
-        return self.run_plugins.get(context)
-
-    @classmethod
-    def get_all_plugin_funcs(self):
-
-        return self.run_plugins.copy()
-
-#HarvestManHooks.add_all_plugins()
-HarvestManHooks.add_all_callbacks()
-              
-def register_plugin_function(context, func):
-    """ Register function 'func' as
-    a plugin at context 'context' """
-    
-    # The context is a string of the form module:hook
-    # Example => crawler:fetcher_process_url_hook
-
-    # Hooks are defined in modules using the global dictionary
-    # __hooks__. This module will load all hooks from modules
-    # supporting hookable(pluggable) function definitions, when
-    # this module is imported and add the hook definitions to
-    # the class HarvestManHooks.
-    
-    # The function is a hook function/method object which is defined
-    # at the time of calling this function. This function
-    # will not attempt to validate whether the hook function exists
-    # and whether it accepts the right parameters (if any). Any
-    # such validation is done by the Python run-time. The function
-    # can be a module-level, class-level or instance-level function.
-    
-    module, plugin = context.split(':')
-
-    # Validity checks...
-    if module not in HarvestManHooks.get_all_plugins().keys():
-        raise HarvestManHooksException,'Error: %s has no plugins defined!' % module
-
-    if plugin not in HarvestManHooks.get_plugins(module):
-        raise HarvestManHooksException,'Error: Plugin %s is not defined for module %s!' % (plugin, module)
-
-    # Finally add hook..
-    HarvestManHooks.set_plugin_func(context, func)
-
-def register_callback_method(context, method, order):
-    """ Register class method 'method' as
-    a callback at context 'context' according
-    to given order """
-
-    # Don't call this function directly, instead
-    # use one of the function below which wraps up
-    # this function.
-    
-    if order not in ('post','pre'):
-        raise HarvestManHooksException,'Error: order argument %s is not valid!' % order
-    
-    # The context is a string of the form module:hook
-    # Example => crawler:fetcher_process_url_hook
-
-    # Callbackss are defined in modules using the global dictionary
-    # __callbacks__. This module will load all callbacks from modules
-    # having function definitions which support callbacks, when
-    # this module is imported. The callback definitions are added to
-    # the class HarvestManHooks.
-    
-    # The method 'method' is a callback instance method object which is defined
-    # at the time of calling this function. The method has to be declared
-    # as a class-level method with the same arguments as the original.
-    # Callbacks are not supported for module level functions, i.e functions
-    # not associated to classes.
-    
-    module, hook = context.split(':')
-
-    # Validity checks...can be a module-level, class-level or instance-level function.
-    if module not in HarvestManHooks.get_all_callbacks().keys():
-        raise HarvestManHooksException,'Error: %s has no callbacks defined!' % module
-
-    if hook not in HarvestManHooks.get_callbacks(module):
-        raise HarvestManHooksException,'Error: Callback %s is not defined for module %s!' % (hook, module)
-
-    # Finally add callback..
-    HarvestManHooks.set_callback_method(context, method, order)    
-    
-def register_pre_callback_method(context, method):
-    """ Register callback method as a pre callback """
-
-    return register_callback_method(context, method,'pre')
-
-def register_post_callback_method(context, method):
-    """ Register callback method as a post callback """
-
-    return register_callback_method(context, method,'post')
-    
-def myfunc(self):
-    pass
-
-if __name__ == "__main__":
-    register_plugin_function('datamgr:download_url_plugin', myfunc)
-    register_post_callback_method('crawler:fetcher_process_url_callback', myfunc)
-    print HarvestManHooks.getInstance().get_all_hook_funcs()
diff --git a/HarvestMan-lite/harvestman/lib/js/__init__.py b/HarvestMan-lite/harvestman/lib/js/__init__.py
deleted file mode 100755
index a93df7e..0000000
--- a/HarvestMan-lite/harvestman/lib/js/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# -- coding: utf-8
diff --git a/HarvestMan-lite/harvestman/lib/js/jsdom.py b/HarvestMan-lite/harvestman/lib/js/jsdom.py
deleted file mode 100755
index 4f7c04c..0000000
--- a/HarvestMan-lite/harvestman/lib/js/jsdom.py
+++ /dev/null
@@ -1,94 +0,0 @@
-# -- coding: utf-8
-"""
-jsdom.py - Defines classes for Javascript DOM objects.
-This module is part of the HarvestMan program.For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Oct 2 2007
-
-Copyright (C) 2007 Anand B Pillai.
-
-"""
-
-class Base(object):
-    """ Base class for DOM objects """
-
-    __slots__ = []
-
-    def __init__(self):
-        for item in self.__class__.__slots__:
-            setattr(self, item, None)
-        
-class Window(Base):
-    """ DOM class which mimics a browser Window """
-    
-    __slots__ = ['frames','closed','defaultStatus','document',
-                 'history','length','location','name','opener',
-                 'outerheight','outerwidth','pageXOffset','pageYOffset',
-                 'parent','personalbar','scrollbars','status','toolbar',
-                 'top']
-
-    def __init__(self):
-        super(Window, self).__init__()
-
-class Location(Base):
-    """ DOM class for page location """
-    
-    __slots__ = ['hash','host','hostname','href','pathname','port',
-                 'protocol','search','hrefchanged']
-
-    def __init__(self):
-        super(Location, self).__init__()
-        # Internal flag
-        self.hrefchanged = False
-
-    def replace(self, url):
-        self.href =  url
-        self.hrefchanged = True
-
-    def assign(self, url):
-        self.replace(url)
-                 
-class Document(Base):
-    """ DOM class for the document """
-    
-    __slots__ = ['body','cookie','domain','lastModified','referrer',
-                 'title','URL', 'content', 'domcontent', 'prescript',
-                 'postscript','contentchanged']
-    
-    def __init__(self):
-        super(Document, self).__init__()
-        self.content = ''
-        self.domcontent = ''
-        # Text before <script...> tags
-        self.prescript = ''
-        # Text after </script>..
-        self.postscript = ''
-        # Internal flag
-        self.contentchanged = False
-        
-    def chomp(self, start, end):
-        """ Split content according to start and end of javascript tags """
-
-        # All content before <script...>
-        self.prescript = self.content[:start]
-        # All content after </script>        
-        self.postscript = self.content[end:]
-        
-    def write(self, text):
-        # Called for document.write(...) actions
-        self.domcontent = self.domcontent + text
-
-    def writeln(self, text):
-        # Called for document.writeln(...) actions        
-        self.domcontent = self.domcontent + text + '\n'
-
-    def construct(self):
-        """ Reconstruct document content using modified DOM """
-        
-        self.contentchanged = True
-        self.content = ''.join((self.prescript, self.domcontent, self.postscript))
-
-    def __repr__(self):
-        return self.content
diff --git a/HarvestMan-lite/harvestman/lib/js/jsparser.py b/HarvestMan-lite/harvestman/lib/js/jsparser.py
deleted file mode 100755
index 48b5617..0000000
--- a/HarvestMan-lite/harvestman/lib/js/jsparser.py
+++ /dev/null
@@ -1,567 +0,0 @@
-# -- coding: utf-8
-""" jsparser - This module provides classes which perform
-Javascript extraction from HTML and Javascript parsing to
-process DOM objects.
-
-The module consists of two classes. HTMLJSParser
-is an HTML Javascript extractor which can extract javascript
-code present in HTML pages. JSParser builds upon HTMLJSParser
-to provide a Javascript parser which can parse HTML pages
-and process Javascript which performs DOM modifications.
-Currently this class can process document.write* functions
-and Javascript based redirection which changes the location
-of a page.
-
-Both classes are written trying to mimic the behaviour
-of Firefox (2.0) as closely as possible.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Aug 31 2007
-Modified Anand B Pillai Oct 2 2007 Added JSParser class and renamed
-                                   old JSParser to HTMLJSParser.
-
-Modified Anand B Pillai  Jan 18 2008 Rewrote regular expressions in
-                                     HTMLJSParser using pyparsing.
-
-Copyright (C) 2007 Anand B Pillai.
-
-"""
-
-import sys, os
-import re
-import urllib2
-from pyparsing import *
-from jsdom import *
-
-class HTMLJSParser(object):
-   """ Javascript parser which extracts javascript statements
-   embedded in HTML. The parser only performs extraction, and no
-   Javascript tokenizing """
-
-   script_content = Literal("<") + Literal("script") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "."+ "/"  + '"' + "'")) + Literal(">") + SkipTo(Literal("</") + Literal("script") + Literal(">"), True)
-
-   comment_open = Literal("<!--") + SkipTo("\n", True)
-   comment_close = Literal("//") + ZeroOrMore(Word(alphanums)) + Literal("-->")
-
-   brace_open = Literal("{")
-   brace_close = Literal("}")
-   
-   syntaxendre = re.compile(r';$')
-
-   def __init__(self):
-      self.comment_open.setParseAction(replaceWith(''))
-      self.comment_close.setParseAction(replaceWith(''))
-      self.brace_open.setParseAction(replaceWith(''))
-      self.brace_close.setParseAction(replaceWith(''))            
-      self.reset()
-       
-   def reset(self):
-       self.rawdata = ''
-       self.buffer = ''
-       self.statements = []
-       self.positions = []
-       
-   def feed(self, data):
-
-       self.rawdata = self.rawdata + data
-       # Extract javascript content
-       self.extract()
-    
-   # Internal - parse the HTML to extract Javascript
-   def extract(self):
-
-      rawdata = self.rawdata
-      for match in self.script_content.scanString(rawdata):
-         if not match: continue
-         if len(match) != 3: continue
-         if len(match[0])==0: continue
-         if len(match[0][-1])==0: continue         
-         statement = match[0][-1][0]
-         # print 'Statement=>',statement
-         self.statements.append(statement.strip())
-         self.positions.append((match[-2], match[-1]))
-
-      # print 'Length=>',len(self.statements)
-
-      # If the JS is embedded in HTML comments <!--- js //-->
-      # remove the comments. This logic takes care of trimming
-      # any junk before/after the comments modeling the
-      # behaviour of a browser (Firefox) as closely as possible.
-      
-      flag  = True
-      for x in range(len(self.statements)):
-         s = self.statements[x]
-         # Remove any braces
-         s = self.brace_close.transformString(self.brace_open.transformString(s))
-         s = self.comment_open.transformString(s)
-         s = self.comment_close.transformString(s)         
-
-         # Clean up any syntax end chars
-         s = self.syntaxendre.sub('', s).strip()
-         
-         if s:self.statements[x] = s
-
-      # Trim any empty statements
-      # print 'Length=>',len(self.statements)
-      # print self.statements
-      
-class JSParserException(Exception):
-   """ An exception class for JSParser """
-   
-   def __init__(self, error, context=None):
-      self._error = error
-      # Context: line number, js statement etc.
-      self._context =context
-
-   def __str__(self):
-      return str(self._error)
-
-   def __repr__(self):
-      return '@'.join((str(self._error), str(self._context)))
-
-  
-class JSParser(object):
-   """ Parser for Javascript DOM. This class can be used to parse
-   javascript which contains DOM binding statements. It returns
-   a DOM object. Calling a repr() on this object will produce
-   the modified DOM text """
-
-   # TODO: Rewrite this using pyparsing
-   
-   # Start signature of document.write* methods
-   re1 = re.compile(r"(document\.write\s*\()|(document\.writeln\s*\()")
-   
-   re3 = re.compile(r'(?<![document\.write\s*|document\.writeln\s*])\(.*\)', re.MULTILINE)
-   # End signature of document.write* methods
-   re4 = re.compile(r'[\'\"]\s*\)|[\'\"]\s*\);', re.MULTILINE)
-
-   # Pattern for contents inside document.write*(...) methods
-   # This can be either a single string enclosed in quotes,
-   # a set of strings concatenated using "+" or a set of
-   # string arguments (individual or concatenated) separated
-   # using commas. Text can be enclosed either in single or
-   # double quotes.
-
-   # Valid Examples...
-   # 1. document.write("<H1>This is a heading</H1>\n");
-   # 2. document.write("Hello World! ","Hello You! ","<p style='color:blue;'>Hello World!</p>");
-   # 3. document.write("Hi, this is " + "<p>A paragraph</p>" + "and this is "  + "<p>Another one</p>");
-   # 4. document.write("Hi, this is " + "<p>A paragraph</p>", "and this is "  + "<p>Another one</p>");
-
-   # Pattern for content
-   re5 = re.compile(r'(\".*\")|(\'.*\')', re.MULTILINE)
-   re6 = re.compile(r'(?<=[\"\'\s])[\+\,]+')
-   re7 = re.compile(r'(?<=[\"\'])(\s*[\+\,]+)')   
-   re8 = re.compile(r'^[\'\"]|[\'\"]$')
-   
-   # JS redirect regular expressions
-   # Form => window.location.replace("<url>") or window.location.assign("<url>")
-   # or location.replace("<url>") or location.assign("<url>")
-   jsredirect1 = re.compile(r'((window\.|this\.)?location\.(replace|assign))(\(.*\))', re.IGNORECASE)
-   # Form => window.location.href="<url>" or location.href="<url>"
-   jsredirect2 = re.compile(r'((window\.|this\.)?location(\.href)?\s*\=\s*)(.*)', re.IGNORECASE)
-   
-   quotechars = re.compile(r'[\'\"]*')
-   newlineplusre = re.compile(r'\n\s*\+')
-      
-    
-   def __init__(self):
-      self._nflag = False
-      self.parser = HTMLJSParser()
-      self.resetDOM()
-      self.statements = []
-      self.js = []
-      pass
-
-   def resetDOM(self):
-      self.page = None
-      self.page = Window()
-      self.page.document = Document()
-      self.page.location = Location()
-      self.locnchanged = False
-      self.domchanged = False
-      
-   def _find(self, data):
-      # Search for document.write* statements and return the
-      # match group if found. Also sets the internal newline
-      # flag, depending on whether a document.write or
-      # document.writeln was found.
-      self._nflag = False
-      m = self.re1.search(data)
-      if m:
-         grp = m.group()
-         if grp.startswith('document.writeln'):
-            self._nflag = True
-         return m
-
-   def parse_url(self, url):
-      """ Parse data from the given URL """
-      
-      try:
-         data = urllib2.urlopen(url).read()
-         # print 'fetched data'
-         return self.parse(data)
-      except Exception, e:
-         print e
-         
-         
-   def parse(self, data):
-      """ Parse HTML, extract javascript and process it """
-
-      self.js = []
-      self.resetDOM()
-      self.parser.reset()
-      
-      self.page.document.content = data
-      
-      # Create a jsparser to extract content inside <script>...</script>
-      # print 'Extracting js content...'
-      self.parser.feed(data)
-      self.js = self.parser.statements[:]
-      
-      # print 'Extracted js content.'
-      # print 'Found %d JS statements.' % len(self.js)
-      
-      # print 'Processing JS'
-      for x in range(len(self.js)):
-         statement = self.js[x]
-
-         # First check for JS redirects and
-         # then for JS document changes.
-         jsredirect = self.processLocation(statement)
-         if jsredirect:
-            # No need to process further since we are redirecting
-            # the location
-            break
-         else:
-            # Further process the URL for document changes
-            position = self.parser.positions[x]
-            
-            rawdata = statement.strip()
-            self._feed(rawdata)
-         
-            if len(self.statements):
-               self.processDocument(position)
-
-      # Set flags for DOM/Location change
-      self.locnchanged = self.page.location.hrefchanged
-      self.domchanged = self.page.document.contentchanged
-
-      # print 'Processed JS.'
-      
-   def processDocument(self, position):
-      """ Process DOM document javascript """
-
-      # The argument 'position' is a two tuple
-      # containing the start and end positions of
-      # the javascript tags inside the document.
-
-      dom = self.page.document
-      start, end = position
-      
-      # Reset positions on DOM content to empty string
-      dom.chomp(start, end)
-      
-      for text, newline in self.statements:
-         if newline:
-            dom.writeln(text)
-         else:
-            dom.write(text)
-
-      # Re-create content
-      dom.construct()
-
-   # Internal - validate URL strings for Javascript
-   def validate_url(self, urlstring):
-      """ Perform validation of URL strings """
-      
-      # Validate the URL - This follows Firefox behaviour
-      # In firefox, the URL might be or might not be enclosed
-      # in quotes. However if it is enclosed in quotes the quote
-      # character at start and begin should match. For example
-      # 'http://www.struer.dk/webtop/site.asp?site=5',
-      # "http://www.struer.dk/webtop/site.asp?site=5" and
-      # http://www.struer.dk/webtop/site.asp?site=5 are valid, but
-      # "http://www.struer.dk/webtop/site.asp?site=5' and
-      # 'http://www.struer.dk/webtop/site.asp?site=5" are not.
-      if urlstring.startswith("'") or urlstring.startswith('"'):
-         if urlstring[0] != urlstring[-1]:
-            # Invalid URL
-            return False
-         
-      return True
-
-   def make_valid_url(self, urlstring):
-      """ Create a valid URL string from the passed urlstring """
-      
-      # Strip off any leading/trailing quote chars
-      urlstring = self.quotechars.sub('',urlstring)
-      return urlstring.strip()
-     
-   def processLocation(self, statement):
-      """ Process any changes in document location """
-
-      locnchanged = False
-      
-      for line in statement.split('\n'):
-         
-         # print 'Expression=>',statement
-         m1 = self.jsredirect1.search(line)
-         if m1:
-            tokens = self.jsredirect1.findall(line)
-            if tokens:
-                urltoken = tokens[0][-1]
-                # Strip of trailing and leading parents
-                url = urltoken.replace('(','').replace(')','').strip()
-                # Validate URL
-                if self.validate_url(url):
-                   url = self.make_valid_url(url)
-                   locnchanged = True
-                   self.page.location.replace(url)
-         else:
-            m2 = self.jsredirect2.search(line)
-            if m2:
-               tokens = self.jsredirect2.findall(line)
-               urltoken = tokens[0][-1]
-               # Strip of trailing and leading parents
-               url = urltoken.replace('(','').replace(')','').strip()
-               if tokens and self.validate_url(url):
-                  url = self.make_valid_url(url)
-                  locnchanged = True
-                  self.page.location.replace(url)                  
-                  locnchanged = True
-
-      return locnchanged
-                
-   def _feed(self, data):
-      """ Internal method to feed data to process DOM document """
-      
-      self.statements = []
-      self.rawdata = data
-      self.goahead()
-      self.process()
-      
-   def tryQuoteException(self, line):
-      """ Check line for mismatching quotes """
-      
-      ret = 0
-      # Check line for mismatching quotes
-      if line[0] in ("'",'"') and line[-1] in ("'",'"'):
-         ret = 1
-         if line[0] != line[-1]:
-            raise JSParserException("Mismatching quote characters", line)
-
-      return ret
-   
-   def process(self):
-      """ Process DOM document related javascript """
-
-      # Process extracted statements
-      statements2 = []
-      for s, nflag in self.statements:
-
-         m = self.re5.match(s)
-         if m:
-            # Well behaved string
-            if self.re6.search(s):
-               m = self.re7.search(s)
-               newline = self.newlineplusre.match(m.groups(1)[0])
-               items = self.re6.split(s)
-               
-               # See if any entry in the list has mismatching quotes, then
-               # raise an error...
-               for item in items:
-                  # print 'Item=>',item
-                  self.tryQuoteException(item)
-                  
-               # Remove any trailing or beginning quotes from the items
-               items = [self.re8.sub('',item.strip()) for item in items]
-               # Replace any \" with "
-               items = [item.replace("\\\"", '"') for item in items]
-
-               # If the javascript consists of many lines with a +
-               # connecting them, there is a very good chance that it
-               # breaks spaces across multiple lines. In such case we
-               # need to join the pieces with at least a space.
-               if newline:
-                  s = ' '.join(items)
-               else:
-                  # If it does not consist of newline and a +, we don't
-                  # need any spaces between the pieces.
-                  s = ''.join(items)                  
-
-            # Remove any trailing or beginning quotes from the statement
-            s = self.re8.sub('', s)
-            statements2.append((s, nflag))
-         else:
-            # Ill-behaved string, has some strange char either beginning
-            # or end of line which was passed up to here.
-            # print 'no match',s
-            # Split and check for mismatched quotes
-            if self.re6.search(s):
-               items = self.re6.split(s)
-               # See if any entry in the list has mismatching quotes, then
-               # raise an error...
-               for item in items:
-                  self.tryQuoteException(item)
-                  
-            else:
-               # Ignore it
-               pass
-            
-      self.statements = statements2[:]
-      pass
-   
-   def goahead(self):
-
-      rawdata = self.rawdata
-      self._nflag = False
-      
-      # Scans the document for document.write* statements
-      # At the end of parsing, an internal DOM object
-      # will contain the modified DOM if any.
-
-      while rawdata:
-         m = self._find(rawdata)
-         if m:
-            # Get start of contents
-            start = m.end()
-            rawdata = rawdata[start:]
-            # Find the next occurence of a ')'
-            # First exclude any occurences of pairs of parens
-            # in the content
-            contentdata, pos = rawdata, 0
-            m1 = self.re3.search(contentdata)
-            while m1:
-               contentdata = contentdata[m1.end():]
-               pos = m1.end()
-               # print 'Pos=>',pos
-               # print contentdata
-               m1 = self.re3.search(contentdata)
-
-            m2 = self.re4.search(rawdata, pos)
-            if not m2:
-               raise JSParserException('Missing end paren!')
-            else:
-               start = m2.start()
-               statement = rawdata[:start+1].strip()
-               # print 'Statement=>',statement
-               # If statement contains a document.write*, then it is a
-               # botched up javascript, so raise error
-               if self.re1.search(statement):
-                  raise JSParserException('Invalid javascript', statement)
-               
-               # Look for errors like mismatching start and end quote chars
-               if self.tryQuoteException(statement) == 1:
-                  pass
-               elif statement[0] in ('+','-') and statement[-1] in ("'", '"'):
-                  # Firefox seems to accept this case
-                  print 'warning: garbage char "%s" in beginning of statement!' % statement[0]
-               else:
-                  raise JSParserException("Garbage in beginning/end of statement!")
-                  
-               # Add statement to list
-               self.statements.append((statement, self._nflag))
-               rawdata = rawdata[m2.end():]
-         else:
-            # No document.write* content found
-            # print 'no content'
-            break
-
-   def getDocument(self):
-      """ Return the DOM document object, this can be used to get
-      the modified page if any """
-
-      return self.page.document
-
-   def getLocation(self):
-      """ Return the DOM Location object, this can be used to
-      get modified URL if any """
-
-      return self.page.location
-
-   def getStatements(self):
-      """ Return the javascript statements in a list """
-
-      return self.parser.statements
-
-   
-def localtests():
-    print 'Doing local tests...'
-    
-    P = JSParser()
-    P.parse(open('samples/bportugal.html').read())
-    assert(repr(P.getDocument())==open('samples/bportugal_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-    
-    P.parse(open('samples/jstest.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-
-    P.parse(open('samples/jsnodom.html').read())
-    assert(repr(P.getDocument())==open('samples/jsnodom.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==False)    
-    
-
-    P.parse(open('samples/jstest2.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest2_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)    
-
-    P.parse(open('samples/jstest3.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest3_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-
-    P.parse(open('samples/jsredirect.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="http://www.struer.dk/webtop/site.asp?site=5")
-
-    P.parse(open('samples/jsredirect4.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect4.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="http://www.szszm.hu/szigetszentmiklos.hu")
-    
-    P.parse(open('samples/jsredirect5.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect5.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="sopron/main.php")    
-
-    print 'All local tests passed.'
-
-def webtests():
-    print 'Starting web tests...'
-    P = JSParser()
-
-    urls = [("http://www.skien.kommune.no/", 0), ("http://www.bayern.de/", 7),
-            ("http://www.agsbs.ch/", 2), ("http://www.froideville.ch/", 1)]
-
-    for url, number in urls:
-       print 'Parsing URL %s...' % url
-       P.parse_url(url)
-       print 'Found %d statements.' % len(P.getStatements())
-       assert(number==len(P.getStatements()))
-       
-def experiments():
-   P = JSParser()
-   P.parse(open('samples/test.html').read())
-   
-if __name__ == "__main__":
-   localtests()
-   #webtests()
-   #experiments()
-   
-   
-    
-
-
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/Banco de Portugal.html b/HarvestMan-lite/harvestman/lib/js/samples/Banco de Portugal.html
deleted file mode 100755
index 2014de7..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/Banco de Portugal.html	
+++ /dev/null
@@ -1,28 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<script language="javascript">
-document.writeln("<frameset framespacing=\"0\" frameborder=\"0\" rows=\"83,*,31\">"
-  + "<frame name=\"topo\" scrolling=\"no\" noresize target=\"main\" src=\"default_top_p.htm\""
-  + "marginwidth=\"0\" marginheight=\"0\">"
-  + "<frameset cols=\"119,*,145\">"
-  + " <frame name=\"contents\" target=\"main\" src=\"default_toc_p.htm\" scrolling=\"no\">"
-  + " <frame name=\"main\" src=\"" + (location.search ? location.search.substring(1) : "principal_p.htm") + "\" scrolling=\"auto\" marginwidth=\"0\""
-  + " marginheight=\"0\" noresize>"
-  +	"<frame name=\"euromenu\" src=\"default_menu_p.htm\" target=\"main\" scrolling=\"auto\" marginwidth=\"0\""
-  + "  marginheight=\"0\" noresize>"
-  + "</frameset>"
-  + "<frame name=\"bottom\" scrolling=\"no\" noresize target=\"main\" marginwidth=\"0\""
-  + " marginheight=\"0\" src=\"scripts/default_bottom_p.asp\">"
-  + "<noframes>"
-  + "<body topmargin=\"0\" leftmargin=\"0\">"
-  + "<p>This page uses frames, but your browser doesn't support them.</p>"
-  + "</body>"
-  + "</noframes>"
-  + "</frameset>");
-
-</script></head><frameset framespacing="0" frameborder="0" rows="83,*,31"><frame name="topo" noresize="noresize" target="main" src="Banco%20de%20Portugal_files/default_top_p.html" marginwidth="0" marginheight="0" scrolling="no"><frameset cols="119,*,145"> <frame name="contents" target="main" src="Banco%20de%20Portugal_files/default_toc_p.html" scrolling="no"> <frame name="main" src="Banco%20de%20Portugal_files/principal_p.html" marginwidth="0" marginheight="0" noresize="noresize" scrolling="auto"><frame name="euromenu" src="Banco%20de%20Portugal_files/default_menu_p.html" target="main" marginwidth="0" marginheight="0" noresize="noresize" scrolling="auto"></frameset><frame name="bottom" noresize="noresize" target="main" marginwidth="0" marginheight="0" src="Banco%20de%20Portugal_files/default_bottom_p.html" scrolling="no"><noframes><body topmargin="0" leftmargin="0"><p>This page uses frames, but your browser doesn't support them.</p></body></noframes></frameset>
-
-
-</html>
\ No newline at end of file
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/bportugal.html b/HarvestMan-lite/harvestman/lib/js/samples/bportugal.html
deleted file mode 100755
index fd8612e..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/bportugal.html
+++ /dev/null
@@ -1,28 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<script language="javascript">
-document.writeln("<frameset framespacing=\"0\" frameborder=\"0\" rows=\"83,*,31\">"
-  + "<frame name=\"topo\" scrolling=\"no\" noresize target=\"main\" src=\"default_top_p.htm\""
-  + "marginwidth=\"0\" marginheight=\"0\">"
-  + "<frameset cols=\"119,*,145\">"
-  + " <frame name=\"contents\" target=\"main\" src=\"default_toc_p.htm\" scrolling=\"no\">"
-  + " <frame name=\"main\" src=\"" + (location.search ? location.search.substring(1) : "principal_p.htm") + "\" scrolling=\"auto\" marginwidth=\"0\""
-  + " marginheight=\"0\" noresize>"
-  +     "<frame name=\"euromenu\" src=\"default_menu_p.htm\" target=\"main\" scrolling=\"auto\" marginwidth=\"0\""
-  + "  marginheight=\"0\" noresize>"
-  + "</frameset>"
-  + "<frame name=\"bottom\" scrolling=\"no\" noresize target=\"main\" marginwidth=\"0\""
-  + " marginheight=\"0\" src=\"scripts/default_bottom_p.asp\">"
-  + "<noframes>"
-  + "<body topmargin=\"0\" leftmargin=\"0\">"
-  + "<p>This page uses frames, but your browser doesn't support them.</p>"
-  + "</body>"
-  + "</noframes>"
-  + "</frameset>");
-
-</script></head>
-
-
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/bportugal_dom.html b/HarvestMan-lite/harvestman/lib/js/samples/bportugal_dom.html
deleted file mode 100755
index e6efdf6..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/bportugal_dom.html
+++ /dev/null
@@ -1,9 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<frameset framespacing="0" frameborder="0" rows="83,*,31"> <frame name="topo" scrolling="no" noresize target="main" src="default_top_p.htm" marginwidth="0" marginheight="0"> <frameset cols="119,*,145">  <frame name="contents" target="main" src="default_toc_p.htm" scrolling="no">  <frame name="main" src=" (location.search ? location.search.substring(1) : "principal_p.htm") " scrolling="auto" marginwidth="0"  marginheight="0" noresize> <frame name="euromenu" src="default_menu_p.htm" target="main" scrolling="auto" marginwidth="0"   marginheight="0" noresize> </frameset> <frame name="bottom" scrolling="no" noresize target="main" marginwidth="0"  marginheight="0" src="scripts/default_bottom_p.asp"> <noframes> <body topmargin="0" leftmargin="0"> <p>This page uses frames, but your browser doesn't support them.</p> </body> </noframes> </frameset>
-</head>
-
-
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jsnodom.html b/HarvestMan-lite/harvestman/lib/js/samples/jsnodom.html
deleted file mode 100755
index fa10f1f..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jsnodom.html
+++ /dev/null
@@ -1,11 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect.html b/HarvestMan-lite/harvestman/lib/js/samples/jsredirect.html
deleted file mode 100755
index ddee8bb..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<html>
-<title>Struer Kommune - Kommunen</title>
-<body bgcolor="#FFFFFF">
-
-<script language="javascript">
-<!--
-
-location.replace("http://www.struer.dk/webtop/site.asp?site=5");
-
-//-->
-</script>
-
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect4.html b/HarvestMan-lite/harvestman/lib/js/samples/jsredirect4.html
deleted file mode 100755
index 24bc6d0..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect4.html
+++ /dev/null
@@ -1,21 +0,0 @@
-html><head>
-
-<title>www.szigetszentmiklos.hu</title>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
-</head>
-<body bgcolor="#0F4233" text="#FFFFFF" leftmargin="0" topmargin="0">
-
-<script>
-{window.location.href="http://www.szszm.hu/szigetszentmiklos.hu";}
-</script>
-
-
-
-<br><br>
-</body>
-
-
-
-
-
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect5.html b/HarvestMan-lite/harvestman/lib/js/samples/jsredirect5.html
deleted file mode 100755
index af8e0f0..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jsredirect5.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-
-<html>
-<head>
-        <title>Sopron</title>   
-</head>
-<body>
-
-<script>
-this.location = 'sopron/main.php'
-</script>
-
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest.html
deleted file mode 100755
index bcb7f14..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest.html
+++ /dev/null
@@ -1,20 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-document.write   ("<H1>This is a heading)</H1>");    
-document.writeln( "Hello World! ","Hello You! ","<p style='color:blue;'>Hello World!</p>");
-document.writeln("Hi, this is " + "<b>A bold line</b>" + " and this is "  + "<i>an italicized line</i>");
-document.writeln( "<p>Hi, this is ", "A paragraph</p>", "<p>And this is "  + "Another one</p>");
-document.writeln( - "<p>Hi, this is another one</p>");
-document.write('A text enclosed in single quote');
-</script>
-<p>Hello this is a paragraph</p>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest2.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest2.html
deleted file mode 100755
index c478622..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest2.html
+++ /dev/null
@@ -1,13 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-document.write("ha ha")
-</script>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest2_dom.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest2_dom.html
deleted file mode 100755
index 9415db1..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest2_dom.html
+++ /dev/null
@@ -1,6 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-ha ha
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest3.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest3.html
deleted file mode 100755
index 3542ed6..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest3.html
+++ /dev/null
@@ -1,16 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="JavaScript1.2" type="text/javascript" >
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-</script>
-<p>Test</p>
-<script language="javascript">
-document.write("ha ha")
-</script>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest3_dom.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest3_dom.html
deleted file mode 100755
index 8698e79..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest3_dom.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="JavaScript1.2" type="text/javascript" >
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-</script>
-<p>Test</p>
-ha ha
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/jstest_dom.html b/HarvestMan-lite/harvestman/lib/js/samples/jstest_dom.html
deleted file mode 100755
index f2ce1ab..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/jstest_dom.html
+++ /dev/null
@@ -1,16 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<H1>This is a heading)</H1>Hello World! Hello You! <p style='color:blue;'>Hello World!</p>
-Hi, this is <b>A bold line</b> and this is <i>an italicized line</i>
-<p>Hi, this is A paragraph</p><p>And this is Another one</p>
-A text enclosed in single quote
-<p>Hello this is a paragraph</p>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/js/samples/test.html b/HarvestMan-lite/harvestman/lib/js/samples/test.html
deleted file mode 100755
index 1229a0c..0000000
--- a/HarvestMan-lite/harvestman/lib/js/samples/test.html
+++ /dev/null
@@ -1,32 +0,0 @@
-<script language="text/javascript">
-<!--hide this script from non-javascript-enabled browsers
-if (Browser=="IE") {
-//   document.write('<span id="temp"><nobr>'+text+'</nobr></span>');
-//   document.all.temp.style.visibility="hidden";
-//   rollbreite=document.all.temp.offsetWidth;
-//   rollbreite=900;
-   rollbreite=1050;
-}
-if (Browser=="NS6") {
-//   document.write('<span id="temp"><nobr>'+text+'</nobr></span>');
-//   document.getElementById("temp").style.visibility="hidden";
-//   rollbreite=document.getElementById("temp").offsetWidth;
-//   rollbreite=900;
-   rollbreite=1050;
-}
-if (Browser=="IE"||Browser=="NS6") {
-
-   document.write('<div class="ticker" id ="hticker" style="position:absolute;top:130;left:174;background-color:#dbf0ff;clip:rect(0,455,20,0);">')
-   document.write('<div class="ticker" id="iticker" style="position:relative;width:'+rollbreite+'">');
-   document.write(text);
-   document.write('</div> </div>');
-
-} else if (Browser=="NS4") {
-   document.write('<layer class="ticker" id="hticker" top="82px" left="181px" bgColor="#dbf0ff" clip="0,0,457,18">');
-   document.write('<layer class="ticker" id="iticker" left="0" >');
-   document.write("<nobr>"+text+"</nobr>");
-   document.write('</layer></layer>');
-
-}
-// stop hiding -->
-</script>
diff --git a/HarvestMan-lite/harvestman/lib/logger.py b/HarvestMan-lite/harvestman/lib/logger.py
deleted file mode 100755
index 6892bb1..0000000
--- a/HarvestMan-lite/harvestman/lib/logger.py
+++ /dev/null
@@ -1,386 +0,0 @@
-# -- coding: utf-8
-"""
-logger.py -  Logging functions for HarvestMan.
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created: Jan 23 2005
-
-Modification History
-
-   Aug 17 06 Anand   Made this to use Python logging module.
-   Jul 11 08 Anand   Modified to suit standard logging with an
-                     extra level named EXTRAINFO between INFO
-                     and DEBUG.Fix for google code issue #12.
-
-Copyright (C) 2005 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import logging, logging.handlers
-import os, sys
-from types import StringTypes
-
-class HandlerFactory(object):
-    """ Factory class to create handlers of different families for use by the logging class """
-    
-    def createHandle(handlertype, *args):
-        """ Return a logging handler of the given type.
-        The handler will be initialized using params from the args
-        argument tuple """
-
-        if handlertype == 'StreamHandler':
-            return logging.StreamHandler(*args)
-        elif handlertype == 'FileHandler':
-            return logging.FileHandler(*args)
-        elif handlertype == 'SocketHandler':
-            return logging.handlers.SocketHandler(*args)        
-        
-    createHandle = staticmethod(createHandle)
-
-# Our logging levels are the following. We confirm
-# to Python logging except for the additional level
-# between DEBUG and INFO named "EXTRAINFO".
-# CRITICAL 50
-# ERROR         40
-# WARNING 30
-# INFO  20
-# EXTRAINFO 15
-# DEBUG         10
-# NOTSET        0
-
-CRITICAL=logging.CRITICAL
-ERROR=logging.ERROR
-WARNING=logging.WARNING
-INFO=logging.INFO
-EXTRAINFO=(logging.INFO+logging.DEBUG)/2
-DEBUG=logging.DEBUG
-DISABLE=NOTSET=logging.NOTSET
-
-def getLogLevel(levelname):
-    """ Return the loglevel given the level name """
-
-    if type(levelname) in StringTypes:
-        return eval(levelname.upper())
-    elif type(levelname) is int:
-        return levelname
-
-def getLogLevelName(level):
-    """ Return the loglevel name given the level """
-    
-    return HarvestManLogger.getLevelName(level)
-
-class HarvestManLogger(object):
-    """ A customizable logging class for HarvestMan with different
-    levels of logging support """
-
-    alias = 'logger'
-
-    # Dictionary from levels to level names
-    _namemap = { 0: 'NOTSET',
-                 10: 'DEBUG',
-                 15: 'EXTRAINFO',
-                 20: 'INFO',
-                 30: 'WARNING',
-                 40: 'ERROR',
-                 50: 'CRITICAL' }
-
-    # Map of instances
-    _instances = {'default': None}
-    
-    def __init__(self, severity=INFO, logflag=2):
-        """ Initialize the logger class with severity and logflag """
-
-        # Add our custom logging level to the module
-        logging.addLevelName(EXTRAINFO, 'EXTRAINFO')
-        
-        self._severity = severity
-        self._formattercache = {}
-        
-        if logflag==0:
-            self._severity = DISABLE
-        else:
-            self._severity = severity
-            if logflag == 2:
-                self.consolelog = True
-
-    def make_logger(self):
-        
-        self._logger = logging.Logger('HarvestMan')
-        self._logger.setLevel(self._severity)
-            
-        if self.consolelog:
-            formatter = logging.Formatter('[%(asctime)s] %(message)s')
-            handler = logging.StreamHandler()
-            handler.setFormatter(formatter)
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def _getMessage(self, arg, *args):
-        """ Private method to create a message string from the supplied arguments """
-
-        try:
-            return ''.join((str(arg),' ',' '.join([str(item) for item in args])))
-        except UnicodeEncodeError:
-            try:
-                return ''.join((str(arg),' ',' '.join([str(item.encode('iso-8859-1')) for item in args])))
-            except UnicodeDecodeError:
-                return str(arg)
-        
-    @classmethod
-    def getLevelName(self, level):
-        """ Return the level name, given the level value """
-        
-        return self._namemap.get(level, '')
-
-    def getLogLevel(self):
-        """ Return the current log level """
-
-        # Same as severity
-        return self.getLogSeverity()
-
-    def getLogSeverity(self):
-        """ Return the current log severity """
-
-        return self._severity
-
-    def getLogLevelName(self):
-        """ Return the name of the current log level """
-
-        return self.getLevelName(self._severity)
-
-    def setLogSeverity(self, severity):
-        """ Set the log severity """
-
-        self._severity = severity
-        self._logger.setLevel(self._severity)        
-
-    def addLogHandler(self, handlertype, *args):
-        """ Generic method to add a handler to the logger """
-
-        # handlertype should be a string
-        # Call helper function here
-        handler = HandlerFactory.createHandle(handlertype, *args)
-        if handlertype != 'StreamHandler':
-            formatter = logging.Formatter('%(asctime)s %(levelname)s - %(message)s')
-        else:
-            # Minimal formatting for console log
-            formatter = logging.Formatter('[%(asctime)s] %(message)s')
-        handler.setFormatter(formatter)
-        self._logger.addHandler(handler)
-
-    def removeLogHandlers(self, handlertype):
-        """ Generic method to remove a handler from the logger """
-
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == handlertype:
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def enableConsoleLogging(self):
-        """ Enable console logging - if console logging is already
-        enabled, this method does not have any effect """
-
-        if not self.consolelog:
-            self.addLogHandler('StreamHandler')
-            self.consolelog = True
-
-    def disableConsoleLogging(self):
-        """ Disable console logging - if console logging is already
-        disabled, this method does not have any effect """
-
-        # Find out streamhandler if any
-        self.removeLogHandlers('StreamHandler')
-        self.consolelog=False
-
-    def disableFileLogging(self):
-        """ Disable file logging - if file logging is already
-        disabled, this method does not have any effect """
-
-        self.removeLogHandlers('FileHandler')
-
-    def addLogFile(self, filename):
-        """ Add a log file named 'filename' to the logger """
-
-        self.addLogHandler('FileHandler', filename)
-
-    def removeLogFile(self, filename):
-        """ Remove log file named 'filename' from the logger. If
-        only a filename is passed instead of a file path, the file
-        is assumed to be in the current directory """
-        
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':
-                fname = h.baseFilename
-                if (os.path.abspath(filename)==os.path.abspath(fname)):
-                    # Remove the handler
-                    self._logger.removeHandler(h)
-                    break
-
-    def setPlainFormat(self):
-        """ Set format to displaying messages-only without any timestamps etc """
-
-        formatter = logging.Formatter('%(message)s')
-        for h in self._logger.handlers:
-            # Cache previous formatter for reverting later if requested
-            self._formattercache[hash(h)] = h.formatter
-            h.setFormatter(formatter)
-
-    def revertFormatting(self):
-        """ Revert formatting for all handlers if a cache is found """
-
-        for h in self._logger.handlers:
-            h.setFormatter(self._formattercache[hash(h)])
-        
-    def debug(self, msg, *args):
-        """ Perform logging at the DEBUG level """
-
-        try:
-            self._logger.debug(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def info(self, msg, *args):
-        """ Perform logging at the INFO level """
-        
-        try:
-            self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def extrainfo(self, msg, *args):
-        """ Perform logging at the EXTRAINFO level """
-
-        try:
-            self._logger.log(EXTRAINFO, self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def warning(self, msg, *args):
-        """ Perform logging at the WARNING level """
-
-        try:
-            self._logger.warning(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def error(self, msg, *args):
-        """ Perform logging at the ERROR level """
-
-        try:
-            self._logger.error(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def critical(self, msg, *args):
-        """ Perform logging at the CRITICAL level """
-
-        try:
-            self._logger.critical(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def logconsole(self, msg, *args):
-        """ Directly log to console using sys.stdout """
-
-        try:
-            (self._severity>DISABLE) and sys.stdout.write(self._getMessage(msg, *args)+'\n')
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def getDefaultLogger(cls):
-        """ Return the default logger instance """
-
-        return cls._instances.get('default')
-    
-    def Instance(cls, name='default', *args):
-        """ Return an instance of this class """
-
-        inst = cls(*args)
-        cls._instances[name] = inst
-
-        return inst
-
-    def clean_up(self):
-        # Remove all handlers...
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':            
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def shutdown(self):
-        logging.shutdown()
-        
-    Instance = classmethod(Instance)
-    getDefaultLogger = classmethod(getDefaultLogger)
-    
-if __name__=="__main__":
-    import sys
-    
-    mylogger = HarvestManLogger.Instance()
-    mylogger.make_logger()
-    
-    mylogger.addLogHandler('FileHandler','report.log')
-    
-    p = 'HarvestMan'
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)
-    mylogger.critical("Test message 6",p)
-    
-    print mylogger.getLogSeverity()
-    print mylogger.getLogLevelName()
-
-    mylogger.enableConsoleLogging()
-    # mylogger.disableConsoleLogging()    
-    mylogger.disableFileLogging()
-
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)    
-    mylogger.critical("Test message 6",p)    
-
-    print HandlerFactory.createHandle('StreamHandler', sys.stdout)
-    print HandlerFactory.createHandle('FileHandler', 'my.txt')
-    print HandlerFactory.createHandle('SocketHandler', 'localhost', 5555)
-
-    mylogger.addLogHandler('FileHandler','my.txt')
-    
-    # Change severity to extrainfo
-    mylogger.setLogSeverity(EXTRAINFO)
-
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)    
-    mylogger.critical("Test message 6",p)
-    
-    print HarvestManLogger.getDefaultLogger()==mylogger
-    print HarvestManLogger._instances
-    
-    print getLogLevel('info')
-    print getLogLevelName(EXTRAINFO)
diff --git a/HarvestMan-lite/harvestman/lib/logger_old.py b/HarvestMan-lite/harvestman/lib/logger_old.py
deleted file mode 100755
index 338270f..0000000
--- a/HarvestMan-lite/harvestman/lib/logger_old.py
+++ /dev/null
@@ -1,325 +0,0 @@
-# -- coding: utf-8
-"""
-logger.py -  Logging functions for HarvestMan.
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created: Jan 23 2005
-
-Modification History
-
-   Aug 17 06 Anand   Made this to use Python logging module.
-
-Copyright (C) 2005 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import logging, logging.handlers
-import os, sys
-
-class HandlerFactory(object):
-    """ Factory class to create handlers of different families for use by SIMLogger """
-    
-    def createHandle(handlertype, *args):
-        """ Return a logging handler of the given type.
-        The handler will be initialized using params from the args
-        argument tuple """
-
-        if handlertype == 'StreamHandler':
-            return logging.StreamHandler(*args)
-        elif handlertype == 'FileHandler':
-            return logging.FileHandler(*args)
-        elif handlertype == 'SocketHandler':
-            return logging.handlers.SocketHandler(*args)        
-        
-    createHandle = staticmethod(createHandle)
-
-
-
-# Useful macros for setting
-# the log level.
-    
-DISABLE = 0
-INFO = 1
-MOREINFO  = 2
-EXTRAINFO = 3
-DEBUG   = 4
-MOREDEBUG = 5
-        
-class HarvestManLogger(object):
-    """ A customizable logging class for HarvestMan with different
-    levels of logging support """
-
-    alias = 'logger'
-    
-    # Dictionary from levels to level names
-    _namemap = { 0: 'DISABLE',
-                 1: 'INFO',
-                 2: 'MOREINFO',
-                 3: 'EXTRAINFO',
-                 4: 'DEBUG',
-                 5: 'MOREDEBUG' }
-
-    # Map of instances
-    _instances = {'default': None}
-    
-    def __init__(self, severity=1, logflag=2):
-        """ Initialize the logger class with severity and logflag """
-        
-        self._severity = severity
-        # Handler cache
-        self._cachehandler = None
-        
-        if logflag==0:
-            self._severity = 0
-        else:
-            self._severity = severity
-            if logflag == 2:
-                self.consolelog = True
-
-    def make_logger(self):
-        
-        self._logger = logging.Logger('HarvestMan')
-        self._logger.setLevel(logging.DEBUG)
-            
-        if self.consolelog:
-            formatter = logging.Formatter('[%(asctime)s] %(message)s',
-                                          '%H:%M:%S')
-            handler = logging.StreamHandler()
-            handler.setFormatter(formatter)
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def _getMessage(self, arg, *args):
-        """ Private method to create a message string from the supplied arguments """
-
-        try:
-            return ''.join((str(arg),' ',' '.join([str(item) for item in args])))
-        except UnicodeEncodeError:
-            try:
-                return ''.join((str(arg),' ',' '.join([str(item.encode('iso-8859-1')) for item in args])))
-            except UnicodeDecodeError:
-                return str(arg)
-        
-    def getLevelName(self, level):
-        """ Return the level name, given the level value """
-        
-        return self._namemap.get(level, '')
-
-    def getLogLevel(self):
-        """ Return the current log level """
-
-        # Same as severity
-        return self.getLogSeverity()
-
-    def getLogSeverity(self):
-        """ Return the current log severity """
-
-        return self._severity
-
-    def getLogLevelName(self):
-        """ Return the name of the current log level """
-
-        return self.getLevelName(self._severity)
-
-    def setLogSeverity(self, severity):
-        """ Set the log severity """
-
-        self._severity = severity
-
-    def addLogHandler(self, handlertype, *args):
-        """ Generic method to add a handler to the logger """
-
-        # handlertype should be a string
-        # Call helper function here
-        handler = HandlerFactory.createHandle(handlertype, *args)
-        formatter = logging.Formatter('%(asctime)s %(message)s',
-                                          '(%d-%m-%y) [%H:%M:%S]')
-        handler.setFormatter(formatter)
-        self._logger.addHandler(handler)
-        
-    def enableConsoleLogging(self):
-        """ Enable console logging - if console logging is already
-        enabled, this method does not have any effect """
-
-        console = 'StreamHandler' in [h.__class__.__name__ for h in self._logger.handlers]
-        
-        if not console:
-            if self._cachehandler.__class__.__name__ == 'StreamHandler':
-                handler = self._cachehandler
-            else:
-                formatter = logging.Formatter('[%(asctime)s] %(message)s',
-                                              '%H:%M:%S')
-                handler = logging.StreamHandler()
-                handler.setFormatter(formatter)
-                
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def disableConsoleLogging(self):
-        """ Disable console logging - if console logging is already
-        disabled, this method does not have any effect """
-
-        # Find out streamhandler if any
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'StreamHandler':
-                # Cache the handler if we want to readd quickly
-                self._cachehandler = h
-                # Remove the handler
-                self._logger.removeHandler(h)
-                break
-        else:
-            pass
-
-    def disableFileLogging(self):
-        """ Disable file logging - if file logging is already
-        disabled, this method does not have any effect """
-
-        # Find out filehandler if any
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':
-                # Remove the handler
-                self._logger.removeHandler(h)
-                break
-        else:
-            pass
-
-    def setPlainFormat(self):
-        """ Set format to displaying messages-only without any timestamps etc """
-
-        formatter = logging.Formatter('%(message)s')
-        for h in self._logger.handlers:
-            h.setFormatter(formatter)
-        
-    def info(self, msg, *args):
-        """ Perform logging at the INFO level """
-        
-        try:
-            (self._severity>=INFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def moreinfo(self, msg, *args):
-        """ Perform logging at the MOREINFO level """
-
-        try:
-            (self._severity>=MOREINFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def extrainfo(self, msg, *args):
-        """ Perform logging at the EXTRAINFO level """
-
-        try:
-            (self._severity>=EXTRAINFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def debug(self, msg, *args):
-        """ Perform logging at the DEBUG level """
-
-        try:
-            (self._severity>=DEBUG) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def moredebug(self, msg, *args):
-        """ Perform logging at the MOREDEBUG level """
-
-        try:
-            (self._severity>=MOREDEBUG) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def logconsole(self, msg, *args):
-        """ Directly log to console using sys.stdout """
-
-        try:
-            (self._severity>DISABLE) and sys.stdout.write(self._getMessage(msg, *args)+'\n')
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def getDefaultLogger(cls):
-        """ Return the default logger instance """
-
-        return cls._instances.get('default')
-    
-    def Instance(cls, name='default', *args):
-        """ Return an instance of this class """
-
-        inst = cls(*args)
-        cls._instances[name] = inst
-
-        return inst
-
-    def clean_up(self):
-        # Remove all handlers...
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':            
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def shutdown(self):
-        logging.shutdown()
-        
-    Instance = classmethod(Instance)
-    getDefaultLogger = classmethod(getDefaultLogger)
-    
-if __name__=="__main__":
-    import sys
-    
-    mylogger = HarvestManLogger.Instance()
-    mylogger.addLogHandler('FileHandler','report.log')
-    # mylogger.setLogSeverity(1)
-    
-    p = 'HarvestMan'
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)
-    
-    print mylogger.getLogSeverity()
-    print mylogger.getLogLevelName()
-
-    mylogger.enableConsoleLogging()
-    # mylogger.disableConsoleLogging()    
-    mylogger.disableFileLogging()
-    
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)
-
-
-    print HandlerFactory.createHandle('StreamHandler', sys.stdout)
-    print HandlerFactory.createHandle('FileHandler', 'my.txt')
-    print HandlerFactory.createHandle('SocketHandler', 'localhost', 5555)
-
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)    
-
-    print HarvestManLogger.getDefaultLogger()==mylogger
-    print HarvestManLogger._instances
diff --git a/HarvestMan-lite/harvestman/lib/methodwrapper.py b/HarvestMan-lite/harvestman/lib/methodwrapper.py
deleted file mode 100755
index 4f20579..0000000
--- a/HarvestMan-lite/harvestman/lib/methodwrapper.py
+++ /dev/null
@@ -1,141 +0,0 @@
-# -- coding: utf-8
-""" methodwrapper.py - Module which provides a meta-class level
-implementation for method wrappers. The metaclasses provided here
-specifically wrap pre_* and post_* methods defined in classes
-and wrap them around the original method.
-
-Any class which wants to auto-implement pre and post callbacks
-need to set their __metaclass__ attribute to the type
-MethodWrapperMetaClass. This has to be done at the time of
-defining the class.
-
-This module provides the function set_method_wrapper which
-sets a given method as a pre or post callback method on a class.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 17 2007
-
-Copyright (C) 2003-2008 Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from new import function
-from harvestman.lib.common.common import objects
-
-class MethodWrapperBaseMetaClass(type):
-    """ A base meta class for method wrappers """
-    
-    # This class allows wrapping pre_ and post_ callbacks
-    # (methods) around a method. Original code courtesy
-    # Eiffel method wrapper implementation in Python
-    # subversion trunk @
-    # http://svn.python.org/view/python/trunk/Demo/newmetaclasses/Eiffel.py
-
-    
-    def my_new(cls, *args, **kwargs):
-        name = cls.__name__
-
-        state = objects.config
-        flag = False
-
-        if state:
-            plugins = state.get_klass_plugins(name)
-            if plugins:
-                for funcname, func in plugins.items():
-                    # print funcname,'=>',func
-                    setattr(cls, funcname, func)
-
-            callbacks = state.get_klass_callbacks(name)
-            if callbacks:
-                for funcname, (func, where)  in callbacks.items():
-                    # print funcname,'=>',func, where
-                    attr = where + '_' + funcname + '_callback'
-                    # print attr
-                    l = getattr(cls, attr, None)
-                    if not l:
-                        setattr(cls, attr, [func])
-                    else:
-                        l.append(func)
-
-                # Since this gets called at creation of every instance
-                # we need to make sure that it gets done exactly once.                    
-                if not getattr(cls, '__callbacks__',False):
-                    # print "Need to convert",cls
-                    cls.convert_methods(cls.__dict__)
-
-        return object.__new__(cls)
-    
-    def __init__(cls, name, bases, dct):
-        super(MethodWrapperBaseMetaClass, cls).__init__(name, bases, dct)
-        cls.__new__ = cls.my_new
-
-    def convert_methods(cls, dict):
-        """Replace functions in dict with MethodWrapper wrappers.
-
-        The dict is modified in place.
-
-        If a method ends in _pre or _post, it is removed from the dict
-        regardless of whether there is a corresponding method.
-        """
-        # find methods with pre or post conditions
-        methods = []
-        for k, v in dict.iteritems():
-            if k.startswith('pre_') or k.startswith('post_'):
-                print v
-            #    assert isinstance(v, list)
-            if isinstance(v, function):
-                methods.append(k)
-        for m in methods:
-            pre = dict.get("pre_%s_callback" % m)
-            post = dict.get("post_%s_callback" % m)
-            if pre or post:
-                setattr(cls, m, cls.make_wrapper_method(dict[m], pre, post))
-
-        setattr(cls, '__callbacks__', True)
-        
-class MethodWrapperMetaClass(MethodWrapperBaseMetaClass):
-    # an implementation of the "MethodWrapper" meta class that uses nested functions
-
-    def make_wrapper_method(func, pre, post):
-        def method(self, *args, **kwargs):
-            if pre:
-                for f in pre:
-                    f(self, *args, **kwargs)
-            x = func(self, *args, **kwargs)
-            if post:
-                for f in post:
-                    f(self, x, *args, **kwargs)
-            return x
-
-        if func.__doc__:
-            method.__doc__ = func.__doc__
-
-        return method
-
-    make_wrapper_method = staticmethod(make_wrapper_method)
-
-def set_wrapper_method(klass, method, callback, where='post'):
-    """ Set callback method 'callback' on the method with
-    the given name 'method' on class 'klass'. If the last
-    argument is 'post' the method is inserted as a post-callback.
-    If the last argument is 'pre', it is inserted as a pre-callback.
-    """
-    
-    # Note: 'method' is the method name, not the method object
-
-    # Set callback
-    attr = where + '_' + method + '_callback'
-    l = getattr(klass, attr, None)
-    if not l:
-        setattr(klass, attr, [callback])
-    else:
-        l.append(callback)
-        setattr(klass, attr, l)            
-
-if __name__ == "__main__":
-    pass
-    
diff --git a/HarvestMan-lite/harvestman/lib/mirrors.py b/HarvestMan-lite/harvestman/lib/mirrors.py
deleted file mode 100755
index e4ea3aa..0000000
--- a/HarvestMan-lite/harvestman/lib/mirrors.py
+++ /dev/null
@@ -1,565 +0,0 @@
-# -- coding: utf-8
-""" mirrors.py - Module which provides support for managing
-mirrors for domains, for hget.
-
-Author - Anand B Pillai <abpillai at gmail dot com>
-
-Created,  Anand B Pillai 14/08/07.
-Modified  Anand B Pillai 10/10/07  Added file mirror support
-Modified  Anand B Pillai 12/11/07  Added logic to retry mirrors which
-                                   did not fail wi th fatal error.
-                                   Replaced duplicate mirroring code with
-                                   HarvestManMirror class.
-Modified Anand B Pillai 6/02/08    Added mirror search logic (Successfully
-                                   download tested apache ant binary using
-                                   findfiles.com mirrors).
-
-Copyright (C) 2007 Anand B Pillai.
-    
-"""
-
-import random
-import re
-from pyparsing import *
-
-from harvestman.lib import urlparser
-from harvestman.lib import connector
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.singleton import Singleton
-
-def test_parse():
-    print urls
-
-class HTMLTableParser(object):
-
-    def __init__(self):
-        self.grammar = Literal("<table") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%" + '"')) + Literal(">") + \
-                       ZeroOrMore(Literal("<tr>") + SkipTo(Literal("</tr>"))) + SkipTo(Literal("</table>"))
-        #self.grammar = Literal("<table") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%" + '"')) + Literal(">") + \
-        #               OneOrMore(Literal("<tr>") + ZeroOrMore(Literal("<td") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%%" + '"')) + Literal(">") + SkipTo(Literal("</td>"))) + SkipTo(Literal("</tr>"))) + SkipTo(Literal("</table>"))
-
-    def parse(self, data):
-
-        # data='<table class="tinner" width="100%" cellspacing="0" cellpadding="0" cellpadding="4"><tr><td></td></tr></table>'
-        for item in self.grammar.scanString(data):
-            print item
-        
-        
-        
-class HarvestManMirror(object):
-    """ Class representing a URL mirror """
-
-    def __init__(self, url, absolute=False):
-        self.url = url
-        self.absolute = absolute
-        # Url object
-        self.urlobj = urlparser.HarvestManUrl(self.url)
-        # By default mirror URLs are assumed to be directory URLs.
-        # if this is an absolute file URL, then don't do anything
-        if not absolute:
-            self.urlobj.set_directory_url()
-        
-        # Reliability factor - FUTURE
-        self.reliability = 1.0
-        # Geo location - FUTURE
-        self.geoloc = 0
-        # Count of number of times
-        # this mirror was used
-        self.usecnt = 0
-
-    def __str__(self):
-        return self.url
-
-    def __repr__(self):
-        return str(self)
-    
-    def calc_relative_path(self, urlobj):
-
-        relpath = urlobj.get_relative_url()
-
-        # Global configuration object
-        if objects.config.mirroruserelpath:
-            if objects.config.mirrorpathindex:
-                items = relpath.split('/')
-                # Trim again....
-                items = [item for item in items if item != '']
-                relpath = '/'.join(items[cfg.mirrorpathindex:])
-        else:
-            # Do not use relative paths, return the filename
-            # of the URL...
-            relpath = urlobj.get_filename()
-
-        if is_sourceforge_url(urlobj):
-            relpath = 'sourceforge' + relpath
-        else:
-            if relpath[0] == '/':
-                relpath = relpath[1:]
-                
-        return relpath
-
-    
-    def mirror_url(self, urlobj):
-        """ Return mirror URL for the given URL """
-
-        if not self.absolute:
-            relpath = self.calc_relative_path(urlobj)
-            newurlobj = urlparser.HarvestManUrl(relpath, baseurl=self.urlobj)
-        else:
-            newurlobj = self.urlobj
-            
-        # Set mirror_url attribute
-        newurlobj.mirror_url = urlobj
-        # Set another attribute indicating the mirror is different
-        newurlobj.mirrored = True
-        newurlobj.trymultipart = True
-
-        self.usecnt += 1
-        # print '\t=>',newurlobj.get_full_url()
-        # logconsole("Mirror URL %d=> %s" % (x+1, newurlobj.get_full_url()))
-        return newurlobj
-
-    def new_mirror_url(self, urlobj):
-        """ Return new mirror URL for an already mirrored URL """
-
-        # Typically called when errors are seen with mirrors
-        orig_urlobj = urlobj.mirror_url
-        newurlobj = self.mirror_url(orig_urlobj)
-        newurlobj.clength = urlobj.clength
-        newurlobj.range = urlobj.range
-        newurlobj.mindex = urlobj.mindex
-
-        self.usecnt += 1
-        
-        return newurlobj
-
-class HarvestManMirrorSearch(object):
-    """ Search mirror sites for files """
-
-    # Mirror sites and their search URLs
-    sites = {'filewatcher':  ('http://www.filewatcher.com/_/?q=%s',),
-             'freewareweb':  ('http://www.freewareweb.com/cgi-bin/ftpsearch.pl?q=%s',),
-             'filesearching': ('http://www.filesearching.com/cgi-bin/s?q=%s&l=en',),
-             'findfiles' : ('http://www.findfiles.com/list.php?string=%s&db=Mirrors&match=Exact&search=',) }
-    
-    quotes_re = re.compile(r'[\'\"]')
-    filename_re = '%s[\?a-zA-Z0-9-_]*'
-
-    def __init__(self):
-        self.tried = []
-        self.valid = ('findfiles',)
-        self.cache = []
-
-
-    def make_urls(self, grammar, data, filename):
-
-        urls = []
-        rc = re.compile(self.filename_re % filename)
-        
-        for match in grammar.scanString(data):
-
-            if not match: continue
-            if len(match) != 3: continue
-            if len(match[0])==0: continue
-            if len(match[0][-1])==0: continue         
-            url = self.quotes_re.sub('', match[0][-1])
-            if url not in urls:
-                # Currently we cannot support FTP mirror URLs
-                #if url.startswith('ftp://') or \
-                #   url.startswith('http://') or \
-                #   url.startswith('https://'):
-                if url.startswith('http://') or \
-                   url.startswith('https://'):                    
-                
-                    if url.endswith(filename):
-                        urls.append(url)
-                    elif rc.search(url):
-                        # Prune any characters after filename
-                        idx = url.find(filename)
-                        if idx != -1: urls.append(url[:idx+len(filename)])
-                    
-        return urls
-
-    def search_filewatcher(self, filename):
-
-        # Note: this grammar could change if the site changes its templates
-        grammar = Literal("<p>") + Literal("<big>") + Literal("<a") + Literal("href") + Literal("=") + \
-                  SkipTo(Literal(">"))
-
-        urls = []
-        search_url = self.sites['filewatcher'][0] % filename
-        
-        conn = connector.HarvestManUrlConnector()
-        data = conn.get_url_data(search_url)
-
-        return self.make_urls(grammar, data, filename)
-
-    def search_findfiles(self, filename):
-
-        print 'Searching http://www.findfiles.com for mirrors of file %s...' % filename
-
-        # Note: this grammar could change if the site changes its templates        
-        content1 = Literal("<h1") + SkipTo(Literal("Advanced Search"))
-        content2 = Literal("<a") + Literal("href") + Literal("=") + SkipTo(Literal(">"))
-        
-        search_url = self.sites['findfiles'][0] % filename
-        
-        conn = connector.HarvestManUrlConnector()
-        data = conn.get_url_data(search_url)
-        # print data
-        matches = []
-        
-        for match in content1.scanString(data):
-            matches.append(match)
-            
-        # There will be only one match
-        if matches:
-            data = matches[0][0][-1]
-            idx1 = data.find('<table')
-            if idx1 != -1:
-                idx2 = data.find('</table>',idx1)
-                if idx2 != -1:
-                    data = data[idx1:idx2+8]
-                    return self.make_urls(content2, data, filename)
-                
-        return []
-
-    def search_freewareweb(self, filename):
-
-        # TODO
-        pass
-
-    def search_filesearching(self, filename):
-
-        # TODO
-        pass    
-    
-
-    def can_search(self):
-        """ Return whether we can search for new mirrors """
-        
-        # This queries whether we have used up all the mirror search sites
-        self.tried.sort()
-        l = list(self.valid)
-        l.sort()
-        return not (self.tried == l)
-        
-    def search(self, urlobj):
-        filename = urlobj.get_filename()
-        print 'Searching mirrors for %s...' % filename
-        
-        # Searching in other mirror search sites returns mostly
-        # FTP urls. We can currently do mirror downloads only
-        # for HTTP URLs.
-        # return self.search_filewatcher(filename)
-        for item in self.valid:
-            if item not in self.tried:
-                func = getattr(self,'search_' + item)
-                self.tried.append(item)
-                mirror_urls = func(filename)
-                if mirror_urls:
-                    mirrors = [HarvestManMirror(url, True) for url  in mirror_urls]
-                    self.cache = mirrors
-                    return mirrors
-            
-             
-class HarvestManMirrorManager(Singleton):
-    """ Mirror manager class for HarvestMan/Hget """
-    
-    # Sourceforge mirror information in the form
-    # of (servername, Place, Country) tuples.
-    sf_mirror_info = (('easynews', 'Arizona, USA'),
-                      ('internap', 'CA, USA'),
-                      ('superb-east','Virginia, USA'),
-                      ('superb-west','Washington, USA'),
-                      ('ufpr', 'Curitiba, Brazil'),
-                      ('belnet', 'Brussels, Belgium'),
-                      ('switch', 'Laussane, Switzerland'),
-                      ('mesh', 'Deusseldorf, Germany'),
-                      ('ovh', 'Paris, France'),
-                      ('dfn', 'Berlin, Germany'),
-                      ('heanet', 'Dublin, Ireland'),
-                      ('garr', 'Bologna, Italy'),
-                      ('surfnet', 'Amsterdam, The Netherlands'),
-                      ('kent', 'Kent, UK'),
-                      ('optusnet', 'Sydney, Australia'),
-                      ('jaist', 'Ishikawa, Japan'),
-                      ('nchc', 'Tainan, Taiwan'))
-                   
-
-    sf_mirrors = tuple([HarvestManMirror('http://%s.dl.sourceforge.net' % name[0]) for name in sf_mirror_info])
-
-    sf_mirror_domains = tuple([mirror.urlobj.get_full_domain() for mirror in sf_mirrors])
-    # print sf_mirror_domains
-
-    def __init__(self):
-        # List of mirror URLs loaded from a mirror file/other source
-        self.filemirrors = []
-        # Flag to perform mirror search
-        self.mirrorsearch = False
-        # List of current mirrors in use
-        self.current_mirrors = []
-        # List of used mirrors
-        self.used_mirrors = []
-        # List of mirrors which can be retried cuz they failed with
-        # non-fatal errors
-        self.mirrors_to_retry = []
-        # List of mirrors which failed (Includes above list)
-        self.failed_mirrors = []
-        # Mirror retry attempts
-        self.retries = 0
-        # Used flag
-        self.used = False
-        # Mirror search object
-        self.searcher = HarvestManMirrorSearch()
-        
-    def find_mirror(self, urlobj):
-
-        mirrors = self.get_mirrors(urlobj, False)
-        if mirrors == None:
-            return
-        
-        for m in mirrors:
-            if m.absolute:
-                if m.urlobj == urlobj:
-                    return m
-            elif m.urlobj == urlobj.baseurl:
-                return m
-    
-    def load_mirrors(self, mirrorfile):
-        """ Load mirror information from the mirror file """
-
-        if mirrorfile:
-            for line in file(mirrorfile):
-                url = line.strip()
-                if url != '':
-                    self.filemirrors.append(HarvestManMirror(url))
-    
-    def mirrors_available(self, urlobj):
-        return (is_sourceforge_url(urlobj) or len(self.filemirrors) or self.mirrorsearch)
-        # return len(self.filemirrors) or (self.mirrorsearch)    
-    
-    def search_for_mirrors(self, urlobj, find_new = True):
-
-        if not find_new:
-            return self.searcher.cache
-        
-        if self.searcher.can_search():
-            mirror_urls = self.searcher.search(urlobj)
-            
-            if mirror_urls:
-                print '%d mirror URLs found, queuing them for multipart downloads...' % len(mirror_urls)
-                return mirror_urls
-            else:
-                return []
-        else:
-            print 'Cannot search for new mirrors'
-            return []
-        
-        pass
-    
-    def get_mirrors(self, urlobj, find_new=True):
-
-        if is_sourceforge_url(urlobj):
-            return self.sf_mirrors
-        elif self.filemirrors:
-            return self.filemirrors
-        elif self.mirrorsearch:
-            return self.search_for_mirrors(urlobj, find_new)
-        
-    def create_multipart_urls(self, urlobj, numparts):
-
-        urlobjects = []
-        relpath = ''
-
-        mirrors = self.get_mirrors(urlobj)
-        if len(mirrors) < numparts:
-            numparts = len(mirrors)
-
-        if len(mirrors)==0:
-            print 'No mirrors found'
-            return []
-        elif len(mirrors)==1:
-            # Only one mirror - this is of no use
-            print 'Only single mirror found'
-            return []
-        
-        # Get a random list of servers
-
-        # Python seems to sometimes optimize these lists to tuples...
-        # This produced an error in Cygwin python, so forcefully
-        # coercing them to lists...
-        self.current_mirrors = list(mirrors[:numparts])
-        self.used_mirrors = list(self.current_mirrors[:])
-
-        orig_url = urlobj.get_full_url()
-
-        for x in range(numparts):
-            mirror = self.current_mirrors[x]
-            newurlobj = mirror.mirror_url(urlobj)
-            urlobjects.append(newurlobj)
-
-        return urlobjects
-    
-    def download_multipart_url(self, urlobj, clength, numparts, threadpool):
-        """ Download URL multipart from supported servers """
-
-        logconsole('Splitting download across mirrors...\n')
-
-        # List of servers - note that we are not doing
-        # any kind of search for the nearest servers. Instead
-        # a random list is created.
-        # Calculate size of each piece
-        piecesz = clength/numparts
-
-        # Calculate size of each piece
-        pcsizes = [piecesz]*numparts
-        # For last URL add the reminder
-        pcsizes[-1] += clength % numparts 
-        # Create a URL object for each and set range
-        urlobjects = self.create_multipart_urls(urlobj, numparts)
-
-        if (len(urlobjects)) == 0:
-            return MIRRORS_NOT_FOUND
-        
-        prev = 0
-
-        for x in range(len(urlobjects)):
-            curr = pcsizes[x]
-            next = curr + prev
-            urlobject = urlobjects[x]
-            urlobject.clength = clength
-            urlobject.range = (prev, next-1)
-            urlobject.mindex = x
-            prev = next
-
-            # Push this URL objects to the pool
-            threadpool.push(urlobject)
-
-        self.used = True
-        
-        return URL_PUSHED_TO_POOL
-
-    def get_different_mirror_url(self, urlobj, urlerror):
-        """ Return a different mirror URL for a (failed) mirror URL """
-        
-        mirror_url = self.find_mirror(urlobj)
-
-        if mirror_url == None:
-            return None
-        
-        if mirror_url not in self.failed_mirrors:
-            self.failed_mirrors.append(mirror_url)
-            
-        # If not fatal error, append to mirrors_to_retry
-        if not urlerror.fatal:
-            if mirror_url not in self.mirrors_to_retry:
-                self.mirrors_to_retry.append(mirror_url)
-
-        mirrors = self.get_mirrors(urlobj)
-        # Get the difference of the 2 sets
-        newmirrors = list(set(mirrors).difference(set(self.used_mirrors)))
-        # print 'New mirrors=>',newmirrors
-
-        if newmirrors:
-            extrainfo("Returning from new mirror list...")
-            # Get a random one out of it...
-            new_mirror = newmirrors[0]
-            # Remove the old mirror and replace it with new mirror in
-            # current_mirrors
-            self.current_mirrors.remove(mirror_url)
-            self.current_mirrors.append(new_mirror)
-            self.used_mirrors.append(new_mirror)
-
-        elif len(self.mirrors_to_retry)>1:
-            extrainfo("Returning from mirrors_to_retry...")        
-            # We don't want to go back to same mirror!
-            new_mirror = self.mirrors_to_retry.pop(0)
-            self.current_mirrors.remove(mirror_url)
-            self.current_mirrors.append(new_mirror)
-            if not new_mirror in self.used_mirrors:
-                self.used_mirrors.append(new_mirror)
-        else:
-            return None
-
-        self.retries += 1
-        
-        return new_mirror.new_mirror_url(urlobj)
-
-    def reset(self):
-        """ Reset the state """
-
-        self.current_mirrors = []
-        self.used_mirrors = []
-        self.mirrors_to_retry = []
-
-    def get_stats(self):
-        """ Provide statistics """
-
-        statsd = {}
-        statsd['filemirrors'] = len(self.filemirrors)
-        statsd['usedmirrors'] = len(self.used_mirrors)
-        statsd['failedmirrors'] = len(self.failed_mirrors)
-        statsd['retries'] = self.retries
-
-        return statsd
-    
-    def print_stats(self):
-        """ Print statistics to console """
-        
-        d = self.get_stats()
-
-        info = ''
-        fmirrors = d['filemirrors']
-        if fmirrors:
-            logconsole("\nPrinting mirror statistics...")
-            info = "%d mirrors were loaded from file, " % fmirrors
-
-        umirrors = d['usedmirrors']
-        if umirrors:
-            if info: info += ', '
-            info += "%d mirrors were used " % umirrors
-        else:
-            return
-        
-        fldmirrors = d['failedmirrors']
-        retries  = d['retries']
-        
-        if fldmirrors:
-            if info: info += ', '            
-            if fldmirrors>1:
-                info += "%d mirrors failed" % fldmirrors
-            else:
-                info += "%d mirror failed" % fldmirrors
-            
-        logconsole(info)
-        
-def is_multipart_download_supported(urlobj):
-    """ Check whether this URL (server) supports multipart downloads """
-    
-    return is_sourceforge_url(urlobj)
-
-def is_sourceforge_url(urlobj):
-    """ Is this a download from sourceforge ? """
-    
-    ret = (urlobj.domain in ('downloads.sourceforge.net', 'prdownloads.sourceforge.net') or \
-           urlobj.get_full_domain() in HarvestManMirrorManager.sf_mirror_domains )
-
-    return ret
-
-if __name__ == "__main__":
-    import config
-    import logger
-    import datamgr
-    
-    SetAlias(config.HarvestManStateObject())
-    cfg = objects.config
-    cfg.verbosity = 5
-    SetAlias(logger.HarvestManLogger())
-    SetLogSeverity()
-    SetAlias(datamgr.HarvestManDataManager())
-    
-    search = HarvestManMirrorSearch()
-    print search.search(urlparser.HarvestManUrl('http://pv-mirror02.mozilla.org/pub/mozilla.org/firefox/releases/2.0.0.11/linux-i686/en-US/firefox-2.0.0.11.tar.gz'))
-
diff --git a/HarvestMan-lite/harvestman/lib/options.py b/HarvestMan-lite/harvestman/lib/options.py
deleted file mode 100755
index bbc6cb9..0000000
--- a/HarvestMan-lite/harvestman/lib/options.py
+++ /dev/null
@@ -1,54 +0,0 @@
-# -- coding: utf-8
-""" options.py - Module keeping a list of command-line
-options for HarvestMan. 
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai - Feb 11 2007.
-
-Copyright (C) 2007 Anand B Pillai
-"""
-
-hman_options=\
-[ ('version', 'short=v','long=version','help=Print version information and exit', 'type=bool'),
-  ('simulate', 'short=m','long=simulate','help=Simulates crawling with the given configuration, without performing any actual downloads (same as "-g simulator")','type=bool'),
-  ('configfile', 'short=C','long=configfile','help=Read all options from the configuration file CFGFILE','meta=CFGFILE'),
-  ('projectfile', 'short=P','long=projectfile','help=Load the project file PROJFILE','meta=PROJFILE'),
-  ('urllist', 'short=F','long=urlfile',"help=Read a list of start URLs from file URLFILE and crawl them","meta=URLFILE"),  
-  ('basedir', 'short=b','long=basedir','help=Set the (optional) base directory to BASEDIR','meta=BASEDIR'),
-  ('project', 'short=p','long=project','help=Set the (optional) project name to PROJECT', 'meta=PROJECT'),
-  ('verbosity', 'short=V','long=verbosity','help= Set the verbosity level to LEVEL. Ranges from 0-5, default is 2','meta=LEVEL'),
-  ('fetchlevel', 'short=f','long=fetchlevel','help=Set the fetch-level of this project to LEVEL. Ranges from 0-4, default is 0','meta=LEVEL'),
-  ('localise', 'short=l','long=localise','help=Localize urls after download (yes/no, default is yes)'),
-  ('retries', 'short=r','long=retry','help=Set the number of retry attempts for failed urls to NUMRETRIES','meta=NUMRETRIES'),
-  ('proxy', 'short=X','long=proxy','help=Enable and set proxy to PROXYSERVER (host:port)','meta=PROXYSERVER'),
-  ('proxyuser', 'short=U','long=proxyuser','help=Set username for proxy server to USERNAME','meta=USERNAME'),
-  ('proxypasswd', 'short=W','long=proxypass','help= Set password for proxy server to PASSWORD','meta=PASSWORD'),
-  ('connections', 'short=n','long=connections','help=Limit number of simultaneous network connections to NUMCONNECTIONS','meta=NUMCONNECTIONS'),
-  ('cache', 'short=c','long=cache',"help=Enable/disable caching of downloaded files. If enabled(default), files won't be saved unless their timestamp is newer than the cache timestamp"),
-  ('depth', 'short=d','long=depth','help=Set the limit on the depth of urls to DEPTH','meta=DEPTH'),
-  ('workers', 'short=w','long=workers','help=Enable worker threads and set the number of worker threads to NUMWORKERS','meta=NUMWORKERS'),
-  ('maxthreads', 'short=T','long=maxthreads','help=Limit the number of tracker threads to NUMTHREADS','meta=NUMTHREADS'),
-  ('maxfiles', 'short=M','long=maxfiles','help=Limit the number of files downloaded to NUMFILES','meta=NUMFILES'),
-  ('timelimit', 'short=t','long=timelimit','help=Run the program for the specified time period PERIOD (in seconds)','meta=PERIOD'),
-  ('subdomain','short=s','long=subdomain','help=If set, treats subdomains in the same parent domain (like my.foo.com & his.foo.com) as the same','type=bool','default=False','meta=SUBDOMAIN'),
-#  ('savesessions', 'short=S','long=savesessions','help=Enable/disable session saver feature. If enabled(default), crashed sessions are automatically saved to disk and the program gives you the option of resuming them next time'),
-  ('robots', 'short=R','long=robots','help=Enable/disable Robot Exclusion Protocol and checking of META ROBOTS tags.'),
-  ('urlfilter', 'short=u','long=urlfilter','help=Use regular expression FILTER for filtering urls','meta=FILTER'),
-  ('plugins', 'short=g','long=plugins',"help=Load the set of plugins PLUGINS (Specified as plugin1+plugin2+...)",'meta=PLUGINS'),
-  ('option','short=o','long=option','meta=<name=value>','help=Pass a configuration param using <name=value> syntax'),
-  ('ui','long=ui','help=Start HarvestMan in Web UI mode','meta=UI','type=bool','default=False'),
-  ('genconfig','long=genconfig','help=Create Configuration File Using GenConfig Web UI mode','meta=genconfig','type=bool','default=False'),
-  ('selftest','long=selftest','help=Run a self test','meta=SELFTEST','type=bool','default=False')]
-
-def getOptList(appname):
-    """ Return the list of options """
-
-    return hman_options
-
-if __name__=="__main__":
-    print getOptList()
diff --git a/HarvestMan-lite/harvestman/lib/pageparser.py b/HarvestMan-lite/harvestman/lib/pageparser.py
deleted file mode 100755
index f2e37fa..0000000
--- a/HarvestMan-lite/harvestman/lib/pageparser.py
+++ /dev/null
@@ -1,589 +0,0 @@
-# -- coding: utf-8
-""" pageparser.py - Module to parse an html page and
-    extract its links. This module is part of the
-    HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification History
-    ====================
-
-
-   Jan 2007       Anand              Complete support for META robot tags implemented.
-                                     Requested by jim sloan of MCHS.
-   Mar 06 2007    Anand              Added support for HTML EMBED & OBJECT tags.
-   Apr 18 2007    Anand              Made to use the urltypes module.
-   Apr 19 2007    Anand              Created class HarvestManCSSParser to take
-                                     care of parsing stylesheet content to extract
-                                     URLs.
-   Aug 28 2007    Anand              Added a parser baed on Effbot's sgmlop
-                                     to parse pages with errors - as a part of
-                                     fixes for #491.
-   Sep 05 2007    Anand              Added a basic javascript parser to parse
-                                     Javascript statements - currently this can
-                                     perform Javascript based site redirection.
-   Sep 10 2007    Anand              Added logic to filter junk links produced
-                                     by web-directory pages.
-   Oct 3  2007    Anand              Removed class HarvestManJSParser since its
-                                     functionality and additional DOM processing
-                                     is done by the new JSParser class.
-
-   Apr 4 2008     Anand              Fix for EIAO bug #812.
-   Apr 6 2008     Anand              Added ParseTag class and features for EIAO bug
-                                     #808.
-   
-   
-  Copyright (C) 2004 Anand B Pillai.                                     
-                                     
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import re
-from sgmllib import SGMLParser
-
-from harvestman.lib.urltypes import *
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-
-class ParseTag(object):
-    """ Class representing a tag which is parsed by the HTML parser(s) """
-    
-    def __init__(self, tag, tagdict, pattern=None, enabled=True):
-        # Tag is the name of the tag (element) which will be parsed.
-        # Tagdict is a dictionary which contains the attributes
-        # of the tag which we are interested as keys and the type
-        # of URL the value of the attribute will be saved as, as
-        # the value. If there are more than one type of URL for this
-        # attribute key, then the value is a list.
-        
-        # For example valid tagdicts are {'href': [URL_TYPE_ANY, URL_TYPE_ANCHOR] },
-        # {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code': URL_TYPE_JAPPLET'}.
-        self.tag = tag
-        self.tagdict = tagdict
-        self.enabled = enabled
-        self.pattern = pattern
-
-    def disable(self):
-        """ Disable parsing of this tag """
-        self.enabled = False
-
-    def enable(self):
-        """ Enable parsing of this tag """
-        self.enabled = True
-
-    def isEnabled(self):
-        """ Is this tag enabled ? """
-        
-        return self.enabled
-
-    def setPattern(self, pattern):
-        self.pattern = pattern
-
-    def __eq__(self, item):
-        return self.tag.lower() == item.lower()
-    
-class HarvestManSimpleParser(SGMLParser):
-    """ An HTML/XHTML parser derived from SGMLParser """
-
-    # query_re = re.compile(r'[-.:_a-zA-Z0-9]*\?[-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*', re.UNICODE)
-    # A more lenient form of query regular expression
-    query_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE) 
-    skip_re = re.compile(r'(javascript:)|(mailto:)|(news:)')
-    # Junk URLs obtained by parsing HTML of web-directory pages
-    # i.e pages with title "Index of...". The filtering is done after
-    # looking at the title of the page.
-    index_page_re = re.compile(r'(\?[a-zA-Z0-9]=[a-zA-Z0-9])')
-
-    features = [ ParseTag('a', {'href': URL_TYPE_ANY}),
-                 ParseTag('base', {'href' : URL_TYPE_BASE}),
-                 ParseTag('frame', {'src' : URL_TYPE_FRAME}),
-                 ParseTag('img', {'src': URL_TYPE_IMAGE}),
-                 ParseTag('form', {'action': URL_TYPE_FORM}),
-                 ParseTag('link', {'href': URL_TYPE_ANY}),
-                 ParseTag('body', {'background' : URL_TYPE_IMAGE}),
-                 ParseTag('script', {'src': URL_TYPE_JAVASCRIPT}),
-                 ParseTag('applet', {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code' : URL_TYPE_JAPPLET}),
-                 ParseTag('area', {'href': URL_TYPE_ANY}),
-                 ParseTag('meta', {'CONTENT': URL_TYPE_ANY, 'content': URL_TYPE_ANY}),
-                 ParseTag('embed', {'src': URL_TYPE_ANY}),
-                 ParseTag('object', {'data': URL_TYPE_ANY}),
-                 ParseTag('option', {'value': URL_TYPE_ANY}, enabled=False) ]
-                 
-
-    handled_rel_types = ( URL_TYPE_STYLESHEET, )
-    
-    def __init__(self):
-        self.url = None
-        self.links = []
-        self.linkpos = {}
-        self.images = []
-        # Keywords
-        self.keywords = []
-        # Description of page
-        self.description = ''
-        # Title of page
-        self.title = ''
-        self.title_flag = True
-        # Fix for <base href="..."> links
-        self.base_href = False
-        # Base url for above
-        self.base = None
-        # anchor links flag
-        self._anchors = True
-        # For META robots tag
-        self.can_index = True
-        self.can_follow = True
-        # Current tag
-        self._tag = ''
-        SGMLParser.__init__(self)
-        # Type
-        self.typ = 0
-        
-    def save_anchors(self, value):
-        """ Set the save anchor links flag """
-
-        # Warning: If you set this to true, anchor links on
-        # webpages will be saved as separate files.
-        self._anchors = value
-
-    def enable_feature(self, tag):
-        """ Enable the given tag feature if it is disabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.enable()
-
-    def disable_feature(self, tag):
-        """ Disable the given tag feature if it is enabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.disable()
-                
-    def filter_link(self, link):
-        """ Function to filter links, we decide here whether
-        to handle certain kinds of links """
-
-        if not link:
-            return LINK_EMPTY
-
-        # ignore javascript links (From 1.2 version javascript
-        # links of the form .js are fetched, but we still ignore
-        # the actual javascript actions since there is no
-        # javascript engine.)
-        llink = link.lower()
-
-        # Skip javascript, mailto, news and directory special tags.
-        if self.skip_re.match(llink):
-            return LINK_FILTERED
-
-        # If this is a web-directory Index page, then check for
-        # match with junk URLs of such index pages
-        if self.title.lower().startswith('index of'):
-            if self.index_page_re.match(llink):
-                # print 'Filtering link',llink
-                return LINK_FILTERED
-            
-        # Check if we're accepting query style URLs
-        if not objects.config.getquerylinks and self.query_re.search(llink):
-            debug('Query filtering link',link)
-            return LINK_FILTERED
-
-        return LINK_NOT_FILTERED
-
-    def handle_anchor_links(self, link):
-        """ Handle links of the form html#..."""
-
-        # if anchor tag, then get rid of anchor #...
-        # and only add the webpage link
-        if not link:
-            return LINK_EMPTY
-
-        # Need to do this here also
-        self.check_add_link(URL_TYPE_ANCHOR, link)
-
-        # No point in getting #anchor sort of links
-        # since typically they point to anchors in the
-        # same page
-
-        index = link.rfind('.html#')
-        if index != -1:
-            newhref = link[:(index + 5)]
-            self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-            return ANCHOR_LINK_FOUND
-        else:
-            index = link.rfind('.htm#')
-            if index != -1:
-                newhref = link[:(index + 4)]
-                self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-                return ANCHOR_LINK_FOUND
-
-        return ANCHOR_LINK_NOT_FOUND
-
-    def unknown_starttag(self, tag, attrs):
-        """ This method gives you the tag in the html
-        page along with its attributes as a list of
-        tuples """
-
-        # Raise event for anybody interested in catching a tagparse event...
-        if objects.eventmgr and objects.eventmgr.raise_event('before_tag_parse', self.url, None, tag, attrs)==False:
-            # Don't parse this tag..
-            return
-                                     
-        # Set as current tag
-        self._tag = tag
-        # print self._tag, attrs
-        
-        if not attrs: return
-        isBaseTag = not self.base and tag == 'base'
-        # print 'Base=>',isBaseTag
-        
-        if tag in self.features:
-
-            d = CaselessDict(attrs)
-            parsetag = self.features[self.features.index(tag)]
-
-            # Don't do anything if the feature is disabled
-            if not parsetag.isEnabled():
-                return
-            
-            tagdict = parsetag.tagdict
-            
-            link = ''
-
-            for key, typ in tagdict.items():
-                # If there is a <base href="..."> tag
-                # set self.base_href
-                if isBaseTag and key=='href':
-                    self.base_href = True
-                    try:
-                        self.base = d[key]
-                    except:
-                        self.base_href = False
-                        continue
-                
-                # if the link already has a value, skip
-                # (except for applet tags)
-                if tag != 'applet':
-                    if link: continue
-
-                if tag == 'link':
-                    try:
-                        # Fix - only reset typ if it is one
-                        # of the valid handled rel types.
-                        foundtyp = d['rel'].lower()
-                        if foundtyp in self.handled_rel_types:
-                            typ = getTypeClass(foundtyp)
-                    except KeyError:
-                        pass
-
-                try:
-                    if tag == 'meta':
-                        # Handle meta tag for refresh
-                        foundtyp = d.get('http-equiv','').lower()
-                        if foundtyp.lower() == 'refresh':
-                            link = d.get(key,'')
-                            if not link: continue
-                            # This will be of the form of either
-                            # a time-gap (CONTENT="600") or a time-gap
-                            # with a URL (CONTENT="0; URL=<url>")
-                            items = link.split(';')
-                            if len(items)==1:
-                                # Only a time-gap, skip it
-                                continue
-                            elif len(items)==2:
-                                # Second one should be a URL
-                                reqd = items[1]
-                                # print 'Reqd=>',reqd
-                                if (reqd.find('URL') != -1 or reqd.find('url') != -1) and reqd.find('=') != -1:
-                                    link = reqd.split('=')[1].strip()
-                                    # print 'Link=>',link
-                                else:
-                                    continue
-                        else:
-                            # Handle robots meta tag
-                            name = d.get('name','').lower()
-                            if name=='robots':
-                                robots = d.get('content','').lower()
-                                # Split to ','
-                                contents = [item.strip() for item in robots.split(',')]
-                                # Check for nofollow
-                                self.can_follow = not ('nofollow' in contents)
-                                # Check for noindex
-                                self.can_index = not ('noindex' in contents)
-                            elif name=='keywords':
-                                self.keywords = d.get('content','').split(',')
-                                # Trim the keywords list
-                                self.keywords = [word.lower().strip() for word in self.keywords]
-                            elif name=='description':
-                                self.description = d.get('content','').strip()
-                            else:
-                                continue
-
-                    elif tag != 'applet':
-                        link = d[key]
-                    else:
-                        link += d[key]
-                        if key == 'codebase':
-                            if link:
-                                if link[-1] != '/':
-                                    link += '/'
-                            continue                                
-                except KeyError:
-                    continue
-
-                # see if this link is to be filtered
-                if self.filter_link(link) != LINK_NOT_FILTERED:
-                    # print 'Filtered link',link
-                    continue
-
-                # anchor links in a page should not be saved        
-                # index = link.find('#')
-
-                # Make sure not to wrongly categorize '#' in query strings
-                # as anchor URLs.
-                if link.find('#') != -1 and not self.query_re.search(link):
-                    # print 'Is an anchor link',link
-                    self.handle_anchor_links(link)
-                else:
-                    # append to private list of links
-                    self.check_add_link(typ, link)
-
-    def unknown_endtag(self, tag):
-            
-        self._tag = ''
-        if tag=='title':
-            self.title_flag = False
-            self.title = self.title.strip()
-            
-    def handle_data(self, data):
-
-        # Raise event for anybody interested in catching a tagparse event...
-        if objects.eventmgr and objects.eventmgr.raise_event('before_tag_data', self.url, None, self._tag, data)==False:
-            # Don't handle data
-            return
-        
-        if self._tag.lower()=='title' and self.title_flag:
-            self.title += data
-
-    def check_add_link(self, typ, link):
-        """ To avoid adding duplicate links """
-
-        f = False
-
-        if typ == 'image':
-            if not (typ, link) in self.images:
-                self.images.append((typ, link))
-        elif not (typ, link) in self.links:
-                # print 'Adding link ', link, typ
-                pos = self.getpos()
-                self.links.append((typ, link))
-                self.linkpos[(typ,link)] = (pos[0],pos[1])
-                
-
-    def add_tag_info(self, taginfo):
-        """ Add new tag information to this object.
-        This can be used to change the behavior of this class
-        at runtime by adding new tags """
-
-        # The taginfo object should be a dictionary
-        # of the form { tagtype : (elementname, elementype) }
-
-        # egs: { 'body' : ('background', 'img) }
-        if type(taginfo) != dict:
-            raise AttributeError, "Attribute type mismatch, taginfo should be a dictionary!"
-
-        # get the key of the dictionary
-        key = (taginfo.keys())[0]
-        if len(taginfo[key]) != 2:
-            raise ValueError, 'Value mismatch, size of tag tuple should be 2'
-
-        # get the value tuple
-        tagelname, tageltype = taginfo[key]
-
-        # see if this is an already existing tagtype
-        if key in self.handled.keys:
-            _values = self.handled[key]
-
-            f=0
-            for index in xrange(len(_values)):
-                # if the elementname is also
-                # the same, just replace it.
-                v = _values[index]
-
-                elname, eltype = v
-                if elname == tagelname:
-                    f=1
-                    _values[index] = (tagelname, tageltype)
-                    break
-
-            # new element, add it to list
-            if f==0: _values.append((tagelname, tageltype))
-            return 
-        else:
-            # new key, directly modify dictionary
-            elements = []
-            elements.append((tagelname, tageltype))
-            self.handled[key] = elements 
-
-    def reset(self):
-        SGMLParser.reset(self)
-
-        self.url = None
-        self.base = None
-        self.links = []
-        self.images = []
-        self.base_href = False
-        self.base_url = ''
-        self.can_index = True
-        self.can_follow = True
-        self.title = ''
-        self.title_flag = True
-        self.description = ''
-        self.keywords = []
-        
-    def base_url_defined(self):
-        """ Return whether this url had a
-        base url of the form <base href='...'>
-        defined """
-
-        return self.base_href
-
-    def get_base_url(self):
-        return self.base
-
-    def set_url(self, url):
-        """ Set the URL whose data is about to be parsed """
-        self.url = url
-
-class HarvestManSGMLOpParser(HarvestManSimpleParser):
-    """ A parser based on effbot's sgmlop """
-
-    def __init__(self):
-        # This module should be built already!
-        import sgmlop
-        
-        self.parser = sgmlop.SGMLParser()
-        self.parser.register(self)
-        HarvestManSimpleParser.__init__(self)
-        # Type
-        self.typ = 1
-        
-    def finish_starttag(self, tag, attrs):
-        self.unknown_starttag(tag, attrs)
-
-    def finish_endtag(self, tag):
-        self.unknown_endtag(tag)        
-
-    def feed(self, data):
-        self.parser.feed(data)
-        
-class HarvestManCSSParser(object):
-    """ Class to parse stylesheets and extract URLs """
-
-    # Regexp to parse stylesheet imports
-    importcss1 = re.compile(r'(\@import\s+\"?)(?!url)([\w.-:/]+)(\"?)', re.MULTILINE|re.LOCALE|re.UNICODE)
-    importcss2 = re.compile(r'(\@import\s+url\(\"?)([\w.-:/]+)(\"?\))', re.MULTILINE|re.LOCALE|re.UNICODE)
-    # Regexp to parse URLs inside CSS files
-    cssurl = re.compile(r'(url\()([^\)]+)(\))', re.LOCALE|re.UNICODE)
-
-    def __init__(self):
-        # Any imported stylesheet URLs
-        self.csslinks = []
-        # All URLs including above
-        self.links = []
-
-    def feed(self, data):
-        self._parse(data)
-        
-    def _parse(self, data):
-        """ Parse stylesheet data and extract imported css links, if any """
-
-        # Return is a list of imported css links.
-        # This subroutine uses the specification mentioned at
-        # http://www.w3.org/TR/REC-CSS2/cascade.html#at-import
-        # for doing stylesheet imports.
-
-        # This takes care of @import "style.css" and
-        # @import url("style.css") and url(...) syntax.
-        # Media types specified if any, are ignored.
-        
-        # Matches for @import "style.css"
-        l1 = self.importcss1.findall(data)
-        # Matches for @import url("style.css")
-        l2 = self.importcss2.findall(data)
-        # Matches for url(...)
-        l3 = self.cssurl.findall(data)
-        
-        for item in (l1+l2):
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            self.csslinks.append(url)
-            self.links.append(url)
-            
-        for item in l3:
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            if url not in self.links:
-                self.links.append(url)
-
-if __name__=="__main__":
-    import os
-    import config
-    import logger
-    
-    SetAlias(config.HarvestManStateObject())
-    SetAlias(logger.HarvestManLogger())
-    
-    cfg = objects.config
-    cfg.verbosity = 5
-    SetLogSeverity()
-    
-    cfg.getquerylinks = True
-    
-    p = HarvestManSimpleParser()
-    #p.enable_feature('option')
-    #p = HarvestManSGMLOpParser()
-    
-    urls = ['http://projecteuler.net/index.php?section=problems']
-    urls = ['http://www.evvs.dk/index.php?cPath=30&osCsid=3b110c689f01d722dbbe53c5cee0bf2d']
-    urls = ['http://nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html']
-    urls = ['http://wiki.java.net/bin/view/Javawsxml/Rome05Tutorials']
-    urls = ['http://bits.blogs.nytimes.com/2008/02/27/google-goes-after-another-microsoft-cash-cow/?ref=technology']
-    urls = ['http://mail.python.org/pipermail/bangpypers/2008-March/000410.html']
-
-    urls = ['http://www.bad-ischl.ooe.gv.at/system/web/default.aspx']
-    urls = ['http://europa.eu/languages/']    
-    urls = ['http://www.web2.cz/rs-reference/']
-    urls = ['http://harvestmanontheweb.com/']
-    urls = ['http://www.web2.cz/rs-uvod/']
-    urls = ['http://digitallife.co.in/indian-cheerleaders-for-ipl/']
-    urls = ['http://www.brodingberg.gv.at']
-    urls = ["www.malvik.kommune.no"]
-    urls = ["http://www.gr.ch/Deutsch/index.cfm"]
-    
-    for url in urls:
-        if os.system('wget %s -O index.html' % url ) == 0:
-            p.feed(open('index.html').read())
-            print p.links, len(p.links)
-            for link in p.links:
-                print link[1]
-                
-            print p.keywords
-            print p.description
-            print p.title
-            print p.base_href
-            print p.base
-            
-            p.reset()
-
-                                   
-
-
-
-
diff --git a/HarvestMan-lite/harvestman/lib/robotparser.py b/HarvestMan-lite/harvestman/lib/robotparser.py
deleted file mode 100755
index 4c6d953..0000000
--- a/HarvestMan-lite/harvestman/lib/robotparser.py
+++ /dev/null
@@ -1,340 +0,0 @@
-# -- coding: utf-8
-""" robotparser.py - Robot Exclusion Principle for python.
-    This module is part of the HarvestMan program.
-
-    This module is a modified version of robotparser.py supplied
-    with Python standard library. The author does not assert
-    any copyright on this module.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    =============== Original Copyright ============================
-    robotparser.py
-    
-    Copyright (C) 2000  Bastian Kleineidam
-
-    You can choose between two licenses when using this package:
-    1) GNU GPLv2
-    2) PYTHON 2.0 OPEN SOURCE LICENSE
-
-    The robots.txt Exclusion Protocol is implemented as specified in
-    http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
-
-    ================ End Original Copyright =========================
-
-    Jan 5 2006          Anand   Fix for EIAO ticket #74, small change
-                                in open of URLopener class, Also moved
-                                import of HarvestManUrlConnector to top.
-
-    Jan 8 2006         Anand    Updated this file from EIAO robacc
-                                repository.
-    Jan 10 2006        Anand   Converted from dos to unix format (removed Ctrl-Ms).
-    Feb 23 2009        Anand    Updated module contents from Python 2.5.                
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import re
-import urlparse
-import urllib
-
-from harvestman.lib.connector import HarvestManUrlConnector
-
-__all__ = ["RobotFileParser"]
-
-debug = 0
-
-def _basicdebug(msg):
-    if debug>=1: print msg
-
-def _moredebug(msg):
-    if debug>=2: print msg
-
-def _verbosedebug(msg):
-    if debug>=3: print msg
-    
-def _debug(msg):
-    if debug: print msg
-
-
-import socket
-
-class RobotFileParser:
-    """ This class provides a set of methods to read, parse and answer
-    questions about a single robots.txt file.
-
-    """
-
-    def __init__(self, url=''):
-        self.entries = []
-        self.default_entry = None
-        self.disallow_all = False
-        self.allow_all = False
-        self.set_url(url)
-        self.last_checked = 0
-
-    def mtime(self):
-        """Returns the time the robots.txt file was last fetched.
-
-        This is useful for long-running web spiders that need to
-        check for new robots.txt files periodically.
-
-        """
-        return self.last_checked
-
-    def modified(self):
-        """Sets the time the robots.txt file was last fetched to the
-        current time.
-
-        """
-        import time
-        self.last_checked = time.time()
-
-    def set_url(self, url):
-        """Sets the URL referring to a robots.txt file."""
-        self.url = url
-        self.host, self.path = urlparse.urlparse(url)[1:3]
-
-    def read(self):
-        """Reads the robots.txt URL and feeds it to the parser."""
-        opener = URLopener()
-        f = opener.open(self.url)
-        if f is None:
-            return -1
-        
-        lines = []
-        line = f.readline()
-        while line:
-            lines.append(line.strip())
-            line = f.readline()
-        self.errcode = opener.errcode
-        if self.errcode == 401 or self.errcode == 403:
-            self.disallow_all = True
-            _debug("disallow all")
-        elif self.errcode >= 400:
-            self.allow_all = True
-            _debug("allow all")
-        elif self.errcode == 200 and lines:
-            _debug("parse lines")
-            self.parse(lines)
-
-    def _add_entry(self, entry):
-        if "*" in entry.useragents:
-            # the default entry is considered last
-            self.default_entry = entry
-        else:
-            self.entries.append(entry)
-
-    def parse(self, lines):
-        """parse the input lines from a robots.txt file.
-           We allow that a user-agent: line is not preceded by
-           one or more blank lines."""
-        state = 0
-        linenumber = 0
-        entry = Entry()
-
-        for line in lines:
-            linenumber = linenumber + 1
-            if not line:
-                if state==1:
-                    _debug("line %d: warning: you should insert"
-                           " allow: or disallow: directives below any"
-                           " user-agent: line" % linenumber)
-                    entry = Entry()
-                    state = 0
-                elif state==2:
-                    self._add_entry(entry)
-                    entry = Entry()
-                    state = 0
-            # remove optional comment and strip line
-            i = line.find('#')
-            if i>=0:
-                line = line[:i]
-            line = line.strip()
-            if not line:
-                continue
-            line = line.split(':', 1)
-            if len(line) == 2:
-                line[0] = line[0].strip().lower()
-                line[1] = urllib.unquote(line[1].strip())
-                if line[0] == "user-agent":
-                    if state==2:
-                        _debug("line %d: warning: you should insert a blank"
-                               " line before any user-agent"
-                               " directive" % linenumber)
-                        self._add_entry(entry)
-                        entry = Entry()
-                    entry.useragents.append(line[1])
-                    state = 1
-                elif line[0] == "disallow":
-                    if state==0:
-                        _debug("line %d: error: you must insert a user-agent:"
-                               " directive before this line" % linenumber)
-                    else:
-                        entry.rulelines.append(RuleLine(line[1], False))
-                        state = 2
-                elif line[0] == "allow":
-                    if state==0:
-                        _debug("line %d: error: you must insert a user-agent:"
-                               " directive before this line" % linenumber)
-                    else:
-                        entry.rulelines.append(RuleLine(line[1], True))
-                else:
-                    _debug("line %d: warning: unknown key %s" % (linenumber,
-                               line[0]))
-            else:
-                _debug("line %d: error: malformed line %s"%(linenumber, line))
-        if state==2:
-            self.entries.append(entry)
-        _debug("Parsed rules:\n%s" % str(self))
-
-
-    def can_fetch(self, useragent, url):
-        """using the parsed robots.txt decide if useragent can fetch url"""
-        _debug("Checking robots.txt allowance for:\n  user agent: %s\n  url: %s" %
-               (useragent, url))
-        if self.disallow_all:
-            return False
-        if self.allow_all:
-            return True
-        # search for given user agent matches
-        # the first match counts
-        url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
-        for entry in self.entries:
-            if entry.applies_to(useragent):
-                return entry.allowance(url)
-        # try the default entry last
-        if self.default_entry:
-            return self.default_entry.allowance(url)
-        # agent not found ==> access granted
-        return True
-
-
-    def __str__(self):
-        ret = ""
-        for entry in self.entries:
-            ret = ret + str(entry) + "\n"
-        return ret
-
-
-class RuleLine:
-    """A rule line is a single "Allow:" (allowance==True) or "Disallow:"
-       (allowance==False) followed by a path."""
-    def __init__(self, path, allowance):
-        if path == '' and not allowance:
-            # an empty value means allow all
-            allowance = True
-        self.path = urllib.quote(path)
-        self.allowance = allowance
-
-    def applies_to(self, filename):
-        return self.path=="*" or filename.startswith(self.path)
-
-    def __str__(self):
-        return (self.allowance and "Allow" or "Disallow")+": "+self.path
-
-
-class Entry:
-    """An entry has one or more user-agents and zero or more rulelines"""
-    def __init__(self):
-        self.useragents = []
-        self.rulelines = []
-
-    def __str__(self):
-        ret = ""
-        for agent in self.useragents:
-            ret = ret + "User-agent: "+agent+"\n"
-        for line in self.rulelines:
-            ret = ret + str(line) + "\n"
-        return ret
-
-    def applies_to(self, useragent):
-        """check if this entry applies to the specified agent"""
-        # split the name token and make it lower case
-        useragent = useragent.split("/")[0].lower()
-        for agent in self.useragents:
-            if agent=='*':
-                # we have the catch-all agent
-                return True
-            agent = agent.lower()
-            if agent in useragent:
-                return True
-        return False
-
-    def allowance(self, filename):
-        """Preconditions:
-        - our agent applies to this entry
-        - filename is URL decoded"""
-        for line in self.rulelines:
-            _debug((filename, str(line), line.allowance))
-            if line.applies_to(filename):
-                return line.allowance
-        return True
-
-class URLopener(urllib.FancyURLopener):
-    def __init__(self, *args):
-        urllib.FancyURLopener.__init__(self, *args)
-        self.errcode = 200
-
-    def http_error_default(self, url, fp, errcode, errmsg, headers):
-        self.errcode = errcode
-        return urllib.FancyURLopener.http_error_default(self, url, fp, errcode,
-                                                        errmsg, headers)
-
-    def open(self, url):
-        return HarvestManUrlConnector().robot_urlopen(url)
-
-def _check(a,b):
-    if not b:
-        ac = "access denied"
-    else:
-        ac = "access allowed"
-    if a!=b:
-        print "failed"
-    else:
-        print "ok (%s)" % ac
-    print
-
-def _test():
-    global debug
-    rp = RobotFileParser()
-    debug = 1
-
-    # robots.txt that exists, gotten to by redirection
-    rp.set_url('http://www.musi-cal.com/robots.txt')
-    rp.read()
-
-    # test for re.escape
-    _check(rp.can_fetch('*', 'http://www.musi-cal.com/'), 1)
-    # this should match the first rule, which is a disallow
-    _check(rp.can_fetch('', 'http://www.musi-cal.com/'), 0)
-    # various cherry pickers
-    _check(rp.can_fetch('CherryPickerSE',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    _check(rp.can_fetch('CherryPickerSE/1.0',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    _check(rp.can_fetch('CherryPickerSE/1.5',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    # case sensitivity
-    _check(rp.can_fetch('ExtractorPro', 'http://www.musi-cal.com/blubba'), 0)
-    _check(rp.can_fetch('extractorpro', 'http://www.musi-cal.com/blubba'), 0)
-    # substring test
-    _check(rp.can_fetch('toolpak/1.1', 'http://www.musi-cal.com/blubba'), 0)
-    # tests for catch-all * agent
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/search'), 0)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/Musician/me'), 1)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/'), 1)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/'), 1)
-
-    # robots.txt that does not exist
-    rp.set_url('http://www.lycos.com/robots.txt')
-    rp.read()
-    _check(rp.can_fetch('Mozilla', 'http://www.lycos.com/search'), 1)
-
-if __name__ == '__main__':
-    _test()
diff --git a/HarvestMan-lite/harvestman/lib/rules.py b/HarvestMan-lite/harvestman/lib/rules.py
deleted file mode 100755
index a5f3158..0000000
--- a/HarvestMan-lite/harvestman/lib/rules.py
+++ /dev/null
@@ -1,747 +0,0 @@
-# -- coding: utf-8
-""" rules.py - Rules checker module for HarvestMan.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    Modification History
-    --------------------
-
-   Jan 8 2006          Anand    Updated this file from EIAO
-                                repository to get fixes for robot
-                                rules. Removed EIAO specific
-                                code.
-
-                                Put ext check rules before robots
-                                check to speed up things.
-   Jan 10 2006          Anand    Converted from dos to unix format
-                                (removed Ctrl-Ms).
-
-   April 11 2007        Anand   Not doing I.P comparison for
-                                non-robots.txt URLs in compare_domains
-                                method as it is erroneous.
-                                
-
-   Copyright (C) 2004 Anand B Pillai.
-                                
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import socket
-import re
-import os
-import time
-import copy
-
-from harvestman.lib.event import HarvestManEvent
-from harvestman.lib import robotparser
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-from harvestman.lib import urlparser
-from harvestman.lib import filters
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.netinfo import tlds
-from harvestman.lib.common.lrucache import LRU
-
-# Defining pluggable functions
-__plugins__ = {'violates_rules_plugin': 'HarvestManRulesChecker:violates_rules'}
-
-# Defining functions with callbacks
-__callbacks__ = {'violates_rules_callback' : 'HarvestManRulesChecker:violates_rules'}
-
-
-class HarvestManRulesChecker(object):
-    """ Class which checks the download rules for urls. These
-    rules include depth checks, robot.txt rules checks, filter
-    checks, external server/directory checks, duplicate url
-    checks, maximum limits check etc. """
-
-    # For supporting callbacks
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'rulesmgr'
-    
-    # Regular expression for matching www. infront of domains
-    wwwre = re.compile(r'^www(\d*)\.')
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self._filter = {}
-        self._extservers = Ldeque(1000)
-        self._extdirs = Ldeque(1000)
-        self._wordstr = '[\s+<>]'
-        self._robots  = LRU(1000)
-        self._robocache = Ldeque(1000)
-        self._invalidservers = Ldeque(1000)
-        # Flag for making filters
-        self._madefilters = False
-        self._configobj = objects.config
-        self.junkfilter = filters.HarvestManJunkFilter()
-        self.urlfilter =  filters.HarvestManUrlFilter(self._configobj.pathurlfilters,
-                                                      self._configobj.extnurlfilters,
-                                                      self._configobj.regexurlfilters)
-        self.txtfilter = filters.HarvestManTextFilter(self._configobj.contentfilters,
-                                                      self._configobj.metafilters)
-
-    def violates_rules(self, urlObj):
-        """ Check the basic rules for this url object,
-        This function returns True if the url object
-        violates the rules, else returns False """
-
-        # raise event to allow custom logic
-        ret = objects.eventmgr.raise_event('include_this_url', urlObj)
-        if ret==False:
-            self.add_to_filter(urlObj.index)            
-            return True
-        elif ret==True:
-            return False
-        
-        url = urlObj.get_full_url()
-
-        # New in 2.0
-        # If checking of rules on the type of this URL
-        # is set to be skipped, return False
-        if urlObj.typ in self._configobj.skipruletypes:
-            return False
-        
-        # if this url exists in filter list, return
-        # True rightaway
-        try:
-            self._filter[url.index]
-            return True
-        except KeyError:
-            pass
-
-        # now apply the url filter
-        if self.apply_url_filter(urlObj):
-            extrainfo("URL filter - filtered", url)
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # now apply the junk filter
-        if self.junkfilter:
-            if self.junkfilter.filter(urlObj):
-                extrainfo("Junk Filter - filtered", url)
-                self.add_to_filter(urlObj.index)                            
-                return True
-
-        # check if this is an external link
-        if self.is_external_link( urlObj ):
-            extrainfo("External link - filtered ", urlObj.get_full_url())
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # now apply REP
-        if self.apply_rep(urlObj):
-            extrainfo("Robots.txt rules prevents crawl of ", url)
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # depth check
-        if self.apply_depth_check(urlObj):
-            extrainfo("Depth exceeds - filtered ", urlObj.get_full_url())
-            self.add_to_filter(urlObj.index)            
-            return True
-
-        return False
-
-    def add_to_filter(self, urlindex):
-        """ Add the link to the filter dictionary """
-
-        self._filter[urlindex] = 1
-
-    def compare_domains(self, domain1, domain2, robots=False):
-        """ Compare two domains (servers) first by
-        ip and then by name and return True if both point
-        to the same server, return False otherwise. """
-
-        # For comparing robots.txt file, first compare by
-        # ip and then by name.
-        if robots: 
-            firstval=self.compare_by_ip(domain1, domain2)
-            if firstval:
-                return firstval
-            else:
-                return self.compare_by_name(domain1, domain2)
-
-        # otherwise, we do only a name check
-        else:
-            return self.compare_by_name(domain1, domain2)
-
-    def _get_base_server(self, server):
-        """ Return the base server name of  the passed
-        server (domain) name """
-
-        # If the server name is of the form say bar.foo.com
-        # or vodka.bar.foo.com, i.e there are more than one
-        # '.' in the name, then we need to return the
-        # last string containing a dot in the middle.
-        if server.count('.') > 1:
-            dotstrings = server.split('.')
-            # now the list is of the form => [vodka, bar, foo, com]
-
-            # Skip the list for skipping over tld domain name endings
-            # such as .org.uk, .mobi.uk etc. For example, if the
-            # server is games.mobileworld.mobi.uk, then we
-            # need to return mobileworld.mobi.uk, not mobi.uk
-            dotstrings.reverse()
-            idx = 0
-            
-            for item in dotstrings:
-                if item.lower() in tlds:
-                    idx += 1
-                
-            return '.'.join(dotstrings[idx::-1])
-        else:
-            # The server is of the form foo.com or just "foo"
-            # so return it straight away
-            return server
-
-    def compare_no_tld(self, domain1, domain2):
-        """ Compare two server names without their tld endings """
-
-        # This will return True for www.foo.com, www.foo.org
-        # foo.co.uk etc.
-        dotstrings1 = self.wwwre.sub('', domain1.lower()).split('.')
-        dotstrings2 = self.wwwre.sub('', domain2.lower()).split('.')
-        l1 = [item for item in dotstrings1 if item not in tlds]
-        l2 = [item for item in dotstrings2 if item not in tlds]            
-        debug(l1, l2)
-        
-        return '.'.join(l1) =='.'.join(l2)
-        
-    def compare_by_name(self, domain1, domain2):
-        """ Compare two servers by their names. Return True
-        if similar, False otherwise """
-
-        # first check if both domains are same
-        if domain1.lower() == domain2.lower(): return True
-        # Check whether we are comparing something like www.foo.com
-        # and foo.com, they are assumed to be same. 
-        if self.wwwre.sub('',domain1.lower())==self.wwwre.sub('',domain2.lower()):
-            return True
-
-        # If ignoretlds is set to True, return True for two servers such
-        # as www.foo.com and www.foo.co.uk, www.foo.org etc.
-        if self._configobj.ignoretlds:
-            if self.compare_no_tld(domain1, domain2):
-                return True
-            
-        if not self._configobj.subdomain:
-            # Checks whether the domain names belong to
-            # the same base server, if the above config
-            # variable is set. For example, this will
-            # return True for two servers like server1.foo.com
-            # and server2.foo.com or server1.base and server2.base
-            baseserver1 = self.wwwre.sub('',self._get_base_server(domain1))
-            baseserver2 = self.wwwre.sub('',self._get_base_server(domain2))
-            debug('Bases=>',baseserver1, baseserver2)
-            
-            # Instead of checking for equality, check for endswith.
-            # This will return True even for cases like
-            # vanhall-larenstein.nl and larenstein.nl
-            if self._configobj.ignoretlds:
-                if self.compare_no_tld(baseserver1, baseserver2):
-                    return True
-                
-            return baseserver1.lower().endswith(baseserver2.lower())
-            # return (baseserver1.lower() == baseserver2.lower())
-        else:
-            # if the subdomain variable is set will return False for two servers like
-            # server1.foo.com and server2.foo.com i.e with same base domain but different
-            # subdomains.
-            return False
-
-    def compare_by_ip(self, domain1, domain2):
-        """ Compare two servers by their ip address. Return
-        True if same, False otherwise """
-
-        try:
-            ip1 = socket.gethostbyname(domain1)
-            ip2 = socket.gethostbyname(domain2)
-        except Exception:
-            return False
-
-        if ip1==ip2: return True
-        else: return False
-
-    def apply_url_filter(self, urlObj):
-        """ Apply URL filter to the URL. Return True if filtered and False otherwise """
-
-        return self.urlfilter.filter(urlObj)
-
-    def apply_text_filter(self, document, urlObj):
-        """ Apply text filter to the document object. Return True if filtered and
-        False otherwise """
-
-        return self.txtfilter.filter(document, urlObj)
-        
-    def apply_rep(self, urlObj):
-        """ See if the robots.txt file on the server
-        allows fetching of this url. Return 0 on success
-        (fetching allowed) and 1 on failure(fetching blocked) """
-
-        # robots option turned off
-        if self._configobj.robots==0: return False
-        
-        domport = urlObj.get_full_domain_with_port()
-        # The robots.txt file url
-        robotsfile = "".join((domport, '/robots.txt'))
-
-        # Check #1
-        # if this url exists in filter list, return
-        # True rightaway
-        try:
-            self._filter[urlObj.index]
-            return True
-        except KeyError:
-            pass
-
-        url_directory = urlObj.get_url_directory()
-
-        # Check #2: Check if this directory
-        # is already there in the white list
-        try:
-            self._robocache.index(url_directory)
-            return False
-        except ValueError:
-            pass
-
-        try:
-            rp = self._robots[domport]
-            # Check #4
-            # If there is an entry, but it
-            # is None, it means there is no
-            # robots.txt file in the server
-            # (see below). So return False.
-            if not rp: return False
-        except KeyError:
-            # Not there, create a fresh
-            # one and add it.
-            rp = robotparser.RobotFileParser()
-            rp.set_url(robotsfile)
-            ret = rp.read()
-            # Check #5                
-            if ret==-1:
-                # no robots.txt file
-                # Set the entry for this
-                # server as None, so next
-                # time we dont need to do
-                # this operation again.
-                self._robots[domport] = None
-                return False
-            else:
-                # Set it
-                self._robots[domport] = rp
-        
-        # Check #6
-        if rp.can_fetch(self._configobj.USER_AGENT, url_directory):
-            # Add to white list
-            self._robocache.append(url_directory)
-            return False
-
-        # Cannot fetch, so add to filter
-        # for quick look up later.
-        
-        return True
-
-    def is_under_starting_directory(self, urlObj):
-        """ Check whether the url in the url object belongs
-        to the same directory as the starting url for the
-        project """
-
-        directory = urlObj.get_url_directory()
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return True
-
-        # Bug: the original URL might have had been
-        # redirected, so its base URL might have got
-        # changed. We need to check with the original
-        # URL in such cases.
-        # Sample site: http://www.vegvesen.no
-
-        if baseUrlObj.reresolved:
-            # bdir = baseUrlObj.get_original_url_directory()
-            old_urlobj = baseUrlObj.get_original_state()
-            bdir = old_urlobj.get_url_directory()
-        else:
-            bdir = baseUrlObj.get_url_directory()
-            
-        # print 'BASEDIR=>',bdir
-        # print 'DIRECTORY=>',directory
-
-        # Look for bdir inside dir
-        index = directory.find(bdir)
-        
-        if index == 0:
-            return True
-
-        # Sometimes a simple string match
-        # is not good enough. May be both
-        # the directories are the same but
-        # the server names are slightly different
-        # ex: www-106.ibm.com and www.ibm.com
-        # for developerworks links.
-
-        # Check if both of them are in the same
-        # domain
-        if self.compare_domains(urlObj.get_domain(), baseUrlObj.get_domain()):
-            debug('Domains',urlObj.get_domain(),'and',baseUrlObj.get_domain(),'compare fine')
-            # Get url directory sans domain
-            directory = urlObj.get_url_directory_sans_domain()
-            bdir = baseUrlObj.get_url_directory_sans_domain()
-            debug('Directories',directory,bdir)
-            
-            # Check again
-            if directory.find(bdir) == 0:
-                return True
-
-        return False
-            
-    def is_external_server_link(self, urlObj):
-        """ Check whether the url in the url object belongs to
-        an external server """
-
-        # Get the tracker queue object
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        # Check based on the server
-        server = urlObj.get_domain()
-        baseserver = baseUrlObj.get_domain()
-
-        return not self.compare_domains( server, baseserver )
-
-    def is_external_link(self, urlObj):
-        """ Check if the url is an external link relative to starting url,
-        using the download rules set by the user """
-
-        # Example.
-        # Assume our start url is 'http://www.server.com/files/images/index.html"
-        # Then any url which starts with another server name or at a level
-        # above the start url's base directory on the same server is considered
-        # an external url
-        # i.e, http://www.yahoo.com will be external because of
-        # 1st reason &
-        # http://www.server.com/files/search.cgi will be external link because of
-        # 2nd reason.
-        # External links ?
-
-        # if under the same starting directory, return False
-        if self.is_under_starting_directory(urlObj):
-            return False
-
-        directory = urlObj.get_url_directory()
-
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        if urlObj.get_type() == 'stylesheet':
-            if self._configobj.getstylesheets: return False
-
-        elif urlObj.get_type() == 'image':
-            if self._configobj.getimagelinks: return False
-
-        if not self.is_external_server_link(urlObj):
-            debug('Same!')
-            if self._configobj.fetchlevel==0:
-                return True
-            
-            elif self._configobj.fetchlevel==3:
-                # check for the directory of the parent url
-                # if it is same as starting directory, allow this
-                # url, else deny
-                try:
-                    parentUrlObj = urlObj.get_parent_url()
-                    if not parentUrlObj:
-                        return False
-
-                    parentdir = parentUrlObj.get_url_directory()
-                    bdir = baseUrlObj.get_url_directory()
-
-                    if parentdir == bdir:
-                        self._increment_ext_directory_count(directory)
-                        return False
-                    else:
-                        return True
-                except urlparser.HarvestManUrlError, e:
-                    logconsole(e)
-                    
-            elif self._configobj.fetchlevel > 0:
-                # do other checks , just fall through
-                pass
-            
-            # Increment external directory count
-            # directory = urlObj.get_url_directory()
-
-            # res=self._ext_directory_check(directory)
-            # if not res:
-            #    extrainfo("External directory error - filtered!")
-            #    return True
-
-            # Apply depth check for external dirs here
-            if self._configobj.extdepth:
-                if self.apply_depth_check(urlObj, mode=2):
-                    return True
-
-            #if self._configobj.epagelinks:
-            #    # We can get external links belonging to same server,
-            #    # so this is not an external link
-            #    return False
-            #else:
-            #    # We cannot get external links belonging to same server,
-            #    # so this is an external link
-            #    return True
-            return False
-        else:
-            # print 'Different!',self._configobj.fetchlevel
-            debug('Different!')
-            # Both belong to different base servers
-            if self._configobj.fetchlevel==0 or self._configobj.fetchlevel == 1:
-                return True
-            elif self._configobj.fetchlevel==2 or self._configobj.fetchlevel==3:
-                # check whether the baseurl (parent url of this url)
-                # belongs to the starting server. If so allow fetching
-                # else deny. ( we assume the baseurl path is not relative! :-)
-                try:
-                    parentUrlObj = urlObj.get_parent_url()
-                    baseserver = baseUrlObj.get_domain()
-
-                    if not parentUrlObj:
-                        return False
-
-                    server = urlObj.get_domain()
-                    if parentUrlObj.get_domain() == baseserver:
-                        self._increment_ext_server_count(server)
-                        return False
-                    else:
-                        return True
-                except urlparser.HarvestManUrlError, e:
-                    logconsole(e)
-                    
-            elif self._configobj.fetchlevel>3:
-                pass
-                # this option takes precedence over the
-                # extserverlinks option, so set extserverlinks
-                # option to true.
-                # self._configobj.eserverlinks=1
-                # do other checks , just fall through
-
-            # res = self._ext_server_check(urlObj.get_domain())
-            # if not res:
-            #   return True
-
-            # Apply depth check for external servers here
-            if self._configobj.extdepth:
-                if self.apply_depth_check(urlObj, mode=2):
-                    return True
-
-            #if self._configobj.eserverlinks:
-            #    # We can get links belonging to another server, so
-            #    # this is NOT an external link
-            #    return False
-            #else:
-            #    # We cannot get external links beloning to another server,
-            #    # so this is an external link
-            #    return True
-            return False
-        
-        # We should not reach here
-        return False
-
-    def apply_depth_check(self, urlObj, mode=0):
-        """ Apply the depth setting for this url, if any """
-
-        # depth variable is -1 means no depth-check
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        reldepth = urlObj.get_relative_depth(baseUrlObj, mode)
-
-        if reldepth != -1:
-            # check if this exceeds allowed depth
-            if mode == 0 and self._configobj.depth != -1:
-                if reldepth > self._configobj.depth:
-                    return True
-            elif mode == 2 and self._configobj.extdepth:
-                if reldepth > self._configobj.extdepth:
-                    return True
-
-        return False
-
-    ## def _ext_directory_check(self, directory):
-    ##     """ Check whether the directory <directory>
-    ##     should be considered external """
-
-    ##     index=self._increment_ext_directory_count(directory)
-
-    ##     # Are we above a prescribed limit ?
-    ##     if self._configobj.maxextdirs and len(self._extdirs)>self._configobj.maxextdirs:
-    ##         if index != -1:
-    ##             # directory index was below the limit, allow its urls
-    ##             if index <= self._configobj.maxextdirs:
-    ##                 return True
-    ##             else:
-    ##                 # directory index was above the limit, block its urls
-    ##                 return False
-    ##         # new directory, block its urls
-    ##         else:
-    ##             return False
-    ##     else:
-    ##         return True
-
-    ## def _ext_server_check(self, server):
-    ##     """ Check whether the server <server> should be considered
-    ##     external """
-
-    ##     index=self._increment_ext_server_count(server)
-
-    ##     # are we above a prescribed limit ?
-    ##     if self._configobj.maxextservers and len(self._extservers)>self._configobj.maxextservers:
-    ##         if index != -1:
-    ##             # server index was below the limit, allow its urls
-    ##             if index <= self._configobj.maxextservers:
-    ##                 return True
-    ##             else:
-    ##                 return False
-    ##         # new server, block its urls
-    ##         else:
-    ##             return False
-    ##     else:
-    ##         return True
-
-    def _increment_ext_directory_count(self, directory):
-        """ Increment the external dir count """
-
-        index=-1
-        try:
-            index=self._extdirs.index(directory)
-        except:
-            self._extdirs.append(directory)
-
-        return index
-
-    def _increment_ext_server_count(self,server):
-        """ Increment the external server count """
-
-        index=-1
-        try:
-            index=self._extservers.index(server)
-        except:
-            self._extservers.append(server)
-
-        return index
-
-    def get_stats(self):
-        """ Return statistics as a 3 tuple. This returns
-        a 3 tuple of number of links, number of servers, and
-        number of directories in the base server parsed by
-        url trackers """
-
-        numservers=len(self._extservers)
-        numdirs=len(self._extdirs)
-        numfiltered = len(self._filter)
-        
-        return (numservers, numdirs, numfiltered)
-
-    def make_filters(self):
-        pass
-    
-   ##  def make_filters(self):
-##         """ This function creates the filter regexps
-##         for url/text based filtering of content """
-
-##         # URL regex filters
-        
-##         url_filters = self._make_filter(urlfilterstr)
-##         # print 'URL FILTERS=>',url_filters
-        
-##         self._configobj.set_option('urlfilterre_value', url_filters)
-
-
-##         server_filters = self._make_filter(serverfilterstr)
-##         self._configobj.set_option('serverfilterre_value', server_filters)
-
-##         #  url/server priority filters
-##         urlprioritystr = self._configobj.urlpriority
-##         # The return is a dictionary
-##         url_priorities = self._make_priority(urlprioritystr)
-
-##         self._configobj.set_option('urlprioritydict_value', url_priorities)
-
-##         serverprioritystr = self._configobj.serverpriority
-##         # The return is a dictionary        
-##         server_priorities = self._make_priority(serverprioritystr)
-
-##         self._configobj.set_option('serverprioritydict_value', server_priorities)
-
-##         # word filter list
-##         wordfilterstr = self._configobj.wordfilter.strip()
-##         # print 'Word filter string=>',wordfilterstr,len(wordfilterstr)
-##         if wordfilterstr:
-##             word_filter = self._make_word_filter(wordfilterstr)
-##             self._configobj.wordfilterre = word_filter
-
-##         self._madefilters = True
-        
-    def _make_priority(self, pstr):
-        """ Generate a priority dictionary from the priority string """
-
-        # file priority is based on file extensions &
-        # server priority based on server names
-
-        # Priority string is of the form...
-        # str1+pnum1,str2-pnum2,str3+pnum3 etc...
-        # Priority range is from [-5 ... +5]
-
-        # Split the string based on commas
-        pr_strs = pstr.split(',')
-
-        # For each string in list, create a dictionary
-        # with the string as key and the priority (including
-        # sign) as the value.
-
-        d = {}
-        for s in pr_strs:
-            if s.find('+') != -1:
-                key, val = s.split('+')
-                val = int(val)
-
-            elif s.find('-') != -1:
-                key, val = s.split('-')
-                val = -1*int(val)
-            else:
-                continue
-
-            # Since we dont allow values outside
-            # the range [-5 ..5] skip such values
-            if val not in range(-5,6): continue
-            d[key.lower()] = val
-
-        return d
-
-    def _make_word_filter(self, s):
-        """ Create a word filter rule for HarvestMan """
-
-        return re.compile(s, re.IGNORECASE|re.UNICODE)
-
-    def clean_up(self):
-        """ Purge data for a project by cleaning up
-        lists, dictionaries and resetting other member items"""
-
-        debug('Rules got cleaned up...!')
-        
-        self._filter = {}
-        self._extservers = []
-        self._extdirs = []
-        self._robocache = []
-        # Reset dicts
-        self._robots.clear()
-        
diff --git a/HarvestMan-lite/harvestman/lib/test.html b/HarvestMan-lite/harvestman/lib/test.html
deleted file mode 100644
index 4589ef5..0000000
--- a/HarvestMan-lite/harvestman/lib/test.html
+++ /dev/null
@@ -1,10 +0,0 @@
-<html>
-<base href="http://razor.occams.info/code/repo/?/govtrack/sec/?" /> 
-
-
-<ul>
-  <li><a href="?/govtrack/?/sec/coderef.c">code2</a></li>
-  <li><a href="../gotrack2/../sec/?/../?/./sec/coderef.c">code3</a></li>
-</ul>
-
-</html>
diff --git a/HarvestMan-lite/harvestman/lib/test_urlparser.py b/HarvestMan-lite/harvestman/lib/test_urlparser.py
deleted file mode 100644
index c3838cb..0000000
--- a/HarvestMan-lite/harvestman/lib/test_urlparser.py
+++ /dev/null
@@ -1,61 +0,0 @@
-import urlparser
-
-# Test 1
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
-print h
-assert(h.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec')
-h2 = urlparser.HarvestManUrl('coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/coderef.c')
-h2 = urlparser.HarvestManUrl('?/govtrack/sec/coderef2.c',baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec/coderef2.c')
-h2 = urlparser.HarvestManUrl("?/sec/coderef3.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/sec/coderef3.c')
-h2 = urlparser.HarvestManUrl("?sec/coderef4.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?sec/coderef4.c')
-h2 = urlparser.HarvestManUrl("sec/coderef5.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/sec/coderef5.c')
-h2 = urlparser.HarvestManUrl("/sec/coderef6.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/sec/coderef6.c')
-h2 = urlparser.HarvestManUrl("govtrack/sec/coderef7.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/sec/coderef7.c')
-h2 = urlparser.HarvestManUrl("govtrack/?/sec/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/../coderef8.c')
-h2 = urlparser.HarvestManUrl("http://www.foo.com/govtrack/./sec/?/id/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://www.foo.com/govtrack/sec/?/id/../coderef8.c')
-h2 = urlparser.HarvestManUrl("../repo2/govtrack/./sec/?/id/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo2/govtrack/sec/?/id/../coderef8.c')
-
-# Test 2
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
-print h
-h2 = urlparser.HarvestManUrl('../coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/coderef.c')
-h2 = urlparser.HarvestManUrl('govtrack/?/sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/coderef.c')
-h2 = urlparser.HarvestManUrl('../govtrack2/?/../sec/.././sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/govtrack2/?/../sec/.././sec/coderef.c')
-
-print 'Test 3'
-# Test 3
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/?')
-print h
-h2 = urlparser.HarvestManUrl('?/govtrack/?/sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/?/sec/coderef.c')
-h2 = urlparser.HarvestManUrl('../gotrack2/../sec/?/../?/./sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/sec/?/../?/./sec/coderef.c')
-
diff --git a/HarvestMan-lite/harvestman/lib/urlcollections.py b/HarvestMan-lite/harvestman/lib/urlcollections.py
deleted file mode 100755
index 4bcdd26..0000000
--- a/HarvestMan-lite/harvestman/lib/urlcollections.py
+++ /dev/null
@@ -1,246 +0,0 @@
-# -- coding: utf-8
-"""
-urlcollections.py - Module which defines URL collection
-and context classes.
-
-URL collection classes allow a programmer to
-create collections (aggregations) of URL objects
-with respect to certain contexts. This allows to
-treat URL objects belonging to the collection (and hence
-the context) as a single unit allowing you to write
-code based on the context rather than based on
-the URL.
-
-Examples of contexts include stylesheet context
-where a web-page and its CSS files forms part of
-this context. Other examples are frame contexts, where
-a context is associated to all frame URLs originating
-from a web-page and page contexts, which basically
-associates all URLs in page to the page URL.
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai April 17 2007 Based on inputs from
-                                     the EIAO project.
-
-Mod     Anand B Pillai May 26 2007   Added HarvestManAutoUrlCollection
-                                     class which automatically categorizes
-                                     URLs to contexts. Also, modified
-                                     HarvestManUrlCollection class so that
-                                     a collection class can be associated
-                                     to multiple contexts.
-                                     
-Copyright (C) 2007, Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import urltypes
-from harvestman.lib.urlparser import HarvestManUrl
-
-class HarvestManUrlCollectionException(Exception):
-    """ Exception class for collections """
-
-    pass
-
-class HarvestManUrlContext(object):
-    """ This class defines the base URL context type for HarvestMan """
-
-    # Name for the context
-    name = 'BASE_URL_CONTEXT'
-    # Description for the context
-    description = 'Base type for URL contexts'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_ANY
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-    
-class HarvestManPageContext(HarvestManUrlContext):
-    """ Page context class. This context ties a webpage URL
-    with its child URLs """
-
-    # Name for the context
-    name = 'PAGE_URL_CONTEXT'
-    # Description for the context
-    description = 'Context type associating a page to its children'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_WEBPAGE
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-    
-class HarvestManFrameContext(HarvestManPageContext):
-    """ Frame context. This context ties a frameset URL
-    to the frame URLs """
-    
-    # Name for the context
-    name = 'FRAME_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a frameset URL to its frame URLs'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_FRAMESET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_FRAME
-
-class HarvestManStyleContext(HarvestManPageContext):
-    """ Style context. This context ties a webpage URL to its
-    stylesheet (css) URLs """
-
-    # Name for the context
-    name = 'STYLE_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a webpage to its stylesheets'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_WEBPAGE
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_STYLESHEET
-
-
-class HarvestManCSSContext(HarvestManPageContext):
-    """ CSS context. This context ties a stylesheet URL to any
-    URLs defined inside the stylesheet """
-
-    # Name for the context
-    name = 'CSS_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a stylesheet to its child URLs'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_STYLESHEET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-
-class HarvestManCSS2Context(HarvestManCSSContext):
-    """ CSS2 context. This context ties a stylesheet URL to any
-    other stylesheets imported in it """
-
-    # Name for the context
-    name = 'CSS2_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a stylesheet to any stylesheets imported in it'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_STYLESHEET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_STYLESHEET     
-    
-class HarvestManUrlCollection(object):
-    """ URL collection classes for HarvestMan """
-
-    # This class is designed as a bag for HarvestManUrl
-    # objects, tied to a context. The key attributes of this
-    # class are a list of such url objects, a main URL
-    # object from which the context originates (the 'source'
-    # URL object) and a corresponding context.
-
-    def __init__(self, source = None):
-        # For efficiency purposes, we do not
-        # keep reference to urlobjects, only their indices.
-        if source:
-            self._source = source.index
-            self._sourcetyp = source.typ
-        else:
-            self._source = None
-            self._sourcetyp = urltypes.URL_TYPE_NONE
-            
-        self._collections = {}
-
-    def _getContext(self, urlobj):
-        """ Return the context at which the URL urlobj is to
-        be inserted """
-
-        # This class always returns HarvestManPageContext
-        return HarvestManPageContext
-        
-    def addURL(self, urlobj):
-        """ Add a url object to the collection """
-
-        # Check if the type of the urlobject matches the
-        # bagurltype defined for this context. Here we
-        # do a isA check since the url object's type can
-        # be a specialized form (derived class) of the
-        # bagurltype.
-
-        if not isinstance(urlobj, HarvestManUrl):
-            raise HarvestManUrlCollectionException, 'Error: Wrong argument type, expecting HarvestManUrl instance!'
-
-        # For efficiency on memory, we do not append
-        # url objects to the list, only their indices.
-        # Url objects can be mapped out later using their
-        # index from the datamgr object.
-
-        # Context is always HarvestManPageContext
-        context = self._getContext(urlobj)
-        # print 'CONTEXT for URL %s=>%s' % (urlobj.get_full_url(), context)
-        if urlobj.typ.isA(context.bagurltype):
-            # Check if this context exists as key in the collections dictionary
-            try:
-                listofurls = self._collections[context]
-                listofurls.append(urlobj.index)
-            except KeyError:
-                self._collections[context] = [urlobj.index]
-        else:
-            raise HarvestManUrlCollectionException, 'Error: mismatch in context and bag URL types!'
-
-    def getSourceURL(self):
-        """ Return the source URL object """
-
-        return self._source
-
-    def getSourceURLType(self):
-        """ Return the type of the source URL object """
-
-        return self._sourcetyp
-    
-    def getURLs(self, context):
-        """ Get list of URL objects for the given context """
-
-        return self._collections.get(context)
-
-    def getAllURLs(self):
-        """ Get list of all URL objects for this collection """
-
-        allurls = []
-        for urls in self._collections.values():
-            allurls.extend(urls)
-
-        return allurls
-
-    def getContextDict(self):
-        """ Returns a copy of the internal context dictionary """
-
-        return self._collections.copy()
-    
-class HarvestManAutoUrlCollection(HarvestManUrlCollection):
-    """ A sub-class of HarvestManUrlCollection which
-    automatically assigns contexts to URLs """
-
-    def _getContext(self, urlobj):
-        """ Return the context at which the URL urlobj is to be inserted """
-
-        # For frames, return HarvestManFrameContext
-        # For CSS files
-        # 1. Source => webpage, return HarvestManStyleContext
-        # 2. Source => stylesheet, return HarvestManCSS2Context
-        # For other URLs
-        # 1. Source => webpage, return HarvestManPageContext
-        # 2. Source => stylesheet, return HarvestManCSSContext
-
-        if urlobj.typ == urltypes.URL_TYPE_FRAME:
-            return HarvestManFrameContext
-
-        if urlobj.typ == urltypes.URL_TYPE_STYLESHEET:
-            # If source is web-page
-            if self._sourcetyp.isA(urltypes.URL_TYPE_WEBPAGE):
-                return HarvestManStyleContext
-            elif self._sourcetyp.isA(urltypes.URL_TYPE_STYLESHEET):
-                return HarvestManCSS2Context
-        else:
-            # For all other url types
-            if self._sourcetyp.isA(urltypes.URL_TYPE_STYLESHEET):
-                return HarvestManCSSContext
-            else:
-                return HarvestManPageContext
diff --git a/HarvestMan-lite/harvestman/lib/urlparser.py b/HarvestMan-lite/harvestman/lib/urlparser.py
deleted file mode 100755
index 9fb7843..0000000
--- a/HarvestMan-lite/harvestman/lib/urlparser.py
+++ /dev/null
@@ -1,1654 +0,0 @@
-# -- coding: utf-8
-"""urlparser.py - Module containing class HarvestManUrl,
-representing a URL and its relation to disk files in
-HarvestMan.
-
-Creation Date: Nov 2 2004
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-
-   Jan 01 2006      jkleven  Change is_webpage to return 'true'
-                             if the URL looks like a form query.
-   Jan 10 2006      Anand    Converted from dos to unix format (removed Ctrl-Ms).
-   Oct 1 2006       Anand    Fixes for EIAO ticket #193 - added reduce_url
-                             method to take care of .. chars inside URLs.
-
-   Feb 25 2007      Anand    Added .ars as a web-page extension to support
-                             the popular ars-technica website.
-   Mar 12 2007      Anand    Added more fields for multipart. Fixed a bug in
-                             is_webpage - anchor links should be returned
-                             as web-page links.
-
-   Apr 12 2007      Anand    Fixed a bug in anchor link parsing. The current
-                             logic was not taking care of multiple anchor
-                             links (#anchor1#anchor2). Fixed it by using
-                             a regular expression.
-
-                             Test page is
-                             http://nltk.sourceforge.net/lite/doc/api/term-index.html
-
-   Mar 05 2008     Anand    Many changes integrated. Method to get canonical form
-                             of URL added .Generating index as hash of canonical URL
-                             now. Added queue macros.
-   Apr 24 2008     Anand    Fix for #829.
-   Jan 9 2009      Anand    Use a different hashing scheme for URL other than
-                            in-built 'hash'.
-
-Copyright (C) 2004 Anand B Pillai.
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import re
-import mimetypes
-import copy
-import urlproc
-import md5
-import itertools
-import random
-
-from types import StringTypes
-
-from harvestman.lib import document
-from harvestman.lib.common.common import *
-from harvestman.lib.common.netinfo import *
-from harvestman.lib.urltypes import *
-
-# URL queueing status macros
-
-URL_NOT_QUEUED=0       # Fresh URL, not queued yet
-URL_QUEUED=1           # Fresh URL sent to queue, but not yet in queue
-URL_IN_QUEUE=2         # URL is in queue
-URL_IN_DOWNLOAD=3      # URL is out of queue and in download
-URL_DONE_DOWNLOAD=4    # URL has completed download, though this may not mean
-                       # that the download was successful.
-
-mask=0xffffff
-
-def one_stringhash(s):
-    # One-at-a-time string hash
-    l = len(s)
-    hashval = 0
-    for c in s:
-        hashval += ord(c)
-        hashval += (hashval << 10)
-        hashval ^= (hashval >> 6)
-
-    hashval += (hashval << 3)
-    hashval ^= (hashval >> 11)
-    hashval += (hashval << 15)
-
-    return (hashval & mask)
-
-class HarvestManUrlError(Exception):
-    """ Error class for HarvestManUrl """
-    
-    def __init__(self, value):
-        self.value = value
-
-    def __repr__(self):
-        return self.__str__()
-    
-    def __str__(self):
-        return str(self.value)
-    
-class HarvestManUrl(object):
-    """ A class representing a URL in HarvestMan """
-
-    TEST = False
-    hashes = {}
-    
-    def __init__(self, url, urltype = URL_TYPE_ANY, cgi = False, baseurl  = None, rootdir = ''):
-        # Remove trailing wspace chars.
-        url = url.rstrip()
-        try:
-            try:
-                try:
-                    url.encode("utf-8")
-                except UnicodeDecodeError:
-                    url = url.decode("iso-8859-1")
-            except UnicodeDecodeError, e:
-                url = url.decode("latin-1")
-        except UnicodeDecodeError, e:
-            pass
-                
-        # For saving original url
-        # since self.url can get
-        # modified
-        self.origurl = url
-        
-        self.url = url
-        self.url = urlproc.modify_url(self.url)
-        
-        self.typ = urltype
-        self.cgi = cgi
-        self.anchor = ''
-        self.index = 0
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        # If the url is a file like url
-        # this value will be true, if it is
-        # a directory like url, this value will
-        # be false.
-        self.filelike = False
-        # download status, a number indicating
-        # whether this url was downloaded successfully
-        # or not. 0 indicates a successful download, and
-        # any number >0 indicates a failed download
-        self.status = -1
-        # Url scheduled status, a number indicating
-        # how the URL is queued for download.
-        # It has the following values
-        # URL_NOT_QUEUED
-        # URL_QUEUED
-        # URL_IN_DOWNLOAD
-        # URL_DONE_DOWNLOAD
-        # The fact that the URL has URL_DONE_DOWNLOAD
-        # need not mean that the download was successful!
-        self.qstatus = URL_NOT_QUEUED
-        # Fatal status
-        self.fatal = False
-        # is starting url?
-        self.starturl = False
-        # Flag for files having extension
-        self.hasextn = False
-        # Relative path flags
-        self.isrel = False
-        # Relative to server?
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        # Url headers
-        self.contentdict = {}
-        # Url generation
-        self.generation = 0
-        # Url priority
-        self.priority = 0
-        # rules violation cache flags
-        self.violatesrules = False
-        self.rulescheckdone = False
-        # Bytes range - used for HTTP/1.1
-        # multipart downloads. This has to
-        # be set to an xrange object 
-        self.range = None
-        # Flag to try multipart
-        self.trymultipart = False
-        # Multipart index
-        self.mindex = 0
-        # Original url for mirrored URLs
-        self.mirror_url = None
-        # Flag set for URLs which are mirrored from
-        # a different server than the original URL
-        self.mirrored = False
-        # Content-length for multi-part
-        # This is the content length of the original
-        # content.
-        self.clength = 0
-        self.dirpath = []
-        # Re-computation flag
-        self.reresolved = False
-        # URL redirected flag
-        self.redirected = False
-        # Flag indicating we are using an old URL
-        # which was redirected, again for producing
-        # further redirections. This is used in Hget
-        # for automatic split-mirror downloading
-        # for URLs that auto-forward to mirrors.
-        self.redirected_old = False
-        self.baseurl = None
-        # Hash of page data
-        self.pagehash = ''
-        # Flag to decide whether to recalculate get_full_url(...)
-        # if flag is False, recalculate...
-        self.urlflag = False
-        # Cached full URL string
-        self.absurl = ''
-        # Base Url Dictionary
-        if baseurl:
-            if isinstance(baseurl, HarvestManUrl):
-                self.baseurl = baseurl
-            elif type(baseurl) in StringTypes:
-                self.baseurl = HarvestManUrl(baseurl, 'generic', cgi, None, rootdir)
-                      
-        # Root directory
-        if rootdir == '':
-            if self.baseurl and self.baseurl.rootdir:
-                self.rootdir = self.baseurl.rootdir
-            else:
-                self.rootdir = os.getcwd()
-        else:
-            self.rootdir = rootdir
-            
-        self.anchorcheck()
-        self.resolveurl()
-
-        # For starting URL, the index is 0, for the rest
-        # it is as hash of the canonical URL string...
-        self.index = one_stringhash(self.get_canonical_url())
-        # If this is a URL similar to start URL,
-        # reset its index to zero. The trick is
-        # to store only the hash of the start URL
-        # as key in the attribute 'hashes'.
-        
-        try:
-            val = self.hashes[self.index]
-            self.index = 0
-        except KeyError:
-            pass
-
-        # Copy of myself, this will be saved if
-        # a re-resolving is requested so that old
-        # parameters can be requested if needed
-        self.orig_state = None
-        
-    def reset(self):
-        """ Reset all the key attributes """
-
-        # Archive previous state
-        self.orig_state = copy.copy(self)
-
-        self.url = urlproc.modify_url(self.url)
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        self.hasextn = False
-        self.isrel = False
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        self.dirpath = []
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        # Set urlflag to False
-        self.urlflag = False
-        self.absurl = ''
-
-    def __str__(self):
-        return self.absurl
-    
-    def wrapper_resolveurl(self):
-        """ Called forcefully to re-resolve a URL, typically after a re-direction
-        or change in URL has been detected """
-
-        self.reset()
-        self.anchorcheck()
-        self.resolveurl()
-        self.reresolved = True
-        
-    def anchorcheck(self):
-        """ Checking for anchor tags and processing accordingly """
-
-        if self.typ == 'anchor':
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for anchor type url'
-
-            if '#' in self.url:
-                # Split with re
-                items = anchore.split(self.url)
-                # First item is the original url
-                if len(items):
-                    if items[0]:
-                        self.url = items[0]
-                    else:
-                        self.url = self.baseurl.get_full_url()
-                    # Rest forms the anchor tag
-                    self.anchor = '#' + '#'.join(items[1:])
-                    
-    def resolve_protocol(self):
-        """ Resolve the protocol of the url """
-
-        url2 = self.url.lower()
-        for proto in protocol_map.keys():
-            if url2.find(proto) != -1:
-                self.protocol = proto
-                self.port = protocol_map.get(proto)
-                return True
-        else:
-            # Fix: Use regex for detecting WWW urls.
-            # Check for WWW urls. These can begin
-            # with a 'www.' or 'www' followed by
-            # a single number (www1, www3 etc).
-            if www_re.match(url2):
-                self.protocol = 'http://'
-                self.url =  "".join((self.protocol, self.url))
-                return True
-
-            # We accept FTP urls beginning with just
-            # ftp.<server>, and consider it as FTP over HTTP
-            if url2.startswith('ftp.'):
-                # FTP over HTTP
-                self.protocol = 'http://'
-                self.url = ''.join((self.protocol, self.url))
-                return True
-            
-            # Urls relative to server might
-            # begin with a //. Then prefix the protocol
-            # string to them.
-            if self.url.find('//') == 0:
-                # Pick protocol from base url
-                if self.baseurl and self.baseurl.protocol:
-                    self.protocol = self.baseurl.protocol
-                else:
-                    self.protocol = "http://"   
-                self.url = "".join((self.protocol, self.url[2:]))
-                return True
-
-            # None of these
-            # Protocol not resolved, so check
-            # base url first, if not found, set
-            # default protocol...
-            if self.baseurl and self.baseurl.protocol:
-                self.protocol = self.baseurl.protocol
-            else:
-                self.protocol = 'http://'
-
-            self.defproto = True
-        
-            return False
-        
-    def resolveurl(self):
-        """ Resolves the url finding out protocol, port, domain etc
-        . Also resolves relative paths and builds a local file name
-        for the url based on the root directory path """
-
-        if len(self.url)==0:
-            raise HarvestManUrlError, 'Error: Zero Length Url'
-
-        proto = self.resolve_protocol()
-
-        paths = ''
-        
-        if not proto:
-            # Could not resolve protocol, must be a relative url
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for relative urls'
-
-            # Set url-relative flag
-            self.isrel = True
-            # Is relative to server?
-            if self.url[0] == '/':
-                self.isrels = True
-            
-            # Split paths
-            relpaths = self.url.split(URLSEP)
-            try:
-                idx = relpaths.index(DOTDOT)
-            except ValueError:
-                idx = -1
-
-            # Only reduce if the URL itself does not start with
-            # .. - if it does our rpath algorithm takes
-            # care of it.
-
-            # Mod: This is commented out now, since it looks
-            # like there is no harm in allowing to reduce, even
-            # if the path starts with '..'
-            #if idx > 0:
-            
-            relpaths = self.reduce_url(relpaths)
-
-            # Build relative path by checking for "." and ".." strings
-            self.rindex = 0
-            for ritem in relpaths:
-                # If path item is ., .. or empty, increment
-                # relpath index.
-                if ritem in (DOT, DOTDOT, ""):
-                    self.rindex += 1
-                    # If path item is not empty, insert
-                    # to relpaths list.
-                    if ritem:
-                        self.rpath.append(ritem)
-
-                else:
-                    # Otherwise, add the rest to paths
-                    # with the separator
-                    for entry in relpaths[self.rindex:]:
-                        paths = "".join((paths, entry, URLSEP))
-
-                    # Remove the last entry
-                    paths = paths[:-1]
-                    
-                    # Again Trim if the relative path ends with /
-                    # like href = /img/abc.gif/ 
-                    #if paths[-1] == '/':
-                    #    paths = paths[:-1]
-                    break
-            
-        else:
-            # Absolute path, so 'paths' is the part of it
-            # minus the protocol part.
-            paths = self.url.replace(self.protocol, '')            
-            if paths=='':
-                # Error: URL consists only of protocol
-                raise HarvestManUrlError, 'Error: Invalid URL containing only protocol'                
-                
-            # Split URL
-            items = paths.split(URLSEP)
-            
-            # If there are nonsense .. and . chars in the paths, remove
-            # them to construct a sane path.
-            #try:
-            #    idx = items.index(DOTDOT)
-            #except ValueError:
-            #    idx = -1            
-            flag = (DOT in items) or (DOTDOT in items)
-            
-            if flag:
-                # Bugfix: Do not allow a URL like http://www.foo.com/../bar
-                # to become http://bar. Basically if the index of .. is
-                # 1, then remove the '..' entirely. This bug was encountered
-                # in EIAO testing of http://www.fylkesmannen.no/ for the URL
-                # http://www.fylkesmannen.no/osloogakershu
-                
-                items = self.reduce_url(items)
-                # Re-construct URL
-                paths = URLSEP.join(items)
-                
-        # Now compute local directory/file paths
-
-        self.compute_dirpaths(paths)
-        if not self.protocol.startswith('file:'):
-            self.compute_domain_and_port()
-
-        # For some file extensions, automatically set as directory URL.
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            if extn in default_directory_extns:
-                self.set_directory_url()
-
-        # print self.dirpath, self.domain
-
-        
-    def reduce_url(self, paths):
-        """ Remove nonsense .. and . chars from URL paths """
-
-        for x in range(len(paths)):
-            path = paths[x]
-            try:
-                nextpath = paths[x+1]
-                if nextpath in (DOT, DOTDOT):
-                    # Check if a ? occurs anywhere earlier in path.
-                    # If a ? occurs in the path, don't reduce
-                    # any paths coming after it.
-                    try:
-                        qindex = paths.index('?')
-                        if qindex < x+1:
-                            continue
-                    except ValueError:
-                        pass
-                    
-                if nextpath == DOTDOT:
-                    paths.pop(x+1)
-                    # Do not allow to remove the domain for
-                    # stupid URLs like 'http://www.foo.com/../bar' or
-                    # 'http://www.foo.com/camp/../../bar'. If allowed
-                    # they become nonsense URLs like http://bar.
-
-                    # This bug was encountered in EIAO testing of
-                    # http://www.fylkesmannen.no/ for the URL
-                    # http://www.fylkesmannen.no/osloogakershu
-                    
-                    if self.isrel or x>0:
-                        paths.remove(path)
-                    return self.reduce_url(paths)
-                elif nextpath==DOT:
-                    paths.pop(x+1)
-                    return self.reduce_url(paths)                    
-            except IndexError:
-                return paths
-        
-        
-    def compute_file_and_dir_paths(self):
-        """ Compute file and directory paths """
-
-        if self.lastpath:
-            dotindex = self.lastpath.find(DOT)
-            if dotindex != -1:
-                self.hasextn = True
-
-            # If there is no extension or if there is
-            # an extension which is occuring in the middle
-            # of last path...
-            if (dotindex == -1) or \
-                ((dotindex >0) and (dotindex < (len(self.lastpath)-1))):
-                self.filelike = True
-                # Bug fix - Strip leading spaces & newlines
-                self.validfilename =  self.make_valid_filename(self.lastpath.strip())
-                self.filename = self.lastpath.strip()
-                self.dirpath  = self.dirpath [:-1]
-        else:
-            if not self.isrel:
-                self.dirpath  = self.dirpath [:-1]
-
-        # Remove leading spaces & newlines from dirpath
-        dirpath2 = []
-        for item in self.dirpath:
-            dirpath2.append(item.strip())
-
-        # Copy
-        self.dirpath = dirpath2[:]
-            
-    def compute_dirpaths(self, path):
-        """ Computer local file & directory paths for the url """
-
-        self.dirpath = path.split(URLSEP)
-        self.lastpath = self.dirpath[-1]
-        # print self.dirpath, self.lastpath
-        
-        if self.isrel:
-            # Construct file/dir names - This is valid only if the path
-            # has more than one component - like www.python.org/doc .
-            # Otherwise, the url is a plain domain
-            # path like www.python.org .
-            self.compute_file_and_dir_paths()
-            # print 'Rpath=>',self.rpath
-            
-            # Interprets relative path
-            # ../../. Nonsense relative paths are graciously ignored,
-            self.rpath.reverse()
-            # print 'Base url dirpath=>',self.baseurl.dirpath
-            # print 'Rindex=>',self.rindex
-
-            # This simple logic is fine for most paths except
-            # when a base URL has a "?" as part of its dirpath.
-            # Example: http://razor.occams.info/code/repo/?/govtrack/sec .
-            # In that case, any pieces of the base URL after the
-            # ? is to be omitted.
-            if '?' in self.baseurl.dirpath:
-                # Trim base url to the part before ?
-                qindex = self.baseurl.dirpath.index('?')
-                self.baseurl.dirpath = self.baseurl.dirpath[:qindex]
-            
-            if len(self.rpath) == 0 :
-                if not self.rindex:
-                    self.dirpath = self.baseurl.dirpath + self.dirpath
-            else:
-                pathstack = self.baseurl.dirpath[0:]
-
-                for ritem in self.rpath:
-                    if ritem == DOT:
-                        pathstack = self.baseurl.dirpath[0:]
-                    elif ritem == DOTDOT:
-                        if len(pathstack) !=0:
-                                pathstack.pop()
-            
-                self.dirpath  = pathstack + self.dirpath 
-
-            # print 'Dirpath2=>',self.dirpath
-
-            #if self.noreduce:
-            #    return
-            
-            # Support for NONSENSE relative paths such as
-            # g/../foo and g/./foo 
-            # consider base = http:\\bar.com\bar1
-            # then g/../foo => http:\\bar.com\bar1\..\foo => http:\\bar.com\foo
-            # g/./foo  is utter nonsense and we feel free to ignore that.
-            #index = 0
-            #for item in self.dirpath:
-            #    if item in (DOT, DOTDOT):
-            #        self.dirpath.remove(item)
-            #    if item == DOTDOT:
-            #        self.dirpath.remove(self.dirpath[index - 1])
-            #    index += 1
-        else:
-            if len(self.dirpath) > 1:
-                self.compute_file_and_dir_paths()
-            
-    def compute_domain_and_port(self):
-        """ Computes url domain and port &
-        re-computes if necessary """
-
-        # Resolving the domain...
-        
-        # Domain is parent domain, if
-        # url is relative :-)
-        if self.isrel:
-            self.domain = self.baseurl.domain
-        else:
-            # If not relative, then domain
-            # if the first item of dirpath.
-            self.domain = self.dirpath[0]
-            self.dirpath = self.dirpath[1:]
-
-        # Find out if the domain contains a port number
-        # for example, server:8080
-        dom = self.domain
-        index = dom.find(PORTSEP)
-        if index != -1:
-            self.domain = dom[:index]
-            # A bug here => needs to be fixed
-            try:
-                self.port   = int(dom[index+1:])
-            except:
-                pass
-
-        # Now check if the base domain had a port specification (other than 80)
-        # Then we need to use that port for all its children, otherwise
-        # we can use default value.
-        if self.isrel and \
-               self.baseurl and \
-               self.baseurl.port != self.port and\
-               self.baseurl.protocol != 'file://':
-            
-            self.port = self.baseurl.port
-
-        # Convert domain to lower case
-        if self.domain != '':
-            self.domain = self.domain.lower()
-        
-    def make_valid_filename(self, s):
-        """ Replace junk characters to create a valid filename """
-
-        # Replace any %xx strings
-        percent_chars = percent_repl.findall(s)
-        for pchar in percent_chars:
-            try:
-                s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-            except UnicodeDecodeError:
-                try:
-                    s = s.decode('iso-8859-1')
-                    s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-                except UnicodeDecodeError, e:
-                    pass
-                
-        for x,y in itertools.izip(junk_chars, junk_chars_repl):
-            s = s.replace(x, y)
-
-        return s
-
-    def make_valid_url(self, url):
-        """ Make a valid url """
-
-        for x,y in itertools.izip(dirty_chars, dirty_chars_repl):
-            if x in url:
-                url = url.replace(x, y)
-
-        # Replace spaces between words
-        # with '%20'.
-        # For example http://www.foo.com/bar/this file.html
-        # Fix: Use regex instead of blind
-        # replacement.
-        if wspacere.search(url):
-            url = re.sub(r'\s', '%20', url)
-        
-        # Replace all % chars with their capital counterparts
-        # i.e %3a => %3A, %5b => %5B etc. This helps in
-        # canonicalization.
-        percent_chars = percent_repl.findall(url)
-        for pchar in percent_chars:
-            url = url.replace(pchar, pchar.upper())
-            
-        return url
-
-    def is_filename_url(self):
-        """ Return whether this is file name url """
-
-        # A directory url is something like http://www.python.org
-        # which points to the <index.html> file inside the www.python.org
-        # directory.A file name url is a url that points to an actual
-        # file like http://www.python.org/doc/current/tut/tut.html
-
-        return self.filelike
-
-    def is_cgi(self):
-        """ Check whether this url is a cgi (dynamic/form) link """
-
-        return self.cgi
-
-    def is_relative_path(self):
-        """ Return whether the original url was a relative one """
-
-        return self.isrel
-
-    def is_relative_to_server(self):
-        """ Return whether the original url was relative to the server """
-        
-        return self.isrels
-
-    def is_image(self):
-        """ Find out if the file is an image """
-
-        if self.typ == 'image':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in image_extns)
-             
-        return False
-
-    def is_multimedia(self):
-        """ Found out if the file is a multimedia (vide or audio) type """
-
-        return (self.is_video() or self.is_audio())
-        
-    def is_audio(self):
-        """ Find out if the file is a multimedia audio type """
-
-        if self.typ == 'audio':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in sound_extns)
-             
-        return False
-
-    def is_video(self):
-        """ Find out if the file is a multimedia video type """
-
-        if self.typ == 'video':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in movie_extns)
-             
-        return False
-            
-    def is_webpage(self):
-        """ Find out by if the file is a webpage type """
-
-        # Note: right now we treat dynamic server-side scripts namely
-        # php, psp, asp, pl, jsp, and cgi as possible html candidates, though
-        # actually they might be generating non-html content (like dynamic
-        # images.)
-        
-        if self.typ.isA(URL_TYPE_WEBPAGE):
-            return True
-        elif self.typ==URL_TYPE_ANY:
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                if extn in webpage_extns:
-                    return True
-                else:
-                    # jkleven: 10/1/06.  Forms were never being parsed for links.
-
-                    # If we are allowing download of query forms (i.e., bin?asdf=3 style URLs)
-                    # then run the URL through a regex if we're still not sure if its ok.
-                    # if it matches the from_re precompiled regex then we'll assume its
-                    # a query style URL and we'll return true.
-                    if objects.config and objects.config.getquerylinks and form_re.search(self.get_full_url()):
-                        return True
-
-        return False
-
-    def is_stylesheet(self):
-        """ Find out whether the url is a style sheet type """
-
-        if self.typ == 'stylesheet':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in stylesheet_extns)
-             
-        return False
-
-    def is_document(self):
-        """ Return whether the url is a document """
-
-        # This method is useful for Indexers which use HarvestMan.
-        # We define any URL which is not an image, is either a web-page
-        # or any of the following types as a document.
-
-        # Microsoft word documents
-        # Openoffice documents
-        # Adobe PDF documents
-        # Postscript documents
-
-        if self.is_image(): return False
-        if self.is_webpage(): return True
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in document_extns)
-
-        return False
-
-    def is_flash(self):
-        """ Return whether the url is flash, flash source code
-        or flash action script """
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in flash_extns)
-
-        return False        
-
-    def is_equal(self, url):
-        """ Find whether the passed url matches
-        my url """
-
-        # Try 2 tests, one straightforward
-        # other with a "/" appended at the end
-        myurl = self.get_full_url()
-        if url==myurl:
-            return True
-        #else:
-        #    myurl += URLSEP
-        #    if url==myurl:
-        #        return True
-
-        return False
-
-    def parseable(self):
-        """ Return whether the URL could have parseable content.
-        This function tries to make the best guess
-        based on the URL file extension and type. Parseable
-        means the possibility of having content which can
-        produce child URLs effectively meaning HTML and
-        parseable stylesheets """
-
-        # This is called before downloading of a URL. However
-        # whether a URL has parseable content is fully known
-        # only after downloading it. However in this case, we
-        # need this information prior to download, so we try
-        # to make the best guess....
-
-        # The guess is very much optimistic. That is the logic
-        # is tilted towards trying to make all possible checks
-        # on returning this as a parseable URL.
-
-        if self.is_webpage() or self.is_stylesheet():
-            return True
-
-        # Is this an image,multimedia,flash then return False
-        if self.is_image() or self.is_multimedia() or self.is_flash():
-            return False
-        
-        # Okay, it is not webpage/css/multimedia/flash, but this
-        # could be a directory URL which can turn out to be
-        # a web-page. So check for file extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            # Document ? Return false
-            if extn in document_extns:
-                return False
-            
-            # The extension has to be a valid (at least 2 chars, excluding the dot)
-            # extension, to assume this is a valid non-HTML type file.
-            if len(extn)>=3:
-                return False
-
-
-        # Ok - Safely assume that this is parseable HTML type or
-        # will turn out to be that later!
-        return True
-        
-    # ============ End - Is (Boolean Get) Methods =========== #  
-    # ============ Begin - General Get Methods ============== #
-    def get_url_content_info(self):
-        """ Get the url content information """
-        
-        return self.contentdict
-    
-    def get_anchor(self):
-        """ Return the anchor tag of this url """
-
-        return self.anchor
-
-    def get_anchor_url(self):
-        """ Get the anchor url, if this url is an anchor type """
-
-        return "".join((self.get_full_url(), self.anchor))
-
-    def get_generation(self):
-        """ Return the generation of this url """
-        
-        return self.generation    
-
-    def get_priority(self):
-        """ Get the priority for this url """
-
-        return self.priority
-
-    def get_download_status(self):
-        """ Return the download status for this url """
-
-        return self.status
-
-    def get_type(self):
-        """ Return the type of this url as a string """
-        
-        return self.typ
-
-    def get_parent_url(self):
-        """ Return the parent url of this url """
-        
-        return self.baseurl
-
-    def get_url_directory(self):
-        """ Return the directory path (url minus its filename if any) of the url """
-        
-        # get the directory path of the url
-        fulldom = self.get_full_domain()
-        urldir = fulldom
-
-        if self.dirpath:
-            newpath = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-            urldir = "".join((fulldom, newpath))
-
-        return urldir
-
-    def get_url_directory_sans_domain(self):
-        """ Return url directory minus the domain """
-
-        # New function in 1.4.1
-        urldir = ''
-        
-        if self.dirpath:
-            urldir = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-
-        return urldir        
-        
-    def get_url(self):
-        """ Return the url of this object """
-        
-        return self.url
-
-    def get_original_url(self):
-        """ Return the original url of this object """
-        
-        return self.origurl
-
-    def get_canonical_url(self):
-        """ Return the canonicalized form of this URL """
-
-        # A canonical URL or 'normalized' URL is a URL modified
-        # to a standardized form so that similar URLs can be
-        # found out by comparing their canonical forms. HarvestMan
-        # uses canonical URLs to remove DUST (Duplicate URLs with
-        # similar text) to some extent.
-
-        # Wikipedia describes canonicalization of a URL
-        # {http://en.wikipedia.org/wiki/URL_normalization}
-        #
-        # 1. Converting the scheme and host to lower case...
-        # 2. Adding trailing to directory URLs...
-        # 3. Removing directory index, i.e
-        #    http://www.example.com/default.asp => http://www.example.com/
-        #    http://www.example.com/index.html => http://www.example.com/
-        # 4. Case insensitive files => If the URL is running on a case insensitive
-        #    file system (Windows, example: FAT*, NTFS etc), then the canonical
-        #    form should use lower case.
-        # 5. Capitalizing letters in escape sequences - All letters within a
-        # percent-encoding triplet (e.g., "%3A") are case-insensitive, and should
-        # be capitalized.
-        # Egs: http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
-        # 6. Removing the anchor fragment 
-        # Egs: http://www.example.com/bar.html#section1 → http://www.example.com/bar.html
-        # 7. Removing the default port.
-        # 8. Removing dot segments. { http://example/com/b/c/.././file.html => http://example.com/b/file.html }
-        # 9. Removing www as the first domain label. i.e www.example.com => example.com
-        # 10. Sorting the variables of active pages (dynamic pages) -> 
-        #  {http://www.example.com/display?lang=en&article=fred → http://www.example.com/display?article=fred&lang=en}
-        # 11. Removing default querystring variables. A default value in the querystring will
-        # render identically whether it is there or not. When a default value appears in the querystring,
-        # it should be removed. {http://www.example.com/display?id=&sort=ascending =>  http://www.example.com/display}
-        # 12. Removing the "?" when the querystring is empty. When the querystring is
-        # empty, there is no need for the "?
-        # { http://www.example.com/display? → http://www.example.com/display  }
-
-        # HarvestMan does 1,2,3,5,6,7,8,9,10,11,12 in order. Note that HarvestMan
-        # performs 6,7,8 automatically when processing the original URL.
-
-        # 1 is already done when resolving the URL.
-        # 2 is already done when resolving the URL
-        # 3 is already done for root domains. i.e http://www.example.com
-        #  becomes http://www.example.com/ . However this is not done
-        # for directory URL since we are not sure if this would be a
-        # file or directory i.e http://www.example.com/docs/ and
-        # http://www.example.com/docs will parse to
-        # http://www.example.com/docs without the trailing slash since
-        # by default we assume that the URL refers to the file "/docs"
-        # rather than the directory index for "/docs" folder.
-        # Skip 4
-        # 5 is done by make_valid_url()...
-        # 6,7,8 are automatically done
-        # Doing 9, 10,11 and 12 and specifically. 
-
-        # Get full url first...
-        url = self.get_full_url()
-        params = params_re.findall(url)
-        lp = len(params)
-        if lp>1:
-            # Rule#11: Remove those params which are using a default value
-            # i.e which does not specify a value.
-            params = [param for param in params if param_re.match(param)]
-            # More than one param, sort it
-            params.sort()
-            url_sans_params = ampersand_re.sub('', params_re.sub('', url))
-            # Now put the params back in sorted order
-            url = url_sans_params + '&'.join(params)
-        elif lp==0:
-            # If no params but there is a ? at end, rule 12 applies
-            # Remove trailing ? at the end
-            url = question_re.sub('', url)
-
-        # Finally we strip off the www. from the beginning of the URL
-        url = www2_re.sub('', url)
-
-        return url
-        
-    def get_full_url(self):
-        """ Return the full url path of this url object after
-        resolving relative paths, filenames etc """
-
-        if self.urlflag:
-            return self.absurl
-        else:
-            rval = ''
-            
-            if not self.protocol.startswith('file:'):
-                rval = self.get_full_domain_with_port()
-
-                if self.dirpath:
-                    newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-                    rval = "".join((rval, URLSEP, newpath))
-
-                if rval[-1] != URLSEP:
-                    rval += URLSEP
-
-                if self.filelike:
-                    rval = "".join((rval, self.filename))
-                
-            else:
-                rval = ''
-                if self.dirpath:
-                    newpath = "/".join([x for x in self.dirpath if ((x and not x[-1] ==URLSEP) or (not x))])            
-                    rval += newpath
-                    
-                if self.filelike:
-                    rval = "".join((rval, URLSEP, self.filename))
-
-                return self.protocol + rval
-            
-            self.urlflag = True
-            self.absurl = self.make_valid_url(rval)
-
-            return self.absurl
-
-  ##       # If this is already calculated, return the cached value...
-##         if self.urlflag:
-##             return self.absurl
-##         else:
-##             rval = self.get_full_domain_with_port()
-##             if self.dirpath:
-##                 newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-##                 rval = "".join((rval, URLSEP, newpath))
-            
-##             if rval[-1] != URLSEP:
-##                 rval += URLSEP
-
-##             if self.filelike:
-##                 rval = "".join((rval, self.filename))
-
-##             self.urlflag = True
-##             self.absurl = self.make_valid_url(rval)
-
-##             return self.absurl
-       # If this is already calculated, return the cached value...
-
-
-    def get_full_url_sans_port(self):
-        """ Return absolute url without the port number """
-
-        rval = self.get_full_domain()
-        if self.dirpath:
-            newpath = "".join([ x+'/' for x in self.dirpath])
-            rval = "".join((rval, URLSEP, newpath))
-
-        if rval[-1] != URLSEP:
-            rval += URLSEP
-
-        if self.filelike:
-            rval = "".join((rval, self.filename))
-
-        return self.make_valid_url(rval)
-
-    def get_port_number(self):
-        """ Return the port number of this url """
-
-        # 80 -> http urls
-        return self.port
-
-    def get_relative_url(self):
-        """ Return relative path of url w.r.t the domain """
-
-        newpath=""
-        if self.dirpath:
-            newpath =  "".join(("/", "".join([ x+'/' for x in self.dirpath])))
-
-        if self.filelike:
-            newpath = "".join((newpath, URLSEP, self.filename))
-            
-        return self.make_valid_url(newpath)
-
-    def get_base_domain(self):
-        """ Return the base domain for this url object """
-
-        # Explanation: Base domain is the domain
-        # at the root of a given domain. For example
-        # base domain of stats.foo.com is foo.com.
-        # If there is no subdomain, this will be
-        # the same as the domain itself.
-
-        # If the server name is of the form say bar.foo.com
-        # or vodka.bar.foo.com, i.e there are more than one
-        # '.' in the name, then we need to return the
-        # last string containing a dot in the middle.
-
-        # Get domain
-        domain = self.domain
-        
-        if domain.count('.') > 1:
-            dotstrings = domain.split('.')
-            # now the list is of the form => [vodka, bar, foo, com]
-
-            # Return the last two items added with a '.'
-            # in between
-            return "".join((dotstrings[-2], ".", dotstrings[-1]))
-        else:
-            # The server is of the form foo.com or just "foo"
-            # so return it straight away
-            return domain
-
-    def get_base_domain_with_port(self):
-        """ Return the base domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-        
-        if ((self.protocol == 'http://' and int(self.port) != 80) \
-            or (self.protocol == 'https://' and int(self.port) != 443) \
-            or (self.protocol == 'ftp://' and int(self.port) != 21)):
-            return self.get_base_domain() + ':' + str(self.port)
-        else:
-            return self.get_base_domain()
-
-    def get_url_hash(self):
-        """ Return a hash value for the URL """
-
-        m = md5.new()
-        m.update(self.get_full_url())
-        return str(m.hexdigest())
-    
-    def get_domain_hash(self):
-        """ Return the hask value for the domain """
-
-        m = md5.new()
-        m.update(self.get_full_domain())
-        return str(m.hexdigest())
-
-    def get_data_hash(self):
-        """ Return the hash value for the URL data """
-
-        return self.pagehash
-
-    def get_domain(self):
-        """ Return the domain (server) for this url object """
-        
-        return self.domain
-
-    def get_full_domain(self):
-        """ Return the full domain (protocol + domain) for this url object """
-        
-        return self.protocol + self.domain
-
-    def get_full_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and int(self.port) != 80) \
-           or (self.protocol == 'https://' and int(self.port) != 443) \
-           or (self.protocol == 'ftp://' and int(self.port) != 21):
-            return self.get_full_domain() + ':' + str(self.port)
-        else:
-            return self.get_full_domain()
-
-    def get_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and self.port != 80) \
-           or (self.protocol == 'https://' and self.port != 443) \
-           or (self.protocol == 'ftp://' and self.port != 21):
-            return self.domain + ':' + str(self.port)
-        else:
-            return self.domain
-
-    def get_full_filename(self):
-        """ Return the full filename of this url on the disk.
-        This is created w.r.t the local directory where we save
-        the url data """
-
-        return os.path.join(self.get_local_directory(), self.get_filename())
-
-    def get_filename(self):
-        """ Return the filename of this url on the disk. """
-
-        # NOTE: This is just the filename, not the absolute filename path
-        return self.validfilename
-
-    def get_relative_filename(self, filename=''):
-
-        # NOTE: Rewrote this method completely
-        # on Nov 18 for 1.4 b2.
-        
-        # If no file name given, file name
-        # is the file name of the parent url
-        if not filename:
-            if self.baseurl:
-                filename = self.baseurl.get_full_filename()
-
-        # Still filename is NULL,
-        # return my absolute path
-        if not filename:
-            return self.get_full_filename()
-        
-        # Get directory of 'filename'
-        diry = os.path.dirname(filename)
-        if diry[-1] != os.sep:
-            diry += os.sep
-            
-        # Get my filename
-        myfilename = self.get_full_filename()
-        # If the base domains are different, we
-        # cannot find a relative path, so return
-        # my filename itself.
-        bdomain = self.baseurl.get_domain()
-        mydomain = self.get_domain()
-
-        if mydomain != bdomain:
-            return myfilename
-
-        # If both filenames are the same,
-        # return just the filename.
-        if myfilename==filename:
-            return self.get_filename()
-        
-        # Get common prefix of my file name &
-        # other file name.
-        prefix = os.path.commonprefix([myfilename, filename])
-        relfilename = ''
-        
-        if prefix:
-            if not os.path.exists(prefix):
-                prefix = os.path.dirname(prefix)
-            
-            if prefix[-1] != os.sep:
-                prefix += os.sep
-
-            # If prefix is the name of the project
-            # directory, both files have no
-            # common component.
-            try:
-                if os.path.samepath(prefix,self.rootdir):
-                    return myfilename
-            except:
-                if prefix==self.rootdir:
-                    return myfilename
-            
-            # If my directory is a subdirectory of
-            # 'dir', then prefix should be the same as
-            # 'dir'.
-            sub=False
-
-            # To test 'sub-directoriness', check
-            # whether dir is wholly contained in
-            # prefix. 
-            prefix2 = os.path.commonprefix([diry,prefix])
-            if prefix2[-1] != os.sep:
-                prefix2 += os.sep
-            
-            # os.path.samepath is not avlbl in all
-            # platforms.
-            try:
-                if os.path.samepath(diry, prefix2):
-                    sub=True
-            except:
-                if diry==prefix2:
-                    sub=True
-
-            # If I am in a sub-directory, relative
-            # path is my filename minus the common
-            # path.
-            if sub:
-                relfilename = myfilename.replace(prefix2, '')
-                return relfilename
-            else:
-                # If I am not in sub-directory, then
-                # we need to get the relative path.
-                dirwithoutprefix = diry.replace(prefix, '')
-                filewithoutprefix = myfilename.replace(prefix, '')
-                relfilename = filewithoutprefix
-                    
-                paths = dirwithoutprefix.split(os.sep)
-                for item in paths:
-                    if item:
-                        relfilename = "".join(('..', os.sep, relfilename))
-
-                return relfilename
-        else:
-            # If there is no common prefix, then
-            # it means me and the passed filename
-            # have no common paths, so return my
-            # full path.
-            return myfilename
-            
-    def get_relative_depth(self, hu, mode=0):
-        """ Get relative depth of current url object vs passed url object.
-        Return a postive integer if successful and -1 on failure """
-
-        # Fixed 2 bugs on 22/7/2003
-        # 1 => passing arguments to find function in wrong order
-        # 2 => Since we allow the notion of zero depth, even zero
-        # value of depth should be returned.
-
-        # This mode checks for depth based on a directory path
-        # This check is valid only if dir2 is a sub-directory of dir1
-        dir1=self.get_url_directory()
-        dir2=hu.get_url_directory()
-        
-        # spit off the protocol from directories
-        dir1 = dir1.replace(self.protocol, '')
-        dir2 = dir2.replace(hu.protocol, '')      
-        
-        # Append a '/' to the dirpath if not already present
-        if dir1[-1] != '/': dir1 += '/'
-        if dir2[-1] != '/': dir2 += '/'
-
-        if mode==0:
-            # check if dir2 is present in dir1
-            # bug: we were passing arguments to the find function
-            # in the wrong order.
-            if dir1.find(dir2) != -1:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-
-            return -1
-        # This mode checks for depth based on the base server(domain).
-        # This check is valid only if dir1 and dir2 belong to the same
-        # base server (checked by name)
-        elif mode==1:
-            if self.domain == hu.domain:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-            return -1
-
-        # This check is done for the current url against current base server (domain)
-        # i.e, this mode does not use the argument 'hu'
-        elif mode==2:
-            dir2 = self.domain
-            if dir2[-1] != '/':
-                dir2 += '/'
-
-            # we need to check for depth only if the above condition is true.
-            l1=dir1.split('/')
-            l2=dir2.split('/')
-            if l1 and l2:
-                diff=len(l1) - len(l2)
-                if diff>=0: return diff
-            return -1
-
-        return -1
-
-    def get_root_dir(self):
-        """ Return root directory """
-        
-        return self.rootdir
-    
-    def get_local_directory(self):
-        """ Return the local directory path of this url w.r.t
-        the directory on the disk where we save the files of this url """
-        
-        # Gives Local Direcory path equivalent to URL Path in server
-        rval = ''
-        if not self.protocol.startswith('file:'):
-            rval = os.path.join(self.rootdir, self.domain)
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))
-        else:
-            rval = self.rootdir
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))                    
-                    
-        return os.path.normpath(rval)
-
-    def get_original_state(self):
-        """ Return the original state of this URL object. This is useful
-        to obtain earlier attributes of a URL after it's state was
-        changed by a URL modification """
-
-        # It is up to the caller to check this value
-        return self.orig_state
-        
-    # ============ Begin - Set Methods =========== #
-
-    def set_directory_url(self):
-        """ Set this as a directory url """
-
-        self.filelike = False
-        # print self.dirpath, self.lastpath, self.domain
-        
-        if (not self.dirpath and self.lastpath != self.domain) or (self.dirpath and (self.dirpath[-1] != self.lastpath)):
-            self.dirpath.append(self.lastpath)
-        self.validfilename = 'index.html'
-        self.urlflag = False
-        
-    def set_url_content_info(self, headers):
-        """ This function sets the url content information of this
-        url. It is a convenient function which can be used by connectors
-        to store url content information """
-
-        if headers:
-            self.contentdict = copy.deepcopy(headers)
-
-    def violates_rules(self):
-        """ Check if this url violates existing download rules """
-
-        # If I am the base url object, violates rule checks apply
-        # only if my original URL has changed.
-
-        if self.starturl and not self.reresolved:
-            return False
-            
-        if not self.rulescheckdone:
-            self.violatesrules = objects.rulesmgr.violates_rules(self)
-            self.rulescheckdone = True
-
-        return self.violatesrules
-
-    def recalc_locations(self):
-        """ Recalculate filenames/directories etc """
-
-        # Case 1 - trying to save as a file when the
-        # parent "directory" is an existing file.
-        # Solution - Change the paths of parent URL object
-        # to change its filename...
-        directory = self.get_url_directory()
-        if os.path.isfile(directory):
-            parent = self.baseurl
-            # Anything can be done on this only if this
-            # is a HarvestManUrl object
-            if isinstance(parent, HarvestManUrl):
-                parent.dirpath.append(parent.filename)
-                parent.filename = 'index.html'
-                parent.validfilename = 'index.html'
-
-        # Case 2 - trying to save as file when the
-        # path is an existing directory.
-        # Solution - Save as index.html in the directory
-        filename = self.get_full_filename()
-        if os.path.isdir(filename):
-            self.dirpath.append(self.filename)
-            self.filename = 'index.html'
-            self.validfilename = 'index.html'
-        
-    def manage_content_type(self, content_type):
-        """ This function gets called from connector modules
-        connect method, after retrieving information about
-        a url. This function can manage the content type of
-        the url object if there are any differences between
-        the assumed type and the returned type """
-
-        # Guess extension of type
-        extn = mimetypes.guess_extension(content_type)
-        
-        if extn:
-            if extn in webpage_extns:
-                self.typ = URL_TYPE_WEBPAGE
-            elif extn in image_extns:
-                self.typ = URL_TYPE_IMAGE
-            elif extn in stylesheet_extns:
-                self.typ = URL_TYPE_STYLESHEET
-            elif extn in sound_extns:
-                self.typ = URL_TYPE_AUDIO
-            elif extn in movie_extns:
-                self.typ = URL_TYPE_VIDEO
-            else:
-                self.typ = URL_TYPE_FILE
-        else:
-            if content_type:
-                # Do some generic tests
-                klass, typ = content_type.split('/')
-                if klass == 'image':
-                    self.typ = URL_TYPE_IMAGE
-                elif klass == 'audio':
-                    self.typ = URL_TYPE_AUDIO
-                elif klass == 'video':
-                    self.typ = URL_TYPE_VIDEO
-                elif typ == 'html':
-                    self.typ = URL_TYPE_WEBPAGE
-            else:
-                # Do static checks
-                if self.is_webpage():
-                    self.typ = URL_TYPE_WEBPAGE
-                elif self.is_image():
-                    self.typ = URL_TYPE_IMAGE
-                elif self.is_audio():
-                    self.typ = URL_TYPE_AUDIO
-                elif self.is_video():
-                    self.typ = URL_TYPE_VIDEO
-                elif self.is_stylesheet():
-                    self.typ = URL_TYPE_STYLESHEET
-                else:
-                    self.typ = URL_TYPE_FILE
-
-    def make_document(self, data, keywords, description, children):
-        """ Return a HarvestManDocument object filling up all the fields """
-        
-        doc = document.HarvestManDocument(self)
-        doc.content = data
-        doc.keywords = keywords[:]
-        doc.description = description
-        doc.content_hash = self.pagehash
-        doc.headers = self.contentdict.copy()
-        for child in children:
-            doc.add_child(child)
-        
-        doc.lastmodified = self.contentdict.get('last-modified','')
-        doc.etag = self.contentdict.get('etag','')
-        doc.content_type = self.contentdict.get('content-type','')
-        doc.content_encoding = self.contentdict.get('content-encoding','plain')
-        return doc
-    
-    # ============ End - Set Methods =========== #
-
-
-def test():
-    
-    # Test code
-    HarvestManUrl.TEST = 1
-    hulist = [ HarvestManUrl('http://www.yahoo.com/photos/my photo.gif'),
-               HarvestManUrl('http://www.rediff.com:80/r/r/tn2/2003/jun/25usfed.htm'),
-               HarvestManUrl('http://cwc2003.rediffblogs.com'),
-               HarvestManUrl('/sports/2003/jun/25beck1.htm',
-                                   'generic', 0, 'http://www.rediff.com', ''),
-               HarvestManUrl('http://ftp.gnu.org/pub/lpf.README'),
-               HarvestManUrl('http://www.python.org/doc/2.3b2/'),
-               HarvestManUrl('//images.sourceforge.net/div.png',
-                                   'image', 0, 'http://sourceforge.net', ''),
-               HarvestManUrl('http://pyro.sourceforge.net/manual/LICENSE'),
-               HarvestManUrl('python/test.htm', 'generic', 0,
-                                   'http://www.foo.com/bar/index.html', ''),
-               HarvestManUrl('/python/test.css', 'generic',
-                                   0, 'http://www.foo.com/bar/vodka/test.htm', ''),
-               HarvestManUrl('/visuals/standard.css', 'generic', 0,
-                                   'http://www.garshol.priv.no/download/text/perl.html',
-                                   'd:/websites'),
-               HarvestManUrl('www.fnorb.org/index.html', 'generic',
-                                   0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('http://profigure.sourceforge.net/index.html',
-                                   'generic', 0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('#anchor', 'anchor', 0, 
-                                   'http://www.foo.com/bar/index.html',
-                                   'd:/websites'),
-               HarvestManUrl('nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html#__init__#index-after', 'anchor', 0, 'http://nltk.sourceforge.net/lite/doc/api/term-index.html', 'd:/websites'),               
-               HarvestManUrl('../../icons/up.png', 'image', 0,
-                                   'http://www.python.org/doc/current/tut/node2.html',
-                                   ''),
-               HarvestManUrl('../eway/library/getmessage.asp?objectid=27015&moduleid=160',
-                                   'generic',0,'http://www.eidsvoll.kommune.no/eway/library/getmessage.asp?objectid=27015&moduleid=160'),
-               HarvestManUrl('fileadmin/dz.gov.si/templates/../../../index.php',
-                                   'generic',0,'http://www.dz-rs.si','~/websites'),
-               HarvestManUrl('http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70','form',True,'http://www.evvs.dk'),
-               HarvestManUrl('http://arstechnica.com/reviews/os/macosx-10.4.ars')]
-                                  
-                                  
-    for hu in hulist:
-        print 'Full filename = ', hu.get_full_filename()
-        print 'Valid filename = ', hu.validfilename
-        print 'Local Filename  = ', hu.get_filename()
-        print 'Is relative path = ', hu.is_relative_path()
-        print 'Full domain = ', hu.get_full_domain()
-        print 'Domain      = ', hu.domain
-        print 'Local Url directory = ', hu.get_url_directory_sans_domain()
-        print 'Canonical Url = ', hu.get_canonical_url()
-        print 'Absolute Url = ', hu.get_full_url()
-        print 'Absolute Url Without Port = ', hu.get_full_url_sans_port()
-        print 'Local Directory = ', hu.get_local_directory()
-        print 'Is filename parsed = ', hu.filelike
-        print 'Path rel to domain = ', hu.get_relative_url()
-        print 'Connection Port = ', hu.get_port_number()
-        print 'Domain with port = ', hu.get_full_domain_with_port()
-        print 'Relative filename = ', hu.get_relative_filename()
-        print 'Anchor url     = ', hu.get_anchor_url()
-        print 'Anchor tag     = ', hu.get_anchor()
-        print  'Index=>',hu.index
-        
-
-if __name__=="__main__":
-    test()
diff --git a/HarvestMan-lite/harvestman/lib/urlproc.py b/HarvestMan-lite/harvestman/lib/urlproc.py
deleted file mode 100755
index d5d3c6e..0000000
--- a/HarvestMan-lite/harvestman/lib/urlproc.py
+++ /dev/null
@@ -1,202 +0,0 @@
-# -- coding: utf-8
-""" urlproc.py - Module to process URLs and replace
-    entity characters (characters starting with an ampersand
-    and ending with a semicolon). 
-
-    All entities here added from
-    http://www.w3schools.com/tags/ref_entities.asp
-    
-    This module is part of the HarvestMan program.
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-   Author: Anand B Pillai <abpillai at gmail dot com>
-   
-   Modification History
-   --------------------
-   
-   Created - Anand B Pillai 28 Sep 06
-
-   Copyright (C) 2006 Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import unicodedata
-import itertools
-
-char_names = ['LESS-THAN SIGN',
-              'GREATER-THAN SIGN',
-              'AMPERSAND',
-              'QUOTATION MARK',
-              'SPACE',
-              'LATIN CAPITAL LETTER C WITH CEDILLA',
-              'LATIN SMALL LETTER C WITH CEDILLA',
-              'LATIN CAPITAL LETTER N WITH TILDE',
-              'LATIN SMALL LETTER N WITH TILDE',
-              'LATIN CAPITAL LETTER THORN',
-              'LATIN SMALL LETTER THORN',
-              'LATIN CAPITAL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH DIAERESIS',
-              'LATIN SMALL LETTER SHARP S',
-              'LATIN CAPITAL LETTER AE',
-              'LATIN CAPITAL LETTER A WITH ACUTE',
-              'LATIN CAPITAL LETTER A WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER A WITH GRAVE',
-              'LATIN CAPITAL LETTER A WITH RING ABOVE',
-              'LATIN CAPITAL LETTER A WITH TILDE',
-              'LATIN CAPITAL LETTER A WITH DIAERESIS',
-              'LATIN SMALL LETTER AE',
-              'LATIN SMALL LETTER A WITH ACUTE',
-              'LATIN SMALL LETTER A WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER A WITH GRAVE',
-              'LATIN SMALL LETTER A WITH RING ABOVE',
-              'LATIN SMALL LETTER A WITH TILDE',
-              'LATIN SMALL LETTER A WITH DIAERESIS',
-              'LATIN CAPITAL LETTER ETH',
-              'LATIN CAPITAL LETTER E WITH ACUTE',
-              'LATIN CAPITAL LETTER E WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER E WITH GRAVE',
-              'LATIN CAPITAL LETTER E WITH DIAERESIS',
-              'LATIN SMALL LETTER ETH',
-              'LATIN SMALL LETTER E WITH ACUTE',
-              'LATIN SMALL LETTER E WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER E WITH GRAVE',
-              'LATIN SMALL LETTER E WITH DIAERESIS',
-              'LATIN CAPITAL LETTER I WITH ACUTE',
-              'LATIN CAPITAL LETTER I WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER I WITH GRAVE',
-              'LATIN CAPITAL LETTER I WITH DIAERESIS',
-              'LATIN SMALL LETTER I WITH ACUTE',
-              'LATIN SMALL LETTER I WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER I WITH GRAVE',
-              'LATIN SMALL LETTER I WITH DIAERESIS',
-              'LATIN CAPITAL LETTER O WITH ACUTE',
-              'LATIN CAPITAL LETTER O WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER O WITH GRAVE',
-              'LATIN CAPITAL LETTER O WITH STROKE',
-              'LATIN CAPITAL LETTER O WITH TILDE',
-              'LATIN CAPITAL LETTER O WITH DIAERESIS',
-              'LATIN SMALL LETTER O WITH ACUTE',
-              'LATIN SMALL LETTER O WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER O WITH GRAVE',
-              'LATIN SMALL LETTER O WITH STROKE',
-              'LATIN SMALL LETTER O WITH TILDE',
-              'LATIN SMALL LETTER O WITH DIAERESIS',
-              'LATIN CAPITAL LETTER U WITH ACUTE',
-              'LATIN CAPITAL LETTER U WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER U WITH GRAVE',
-              'LATIN CAPITAL LETTER U WITH DIAERESIS',
-              'LATIN SMALL LETTER U WITH ACUTE',
-              'LATIN SMALL LETTER U WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER U WITH GRAVE',
-              'LATIN SMALL LETTER U WITH DIAERESIS',
-              'REGISTERED SIGN',
-              'PLUS-MINUS SIGN',
-              'MICRO SIGN',
-              'PILCROW SIGN',
-              'MIDDLE DOT',
-              'CENT SIGN',
-              'POUND SIGN',
-              'YEN SIGN',
-              'VULGAR FRACTION ONE QUARTER',
-              'VULGAR FRACTION ONE HALF',
-              'VULGAR FRACTION THREE QUARTERS',
-              'SUPERSCRIPT ONE',
-              'SUPERSCRIPT TWO',
-              'SUPERSCRIPT THREE',
-              'INVERTED QUESTION MARK',
-              'DEGREE SIGN',
-              'BROKEN BAR',
-              'SECTION SIGN',
-              'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'EURO SIGN',
-              'SINGLE RIGHT-POINTING ANGLE QUOTATION MARK',
-              'SINGLE LEFT-POINTING ANGLE QUOTATION MARK',              
-              'PER MILLE SIGN',
-              'HORIZONTAL ELLIPSIS',
-              'DOUBLE DAGGER',
-              'DAGGER',
-              'DOUBLE LOW-9 QUOTATION MARK',
-              'RIGHT DOUBLE QUOTATION MARK',
-              'LEFT DOUBLE QUOTATION MARK',
-              'SINGLE LOW-9 QUOTATION MARK',
-              'RIGHT SINGLE QUOTATION MARK',
-              'LEFT SINGLE QUOTATION MARK',
-              'EM DASH',
-              'EN DASH',
-              'LATIN SMALL LETTER S WITH CARON',
-              'LATIN CAPITAL LETTER S WITH CARON',
-              'LATIN SMALL LIGATURE OE',
-              'LATIN CAPITAL LIGATURE OE',
-              'INVERTED EXCLAMATION MARK',
-              'CURRENCY SIGN',
-              'DIAERESIS',
-              'FEMININE ORDINAL INDICATOR',
-              'NOT SIGN',
-              'TRADE MARK SIGN',
-              'MACRON',
-              'ACUTE ACCENT',
-              'CEDILLA',
-              'MASCULINE ORDINAL INDICATOR',
-              'MULTIPLICATION SIGN',
-              'DIVISION SIGN'
-              ]
-
-# Entity characters
-ampersand_strings = ('&lt;','&gt;','&amp;','&quot;',
-                     '&nbsp;','&Ccedil;','&ccdil;','&Ntilde;',
-                     '&ntilde;','&THORN;','&thorn;','&Yacute;',
-                     '&yacute;','&yuml;','&szlig;','&AElig;',
-                     '&Aacute;','&Acirc;','&Agrave;','&Aring;',
-                     '&Atilde;','&Auml;','&aelig;','&acirc;',
-                     '&aacute;','&agrave;','&aring;','&atilde;',
-                     '&auml;', '&ETH;','&Eacute;','&Ecirc;',
-                     '&Egrave;','&Euml;','&eth;','&eacute;',
-                     '&ecirc;','&egrave;','&euml;','&Iacute;',
-                     '&Icirc;','&Igrave;','&Iuml;','&iacute;',
-                     '&icirc;','&igrave;','&iuml;','&Oacute;',
-                     '&Ocirc;','&Ograve;','&Oslash;','&Otilde;',
-                     '&Ouml;','&oacute;','&ocirc;','&ograve;',
-                     '&oslash;','&otilde;','&ouml;','&Uacute;',
-                     '&Ucirc;','&Ugrave;','&Uuml;','&uacute;',
-                     '&ucirc;','&ugrave;','&uuml;','&reg;',
-                     '&plusmn;','&micro;','&para;','&middot;',
-                     '&cent;','&pound;','&yen;','&frac14;',
-                     '&frac12;','&frac34;','&sup1;','&sup2;',
-                     '&sup3;','&iquest;','&deg;','&brvbar;',
-                     '&sect;','&laquo;','&raquo;','&euro;',
-                     '&rsaquo;','&lsaquo;','&permil;','&hellip;',
-                     '&Dagger;','&dagger;','&bdquo;','&rdquo;',
-                     '&ldquo;','&sbquo;','&rsquo;','&lsquo;',
-                     '&mdash;','&ndash;','&scaron;','&Scaron;',
-                     '&oelig;','&OElig;','&iexcl;','&curren;',
-                     '&uml;','&ordf;','&not;','&trade;',
-                     '&macr;','&acute;','&cedil;','&ordm;','&times;',
-                     '&divide;')
-                         
-                         
-def modify_url(url):
-    """ Replace entity characters in URLs with the original
-    string representations """
-    
-    for ampersand_string, ucode_name in itertools.izip(ampersand_strings, char_names):
-        if url.find(ampersand_string) != -1:
-            ucode_char = unicodedata.lookup(ucode_name)            
-            try:
-                url = url.replace(ampersand_string, ucode_char)
-            except UnicodeDecodeError:
-                pass
-            
-    return url
-
-def main():
-    # Test code
-    url = 'http://www.nbb.be/pub/Home.htm?l=nl&amp;t=ho'
-    print modify_url(url)
-
-if __name__ == "__main__":
-    main()
diff --git a/HarvestMan-lite/harvestman/lib/urlqueue.py b/HarvestMan-lite/harvestman/lib/urlqueue.py
deleted file mode 100755
index 0c14c51..0000000
--- a/HarvestMan-lite/harvestman/lib/urlqueue.py
+++ /dev/null
@@ -1,824 +0,0 @@
-# -- coding: utf-8
-""" urlqueue.py - Module which controls queueing of urls
-    created by crawler threads. This is part of the HarvestMan
-    program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Modification History
-
-     Anand Jan 12 2005 -   Created this module, by splitting urltracker.py
-     Aug 11 2006  Anand    Checked in changes for reducing CPU utilization.
-
-     Aug 22 2006  Anand    Changes for fixing single-thread mode.
-     Oct 19 2007  Anand    Added a very basic state-machine for managing
-                           crawler end condition.
-     Oct 22 2007  Anand    Enhanced the state machine with additional states,
-                           checks and a modified mainloop etc.
-
-     April 04 2008 Anand   Fixes in state machine and mainloop.
-     Jun   03 2008 Anand   Fixes in abnormal exit logic. Other fixes.
-     
-   Copyright (C) 2005 Anand B Pillai.     
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import bisect
-import time
-import threading
-import sys, os
-import copy
-
-from collections import deque
-from Queue import *
-
-from harvestman.lib import crawler
-from harvestman.lib import urlparser
-from harvestman.lib import document
-from harvestman.lib import datamgr
-from harvestman.lib import urltypes
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.singleton import Singleton
-
-class HarvestManCrawlerState(Singleton):
-    """ State machine for signalling crawler end condition
-    and for managing end-condition stalemates and other
-    issues """
-
-    def __init__(self, queue):
-        self.reset()
-        self.cond = threading.Condition(threading.Lock())
-        self.queue = queue
-
-    def reset(self):
-        self.ts = {}
-        # Flags
-        # All threads blocked (waiting)
-        self.blocked = False
-        # All fetchers blocked (waiting)
-        self.fblocked = False
-        # All crawlers blocked (waiting)
-        self.cblocked = False
-        # Crawler thread transitions
-        self.ctrans = 0
-        # Fetcher thread transitions
-        self.ftrans = 0
-        # Pushes by fetcher
-        self.fpush = 0
-        # Pushes by crawler
-        self.cpush = 0
-        # Gets by fetcher
-        self.fgets = 0
-        # Gets by crawler
-        self.cgets = 0
-        # Number of crawlers
-        self.numcrawlers = 0
-        # Number of fetchers
-        self.numfetchers = 0
-        # Suspend time-stamp, initially
-        # set to None. To suspend end-state
-        # checking, set this to current time...
-        self.st = None
-        # End state message. If normal exit
-        # this is not set, for abormal exit
-        # this could be set...
-        self.abortmsg = ''
-        self.blkcnt = 0
-        self.lastcheck = time.time()
-        
-    def set(self, thread, state):
-
-        curr, role = None, None
-        item = self.get(thread)
-        if item:
-            curr, role = item
-            
-        if curr != state:
-            # print 'Thread %s changes from state %s to state %s' % (thread, curr, state)
-            self.ts[thread.getName()] = state, thread._role
-
-        self.state_callback(thread, state)
-
-    def get(self, thread):
-        return self.ts.get(thread.getName())
-
-    def zero_thread(self):
-        """ Function which returns whether any of the
-        thread counts (fetcher/crawler) have gone to zero """
-
-        return (self.numfetchers==0) or (self.numcrawlers == 0)
-
-    def suspend(self):
-        """ Suspend checks on end state. This uses a timeout
-        on the suspend flag which automaticall ages the flag
-        and resets if, if not set within the aging period """
-
-        self.st = time.time()
-
-    def unsuspend(self):
-        """ Unsuspend checks on end-state. """
-
-        self.st = None
-        
-    def state_callback(self, thread, state):
-        """ Callbacks for taking action according to state transitions """
-
-        self.cond.acquire()
-        typ = thread._role
-        
-        if state == crawler.THREAD_STARTED:
-            if thread.resuming:
-                # Resuming threads should call unsuspend..
-                self.unsuspend()
-            
-        # If the thread is killed, try to regenerate it...
-        elif state == crawler.THREAD_DIED:
-            # print 'THREAD DIED',thread
-            # Don't try to regenerate threads if this is a local exception.
-            e = thread.exception
-            logconsole("Thread died due to exception => ", str(e))
-            # See class for last error. If it is same as
-            # this error, don't do anything since this could
-            # be a programming error and will send us into
-            # a loop...
-            #if str(thread.__class__._lasterror) == str(e):
-            #    debug('Looks like a repeating error, not trying to restart thread %s' % (str(thread)))
-            # In this case the thread has died, so reduce local thread count
-            if typ=='crawler':
-                self.numcrawlers -= 1
-            elif typ == 'fetcher':
-                self.numfetchers -= 1
-                
-            del self.ts[thread.getName()]
-            #else:
-            #    thread.__class__._lasterror = e
-            #    # Release the lock now!
-            #    self.cond.release()
-            #    # Set suspend flag
-            #    self.suspend()
-            #    
-            #    del self.ts[thread]
-            #    extrainfo('Tracker thread %s has died due to error: %s' % (str(thread), str(e)))
-            #    self.queue.dead_thread_callback(thread)
-            #    return
-            
-        elif state == crawler.FETCHER_PUSHED_URL:
-            # Push count for fetcher threads
-            self.fpush += 1
-        elif state == crawler.CRAWLER_PUSHED_URL:            
-            # Push count for fetcher threads
-            self.cpush += 1
-        elif state == crawler.FETCHER_GOT_DATA:
-            # Get count for fetcher threads
-            self.fgets += 1
-        elif state == crawler.CRAWLER_GOT_DATA:
-            # Get count for fetcher threads
-            self.cgets += 1            
-        elif state == crawler.THREAD_SLEEPING:
-            # A sleep state can be achieved only after a work state
-            # so this indicates a cycle of transitions since
-            # a cycle ends with a sleep...
-            if typ == 'crawler':
-                # Transition count for crawler threads                
-                self.ctrans += 1
-            elif typ == 'fetcher':
-                # Transition count for crawler threads                
-                self.ftrans += 1                
-                
-        elif state in (crawler.FETCHER_WAITING, crawler.CRAWLER_WAITING):
-            if self.end_state():
-                # This is useful only if the waiter is waiting
-                # using wait1(...) method. If he is waiting
-                # using wait2(...) method, he needs to devise
-                # his own wake-up logic.
-                self.cond.notify()
-
-        self.cond.release()
-
-    def all_are_waiting(self):
-        """ This method returns whether the threads are all starved for
-        data during regular crawl, which signals an end condition for the
-        program """
-        
-        # Time stamp of calling this function
-        currt = time.time()
-        # Check suspend time-stamp
-        if self.st:
-            # Calculate difference, do not allow suspending
-            # for more than 5 seconds.
-            if (currt - self.st)>5.0:
-                self.st = None
-            return False
-
-        if self.zero_thread():
-            self.abortmsg = "Fatal thread reduction, stopping program"
-            return True
-
-        for status, role in self.ts.values():
-            if status.__name__ not in ('PERM_EXCEPT','FETCHER_WAITING','CRAWLER_WAITING','THREAD_DIED', 'THREAD_STOPPED'):
-                return False
-        
-        #if self.queue.url_q.qsize() or self.queue.data_q.qsize():
-        #    return False
-
-        return ((self.fpush == self.cgets) and \
-               (self.cpush == self.fgets))
-
-    def all_have_stopped(self):
-        """ This method returns whether the threads are all stopped
-        (or sleeping) after an abnormal exit of the program either
-        by an explicit interrupt or after a program exception """
-        
-        if self.zero_thread():
-            self.abortmsg = "Fatal thread reduction, stopping program"
-            return True
-
-        for status, role in self.ts.values():
-            if status.__name__ not in ('PERM_EXCEPT','THREAD_STOPPED','THREAD_SLEEPING'):
-                return False
-        
-        return True
-
-    def end_state(self):
-        """ Check end state for the program. Returns True
-        if the program is ready to end. Abnormal exits are
-        not handled here """
-
-        return self.all_are_waiting()
-
-    def exit_state(self):
-        """ Checks end of state for program for abnormal
-        exits. Returns True if the program is ready to end """
-
-        return self.all_have_stopped()
-        
-##         # Time stamp of calling this function
-##         currt = time.time()
-##         # Check suspend time-stamp
-##         if self.st:
-##             # Calculate difference, do not allow suspending
-##             # for more than 5 seconds.
-##             if (currt - self.st)>5.0:
-##                 self.st = None
-##             return False
-
-##         if self.zero_thread():
-##             self.abortmsg = "Fatal thread reduction, stopping program"
-##             return True
-        
-##         flag = True
-##         numthreads = 0
-        
-##         fcount, fnblock, ccount, cnblock = 0, 0, 0, 0
-##         self.blocked, self.fblocked, self.cblocked = False, False, False
-
-##         for state, role in self.ts.values():
-##             numthreads += 1
-
-##             print role,'=>',state
-            
-##             if role == 'fetcher':
-##                 fcount += 1
-##                 if state == crawler.FETCHER_WAITING:
-##                     fnblock += 1
-##                 else:
-##                     flag = False
-##                     break
-##             else:
-##                 ccount += 1
-##                 if state == crawler.CRAWLER_WAITING:
-##                     cnblock += 1
-##                 else:
-##                     flag = False
-##                     break
-
-##         # print 'flag=>',flag
-##          # For exit based on thread waiting state allignment
-##         if flag:
-##             self.blocked = True
-##             print 'Numthreads=>',numthreads
-
-##         if ccount==cnblock:
-##             self.cblocked = True
-
-##         if fcount==fnblock:
-##             self.fblocked = True
-
-##         if self.blocked:
-##             # print 'BLOCKED!'
-##             # print 'Length1 => ',len(self.queue.data_q)
-##             # print 'Length2 => ',len(self.queue.url_q)
-            
-##             #print "Pushes=>",self.queue._pushes
-##             # print 'Transitions',self.ctrans, self.ftrans,self.fpush,self.cpush
-
-##             # If we have one fpush event, we need to have at least
-##             # one cpush associated to it...
-##             # Error: this is a dangerous condition... we never know if the
-##             # crawler has filtered out the children of the first URL itself
-##             # so cpush could be == 0, commented this out!
-            
-##             #if self.fpush>0:
-##             #    return (self.cpush>0)
-##             self.blkcnt += 1
-##             if self.blkcnt > 10:
-##                 return True
-##             #return False
-
-##         return False
-
-    def __str__(self):
-        return str(self.ts)
-
-    def wait1(self, timeout):
-        """ Regular wait method. This should be typically
-        called with a large timeout value """
-
-        while not self.end_state():
-            self.cond.acquire()
-            try:
-                self.cond.wait(timeout)
-            except IOError, e:
-                break
-            
-            self.cond.release()
-        
-    def wait2(self, timeout):
-        """ Secondary wait method. This should be typically
-        called with a small timeout value. When calling
-        this the caller should have additional logic to
-        make sure he does not time out before the condition
-        is met """
-        
-        try:
-            self.cond.acquire()
-            self.cond.wait(timeout)
-            self.cond.release()
-        except IOError, e:
-            pass
-        
-class PriorityQueue(Queue):
-    """ Priority queue based on bisect module (courtesy: Effbot) """
-
-    def __init__(self, maxsize=0):
-        Queue.__init__(self, maxsize)
-
-    def _init(self, maxsize):
-        self.maxsize = maxsize
-        self.queue = []
-        
-    def _put(self, item):
-        bisect.insort(self.queue, item)
-
-    def __len__(self):
-        return self.qsize()
-    
-    def _qsize(self):
-        return len(self.queue)
-
-    def _empty(self):
-        return not self.queue
-
-    def _full(self):
-        return self.maxsize>0 and len(self.queue) == self.maxsize
-
-    def _get(self):
-        return self.queue.pop(0)    
-
-    def clear(self):
-        while True:
-            try:
-                self.queue.pop()
-            except IndexError:
-                break
-        
-class HarvestManCrawlerQueue(object):
-    """ This class functions as the thread safe queue
-    for storing url data for tracker threads """
-
-    alias = 'queuemgr'
-    
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        
-        self.basetracker = None
-        self.controller = None 
-        self.flag = 0
-        self.pushes = 0
-        self.lasttimestamp = time.time()
-        self.trackers  = []
-        self.savers = []
-        self.requests = 0
-        self.trackerindex = 0
-        self.baseurl = None
-        self.stateobj = HarvestManCrawlerState(self)
-        self.configobj = objects.config
-        self.url_q = PriorityQueue(self.configobj.queuesize)
-        self.data_q = PriorityQueue(self.configobj.queuesize)
-            
-        # Local buffer
-        self.buffer = []
-        # Lock for creating new threads
-        self.cond = threading.Lock()
-        # Flag indicating a forceful exit
-        self.forcedexit = False
-        # Sleep event
-        self.evnt = SleepEvent(self.configobj.queuetime)
-
-    def get_controller(self):
-        """ Return the controller thread object """
-
-        return self.controller
-    
-    def configure(self):
-        """ Configure this class """
-
-        try:
-            self.baseurl = urlparser.HarvestManUrl(self.configobj.url,
-                                                   urltypes.URL_TYPE_ANY,
-                                                   0, self.configobj.url,
-                                                   self.configobj.projdir)
-
-            # Put the original hash of the start url in the class
-            urlparser.HarvestManUrl.hashes[self.baseurl.index] = 1
-            # Reset index to zero
-            self.baseurl.index = 0
-            objects.datamgr.add_url(self.baseurl)
-            
-        except urlparser.HarvestManUrlError:
-            return False
-
-        self.baseurl.starturl = True
-        
-        #if self.configobj.fastmode:
-        try:
-            self.basetracker = crawler.HarvestManUrlFetcher( 0, self.baseurl, True )
-        except Exception, e:
-            print "Fatal Error:",e
-            hexit(1)
-                
-        #else:
-        #    # Not much point keeping this code...!
-        #    
-        #    # Disable usethreads
-        #    self.configobj.usethreads = False
-        #    # Disable blocking
-        #    self.configobj.blocking = False
-        #    self.basetracker = crawler.HarvestManUrlDownloader( 0, self.baseurl, False )
-            
-        self.trackers.append(self.basetracker)
-        # Reset state object
-        self.stateobj.reset()
-        
-        return True
-
-    def mainloop(self):
-        """ Main program loop which waits for
-        the threads to do their work """
-
-        # print 'Waiting...'
-        timeslot, tottime = 0.5, 0
-        pool = objects.datamgr.get_url_threadpool()
-        
-        while not self.stateobj.end_state():
-            self.stateobj.wait2(timeslot)
-            tottime += timeslot
-            if self.flag: 
-                break
-            
-        if pool: pool.wait(10.0, self.configobj.timeout)
-
-        if self.stateobj.abortmsg:
-            extrainfo(self.stateobj.abortmsg)
-            
-        if not self.forcedexit:
-            self.end_threads()
-
-    def endloop(self, forced=False):
-        """ Exit the mainloop """
-
-        # Set flag to 1 to denote that downloading is finished.
-        self.flag = 1
-        if forced:
-            self.forcedexit = True
-            # A forced exit happens when we exit because a
-            # download limit is breached, so instruct connectors
-            # to not save anything from here on...
-            conndict = objects.connfactory.get_connector_dict()
-            for conn in conndict.keys():
-                if conndict.get(conn):
-                    conn.blockwrite = True
-
-    def restart(self):
-        """ Alternate method to start from a previous restored state """
-
-        # Start harvestman controller thread
-        import datamgr
-
-        if self.configobj.enable_controller():
-            self.controller = datamgr.HarvestManController()
-            self.controller.start()
-
-        # Start base tracker
-        self.basetracker.start()
-        time.sleep(2.0)
-
-        for t in self.trackers[1:]:
-            try:
-                t.start()
-            except AssertionError, e:
-                logconsole(e)
-                pass
-
-        self.mainloop()
-        
-    def crawl(self):
-        """ Starts crawling for this project """
-
-        # Reset flag
-        self.flag = 0
-
-        t1=time.time()
-
-
-        # Clear the queues...
-        self.url_q.clear()
-        self.data_q.clear()
-        
-        # Push the first URL directly to the url queue
-        self.url_q.put((self.baseurl.priority, self.baseurl))
-        # This is pushed to url queue, so increment crawler push...
-        self.stateobj.cpush += 1
-        
-        #if self.configobj.fastmode:
-
-        # Start harvestman controller thread
-        if self.configobj.enable_controller():        
-            self.controller = datamgr.HarvestManController()
-            self.controller.start()
-
-        # Create the number of threads in the config file
-        # Pre-launch the number of threads specified
-        # in the config file.
-
-        # Initialize thread dictionary
-        self.stateobj.numfetchers = int(0.75*self.configobj.maxtrackers)
-        self.stateobj.numcrawlers = self.configobj.maxtrackers - self.stateobj.numfetchers
-
-        self.basetracker.setDaemon(True)
-        self.basetracker.start()
-
-        evt = SleepEvent(0.1)
-        while self.stateobj.get(self.basetracker) == crawler.FETCHER_WAITING:
-            evt.sleep()
-
-        # Set start time on config object
-        self.configobj.starttime = t1
-
-        del evt
-        for x in range(1, self.stateobj.numfetchers):
-            t = crawler.HarvestManUrlFetcher(x, None)
-            self.add_tracker(t)
-            t.setDaemon(True)
-            t.start()
-
-        for x in range(self.stateobj.numcrawlers):
-            t = crawler.HarvestManUrlCrawler(x, None)
-            self.add_tracker(t)
-            t.setDaemon(True)
-            t.start()
-
-        self.mainloop()
-        #else:
-        #    self.basetracker.action()
-
-    def get_base_tracker(self):
-        """ Get the base tracker object """
-
-        return self.basetracker
-
-    def get_base_url(self):
-
-        return self.baseurl
-    
-    def get_url_data(self, role):
-        """ Pop url data from the queue """
-
-        if self.flag: return None
-        
-        obj = None
-
-        blk = self.configobj.blocking
-
-        slptime = self.configobj.queuetime
-        ct = threading.currentThread()
-        
-        if role == 'crawler':
-            if blk:
-                obj=self.data_q.get()
-                self.stateobj.set(ct, crawler.CRAWLER_GOT_DATA)                
-            else:
-                self.stateobj.set(ct, crawler.CRAWLER_WAITING)
-                try:
-                    obj = self.data_q.get(timeout=slptime)
-                    self.stateobj.set(ct, crawler.CRAWLER_GOT_DATA)                                    
-                except Empty, TypeError:
-                    obj = None
-                    
-        elif role == 'fetcher' or role=='tracker':
-            
-            if blk:
-                obj = self.url_q.get()
-                self.stateobj.set(ct, crawler.FETCHER_GOT_DATA)                                
-            else:
-                self.stateobj.set(ct, crawler.FETCHER_WAITING)
-                try:
-                    obj = self.url_q.get(timeout=slptime)
-                    self.stateobj.set(ct, crawler.FETCHER_GOT_DATA)                                                    
-                except Empty, TypeError:
-                    obj = None
-            
-        self.lasttimestamp = time.time()        
-
-        self.requests += 1
-        return obj
-
-    def add_tracker(self, tracker):
-
-        self.trackers.append( tracker )
-        self.trackerindex += 1
-
-    def remove_tracker(self, tracker):
-
-        self.trackers.remove(tracker)
-
-    def dead_thread_callback(self, t):
-        """ Call back function called by a thread if it
-        dies with an exception. This class then creates
-        a fresh thread, migrates the data of the dead
-        thread to it """
-
-        
-        try:
-            debug('Trying to regenerate thread...')
-            self.cond.acquire()
-            # First find out the type
-            role = t._role
-            new_t = None
-
-            if role == 'fetcher':
-                new_t = crawler.HarvestManUrlFetcher(t.get_index(), None)
-            elif role == 'crawler':
-                new_t = crawler.HarvestManUrlCrawler(t.get_index(), None)
-
-            # Migrate data and start thread
-            if new_t:
-                new_t._url = t._url
-                new_t._urlobject = t._urlobject
-                
-                new_t.buffer = copy.deepcopy(t.buffer)
-                # If this is a crawler get links also
-                if role == 'crawler':
-                    new_t.links = t.links[:]
-                    
-                # Replace dead thread in the list
-                idx = self.trackers.index(t)
-                self.trackers[idx] = new_t
-                new_t.resuming = True
-                new_t.start()
-
-                debug('Regenerated thread...')
-                
-                return THREAD_MIGRATION_OK
-            else:
-                # Could not make new thread, so decrement
-                # count of threads.
-                # Remove from tracker list
-                self.trackers.remove(t)
-                
-                if role == 'fetcher':
-                    self.stateobj.numfetchers -= 1
-                elif role == 'crawler':
-                    self.stateobj.numcrawlers -= 1
-
-                return THREAD_MIGRATION_ERROR
-        finally:
-            self.cond.release()
-                
-    def push(self, obj, role):
-        """ Push trackers to the queue """
-
-        if self.flag: return
-        
-        ntries, status = 0, 0
-        ct = threading.currentThread()
-        
-        if role == 'crawler' or role=='tracker' or role =='downloader':
-            # debug('Pushing stuff to buffer',ct)
-            self.stateobj.set(ct, crawler.CRAWLER_PUSH_URL)
-            
-            while ntries < 5:
-                try:
-                    ntries += 1
-                    self.url_q.put((obj.priority, obj))
-                    self.pushes += 1
-                    status = 1
-                    self.stateobj.set(ct, crawler.CRAWLER_PUSHED_URL)
-                    break
-                except Full:
-                    self.evnt.sleep()
-                    
-        elif role == 'fetcher':
-            # print 'Pushing stuff to buffer', ct
-            self.stateobj.set(ct, crawler.FETCHER_PUSH_URL)                                
-            # stuff = (obj[0].priority, (obj[0].index, obj[1]))
-            while ntries < 5:
-                try:
-                    ntries += 1
-                    self.data_q.put(obj)
-                    self.pushes += 1
-                    status = 1
-                    self.stateobj.set(ct, crawler.FETCHER_PUSHED_URL)                    
-                    break
-                except Full:
-                    self.evnt.sleep()                    
-
-        self.lasttimestamp = time.time()
-
-        return status
-    
-    def end_threads(self):
-        """ Stop all running threads and clean
-        up the program. This function is called
-        for a normal/abnormal exit of HravestMan """
-
-        extrainfo("Ending threads...")
-        if self.configobj.project:
-            if self.forcedexit:
-                info('Terminating project ',self.configobj.project,'...')
-            else:
-                info("Ending Project", self.configobj.project,'...')
-
-        # Stop controller
-        if self.controller:
-            self.controller.stop()
-        
-        if self.forcedexit:
-            self._kill_tracker_threads()
-        else:
-            # Do a regular stop and join
-            for t in self.trackers:
-                try:
-                    t.stop()
-                except Exception, e:
-                    pass
-
-            # Wait till all threads report
-            # to the state machine, with a
-            # timeout of 5 minutes.
-            extrainfo("Waiting for threads to finish up...")
-
-            timeslot, tottime = 0.5, 0
-            while not self.stateobj.exit_state():
-                # print 'Waiting...'
-                self.stateobj.wait2(timeslot)
-                tottime += timeslot
-                if tottime>=300.0:
-                    break
-
-            pool = objects.datamgr.get_url_threadpool()
-            if pool: pool.wait(10.0, 120.0)
-            
-            extrainfo("Done.")
-            # print 'Done.'
-        
-        self.trackers = []
-        self.basetracker = None
-
-    def _kill_tracker_threads(self):
-        """ This function kills running tracker threads """
-
-        count =0
-
-        for tracker in self.trackers:
-            count += 1
-            sys.stdout.write('...')
-
-            if count % 10 == 0: sys.stdout.write('\n')
-
-            try:
-                tracker.stop()
-            except AssertionError, e:
-                logconsole(str(e))
-            except ValueError, e:
-                logconsole(str(e))
-            except crawler.HarvestManUrlCrawlerException, e:
-                pass
-
diff --git a/HarvestMan-lite/harvestman/lib/urlthread.py b/HarvestMan-lite/harvestman/lib/urlthread.py
deleted file mode 100755
index 16d9974..0000000
--- a/HarvestMan-lite/harvestman/lib/urlthread.py
+++ /dev/null
@@ -1,792 +0,0 @@
-# -- coding: utf-8
-""" urlthread.py - Url thread downloader module.
-    Has two classes, one for downloading of urls and another
-    for managing the url threads.
-
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Modification History
-
-    Jan 10 2006  Anand  Converted from dos to unix format (removed Ctrl-Ms).
-    Jan 20 2006  Anand  Small change in printing debug info in download
-                        method.
-
-    Mar 05 2007  Anand  Implemented http 304 handling in notify(...).
-
-    Apr 09 2007  Anand  Added check to make sure that threads are not
-                        re-started for the same recurring problem.
-    
-    Copyright (C) 2004 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import math
-import time
-import threading
-import copy
-import random
-import sha
-from collections import deque
-from Queue import Queue, Full, Empty
-
-from harvestman.lib import urlparser
-
-from harvestman.lib.mirrors import HarvestManMirrorManager
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-
-class HarvestManUrlThreadInterrupt(Exception):
-    """ Interrupt raised to kill a harvestManUrlThread class's object """
-
-    def __init__(self, value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManUrlThread(threading.Thread):
-    """ Class to download a url in a separate thread """
-
-    # The last error which caused a thread instance to die
-    _lasterror = None
-    
-    def __init__(self, name, timeout, threadpool):
-        """ Constructor, the constructor takes a url, a filename
-        , a timeout value, and the thread pool object pooling this
-        thread """
-
-        # url Object (This is an instance of urlPathParser class)
-        self._urlobject = None
-        # thread queue object pooling this thread
-        self._pool = threadpool
-        # max lifetime for the thread
-        self._timeout = timeout
-        # start time of thread
-        self._starttime = 0
-        # start time of a download
-        self._dstartime = 0
-        # sleep time
-        self._sleepTime = 1.0
-        # error object
-        self._error = None
-        # download status 
-        self._downloadstatus = 0
-        # busy flag
-        self._busyflag = False
-        # end flag
-        self._endflag = False
-        # Url data, only used for CONNECTOR_DATA_MODE_INMEM
-        self._data = ''
-        # Url temp file, used for mode CONNECTOR_DATA_MODE_FLUSH
-        self._urltmpfile = ''
-        # Current connector
-        self._conn = None
-        # initialize threading
-        threading.Thread.__init__(self, None, None, name)
-
-    def __str__(self):
-        return self.getName()
-    
-    def get_error(self):
-        """ Get error object of this thread """
-
-        return self._error
-
-    def get_status(self):
-        """ Get the download status of this thread """
-
-        return self._downloadstatus
-
-    def get_data(self):
-        """ Return the data of this thread """
-
-        return self._data
-
-    def get_tmpfname(self):
-        """ Return the temp filename if any """
-
-        return self._urltmpfile
-
-    def set_tmpfname(self, filename):
-        """ Set the temporary filename """
-
-        # Typically called by connector objects
-        self._urltmpfile = filename
-        
-    def set_status(self, status):
-        """ Set the download status of this thread """
-
-        self._downloadstatus = status
-
-    def is_busy(self):
-        """ Get busy status for this thread """
-
-        return self._busyflag
-
-    def set_busy_flag(self, flag):
-        """ Set busy status for this thread """
-
-        self._busyflag = flag
-
-    def join(self):
-        """ The thread's join method to be called
-        by other threads """
-
-        threading.Thread.join(self, self._timeout)
-
-    def terminate(self):
-        """ Kill this thread """
-
-        self.stop()
-        msg = 'Download thread, ' + self.getName() + ' killed!'
-        raise HarvestManUrlThreadInterrupt, msg
-
-    def stop(self):
-        """ Stop this thread """
-
-        # If download was not completed, push-back object
-        # to the pool.
-        if self._downloadstatus==0 and self._urlobject:
-            self._pool.push(self._urlobject)
-            
-        self._endflag = True
-
-    def download(self, url_obj):
-        """ Download this url """
-
-        # Set download status
-        self._downloadstatus = 0
-        self._dstartime = time.time()
-        
-        url = url_obj.get_full_url()
-
-        if not url_obj.trymultipart:
-            # print 'Gewt URL=>',url,self
-            if url_obj.is_image():
-                extrainfo('Downloading image ...', url)
-            else:
-                extrainfo('Downloading url ...', url)
-        else:
-            startrange = url_obj.range[0]
-            endrange = url_obj.range[-1]
-            # print "Got URL",url,self
-            extrainfo('%s: Downloading url %s, byte range(%d - %d)' % (str(self),url,startrange,endrange))
-
-        # This call will block if we exceed the number of connections
-        self._conn = objects.connfactory.create_connector()
-        mode = self._conn.get_data_mode()
-        
-        if not url_obj.trymultipart:
-            res = self._conn.save_url(url_obj)
-        else:
-            # print 'Downloading URL',url,self
-            res = self._conn.wrapper_connect(url_obj)
-            # print 'Connector returned',self,url_obj.get_full_url()
-            
-            if mode == CONNECTOR_DATA_MODE_FLUSH:
-                self._urltmpfile = self._conn.get_tmpfname()
-            elif mode == CONNECTOR_DATA_MODE_INMEM:
-                self._data = self._conn.get_data()
-
-        # Add page hash to URL object
-        data = self._conn.get_data()
-        # Update pagehash on the URL object
-        if data: 
-            url_obj.pagehash = sha.new(data).hexdigest()
-            
-        # Remove the connector from the factory
-        objects.connfactory.remove_connector(self._conn)
-        
-        # Set this as download status
-        self._downloadstatus = res
-        
-        # get error flag from connector
-        self._error = self._conn.get_error()
-
-        self._conn = None
-        
-        # Notify thread pool
-        self._pool.notify(self)
-
-        if SUCCESS(res):
-            if not url_obj.trymultipart:            
-                extrainfo('Finished download of ', url)
-            else:
-                startrange = url_obj.range[0]
-                endrange = url_obj.range[-1]                            
-                debug('Finished download of byte range(%d - %d) of %s' % (startrange,endrange, url))
-        elif self._error.number != 304:
-            error('Failed to download URL',url)
-
-        objects.datamgr.update_url(url_obj)
-        
-    def run(self):
-        """ Run this thread """
-
-        while not self._endflag:
-            try:
-                self._starttime=time.time()
-
-                # print 'Waiting for next URL task',self
-                url_obj = self._pool.get_next_urltask()
-
-                # Dont do duplicate checking for multipart...
-                if not url_obj.trymultipart and self._pool.check_duplicates(url_obj):
-                    print 'Is duplicate',url_obj.get_full_url()
-                    continue
-
-                if not url_obj:
-                    time.sleep(0.1)
-                    continue
-
-                # set busy flag to 1
-                self._busyflag = True
-
-                # Save reference
-                self._urlobject = url_obj
-                # print 'Got url=>',url_obj.get_full_url()
-                
-                filename, url = url_obj.get_full_filename(), url_obj.get_full_url()
-                if not filename and not url:
-                    return
-
-                # Perf fix: Check end flag
-                # in case the program was terminated
-                # between start of loop and now!
-                if not self._endflag: self.download(url_obj)
-                # reset busyflag
-                # print 'Setting busyflag to False',self
-                self._busyflag = False
-            except Exception, e:
-                raise
-                error('Worker thread Exception',e)
-                # Now I am dead - so I need to tell the pool
-                # object to migrate my data and produce a new thread.
-                
-                # See class for last error. If it is same as
-                # this error, don't do anything since this could
-                # be a programming error and will send us into
-                # a loop...
-
-                # Set busyflag to False
-                self._busyflag = False
-                # Remove the connector from the factory
-                
-                if self._conn and (not self._conn.is_released()):
-                    objects.connfactory.remove_connector(self._conn)
-                
-                if str(self.__class__._lasterror) == str(e):
-                    debug('Looks like a repeating error, not trying to restart worker thread %s' % (str(self)))
-                else:
-                    self.__class__._lasterror = e
-                    # self._pool.dead_thread_callback(self)
-                    error('Worker thread %s has died due to error: %s' % (str(self), str(e)))
-                    error('Worker thread was downloading URL %s' % url_obj.get_full_url())
-
-    def get_url(self):
-
-        if self._urlobject:
-            return self._urlobject.get_full_url()
-
-        return ""
-
-    def get_filename(self):
-
-        if self._urlobject:
-            return self._urlobject.get_full_filename()
-
-        return ""
-
-    def get_urlobject(self):
-        """ Return this thread's url object """
-
-        return self._urlobject
-
-    def get_connector(self):
-        """ Return the connector object """
-
-        return self._conn
-    
-    def set_urlobject(self, urlobject):
-            
-        self._urlobject = urlobject
-        
-    def get_start_time(self):
-        """ Return the start time of current download """
-
-        return self._starttime
-
-    def set_start_time(self, starttime):
-        """ Return the start time of current download """
-
-        self._starttime = starttime
-    
-    def get_elapsed_time(self):
-        """ Get the time taken for this thread """
-
-        now=time.time()
-        fetchtime=float(math.ceil((now-self._starttime)*100)/100)
-        return fetchtime
-
-    def get_elapsed_download_time(self):
-        """ Return elapsed download time for this thread """
-
-        fetchtime=float(math.ceil((time.time()-self._dstartime)*100)/100)
-        return fetchtime
-        
-    def long_running(self):
-        """ Find out if this thread is running for a long time
-        (more than given timeout) """
-
-        # if any thread is running for more than <timeout>
-        # time, return TRUE
-        return (self.get_elapsed_time() > self._timeout)
-
-    def set_timeout(self, value):
-        """ Set the timeout value for this thread """
-
-        self._timeout = value
-
-class HarvestManUrlThreadPool(Queue):
-    """ Thread group/pool class to manage download threads """
-
-    def __init__(self):
-        """ Initialize this class """
-
-        # list of spawned threads
-        self._threads = []
-        # list of url tasks
-        self._tasks = []
-        self._cfg = objects.config
-        # Maximum number of threads spawned
-        self._numthreads = self._cfg.threadpoolsize
-        self._timeout = self._cfg.timeout
-        
-        # Last thread report time
-        self._ltrt = 0.0
-        # Local buffer
-        self.buffer = []
-        # Data dictionary for multi-part downloads
-        # Keys are URLs and value is the data
-        self._multipartdata = {}
-        # Status of URLs being downloaded in
-        # multipart. Keys are URLs
-        self._multipartstatus = {}
-        # Flag that is set when one of the threads
-        # in a multipart download fails
-        self._multiparterror = False
-        # Number of parts
-        self._parts = self._cfg.numparts
-        # Condition object
-        self._cond = threading.Condition(threading.Lock())
-        # Condition object for waiting on end condition
-        self._endcond = threading.Condition(threading.Lock())
-        Queue.__init__(self, self._numthreads + 5)
-        
-    def start_threads(self):
-        """ Start threads if they are not running """
-
-        for t in self._threads:
-            try:
-                t.start()
-            except AssertionError, e:
-                pass
-            
-    def spawn_threads(self):
-        """ Start the download threads """
-
-        for x in range(self._numthreads):
-            name = 'Worker-'+ str(x+1)
-            fetcher = HarvestManUrlThread(name, self._timeout, self)
-            fetcher.setDaemon(True)
-            # Append this thread to the list of threads
-            self._threads.append(fetcher)
-            # print 'Starting thread',fetcher
-            fetcher.start()
-
-    def download_urls(self, listofurlobjects):
-        """ Method to download a list of urls.
-        Each member is an instance of a urlPathParser class """
-
-        for urlinfo in listofurlobjects:
-            self.push(urlinfo)
-
-    def _get_num_blocked_threads(self):
-
-        blocked = 0
-        for t in self._threads:
-            if not t.is_busy(): blocked += 1
-
-        return blocked
-
-    def is_blocked(self):
-        """ The queue is considered blocked if all threads
-        are waiting for data, and no data is coming """
-
-        blocked = self._get_num_blocked_threads()
-
-        if blocked == len(self._threads):
-            return True
-        else:
-            return False
-
-    def push(self, urlObj):
-        """ Push the url object and start downloading the url """
-
-        # print 'Pushed',urlObj.get_full_url()
-        # unpack the tuple
-        try:
-            filename, url = urlObj.get_full_filename(), urlObj.get_full_url()
-        except:
-            return
-
-        # Wait till we have a thread slot free, and push the
-        # current url's info when we get one
-        try:
-            self.put( urlObj )
-            urlObj.qstatus = urlparser.URL_IN_QUEUE            
-            # If this URL was multipart, mark it as such
-            self._multipartstatus[url] = MULTIPART_DOWNLOAD_QUEUED
-        except Full:
-            self.buffer.append(urlObj)
-        
-    def get_next_urltask(self):
-
-        # Insert a random sleep in range
-        # of 0 - 0.5 seconds
-        # time.sleep(random.random()*0.5)
-        try:
-            if len(self.buffer):
-                # Get last item from buffer
-                item = buffer.pop()
-                return item
-            else:
-                # print 'Waiting to get item',threading.currentThread()
-                item = self.get()
-                return item
-            
-        except Empty:
-            return None
-
-    def notify(self, thread):
-        """ Method called by threads to notify that they
-        have finished """
-
-        try:
-            self._cond.acquire()
-
-            # Mark the time stamp (last thread report time)
-            self._ltrt = time.time()
-
-            urlObj = thread.get_urlobject()
-            
-            # See if this was a multi-part download
-            if urlObj.trymultipart:
-                status = thread.get_status()
-                if status == CONNECT_YES_DOWNLOADED:
-                    extrainfo('Thread %s reported %s' % (thread, urlObj.get_full_url()))
-                    # For flush mode, get the filename
-                    # for memory mode, get the data
-                    datamode = self._cfg.datamode
-
-                    fname, data = '',''
-                    if datamode == CONNECTOR_DATA_MODE_FLUSH:
-                        fname = thread.get_tmpfname()
-                        datalen = os.path.getsize(fname)
-                    else:
-                        data = thread.get_data()
-                        datalen = len(data)
-                        
-
-                    # See if the data was downloaded fully...,else reschedule this piece
-                    expected = (urlObj.range[-1] - urlObj.range[0]) + 1
-                    if datalen != expected:
-                        extrainfo("Expected: %d, Got: %d" % (expected, datalen))
-                        extrainfo("Thread %s did only a partial download, rescheduling this piece..." % thread)
-
-                        
-                    index = urlObj.mirror_url.index
-                    # print 'Index=>',index
-                    
-                    if index in self._multipartdata:
-                        infolist = self._multipartdata[index]
-                        if data:
-                            infolist.append((urlObj.range[0],data))
-                        elif fname:
-                            infolist.append((urlObj.range[0],fname))                        
-                    else:
-                        infolist = []
-                        if data:
-                            infolist.append((urlObj.range[0],data))
-                        elif fname:
-                            infolist.append((urlObj.range[0],fname))
-                        #else:
-                        #    self._parts -= 1 # AD-HOC
-
-                        self._multipartdata[index] = infolist
-
-                    # print 'Length of data list is',len(infolist),self._parts
-                    if len(infolist)==self._parts:
-                        # Sort the data list  according to byte-range
-                        infolist.sort()
-                        # Download of this URL is complete...
-                        logconsole('Download of %s is complete...' % urlObj.get_full_url())
-
-                        if datamode == CONNECTOR_DATA_MODE_INMEM:
-                            data = ''.join([item[1] for item in infolist])
-                            self._multipartdata['data:' + str(index)] = data
-                        else:
-                            pass
-
-                        self._multipartstatus[index] = MULTIPART_DOWNLOAD_COMPLETED
-                else:
-                    # Currently when a thread reports an error, we abort the download
-                    # In future, we can inspect whether the error is fatal or not
-                    # and resume download in another thread etc...
-                    extrainfo('Thread %s reported error => %s' % (str(thread), str(thread.get_error())))
-                        
-            # if the thread failed, update failure stats on the data manager
-            err = thread.get_error()
-
-            tstatus = thread.get_status()
-
-            # Either file was fetched or file was uptodate
-            if err.number in (0, 304):
-                # thread succeeded, increment file count stats on the data manager
-                objects.datamgr.update_file_stats( urlObj, tstatus)
-
-        finally:
-            self._cond.release()
-
-    def has_busy_threads(self):
-        """ Return whether I have any busy threads """
-
-        val=0
-        for thread in self._threads:
-            if thread.is_busy():
-                val += 1
-                break
-            
-        return val
-
-    def get_busy_threads(self):
-        """ Return a list of busy threads """
-
-        return [thread for thread in self._threads if thread.is_busy()]
-
-    def get_busy_count(self):
-        """ Return a count of busy threads """
-
-        return len(self.get_busy_threads())
-
-    def get_busy_figure(self):
-
-        s=''
-        for t in self._threads:
-            if t.is_busy():
-                s=s + t.getName().split('-')[-1] + ' '
-
-        return s
-
-    def wait(self, period, timeout):
-
-        # Wait for the pool to signal that there
-        # are no more busy threads
-
-        # Note: timeout must be > period
-        count = 0.0
-        while self.has_busy_threads():
-            self._endcond.acquire()
-            
-            try:
-                self._endcond.wait(period)
-                count += period
-                self.end_hanging_threads()
-            except IOError, e:
-                break
-
-            self._endcond.release()
-            if count>=timeout:
-                break
-    
-    def locate_thread(self, url):
-        """ Find a thread which downloaded a certain url """
-
-        for x in self._threads:
-            if not x.is_busy():
-                if x.get_url() == url:
-                    return x
-
-        return None
-
-    def locate_busy_threads(self, url):
-        """ Find all threads which are downloading a certain url """
-
-        threads=[]
-        for x in self._threads:
-            if x.is_busy():
-                if x.get_url() == url:
-                    threads.append(x)
-
-        return threads
-
-    def check_duplicates(self, urlobj):
-        """ Avoid downloading same url file twice.
-        It can happen that same url is linked from
-        different web pages """
-
-        filename = urlobj.get_full_filename()
-        url = urlobj.get_full_url()
-
-        # First check if any thread is in the process
-        # of downloading this url.
-        if self.locate_thread(url):
-            debug('Another thread is downloading %s' % url)
-            return True
-
-        return False
-
-    def end_hanging_threads(self):
-        """ If any download thread is running for too long,
-        kill it, and remove it from the thread pool """
-
-        pool=[]
-        for thread in self._threads:
-            if thread.long_running(): pool.append(thread)
-
-        for thread in pool:
-            extrainfo('Killing hanging thread ', thread)
-            # remove this thread from the thread list
-            self._threads.remove(thread)
-            # kill it
-            try:
-                thread.terminate()
-            except HarvestManUrlThreadInterrupt:
-                pass
-
-            del thread
-
-    def end_all_threads(self):
-        """ Kill all running threads """
-
-        try:
-            self._cond.acquire()
-            for t in self._threads:
-                try:
-                    t.terminate()
-                    t.join()
-                except HarvestManUrlThreadInterrupt, e:
-                    debug(str(e))
-                    pass
-
-            self._threads = []
-        finally:
-            self._cond.release()
-
-    def stop_all_threads(self):
-        """ Stop all running threads """
-
-        # Same as end_all_threads but only that
-        # we don't print the killed message.
-        try:
-            self._cond.acquire()
-            for t in self._threads:
-                try:
-                    t.terminate()
-                    t.join()
-                except HarvestManUrlThreadInterrupt, e:
-                    pass
-
-            self._threads = []
-        finally:
-            self._cond.release()
-
-    def remove_finished_threads(self):
-        """ Clean up all threads that have completed """
-
-        for thread in self._threads:
-            if not thread.is_busy():
-                self._threads.remove(thread)
-                del thread
-
-    def last_thread_report_time(self):
-        """ Return the last thread reported time """
-
-        return self._ltrt
-    
-    def get_multipart_download_status(self, url):
-        """ Get status of multipart downloads """
-
-        # If a thread has failed, signal exit
-        if self._multiparterror:
-            return MULTIPART_DOWNLOAD_ERROR
-        else:
-            return self._multipartstatus.get(url.index, MULTIPART_DOWNLOAD_STATUS_UNKNOWN)
-
-    def get_multipart_url_data(self, url):
-        """ Return data for multipart downloads """
-
-        return self._multipartdata.get('data:'+ str(url.index), '')
-
-    def get_multipart_url_info(self, url):
-        """ Return information for multipart downloads """
-
-        return self._multipartdata.get(url.index, '')
-
-    def dead_thread_callback(self, t):
-        """ Call back function called by a thread if it
-        dies with an exception. This class then creates
-        a fresh thread, migrates the data of the dead
-        thread to it """
-
-        try:
-            self._cond.acquire()
-            new_t = HarvestManUrlThread(t.getName(), self._timeout, self)
-            # Migrate data and start thread
-            if new_t:
-                new_t.set_urlobject(t.get_urlobject())
-                # Replace dead thread in the list
-                idx = self._threads.index(t)
-                self._threads[idx] = new_t
-                new_t.start()
-            else:
-                # Could not make new thread, remove
-                # current thread anyway
-                self._threads.remove(t)
-        finally:
-            self._cond.release()                
-                    
-    def get_threads(self):
-        """ Return the list of thread objects """
-
-        return self._threads
-
-    def get_thread_urls(self):
-        """ Return a list of current URLs being downloaded """
-
-        # This returns a list of URL objects, not URL strings
-        urlobjs = []
-
-        for t in self._threads:
-            if t.is_busy():
-                urlobj = t.get_urlobject()
-                if urlobj: urlobjs.append(urlobj)
-
-        return urlobjs
-
-    def reset_multipart_data(self):
-        """ Reset multipart related state """
-
-        self._multiparterror = False
-        self._multipartdata.clear()
-        self._multipartdata = {}
-        self._multipartstatus.clear()
-        self._multipartstatus = {}
-        
-        
diff --git a/HarvestMan-lite/harvestman/lib/urltypes.py b/HarvestMan-lite/harvestman/lib/urltypes.py
deleted file mode 100755
index f020219..0000000
--- a/HarvestMan-lite/harvestman/lib/urltypes.py
+++ /dev/null
@@ -1,218 +0,0 @@
-# -- coding: utf-8
-"""
-urltypes - Module defining types of URLs and their
-relationships.
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai April 18 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-# The URL types are defined as classes with easy-to-use
-# string representations. 
-
-# Also, these classes are to be used as "raw", in other words
-# ideally the clients of these classes need not create instances
-# from the classes. Instead they should use them as given,
-# i.e as classes.
-
-class URL_TYPE_META(type):
-    """ Meta-class for type classes """
-
-    def __eq__(cls, other):
-        return (str(cls) == str(other))
-
-    def __str__(cls):
-        return cls.typ
-
-    def isA(cls, baseklass):
-        """ Check whether the passed class is a subclass of my class """
-        
-        return issubclass(cls, baseklass)
-
-class URL_TYPE_ANY(str):
-    """ Class representing a URL which belongs to any type.
-    This is the base class for all other URL types """
-
-    __metaclass__ = URL_TYPE_META
-    
-    typ = 'generic'
-
-class URL_TYPE_NONE(URL_TYPE_ANY):
-    """ Class representing the None type for URLs """
-
-    __metaclass__ = URL_TYPE_META
-
-    typ = 'none'
-    
-class URL_TYPE_WEBPAGE(URL_TYPE_ANY):
-    """ Class representing a webpage URL. A webpage URL will
-    consist of some (X)HTML markup which can be parsed by an
-    (X)HTML parser. """
-
-    typ = 'webpage'
-
-class URL_TYPE_BASE(URL_TYPE_WEBPAGE):
-    """ Class representing the base URL of a web site. This is
-    a special kind of webpage type """
-
-    typ = 'base'
-
-class URL_TYPE_ANCHOR(URL_TYPE_WEBPAGE):
-    """ Class representing HTML anchor links. Anchor links are
-    part of the same web-page and are typically labels defined
-    in the same page or in another page. They start with a '#'"""
-
-    typ = 'anchor'
-
-class URL_TYPE_FRAMESET(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which defines HTML frames. The
-    children of this URL point to HTML frame documents """
-
-    typ = 'frameset'
-
-
-class URL_TYPE_FRAME(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which acts as the source for an
-    HTML 'frame' element. This URL is typically the child of
-    an HTML 'frameset' URL """
-
-    typ = 'frame'
-    
-class URL_TYPE_QUERY(URL_TYPE_ANY):
-    """ Class representing a URL which is used to submit queries to
-    web servers. Such queries can result in html or non-html result,
-    but typically they consist of session IDs """
-
-    typ = 'query'
-    
-class URL_TYPE_FORM(URL_TYPE_QUERY):
-    """ A URL which points to an action, usually used to submit
-    form contents to a ReST endpoint. This URL is part of the submit
-    action of an HTML <form> element """
-
-    typ = 'form'
-
-class URL_TYPE_IMAGE(URL_TYPE_ANY):
-    """ Class representing a URL which points to a binary raster image """
-
-    typ = 'image'
-
-class URL_TYPE_MULTIMEDIA(URL_TYPE_ANY):
-    """ Class representing a multimedia (audio/video) URL type """
-
-    typ = 'multimedia'
-
-class URL_TYPE_AUDIO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia audio URL type """
-
-    typ = 'audio'        
-
-class URL_TYPE_VIDEO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia video URL type """
-
-    typ = 'video'
-
-class URL_TYPE_FLASH(URL_TYPE_MULTIMEDIA):
-    """ Class representing Adobe shockwave flash/action script type """
-
-    typ = 'flash'            
-
-class URL_TYPE_STYLESHEET(URL_TYPE_ANY):
-    """ Class representing a URL which points to a stylesheet (CSS) file """
-
-    typ = 'stylesheet'
-
-class URL_TYPE_JAVASCRIPT(URL_TYPE_ANY):
-    """ Class which defines a URL which stands for server-side javascript files """
-
-    typ = 'javascript'
-
-class URL_TYPE_JAPPLET(URL_TYPE_ANY):
-    """ Class which defines a URL that points to a Java applet class """
-
-    typ = 'javaapplet'
-
-class URL_TYPE_JAPPLET_CODEBASE(URL_TYPE_ANY):
-    """ Class which defines a URL that points to the code-base path of a Java applet """
-
-    typ = 'appletcodebase'
-    
-class URL_TYPE_FILE(URL_TYPE_ANY):
-    """ Class representing a URL which points to any kind of file other
-    than webpages, images, stylesheets,server-side javascript files, java
-    applets, form queries etc """
-
-    # This is a generic catch-all for all URLs which are not defined so far.
-    typ = 'file'
-
-class URL_TYPE_DOCUMENT(URL_TYPE_ANY):
-    """ Class which stands for URLs that point to documents which can be
-    indexed by search engines. Examples are text files, xml files, PDF files,
-    word documents, open office documents etc """
-
-    # This type is not used in HarvestMan, but is useful for indexers
-    # which work with HarvestMan, such as swish-e.
-    typ = 'document'
-
-
-# An easy-to-use dictionary for type string to type class mapping
-
-type_map = { 'generic' : URL_TYPE_ANY,
-             'webpage' : URL_TYPE_WEBPAGE,
-             'base': URL_TYPE_BASE,
-             'anchor': URL_TYPE_ANCHOR,
-             'query': URL_TYPE_QUERY,
-             'form' : URL_TYPE_FORM,
-             'image': URL_TYPE_IMAGE,
-             'multimedia': URL_TYPE_MULTIMEDIA,
-             'audio' : URL_TYPE_AUDIO,
-             'video' : URL_TYPE_VIDEO,
-             'stylesheet': URL_TYPE_STYLESHEET,
-             'javascript': URL_TYPE_JAVASCRIPT,
-             'javaapplet': URL_TYPE_JAPPLET,
-             'appletcodebase': URL_TYPE_JAPPLET_CODEBASE,
-             'file': URL_TYPE_FILE,
-             'document': URL_TYPE_DOCUMENT }
-
-
-def getTypeClass(typename):
-    """ Return the type class, given the type name """
-
-    return type_map.get(typename, URL_TYPE_ANY)
-
-if __name__ == "__main__":
-    print URL_TYPE_ANY == 'generic'
-    print URL_TYPE_WEBPAGE == 'webpage'
-    print URL_TYPE_BASE == 'base'
-    print URL_TYPE_ANCHOR == 'anchor'
-    print URL_TYPE_QUERY == 'query'
-    print URL_TYPE_FORM == 'form'
-    print URL_TYPE_IMAGE == 'image'
-    print URL_TYPE_STYLESHEET == 'stylesheet'
-    print URL_TYPE_JAVASCRIPT == 'javascript'
-    print URL_TYPE_JAPPLET == 'javaapplet'
-    print URL_TYPE_JAPPLET_CODEBASE == 'appletcodebase'
-    print URL_TYPE_FILE == 'file'
-    print URL_TYPE_DOCUMENT == 'document'
-    
-
-    print URL_TYPE_ANY in ('generic','webpage')
-    print issubclass(URL_TYPE_ANCHOR, URL_TYPE_WEBPAGE)
-    print issubclass(URL_TYPE_BASE, URL_TYPE_WEBPAGE)    
-    print type(URL_TYPE_ANCHOR), type(URL_TYPE_ANY)
-    
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANY.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE in ('image','stylesheet')
diff --git a/HarvestMan-lite/harvestman/lib/utils.py b/HarvestMan-lite/harvestman/lib/utils.py
deleted file mode 100755
index cdfba25..0000000
--- a/HarvestMan-lite/harvestman/lib/utils.py
+++ /dev/null
@@ -1,520 +0,0 @@
-# -- coding: utf-8
-""" utils.py - Utility classes for harvestman
-    program.
-
-    Created: Anand B Pillai on Sep 25 2003.
-    
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    This contains a class for pickling using compressed data
-    streams and another one for writing project files.
-
-   Jan 10 2006     Anand   Converted from dos to unix format (removed Ctrl-Ms).
-   Mar 03 2007     Anand   Modified cache read/write functions to dump URL data
-                           to separate *.data files - this helps to reduce
-                           the size of the cache files.
-
-   Apr 11 2007     Anand   Modified extension of harvestman project files to .hpf.
-   
-   Copyright (C) 2005 Anand B Pillai
-                          
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-
-import os
-import cPickle, pickle
-import zlib
-import shelve
-import glob
-from shutil import copy
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.pydblite import Base
-
-HARVESTMAN_XML_HEAD1="""<?xml version=\"1.0\" encoding=\"UTF-8\"?>"""
-HARVESTMAN_XML_HEAD2="""<!DOCTYPE HarvestManProject SYSTEM \"HarvestManProject.dtd\">"""
-
-#=====Start Browser page macro strings ================#
-HARVESTMAN_SIG="Daddy Long Legs"
-
-HARVESTMAN_PROJECTINFO="""\
-<TR align=center>
-    <TD>
-    %(PROJECTNAME)s
-    </TD>
-    <TD>&middot;
-    <!-- PROJECTPAGE --><A HREF=\"%(PROJECTSTARTPAGE)s\"><!-- END -->
-    <!-- PROJECTURL -->%(PROJECTURL)s<!-- END -->
-        </A>
-    </TD>
-</TR>"""
-
-HARVESTMAN_BOAST="""HarvestMan is an easy-to-use website copying utility. It allows you to download a website in the World Wide Web from the Internet to a local directory. It retrieves html, images, and other files from the remote server to your computer. It builds the local directory structures recursively, and rebuilds links relatively so that you can browse the local site without again connecting to the internet. The robot allows you to customize it in a variety of ways, filtering files based on file extensions/websites/keywords. The robot is customizable by using a configuration file. The program is completely written in Python."""
-
-HARVESTMAN_KEYWORDS="""HarvestMan, HARVESTMAN, HARVESTMan, offline browser, robot, web-spider, website mirror utility, aspirateur web, surf offline, web capture, www mirror utility, browse offline, local  site builder, website mirroring, aspirateur www, internet grabber, capture de site web, internet tool, hors connexion, windows, windows 95, windows 98, windows nt, windows 2000, python apps, python tools, python spider"""
-
-HARVESTMAN_CREDITS="""\
-&copy; 2004-2005, Anand B Pillai. """
-
-
-HARVESTMAN_BROWSER_CSS="""\
-body {
-    margin: 0;
-    padding: 1;
-    margin-bottom: 15px;
-    margin-top: 15px;
-    background: #678;
-}
-body, td {
-    font: 14px Arial, Times, sans-serif;
-    }
-
-#subTitle {
-    background: #345;  color: #fff;  padding: 4px;  font-weight: bold;
-    }
-
-#siteNavigation a, #siteNavigation .current {
-    font-weight: bold;  color: #448;
-    }
-#siteNavigation a:link    { text-decoration: none; }
-#siteNavigation a:visited { text-decoration: none; }
-
-#siteNavigation .current { background-color: #ccd; }
-
-#siteNavigation a:hover   { text-decoration: none;  background-color: #fff;  color: #000; }
-#siteNavigation a:active  { text-decoration: none;  background-color: #ccc; }
-
-
-a:link    { text-decoration: underline;  color: #00f; }
-a:visited { text-decoration: underline;  color: #000; }
-a:hover   { text-decoration: underline;  color: #c00; }
-a:active  { text-decoration: underline; }
-
-#pageContent {
-    clear: both;
-    border-bottom: 6px solid #000;
-    padding: 10px;  padding-top: 20px;
-    line-height: 1.65em;
-    background-image: url(backblue.gif);
-    background-repeat: no-repeat;
-    background-position: top right;
-    }
-
-#pageContent, #siteNavigation {
-    background-color: #ccd;
-    }
-
-
-.imgLeft  { float: left;   margin-right: 10px;  margin-bottom: 10px; }
-.imgRight { float: right;  margin-left: 10px;   margin-bottom: 10px; }
-
-hr { height: 1px;  color: #000;  background-color: #000;  margin-bottom: 15px; }
-
-h1 { margin: 0;  font: 14px \"Monotype Corsiva\", Times, Arial;
-font-weight: bold;  font-size: 2em; }
-h2 { margin: 0;  font-weight: bold;  font-size: 1.6em; }
-h3 { margin: 0;  font-weight: bold;  font-size: 1.3em; }
-h4 { margin: 0;  font-weight: bold;  font-size: 1.18em; }
-
-.blak { background-color: #000; }
-.hide { display: none; }
-.tableWidth { min-width: 400px; }
-
-.tblRegular       { border-collapse: collapse; }
-.tblRegular td    { padding: 6px;  background-image: url(fade.gif);  border: 2px solid #99c; }
-.tblHeaderColor, .tblHeaderColor td { background: #99c; }
-.tblNoBorder td   { border: 0; }"""
-
-HARVESTMAN_BROWSER_TABLE1="""\
-<table width=\"76%\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"3\" class=\"tableWidth\">
-    <tr>
-    <td id=\"subTitle\">HARVESTMan Internet Spider - Website Copier</td>
-    </tr>
-</table>"""
-
-HARVESTMAN_BROWSER_HEADER="Index of Downloaded Sites:"
-
-HARVESTMAN_BROWSER_TABLE2= """\
-<table width=\"76%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"0\" class=\"tableWidth\">
-<tr class=\"blak\">
-<td>
-    <table width=\"100%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"1\" cellpadding=\"0\">
-    <tr>
-    <td colspan=\"6\">
-        <table width=\"100%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"10\">
-        <tr>
-        <td id=\"pageContent\">
-<!-- ==================== End prologue ==================== -->
-
-    <meta name=\"generator\" content=\"HARVESTMAN Internet Spider Version %(VERSION)s \">
-    <TITLE>Local index - HarvestMan</TITLE>
-</HEAD>
-<h1 ALIGN=left><u>%(HEADER)s</i></h1>
-    <TABLE BORDER=\"0\" WIDTH=\"100%(PER)s\" CELLSPACING=\"1\" CELLPADDING=\"0\">
-    <BR>
-        <TR align=center>
-            <TD>
-            %(PROJECTNAME)s
-            </TD>
-            <TD>&middot;
-                <!-- PROJECTPAGE --><A HREF=\"%(PROJECTSTARTPAGE)s\"><!-- END -->
-                    <!-- PROJECTURL -->%(PROJECTURL)s<!-- END -->
-                </A>
-            </TD>
-        </TR>
-    </TABLE>
-    <BR>
-    <BR>
-    <BR>
-    <H6 ALIGN=\"RIGHT\">
-    <I>Mirror and index made by HarvestMan Web Crawler [ABP 2006]</I>
-    </H6>
-<!-- ==================== Start epilogue ==================== -->
-    </td>
-    </tr>
-    </table>
-    </td>
-    </tr>
-    </table>
-</td>
-</tr>
-</table>"""
-
-HARVESTMAN_BROWSER_TABLE3="""\
-<table width=\"76%(PER)s\" border=\"0\" align=\"center\" valign=\"bottom\" cellspacing=\"0\" cellpadding=\"0\">
-    <tr>
-    <td id=\"footer\"><small>%(CREDITS)s </small></td>
-    </tr>
-</table>"""
-
-HARVESTMAN_CACHE_README="""\
-This directory contains important cache information for HarvestMan.
-This information is used by HarvestMan to update the project files.
-If you delete this directory or its contents, the project update/caching
-mechanism wont work.
-
-"""
-
-#=====End Browser page macro strings ==============#
-
-
-class HarvestManSerializerError(Exception):
-
-    def __init__(self, value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManSerializer(object):
-
-    def __init__(self):
-        pass
-
-    def dump(self, obj, filename):
-        """ dump method similar to pickle. The main difference is that
-        this method accepts a filename string rather than a file
-        stream as pickle does """
-
-        try:
-            # print obj
-            cPickle.dump(obj, open(filename,'wb'))
-        except Exception, e:
-            raise HarvestManSerializerError, str(e)
-            # return HARVESTMAN_FAIL
-
-        return HARVESTMAN_OK
-
-    def load(self, filename):
-        """ load method similar to pickle. The main difference is that
-        this method accepts a filename string rather than a file
-        stream as pickle does """
-
-        try:
-            obj = cPickle.load(open(filename,'rb'))
-        except Exception, e:
-            raise HarvestManSerializerError, str(e)            
-
-        return obj
-
-class HarvestManCacheReaderWriter(object):
-    """ Utility class to read/write different cache files for HarvestMan """
-
-    def __init__(self, directory):
-        self._cachedir = directory
-
-        # Create cache directory if it does not exist
-        if not os.path.isdir(self._cachedir):
-            try:
-                os.makedirs(self._cachedir)
-                extrainfo('Created directory => ', self._cachedir)
-                # Copy a readme.txt file to the cache directory
-                readmefile = os.path.join(self._cachedir, "Readme.txt")
-                if not os.path.isfile(readmefile):
-                    try:
-                        fs=open(readmefile, 'w')
-                        fs.write(HARVESTMAN_CACHE_README)
-                        fs.close()
-                    except Exception, e:
-                        debug(str(e))
-
-            except OSError, e:
-                debug('OS Exception ', e)
-
-        self._cachefilename = os.path.join(self._cachedir, 'cache')
-        
-    def read_project_cache(self):
-        """ Try to read the project cache file """
-
-        found = False
-
-        # Get cache filename
-        if not os.path.exists(self._cachefilename):
-            info("Project cache not found")
-
-        cache_obj = Base(self._cachefilename)
-
-        if os.path.isfile(self._cachefilename):
-            try:
-                cache_obj.open()
-                found = True
-            except Exception, e:
-                logconsole(e)
-
-        return (cache_obj, found)
-
-    def write_project_cache(self, cache):
-        """ Commit the project cache to the disk """
-
-        cache.commit()
-        
-    def write_url_headers(self, headerdict):
-
-        try:
-            pickler = HarvestManSerializer()
-            pickler.dump(headerdict, os.path.join(self._cachedir, 'urlheaders.db'))
-        except HarvestManSerializerError, e:
-            logconsole(str(e))
-            return WRITE_URL_HEADERS_ERROR
-
-        return WRITE_URL_HEADERS_OK
-    
-class HarvestManProjectManager(object):
-    """ Utility class to read/write project files """
-
-    def __init__(self):
-        pass
-
-    def write_project(self):
-        """ Write project files """
-
-        info('Writing Project Files...')
-
-        cfg = objects.config.copy()
-
-        pckfile = os.path.join(cfg.basedir, cfg.project + '.hpf')
-        
-        if os.path.exists(pckfile):
-            try:
-                os.remove(pckfile)
-            except OSError, e:
-                logconsole(e)
-                return PROJECT_FILE_REMOVE_ERROR
-
-        try:
-            pickler = HarvestManSerializer()
-            pickler.dump( cfg, pckfile)
-        except HarvestManSerializerError, e:
-            logconsole(str(e))
-            return PROJECT_FILE_WRITE_ERROR
-
-        extrainfo('Done.')
-        
-        return PROJECT_FILE_WRITE_OK
-
-
-    def read_project(self):
-        """ Load an existing HarvestMan project file and
-        crete dictionary for the passed config object """
-
-        projectfile = config.projectfile
-
-        try:
-            pickler = HarvestManSerializer()
-            d = pickler.load(projectfile)
-
-            for key in objects.config.keys():
-                try:
-                    objects.config[key] = d[key]
-                except:
-                    pass
-
-            objects.config.fromprojfile = True
-
-            return PROJECT_FILE_READ_OK
-        except HarvestManSerializerError, e:
-            logconsole(e)
-            return PROJECT_FILE_READ_ERROR
-
-
-class HarvestManBrowser(object):
-    """ Utility class to write the project browse pages """
-
-    def __init__(self):
-        self._projectstartpage = os.path.abspath(objects.queuemgr.get_base_url().get_full_filename())
-        self._projectstartpage = 'file://' + self._projectstartpage.replace('\\', '/')
-        self._cfg = objects.config
-
-    def make_project_browse_page(self):
-        """ This creates an xhtml page for browsing the downloaded html pages """
-
-        if self._cfg.browsepage == 0:
-            return
-
-        ret = self._add_project_to_browse_page()
-        if ret == BROWSE_FILE_NOT_FOUND:
-            return self._make_initial_browse_page()
-        else:
-            return ret
-
-    def open_project_browse_page(self):
-        """ Open the project page in the user's web-browser """
-        
-        import webbrowser
-
-        info('Opening project in browser...')
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-        try:
-            webbrowser.open(browsefile)
-            extrainfo('Done.')
-        except webbrowser.Error, e:
-            logconsole(e)
-        return 
-
-    def _add_project_to_browse_page(self):
-        """ Append new project information to existing project browser page """
-
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-        if not os.path.exists(browsefile):
-            return BROWSE_FILE_NOT_FOUND
-
-        # read contents of file
-        contents=''
-        try:
-            f=open(browsefile, 'r')
-            contents=f.read()
-            f.close()
-        except (IOError, OSError), e:
-            logconsole(e)
-            return BROWSE_FILE_READ_ERROR
-
-        if not contents:
-            return BROWSE_FILE_EMPTY
-
-        # See if this is a proper browse file created by HARVESTMan
-        index = contents.find("HARVESTMan SIG:")
-        if index == -1: return -1
-        sig=contents[(index+17):(index+32)].strip()
-        if sig != HARVESTMAN_SIG: return -1
-        # Locate position to insert project info
-        index = contents.find(HARVESTMAN_BROWSER_HEADER)
-        if index == -1: return BROWSE_FILE_INVALID
-        # get project page
-        index=contents.rfind('<!-- PROJECTPAGE -->', index)
-        if index == -1: return BROWSE_FILE_INVALID
-        newindex=contents.find('<!-- END -->', index)
-        projpage=contents[(index+29):(newindex-2)]
-        # get project url
-        index=contents.find('<!-- PROJECTURL -->', newindex)
-        if index == -1: return BROWSE_FILE_INVALID
-
-        newindex=contents.find('<!-- END -->', index)
-        prjurl=contents[(index+19):newindex]
-
-        if prjurl and prjurl==self._cfg.url:
-            debug('Duplicate project!')
-            if projpage:
-                newcontents=contents.replace(projpage,self._projectstartpage)
-            if prjurl:
-                newcontents=contents.replace(prjurl, self._cfg.url)
-            try:
-                f=open(browsefile, 'w')
-                f.write(newcontents)
-                f.close()
-
-                return BROWSE_FILE_WRITE_OK
-            except OSError, e:
-                logconsole(e)
-                return BROWSE_FILE_WRITE_ERROR
-        else:
-            # find location of </TR> from this index
-            index = contents.find('</TR>', newindex)
-            if index==-1: return BROWSE_FILE_INVALID
-            newprojectinfo = HARVESTMAN_PROJECTINFO % {'PROJECTNAME': self._cfg.project,
-                                                       'PROJECTSTARTPAGE': self._projectstartpage,
-                                                       'PROJECTURL' : self._cfg.url }
-            # insert this string
-            newcontents = contents[:index] + '\n' + newprojectinfo + contents[index+5:]
-            try:
-                f=open(browsefile, 'w')
-                f.write(newcontents)
-                f.close()
-
-                return BROWSE_FILE_WRITE_OK                
-            except OSError, e:
-                logconsole(e)
-                return BROWSE_FILE_WRITE_ERROR
-
-    def _make_initial_browse_page(self):
-        """ This creates an xhtml page for browsing the downloaded
-        files similar to HTTrack copier """
-
-        debug('Making fresh page...')
-
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-
-        f=open(browsefile, 'w')
-        f.write("<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\">\n\n")
-        f.write("<head>\n")
-        f.write("\t<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\" />\n")
-        f.write("\t<meta name=\"description\" content=\"" + HARVESTMAN_BOAST + "\" />\n")
-        f.write("\t<meta name=\"keywords\" content=\"" + HARVESTMAN_KEYWORDS + "\" />\n")
-        f.write("\t<title>Local index - HARVESTMAN Internet Spider</title>\n")
-        f.write("<!-- Mirror and index made by HARVESTMAN Internet Spider/" + self._cfg.version + " [ABP, NK '2003] -->\n")
-        f.write("<style type=\"text/css\">\n")
-        f.write("<!--\n\n")
-        f.write(HARVESTMAN_BROWSER_CSS)
-        f.write("\n\n")
-        f.write("// -->\n")
-        f.write("</style>\n")
-        f.write("</head>\n")
-        f.write(HARVESTMAN_BROWSER_TABLE1)
-        s=HARVESTMAN_BROWSER_TABLE2 % {'PER'    : '%',
-                                         'VERSION': self._cfg.version,
-                                         'HEADER' : HARVESTMAN_BROWSER_HEADER,
-                                         'PROJECTNAME': self._cfg.project,
-                                         'PROJECTSTARTPAGE': self._projectstartpage,
-                                         'PROJECTURL' : self._cfg.url}
-        f.write(s)
-        f.write("<BR><BR><BR><BR>\n")
-        f.write("<HR width=76%>\n")
-        s=HARVESTMAN_BROWSER_TABLE3 % {'PER'    : '%',
-                                         'CREDITS': HARVESTMAN_CREDITS }
-        f.write(s)
-        f.write("</body>\n")
-
-        # insert signature
-        sigstr = "<!-- HARVESTMan SIG: <" + HARVESTMAN_SIG + "> -->\n"
-        f.write(sigstr)
-        f.write("</html>\n")
-
-
-if __name__=="__main__":
-    pass
-
-
-
diff --git a/HarvestMan-lite/harvestman/test/__init__.py b/HarvestMan-lite/harvestman/test/__init__.py
deleted file mode 100755
index a93df7e..0000000
--- a/HarvestMan-lite/harvestman/test/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# -- coding: utf-8
diff --git a/HarvestMan-lite/harvestman/test/fail.html b/HarvestMan-lite/harvestman/test/fail.html
deleted file mode 100644
index 63bb079..0000000
--- a/HarvestMan-lite/harvestman/test/fail.html
+++ /dev/null
@@ -1,11 +0,0 @@
-<html>
-<head>
-<title>Failure Page</title>
-<link rel="stylesheet" type="text/css" href="meny/popupstyles.css">
-<script language="JavaScript" src="meny/popup.js"></script>
-<! link rel="stylesheet" href="nyheter.css" type="text/css">
-<link rel="stylesheet" href="nyheter/global.css" type="text/css">
-</head>
-<!-- end -->
-<a href="test.html">This is a test link</a>
-</html>
diff --git a/HarvestMan-lite/harvestman/test/pass.css b/HarvestMan-lite/harvestman/test/pass.css
deleted file mode 100755
index c882b3b..0000000
--- a/HarvestMan-lite/harvestman/test/pass.css
+++ /dev/null
@@ -1,133 +0,0 @@
-@import "css1.css";
-
-
-body
-{
-background-color: #EEFFFF;
-font-family: Lucida, Georgia, Verdana, Helvetica, sans-serif, Times;
-color: #000000;
-
-}
-
-h1{
- font-size: 20pt;
- color: #334d55;
- text-align: center;
-}
-
-#masthead{
-        margin: 2px solid black;
-        padding: 5px 2px;
-        border-bottom: 1px solid #cccccc;
-        background-color: #a2bcef;
-        height: 80px;
-}
-
-@import url("css2.css");
-
-div#top-navbar {
-  float: left; 
-  width: 20%; 
-  background-color: #A9BCEF; 
-  background-color: #a2ccef;
-  padding: 1em 0px;
-  margin-top: 0.8em;
-  border: 1px solid black; 
-}
-div#top-navbar span {
-  display:block; 
-  margin-left: .5em;
-  margin-top: 0.8em;
-}
-div#top-navbar span.navbar-title {
-  font-weight: bold; 
-  font-size:110%; 
-  border-bottom: 1px solid #007; 
-  margin-right: .1em; 
-  padding-bottom: 0.3em
-  align: center;
-}
-div#top-navbar span.first-item {
-  margin-top: 1em;
-}
-div#top-navbar span.colored-item {
-  background: #ffffff;
-  text-align: center;
-  margin-right: 1em;
-  padding: 0;
-}
-
-div#bottom-navbar {
-  float: left;
-  width: 20%; 
-  background-color: EEF; 
-  border: 2px solid black; 
-  padding: 1em 0px;
-  margin-top: 100em;
-
-}
-
-div#bottom-navbar span {
-  display:block; 
-  margin-left: .5em;
-  margin-top: .25em;
-}
-div#bottom-navbar span.navbar-title {
-  font-weight: bold; 
-  font-size:110%; 
-  border-bottom: 1px solid #007; 
-  margin-right: .5em; 
-  padding-bottom: 0.3em
-  align: center;
-}
-
-div#bottom-navbar span.colored-item {
-  background: #ffffcc;
-  text-align: center;
-  margin-right: 1em;
-  padding: 0;
-}
-
-div#top-navbar span.nbsep {
-  display: none;
-}
-
-div#top-navbar span.nbsep2 {
-  display: yes;
-  align: center;
-}
-
-div#footer {
-  position: relative; 
-  bottom: 0.5em;
-  width: 96%; 
-  text-align: center; 
-  border-top: 1px solid #AAA; 
-  color: #AAA; 
-  margin: 0 1%; 
-  padding: 0
-}
-
-div#main-content {
-  position: relative; 
-  margin-left: 25.0%; 
-  padding-top: .5em;
-  top: -1em; 
-  border-top: 1 px solid black;
-  font-size: 100%;
-}
-
-div#section-title span
-{
-  border: 2px solid #999999;
-  padding: 2px 4px;
-  background: #cccccc;
-  underline: yes;
-}
-
-div#table-placement
-{
-  float: right;
-}
-
-UL           { list-style: url(fancybullet.gif) disc }
\ No newline at end of file
diff --git a/HarvestMan-lite/harvestman/test/pass.html b/HarvestMan-lite/harvestman/test/pass.html
deleted file mode 100644
index 26a5345..0000000
--- a/HarvestMan-lite/harvestman/test/pass.html
+++ /dev/null
@@ -1,221 +0,0 @@
-<html><head><title>The HarvestMan WebCrawler </title></head><body bgcolor="#FFFFFF">
-
-
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
-  <head>
-    <base href="http://harvestmanontheweb.com"/>
-    <title>The HarvestMan Webcrawler</title>
-     <meta name="author" content="Anand B Pillai"/>
-    <meta name="keywords" content="crawler, spider, bot, web-bot, robot, offline, browser, web, Internet, harvest, HarvestMan, HTTP, browsing, searching, Python, tools, aggregator, mining, intelligent, agents, agent-based computing,
-     autonomous, documents"/>
-    <meta name="description" content="Project page of the HarvestMan WebCrawler"/>
-    <meta name="copyright" content="Anand B Pillai"/>
-    <meta name="license" content="GNU General Public License, Copyright (C) 2004-2005, Anand B Pillai" />
-
-    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
-    <link href="style.css" rel="stylesheet" type="text/css"/>
-    <link href="style.css" rel="stylesheet" type="text/css"/>
-
-  </head>
- <body>
-<div id="masthead"> 
-  <h1 id="siteName"><img src="images/HarvestMan_s.jpg" alt="HarvestMan" align="absbottom"> - The HarvestMan Web Crawler</h1> 
-</div> 
-
-   <div id="top-navbar">
-      <span class="navbar-title">HarvestMan</span>
-      <span class="nbsep">:</span>
-
-      <span class="navbar-item first-item"><a href="news.html">
-            News</a> </span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="/">
-            About</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="releases.html">
-            Releases</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="http://harvestman-crawler.googlecode.com">
-            Project page</a></span>
-    
-      <hr noshade>
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="faq.html">
-            FAQ</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="architecture.html">
-            Architecture</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="download.html">
-            Downloads</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="projects.html">
-            Projects</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="related.html">
-            Links & Related Projects</a></span>
-      <br>
-      <span class="nbsep">|</span>
-      <center>
-        <p>
-          <a href="http://www.efytimes.com/efytimes/24867/news.htm"><img src="images/FOSS_awards_low_res.jpg" alt="foss india awards icon" align="absbottom"></a>
-        </p>
-      </center>
-      <span class="nbsep">|</span>
-      <hr noshade>
-<span>
-<!-- SiteSearch Google -->
-<center>
-
-<form method="get" action="http://www.google.com/search" target="_top">
-<table border="0" bgcolor="#ffffff">
-<tr><td nowrap="nowrap" valign="top" align="left" height="32">
-<a href="http://www.google.com/">
-<img src="http://www.google.com/logos/Logo_25wht.gif" border="0" alt="Google" align="middle"></a>
-<br/>
-<input type="hidden" name="domains" value="http://www.harvestmanontheweb.com"></input>
-<input type="text" name="q" size="20" maxlength="255" value="harvestman"></input>
-</td></tr>
-<tr>
-<td nowrap="nowrap">
-<table>
-<tr>
-<td>
-<input type="radio" name="sitesearch" value="" checked="checked"></input>
-<font size="-1" color="#000000">Web</font>
-</td>
-<td>
-<input type="radio" name="sitesearch" value="http://www.harvestmanontheweb.com"></input>
-<font size="-1" color="#000000">this site</font>
-</td>
-</tr>
-</table>
-<input type="submit" name="sa" value="Search"></input>
-</td></tr></table>
-</form>
-
-<!-- SiteSearch Google -->          
-</center>
-</span>    
-     </div>
-   
-   <div id="main-content">
-     <p><span class="section-title"><b><u>Welcome</u></b></span></p>
-     <p>Welcome to the project page of the HarvestMan web crawler.</p>
-    <p><span class="section-title"><b><u>Companion Website <font color="red">(new)</font><b></b></u></b></span></p> 
-    <p>HarvestMan has a new <a href="http://harvestman.everythingability.com">companion website</a>,  thanks to Tom Smith.
-     The new site has more current information including a Wiki which is updated frequently.</p>
-    <p><span class="section-title"><b><u>News <b>(Updated May 08 2008)</b></u></b></span></p> 
-    <p>Read the latest <a href="news.html#latest">news</a> about HarvestMan.</p>
-    <p><span class="section-title"><b><u>Development Code</b></u></b></span></p> 
-    <p>Browse or download the <a href="current.html">bleeding edge</a> source code.</p>
-     <p><span class="section-title"><b><u>About HarvestMan</u></b></span></p>
-     <p>HarvestMan is a <a href="http://en.wikipedia.org/wiki/Web_crawler">web crawler application</a> written in the <a href="http://www.python.org">Python</a> programming language.
-        HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization
-        options. HarvestMan is a console (command-line) application.
-     </p>
-     <p>HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the <a href="http://www.gnu.org/copyleft/gpl.html">GNU General Public License</a>.</p>
-    <p><span class="section-title"><b><u>Current Release</u></b></span></p>
-    <p>The latest release of HarvestMan is 1.4.6.
-       <ul>
-        <li>Read the <a href="files/Changelog.txt">Changelog</a> for this release</li>
-        <li><a href="download.html#latest">Download</a> the files for this release</li>
-       </ul>
-    <p>More information is available on the <a href="releases.html">releases page</a>.
-    </p>
-
-    <p><span class="section-title"><b><u>Architecture</u></b></span></p>
-    <p>See the <a href="architecture.html">architecture of HarvestMan</a>.</p>
-    <p><span class="section-title"><b><u>HarvestMan Configuration</u></b></span></p>    
-    <p>HarvestMan is typically run by reading options from a configuration file. The configuration file
-    is in the XML format. By default it is named <i>config.xml</i>. This overrides an older text format, where configuration options were represented
-    as name/value pairs in a text file. This <a href="configoptions.html">page</a> describes the older
-    format in detail. 
-    </p>
-    <p>Here is a <a href="http://download.berlios.de/harvestman/config.xml">sample config file</a> of 
-     HarvestMan.</p>
-    <p><span class="section-title"><b><u>HarvestMan command-line options</u></b></span></p>
-    <p>HarvestMan also accepts command-line options. The <a href="commandline.html">Command line FAQ</a>
-    describes the most important command-line options for HarvestMan.
-    </p>
-    <p><span class="section-title"><b><u>Developers</u></b></span></p> 
-    <p>The original developer of HarvestMan is <a href="http://randombytes.blogspot.com">Anand B Pillai</a>.
-    Anand is a software professional, based in Bangalore, India.</a>.
-    </p>
-    <p><span class="section-title"><b><u>History</u></b></span></p> 
-    <p>For an interesting article on the history of HarvestMan, read this <a href="http://developer.spikesource.com/wiki/index.php/An_Interview_with_the_author_of_harvestman">interview</a>.
-    </p>
-
-    <p><span class="section-title"><b><u>Downloads</u></b></span></p> 
-    <p>Check the <a href="download.html">download page</a> for HarvestMan downloads.</p>
-    <p><span class="section-title"><b><u>Contacts</u></b></span></p> 
-    <p><a href="mailto:print 'nocvyynv@tznvy.pbz'.encode('rot13')">Email address</a>.</p>
-    </p>
-    
-    </div>
-
-</p>
-
-<!---
-<p>
-<form method="get" action="http://groups.yahoo.com/subscribe/BangPypers">
-<table cellspacing="0" cellpadding="2" border="0" bgcolor="#ffffcc" align="center">
-  <tr>
-    <td colspan="2" align="center">
-      <em>Subscribe to BangPypers</em>
-    </td>
-  </tr>
-  <tr>
-    <td>
-      <input type="text" name="user" value="enter email address" size="20">
-    </td>
-    <td>
-      <input type="image" border="0" alt="Click here to join BangPypers" 
-       name="Click here to join BangPypers"
-       src="http://us.i1.yimg.com/us.yimg.com/i/yg/img/i/us/ui/join.gif">
-    </td>
-  </tr>
-  <tr align="center">
-    <td colspan="2">
-      Powered by&nbsp;<a href="http://groups.yahoo.com/">groups.yahoo.com</a> 
-    </td>
-  </tr>
-</table>
-</form>
-</p> 
--->
-<br><br>
-     <div id="footer">
-
-      <br>&copy; Anand B Pillai<br>
-      Last modified on Jun 16 2008<br>
-      <br>
-<br>
-      <br><br><br><br>
-      <p>   
-      <table border="0" cellSpacing="10" cellPadding="0" align="center">
-      <tr><td</td>
-      <td><a href="http://www.python.org"><img src="images/py_powered.gif"></a></td>
-      <td><img src="images/HarvestMan_s.jpg"></td>
-      </tr>
-      </table>
-      </p>
-</div>
-
-
-  </body>
-
-</body></html>
- 
diff --git a/HarvestMan-lite/harvestman/test/run_tests.py b/HarvestMan-lite/harvestman/test/run_tests.py
deleted file mode 100755
index 9133f15..0000000
--- a/HarvestMan-lite/harvestman/test/run_tests.py
+++ /dev/null
@@ -1,108 +0,0 @@
-# -- coding: utf-8
-""" Unit test wrapper module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jun 02 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import unittest
-import test_base
-import glob
-import os, sys
-
-# FIXME: Add a unit-test log for failures with complete tracebacks
-def run_all_tests():
-    """ Run all unit tests in this folder """
-
-    test_base.setUp()
-    # Get location of this module
-    curdir = os.path.abspath(os.path.dirname(test_base.__file__))
-    sys.path.append(curdir)
-    # Comment following line and uncomment line after it before checking in code...
-    # test_modules = glob.glob(os.path.join(curdir, 'test_[!base|connector]*.py'))
-    test_modules = glob.glob(os.path.join(curdir, 'test_[!base]*.py'))    
-    result = unittest.TestResult()
-
-    for module in test_modules:
-        try:
-            print 'Running unit-test for %s...' % module
-            modpath, modfile = os.path.split(module)
-            m = __import__(modfile.replace('.py',''))
-            m.run(result)
-        except ImportError,e :
-            raise
-            pass
-        
-    test_base.clean_up()
-    return result
-
-def run_test_connector():
-    import test_connector
-
-    print 'Running test_connector...'
-    suite = unittest.makeSuite(test_connector.TestHarvestManUrlConnector)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urlparser():
-    import test_urlparser
-
-    print 'Running test_urlparser...'
-    suite = unittest.makeSuite(test_urlparser.TestHarvestManUrl)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_logger():
-    import test_logger
-    
-    print 'Running test_logger...'
-    suite = unittest.makeSuite(test_logger.TestHarvestManLogger)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urltypes():
-    import test_urltypes
-    
-    print 'Running test_urltypes...'
-    suite = unittest.makeSuite(test_urltypes.HarvestManUrlTypes)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urlcollections():
-    import test_urlcollections
-    
-    print 'Running test_urlcollections...'
-    suite = unittest.makeSuite(test_urlcollections.HarvestManUrlCollections)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_pageparser():
-    import test_pageparser
-    
-    print 'Running test_pageparser...'
-    suite = unittest.makeSuite(test_pageparser.TestHarvestManPageParser)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-
-if __name__=="__main__":
-    print run_all_tests()
diff --git a/HarvestMan-lite/harvestman/test/test_base.py b/HarvestMan-lite/harvestman/test/test_base.py
deleted file mode 100755
index 4c203da..0000000
--- a/HarvestMan-lite/harvestman/test/test_base.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# -- coding: utf-8
-""" Base module for unit tests
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import sys, os
-import unittest
-
-flag = False
-
-def setUpPaths():
-    """ Set up paths """
-
-    f = globals()['__file__']
-    parentdir = os.path.dirname(os.path.dirname(f))
-    sys.path.append(os.path.dirname(parentdir))
-
-def setUp():
-    """ Set up """
-
-    global flag
-    if flag: return
-    
-    setUpPaths()
-
-    from harvestman.lib.common.common import SetAlias
-    
-    from harvestman.lib import config
-    cfg = config.HarvestManStateObject()
-    # Enable testing flag
-    cfg.testing = 1
-    
-    SetAlias(cfg)
-
-    from harvestman.lib import datamgr
-    from harvestman.lib import rules
-    from harvestman.lib import connector
-    from harvestman.lib import urlqueue
-    from harvestman.lib import logger
-    from harvestman.lib import event
-
-    log=logger.HarvestManLogger()
-    log.make_logger()
-    SetAlias(log)
-    
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    dmgr.initialize()
-    SetAlias(dmgr)
-    
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-    
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-    
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(5)
-    SetAlias(conn_factory)
-    
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-    
-    SetAlias(event.HarvestManEvent())    
-
-    flag = True
-    
-def clean_up():
-    from harvestman.lib.common.common import objects
-    objects.datamgr.clean_up()
-
-def run_test(testklass, result):
-    suite = unittest.makeSuite(testklass)
-    return suite.run(result)
diff --git a/HarvestMan-lite/harvestman/test/test_connector.py b/HarvestMan-lite/harvestman/test/test_connector.py
deleted file mode 100755
index f86b9d3..0000000
--- a/HarvestMan-lite/harvestman/test/test_connector.py
+++ /dev/null
@@ -1,93 +0,0 @@
-# -- coding: utf-8
-""" Unit test for connector module
-
-Created: Anand B Pillai <abpillai@gmail.com> May 21 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-import time
-import random
-
-test_base.setUp()
-
-from harvestman.lib.connector import HarvestManUrlConnector
-from harvestman.lib.urlparser import HarvestManUrl    
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.common import objects
-
-urls = ['http://www.google.com','http://www.yahoo.com','http://www.python.org', 'ftp.gnu.org']
-
-class TestHarvestManUrlConnector(unittest.TestCase):
-    """ Unit test class for HarvestManUrlConnector class """
-
-    etag = ''
-    lmt = ''
-
-    # Turn caching etc off
-    objects.config.pagecache = 0
-    objects.config.rawsave = True
-    
-    def test_connect(self):
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.connect(HarvestManUrl(url))
-        error = conn.get_error()
-        if error.number==0:        
-            assert(res == CONNECT_YES_DOWNLOADED)
-            assert(conn.get_content_length()>0)
-            content_type = conn.get_content_type()
-            assert(content_type == 'text/html')
-            fo = conn.get_fileobj()
-            assert(fo != None)
-            assert(fo.get_data() == '')        
-            # Since default is flushing to file, the file
-            # object should not be None
-            assert(fo.get_tmpfile() != None)
-        else:
-            print 'Error in fetching data, skipping tests...'
-            
-        # Now set connector to in-mem mode and test again
-        objects.config.datamode = CONNECTOR_DATA_MODE_INMEM
-
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.connect(HarvestManUrl(url))
-        # There could be an error...
-        error = conn.get_error()
-        if error.number==0:
-            assert(res == CONNECT_YES_DOWNLOADED)
-            assert(conn.get_content_length()>0)
-            content_type = conn.get_content_type()
-            assert(content_type == 'text/html')
-            fo = conn.get_fileobj()
-            assert(fo != None)
-            assert(fo.get_data() != '')
-            assert(fo.get_tmpfile() == None)
-        else:
-            print 'Error in fetching data, skipping tests...'
-
-    def test_saveurl(self):
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.save_url(HarvestManUrl(url))
-        if conn.get_error().number==0:
-            assert(res==DOWNLOAD_YES_OK)
-            if os.path.isfile('index.html'):
-                os.remove('index.html')
-        else:
-            print 'Error in fetching data, skipping tests...'                
-
-    def test_connfactory(self):
-        pass
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlConnector, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlConnector)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
diff --git a/HarvestMan-lite/harvestman/test/test_logger.py b/HarvestMan-lite/harvestman/test/test_logger.py
deleted file mode 100755
index 28cbf4c..0000000
--- a/HarvestMan-lite/harvestman/test/test_logger.py
+++ /dev/null
@@ -1,113 +0,0 @@
-# -- coding: utf-8
-""" Unit test for logger module
-
-Created: Anand B Pillai <abpillai at gmail.com> Jul 11 2007
-
-Copyright (C) 2003-2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-from harvestman.lib import logger
-
-filename1='harvestman-test1.log'
-filename2='harvestman-test1.log'
-
-class TestHarvestManLogger(unittest.TestCase):
-    """ Unit test class for HarvestManLogger class """
-
-    mylogger = logger.HarvestManLogger.Instance()
-    mylogger.make_logger()
-    mylogger.disableConsoleLogging()
-    
-    def test_loglevel(self):
-
-        mylogger = self.mylogger
-
-        # Remove file if exists
-        if os.path.isfile(filename1):
-            os.remove(filename1)
-            
-        mylogger.addLogFile(filename1)
-        
-        p='HarvestMan'
-        mylogger.debug("Test message 1",p)
-        mylogger.extrainfo("Test message 2",p)
-        mylogger.info("Test message 3",p)
-        mylogger.warning("Test message 4",p)
-        mylogger.error("Test message 5",p)
-        mylogger.critical("Test message 6",p)
-    
-        # Verify file exists
-        assert(os.path.isfile(filename1))
-        # Check it has only 4 lines
-        lines = open(filename1).readlines()
-        assert(len(lines)==4)
-        # Check that line 1 has 'INFO' in it
-        assert(lines[0].find('INFO') != -1)
-        
-        # Remove this handler
-        os.remove(filename1)
-        mylogger.removeLogFile(filename1)
-        
-        # Add a new log file
-        # Remove file if exists
-        if os.path.isfile(filename2):
-            os.remove(filename2)
-            
-        mylogger.addLogFile(filename2)
-        mylogger.setLogSeverity(logger.EXTRAINFO)
-        mylogger.debug("Test message 1",p)
-        mylogger.extrainfo("Test message 2",p)
-        mylogger.info("Test message 3",p)
-        mylogger.warning("Test message 4",p)
-        mylogger.error("Test message 5",p)
-        mylogger.critical("Test message 6",p)
-
-        # Verify file exists
-        assert(os.path.isfile(filename2))
-        # Check it has only 5 lines
-        lines = open(filename2).readlines()
-        assert(len(lines)==5)        
-
-        # Check that line 1 has 'EXTRAINFO' in it
-        assert(lines[0].find('EXTRAINFO') != -1)
-        mylogger.removeLogFile(filename2)
-        os.remove(filename2)
-        
-    def test_others(self):
-
-        mylogger = self.mylogger
-        # Test other things
-         # Remove file if exists
-        if os.path.isfile(filename1):
-            os.remove(filename1)
-            
-        mylogger.addLogFile(filename1)
-        mylogger.setPlainFormat()
-        
-        msg = "Test message"
-        mylogger.info(msg)
-        # Verify that the log file contains nothing more than
-        # the message
-        # Verify file exists
-        assert(os.path.isfile(filename1))
-        lines = open(filename1).readlines()
-        assert(lines[0].strip()==msg)
-        # Revert formatting
-        mylogger.revertFormatting()
-        mylogger.info(msg)
-        lines = open(filename1).readlines()
-        assert(lines[-1].strip()!=msg)
-        os.remove(filename1)
-
-def run(result):
-    return test_base.run_test(TestHarvestManLogger, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManLogger)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()  
diff --git a/HarvestMan-lite/harvestman/test/test_pageparser.py b/HarvestMan-lite/harvestman/test/test_pageparser.py
deleted file mode 100644
index 7830b39..0000000
--- a/HarvestMan-lite/harvestman/test/test_pageparser.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# -- coding: utf-8
-""" Unit test for pageparser module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 17 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-import time
-
-test_base.setUp()
-from harvestman.lib.pageparser import HarvestManSimpleParser, HarvestManSGMLOpParser, HarvestManCSSParser
-from harvestman.lib.urlparser import HarvestManUrl    
-from harvestman.lib.common.macros import *
-from harvestman.lib.urltypes import *
-from sgmllib import SGMLParseError
-
-curdir = os.path.abspath(os.path.dirname(test_base.__file__))
-
-class Link:
-
-    def __init__(self, typ, url):
-        self.typ = typ
-        self.url = url
-
-    def __eq__(self, link):
-        return link==self.url
-        
-def Linkify(links):
-
-    Links = []
-    for typ,url in links:
-        Links.append(Link(typ, url))
-    return Links
-
-class TestHarvestManPageParser(unittest.TestCase):
-    """ Unit test class for all classes in pageparser module """
-
-    # Supported tags
-    tags = ('a','frame','img','form','link','body','script',
-            'applet','area','meta','embed','object','option')
-    
-    def test_simpleparser(self):
-        # Test features (tags)
-        for tag in self.tags:
-            assert(tag in HarvestManSimpleParser.features)
-
-        # Parse test
-        p=HarvestManSimpleParser()
-        p.feed(open(os.path.join(curdir, 'pass.html')).read())
-        # There should be 29 links and 4 images
-        assert(len(p.links)==29)
-        assert(len(p.images)==4)
-        assert(p.keywords==['crawler', 'spider', 'bot', 'web-bot', 'robot', 'offline', 'browser', 'web', 'internet', 'harvest', 'harvestman', 'http', 'browsing', 'searching', 'python', 'tools', 'aggregator', 'mining', 'intelligent', 'agents', 'agent-based computing', 'autonomous', 'documents'])
-        assert(p.description=="Project page of the HarvestMan WebCrawler")
-        assert(p.title=='The HarvestMan WebCrawler')
-        
-        link_urls = Linkify(p.links)
-        # There should be a stylesheet link
-        assert('style.css' in link_urls)
-        # There will be an anchor link
-        l = link_urls.index('download.html#latest')
-        assert(link_urls[l].typ==URL_TYPE_ANCHOR)
-
-        image_urls = Linkify(p.images)
-        assert('images/HarvestMan_s.jpg' in image_urls)
-        p.reset()
-        try:
-            # This page shoud fail the parser...
-            p.feed(open(os.path.join(curdir, 'fail.html')).read())
-            assert()
-        except SGMLParseError:
-            pass
-
-    def test_sgmlopparser(self):
-
-        # There is only one test, i.e the fail page
-        # should parse with this parser.
-        try:
-            p=HarvestManSGMLOpParser()
-            # This page shoud not fail the parser...
-            p.feed(open(os.path.join(curdir, 'fail.html')).read())
-            assert(len(p.links)==4)
-        except Exception:
-            # assert()
-            pass
-
-
-    def test_cssparser(self):
-
-        p = HarvestManCSSParser()
-        p.feed(open(os.path.join(curdir, 'pass.css')).read())
-        assert(p.links==['css1.css','css2.css','fancybullet.gif'])
-        assert(p.csslinks==['css1.css','css2.css'])
-        
-def run(result):
-    return test_base.run_test(TestHarvestManPageParser, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManPageParser)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan-lite/harvestman/test/test_urlcollections.py b/HarvestMan-lite/harvestman/test/test_urlcollections.py
deleted file mode 100755
index 7c444b8..0000000
--- a/HarvestMan-lite/harvestman/test/test_urlcollections.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urlcollections module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 18 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-
-test_base.setUp()
-from harvestman.lib.urltypes import *
-from harvestman.lib.urlcollections import *
-from harvestman.lib.urlparser import *
-
-class TestHarvestManUrlCollections(unittest.TestCase):
-    """ Unit test class for all classes in urltypes module """
-    
-    def test_urlcollections(self):
-        srcurl = HarvestManUrl('http://www.foo.com',urltype=URL_TYPE_WEBPAGE)
-        child_url1 = HarvestManUrl('url1.html',baseurl=srcurl)
-        child_url2 = HarvestManUrl('url2.html',baseurl=srcurl)        
-        child_css = HarvestManUrl('test.css',urltype=URL_TYPE_STYLESHEET, baseurl=srcurl)        
-
-        indices = [obj.index for obj in (child_url1, child_url2, child_css)]
-        indices.sort()
-        
-        coll = HarvestManAutoUrlCollection(srcurl)
-        coll.addURL(child_url1)
-        coll.addURL(child_url2)
-        coll.addURL(child_css)
-
-        assert(coll.getSourceURL()==srcurl.index)
-        urlindices = coll.getAllURLs()
-        urlindices.sort()
-
-        assert(urlindices == indices)
-        assert(coll.getURLs(HarvestManStyleContext)==[child_css.index])
-        assert(coll.getURLs(HarvestManPageContext)==[child_url1.index, child_url2.index])        
-
-        # Now for a stylesheet containing URLs
-        srccss = HarvestManUrl('http://www.foo.com/style.css',urltype=URL_TYPE_STYLESHEET)
-        css_url1 = HarvestManUrl('cssurl1.html',baseurl=srccss)
-        css_url2 = HarvestManUrl('cssurl2.html',baseurl=srccss)        
-        css_css = HarvestManUrl('test.css',urltype=URL_TYPE_STYLESHEET, baseurl=srccss)
-
-        indices = [obj.index for obj in (css_url1, css_url2, css_css)]
-        indices.sort()
-
-        coll = HarvestManAutoUrlCollection(srccss)
-        coll.addURL(css_url1)
-        coll.addURL(css_url2)
-        coll.addURL(css_css)
-        assert(coll.getSourceURL()==srccss.index)
-        urlindices = coll.getAllURLs()
-        urlindices.sort()
-
-        assert(urlindices == indices)
-        assert(coll.getURLs(HarvestManCSS2Context)==[css_css.index])
-        assert(coll.getURLs(HarvestManCSSContext)==[css_url1.index, css_url2.index])                
-
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlCollections, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlCollections)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan-lite/harvestman/test/test_urlfilter.py b/HarvestMan-lite/harvestman/test/test_urlfilter.py
deleted file mode 100644
index 968aeae..0000000
--- a/HarvestMan-lite/harvestman/test/test_urlfilter.py
+++ /dev/null
@@ -1,73 +0,0 @@
-# -- coding: utf-8
-""" Unit test for filters module
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Mod   Anand         Sep 29 08      Fix for issue #24.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-
-from harvestman.lib.urlparser import HarvestManUrl, HarvestManUrlError
-from harvestman.lib.filters import HarvestManUrlFilter
-
-
-class TestHarvestManUrlFilter(unittest.TestCase):
-    """ Unit test class for HarvestManUrlFilter class """
-
-    hfilter = HarvestManUrlFilter([(u'-/images/*+/images/public/*', 1, '')],
-                                  [(u'-jpg-png+doc', 0, '')],
-                                  [(u'\d+\.doc$', 0, ''), (u'\d+\.pdf$',0,'')])
-    
-    url1 = HarvestManUrl('http://www.yahoo.com/photos/my photo.gif')
-    url2 = HarvestManUrl('http://www.foo.com/images/photo.bmp')
-    url3 = HarvestManUrl('http://www.foo.com/images/public/photo.bmp')
-    url4 = HarvestManUrl('http://www.foo.com/images/public/image.png')
-    url5 = HarvestManUrl('http://www.foo.com/images/public/image.jpg')
-    url6 = HarvestManUrl('http://www.foo.com/photos/image.jpg')
-    url7 = HarvestManUrl('http://www.foo.com/photos/image.png')        
-    url8 = HarvestManUrl('http://website.com/documents/mydoc.pdf')
-    url9 = HarvestManUrl('http://website.com/documents/mydoc-20.pdf')
-    url10 = HarvestManUrl('http://website.com/documents/mydoc-25.doc')                        
-
-    def test_urlfilter(self):
-
-        f = self.hfilter
-
-        # False
-        assert(f.filter(self.url1)==False)
-        # True
-        assert(f.filter(self.url2)==True)
-        # False - inclusion
-        assert(f.filter(self.url3)==False)
-        assert(f.filter(self.url4)==False)
-        assert(f.filter(self.url5)==False)
-
-        # True - extn
-        assert(f.filter(self.url6)==True)
-        assert(f.filter(self.url7)==True)
-
-        # False
-        assert(f.filter(self.url8)==False)
-
-        # True - regex
-        assert(f.filter(self.url9)==True)
-
-        # False - inclusion
-        assert(f.filter(self.url10)==False)
-        
-def run(result):
-    return test_base.run_test(TestHarvestManUrlFilter, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlFilter)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()            
-
-
diff --git a/HarvestMan-lite/harvestman/test/test_urlparser.py b/HarvestMan-lite/harvestman/test/test_urlparser.py
deleted file mode 100755
index eccf85e..0000000
--- a/HarvestMan-lite/harvestman/test/test_urlparser.py
+++ /dev/null
@@ -1,326 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urlparser module
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Mod   Anand         Sep 29 08      Fix for issue #24.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-
-from harvestman.lib.urlparser import HarvestManUrl, HarvestManUrlError
-
-class TestHarvestManUrl(unittest.TestCase):
-    """ Unit test class for HarvestManUrl class """
-
-    # Basic test set
-    l = [ HarvestManUrl('http://www.yahoo.com/photos/my photo.gif'),
-          HarvestManUrl('http://www.rediff.com:80/r/r/tn2/2003/jun/25usfed.htm'),
-          HarvestManUrl('http://cwc2003.rediffblogs.com'),
-          HarvestManUrl('/sports/2003/jun/25beck1.htm',
-                              'generic', 0, 'http://www.rediff.com', ''),
-          HarvestManUrl('http://ftp.gnu.org/pub/lpf.README'),
-          HarvestManUrl('http://www.python.org/doc/2.3b2'),
-          HarvestManUrl('//images.sourceforge.net/div.png',
-                              'image', 0, 'http://sourceforge.net', ''),
-          HarvestManUrl('http://pyro.sourceforge.net/manual/LICENSE'),
-          HarvestManUrl('python/test.htm', 'generic', 0,
-                              'http://www.foo.com/bar/index.html', ''),
-          HarvestManUrl('/python/test.css', 'generic',
-                              0, 'http://www.foo.com/bar/vodka/test.htm', ''),
-          HarvestManUrl('/visuals/standard.css', 'generic', 0,
-                              'http://www.garshol.priv.no/download/text/perl.html'),
-          HarvestManUrl('www.fnorb.org/index.html', 'generic',
-                              0, 'http://pyro.sourceforge.net'),
-          HarvestManUrl('http://profigure.sourceforge.net/index.html',
-                              'generic', 0, 'http://pyro.sourceforge.net'),
-          HarvestManUrl('#anchor', 'anchor', 0, 
-                              'http://www.foo.com/bar/index.html'),
-          HarvestManUrl('nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html#__init__#index-after', 'anchor', 0, 'http://nltk.sourceforge.net/lite/doc/api/term-index.html'),              
-          HarvestManUrl('../icons/up.png', 'image', 0,
-                              'http://www.python.org/doc/current/tut/node2.html',
-                              ''),
-          HarvestManUrl('../eway/library/getmessage.asp?objectid=27015&moduleid=160',
-                              'generic',0,'http://www.eidsvoll.kommune.no/eway/library/getmessage.asp?objectid=27015&moduleid=160'),
-          HarvestManUrl('fileadmin/dz.gov.si/templates/../../../index.php',
-                              'generic',0,'http://www.dz-rs.si'),
-          HarvestManUrl('http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70','form',True,'http://www.evvs.dk'),
-          HarvestManUrl('http://arstechnica.com/reviews/os/macosx-10.4.ars'),
-          HarvestManUrl('http://www.fylkesmannen.no/../fmt_hoved.asp',baseurl='http://www.fylkesmannen.no/osloogakershu'),
-          HarvestManUrl('http://www.example.com/display%3c%5d%2f?weight=1.0&article=fred&lang=en&size=100&country=in&q=&id='),
-          HarvestManUrl('file:extension.css'),
-          HarvestManUrl('file://home/anand/style.css'),
-          HarvestManUrl('file://style.css'),
-          HarvestManUrl('file:/home/anand/style.css'),
-          HarvestManUrl('file:/home/anand/'),
-          HarvestManUrl('file://home/anand/'),
-          HarvestManUrl('/bar/',baseurl='http://www.foo.com')]
-
-    # Second test set - For base URL containing a '?' in path
-    h = HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/')
-    h2 = HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/?')
-    
-    l2 = [ HarvestManUrl('coderef.c', baseurl=h),
-           HarvestManUrl('?/govtrack/sec/coderef2.c',baseurl=h),
-           HarvestManUrl("?/sec/coderef3.c", baseurl=h),
-           HarvestManUrl("?sec/coderef4.c", baseurl=h),
-           HarvestManUrl("sec/coderef5.c", baseurl=h),
-           HarvestManUrl("/sec/coderef6.c", baseurl=h),
-           HarvestManUrl("govtrack/sec/coderef7.c", baseurl=h),
-           HarvestManUrl("govtrack/?/sec/../coderef8.c", baseurl=h),
-           HarvestManUrl("http://www.foo.com/govtrack/./sec/?/id/../coderef9.c"),
-           HarvestManUrl("../repo2/govtrack/./sec/?/id/../coderef10.c", baseurl=h),
-           HarvestManUrl('../coderef11.c', baseurl=h),
-           HarvestManUrl('govtrack/?/sec/coderef12.c', baseurl=h),
-           HarvestManUrl('../govtrack2/?/../sec/.././sec/coderef13.c', baseurl=h),
-           HarvestManUrl('?/govtrack/?/sec/coderef14.c', baseurl=h2),
-           HarvestManUrl('../gotrack2/../sec/?/../?/./sec/coderef15.c', baseurl=h2)
-           ]
-
-    def test_filename(self):
-        d = os.path.abspath(os.curdir)
-        
-        assert(self.l[0].get_full_filename()==os.path.join(d, 'www.yahoo.com/photos/my photo.gif'))
-        assert(self.l[1].get_full_filename()==os.path.join(d, 'www.rediff.com/r/r/tn2/2003/jun/25usfed.htm'))
-        assert(self.l[2].get_full_filename()==os.path.join(d, 'cwc2003.rediffblogs.com/index.html'))
-        assert(self.l[3].get_full_filename()==os.path.join(d, 'www.rediff.com/sports/2003/jun/25beck1.htm'))
-        assert(self.l[4].get_full_filename()==os.path.join(d, 'ftp.gnu.org/pub/lpf.README'))
-        assert(self.l[5].get_full_filename()==os.path.join(d, 'www.python.org/doc/2.3b2'))
-        assert(self.l[6].get_full_filename()==os.path.join(d, 'images.sourceforge.net/div.png'))
-        assert(self.l[7].get_full_filename()==os.path.join(d, 'pyro.sourceforge.net/manual/LICENSE'))
-        assert(self.l[8].get_full_filename()==os.path.join(d, 'www.foo.com/bar/python/test.htm'))
-        assert(self.l[9].get_full_filename()==os.path.join(d, 'www.foo.com/python/test.css'))
-        assert(self.l[10].get_full_filename()==os.path.join(d, 'www.garshol.priv.no/visuals/standard.css'))
-        assert(self.l[11].get_full_filename()==os.path.join(d, 'www.fnorb.org/index.html'))
-        assert(self.l[12].get_full_filename()==os.path.join(d, 'profigure.sourceforge.net/index.html'))
-        assert(self.l[13].get_full_filename()==os.path.join(d, 'www.foo.com/bar/index.html'))
-        assert(self.l[14].get_full_filename()==os.path.join(d, 'nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html'))
-        assert(self.l[15].get_full_filename()==os.path.join(d, 'www.python.org/doc/current/icons/up.png'))
-        assert(self.l[16].get_full_filename()==os.path.join(d, 'www.eidsvoll.kommune.no/eway/eway/library/getmessage.aspobjectid=27015&moduleid=160'))
-        assert(self.l[17].get_full_filename()==os.path.join(d, 'www.dz-rs.si/index.php'))
-        assert(self.l[18].get_full_filename()==os.path.join(d, 'www.evvs.dk/index.phpcPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70'))
-        assert(self.l[19].get_full_filename()==os.path.join(d, 'arstechnica.com/reviews/os/macosx-10.4.ars/index.html'))
-        assert(self.l[20].get_full_filename()==os.path.join(d, 'www.fylkesmannen.no/fmt_hoved.asp'))
-        assert(self.l[21].get_full_filename()==os.path.join(d, 'www.example.com/display]weight=1.0&article=fred&lang=en&size=100&country=in&q=&id='))
-        
-    def test_valid_filename(self):
-
-        assert(self.l[0].validfilename=='my photo.gif')
-        assert(self.l[1].validfilename=='25usfed.htm')
-        assert(self.l[2].validfilename=='index.html')
-        assert(self.l[3].validfilename=='25beck1.htm')
-        assert(self.l[4].validfilename=='lpf.README')
-        assert(self.l[5].validfilename=='2.3b2')
-        assert(self.l[6].validfilename=='div.png')
-        assert(self.l[7].validfilename=='LICENSE')
-        assert(self.l[8].validfilename=='test.htm')
-        assert(self.l[9].validfilename=='test.css')
-        assert(self.l[10].validfilename=='standard.css')
-        assert(self.l[11].validfilename=='index.html')
-        assert(self.l[12].validfilename=='index.html')
-        assert(self.l[13].validfilename=='index.html')
-        assert(self.l[14].validfilename=='nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html')
-        assert(self.l[15].validfilename=='up.png')
-        assert(self.l[16].validfilename=='getmessage.aspobjectid=27015&moduleid=160')
-        assert(self.l[17].validfilename=='index.php')
-        assert(self.l[18].validfilename=='index.phpcPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70')
-        assert(self.l[19].validfilename=='index.html')
-        assert(self.l[20].validfilename=='fmt_hoved.asp')
-        assert(self.l[21].validfilename=='display]weight=1.0&article=fred&lang=en&size=100&country=in&q=&id=')
-        
-
-    def test_is_relative_path(self):
-
-        assert(self.l[0].is_relative_path()==False)
-        assert(self.l[1].is_relative_path()==False)
-        assert(self.l[2].is_relative_path()==False)
-        assert(self.l[3].is_relative_path()==True)
-        assert(self.l[4].is_relative_path()==False)
-        assert(self.l[5].is_relative_path()==False)
-        assert(self.l[6].is_relative_path()==False)
-        assert(self.l[7].is_relative_path()==False)
-        assert(self.l[8].is_relative_path()==True)
-        assert(self.l[9].is_relative_path()==True)
-        assert(self.l[10].is_relative_path()==True)
-        assert(self.l[11].is_relative_path()==False)
-        assert(self.l[12].is_relative_path()==False)
-        assert(self.l[13].is_relative_path()==False)
-        assert(self.l[14].is_relative_path()==True)
-        assert(self.l[15].is_relative_path()==True)
-        assert(self.l[16].is_relative_path()==True)
-        assert(self.l[17].is_relative_path()==True)
-        assert(self.l[18].is_relative_path()==False)
-        assert(self.l[19].is_relative_path()==False)
-        assert(self.l[20].is_relative_path()==False)
-        assert(self.l[21].is_relative_path()==False)        
-        
-    def test_absolute_url(self):
-
-        assert(self.l[0].get_full_url()=='http://www.yahoo.com/photos/my%20photo.gif')
-        assert(self.l[1].get_full_url()=='http://www.rediff.com/r/r/tn2/2003/jun/25usfed.htm')
-        assert(self.l[2].get_full_url()=='http://cwc2003.rediffblogs.com/')
-        assert(self.l[3].get_full_url()=='http://www.rediff.com/sports/2003/jun/25beck1.htm')
-        assert(self.l[4].get_full_url()=='http://ftp.gnu.org/pub/lpf.README')
-        assert(self.l[5].get_full_url()=='http://www.python.org/doc/2.3b2')
-        assert(self.l[6].get_full_url()=='http://images.sourceforge.net/div.png')
-        assert(self.l[7].get_full_url()=='http://pyro.sourceforge.net/manual/LICENSE')
-        assert(self.l[8].get_full_url()=='http://www.foo.com/bar/python/test.htm')
-        assert(self.l[9].get_full_url()=='http://www.foo.com/python/test.css')
-        assert(self.l[10].get_full_url()=='http://www.garshol.priv.no/visuals/standard.css')
-        assert(self.l[11].get_full_url()=='http://www.fnorb.org/index.html')
-        assert(self.l[12].get_full_url()=='http://profigure.sourceforge.net/index.html')
-        assert(self.l[13].get_full_url()=='http://www.foo.com/bar/index.html')
-        assert(self.l[14].get_full_url()=='http://nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html')
-        assert(self.l[15].get_full_url()=='http://www.python.org/doc/current/icons/up.png')
-        assert(self.l[16].get_full_url()=='http://www.eidsvoll.kommune.no/eway/eway/library/getmessage.asp?objectid=27015&moduleid=160')
-        assert(self.l[17].get_full_url()=='http://www.dz-rs.si/index.php')
-        assert(self.l[18].get_full_url()=='http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70')
-        assert(self.l[19].get_full_url()=='http://arstechnica.com/reviews/os/macosx-10.4.ars/')
-        assert(self.l[20].get_full_url()=='http://www.fylkesmannen.no/fmt_hoved.asp')
-        assert(self.l[21].get_full_url()=='http://www.example.com/display%3C%5D%2F?weight=1.0&article=fred&lang=en&size=100&country=in&q=&id=')
-        assert(self.l[22].get_full_url()=='file:extension.css')
-        assert(self.l[23].get_full_url()=='file://home/anand/style.css')
-        assert(self.l[24].get_full_url()=='file://style.css')
-        assert(self.l[25].get_full_url()=='file:/home/anand/style.css')
-        assert(self.l[26].get_full_url()=='file:/home/anand')
-        assert(self.l[27].get_full_url()=='file://home/anand')
-        assert(self.l[28].get_full_url()=='http://www.foo.com/bar/')        
-
-        # Second set
-        assert(self.l2[0].get_full_url()=='http://razor.occams.info/code/repo/coderef.c')
-        assert(self.l2[1].get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec/coderef2.c')
-        assert(self.l2[2].get_full_url()=='http://razor.occams.info/code/repo/?/sec/coderef3.c')
-        assert(self.l2[3].get_full_url()=='http://razor.occams.info/code/repo/?sec/coderef4.c')
-        assert(self.l2[4].get_full_url()=='http://razor.occams.info/code/repo/sec/coderef5.c')
-        assert(self.l2[5].get_full_url()=='http://razor.occams.info/sec/coderef6.c')
-        assert(self.l2[6].get_full_url()=='http://razor.occams.info/code/repo/govtrack/sec/coderef7.c')
-        assert(self.l2[7].get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/../coderef8.c')
-        assert(self.l2[8].get_full_url()=='http://www.foo.com/govtrack/sec/?/id/../coderef9.c')
-        assert(self.l2[9].get_full_url()=='http://razor.occams.info/code/repo2/govtrack/sec/?/id/../coderef10.c')
-        assert(self.l2[10].get_full_url()=='http://razor.occams.info/code/coderef11.c')
-        assert(self.l2[11].get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/coderef12.c')
-        assert(self.l2[12].get_full_url()=='http://razor.occams.info/code/govtrack2/?/../sec/.././sec/coderef13.c')
-        assert(self.l2[13].get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/?/sec/coderef14.c')
-        assert(self.l2[14].get_full_url()=='http://razor.occams.info/code/sec/?/../?/./sec/coderef15.c')                
-        
-        
-               
-    def test_is_file_like(self):
-
-        assert(self.l[0].filelike==True)
-        assert(self.l[1].filelike==True)
-        assert(self.l[2].filelike==False)
-        assert(self.l[3].filelike==True)
-        assert(self.l[4].filelike==True)
-        assert(self.l[5].filelike==True)
-        assert(self.l[6].filelike==True)
-        assert(self.l[7].filelike==True)
-        assert(self.l[8].filelike==True)
-        assert(self.l[9].filelike==True)
-        assert(self.l[10].filelike==True)
-        assert(self.l[11].filelike==True)
-        assert(self.l[12].filelike==True)
-        assert(self.l[13].filelike==True)
-        assert(self.l[14].filelike==True)
-        assert(self.l[15].filelike==True)
-        assert(self.l[16].filelike==True)
-        assert(self.l[17].filelike==True)
-        assert(self.l[18].filelike==True)
-        assert(self.l[19].filelike==False)
-        assert(self.l[20].filelike==True)
-        assert(self.l[21].filelike==True)                        
-        
-    def test_anchor_tag(self):
-
-        assert(self.l[0].get_anchor()=='')
-        assert(self.l[1].get_anchor()=='')
-        assert(self.l[2].get_anchor()=='')
-        assert(self.l[3].get_anchor()=='')
-        assert(self.l[4].get_anchor()=='')
-        assert(self.l[5].get_anchor()=='')
-        assert(self.l[6].get_anchor()=='')
-        assert(self.l[7].get_anchor()=='')
-        assert(self.l[8].get_anchor()=='')
-        assert(self.l[9].get_anchor()=='')
-        assert(self.l[10].get_anchor()=='')
-        assert(self.l[11].get_anchor()=='')
-        assert(self.l[12].get_anchor()=='')
-        assert(self.l[13].get_anchor()=='#anchor')
-        assert(self.l[14].get_anchor()=='#__init__#index-after')
-        assert(self.l[15].get_anchor()=='')
-        assert(self.l[16].get_anchor()=='')
-        assert(self.l[17].get_anchor()=='')
-        assert(self.l[18].get_anchor()=='')
-        assert(self.l[19].get_anchor()=='')
-        assert(self.l[20].get_anchor()=='')
-        assert(self.l[21].get_anchor()=='')                        
-
-    def test_canonical_url(self):
-        assert(self.l[21].get_canonical_url()=='http://example.com/display%3C%5D%2F?article=fred&country=in&lang=en&size=100&weight=1.0')
-    def test_invalid_urls(self):
-
-        # Make sure invalid URLs do raise an error
-        try:
-            HarvestManUrl('')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Zero Length Url')
-
-        try:
-            HarvestManUrl('',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Zero Length Url')
-
-        try:
-            HarvestManUrl('http://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('https://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('ftp://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('file://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-            
-def run(result):
-    return test_base.run_test(TestHarvestManUrl, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrl)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-    
-    
diff --git a/HarvestMan-lite/harvestman/test/test_urltypes.py b/HarvestMan-lite/harvestman/test/test_urltypes.py
deleted file mode 100644
index fc4a5e5..0000000
--- a/HarvestMan-lite/harvestman/test/test_urltypes.py
+++ /dev/null
@@ -1,52 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urltypes module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 17 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-
-test_base.setUp()
-from harvestman.lib.urltypes import *
-
-class TestHarvestManUrlTypes(unittest.TestCase):
-    """ Unit test class for all classes in urltypes module """
-    
-    def test_urltypes(self):
-        assert( URL_TYPE_ANY == 'generic')
-        assert( URL_TYPE_WEBPAGE == 'webpage')
-        assert( URL_TYPE_BASE == 'base')
-        assert( URL_TYPE_ANCHOR == 'anchor')
-        assert( URL_TYPE_QUERY == 'query')
-        assert( URL_TYPE_FORM == 'form')
-        assert( URL_TYPE_IMAGE == 'image')
-        assert( URL_TYPE_STYLESHEET == 'stylesheet')
-        assert( URL_TYPE_JAVASCRIPT == 'javascript')
-        assert( URL_TYPE_JAPPLET == 'javaapplet')
-        assert( URL_TYPE_JAPPLET_CODEBASE == 'appletcodebase')
-        assert( URL_TYPE_FILE == 'file')
-        assert( URL_TYPE_DOCUMENT == 'document')
-        assert( URL_TYPE_FLASH == 'flash')
-        assert( issubclass(URL_TYPE_FLASH, URL_TYPE_MULTIMEDIA))
-        assert( issubclass(URL_TYPE_VIDEO, URL_TYPE_MULTIMEDIA))
-        assert( issubclass(URL_TYPE_AUDIO, URL_TYPE_MULTIMEDIA))                
-        assert( URL_TYPE_ANY in ('generic','webpage'))
-        assert( issubclass(URL_TYPE_ANCHOR, URL_TYPE_WEBPAGE))
-        assert( issubclass(URL_TYPE_BASE, URL_TYPE_WEBPAGE))    
-        assert( URL_TYPE_ANCHOR.isA(URL_TYPE_WEBPAGE))
-        assert( URL_TYPE_ANCHOR.isA(URL_TYPE_ANY))
-        assert( not URL_TYPE_IMAGE.isA(URL_TYPE_WEBPAGE))
-        assert( URL_TYPE_ANY.isA(URL_TYPE_ANY))
-        assert( URL_TYPE_IMAGE in ('image','stylesheet'))       
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlTypes, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlTypes)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan-lite/harvestman/tools/__init__.py b/HarvestMan-lite/harvestman/tools/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan-lite/harvestman/tools/genconfig.py b/HarvestMan-lite/harvestman/tools/genconfig.py
deleted file mode 100755
index f2a0e50..0000000
--- a/HarvestMan-lite/harvestman/tools/genconfig.py
+++ /dev/null
@@ -1,49 +0,0 @@
-"""
-genconfig.py - Interactive web-based HarvestMan configuration
-file generator using web.py .
-
-Created Anand B Pillai <abpillai at gmail dot com> May 29 2008
-Modified Anand B Pillai  Moved most code to lib/gui.py Jun 01 2008
-                         and trimmed this modul.e
-Modified Lukasz Szybalski 
-    Created main function and will be added to harvestman menu as harvestman --genconfig.
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-import sys
-import webbrowser
-import threading
-import time
-import web
-
-from harvestman.lib import gui
-
-index = gui.HarvestManConfigGenerator
-urls = ('/', 'index')
-
-def open_page():
-    print 'Opening page...'
-    webbrowser.open("http://localhost:5940")
-
-if __name__=="__main__":
-    sys.argv.append("5940")
-    print "Starting web.py at port 5940..."
-    web.internalerror = web.debugerror
-    # Start timer thread to run after 5 seconds
-    print 'Waiting for page to load in browser...'
-    threading.Timer(5.0, open_page).start()
-    web.run(urls, globals(), web.reloader)
-
-#Allows to be imported and run
-def main():
-    #if __name__=="__main__":
-    #sys.argv.append("5940")
-    #Because web.py expects the port to be passed on argv[1] I will replace it here. The original argv[1] is '--genconf'
-    sys.argv[1]='5940'
-    print "Starting web.py at port 5940..."
-    web.internalerror = web.debugerror
-    # Start timer thread to run after 5 seconds
-    print 'Waiting for page to load in browser...'
-    threading.Timer(5.0, open_page).start()
-    web.application(urls, globals()).run()
diff --git a/HarvestMan-lite/harvestman/tools/printstats.py b/HarvestMan-lite/harvestman/tools/printstats.py
deleted file mode 100755
index 4bc15e9..0000000
--- a/HarvestMan-lite/harvestman/tools/printstats.py
+++ /dev/null
@@ -1,49 +0,0 @@
-"""
-printstats.py - Print project statistics and information
-by reading the user's crawls database.
-
-Created by Anand B Pillai <abpillai at gmail dot com> May 30 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-import sys
-import os
-import time
-
-try:
-    import sqlite3
-except ImportError:
-    sys.exit('sqlite3 module not found!')
-
-conn = sqlite3.connect(os.path.expanduser("~/.harvestman/db/crawls.db"))
-c1 = conn.cursor()
-cur = c1.execute("select *  from projects order by id")
-
-for member in c1:
-    # project id is first member
-    proj_id = member[0]
-    print 'Project #%d crawled at [%s] with URL {%s}, saved to name "<%s>"...' % (proj_id, time.ctime(float(member[1])),
-                                                                     member[3], member[2])
-    c2 = conn.cursor()
-    c2.execute("select * from project_stats where project_id=%d" % proj_id)
-    data = c2.fetchall()
-    if len(data)==0: continue
-
-    data = data[0]
-    print 'Statistics'
-    print '----------'
-    print '  Total # of URLs=>',data[1]
-    print '  Processed URLs=>',data[2]
-    print '  Filtered URLs=>',data[3]
-    print '  Failed URLs=>',data[4]
-    print '  Broken URLs=>',data[5]
-    print '  URLs found in Cache=>',data[6]
-    print '  # of domains=>',data[7]
-    print '  # of directories=>',data[8]
-    print '  # of files=>',data[9]
-    print '  Data downloaded=>',data[10],'bytes.'
-    print '  Duration=>',data[11],'seconds.'
-    c2.close()
-
-c1.close()
-conn.close()
diff --git a/HarvestMan-lite/harvestman/ui/templates/content/example-print.css b/HarvestMan-lite/harvestman/ui/templates/content/example-print.css
deleted file mode 100755
index 59a61e0..0000000
--- a/HarvestMan-lite/harvestman/ui/templates/content/example-print.css
+++ /dev/null
@@ -1,8 +0,0 @@
-/* $Id: example-print.css,v 1.2 2006/03/06 04:11:55 pat Exp $ */
-/* When printing, hide the tab navigation list
-   and don't use any other styles
-*/
-
-.tabbernav {
- display:none;
-}
diff --git a/HarvestMan-lite/harvestman/ui/templates/content/example.css b/HarvestMan-lite/harvestman/ui/templates/content/example.css
deleted file mode 100755
index 35eaa06..0000000
--- a/HarvestMan-lite/harvestman/ui/templates/content/example.css
+++ /dev/null
@@ -1,109 +0,0 @@
-/* $Id: example.css,v 1.5 2006/03/27 02:44:36 pat Exp $ */
-
-/*--------------------------------------------------
-  REQUIRED to hide the non-active tab content.
-  But do not hide them in the print stylesheet!
-  --------------------------------------------------*/
-.tabberlive .tabbertabhide {
- display:none;
-}
-
-/*--------------------------------------------------
-  .tabber = before the tabber interface is set up
-  .tabberlive = after the tabber interface is set up
-  --------------------------------------------------*/
-.tabber {
-}
-.tabberlive {
- margin-top:1em;
-}
-
-/*--------------------------------------------------
-  ul.tabbernav = the tab navigation list
-  li.tabberactive = the active tab
-  --------------------------------------------------*/
-ul.tabbernav
-{
- margin:0;
- padding: 3px 0;
- border-bottom: 1px solid #778;
- font: bold 12px Verdana, sans-serif;
-}
-
-ul.tabbernav li
-{
- list-style: none;
- margin: 0;
- display: inline;
-}
-
-ul.tabbernav li a
-{
- padding: 3px 0.5em;
- margin-left: 3px;
- border: 1px solid #778;
- border-bottom: none;
- background: #DDE;
- text-decoration: none;
-}
-
-ul.tabbernav li a:link { color: #448; }
-ul.tabbernav li a:visited { color: #667; }
-
-ul.tabbernav li a:hover
-{
- color: #000;
- background: #AAE;
- border-color: #227;
-}
-
-ul.tabbernav li.tabberactive a
-{
- background-color: #fff;
- border-bottom: 1px solid #fff;
-}
-
-ul.tabbernav li.tabberactive a:hover
-{
- color: #000;
- background: white;
- border-bottom: 1px solid white;
-}
-
-/*--------------------------------------------------
-  .tabbertab = the tab content
-  Add style only after the tabber interface is set up (.tabberlive)
-  --------------------------------------------------*/
-.tabberlive .tabbertab {
- padding:5px;
- border:1px solid #aaa;
- border-top:0;
-
- /* If you don't want the tab size changing whenever a tab is changed
-    you can set a fixed height */
-
- /* height:200px; */
-
- /* If you set a fix height set overflow to auto and you will get a
-    scrollbar when necessary */
-
- /* overflow:auto; */
-}
-
-/* If desired, hide the heading since a heading is provided by the tab */
-.tabberlive .tabbertab h2 {
- display:none;
-}
-.tabberlive .tabbertab h3 {
- display:none;
-}
-
-/* Example of using an ID to set different styles for the tabs on the page */
-.tabberlive#tab1 {
-}
-.tabberlive#tab2 {
-}
-.tabberlive#tab2 .tabbertab {
- height:200px;
- overflow:auto;
-}
diff --git a/HarvestMan-lite/harvestman/ui/templates/content/tabber.js b/HarvestMan-lite/harvestman/ui/templates/content/tabber.js
deleted file mode 100755
index 4fea314..0000000
--- a/HarvestMan-lite/harvestman/ui/templates/content/tabber.js
+++ /dev/null
@@ -1,523 +0,0 @@
-/*==================================================
-  $Id: tabber.js,v 1.9 2006/04/27 20:51:51 pat Exp $
-  tabber.js by Patrick Fitzgerald pat@barelyfitz.com
-
-  Documentation can be found at the following URL:
-  http://www.barelyfitz.com/projects/tabber/
-
-  License (http://www.opensource.org/licenses/mit-license.php)
-
-  Copyright (c) 2006 Patrick Fitzgerald
-
-  Permission is hereby granted, free of charge, to any person
-  obtaining a copy of this software and associated documentation files
-  (the "Software"), to deal in the Software without restriction,
-  including without limitation the rights to use, copy, modify, merge,
-  publish, distribute, sublicense, and/or sell copies of the Software,
-  and to permit persons to whom the Software is furnished to do so,
-  subject to the following conditions:
-
-  The above copyright notice and this permission notice shall be
-  included in all copies or substantial portions of the Software.
-
-  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
-  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
-  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
-  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-  SOFTWARE.
-  ==================================================*/
-
-function tabberObj(argsObj)
-{
-  var arg; /* name of an argument to override */
-
-  /* Element for the main tabber div. If you supply this in argsObj,
-     then the init() method will be called.
-  */
-  this.div = null;
-
-  /* Class of the main tabber div */
-  this.classMain = "tabber";
-
-  /* Rename classMain to classMainLive after tabifying
-     (so a different style can be applied)
-  */
-  this.classMainLive = "tabberlive";
-
-  /* Class of each DIV that contains a tab */
-  this.classTab = "tabbertab";
-
-  /* Class to indicate which tab should be active on startup */
-  this.classTabDefault = "tabbertabdefault";
-
-  /* Class for the navigation UL */
-  this.classNav = "tabbernav";
-
-  /* When a tab is to be hidden, instead of setting display='none', we
-     set the class of the div to classTabHide. In your screen
-     stylesheet you should set classTabHide to display:none.  In your
-     print stylesheet you should set display:block to ensure that all
-     the information is printed.
-  */
-  this.classTabHide = "tabbertabhide";
-
-  /* Class to set the navigation LI when the tab is active, so you can
-     use a different style on the active tab.
-  */
-  this.classNavActive = "tabberactive";
-
-  /* Elements that might contain the title for the tab, only used if a
-     title is not specified in the TITLE attribute of DIV classTab.
-  */
-  this.titleElements = ['h2','h3','h4','h5','h6'];
-
-  /* Should we strip out the HTML from the innerHTML of the title elements?
-     This should usually be true.
-  */
-  this.titleElementsStripHTML = true;
-
-  /* If the user specified the tab names using a TITLE attribute on
-     the DIV, then the browser will display a tooltip whenever the
-     mouse is over the DIV. To prevent this tooltip, we can remove the
-     TITLE attribute after getting the tab name.
-  */
-  this.removeTitle = true;
-
-  /* If you want to add an id to each link set this to true */
-  this.addLinkId = false;
-
-  /* If addIds==true, then you can set a format for the ids.
-     <tabberid> will be replaced with the id of the main tabber div.
-     <tabnumberzero> will be replaced with the tab number
-       (tab numbers starting at zero)
-     <tabnumberone> will be replaced with the tab number
-       (tab numbers starting at one)
-     <tabtitle> will be replaced by the tab title
-       (with all non-alphanumeric characters removed)
-   */
-  this.linkIdFormat = '<tabberid>nav<tabnumberone>';
-
-  /* You can override the defaults listed above by passing in an object:
-     var mytab = new tabber({property:value,property:value});
-  */
-  for (arg in argsObj) { this[arg] = argsObj[arg]; }
-
-  /* Create regular expressions for the class names; Note: if you
-     change the class names after a new object is created you must
-     also change these regular expressions.
-  */
-  this.REclassMain = new RegExp('\\b' + this.classMain + '\\b', 'gi');
-  this.REclassMainLive = new RegExp('\\b' + this.classMainLive + '\\b', 'gi');
-  this.REclassTab = new RegExp('\\b' + this.classTab + '\\b', 'gi');
-  this.REclassTabDefault = new RegExp('\\b' + this.classTabDefault + '\\b', 'gi');
-  this.REclassTabHide = new RegExp('\\b' + this.classTabHide + '\\b', 'gi');
-
-  /* Array of objects holding info about each tab */
-  this.tabs = new Array();
-
-  /* If the main tabber div was specified, call init() now */
-  if (this.div) {
-
-    this.init(this.div);
-
-    /* We don't need the main div anymore, and to prevent a memory leak
-       in IE, we must remove the circular reference between the div
-       and the tabber object. */
-    this.div = null;
-  }
-}
-
-
-/*--------------------------------------------------
-  Methods for tabberObj
-  --------------------------------------------------*/
-
-
-tabberObj.prototype.init = function(e)
-{
-  /* Set up the tabber interface.
-
-     e = element (the main containing div)
-
-     Example:
-     init(document.getElementById('mytabberdiv'))
-   */
-
-  var
-  childNodes, /* child nodes of the tabber div */
-  i, i2, /* loop indices */
-  t, /* object to store info about a single tab */
-  defaultTab=0, /* which tab to select by default */
-  DOM_ul, /* tabbernav list */
-  DOM_li, /* tabbernav list item */
-  DOM_a, /* tabbernav link */
-  aId, /* A unique id for DOM_a */
-  headingElement; /* searching for text to use in the tab */
-
-  /* Verify that the browser supports DOM scripting */
-  if (!document.getElementsByTagName) { return false; }
-
-  /* If the main DIV has an ID then save it. */
-  if (e.id) {
-    this.id = e.id;
-  }
-
-  /* Clear the tabs array (but it should normally be empty) */
-  this.tabs.length = 0;
-
-  /* Loop through an array of all the child nodes within our tabber element. */
-  childNodes = e.childNodes;
-  for(i=0; i < childNodes.length; i++) {
-
-    /* Find the nodes where class="tabbertab" */
-    if(childNodes[i].className &&
-       childNodes[i].className.match(this.REclassTab)) {
-      
-      /* Create a new object to save info about this tab */
-      t = new Object();
-      
-      /* Save a pointer to the div for this tab */
-      t.div = childNodes[i];
-      
-      /* Add the new object to the array of tabs */
-      this.tabs[this.tabs.length] = t;
-
-      /* If the class name contains classTabDefault,
-	 then select this tab by default.
-      */
-      if (childNodes[i].className.match(this.REclassTabDefault)) {
-	defaultTab = this.tabs.length-1;
-      }
-    }
-  }
-
-  /* Create a new UL list to hold the tab headings */
-  DOM_ul = document.createElement("ul");
-  DOM_ul.className = this.classNav;
-  
-  /* Loop through each tab we found */
-  for (i=0; i < this.tabs.length; i++) {
-
-    t = this.tabs[i];
-
-    /* Get the label to use for this tab:
-       From the title attribute on the DIV,
-       Or from one of the this.titleElements[] elements,
-       Or use an automatically generated number.
-     */
-    t.headingText = t.div.title;
-
-    /* Remove the title attribute to prevent a tooltip from appearing */
-    if (this.removeTitle) { t.div.title = ''; }
-
-    if (!t.headingText) {
-
-      /* Title was not defined in the title of the DIV,
-	 So try to get the title from an element within the DIV.
-	 Go through the list of elements in this.titleElements
-	 (typically heading elements ['h2','h3','h4'])
-      */
-      for (i2=0; i2<this.titleElements.length; i2++) {
-	headingElement = t.div.getElementsByTagName(this.titleElements[i2])[0];
-	if (headingElement) {
-	  t.headingText = headingElement.innerHTML;
-	  if (this.titleElementsStripHTML) {
-	    t.headingText.replace(/<br>/gi," ");
-	    t.headingText = t.headingText.replace(/<[^>]+>/g,"");
-	  }
-	  break;
-	}
-      }
-    }
-
-    if (!t.headingText) {
-      /* Title was not found (or is blank) so automatically generate a
-         number for the tab.
-      */
-      t.headingText = i + 1;
-    }
-
-    /* Create a list element for the tab */
-    DOM_li = document.createElement("li");
-
-    /* Save a reference to this list item so we can later change it to
-       the "active" class */
-    t.li = DOM_li;
-
-    /* Create a link to activate the tab */
-    DOM_a = document.createElement("a");
-    DOM_a.appendChild(document.createTextNode(t.headingText));
-    DOM_a.href = "javascript:void(null);";
-    DOM_a.title = t.headingText;
-    DOM_a.onclick = this.navClick;
-
-    /* Add some properties to the link so we can identify which tab
-       was clicked. Later the navClick method will need this.
-    */
-    DOM_a.tabber = this;
-    DOM_a.tabberIndex = i;
-
-    /* Do we need to add an id to DOM_a? */
-    if (this.addLinkId && this.linkIdFormat) {
-
-      /* Determine the id name */
-      aId = this.linkIdFormat;
-      aId = aId.replace(/<tabberid>/gi, this.id);
-      aId = aId.replace(/<tabnumberzero>/gi, i);
-      aId = aId.replace(/<tabnumberone>/gi, i+1);
-      aId = aId.replace(/<tabtitle>/gi, t.headingText.replace(/[^a-zA-Z0-9\-]/gi, ''));
-
-      DOM_a.id = aId;
-    }
-
-    /* Add the link to the list element */
-    DOM_li.appendChild(DOM_a);
-
-    /* Add the list element to the list */
-    DOM_ul.appendChild(DOM_li);
-  }
-
-  /* Add the UL list to the beginning of the tabber div */
-  e.insertBefore(DOM_ul, e.firstChild);
-
-  /* Make the tabber div "live" so different CSS can be applied */
-  e.className = e.className.replace(this.REclassMain, this.classMainLive);
-
-  /* Activate the default tab, and do not call the onclick handler */
-  this.tabShow(defaultTab);
-
-  /* If the user specified an onLoad function, call it now. */
-  if (typeof this.onLoad == 'function') {
-    this.onLoad({tabber:this});
-  }
-
-  return this;
-};
-
-
-tabberObj.prototype.navClick = function(event)
-{
-  /* This method should only be called by the onClick event of an <A>
-     element, in which case we will determine which tab was clicked by
-     examining a property that we previously attached to the <A>
-     element.
-
-     Since this was triggered from an onClick event, the variable
-     "this" refers to the <A> element that triggered the onClick
-     event (and not to the tabberObj).
-
-     When tabberObj was initialized, we added some extra properties
-     to the <A> element, for the purpose of retrieving them now. Get
-     the tabberObj object, plus the tab number that was clicked.
-  */
-
-  var
-  rVal, /* Return value from the user onclick function */
-  a, /* element that triggered the onclick event */
-  self, /* the tabber object */
-  tabberIndex, /* index of the tab that triggered the event */
-  onClickArgs; /* args to send the onclick function */
-
-  a = this;
-  if (!a.tabber) { return false; }
-
-  self = a.tabber;
-  tabberIndex = a.tabberIndex;
-
-  /* Remove focus from the link because it looks ugly.
-     I don't know if this is a good idea...
-  */
-  a.blur();
-
-  /* If the user specified an onClick function, call it now.
-     If the function returns false then do not continue.
-  */
-  if (typeof self.onClick == 'function') {
-
-    onClickArgs = {'tabber':self, 'index':tabberIndex, 'event':event};
-
-    /* IE uses a different way to access the event object */
-    if (!event) { onClickArgs.event = window.event; }
-
-    rVal = self.onClick(onClickArgs);
-    if (rVal === false) { return false; }
-  }
-
-  self.tabShow(tabberIndex);
-
-  return false;
-};
-
-
-tabberObj.prototype.tabHideAll = function()
-{
-  var i; /* counter */
-
-  /* Hide all tabs and make all navigation links inactive */
-  for (i = 0; i < this.tabs.length; i++) {
-    this.tabHide(i);
-  }
-};
-
-
-tabberObj.prototype.tabHide = function(tabberIndex)
-{
-  var div;
-
-  if (!this.tabs[tabberIndex]) { return false; }
-
-  /* Hide a single tab and make its navigation link inactive */
-  div = this.tabs[tabberIndex].div;
-
-  /* Hide the tab contents by adding classTabHide to the div */
-  if (!div.className.match(this.REclassTabHide)) {
-    div.className += ' ' + this.classTabHide;
-  }
-  this.navClearActive(tabberIndex);
-
-  return this;
-};
-
-
-tabberObj.prototype.tabShow = function(tabberIndex)
-{
-  /* Show the tabberIndex tab and hide all the other tabs */
-
-  var div;
-
-  if (!this.tabs[tabberIndex]) { return false; }
-
-  /* Hide all the tabs first */
-  this.tabHideAll();
-
-  /* Get the div that holds this tab */
-  div = this.tabs[tabberIndex].div;
-
-  /* Remove classTabHide from the div */
-  div.className = div.className.replace(this.REclassTabHide, '');
-
-  /* Mark this tab navigation link as "active" */
-  this.navSetActive(tabberIndex);
-
-  /* If the user specified an onTabDisplay function, call it now. */
-  if (typeof this.onTabDisplay == 'function') {
-    this.onTabDisplay({'tabber':this, 'index':tabberIndex});
-  }
-
-  return this;
-};
-
-tabberObj.prototype.navSetActive = function(tabberIndex)
-{
-  /* Note: this method does *not* enforce the rule
-     that only one nav item can be active at a time.
-  */
-
-  /* Set classNavActive for the navigation list item */
-  this.tabs[tabberIndex].li.className = this.classNavActive;
-
-  return this;
-};
-
-
-tabberObj.prototype.navClearActive = function(tabberIndex)
-{
-  /* Note: this method does *not* enforce the rule
-     that one nav should always be active.
-  */
-
-  /* Remove classNavActive from the navigation list item */
-  this.tabs[tabberIndex].li.className = '';
-
-  return this;
-};
-
-
-/*==================================================*/
-
-
-function tabberAutomatic(tabberArgs)
-{
-  /* This function finds all DIV elements in the document where
-     class=tabber.classMain, then converts them to use the tabber
-     interface.
-
-     tabberArgs = an object to send to "new tabber()"
-  */
-  var
-    tempObj, /* Temporary tabber object */
-    divs, /* Array of all divs on the page */
-    i; /* Loop index */
-
-  if (!tabberArgs) { tabberArgs = {}; }
-
-  /* Create a tabber object so we can get the value of classMain */
-  tempObj = new tabberObj(tabberArgs);
-
-  /* Find all DIV elements in the document that have class=tabber */
-
-  /* First get an array of all DIV elements and loop through them */
-  divs = document.getElementsByTagName("div");
-  for (i=0; i < divs.length; i++) {
-    
-    /* Is this DIV the correct class? */
-    if (divs[i].className &&
-	divs[i].className.match(tempObj.REclassMain)) {
-      
-      /* Now tabify the DIV */
-      tabberArgs.div = divs[i];
-      divs[i].tabber = new tabberObj(tabberArgs);
-    }
-  }
-  
-  return this;
-}
-
-
-/*==================================================*/
-
-
-function tabberAutomaticOnLoad(tabberArgs)
-{
-  /* This function adds tabberAutomatic to the window.onload event,
-     so it will run after the document has finished loading.
-  */
-  var oldOnLoad;
-
-  if (!tabberArgs) { tabberArgs = {}; }
-
-  /* Taken from: http://simon.incutio.com/archive/2004/05/26/addLoadEvent */
-
-  oldOnLoad = window.onload;
-  if (typeof window.onload != 'function') {
-    window.onload = function() {
-      tabberAutomatic(tabberArgs);
-    };
-  } else {
-    window.onload = function() {
-      oldOnLoad();
-      tabberAutomatic(tabberArgs);
-    };
-  }
-}
-
-
-/*==================================================*/
-
-
-/* Run tabberAutomaticOnload() unless the "manualStartup" option was specified */
-
-if (typeof tabberOptions == 'undefined') {
-
-    tabberAutomaticOnLoad();
-
-} else {
-
-  if (!tabberOptions['manualStartup']) {
-    tabberAutomaticOnLoad(tabberOptions);
-  }
-
-}
diff --git a/HarvestMan-lite/harvestman/ui/templates/form.html b/HarvestMan-lite/harvestman/ui/templates/form.html
deleted file mode 100755
index 6f099df..0000000
--- a/HarvestMan-lite/harvestman/ui/templates/form.html
+++ /dev/null
@@ -1,15 +0,0 @@
-$def with (form)
-
-<p><b>Generate HarvestMan Configuration</b></p>
-<p>Enter the parameters in the form below and click on "Generate Configuration" button at the end.</p>
-<p><i>Hovering on the input boxes with your mouse will pop up a tool-tip with help text.</i></p>
-</p>
-
-<form name="main" method="post"> 
-$if not form.valid: <p class="error"></p>
-$:form.render()
-<br>
-<p align="center">
-<input type="submit" value="Generate Configuration" />
-</p>
-</form>
diff --git a/HarvestMan-lite/pydocgen.sh b/HarvestMan-lite/pydocgen.sh
deleted file mode 100755
index 680a496..0000000
--- a/HarvestMan-lite/pydocgen.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-#!/bin/bash
-
-epydoc -o pydoc -n HarvestMan -u http://www.harvestmanontheweb.com ./harvestman/apps/harvestman.py ./harvestman/apps/hget.py ./harvestman/apps/appbase.py ./harvestman/apps/harvestmanimp.py ./harvestman/apps/__init__.py ./harvestman/apps/samples/taganalyzer.py ./harvestman/apps/samples/test.py ./harvestman/apps/samples/blogger.py ./harvestman/apps/samples/imagecrawler.py ./harvestman/apps/samples/blogger_google.py ./harvestman/apps/samples/postingcrawler.py ./harvestman/apps/samples/__init__.py ./harvestman/apps/samples/linkchecker.py ./harvestman/apps/samples/datacrawler.py ./harvestman/apps/samples/htmlcrawler.py ./harvestman/ext/lucene.py ./harvestman/ext/swish-e.py ./harvestman/ext/userbrowse.py ./harvestman/ext/datafilter.py ./harvestman/ext/lucene/SearchFiles.py ./harvestman/ext/lucene/IndexFiles.py ./harvestman/ext/__init__.py ./harvestman/ext/spam.py ./harvestman/ext/simulator.py ./harvestman/lib/urlthread.py ./harvestman/lib/connector.py ./harvestman/lib/db.py ./harvestman/lib/configparser.py ./harvestman/lib/urlcollections.py ./harvestman/lib/datamgr.py ./harvestman/lib/rules.py ./harvestman/lib/logger.py ./harvestman/lib/js/jsdom.py ./harvestman/lib/js/jsparser.py ./harvestman/lib/js/__init__.py ./harvestman/lib/utils.py ./harvestman/lib/methodwrapper.py ./harvestman/lib/common/progress.py ./harvestman/lib/common/bst.py ./harvestman/lib/common/spincursor.py ./harvestman/lib/common/lrucache.py ./harvestman/lib/common/keepalive.py ./harvestman/lib/common/dictcache.py ./harvestman/lib/common/optionparser.py ./harvestman/lib/common/bst_orig.py ./harvestman/lib/common/properties.py ./harvestman/lib/common/macros.py ./harvestman/lib/common/__init__.py ./harvestman/lib/common/pydblite.py ./harvestman/lib/common/netinfo.py ./harvestman/lib/common/singleton.py ./harvestman/lib/common/common.py ./harvestman/lib/urlqueue.py ./harvestman/lib/config.py ./harvestman/lib/crawler.py ./harvestman/lib/mirrors.py ./harvestman/lib/urlparser.py ./harvestman/lib/event.py ./harvestman/lib/__init__.py ./harvestman/lib/pageparser.py ./harvestman/lib/robotparser.py ./harvestman/lib/document.py ./harvestman/lib/options.py ./harvestman/lib/urlproc.py ./harvestman/lib/hooks.py ./harvestman/lib/urltypes.py ./harvestman/__init__.py
diff --git a/HarvestMan-lite/schema/HarvestMan.xsd b/HarvestMan-lite/schema/HarvestMan.xsd
deleted file mode 100755
index d99160e..0000000
--- a/HarvestMan-lite/schema/HarvestMan.xsd
+++ /dev/null
@@ -1,442 +0,0 @@
-<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
-  
-  <xsd:annotation>
-    <xsd:documentation xml:lang="en">
-      W3C Schema file for HarvestMan webcrawler configuration file.
-      Copyright 2004-2005 Anand B Pillai. All rights reserved.
-      Written on Jan 18 2005, validated with xmllint.
-    </xsd:documentation>
-  </xsd:annotation>
-  
-  <xsd:element name="HarvestMan" type="HarvestManType"/>
-  
-  <!--- Defining the non-null uri -->
-  <xsd:simpleType name="validURI">
-    <xsd:restriction base="xsd:anyURI">
-      <xsd:minLength value="1"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-  
-  <!--- Defining the non-null string -->
-  <xsd:simpleType name="validString">
-    <xsd:restriction base="xsd:string">
-      <xsd:minLength value="1"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!-- Defining the "verbosity" type -->
-  <xsd:simpleType name="VerbType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="5"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-  
-  <!-- Defining the "fetchlevel" type -->
-  <xsd:simpleType name="FetchType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="4"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!-- Defining the "localise" type -->
-  <xsd:simpleType name="LocaliseType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="2"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!--- Defining the 'HarvestMan element -->
-  <xsd:complexType name="HarvestManType">
-    <xsd:sequence>
-      <xsd:element name="config" type="HarvestManConfigType"/>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'config' element -->
-  <xsd:complexType name="HarvestManConfigType">
-    <xsd:sequence>
-      <xsd:element name="projects" type="ConfigProjectsType" minOccurs="0" />
-      <xsd:element name="network" type="ConfigNetworkType"  minOccurs="0" />
-      <xsd:element name="download" type="ConfigDownloadType" minOccurs="0" />
-      <xsd:element name="control" type="ConfigControlType" minOccurs="0" />
-      <xsd:element name="parser" type="ConfigParserType" minOccurs="0" />
-      <xsd:element name="system" type="ConfigSystemType" minOccurs="0" />      
-      <xsd:element name="files" type="ConfigFilesType" minOccurs="0" />
-      <xsd:element name="display" type="ConfigDisplayType" minOccurs="0" />
-    </xsd:sequence>
-    <xsd:attribute name="version" type="validString" fixed="3.0" use="required"/>
-    <xsd:attribute name="xmlversion" type="validString" fixed="1.0" use="required"/>
-  </xsd:complexType>
-
-  <!-- Defining the 'projects' element -->
-  <xsd:complexType name="ConfigProjectsType">
-    <xsd:sequence>
-      <!--- This element contains zero or more 'project' elements -->
-      <xsd:element name="project" type="ProjectsProjectType" minOccurs="0" maxOccurs="unbounded"/>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'project' element -->
-  <xsd:complexType name="ProjectsProjectType">
-    <xsd:sequence>
-      <xsd:element name="url" type="xsd:anyURI"/>
-      <xsd:element name="name" type="validString" minOccurs="0" />      
-      <xsd:element name="basedir" type="validString" minOccurs="0" />
-      <xsd:element name="verbosity" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="VerbType" default="3"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timeout" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:double" default="600.0"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-    <xsd:attribute name="ignore" type="xsd:integer" default="0"/>                 
-  </xsd:complexType>
-  
-  <!--- Defining the 'network' element -->
-  <xsd:complexType name="ConfigNetworkType">
-    <xsd:sequence>
-      <!--- This consists of one 'proxy' element and one 'urlserver' element, both optional -->
-      <xsd:element name="proxy" type="NetworkProxyType" minOccurs="0"/>      
-    </xsd:sequence>
-  </xsd:complexType>
-  
-  <!--- Defining the 'proxy' element -->
-  <xsd:complexType name="NetworkProxyType">
-    <xsd:sequence>
-      <xsd:element name="proxyserver" type="xsd:string" minOccurs="0" />
-      <xsd:element name="proxyuser" type="xsd:string" minOccurs="0" />      
-      <xsd:element name="proxypasswd" type="xsd:string" minOccurs="0" />
-      <xsd:element name="proxyport">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="80"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'download' element -->
-  <xsd:complexType name="ConfigDownloadType">
-    <xsd:sequence>
-      <!--- Consists of three elements, 'types', 'cache' and 'misc' -->
-      <xsd:element name="types" type="DownloadTypesType" minOccurs="0" />
-      <xsd:element name="cache" type="DownloadCacheType" minOccurs="0" />
-      <xsd:element name="protocol" type="DownloadProtocolType" minOccurs="0" />
-      <xsd:element name="misc" type="DownloadMiscType" minOccurs="0" />
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Definint the 'types' element -->
-  <xsd:complexType name="DownloadTypesType">
-      <xsd:sequence>
-        <xsd:element name="html" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="images" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="movies" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="flash" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="sounds" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="documents" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="javascript" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="javaapplet" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="querylinks" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-      </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'cache' type -->
-  <xsd:complexType name="DownloadCacheType">
-    <xsd:sequence>
-      <xsd:element name="datacache" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>     
-    <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-  </xsd:complexType>
-
-  <!--- Defining the 'protocol' type -->
-  <xsd:complexType name="DownloadProtocolType">
-    <xsd:sequence>
-      <xsd:element name="http" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="compress" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>     
-  </xsd:complexType>
-
-  <!--- Defining the 'misc' type -->
-  <xsd:complexType name="DownloadMiscType">
-    <xsd:sequence>
-      <xsd:element name="retries" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="tidyhtml" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'control' element -->
-  <xsd:complexType name="ConfigControlType">
-    <xsd:sequence>
-      <!--- Consists of six elements, 'links', 'extent' and 'limits', 'rules', filters' & plugins -->
-      <xsd:element name="links" type="ControlLinksType" minOccurs="0" />
-      <xsd:element name="extent" type="ControlExtentType" minOccurs="0" />
-      <xsd:element name="limits" type="ControlLimitsType" minOccurs="0" />
-      <xsd:element name="rules" type="ControlRulesType" minOccurs="0" />
-      <xsd:element name="filters" type="ControlFiltersType" minOccurs="0" />
-      <xsd:element name="plugins" type="ControlPluginsType" minOccurs="0" />
-    </xsd:sequence>
-  </xsd:complexType>
-  
-  <!--- Defining the 'links' element -->
-  <xsd:complexType name="ControlLinksType">
-    <xsd:sequence>
-      <xsd:element name="imagelinks" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="stylesheetlinks" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="offset" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="start" type="xsd:integer" default="0" use="optional"/>
-          <xsd:attribute name="end" type="xsd:integer" default="-1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'extent' element -->
-  <xsd:complexType name="ControlExtentType">
-    <xsd:sequence>
-      <xsd:element name="fetchlevel" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="FetchType" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="depth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="10" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="extdepth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="subdomain" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="ignoretlds" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'limits' element -->
-  <xsd:complexType name="ControlLimitsType">
-    <xsd:sequence>
-      <xsd:element name="maxfiles" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="5000" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxfilesize" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="1048576" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxbytes" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxbandwidth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="connections" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="5" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timelimit" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="-1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-
-  <!--- Defining the 'rules' element -->
-  <xsd:complexType name="ControlRulesType">
-    <xsd:sequence>
-      <xsd:element name="robots" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="urlpriority" type="xsd:string" minOccurs="0" />
-      <xsd:element name="serverpriority" type="xsd:string" minOccurs="0" />
-    </xsd:sequence>       
-  </xsd:complexType>
-        
-   <!--- Defining the 'filters' element -->
-  <xsd:complexType name="ControlFiltersType">
-    <xsd:sequence>
-      <xsd:element name="urlfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="serverfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="wordfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="junkfilter" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-
-   <!--- Defining the 'plugins' element -->
-  <xsd:complexType name="ControlPluginsType">
-    <xsd:sequence>
-      <xsd:element name="plugin" minOccurs="0" maxOccurs="unbounded">
-        <xsd:complexType>
-          <xsd:attribute name="name" type="xsd:string" use="required"/>
-          <xsd:attribute name="enable" type="xsd:boolean" use="required"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence> 
-    <xsd:attribute name="type" type="xsd:integer" default="0" use="optional"/>
-  </xsd:complexType>
-
-  <!--- Defining the 'parser' element -->
-  <xsd:complexType name="ConfigParserType">
-    <xsd:sequence>
-      <xsd:element name="feature" minOccurs="0" maxOccurs="unbounded">
-        <xsd:complexType>
-          <xsd:attribute name="name" type="xsd:string" use="required"/>
-          <xsd:attribute name="enable" type="xsd:boolean" use="required"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'system' element -->
-  <xsd:complexType name="ConfigSystemType">
-    <xsd:sequence>
-      <xsd:element name="workers" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-          <xsd:attribute name="size" type="xsd:positiveInteger" default="10" use="optional"/>
-          <xsd:attribute name="timeout" type="xsd:double" default="1200.0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="trackers" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="4" use="optional"/>
-          <xsd:attribute name="timeout" type="xsd:double" default="240.0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timegap" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:double" default="0.5" use="optional"/>
-          <xsd:attribute name="random" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'files' element -->
-  <xsd:complexType name="ConfigFilesType">
-    <xsd:sequence>
-      <xsd:element name="urltreefile" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      
-      <xsd:element name="archive" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="0" use="optional"/>
-          <xsd:attribute name="format" type="xsd:string" default="bzip" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="urlheaders" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="localise" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="LocaliseType" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <xsd:complexType name="ConfigDisplayType">
-    <xsd:sequence>
-      <xsd:element name="browsepage" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-</xsd:schema>
diff --git a/HarvestMan-lite/setup.cfg b/HarvestMan-lite/setup.cfg
deleted file mode 100644
index 3abf138..0000000
--- a/HarvestMan-lite/setup.cfg
+++ /dev/null
@@ -1,4 +0,0 @@
-[egg_info]
-tag_build = dev
-tag_svn_revision = true
-
diff --git a/HarvestMan-lite/setup.py b/HarvestMan-lite/setup.py
deleted file mode 100644
index 00d926a..0000000
--- a/HarvestMan-lite/setup.py
+++ /dev/null
@@ -1,64 +0,0 @@
-#Gets setuptools
-try:
-    from setuptools import setup, find_packages
-except ImportError:
-    from ez_setup import use_setuptools
-    use_setuptools()
-    from setuptools import setup, find_packages
-
-#Normal setup.py starts here
-import sys, os
-
-version = '2.0.4beta'
-
-setup(name='HarvestMan',
-      version=version,
-      description="HarvestMan is a web crawler application and framework.",
-      long_description="""\
-HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan? can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
-""",
-      # Get strings from http://pypi.python.org/pypi?%3Aaction=list_classifiers
-      classifiers=[
-          'Development Status :: 5 - Production/Stable',
-          'Environment :: Console',
-          'Environment :: Web Environment',
-          'Intended Audience :: End Users/Desktop',
-          'Intended Audience :: Developers',
-          'License :: OSI Approved :: GNU General Public License (GPL)',
-          'Operating System :: OS Independent',
-          'Programming Language :: Python',
-          'Topic :: Internet :: WWW/HTTP :: Indexing/Search',
-          'Topic :: Software Development :: Libraries :: Python Modules',
-          'Topic :: Software Development :: Testing :: Traffic Generation',
-          'Topic :: Text Processing :: Indexing',
-          ],
-      keywords='crawler spider web-crawler web-bot robot data-mining python',
-      author='Anand B Pillai',
-      author_email='abpillai at gmail dot com',
-      maintainer='Lukasz Szybalski',
-      maintainer_email='szybalski@gmail.com',
-      url='http://code.google.com/p/harvestman-crawler/',
-      license='GPLv2',
-      #find_packages replaces package_dir and packages
-      #packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),
-      #package_dir = {'harvestman': 'harvestman'}, #Instalation package:path
-      packages=find_packages(exclude=['ez_setup', 'examples']),
-      include_package_data = True,    # include everything in source control
-      package_dir = {'harvestman': 'harvestman'}, #Instalation package:path
-      #Package_data is for none-py files
-      package_data = {'harvestman' : ['ui/templates/*.html', 'ui/templates/content/*']},
-      zip_safe=False,
-      install_requires=[
-      "sgmlop >= 1.1.1",
-      "pyparsing >= 1.4.8",
-      "web.py >= 0.23",
-          # -*- Extra requirements: -*-
-      ],
-      entry_points="""
-      [console_scripts]
-        harvestman = harvestman.apps.spider:main
-      """,
-      )
-
-
-
diff --git a/HarvestMan-lite/tarhm.py b/HarvestMan-lite/tarhm.py
deleted file mode 100755
index 7dee3ff..0000000
--- a/HarvestMan-lite/tarhm.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import sys, os
-import shutil
-import time
-import ftplib
-import getpass
-
-srcdir = sys.argv[1]
-# Copy to /tmp
-#os.system('cp -r %s /tmp/HarvestMan-2.0' % srcdir)
-#curdir = os.path.abspath('.')
-# Change to /tmp
-#os.chdir('/tmp')
-#srcdir = 'HarvestMan-2.0'
-# Remove .pyc files
-os.system('rm -rf `find %s -name \*.pyc`' % srcdir)
-# Remove files starting with #
-os.system('rm -rf `find %s -name \#*`' % srcdir)
-# Remove files ending with ~
-os.system('rm -rf `find %s -name \*~`' % srcdir)
-# Remove .svn directories
-os.system('rm -rf `find %s -name \.svn`' % srcdir)
-# Remove any saved www* entries
-os.system('rm -rf `find %s -name www*`' % srcdir)
-# Remove any other saved files...
-# Remove any .hpf files
-os.system('rm -rf `find %s -name \*.hpf`' % srcdir)
-# Remove any other .com/.org sites
-os.system('rm -rf `find %s -regex ".*\.org"`' % srcdir)
-os.system('rm -rf `find %s -regex ".*\.com"`' % srcdir)
-# Remove any other stray html
-os.system('rm -rf `find %s -name \*.htm\* | grep -v samples | grep -v bugs | grep -v templates | grep -v test`' % srcdir)
-# Remove any .bidx files
-os.system('rm -rf `find %s -name \.bidx*`' % srcdir)
-# Now tar it up...
-tarfile = time.strftime('HarvestMan-2.0alpha%d%m%Y.tar.gz', time.localtime())
-ret = os.system('tar -czvf %s %s' % (tarfile, srcdir))
-# Clean up the folder
-# os.system('rm -rf ' + srcdir)
-ret = raw_input('Go ahead with upload [y/n] ? ')
-if ret.strip().lower() != 'y':
-    sys.exit('Done.')
-    
-# Upload it to harvestmanontheweb.com
-user=raw_input('Enter FTP username: ').strip()
-passw=getpass.getpass('Enter FTP password: ').strip()
-print 'Logging into harvestmanontheweb.com...',
-ftp = ftplib.FTP('www.harvestmanontheweb.com')
-ftp.login(user, passw)
-print 'done.'
-print 'Changing directory to www/packages/2.0...',
-ftp.cwd('www/packages/2.0')
-print 'done.'
-t1 = time.time()
-print 'Transferring file %s...' % tarfile
-ftp.storbinary('STOR %s' % tarfile, open(tarfile,'rb'))
-ftp.close()
-print 'Uploaded file %s in %.1f seconds' % (tarfile , time.time()-t1)
-# Remove the tarfile
-os.system('rm -rf ' + tarfile)
-os.chdir(curdir)
-
-
diff --git a/HarvestMan-twisted/crawler.py b/HarvestMan-twisted/crawler.py
deleted file mode 100644
index 7792ff6..0000000
--- a/HarvestMan-twisted/crawler.py
+++ /dev/null
@@ -1,42 +0,0 @@
-# A prototype of harvestman crawler using twisted matrix framework.
-# NOTE: This doesn't crawl yet - but acts as a simple downloader
-# which recreates URL structure on the disk.
-
-# The goal of this is to create a prototype of a crawler using
-# twisted philosophies and reusing minimal number of modules
-# from harvestman code-base.
-
-from twisted.web import client
-from twisted.internet import defer, reactor
-import pageparser
-import urlparser
-import sys, os
-
-def saveUrl(url, base=None):
-
-    try:
-        hu = urlparser.HarvestManUrl(url, baseurl=base)
-    except urlparser.HarvestManUrlError, e:
-        print e
-        # Dummy deferred object
-        return defer.Deferred()
-
-    directory = hu.get_local_directory()
-    if not os.path.exists(directory):
-        os.makedirs(directory)
-
-    return client.downloadPage(hu.get_full_url(), hu.get_full_filename())
-
-def handleAllResults(results):
-    print results
-    print 'Done.'
-    reactor.stop()
-
-def handleResult(result):
-    print result
-    
-if __name__ == "__main__":
-    urls = (open('urls.txt').read()).split('\n')
-    downloaders = [saveUrl(url).addCallback(handleResult) for url in urls if url != '']
-    # defer.DeferredList(downloaders, consumeErrors=True).addCallback(handleAllResults)
-    reactor.run()
diff --git a/HarvestMan-twisted/document.py b/HarvestMan-twisted/document.py
deleted file mode 100755
index 0f982fc..0000000
--- a/HarvestMan-twisted/document.py
+++ /dev/null
@@ -1,74 +0,0 @@
-# -- coding: utf-8
-"""
-document.py - Provides HarvestManDocument class which provides
-an abstraction over a webpage with attributes such as URL,
-content, child URLs, HTTP headers, lastmodified value and
-other attributes.
-
-Created by Anand B Pillai <abpillai at gmail dot com> Feb 26 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-import bz2
-from harvestman.lib.common.common import *
-
-class HarvestManDocument(object):
-    """ Web document class """
-
-    def __init__(self, urlobj):
-        # Store only index for conserving memory
-        self.urlindex = urlobj.index
-        # Also, list of children is actually list of
-        # child indices to save memory...
-        self.children = []
-        self.content = ''
-        self.content_hash = ''
-        self.headers = {}
-        # Only valid for webpages
-        self.description = ''
-        # Only valid for webpages        
-        self.keywords = []
-        # Only valid for webpages
-        self.title = ''
-        self.lastmodified = ''
-        self.etag = ''
-        #self.httpstatus = ''
-        #self.httpreason = ''
-        self.content_type = ''
-        self.content_encoding = ''
-        self.error = None
-        
-    def get_url(self):
-        return objects.datamgr.get_url(self.urlindex)
-
-    def set_url(self, urlobj):
-        self.urlindex = urlobj.index
-
-    def add_child(self, urlobj):
-        self.children.append(urlobj.index)
-        
-    def get_links(self):
-        # Links are already "normalized"
-        return [objects.datamgr.get_url(index) for index in self.children]
-
-    def get_content(self):
-        return self.content
-
-    def set_content(self, data):
-        self.content = data
-        
-    def get_content_hash(self):
-        return self.content_hash
-
-    def get_zipped_content(self):
-        # Return the content, gzipped
-        pass
-
-    def get_bzipped_content(self):
-        return bz2.compress(self.content)
-
-
-    
-
-        
diff --git a/HarvestMan-twisted/macros.py b/HarvestMan-twisted/macros.py
deleted file mode 100755
index 63cd500..0000000
--- a/HarvestMan-twisted/macros.py
+++ /dev/null
@@ -1,185 +0,0 @@
-# -- coding: utf-8
-"""
-macros.py - Defining macro variables for use by other
-modules.
-
-Created Anand B Pillai <abpillai at gmail dot com> Oct 5 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-class HarvestManMacroVariable(type):
-    """ A metaclass for HarvestMan macro variables """
-    
-    PIDX = 0
-    NIDX = 0
-    macrodict = {}
-    
-    def __new__(cls, name, bases=(), dct={}):
-
-        val = dct.get('value')
-        if val != None:
-            dct['index'] = val
-            
-        elif dct.get('negate'):
-            cls.NIDX -= 1
-            dct['index'] = cls.NIDX
-        else:
-            cls.PIDX += 1
-            dct['index'] = cls.PIDX
-            
-        item = type.__new__(cls, name, bases, dct)
-        cls.macrodict[name] = item
-        return item
-
-    def __init__(cls, name, bases=(), dct={}):
-	pass
-    def __str__(self):
-        return '%d' % (self.index)
-
-    def __eq__(self, number):
-        # Makes it easy to do things like
-        # THREAD_IDLE == 0 in code.
-        return self.index == number
-
-    def __lt__(self, number):
-
-        return self.index < number
-
-    def __gt__(self, number):
-
-        return self.index > number
-
-    def __le__(self, number):
-
-        return self.index <= number
-
-    def __ge__(self, number):
-
-        return self.index >= number
-    
-    
-def DEFINE_MACRO(name, val=None):
-    """ A factory function for defining macros """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val})
-    else:
-        globals()[name] = HarvestManMacroVariable(name)
-
-def DEFINE_NEGATIVE_MACRO(name, val=None):
-    """ A factory function for defining macros with negative values """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val,'negate': True})
-    else:
-        globals()[name] = HarvestManMacroVariable(name, dct={'negate': True})
-
-
-def SUCCESS(status):
-    return (status > 0)
-    
-DEFINE_ERROR_MACRO = DEFINE_NEGATIVE_MACRO
-
-# Special (predefined) macros
-DEFINE_MACRO("HARVESTMAN_OK", 1)
-DEFINE_MACRO("HARVESTMAN_FAIL", -1)
-DEFINE_MACRO("OPTION_TURN_OFF", 0)
-DEFINE_MACRO("OPTION_TURN_ON", 1)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_FLUSH", 0)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_INMEM", 1)
-
-# Success macros
-DEFINE_MACRO("RESTORE_STATE_OK")
-DEFINE_MACRO("SAVE_STATE_OK")
-DEFINE_MACRO("CONFIG_FILE_EXISTS")
-DEFINE_MACRO("CONFIG_FILE_PARSE_OK")
-DEFINE_MACRO("CONFIG_OPTION_SET")
-DEFINE_MACRO("CONFIG_ITEM_SKIPPED")
-DEFINE_MACRO("CONFIG_OPTION_NOT_DEFINED")
-DEFINE_MACRO("CONFIG_ARGUMENT_OK")
-DEFINE_MACRO("CONFIG_ARGUMENTS_OK")
-DEFINE_MACRO("PROJECT_FILE_EXISTS", 0)
-DEFINE_MACRO("CONFIGURE_PROTOCOL_OK")
-DEFINE_MACRO("CONNECT_MULTIPART_DOWNLOAD")
-DEFINE_MACRO("CONNECT_NO_UPTODATE")
-DEFINE_MACRO("CONNECT_YES_DOWNLOADED")
-DEFINE_MACRO("DOWNLOAD_YES_WITH_MODIFICATION")
-DEFINE_MACRO("DOWNLOAD_NO_UPTODATE")
-DEFINE_MACRO("DOWNLOAD_NO_CACHE_SYNCED")
-DEFINE_MACRO("DOWNLOAD_YES_OK")
-DEFINE_MACRO("URL_PUSHED_TO_POOL")
-DEFINE_MACRO("CREATE_DIRECTORY_OK")
-DEFINE_MACRO("URL_DOWNLOAD_OK")
-DEFINE_MACRO("DATA_ALREADY_PRESENT")
-DEFINE_MACRO("FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_OK")
-DEFINE_MACRO("DUMP_URL_OK")
-DEFINE_MACRO("PROJECT_FILE_READ_OK")
-DEFINE_MACRO("PROJECT_FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_HEADERS_OK")
-DEFINE_MACRO("BROWSE_FILE_WRITE_OK")
-DEFINE_MACRO("LINK_FILTERED")
-DEFINE_MACRO("LINK_NOT_FILTERED")
-DEFINE_MACRO("LINK_EMPTY")
-DEFINE_MACRO("ANCHOR_LINK_FOUND")
-DEFINE_MACRO("SET_STATE_OK")
-DEFINE_MACRO("THREAD_MIGRATION_OK")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_QUEUED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_COMPLETED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_STATUS_UNKNOWN")
-DEFINE_MACRO("HGET_DOWNLOAD_OK")
-
-# Error macros
-DEFINE_ERROR_MACRO("SAVE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("RESTORE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("CONFIG_FILE_DOES_NOT_EXIST")
-DEFINE_ERROR_MACRO("CONFIG_FILE_PARSE_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_EMPTY")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_MISMATCH")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_NOT_SET")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_ASSIGN_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_INVALID_ARGUMENT")
-DEFINE_ERROR_MACRO("CONFIG_ARGUMENT_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_NO_RULES_VIOLATION")
-DEFINE_ERROR_MACRO("CONNECT_NO_FILTERED")
-DEFINE_ERROR_MACRO("CONNECT_NO_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_DOWNLOAD_ABORTED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_ERROR")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_WRITE_FILTERED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_RULE_VIOLATION")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_CACHE_SYNC_FAILED")
-DEFINE_ERROR_MACRO("CREATE_DIRECTORY_NOT_OK")
-DEFINE_ERROR_MACRO("URL_DOWNLOAD_FAILED")
-DEFINE_ERROR_MACRO("DATA_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("DATA_EMPTY_ERROR")
-DEFINE_ERROR_MACRO("FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_FAILED")
-DEFINE_ERROR_MACRO("NULL_URLOBJECT_ERROR")
-DEFINE_ERROR_MACRO("INVALID_ARCHIVE_FORMAT")
-DEFINE_ERROR_MACRO("FILE_TRUNCATE_ERROR")
-DEFINE_ERROR_MACRO("DUMP_URL_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_REMOVE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_HEADERS_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_NOT_FOUND")
-DEFINE_ERROR_MACRO("BROWSE_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_EMPTY")
-DEFINE_ERROR_MACRO("BROWSE_FILE_INVALID")
-DEFINE_ERROR_MACRO("BROWSE_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("ANCHOR_LINK_NOT_FOUND")
-DEFINE_ERROR_MACRO("SET_STATE_ERROR")
-DEFINE_ERROR_MACRO("THREAD_MIGRATION_ERROR")
-DEFINE_ERROR_MACRO("MULTIPART_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("HGET_FATAL_ERROR")
-DEFINE_ERROR_MACRO("HGET_KEYBOARD_INTERRUPT")
-DEFINE_ERROR_MACRO("HGET_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("MIRRORS_NOT_FOUND")
-DEFINE_ERROR_MACRO("WRITE_URL_FILTERED")
-DEFINE_ERROR_MACRO("WRITE_URL_BLOCKED")
-DEFINE_ERROR_MACRO("CONTROLLER_EXIT")
-
-if __name__ == "__main__":
-    for key, val in HarvestManMacroVariable.macrodict.iteritems():
-        print key,'=>',val.index
diff --git a/HarvestMan-twisted/netinfo.py b/HarvestMan-twisted/netinfo.py
deleted file mode 100755
index 080e85a..0000000
--- a/HarvestMan-twisted/netinfo.py
+++ /dev/null
@@ -1,184 +0,0 @@
-"""
-netinfo - Module summarizing information regarding protocols,
-ports, file extensions, regular expressions for analyzing URLs etc
-for HarvestMan.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 22 2008, moving
-                                                   content from urlparser.py
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import re
-
-URLSEP = '/'          # URL separator character
-PROTOSEP = '//'       # String which separates a protocol string from the rest of URL
-DOTDOT = '..'         # A convenient name for the .. string
-DOT = '.'             # A convenient name for the . string
-PORTSEP = ':'         # Protocol separator character, character which separates the protocol
-                      # string from rest of URL
-BACKSLASH = '\\'      # A convenient name for the backslash character
-
-# Mapping popular protocols to most widely used port numbers
-protocol_map = { "http://" : 80,       
-                 "ftp://" : 21,
-                 "https://" : 443,
-                 "file://": 0,
-                 "file:": 0
-                 }
-
-# Popular image types file extensions
-image_extns = ('.bmp', '.dib', '.dcx', '.emf', '.fpx', '.gif', '.ico', '.img',
-               '.jp2', '.jpc', '.j2k', '.jpf', '.jpg', '.jpeg', '.jpe',
-               '.mng', '.pbm', '.pcd', '.pcx', '.pgm', '.png', '.ppm', 
-               '.psd', '.ras', '.rgb', '.tga', '.tif', '.tiff', '.wbmp',
-               '.xbm', '.xpm')
-
-# Popular video types file extensions
-movie_extns = ('.3gp', '.avi', '.asf','.asx', '.avs', '.bay',
-               '.bik', '.bsf', '.dat', '.dv' ,'.dvr-ms', 'flc',
-               '.flv', '.ivf', '.m1v', '.m2ts', '.m2v', '.m4v',
-               '.mgv', '.mkv', '.mov', '.mp2v', '.mp4', '.mpe',
-               '.mpeg', '.mpg', '.ogm', '.qt', '.ratDVD', '.rm',
-               '.smi', '.vob', '.wm', '.wmv', '.xvid' )
-
-# Popular audio types file extensions
-sound_extns = ('.aac', '.aif', '.aiff', '.aifc', '.aifr', '.amr',
-               '.ape' ,'.asf', '.au', '.aud', '.aup', '.bwf', 
-               '.cda', '.dct', '.dss', '.dts', '.dvf', '.esu',
-               '.eta', '.flac', '.gsm', '.jam', '.m4a', '.m4p',
-               '.mdi', '.mid', '.midi', '.mka', '.mod', '.mp1', '.mp2',
-               '.mp3', '.mpa', '.mpc', '.mpega', '.msv', '.mus',
-               '.nrj', '.nwc', '.nwp', '.ogg', '.psb', '.psm', '.ra',
-               '.ram', '.rel', '.sab', '.shn', '.smf', '.snd', '.speex',
-               '.tta', '.vox', '.vy3', '.wav', '.wave', '.wma',
-               '.wpk', '.wv', '.wvc')
-
-# Most common web page url file extensions
-# including dynamic server pages & cgi scripts.
-webpage_extns = ('', '.htm', '.html', '.shtm', '.shtml', '.php',
-                 '.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl',
-                 '.cgi', '.stx', '.cfm', '.cfml', '.cms', '.ars')
-
-
-# Document extensions
-document_extns = ('.doc','.rtf','.odt','.odp','.ott','.sxw','.stw',
-                  '.sdw','.vor','.pdf','.ps')
-
-# Extensions for flash/flash source code/flash action script
-flash_extns = ('.swf', '.fla', '.mxml', '.as', '.abc')
-
-# Web-page extensions which automatically default to directories
-# These are special web-page types which are web-pages as well
-# as directories. Most common example is the .ars file extension
-# of arstechnica.com.
-default_directory_extns = ('.ars',)
-
-# Most common stylesheet url file extensions
-stylesheet_extns = ( '.css', )
-
-# Regular expression for matching
-# urls which contain white spaces
-wspacere = re.compile(r'\s+\S+', re.LOCALE|re.UNICODE)
-
-# Regular expression for anchor tags
-anchore = re.compile(r'\#+')
-
-# jkleven: Regex if we still don't recognize a URL address as HTML.  Only
-# to be used if we've looked at everything else and URL still isn't
-# a known type.  This regex is similar to one in pageparser.py but 
-# we changed a few '*' to '+' to get one or more matches.  
-# form_re = re.compile(r'[-.:_a-zA-Z0-9]+\?[-.:_a-zA-Z0-9]+=[-.a:_-zA-Z0-9]*', re.UNICODE)
-
-# Made this more generic and lenient.
-form_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE)
-
-# Junk chars which cannot be part of valid filenames
-junk_chars = ('?','*','"','<','>','!',':','/','\\')
-
-# Replacement chars
-junk_chars_repl = ('',)*len(junk_chars)
-
-# Dirty chars which need to be hex encoded in URLs (apart from white-space)
-# We are assuming that there won't be many idiots who would put a backslash in a URL...
-dirty_chars = ('<','>','(',')','{','}','[',']','^','`','|')
-
-# These are replaced with their hex counterparts
-dirty_chars_repl = ('%3C','%3E','%28','%29','%7B','%7D','%5B','%5D','%5E','%60','%7C')
-
-# %xx char replacement regexp
-percent_repl = re.compile(r'\%[a-f0-9][a-f0-9]', re.IGNORECASE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*)+', re.UNICODE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[^\&]*)+', re.UNICODE)
-
-# Regexp which extracts params from query URLs, most generic
-params_re = re.compile(r'([^&=\?]*=[^&=\?]*)', re.UNICODE)
-# Regular expression for validating a query param group (such as "lang=en")
-param_re = re.compile(r'([^&=\?]+=[^&=\?\s]+)', re.UNICODE)
-
-# ampersand regular expression at URL end
-ampersand_re = re.compile(r'\&+$')
-# question mark regular expression at URL end
-question_re = re.compile(r'\?+$')
-# Regular expression for www prefixes at front of a string
-www_re = re.compile(r'^www(\d*)\.')
-# Regular expression for www prefixes anywhere
-www2_re = re.compile(r'www(\d*)\.')
-
-# List of TLD (top-level domain) name endings from http://data.iana.org/TLD/tlds-alpha-by-domain.txt
-
-tlds = ['ac', 'ad', 'ae', 'aero', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq',
-        'ar', 'arpa', 'as', 'asia', 'at', 'au', 'aw','ax', 'az', 'ba', 'bb', 'bd',
-        'be', 'bf', 'bg', 'bh', 'bi', 'biz', 'bj', 'bm', 'bn', 'bo', 'br', 'bs',
-        'bt', 'bv', 'bw', 'by', 'bz', 'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch',
-        'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'com', 'coop', 'cr', 'cu', 'cv', 'cx',
-        'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'edu', 'ee', 'eg',
-        'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo', 'fr', 'ga', 'gb',
-        'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq',
-        'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht', 'hu',
-        'id', 'ie', 'il', 'im', 'in', 'info', 'int', 'io', 'iq', 'ir', 'is',
-        'it', 'je', 'jm', 'jo', 'jobs', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn',
-        'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls',
-        'lt', 'lu', 'lv', 'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mil', 'mk',
-        'ml', 'mm', 'mn', 'mo', 'mobi', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu',
-        'museum', 'mv', 'mw', 'mx', 'my', 'mz', 'na', 'name', 'nc', 'ne', 'net',
-        'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'org', 'pa',
-        'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'pro', 'ps', 'pt',
-        'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd',
-        'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st',
-        'su', 'sv', 'sy', 'sz', 'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk',
-        'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'travel', 'tt', 'tv', 'tw', 'tz',
-        'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi',
-        'vn', 'vu', 'wf', 'ws', 'xn--0zwm56d', 'xn--11b5bs3a9aj6g', 'xn--80akhbyknj4f',
-        'xn--9t4b11yi5a', 'xn--deba0ad', 'xn--g6w251d', 'xn--hgbk6aj7f53bba',
-        'xn--hlcj6aya9esc7a', 'xn--jxalpdlp', 'xn--kgbechtv', 'xn--zckzah',
-        'ye', 'yt', 'yu', 'za', 'zm', 'zw']
-
-def get_base_server(server):
-    """ Return the base server name of  the passed
-    server (domain) name """
-
-    # If the server name is of the form say bar.foo.com
-    # or vodka.bar.foo.com, i.e there are more than one
-    # '.' in the name, then we need to return the
-    # last string containing a dot in the middle.
-    if server.count('.') > 1:
-        dotstrings = server.split('.')
-        # now the list is of the form => [vodka, bar, foo, com]
-
-        # Skip the list for skipping over tld domain name endings
-        # such as .org.uk, .mobi.uk etc. For example, if the
-        # server is games.mobileworld.mobi.uk, then we
-        # need to return mobileworld.mobi.uk, not mobi.uk
-        dotstrings.reverse()
-        idx = 0
-
-        for item in dotstrings:
-            if item.lower() in tlds:
-                idx += 1
-
-        return '.'.join(dotstrings[idx::-1])
-    else:
-        # The server is of the form foo.com or just "foo"
-        # so return it straight away
-        return server
diff --git a/HarvestMan-twisted/pageparser.py b/HarvestMan-twisted/pageparser.py
deleted file mode 100755
index a89ca07..0000000
--- a/HarvestMan-twisted/pageparser.py
+++ /dev/null
@@ -1,483 +0,0 @@
-import re
-from sgmllib import SGMLParser
-
-from urltypes import *
-from macros import *
-
-class ParseTag(object):
-    """ Class representing a tag which is parsed by the HTML parser(s) """
-    
-    def __init__(self, tag, tagdict, pattern=None, enabled=True):
-        # Tag is the name of the tag (element) which will be parsed.
-        # Tagdict is a dictionary which contains the attributes
-        # of the tag which we are interested as keys and the type
-        # of URL the value of the attribute will be saved as, as
-        # the value. If there are more than one type of URL for this
-        # attribute key, then the value is a list.
-        
-        # For example valid tagdicts are {'href': [URL_TYPE_ANY, URL_TYPE_ANCHOR] },
-        # {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code': URL_TYPE_JAPPLET'}.
-        self.tag = tag
-        self.tagdict = tagdict
-        self.enabled = enabled
-        self.pattern = pattern
-
-    def disable(self):
-        """ Disable parsing of this tag """
-        self.enabled = False
-
-    def enable(self):
-        """ Enable parsing of this tag """
-        self.enabled = True
-
-    def isEnabled(self):
-        """ Is this tag enabled ? """
-        
-        return self.enabled
-
-    def setPattern(self, pattern):
-        self.pattern = pattern
-
-    def __eq__(self, item):
-        return self.tag.lower() == item.lower()
-    
-class HarvestManSimpleParser(SGMLParser):
-    """ An HTML/XHTML parser derived from SGMLParser """
-
-    # query_re = re.compile(r'[-.:_a-zA-Z0-9]*\?[-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*', re.UNICODE)
-    # A more lenient form of query regular expression
-    query_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE) 
-    skip_re = re.compile(r'(javascript:)|(mailto:)|(news:)')
-    # Junk URLs obtained by parsing HTML of web-directory pages
-    # i.e pages with title "Index of...". The filtering is done after
-    # looking at the title of the page.
-    index_page_re = re.compile(r'(\?[a-zA-Z0-9]=[a-zA-Z0-9])')
-
-    features = [ ParseTag('a', {'href': URL_TYPE_ANY}),
-                 ParseTag('base', {'href' : URL_TYPE_BASE}),
-                 ParseTag('frame', {'src' : URL_TYPE_FRAME}),
-                 ParseTag('img', {'src': URL_TYPE_IMAGE}),
-                 ParseTag('form', {'action': URL_TYPE_FORM}),
-                 ParseTag('link', {'href': URL_TYPE_ANY}),
-                 ParseTag('body', {'background' : URL_TYPE_IMAGE}),
-                 ParseTag('script', {'src': URL_TYPE_JAVASCRIPT}),
-                 ParseTag('applet', {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code' : URL_TYPE_JAPPLET}),
-                 ParseTag('area', {'href': URL_TYPE_ANY}),
-                 ParseTag('meta', {'CONTENT': URL_TYPE_ANY, 'content': URL_TYPE_ANY}),
-                 ParseTag('embed', {'src': URL_TYPE_ANY}),
-                 ParseTag('object', {'data': URL_TYPE_ANY}),
-                 ParseTag('option', {'value': URL_TYPE_ANY}, enabled=False) ]
-                 
-
-    handled_rel_types = ( URL_TYPE_STYLESHEET, )
-    
-    def __init__(self):
-        self.url = None
-        self.links = []
-        self.linkpos = {}
-        self.images = []
-        # Keywords
-        self.keywords = []
-        # Description of page
-        self.description = ''
-        # Title of page
-        self.title = ''
-        self.title_flag = True
-        # Fix for <base href="..."> links
-        self.base_href = False
-        # Base url for above
-        self.base = None
-        # anchor links flag
-        self._anchors = True
-        # For META robots tag
-        self.can_index = True
-        self.can_follow = True
-        # Current tag
-        self._tag = ''
-        SGMLParser.__init__(self)
-        # Type
-        self.typ = 0
-        
-    def save_anchors(self, value):
-        """ Set the save anchor links flag """
-
-        # Warning: If you set this to true, anchor links on
-        # webpages will be saved as separate files.
-        self._anchors = value
-
-    def enable_feature(self, tag):
-        """ Enable the given tag feature if it is disabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.enable()
-
-    def disable_feature(self, tag):
-        """ Disable the given tag feature if it is enabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.disable()
-                
-    def filter_link(self, link):
-        """ Function to filter links, we decide here whether
-        to handle certain kinds of links """
-
-        if not link:
-            return LINK_EMPTY
-
-        # ignore javascript links (From 1.2 version javascript
-        # links of the form .js are fetched, but we still ignore
-        # the actual javascript actions since there is no
-        # javascript engine.)
-        llink = link.lower()
-
-        # Skip javascript, mailto, news and directory special tags.
-        if self.skip_re.match(llink):
-            return LINK_FILTERED
-
-        # If this is a web-directory Index page, then check for
-        # match with junk URLs of such index pages
-        if self.title.lower().startswith('index of'):
-            if self.index_page_re.match(llink):
-                # print 'Filtering link',llink
-                return LINK_FILTERED
-            
-        # Check if we're accepting query style URLs
-        if not objects.config.getquerylinks and self.query_re.search(llink):
-            debug('Query filtering link',link)
-            return LINK_FILTERED
-
-        return LINK_NOT_FILTERED
-
-    def handle_anchor_links(self, link):
-        """ Handle links of the form html#..."""
-
-        # if anchor tag, then get rid of anchor #...
-        # and only add the webpage link
-        if not link:
-            return LINK_EMPTY
-
-        # Need to do this here also
-        self.check_add_link(URL_TYPE_ANCHOR, link)
-
-        # No point in getting #anchor sort of links
-        # since typically they point to anchors in the
-        # same page
-
-        index = link.rfind('.html#')
-        if index != -1:
-            newhref = link[:(index + 5)]
-            self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-            return ANCHOR_LINK_FOUND
-        else:
-            index = link.rfind('.htm#')
-            if index != -1:
-                newhref = link[:(index + 4)]
-                self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-                return ANCHOR_LINK_FOUND
-
-        return ANCHOR_LINK_NOT_FOUND
-
-    def unknown_starttag(self, tag, attrs):
-        """ This method gives you the tag in the html
-        page along with its attributes as a list of
-        tuples """
-
-        # Raise event for anybody interested in catching a tagparse event...
-        if objects.eventmgr and objects.eventmgr.raise_event('beforetag', self.url, None, tag=tag, attrs=attrs)==False:
-            # Don't parse this tag..
-            return
-                                     
-        # Set as current tag
-        self._tag = tag
-        # print self._tag, attrs
-        
-        if not attrs: return
-        isBaseTag = not self.base and tag == 'base'
-        # print 'Base=>',isBaseTag
-        
-        if tag in self.features:
-
-            d = CaselessDict(attrs)
-            parsetag = self.features[self.features.index(tag)]
-
-            # Don't do anything if the feature is disabled
-            if not parsetag.isEnabled():
-                return
-            
-            tagdict = parsetag.tagdict
-            
-            link = ''
-
-            for key, typ in tagdict.items():
-                # If there is a <base href="..."> tag
-                # set self.base_href
-                if isBaseTag and key=='href':
-                    self.base_href = True
-                    try:
-                        self.base = d[key]
-                    except:
-                        self.base_href = False
-                        continue
-                
-                # if the link already has a value, skip
-                # (except for applet tags)
-                if tag != 'applet':
-                    if link: continue
-
-                if tag == 'link':
-                    try:
-                        # Fix - only reset typ if it is one
-                        # of the valid handled rel types.
-                        foundtyp = d['rel'].lower()
-                        if foundtyp in self.handled_rel_types:
-                            typ = getTypeClass(foundtyp)
-                    except KeyError:
-                        pass
-
-                try:
-                    if tag == 'meta':
-                        # Handle meta tag for refresh
-                        foundtyp = d.get('http-equiv','').lower()
-                        if foundtyp.lower() == 'refresh':
-                            link = d.get(key,'')
-                            if not link: continue
-                            # This will be of the form of either
-                            # a time-gap (CONTENT="600") or a time-gap
-                            # with a URL (CONTENT="0; URL=<url>")
-                            items = link.split(';')
-                            if len(items)==1:
-                                # Only a time-gap, skip it
-                                continue
-                            elif len(items)==2:
-                                # Second one should be a URL
-                                reqd = items[1]
-                                # print 'Reqd=>',reqd
-                                if (reqd.find('URL') != -1 or reqd.find('url') != -1) and reqd.find('=') != -1:
-                                    link = reqd.split('=')[1].strip()
-                                    # print 'Link=>',link
-                                else:
-                                    continue
-                        else:
-                            # Handle robots meta tag
-                            name = d.get('name','').lower()
-                            if name=='robots':
-                                robots = d.get('content','').lower()
-                                # Split to ','
-                                contents = [item.strip() for item in robots.split(',')]
-                                # Check for nofollow
-                                self.can_follow = not ('nofollow' in contents)
-                                # Check for noindex
-                                self.can_index = not ('noindex' in contents)
-                            elif name=='keywords':
-                                self.keywords = d.get('content','').split(',')
-                                # Trim the keywords list
-                                self.keywords = [word.lower().strip() for word in self.keywords]
-                            elif name=='description':
-                                self.description = d.get('content','').strip()
-                            else:
-                                continue
-
-                    elif tag != 'applet':
-                        link = d[key]
-                    else:
-                        link += d[key]
-                        if key == 'codebase':
-                            if link:
-                                if link[-1] != '/':
-                                    link += '/'
-                            continue                                
-                except KeyError:
-                    continue
-
-                # see if this link is to be filtered
-                if self.filter_link(link) != LINK_NOT_FILTERED:
-                    # print 'Filtered link',link
-                    continue
-
-                # anchor links in a page should not be saved        
-                # index = link.find('#')
-
-                # Make sure not to wrongly categorize '#' in query strings
-                # as anchor URLs.
-                if link.find('#') != -1 and not self.query_re.search(link):
-                    # print 'Is an anchor link',link
-                    self.handle_anchor_links(link)
-                else:
-                    # append to private list of links
-                    self.check_add_link(typ, link)
-
-    def unknown_endtag(self, tag):
-            
-        self._tag = ''
-        if tag=='title':
-            self.title_flag = False
-            self.title = self.title.strip()
-            
-    def handle_data(self, data):
-
-        if self._tag.lower()=='title' and self.title_flag:
-            self.title += data
-
-    def check_add_link(self, typ, link):
-        """ To avoid adding duplicate links """
-
-        f = False
-
-        if typ == 'image':
-            if not (typ, link) in self.images:
-                self.images.append((typ, link))
-        elif not (typ, link) in self.links:
-                # print 'Adding link ', link, typ
-                pos = self.getpos()
-                self.links.append((typ, link))
-                self.linkpos[(typ,link)] = (pos[0],pos[1])
-                
-
-    def add_tag_info(self, taginfo):
-        """ Add new tag information to this object.
-        This can be used to change the behavior of this class
-        at runtime by adding new tags """
-
-        # The taginfo object should be a dictionary
-        # of the form { tagtype : (elementname, elementype) }
-
-        # egs: { 'body' : ('background', 'img) }
-        if type(taginfo) != dict:
-            raise AttributeError, "Attribute type mismatch, taginfo should be a dictionary!"
-
-        # get the key of the dictionary
-        key = (taginfo.keys())[0]
-        if len(taginfo[key]) != 2:
-            raise ValueError, 'Value mismatch, size of tag tuple should be 2'
-
-        # get the value tuple
-        tagelname, tageltype = taginfo[key]
-
-        # see if this is an already existing tagtype
-        if key in self.handled.keys:
-            _values = self.handled[key]
-
-            f=0
-            for index in xrange(len(_values)):
-                # if the elementname is also
-                # the same, just replace it.
-                v = _values[index]
-
-                elname, eltype = v
-                if elname == tagelname:
-                    f=1
-                    _values[index] = (tagelname, tageltype)
-                    break
-
-            # new element, add it to list
-            if f==0: _values.append((tagelname, tageltype))
-            return 
-        else:
-            # new key, directly modify dictionary
-            elements = []
-            elements.append((tagelname, tageltype))
-            self.handled[key] = elements 
-
-    def reset(self):
-        SGMLParser.reset(self)
-
-        self.url = None
-        self.base = None
-        self.links = []
-        self.images = []
-        self.base_href = False
-        self.base_url = ''
-        self.can_index = True
-        self.can_follow = True
-        self.title = ''
-        self.title_flag = True
-        self.description = ''
-        self.keywords = []
-        
-    def base_url_defined(self):
-        """ Return whether this url had a
-        base url of the form <base href='...'>
-        defined """
-
-        return self.base_href
-
-    def get_base_url(self):
-        return self.base
-
-    def set_url(self, url):
-        """ Set the URL whose data is about to be parsed """
-        self.url = url
-
-class HarvestManSGMLOpParser(HarvestManSimpleParser):
-    """ A parser based on effbot's sgmlop """
-
-    def __init__(self):
-        # This module should be built already!
-        import sgmlop
-        
-        self.parser = sgmlop.SGMLParser()
-        self.parser.register(self)
-        HarvestManSimpleParser.__init__(self)
-        # Type
-        self.typ = 1
-        
-    def finish_starttag(self, tag, attrs):
-        self.unknown_starttag(tag, attrs)
-
-    def finish_endtag(self, tag):
-        self.unknown_endtag(tag)        
-
-    def feed(self, data):
-        self.parser.feed(data)
-        
-class HarvestManCSSParser(object):
-    """ Class to parse stylesheets and extract URLs """
-
-    # Regexp to parse stylesheet imports
-    importcss1 = re.compile(r'(\@import\s+\"?)(?!url)([\w.-:/]+)(\"?)', re.MULTILINE|re.LOCALE|re.UNICODE)
-    importcss2 = re.compile(r'(\@import\s+url\(\"?)([\w.-:/]+)(\"?\))', re.MULTILINE|re.LOCALE|re.UNICODE)
-    # Regexp to parse URLs inside CSS files
-    cssurl = re.compile(r'(url\()([^\)]+)(\))', re.LOCALE|re.UNICODE)
-
-    def __init__(self):
-        # Any imported stylesheet URLs
-        self.csslinks = []
-        # All URLs including above
-        self.links = []
-
-    def feed(self, data):
-        self._parse(data)
-        
-    def _parse(self, data):
-        """ Parse stylesheet data and extract imported css links, if any """
-
-        # Return is a list of imported css links.
-        # This subroutine uses the specification mentioned at
-        # http://www.w3.org/TR/REC-CSS2/cascade.html#at-import
-        # for doing stylesheet imports.
-
-        # This takes care of @import "style.css" and
-        # @import url("style.css") and url(...) syntax.
-        # Media types specified if any, are ignored.
-        
-        # Matches for @import "style.css"
-        l1 = self.importcss1.findall(data)
-        # Matches for @import url("style.css")
-        l2 = self.importcss2.findall(data)
-        # Matches for url(...)
-        l3 = self.cssurl.findall(data)
-        
-        for item in (l1+l2):
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            self.csslinks.append(url)
-            self.links.append(url)
-            
-        for item in l3:
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            if url not in self.links:
-                self.links.append(url)
-
diff --git a/HarvestMan-twisted/urlparser.py b/HarvestMan-twisted/urlparser.py
deleted file mode 100755
index 5ab2bae..0000000
--- a/HarvestMan-twisted/urlparser.py
+++ /dev/null
@@ -1,1612 +0,0 @@
-# -- coding: utf-8
-
-import os, sys
-import re
-import mimetypes
-import copy
-import urlproc
-import md5
-import itertools
-import random
-
-from types import StringTypes
-
-import document
-#from harvestman.lib.common.common import *
-from netinfo import *
-from urltypes import *
-
-# URL queueing status macros
-
-URL_NOT_QUEUED=0       # Fresh URL, not queued yet
-URL_QUEUED=1           # Fresh URL sent to queue, but not yet in queue
-URL_IN_QUEUE=2         # URL is in queue
-URL_IN_DOWNLOAD=3      # URL is out of queue and in download
-URL_DONE_DOWNLOAD=4    # URL has completed download, though this may not mean
-                       # that the download was successful.
-
-mask=0xffffff
-
-def one_stringhash(s):
-    # One-at-a-time string hash
-    l = len(s)
-    hashval = 0
-    for c in s:
-        hashval += ord(c)
-        hashval += (hashval << 10)
-        hashval ^= (hashval >> 6)
-
-    hashval += (hashval << 3)
-    hashval ^= (hashval >> 11)
-    hashval += (hashval << 15)
-
-    return (hashval & mask)
-
-class HarvestManUrlError(Exception):
-    """ Error class for HarvestManUrl """
-    
-    def __init__(self, value):
-        self.value = value
-
-    def __repr__(self):
-        return self.__str__()
-    
-    def __str__(self):
-        return str(self.value)
-    
-class HarvestManUrl(object):
-    """ A class representing a URL in HarvestMan """
-
-    TEST = False
-    hashes = {}
-    
-    def __init__(self, url, urltype = URL_TYPE_ANY, cgi = False, baseurl  = None, rootdir = ''):
-        # Remove trailing wspace chars.
-        url = url.rstrip()
-        try:
-            try:
-                try:
-                    url.encode("utf-8")
-                except UnicodeDecodeError:
-                    url = url.decode("iso-8859-1")
-            except UnicodeDecodeError, e:
-                url = url.decode("latin-1")
-        except UnicodeDecodeError, e:
-            pass
-                
-        # For saving original url
-        # since self.url can get
-        # modified
-        self.origurl = url
-        
-        self.url = url
-        self.url = urlproc.modify_url(self.url)
-        
-        self.typ = urltype
-        self.cgi = cgi
-        self.anchor = ''
-        self.index = 0
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        # If the url is a file like url
-        # this value will be true, if it is
-        # a directory like url, this value will
-        # be false.
-        self.filelike = False
-        # download status, a number indicating
-        # whether this url was downloaded successfully
-        # or not. 0 indicates a successful download, and
-        # any number >0 indicates a failed download
-        self.status = -1
-        # Url scheduled status, a number indicating
-        # how the URL is queued for download.
-        # It has the following values
-        # URL_NOT_QUEUED
-        # URL_QUEUED
-        # URL_IN_DOWNLOAD
-        # URL_DONE_DOWNLOAD
-        # The fact that the URL has URL_DONE_DOWNLOAD
-        # need not mean that the download was successful!
-        self.qstatus = URL_NOT_QUEUED
-        # Fatal status
-        self.fatal = False
-        # is starting url?
-        self.starturl = False
-        # Flag for files having extension
-        self.hasextn = False
-        # Relative path flags
-        self.isrel = False
-        # Relative to server?
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        # Url headers
-        self.contentdict = {}
-        # Url generation
-        self.generation = 0
-        # Url priority
-        self.priority = 0
-        # rules violation cache flags
-        self.violatesrules = False
-        self.rulescheckdone = False
-        # Bytes range - used for HTTP/1.1
-        # multipart downloads. This has to
-        # be set to an xrange object 
-        self.range = None
-        # Flag to try multipart
-        self.trymultipart = False
-        # Multipart index
-        self.mindex = 0
-        # Original url for mirrored URLs
-        self.mirror_url = None
-        # Flag set for URLs which are mirrored from
-        # a different server than the original URL
-        self.mirrored = False
-        # Content-length for multi-part
-        # This is the content length of the original
-        # content.
-        self.clength = 0
-        self.dirpath = []
-        # Re-computation flag
-        self.reresolved = False
-        # URL redirected flag
-        self.redirected = False
-        # Flag indicating we are using an old URL
-        # which was redirected, again for producing
-        # further redirections. This is used in Hget
-        # for automatic split-mirror downloading
-        # for URLs that auto-forward to mirrors.
-        self.redirected_old = False
-        self.baseurl = None
-        # Hash of page data
-        self.pagehash = ''
-        # Flag to decide whether to recalculate get_full_url(...)
-        # if flag is False, recalculate...
-        self.urlflag = False
-        # Cached full URL string
-        self.absurl = ''
-        # Base Url Dictionary
-        if baseurl:
-            if isinstance(baseurl, HarvestManUrl):
-                self.baseurl = baseurl
-            elif type(baseurl) in StringTypes:
-                self.baseurl = HarvestManUrl(baseurl, 'generic', cgi, None, rootdir)
-                      
-        # Root directory
-        if rootdir == '':
-            if self.baseurl and self.baseurl.rootdir:
-                self.rootdir = self.baseurl.rootdir
-            else:
-                self.rootdir = os.getcwd()
-        else:
-            self.rootdir = rootdir
-            
-        self.anchorcheck()
-        self.resolveurl()
-
-        # For starting URL, the index is 0, for the rest
-        # it is as hash of the canonical URL string...
-        self.index = one_stringhash(self.get_canonical_url())
-        # If this is a URL similar to start URL,
-        # reset its index to zero. The trick is
-        # to store only the hash of the start URL
-        # as key in the attribute 'hashes'.
-        
-        try:
-            val = self.hashes[self.index]
-            self.index = 0
-        except KeyError:
-            pass
-
-        # Copy of myself, this will be saved if
-        # a re-resolving is requested so that old
-        # parameters can be requested if needed
-        self.orig_state = None
-        
-    def reset(self):
-        """ Reset all the key attributes """
-
-        # Archive previous state
-        self.orig_state = copy.copy(self)
-
-        self.url = urlproc.modify_url(self.url)
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        self.hasextn = False
-        self.isrel = False
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        self.dirpath = []
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        # Set urlflag to False
-        self.urlflag = False
-        self.absurl = ''
-
-    def __str__(self):
-        return self.absurl
-    
-    def wrapper_resolveurl(self):
-        """ Called forcefully to re-resolve a URL, typically after a re-direction
-        or change in URL has been detected """
-
-        self.reset()
-        self.anchorcheck()
-        self.resolveurl()
-        self.reresolved = True
-        
-    def anchorcheck(self):
-        """ Checking for anchor tags and processing accordingly """
-
-        if self.typ == 'anchor':
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for anchor type url'
-
-            if '#' in self.url:
-                # Split with re
-                items = anchore.split(self.url)
-                # First item is the original url
-                if len(items):
-                    if items[0]:
-                        self.url = items[0]
-                    else:
-                        self.url = self.baseurl.get_full_url()
-                    # Rest forms the anchor tag
-                    self.anchor = '#' + '#'.join(items[1:])
-                    
-    def resolve_protocol(self):
-        """ Resolve the protocol of the url """
-
-        url2 = self.url.lower()
-        for proto in protocol_map.keys():
-            if url2.find(proto) != -1:
-                self.protocol = proto
-                self.port = protocol_map.get(proto)
-                return True
-        else:
-            # Fix: Use regex for detecting WWW urls.
-            # Check for WWW urls. These can begin
-            # with a 'www.' or 'www' followed by
-            # a single number (www1, www3 etc).
-            if www_re.match(url2):
-                self.protocol = 'http://'
-                self.url =  "".join((self.protocol, self.url))
-                return True
-
-            # We accept FTP urls beginning with just
-            # ftp.<server>, and consider it as FTP over HTTP
-            if url2.startswith('ftp.'):
-                # FTP over HTTP
-                self.protocol = 'http://'
-                self.url = ''.join((self.protocol, self.url))
-                return True
-            
-            # Urls relative to server might
-            # begin with a //. Then prefix the protocol
-            # string to them.
-            if self.url.find('//') == 0:
-                # Pick protocol from base url
-                if self.baseurl and self.baseurl.protocol:
-                    self.protocol = self.baseurl.protocol
-                else:
-                    self.protocol = "http://"   
-                self.url = "".join((self.protocol, self.url[2:]))
-                return True
-
-            # None of these
-            # Protocol not resolved, so check
-            # base url first, if not found, set
-            # default protocol...
-            if self.baseurl and self.baseurl.protocol:
-                self.protocol = self.baseurl.protocol
-            else:
-                self.protocol = 'http://'
-
-            self.defproto = True
-        
-            return False
-        
-    def resolveurl(self):
-        """ Resolves the url finding out protocol, port, domain etc
-        . Also resolves relative paths and builds a local file name
-        for the url based on the root directory path """
-
-        if len(self.url)==0:
-            raise HarvestManUrlError, 'Error: Zero Length Url'
-
-        proto = self.resolve_protocol()
-
-        paths = ''
-        
-        if not proto:
-            # Could not resolve protocol, must be a relative url
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for relative urls'
-
-            # Set url-relative flag
-            self.isrel = True
-            # Is relative to server?
-            if self.url[0] == '/':
-                self.isrels = True
-            
-            # Split paths
-            relpaths = self.url.split(URLSEP)
-            try:
-                idx = relpaths.index(DOTDOT)
-            except ValueError:
-                idx = -1
-
-            # Only reduce if the URL itself does not start with
-            # .. - if it does our rpath algorithm takes
-            # care of it.
-
-            # Mod: This is commented out now, since it looks
-            # like there is no harm in allowing to reduce, even
-            # if the path starts with '..'
-            #if idx > 0:
-            
-            relpaths = self.reduce_url(relpaths)
-
-            # Build relative path by checking for "." and ".." strings
-            self.rindex = 0
-            for ritem in relpaths:
-                # If path item is ., .. or empty, increment
-                # relpath index.
-                if ritem in (DOT, DOTDOT, ""):
-                    self.rindex += 1
-                    # If path item is not empty, insert
-                    # to relpaths list.
-                    if ritem:
-                        self.rpath.append(ritem)
-
-                else:
-                    # Otherwise, add the rest to paths
-                    # with the separator
-                    for entry in relpaths[self.rindex:]:
-                        paths = "".join((paths, entry, URLSEP))
-
-                    # Remove the last entry
-                    paths = paths[:-1]
-                    
-                    # Again Trim if the relative path ends with /
-                    # like href = /img/abc.gif/ 
-                    #if paths[-1] == '/':
-                    #    paths = paths[:-1]
-                    break
-            
-        else:
-            # Absolute path, so 'paths' is the part of it
-            # minus the protocol part.
-            paths = self.url.replace(self.protocol, '')            
-            if paths=='':
-                # Error: URL consists only of protocol
-                raise HarvestManUrlError, 'Error: Invalid URL containing only protocol'                
-                
-            # Split URL
-            items = paths.split(URLSEP)
-            
-            # If there are nonsense .. and . chars in the paths, remove
-            # them to construct a sane path.
-            #try:
-            #    idx = items.index(DOTDOT)
-            #except ValueError:
-            #    idx = -1            
-            flag = (DOT in items) or (DOTDOT in items)
-            
-            if flag:
-                # Bugfix: Do not allow a URL like http://www.foo.com/../bar
-                # to become http://bar. Basically if the index of .. is
-                # 1, then remove the '..' entirely. This bug was encountered
-                # in EIAO testing of http://www.fylkesmannen.no/ for the URL
-                # http://www.fylkesmannen.no/osloogakershu
-                
-                items = self.reduce_url(items)
-                # Re-construct URL
-                paths = URLSEP.join(items)
-                
-        # Now compute local directory/file paths
-
-        self.compute_dirpaths(paths)
-        if not self.protocol.startswith('file:'):
-            self.compute_domain_and_port()
-
-        # For some file extensions, automatically set as directory URL.
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            if extn in default_directory_extns:
-                self.set_directory_url()
-
-        # print self.dirpath, self.domain
-
-        
-    def reduce_url(self, paths):
-        """ Remove nonsense .. and . chars from URL paths """
-
-        for x in range(len(paths)):
-            path = paths[x]
-            try:
-                nextpath = paths[x+1]
-                if nextpath in (DOT, DOTDOT):
-                    # Check if a ? occurs anywhere earlier in path.
-                    # If a ? occurs in the path, don't reduce
-                    # any paths coming after it.
-                    try:
-                        qindex = paths.index('?')
-                        if qindex < x+1:
-                            continue
-                    except ValueError:
-                        pass
-                    
-                if nextpath == DOTDOT:
-                    paths.pop(x+1)
-                    # Do not allow to remove the domain for
-                    # stupid URLs like 'http://www.foo.com/../bar' or
-                    # 'http://www.foo.com/camp/../../bar'. If allowed
-                    # they become nonsense URLs like http://bar.
-
-                    # This bug was encountered in EIAO testing of
-                    # http://www.fylkesmannen.no/ for the URL
-                    # http://www.fylkesmannen.no/osloogakershu
-                    
-                    if self.isrel or x>0:
-                        paths.remove(path)
-                    return self.reduce_url(paths)
-                elif nextpath==DOT:
-                    paths.pop(x+1)
-                    return self.reduce_url(paths)                    
-            except IndexError:
-                return paths
-        
-        
-    def compute_file_and_dir_paths(self):
-        """ Compute file and directory paths """
-
-        if self.lastpath:
-            dotindex = self.lastpath.find(DOT)
-            if dotindex != -1:
-                self.hasextn = True
-
-            # If there is no extension or if there is
-            # an extension which is occuring in the middle
-            # of last path...
-            if (dotindex == -1) or \
-                ((dotindex >0) and (dotindex < (len(self.lastpath)-1))):
-                self.filelike = True
-                # Bug fix - Strip leading spaces & newlines
-                self.validfilename =  self.make_valid_filename(self.lastpath.strip())
-                self.filename = self.lastpath.strip()
-                self.dirpath  = self.dirpath [:-1]
-        else:
-            if not self.isrel:
-                self.dirpath  = self.dirpath [:-1]
-
-        # Remove leading spaces & newlines from dirpath
-        dirpath2 = []
-        for item in self.dirpath:
-            dirpath2.append(item.strip())
-
-        # Copy
-        self.dirpath = dirpath2[:]
-            
-    def compute_dirpaths(self, path):
-        """ Computer local file & directory paths for the url """
-
-        self.dirpath = path.split(URLSEP)
-        self.lastpath = self.dirpath[-1]
-        # print self.dirpath, self.lastpath
-        
-        if self.isrel:
-            # Construct file/dir names - This is valid only if the path
-            # has more than one component - like www.python.org/doc .
-            # Otherwise, the url is a plain domain
-            # path like www.python.org .
-            self.compute_file_and_dir_paths()
-            # print 'Rpath=>',self.rpath
-            
-            # Interprets relative path
-            # ../../. Nonsense relative paths are graciously ignored,
-            self.rpath.reverse()
-            # print 'Base url dirpath=>',self.baseurl.dirpath
-            # print 'Rindex=>',self.rindex
-
-            # This simple logic is fine for most paths except
-            # when a base URL has a "?" as part of its dirpath.
-            # Example: http://razor.occams.info/code/repo/?/govtrack/sec .
-            # In that case, any pieces of the base URL after the
-            # ? is to be omitted.
-            if '?' in self.baseurl.dirpath:
-                # Trim base url to the part before ?
-                qindex = self.baseurl.dirpath.index('?')
-                self.baseurl.dirpath = self.baseurl.dirpath[:qindex]
-            
-            if len(self.rpath) == 0 :
-                if not self.rindex:
-                    self.dirpath = self.baseurl.dirpath + self.dirpath
-            else:
-                pathstack = self.baseurl.dirpath[0:]
-
-                for ritem in self.rpath:
-                    if ritem == DOT:
-                        pathstack = self.baseurl.dirpath[0:]
-                    elif ritem == DOTDOT:
-                        if len(pathstack) !=0:
-                                pathstack.pop()
-            
-                self.dirpath  = pathstack + self.dirpath 
-
-            # print 'Dirpath2=>',self.dirpath
-
-            #if self.noreduce:
-            #    return
-            
-            # Support for NONSENSE relative paths such as
-            # g/../foo and g/./foo 
-            # consider base = http:\\bar.com\bar1
-            # then g/../foo => http:\\bar.com\bar1\..\foo => http:\\bar.com\foo
-            # g/./foo  is utter nonsense and we feel free to ignore that.
-            #index = 0
-            #for item in self.dirpath:
-            #    if item in (DOT, DOTDOT):
-            #        self.dirpath.remove(item)
-            #    if item == DOTDOT:
-            #        self.dirpath.remove(self.dirpath[index - 1])
-            #    index += 1
-        else:
-            if len(self.dirpath) > 1:
-                self.compute_file_and_dir_paths()
-            
-    def compute_domain_and_port(self):
-        """ Computes url domain and port &
-        re-computes if necessary """
-
-        # Resolving the domain...
-        
-        # Domain is parent domain, if
-        # url is relative :-)
-        if self.isrel:
-            self.domain = self.baseurl.domain
-        else:
-            # If not relative, then domain
-            # if the first item of dirpath.
-            self.domain = self.dirpath[0]
-            self.dirpath = self.dirpath[1:]
-
-        # Find out if the domain contains a port number
-        # for example, server:8080
-        dom = self.domain
-        index = dom.find(PORTSEP)
-        if index != -1:
-            self.domain = dom[:index]
-            # A bug here => needs to be fixed
-            try:
-                self.port   = int(dom[index+1:])
-            except:
-                pass
-
-        # Now check if the base domain had a port specification (other than 80)
-        # Then we need to use that port for all its children, otherwise
-        # we can use default value.
-        if self.isrel and \
-               self.baseurl and \
-               self.baseurl.port != self.port and\
-               self.baseurl.protocol != 'file://':
-            
-            self.port = self.baseurl.port
-
-        # Convert domain to lower case
-        if self.domain != '':
-            self.domain = self.domain.lower()
-        
-    def make_valid_filename(self, s):
-        """ Replace junk characters to create a valid filename """
-
-        # Replace any %xx strings
-        percent_chars = percent_repl.findall(s)
-        for pchar in percent_chars:
-            try:
-                s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-            except UnicodeDecodeError:
-                try:
-                    s = s.decode('iso-8859-1')
-                    s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-                except UnicodeDecodeError, e:
-                    pass
-                
-        for x,y in itertools.izip(junk_chars, junk_chars_repl):
-            s = s.replace(x, y)
-
-        return s
-
-    def make_valid_url(self, url):
-        """ Make a valid url """
-
-        for x,y in itertools.izip(dirty_chars, dirty_chars_repl):
-            if x in url:
-                url = url.replace(x, y)
-
-        # Replace spaces between words
-        # with '%20'.
-        # For example http://www.foo.com/bar/this file.html
-        # Fix: Use regex instead of blind
-        # replacement.
-        if wspacere.search(url):
-            url = re.sub(r'\s', '%20', url)
-        
-        # Replace all % chars with their capital counterparts
-        # i.e %3a => %3A, %5b => %5B etc. This helps in
-        # canonicalization.
-        percent_chars = percent_repl.findall(url)
-        for pchar in percent_chars:
-            url = url.replace(pchar, pchar.upper())
-            
-        return url
-
-    def is_filename_url(self):
-        """ Return whether this is file name url """
-
-        # A directory url is something like http://www.python.org
-        # which points to the <index.html> file inside the www.python.org
-        # directory.A file name url is a url that points to an actual
-        # file like http://www.python.org/doc/current/tut/tut.html
-
-        return self.filelike
-
-    def is_cgi(self):
-        """ Check whether this url is a cgi (dynamic/form) link """
-
-        return self.cgi
-
-    def is_relative_path(self):
-        """ Return whether the original url was a relative one """
-
-        return self.isrel
-
-    def is_relative_to_server(self):
-        """ Return whether the original url was relative to the server """
-        
-        return self.isrels
-
-    def is_image(self):
-        """ Find out if the file is an image """
-
-        if self.typ == 'image':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in image_extns)
-             
-        return False
-
-    def is_multimedia(self):
-        """ Found out if the file is a multimedia (vide or audio) type """
-
-        return (self.is_video() or self.is_audio())
-        
-    def is_audio(self):
-        """ Find out if the file is a multimedia audio type """
-
-        if self.typ == 'audio':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in sound_extns)
-             
-        return False
-
-    def is_video(self):
-        """ Find out if the file is a multimedia video type """
-
-        if self.typ == 'video':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in movie_extns)
-             
-        return False
-            
-    def is_webpage(self):
-        """ Find out by if the file is a webpage type """
-
-        # Note: right now we treat dynamic server-side scripts namely
-        # php, psp, asp, pl, jsp, and cgi as possible html candidates, though
-        # actually they might be generating non-html content (like dynamic
-        # images.)
-        
-        if self.typ.isA(URL_TYPE_WEBPAGE):
-            return True
-        elif self.typ==URL_TYPE_ANY:
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                if extn in webpage_extns:
-                    return True
-                else:
-                    # jkleven: 10/1/06.  Forms were never being parsed for links.
-
-                    # If we are allowing download of query forms (i.e., bin?asdf=3 style URLs)
-                    # then run the URL through a regex if we're still not sure if its ok.
-                    # if it matches the from_re precompiled regex then we'll assume its
-                    # a query style URL and we'll return true.
-                    #if objects.config and objects.config.getquerylinks and form_re.search(self.get_full_url()):
-                    return True
-
-        return False
-
-    def is_stylesheet(self):
-        """ Find out whether the url is a style sheet type """
-
-        if self.typ == 'stylesheet':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in stylesheet_extns)
-             
-        return False
-
-    def is_document(self):
-        """ Return whether the url is a document """
-
-        # This method is useful for Indexers which use HarvestMan.
-        # We define any URL which is not an image, is either a web-page
-        # or any of the following types as a document.
-
-        # Microsoft word documents
-        # Openoffice documents
-        # Adobe PDF documents
-        # Postscript documents
-
-        if self.is_image(): return False
-        if self.is_webpage(): return True
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in document_extns)
-
-        return False
-
-    def is_flash(self):
-        """ Return whether the url is flash, flash source code
-        or flash action script """
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in flash_extns)
-
-        return False        
-
-    def is_equal(self, url):
-        """ Find whether the passed url matches
-        my url """
-
-        # Try 2 tests, one straightforward
-        # other with a "/" appended at the end
-        myurl = self.get_full_url()
-        if url==myurl:
-            return True
-        #else:
-        #    myurl += URLSEP
-        #    if url==myurl:
-        #        return True
-
-        return False
-
-    def parseable(self):
-        """ Return whether the URL could have parseable content.
-        This function tries to make the best guess
-        based on the URL file extension and type. Parseable
-        means the possibility of having content which can
-        produce child URLs effectively meaning HTML and
-        parseable stylesheets """
-
-        # This is called before downloading of a URL. However
-        # whether a URL has parseable content is fully known
-        # only after downloading it. However in this case, we
-        # need this information prior to download, so we try
-        # to make the best guess....
-
-        # The guess is very much optimistic. That is the logic
-        # is tilted towards trying to make all possible checks
-        # on returning this as a parseable URL.
-
-        if self.is_webpage() or self.is_stylesheet():
-            return True
-
-        # Is this an image,multimedia,flash then return False
-        if self.is_image() or self.is_multimedia() or self.is_flash():
-            return False
-        
-        # Okay, it is not webpage/css/multimedia/flash, but this
-        # could be a directory URL which can turn out to be
-        # a web-page. So check for file extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            # Document ? Return false
-            if extn in document_extns:
-                return False
-            
-            # The extension has to be a valid (at least 2 chars, excluding the dot)
-            # extension, to assume this is a valid non-HTML type file.
-            if len(extn)>=3:
-                return False
-
-
-        # Ok - Safely assume that this is parseable HTML type or
-        # will turn out to be that later!
-        return True
-        
-    # ============ End - Is (Boolean Get) Methods =========== #  
-    # ============ Begin - General Get Methods ============== #
-    def get_url_content_info(self):
-        """ Get the url content information """
-        
-        return self.contentdict
-    
-    def get_anchor(self):
-        """ Return the anchor tag of this url """
-
-        return self.anchor
-
-    def get_anchor_url(self):
-        """ Get the anchor url, if this url is an anchor type """
-
-        return "".join((self.get_full_url(), self.anchor))
-
-    def get_generation(self):
-        """ Return the generation of this url """
-        
-        return self.generation    
-
-    def get_priority(self):
-        """ Get the priority for this url """
-
-        return self.priority
-
-    def get_download_status(self):
-        """ Return the download status for this url """
-
-        return self.status
-
-    def get_type(self):
-        """ Return the type of this url as a string """
-        
-        return self.typ
-
-    def get_parent_url(self):
-        """ Return the parent url of this url """
-        
-        return self.baseurl
-
-    def get_url_directory(self):
-        """ Return the directory path (url minus its filename if any) of the url """
-        
-        # get the directory path of the url
-        fulldom = self.get_full_domain()
-        urldir = fulldom
-
-        if self.dirpath:
-            newpath = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-            urldir = "".join((fulldom, newpath))
-
-        return urldir
-
-    def get_url_directory_sans_domain(self):
-        """ Return url directory minus the domain """
-
-        # New function in 1.4.1
-        urldir = ''
-        
-        if self.dirpath:
-            urldir = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-
-        return urldir        
-        
-    def get_url(self):
-        """ Return the url of this object """
-        
-        return self.url
-
-    def get_original_url(self):
-        """ Return the original url of this object """
-        
-        return self.origurl
-
-    def get_canonical_url(self):
-        """ Return the canonicalized form of this URL """
-
-        # A canonical URL or 'normalized' URL is a URL modified
-        # to a standardized form so that similar URLs can be
-        # found out by comparing their canonical forms. HarvestMan
-        # uses canonical URLs to remove DUST (Duplicate URLs with
-        # similar text) to some extent.
-
-        # Wikipedia describes canonicalization of a URL
-        # {http://en.wikipedia.org/wiki/URL_normalization}
-        #
-        # 1. Converting the scheme and host to lower case...
-        # 2. Adding trailing to directory URLs...
-        # 3. Removing directory index, i.e
-        #    http://www.example.com/default.asp => http://www.example.com/
-        #    http://www.example.com/index.html => http://www.example.com/
-        # 4. Case insensitive files => If the URL is running on a case insensitive
-        #    file system (Windows, example: FAT*, NTFS etc), then the canonical
-        #    form should use lower case.
-        # 5. Capitalizing letters in escape sequences - All letters within a
-        # percent-encoding triplet (e.g., "%3A") are case-insensitive, and should
-        # be capitalized.
-        # Egs: http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
-        # 6. Removing the anchor fragment 
-        # Egs: http://www.example.com/bar.html#section1 → http://www.example.com/bar.html
-        # 7. Removing the default port.
-        # 8. Removing dot segments. { http://example/com/b/c/.././file.html => http://example.com/b/file.html }
-        # 9. Removing www as the first domain label. i.e www.example.com => example.com
-        # 10. Sorting the variables of active pages (dynamic pages) -> 
-        #  {http://www.example.com/display?lang=en&article=fred → http://www.example.com/display?article=fred&lang=en}
-        # 11. Removing default querystring variables. A default value in the querystring will
-        # render identically whether it is there or not. When a default value appears in the querystring,
-        # it should be removed. {http://www.example.com/display?id=&sort=ascending =>  http://www.example.com/display}
-        # 12. Removing the "?" when the querystring is empty. When the querystring is
-        # empty, there is no need for the "?
-        # { http://www.example.com/display? → http://www.example.com/display  }
-
-        # HarvestMan does 1,2,3,5,6,7,8,9,10,11,12 in order. Note that HarvestMan
-        # performs 6,7,8 automatically when processing the original URL.
-
-        # 1 is already done when resolving the URL.
-        # 2 is already done when resolving the URL
-        # 3 is already done for root domains. i.e http://www.example.com
-        #  becomes http://www.example.com/ . However this is not done
-        # for directory URL since we are not sure if this would be a
-        # file or directory i.e http://www.example.com/docs/ and
-        # http://www.example.com/docs will parse to
-        # http://www.example.com/docs without the trailing slash since
-        # by default we assume that the URL refers to the file "/docs"
-        # rather than the directory index for "/docs" folder.
-        # Skip 4
-        # 5 is done by make_valid_url()...
-        # 6,7,8 are automatically done
-        # Doing 9, 10,11 and 12 and specifically. 
-
-        # Get full url first...
-        url = self.get_full_url()
-        params = params_re.findall(url)
-        lp = len(params)
-        if lp>1:
-            # Rule#11: Remove those params which are using a default value
-            # i.e which does not specify a value.
-            params = [param for param in params if param_re.match(param)]
-            # More than one param, sort it
-            params.sort()
-            url_sans_params = ampersand_re.sub('', params_re.sub('', url))
-            # Now put the params back in sorted order
-            url = url_sans_params + '&'.join(params)
-        elif lp==0:
-            # If no params but there is a ? at end, rule 12 applies
-            # Remove trailing ? at the end
-            url = question_re.sub('', url)
-
-        # Finally we strip off the www. from the beginning of the URL
-        url = www2_re.sub('', url)
-
-        return url
-        
-    def get_full_url(self):
-        """ Return the full url path of this url object after
-        resolving relative paths, filenames etc """
-
-        if self.urlflag:
-            return self.absurl
-        else:
-            rval = ''
-            
-            if not self.protocol.startswith('file:'):
-                rval = self.get_full_domain_with_port()
-
-                if self.dirpath:
-                    newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-                    rval = "".join((rval, URLSEP, newpath))
-
-                if rval[-1] != URLSEP:
-                    rval += URLSEP
-
-                if self.filelike:
-                    rval = "".join((rval, self.filename))
-                
-            else:
-                rval = ''
-                if self.dirpath:
-                    newpath = "/".join([x for x in self.dirpath if ((x and not x[-1] ==URLSEP) or (not x))])            
-                    rval += newpath
-                    
-                if self.filelike:
-                    rval = "".join((rval, URLSEP, self.filename))
-
-                return self.protocol + rval
-            
-            self.urlflag = True
-            self.absurl = self.make_valid_url(rval)
-
-            return self.absurl
-
-  ##       # If this is already calculated, return the cached value...
-##         if self.urlflag:
-##             return self.absurl
-##         else:
-##             rval = self.get_full_domain_with_port()
-##             if self.dirpath:
-##                 newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-##                 rval = "".join((rval, URLSEP, newpath))
-            
-##             if rval[-1] != URLSEP:
-##                 rval += URLSEP
-
-##             if self.filelike:
-##                 rval = "".join((rval, self.filename))
-
-##             self.urlflag = True
-##             self.absurl = self.make_valid_url(rval)
-
-##             return self.absurl
-       # If this is already calculated, return the cached value...
-
-
-    def get_full_url_sans_port(self):
-        """ Return absolute url without the port number """
-
-        rval = self.get_full_domain()
-        if self.dirpath:
-            newpath = "".join([ x+'/' for x in self.dirpath])
-            rval = "".join((rval, URLSEP, newpath))
-
-        if rval[-1] != URLSEP:
-            rval += URLSEP
-
-        if self.filelike:
-            rval = "".join((rval, self.filename))
-
-        return self.make_valid_url(rval)
-
-    def get_port_number(self):
-        """ Return the port number of this url """
-
-        # 80 -> http urls
-        return self.port
-
-    def get_relative_url(self):
-        """ Return relative path of url w.r.t the domain """
-
-        newpath=""
-        if self.dirpath:
-            newpath =  "".join(("/", "".join([ x+'/' for x in self.dirpath])))
-
-        if self.filelike:
-            newpath = "".join((newpath, URLSEP, self.filename))
-            
-        return self.make_valid_url(newpath)
-
-    def get_base_domain(self):
-        """ Return the base domain for this url object """
-
-        # Explanation: Base domain is the domain
-        # at the root of a given domain. For example
-        # base domain of stats.foo.com is foo.com.
-        # If there is no subdomain, this will be
-        # the same as the domain itself.
-
-        # If the server name is of the form say bar.foo.com
-        # or vodka.bar.foo.com, i.e there are more than one
-        # '.' in the name, then we need to return the
-        # last string containing a dot in the middle.
-
-        # Get domain
-        domain = self.domain
-        
-        if domain.count('.') > 1:
-            dotstrings = domain.split('.')
-            # now the list is of the form => [vodka, bar, foo, com]
-
-            # Return the last two items added with a '.'
-            # in between
-            return "".join((dotstrings[-2], ".", dotstrings[-1]))
-        else:
-            # The server is of the form foo.com or just "foo"
-            # so return it straight away
-            return domain
-
-    def get_base_domain_with_port(self):
-        """ Return the base domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-        
-        if ((self.protocol == 'http://' and int(self.port) != 80) \
-            or (self.protocol == 'https://' and int(self.port) != 443) \
-            or (self.protocol == 'ftp://' and int(self.port) != 21)):
-            return self.get_base_domain() + ':' + str(self.port)
-        else:
-            return self.get_base_domain()
-
-    def get_url_hash(self):
-        """ Return a hash value for the URL """
-
-        m = md5.new()
-        m.update(self.get_full_url())
-        return str(m.hexdigest())
-    
-    def get_domain_hash(self):
-        """ Return the hask value for the domain """
-
-        m = md5.new()
-        m.update(self.get_full_domain())
-        return str(m.hexdigest())
-
-    def get_data_hash(self):
-        """ Return the hash value for the URL data """
-
-        return self.pagehash
-
-    def get_domain(self):
-        """ Return the domain (server) for this url object """
-        
-        return self.domain
-
-    def get_full_domain(self):
-        """ Return the full domain (protocol + domain) for this url object """
-        
-        return self.protocol + self.domain
-
-    def get_full_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and int(self.port) != 80) \
-           or (self.protocol == 'https://' and int(self.port) != 443) \
-           or (self.protocol == 'ftp://' and int(self.port) != 21):
-            return self.get_full_domain() + ':' + str(self.port)
-        else:
-            return self.get_full_domain()
-
-    def get_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and self.port != 80) \
-           or (self.protocol == 'https://' and self.port != 443) \
-           or (self.protocol == 'ftp://' and self.port != 21):
-            return self.domain + ':' + str(self.port)
-        else:
-            return self.domain
-
-    def get_full_filename(self):
-        """ Return the full filename of this url on the disk.
-        This is created w.r.t the local directory where we save
-        the url data """
-
-        return os.path.join(self.get_local_directory(), self.get_filename())
-
-    def get_filename(self):
-        """ Return the filename of this url on the disk. """
-
-        # NOTE: This is just the filename, not the absolute filename path
-        return self.validfilename
-
-    def get_relative_filename(self, filename=''):
-
-        # NOTE: Rewrote this method completely
-        # on Nov 18 for 1.4 b2.
-        
-        # If no file name given, file name
-        # is the file name of the parent url
-        if not filename:
-            if self.baseurl:
-                filename = self.baseurl.get_full_filename()
-
-        # Still filename is NULL,
-        # return my absolute path
-        if not filename:
-            return self.get_full_filename()
-        
-        # Get directory of 'filename'
-        diry = os.path.dirname(filename)
-        if diry[-1] != os.sep:
-            diry += os.sep
-            
-        # Get my filename
-        myfilename = self.get_full_filename()
-        # If the base domains are different, we
-        # cannot find a relative path, so return
-        # my filename itself.
-        bdomain = self.baseurl.get_domain()
-        mydomain = self.get_domain()
-
-        if mydomain != bdomain:
-            return myfilename
-
-        # If both filenames are the same,
-        # return just the filename.
-        if myfilename==filename:
-            return self.get_filename()
-        
-        # Get common prefix of my file name &
-        # other file name.
-        prefix = os.path.commonprefix([myfilename, filename])
-        relfilename = ''
-        
-        if prefix:
-            if not os.path.exists(prefix):
-                prefix = os.path.dirname(prefix)
-            
-            if prefix[-1] != os.sep:
-                prefix += os.sep
-
-            # If prefix is the name of the project
-            # directory, both files have no
-            # common component.
-            try:
-                if os.path.samepath(prefix,self.rootdir):
-                    return myfilename
-            except:
-                if prefix==self.rootdir:
-                    return myfilename
-            
-            # If my directory is a subdirectory of
-            # 'dir', then prefix should be the same as
-            # 'dir'.
-            sub=False
-
-            # To test 'sub-directoriness', check
-            # whether dir is wholly contained in
-            # prefix. 
-            prefix2 = os.path.commonprefix([diry,prefix])
-            if prefix2[-1] != os.sep:
-                prefix2 += os.sep
-            
-            # os.path.samepath is not avlbl in all
-            # platforms.
-            try:
-                if os.path.samepath(diry, prefix2):
-                    sub=True
-            except:
-                if diry==prefix2:
-                    sub=True
-
-            # If I am in a sub-directory, relative
-            # path is my filename minus the common
-            # path.
-            if sub:
-                relfilename = myfilename.replace(prefix2, '')
-                return relfilename
-            else:
-                # If I am not in sub-directory, then
-                # we need to get the relative path.
-                dirwithoutprefix = diry.replace(prefix, '')
-                filewithoutprefix = myfilename.replace(prefix, '')
-                relfilename = filewithoutprefix
-                    
-                paths = dirwithoutprefix.split(os.sep)
-                for item in paths:
-                    if item:
-                        relfilename = "".join(('..', os.sep, relfilename))
-
-                return relfilename
-        else:
-            # If there is no common prefix, then
-            # it means me and the passed filename
-            # have no common paths, so return my
-            # full path.
-            return myfilename
-            
-    def get_relative_depth(self, hu, mode=0):
-        """ Get relative depth of current url object vs passed url object.
-        Return a postive integer if successful and -1 on failure """
-
-        # Fixed 2 bugs on 22/7/2003
-        # 1 => passing arguments to find function in wrong order
-        # 2 => Since we allow the notion of zero depth, even zero
-        # value of depth should be returned.
-
-        # This mode checks for depth based on a directory path
-        # This check is valid only if dir2 is a sub-directory of dir1
-        dir1=self.get_url_directory()
-        dir2=hu.get_url_directory()
-        
-        # spit off the protocol from directories
-        dir1 = dir1.replace(self.protocol, '')
-        dir2 = dir2.replace(hu.protocol, '')      
-        
-        # Append a '/' to the dirpath if not already present
-        if dir1[-1] != '/': dir1 += '/'
-        if dir2[-1] != '/': dir2 += '/'
-
-        if mode==0:
-            # check if dir2 is present in dir1
-            # bug: we were passing arguments to the find function
-            # in the wrong order.
-            if dir1.find(dir2) != -1:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-
-            return -1
-        # This mode checks for depth based on the base server(domain).
-        # This check is valid only if dir1 and dir2 belong to the same
-        # base server (checked by name)
-        elif mode==1:
-            if self.domain == hu.domain:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-            return -1
-
-        # This check is done for the current url against current base server (domain)
-        # i.e, this mode does not use the argument 'hu'
-        elif mode==2:
-            dir2 = self.domain
-            if dir2[-1] != '/':
-                dir2 += '/'
-
-            # we need to check for depth only if the above condition is true.
-            l1=dir1.split('/')
-            l2=dir2.split('/')
-            if l1 and l2:
-                diff=len(l1) - len(l2)
-                if diff>=0: return diff
-            return -1
-
-        return -1
-
-    def get_root_dir(self):
-        """ Return root directory """
-        
-        return self.rootdir
-    
-    def get_local_directory(self):
-        """ Return the local directory path of this url w.r.t
-        the directory on the disk where we save the files of this url """
-        
-        # Gives Local Direcory path equivalent to URL Path in server
-        rval = ''
-        if not self.protocol.startswith('file:'):
-            rval = os.path.join(self.rootdir, self.domain)
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))
-        else:
-            rval = self.rootdir
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))                    
-                    
-        return os.path.normpath(rval)
-
-    def get_original_state(self):
-        """ Return the original state of this URL object. This is useful
-        to obtain earlier attributes of a URL after it's state was
-        changed by a URL modification """
-
-        # It is up to the caller to check this value
-        return self.orig_state
-        
-    # ============ Begin - Set Methods =========== #
-
-    def set_directory_url(self):
-        """ Set this as a directory url """
-
-        self.filelike = False
-        # print self.dirpath, self.lastpath, self.domain
-        
-        if (not self.dirpath and self.lastpath != self.domain) or (self.dirpath and (self.dirpath[-1] != self.lastpath)):
-            self.dirpath.append(self.lastpath)
-        self.validfilename = 'index.html'
-        self.urlflag = False
-        
-    def set_url_content_info(self, headers):
-        """ This function sets the url content information of this
-        url. It is a convenient function which can be used by connectors
-        to store url content information """
-
-        if headers:
-            self.contentdict = copy.deepcopy(headers)
-
-    ## def violates_rules(self):
-    ##     """ Check if this url violates existing download rules """
-
-    ##     # If I am the base url object, violates rule checks apply
-    ##     # only if my original URL has changed.
-
-    ##     if self.starturl and not self.reresolved:
-    ##         return False
-            
-    ##     if not self.rulescheckdone:
-    ##         self.violatesrules = objects.rulesmgr.violates_rules(self)
-    ##         self.rulescheckdone = True
-
-    ##     return self.violatesrules
-
-    def recalc_locations(self):
-        """ Recalculate filenames/directories etc """
-
-        # Case 1 - trying to save as a file when the
-        # parent "directory" is an existing file.
-        # Solution - Change the paths of parent URL object
-        # to change its filename...
-        directory = self.get_url_directory()
-        if os.path.isfile(directory):
-            parent = self.baseurl
-            # Anything can be done on this only if this
-            # is a HarvestManUrl object
-            if isinstance(parent, HarvestManUrl):
-                parent.dirpath.append(parent.filename)
-                parent.filename = 'index.html'
-                parent.validfilename = 'index.html'
-
-        # Case 2 - trying to save as file when the
-        # path is an existing directory.
-        # Solution - Save as index.html in the directory
-        filename = self.get_full_filename()
-        if os.path.isdir(filename):
-            self.dirpath.append(self.filename)
-            self.filename = 'index.html'
-            self.validfilename = 'index.html'
-        
-    def manage_content_type(self, content_type):
-        """ This function gets called from connector modules
-        connect method, after retrieving information about
-        a url. This function can manage the content type of
-        the url object if there are any differences between
-        the assumed type and the returned type """
-
-        # Guess extension of type
-        extn = mimetypes.guess_extension(content_type)
-        
-        if extn:
-            if extn in webpage_extns:
-                self.typ = URL_TYPE_WEBPAGE
-            elif extn in image_extns:
-                self.typ = URL_TYPE_IMAGE
-            elif extn in stylesheet_extns:
-                self.typ = URL_TYPE_STYLESHEET
-            elif extn in sound_extns:
-                self.typ = URL_TYPE_AUDIO
-            elif extn in movie_extns:
-                self.typ = URL_TYPE_VIDEO
-            else:
-                self.typ = URL_TYPE_FILE
-        else:
-            if content_type:
-                # Do some generic tests
-                klass, typ = content_type.split('/')
-                if klass == 'image':
-                    self.typ = URL_TYPE_IMAGE
-                elif klass == 'audio':
-                    self.typ = URL_TYPE_AUDIO
-                elif klass == 'video':
-                    self.typ = URL_TYPE_VIDEO
-                elif typ == 'html':
-                    self.typ = URL_TYPE_WEBPAGE
-            else:
-                # Do static checks
-                if self.is_webpage():
-                    self.typ = URL_TYPE_WEBPAGE
-                elif self.is_image():
-                    self.typ = URL_TYPE_IMAGE
-                elif self.is_audio():
-                    self.typ = URL_TYPE_AUDIO
-                elif self.is_video():
-                    self.typ = URL_TYPE_VIDEO
-                elif self.is_stylesheet():
-                    self.typ = URL_TYPE_STYLESHEET
-                else:
-                    self.typ = URL_TYPE_FILE
-
-    def make_document(self, data, keywords, description, children):
-        """ Return a HarvestManDocument object filling up all the fields """
-        
-        doc = document.HarvestManDocument(self)
-        doc.content = data
-        doc.keywords = keywords[:]
-        doc.description = description
-        doc.content_hash = self.pagehash
-        doc.headers = self.contentdict.copy()
-        for child in children:
-            doc.add_child(child)
-        
-        doc.lastmodified = self.contentdict.get('last-modified','')
-        doc.etag = self.contentdict.get('etag','')
-        doc.content_type = self.contentdict.get('content-type','')
-        doc.content_encoding = self.contentdict.get('content-encoding','plain')
-        return doc
-    
-    # ============ End - Set Methods =========== #
-
-
-def test():
-    
-    # Test code
-    HarvestManUrl.TEST = 1
-    hulist = [ HarvestManUrl('http://www.yahoo.com/photos/my photo.gif'),
-               HarvestManUrl('http://www.rediff.com:80/r/r/tn2/2003/jun/25usfed.htm'),
-               HarvestManUrl('http://cwc2003.rediffblogs.com'),
-               HarvestManUrl('/sports/2003/jun/25beck1.htm',
-                                   'generic', 0, 'http://www.rediff.com', ''),
-               HarvestManUrl('http://ftp.gnu.org/pub/lpf.README'),
-               HarvestManUrl('http://www.python.org/doc/2.3b2/'),
-               HarvestManUrl('//images.sourceforge.net/div.png',
-                                   'image', 0, 'http://sourceforge.net', ''),
-               HarvestManUrl('http://pyro.sourceforge.net/manual/LICENSE'),
-               HarvestManUrl('python/test.htm', 'generic', 0,
-                                   'http://www.foo.com/bar/index.html', ''),
-               HarvestManUrl('/python/test.css', 'generic',
-                                   0, 'http://www.foo.com/bar/vodka/test.htm', ''),
-               HarvestManUrl('/visuals/standard.css', 'generic', 0,
-                                   'http://www.garshol.priv.no/download/text/perl.html',
-                                   'd:/websites'),
-               HarvestManUrl('www.fnorb.org/index.html', 'generic',
-                                   0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('http://profigure.sourceforge.net/index.html',
-                                   'generic', 0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('#anchor', 'anchor', 0, 
-                                   'http://www.foo.com/bar/index.html',
-                                   'd:/websites'),
-               HarvestManUrl('nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html#__init__#index-after', 'anchor', 0, 'http://nltk.sourceforge.net/lite/doc/api/term-index.html', 'd:/websites'),               
-               HarvestManUrl('../../icons/up.png', 'image', 0,
-                                   'http://www.python.org/doc/current/tut/node2.html',
-                                   ''),
-               HarvestManUrl('../eway/library/getmessage.asp?objectid=27015&moduleid=160',
-                                   'generic',0,'http://www.eidsvoll.kommune.no/eway/library/getmessage.asp?objectid=27015&moduleid=160'),
-               HarvestManUrl('fileadmin/dz.gov.si/templates/../../../index.php',
-                                   'generic',0,'http://www.dz-rs.si','~/websites'),
-               HarvestManUrl('http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70','form',True,'http://www.evvs.dk'),
-               HarvestManUrl('http://arstechnica.com/reviews/os/macosx-10.4.ars')]
-                                  
-                                  
-    for hu in hulist:
-        print 'Full filename = ', hu.get_full_filename()
-        print 'Valid filename = ', hu.validfilename
-        print 'Local Filename  = ', hu.get_filename()
-        print 'Is relative path = ', hu.is_relative_path()
-        print 'Full domain = ', hu.get_full_domain()
-        print 'Domain      = ', hu.domain
-        print 'Local Url directory = ', hu.get_url_directory_sans_domain()
-        print 'Canonical Url = ', hu.get_canonical_url()
-        print 'Absolute Url = ', hu.get_full_url()
-        print 'Absolute Url Without Port = ', hu.get_full_url_sans_port()
-        print 'Local Directory = ', hu.get_local_directory()
-        print 'Is filename parsed = ', hu.filelike
-        print 'Path rel to domain = ', hu.get_relative_url()
-        print 'Connection Port = ', hu.get_port_number()
-        print 'Domain with port = ', hu.get_full_domain_with_port()
-        print 'Relative filename = ', hu.get_relative_filename()
-        print 'Anchor url     = ', hu.get_anchor_url()
-        print 'Anchor tag     = ', hu.get_anchor()
-        print  'Index=>',hu.index
-        
-
-if __name__=="__main__":
-    test()
diff --git a/HarvestMan-twisted/urlproc.py b/HarvestMan-twisted/urlproc.py
deleted file mode 100755
index d5d3c6e..0000000
--- a/HarvestMan-twisted/urlproc.py
+++ /dev/null
@@ -1,202 +0,0 @@
-# -- coding: utf-8
-""" urlproc.py - Module to process URLs and replace
-    entity characters (characters starting with an ampersand
-    and ending with a semicolon). 
-
-    All entities here added from
-    http://www.w3schools.com/tags/ref_entities.asp
-    
-    This module is part of the HarvestMan program.
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-   Author: Anand B Pillai <abpillai at gmail dot com>
-   
-   Modification History
-   --------------------
-   
-   Created - Anand B Pillai 28 Sep 06
-
-   Copyright (C) 2006 Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import unicodedata
-import itertools
-
-char_names = ['LESS-THAN SIGN',
-              'GREATER-THAN SIGN',
-              'AMPERSAND',
-              'QUOTATION MARK',
-              'SPACE',
-              'LATIN CAPITAL LETTER C WITH CEDILLA',
-              'LATIN SMALL LETTER C WITH CEDILLA',
-              'LATIN CAPITAL LETTER N WITH TILDE',
-              'LATIN SMALL LETTER N WITH TILDE',
-              'LATIN CAPITAL LETTER THORN',
-              'LATIN SMALL LETTER THORN',
-              'LATIN CAPITAL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH DIAERESIS',
-              'LATIN SMALL LETTER SHARP S',
-              'LATIN CAPITAL LETTER AE',
-              'LATIN CAPITAL LETTER A WITH ACUTE',
-              'LATIN CAPITAL LETTER A WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER A WITH GRAVE',
-              'LATIN CAPITAL LETTER A WITH RING ABOVE',
-              'LATIN CAPITAL LETTER A WITH TILDE',
-              'LATIN CAPITAL LETTER A WITH DIAERESIS',
-              'LATIN SMALL LETTER AE',
-              'LATIN SMALL LETTER A WITH ACUTE',
-              'LATIN SMALL LETTER A WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER A WITH GRAVE',
-              'LATIN SMALL LETTER A WITH RING ABOVE',
-              'LATIN SMALL LETTER A WITH TILDE',
-              'LATIN SMALL LETTER A WITH DIAERESIS',
-              'LATIN CAPITAL LETTER ETH',
-              'LATIN CAPITAL LETTER E WITH ACUTE',
-              'LATIN CAPITAL LETTER E WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER E WITH GRAVE',
-              'LATIN CAPITAL LETTER E WITH DIAERESIS',
-              'LATIN SMALL LETTER ETH',
-              'LATIN SMALL LETTER E WITH ACUTE',
-              'LATIN SMALL LETTER E WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER E WITH GRAVE',
-              'LATIN SMALL LETTER E WITH DIAERESIS',
-              'LATIN CAPITAL LETTER I WITH ACUTE',
-              'LATIN CAPITAL LETTER I WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER I WITH GRAVE',
-              'LATIN CAPITAL LETTER I WITH DIAERESIS',
-              'LATIN SMALL LETTER I WITH ACUTE',
-              'LATIN SMALL LETTER I WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER I WITH GRAVE',
-              'LATIN SMALL LETTER I WITH DIAERESIS',
-              'LATIN CAPITAL LETTER O WITH ACUTE',
-              'LATIN CAPITAL LETTER O WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER O WITH GRAVE',
-              'LATIN CAPITAL LETTER O WITH STROKE',
-              'LATIN CAPITAL LETTER O WITH TILDE',
-              'LATIN CAPITAL LETTER O WITH DIAERESIS',
-              'LATIN SMALL LETTER O WITH ACUTE',
-              'LATIN SMALL LETTER O WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER O WITH GRAVE',
-              'LATIN SMALL LETTER O WITH STROKE',
-              'LATIN SMALL LETTER O WITH TILDE',
-              'LATIN SMALL LETTER O WITH DIAERESIS',
-              'LATIN CAPITAL LETTER U WITH ACUTE',
-              'LATIN CAPITAL LETTER U WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER U WITH GRAVE',
-              'LATIN CAPITAL LETTER U WITH DIAERESIS',
-              'LATIN SMALL LETTER U WITH ACUTE',
-              'LATIN SMALL LETTER U WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER U WITH GRAVE',
-              'LATIN SMALL LETTER U WITH DIAERESIS',
-              'REGISTERED SIGN',
-              'PLUS-MINUS SIGN',
-              'MICRO SIGN',
-              'PILCROW SIGN',
-              'MIDDLE DOT',
-              'CENT SIGN',
-              'POUND SIGN',
-              'YEN SIGN',
-              'VULGAR FRACTION ONE QUARTER',
-              'VULGAR FRACTION ONE HALF',
-              'VULGAR FRACTION THREE QUARTERS',
-              'SUPERSCRIPT ONE',
-              'SUPERSCRIPT TWO',
-              'SUPERSCRIPT THREE',
-              'INVERTED QUESTION MARK',
-              'DEGREE SIGN',
-              'BROKEN BAR',
-              'SECTION SIGN',
-              'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'EURO SIGN',
-              'SINGLE RIGHT-POINTING ANGLE QUOTATION MARK',
-              'SINGLE LEFT-POINTING ANGLE QUOTATION MARK',              
-              'PER MILLE SIGN',
-              'HORIZONTAL ELLIPSIS',
-              'DOUBLE DAGGER',
-              'DAGGER',
-              'DOUBLE LOW-9 QUOTATION MARK',
-              'RIGHT DOUBLE QUOTATION MARK',
-              'LEFT DOUBLE QUOTATION MARK',
-              'SINGLE LOW-9 QUOTATION MARK',
-              'RIGHT SINGLE QUOTATION MARK',
-              'LEFT SINGLE QUOTATION MARK',
-              'EM DASH',
-              'EN DASH',
-              'LATIN SMALL LETTER S WITH CARON',
-              'LATIN CAPITAL LETTER S WITH CARON',
-              'LATIN SMALL LIGATURE OE',
-              'LATIN CAPITAL LIGATURE OE',
-              'INVERTED EXCLAMATION MARK',
-              'CURRENCY SIGN',
-              'DIAERESIS',
-              'FEMININE ORDINAL INDICATOR',
-              'NOT SIGN',
-              'TRADE MARK SIGN',
-              'MACRON',
-              'ACUTE ACCENT',
-              'CEDILLA',
-              'MASCULINE ORDINAL INDICATOR',
-              'MULTIPLICATION SIGN',
-              'DIVISION SIGN'
-              ]
-
-# Entity characters
-ampersand_strings = ('&lt;','&gt;','&amp;','&quot;',
-                     '&nbsp;','&Ccedil;','&ccdil;','&Ntilde;',
-                     '&ntilde;','&THORN;','&thorn;','&Yacute;',
-                     '&yacute;','&yuml;','&szlig;','&AElig;',
-                     '&Aacute;','&Acirc;','&Agrave;','&Aring;',
-                     '&Atilde;','&Auml;','&aelig;','&acirc;',
-                     '&aacute;','&agrave;','&aring;','&atilde;',
-                     '&auml;', '&ETH;','&Eacute;','&Ecirc;',
-                     '&Egrave;','&Euml;','&eth;','&eacute;',
-                     '&ecirc;','&egrave;','&euml;','&Iacute;',
-                     '&Icirc;','&Igrave;','&Iuml;','&iacute;',
-                     '&icirc;','&igrave;','&iuml;','&Oacute;',
-                     '&Ocirc;','&Ograve;','&Oslash;','&Otilde;',
-                     '&Ouml;','&oacute;','&ocirc;','&ograve;',
-                     '&oslash;','&otilde;','&ouml;','&Uacute;',
-                     '&Ucirc;','&Ugrave;','&Uuml;','&uacute;',
-                     '&ucirc;','&ugrave;','&uuml;','&reg;',
-                     '&plusmn;','&micro;','&para;','&middot;',
-                     '&cent;','&pound;','&yen;','&frac14;',
-                     '&frac12;','&frac34;','&sup1;','&sup2;',
-                     '&sup3;','&iquest;','&deg;','&brvbar;',
-                     '&sect;','&laquo;','&raquo;','&euro;',
-                     '&rsaquo;','&lsaquo;','&permil;','&hellip;',
-                     '&Dagger;','&dagger;','&bdquo;','&rdquo;',
-                     '&ldquo;','&sbquo;','&rsquo;','&lsquo;',
-                     '&mdash;','&ndash;','&scaron;','&Scaron;',
-                     '&oelig;','&OElig;','&iexcl;','&curren;',
-                     '&uml;','&ordf;','&not;','&trade;',
-                     '&macr;','&acute;','&cedil;','&ordm;','&times;',
-                     '&divide;')
-                         
-                         
-def modify_url(url):
-    """ Replace entity characters in URLs with the original
-    string representations """
-    
-    for ampersand_string, ucode_name in itertools.izip(ampersand_strings, char_names):
-        if url.find(ampersand_string) != -1:
-            ucode_char = unicodedata.lookup(ucode_name)            
-            try:
-                url = url.replace(ampersand_string, ucode_char)
-            except UnicodeDecodeError:
-                pass
-            
-    return url
-
-def main():
-    # Test code
-    url = 'http://www.nbb.be/pub/Home.htm?l=nl&amp;t=ho'
-    print modify_url(url)
-
-if __name__ == "__main__":
-    main()
diff --git a/HarvestMan-twisted/urltypes.py b/HarvestMan-twisted/urltypes.py
deleted file mode 100755
index f020219..0000000
--- a/HarvestMan-twisted/urltypes.py
+++ /dev/null
@@ -1,218 +0,0 @@
-# -- coding: utf-8
-"""
-urltypes - Module defining types of URLs and their
-relationships.
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai April 18 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-# The URL types are defined as classes with easy-to-use
-# string representations. 
-
-# Also, these classes are to be used as "raw", in other words
-# ideally the clients of these classes need not create instances
-# from the classes. Instead they should use them as given,
-# i.e as classes.
-
-class URL_TYPE_META(type):
-    """ Meta-class for type classes """
-
-    def __eq__(cls, other):
-        return (str(cls) == str(other))
-
-    def __str__(cls):
-        return cls.typ
-
-    def isA(cls, baseklass):
-        """ Check whether the passed class is a subclass of my class """
-        
-        return issubclass(cls, baseklass)
-
-class URL_TYPE_ANY(str):
-    """ Class representing a URL which belongs to any type.
-    This is the base class for all other URL types """
-
-    __metaclass__ = URL_TYPE_META
-    
-    typ = 'generic'
-
-class URL_TYPE_NONE(URL_TYPE_ANY):
-    """ Class representing the None type for URLs """
-
-    __metaclass__ = URL_TYPE_META
-
-    typ = 'none'
-    
-class URL_TYPE_WEBPAGE(URL_TYPE_ANY):
-    """ Class representing a webpage URL. A webpage URL will
-    consist of some (X)HTML markup which can be parsed by an
-    (X)HTML parser. """
-
-    typ = 'webpage'
-
-class URL_TYPE_BASE(URL_TYPE_WEBPAGE):
-    """ Class representing the base URL of a web site. This is
-    a special kind of webpage type """
-
-    typ = 'base'
-
-class URL_TYPE_ANCHOR(URL_TYPE_WEBPAGE):
-    """ Class representing HTML anchor links. Anchor links are
-    part of the same web-page and are typically labels defined
-    in the same page or in another page. They start with a '#'"""
-
-    typ = 'anchor'
-
-class URL_TYPE_FRAMESET(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which defines HTML frames. The
-    children of this URL point to HTML frame documents """
-
-    typ = 'frameset'
-
-
-class URL_TYPE_FRAME(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which acts as the source for an
-    HTML 'frame' element. This URL is typically the child of
-    an HTML 'frameset' URL """
-
-    typ = 'frame'
-    
-class URL_TYPE_QUERY(URL_TYPE_ANY):
-    """ Class representing a URL which is used to submit queries to
-    web servers. Such queries can result in html or non-html result,
-    but typically they consist of session IDs """
-
-    typ = 'query'
-    
-class URL_TYPE_FORM(URL_TYPE_QUERY):
-    """ A URL which points to an action, usually used to submit
-    form contents to a ReST endpoint. This URL is part of the submit
-    action of an HTML <form> element """
-
-    typ = 'form'
-
-class URL_TYPE_IMAGE(URL_TYPE_ANY):
-    """ Class representing a URL which points to a binary raster image """
-
-    typ = 'image'
-
-class URL_TYPE_MULTIMEDIA(URL_TYPE_ANY):
-    """ Class representing a multimedia (audio/video) URL type """
-
-    typ = 'multimedia'
-
-class URL_TYPE_AUDIO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia audio URL type """
-
-    typ = 'audio'        
-
-class URL_TYPE_VIDEO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia video URL type """
-
-    typ = 'video'
-
-class URL_TYPE_FLASH(URL_TYPE_MULTIMEDIA):
-    """ Class representing Adobe shockwave flash/action script type """
-
-    typ = 'flash'            
-
-class URL_TYPE_STYLESHEET(URL_TYPE_ANY):
-    """ Class representing a URL which points to a stylesheet (CSS) file """
-
-    typ = 'stylesheet'
-
-class URL_TYPE_JAVASCRIPT(URL_TYPE_ANY):
-    """ Class which defines a URL which stands for server-side javascript files """
-
-    typ = 'javascript'
-
-class URL_TYPE_JAPPLET(URL_TYPE_ANY):
-    """ Class which defines a URL that points to a Java applet class """
-
-    typ = 'javaapplet'
-
-class URL_TYPE_JAPPLET_CODEBASE(URL_TYPE_ANY):
-    """ Class which defines a URL that points to the code-base path of a Java applet """
-
-    typ = 'appletcodebase'
-    
-class URL_TYPE_FILE(URL_TYPE_ANY):
-    """ Class representing a URL which points to any kind of file other
-    than webpages, images, stylesheets,server-side javascript files, java
-    applets, form queries etc """
-
-    # This is a generic catch-all for all URLs which are not defined so far.
-    typ = 'file'
-
-class URL_TYPE_DOCUMENT(URL_TYPE_ANY):
-    """ Class which stands for URLs that point to documents which can be
-    indexed by search engines. Examples are text files, xml files, PDF files,
-    word documents, open office documents etc """
-
-    # This type is not used in HarvestMan, but is useful for indexers
-    # which work with HarvestMan, such as swish-e.
-    typ = 'document'
-
-
-# An easy-to-use dictionary for type string to type class mapping
-
-type_map = { 'generic' : URL_TYPE_ANY,
-             'webpage' : URL_TYPE_WEBPAGE,
-             'base': URL_TYPE_BASE,
-             'anchor': URL_TYPE_ANCHOR,
-             'query': URL_TYPE_QUERY,
-             'form' : URL_TYPE_FORM,
-             'image': URL_TYPE_IMAGE,
-             'multimedia': URL_TYPE_MULTIMEDIA,
-             'audio' : URL_TYPE_AUDIO,
-             'video' : URL_TYPE_VIDEO,
-             'stylesheet': URL_TYPE_STYLESHEET,
-             'javascript': URL_TYPE_JAVASCRIPT,
-             'javaapplet': URL_TYPE_JAPPLET,
-             'appletcodebase': URL_TYPE_JAPPLET_CODEBASE,
-             'file': URL_TYPE_FILE,
-             'document': URL_TYPE_DOCUMENT }
-
-
-def getTypeClass(typename):
-    """ Return the type class, given the type name """
-
-    return type_map.get(typename, URL_TYPE_ANY)
-
-if __name__ == "__main__":
-    print URL_TYPE_ANY == 'generic'
-    print URL_TYPE_WEBPAGE == 'webpage'
-    print URL_TYPE_BASE == 'base'
-    print URL_TYPE_ANCHOR == 'anchor'
-    print URL_TYPE_QUERY == 'query'
-    print URL_TYPE_FORM == 'form'
-    print URL_TYPE_IMAGE == 'image'
-    print URL_TYPE_STYLESHEET == 'stylesheet'
-    print URL_TYPE_JAVASCRIPT == 'javascript'
-    print URL_TYPE_JAPPLET == 'javaapplet'
-    print URL_TYPE_JAPPLET_CODEBASE == 'appletcodebase'
-    print URL_TYPE_FILE == 'file'
-    print URL_TYPE_DOCUMENT == 'document'
-    
-
-    print URL_TYPE_ANY in ('generic','webpage')
-    print issubclass(URL_TYPE_ANCHOR, URL_TYPE_WEBPAGE)
-    print issubclass(URL_TYPE_BASE, URL_TYPE_WEBPAGE)    
-    print type(URL_TYPE_ANCHOR), type(URL_TYPE_ANY)
-    
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANY.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE in ('image','stylesheet')
diff --git a/HarvestMan-lite/__init__.py b/HarvestMan.md
similarity index 100%
rename from HarvestMan-lite/__init__.py
rename to HarvestMan.md
diff --git a/HarvestMan/HarvestMan.egg-info/PKG-INFO b/HarvestMan/HarvestMan.egg-info/PKG-INFO
deleted file mode 100644
index ea79230..0000000
--- a/HarvestMan/HarvestMan.egg-info/PKG-INFO
+++ /dev/null
@@ -1,24 +0,0 @@
-Metadata-Version: 1.0
-Name: HarvestMan
-Version: 2.0.4betadev-r180
-Summary: HarvestMan is a web crawler application and framework.
-Home-page: http://code.google.com/p/harvestman-crawler/
-Author: Lukasz Szybalski
-Author-email: szybalski@gmail.com
-License: GPLv2
-Description: HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan? can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
-        
-Keywords: crawler spider web-crawler web-bot robot data-mining python
-Platform: UNKNOWN
-Classifier: Development Status :: 5 - Production/Stable
-Classifier: Environment :: Console
-Classifier: Environment :: Web Environment
-Classifier: Intended Audience :: End Users/Desktop
-Classifier: Intended Audience :: Developers
-Classifier: License :: OSI Approved :: GNU General Public License (GPL)
-Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python
-Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
-Classifier: Topic :: Software Development :: Libraries :: Python Modules
-Classifier: Topic :: Software Development :: Testing :: Traffic Generation
-Classifier: Topic :: Text Processing :: Indexing
diff --git a/HarvestMan/HarvestMan.egg-info/SOURCES.txt b/HarvestMan/HarvestMan.egg-info/SOURCES.txt
deleted file mode 100644
index d304c75..0000000
--- a/HarvestMan/HarvestMan.egg-info/SOURCES.txt
+++ /dev/null
@@ -1,212 +0,0 @@
-LICENSE.txt
-MANIFEST
-Readme.hget
-Readme.txt
-__init__.py
-ez_setup.py
-pydocgen.sh
-setup.cfg
-setup.py
-tarhm.py
-HarvestMan.egg-info/PKG-INFO
-HarvestMan.egg-info/SOURCES.txt
-HarvestMan.egg-info/dependency_links.txt
-HarvestMan.egg-info/entry_points.txt
-HarvestMan.egg-info/not-zip-safe
-HarvestMan.egg-info/paster_plugins.txt
-HarvestMan.egg-info/requires.txt
-HarvestMan.egg-info/top_level.txt
-deps/sgmlop-1.1.1-20040207.zip
-doc/Changelog.txt
-doc/Changes.txt
-doc/Datastructures.txt
-doc/PLAN.txt
-doc/STATUS.txt
-doc/events.HOWTO
-doc/harvestman-epydoc.sh
-doc/plugins.HOWTO
-doc/state_machine.txt
-harvestman/__init__.py
-harvestman/apps/__init__.py
-harvestman/apps/appbase.py
-harvestman/apps/config-sample-urlfilter.xml
-harvestman/apps/config-sample.xml
-harvestman/apps/hget.py
-harvestman/apps/spider.py
-harvestman/apps/samples/Readme.txt
-harvestman/apps/samples/__init__.py
-harvestman/apps/samples/blogger.py
-harvestman/apps/samples/htmlcrawler.py
-harvestman/apps/samples/imagecrawler.py
-harvestman/apps/samples/indexingcrawler.py
-harvestman/apps/samples/linkchecker.py
-harvestman/apps/samples/postingcrawler.py
-harvestman/apps/samples/searchingcrawler.py
-harvestman/apps/samples/taganalyzer.py
-harvestman/bugs/Readme.txt
-harvestman/bugs/config-issue20.xml
-harvestman/bugs/config-issue21.xml
-harvestman/bugs/s_municipaux.htm
-harvestman/bugs/soskut_hu_index.html
-harvestman/bugs/test_808.py
-harvestman/bugs/test_812.py
-harvestman/dev/__init__.py
-harvestman/dev/filethread.py
-harvestman/dev/sqlite_test.py
-harvestman/dev/sqlite_test2.py
-harvestman/dev/sqlite_test3.py
-harvestman/dev/sqlite_test4.py
-harvestman/ext/__init__.py
-harvestman/ext/datafilter.py
-harvestman/ext/lucene.py
-harvestman/ext/simulator.py
-harvestman/ext/spam.py
-harvestman/ext/swish-e.py
-harvestman/ext/userbrowse.py
-harvestman/ext/lucene/IndexFiles.py
-harvestman/ext/lucene/SearchFiles.py
-harvestman/ext/swish-e/HOWTO.swish-e
-harvestman/ext/swish-e/README.txt
-harvestman/ext/swish-e/swish-config.conf
-harvestman/lib/__init__.py
-harvestman/lib/config.py
-harvestman/lib/configparser.py
-harvestman/lib/connector.py
-harvestman/lib/crawler.py
-harvestman/lib/datamgr.py
-harvestman/lib/db.py
-harvestman/lib/document.py
-harvestman/lib/event.py
-harvestman/lib/filters.py
-harvestman/lib/gui.py
-harvestman/lib/hooks.py
-harvestman/lib/logger.py
-harvestman/lib/logger_old.py
-harvestman/lib/methodwrapper.py
-harvestman/lib/mirrors.py
-harvestman/lib/options.py
-harvestman/lib/pageparser.py
-harvestman/lib/robotparser.py
-harvestman/lib/rules.py
-harvestman/lib/test.html
-harvestman/lib/test_urlparser.py
-harvestman/lib/urlcollections.py
-harvestman/lib/urlparser.py
-harvestman/lib/urlproc.py
-harvestman/lib/urlqueue.py
-harvestman/lib/urlthread.py
-harvestman/lib/urltypes.py
-harvestman/lib/utils.py
-harvestman/lib/common/__init__.py
-harvestman/lib/common/bst.py
-harvestman/lib/common/bst_orig.py
-harvestman/lib/common/common.py
-harvestman/lib/common/dictcache.py
-harvestman/lib/common/feedparser.py
-harvestman/lib/common/keepalive.py
-harvestman/lib/common/lrucache.py
-harvestman/lib/common/macros.py
-harvestman/lib/common/netinfo.py
-harvestman/lib/common/optionparser.py
-harvestman/lib/common/progress.py
-harvestman/lib/common/properties.py
-harvestman/lib/common/pydblite.py
-harvestman/lib/common/singleton.py
-harvestman/lib/common/spincursor.py
-harvestman/lib/js/Parser.rb
-harvestman/lib/js/README.txt
-harvestman/lib/js/__init__.py
-harvestman/lib/js/jsdom.py
-harvestman/lib/js/jsparse.py
-harvestman/lib/js/jsparser.py
-harvestman/lib/js/narcissus.py
-harvestman/lib/js/parse.rb
-harvestman/lib/js/testnarcissus.py
-harvestman/lib/js/samples/Banco de Portugal.html
-harvestman/lib/js/samples/addbookmark.js
-harvestman/lib/js/samples/addrowstotable.js
-harvestman/lib/js/samples/arrayintersect.js
-harvestman/lib/js/samples/bportugal.html
-harvestman/lib/js/samples/bportugal_dom.html
-harvestman/lib/js/samples/breadcrumbs.js
-harvestman/lib/js/samples/colsameheight.js
-harvestman/lib/js/samples/combinations.js
-harvestman/lib/js/samples/dblkeypress.js
-harvestman/lib/js/samples/derive.js
-harvestman/lib/js/samples/domcss.js
-harvestman/lib/js/samples/domrollover.js
-harvestman/lib/js/samples/dzone.js
-harvestman/lib/js/samples/extlinks.js
-harvestman/lib/js/samples/formatnums.js
-harvestman/lib/js/samples/formatpaper.js
-harvestman/lib/js/samples/getelems.js
-harvestman/lib/js/samples/gmxpath.js
-harvestman/lib/js/samples/hotkey.js
-harvestman/lib/js/samples/htmlselect.js
-harvestman/lib/js/samples/imgsrc.js
-harvestman/lib/js/samples/incljsbydom.js
-harvestman/lib/js/samples/incljsbyxhr.js
-harvestman/lib/js/samples/jsnodom.html
-harvestman/lib/js/samples/jsredirect.html
-harvestman/lib/js/samples/jsredirect4.html
-harvestman/lib/js/samples/jsredirect5.html
-harvestman/lib/js/samples/jstest.html
-harvestman/lib/js/samples/jstest2.html
-harvestman/lib/js/samples/jstest2_dom.html
-harvestman/lib/js/samples/jstest3.html
-harvestman/lib/js/samples/jstest3_dom.html
-harvestman/lib/js/samples/jstest_dom.html
-harvestman/lib/js/samples/mouseposn.js
-harvestman/lib/js/samples/mysqltimestamp.js
-harvestman/lib/js/samples/progress.js
-harvestman/lib/js/samples/redirect.js
-harvestman/lib/js/samples/selectlist.js
-harvestman/lib/js/samples/showformvalues.js
-harvestman/lib/js/samples/strcmp.js
-harvestman/lib/js/samples/submitags.js
-harvestman/lib/js/samples/switchexample.js
-harvestman/lib/js/samples/syncajax.js
-harvestman/lib/js/samples/test.html
-harvestman/lib/js/samples/test.js
-harvestman/lib/js/samples/trim.js
-harvestman/lib/js/samples/trim2.js
-harvestman/lib/js/samples/xpathstring.js
-harvestman/lib/js/samples/fail/aim.js
-harvestman/lib/js/samples/fail/ajaxobject.js
-harvestman/lib/js/samples/fail/arrayfind.js
-harvestman/lib/js/samples/fail/bportugal.js
-harvestman/lib/js/samples/fail/cookiehandler.js
-harvestman/lib/js/samples/fail/createcss.js
-harvestman/lib/js/samples/fail/draghandler.js
-harvestman/lib/js/samples/fail/editor_main.js
-harvestman/lib/js/samples/fail/functreebuilder.js
-harvestman/lib/js/samples/fail/int2word.js
-harvestman/lib/js/samples/fail/openwindow.js
-harvestman/lib/js/samples/fail/openwindow2.js
-harvestman/lib/js/samples/fail/pluralize.js
-harvestman/lib/js/samples/fail/selform.js
-harvestman/lib/js/samples/fail/twitter.js
-harvestman/lib/js/samples/fail/typedtooltip.js
-harvestman/lib/js/samples/fail/vardump.js
-harvestman/lib/js/samples/fail/xpath.js
-harvestman/test/__init__.py
-harvestman/test/fail.html
-harvestman/test/pass.css
-harvestman/test/pass.html
-harvestman/test/run_tests.py
-harvestman/test/test_base.py
-harvestman/test/test_connector.py
-harvestman/test/test_logger.py
-harvestman/test/test_pageparser.py
-harvestman/test/test_urlcollections.py
-harvestman/test/test_urlfilter.py
-harvestman/test/test_urlparser.py
-harvestman/test/test_urltypes.py
-harvestman/tools/__init__.py
-harvestman/tools/genconfig.py
-harvestman/tools/printstats.py
-harvestman/ui/templates/form.html
-harvestman/ui/templates/content/example-print.css
-harvestman/ui/templates/content/example.css
-harvestman/ui/templates/content/tabber.js
-schema/HarvestMan.xsd
\ No newline at end of file
diff --git a/HarvestMan/HarvestMan.egg-info/dependency_links.txt b/HarvestMan/HarvestMan.egg-info/dependency_links.txt
deleted file mode 100644
index 8b13789..0000000
--- a/HarvestMan/HarvestMan.egg-info/dependency_links.txt
+++ /dev/null
@@ -1 +0,0 @@
-
diff --git a/HarvestMan/HarvestMan.egg-info/entry_points.txt b/HarvestMan/HarvestMan.egg-info/entry_points.txt
deleted file mode 100644
index 186bf6f..0000000
--- a/HarvestMan/HarvestMan.egg-info/entry_points.txt
+++ /dev/null
@@ -1,5 +0,0 @@
-
-      [console_scripts]
-        harvestman = harvestman.apps.spider:main
-        hget = harvestman.apps.hget:main
-      
\ No newline at end of file
diff --git a/HarvestMan/HarvestMan.egg-info/not-zip-safe b/HarvestMan/HarvestMan.egg-info/not-zip-safe
deleted file mode 100644
index 8b13789..0000000
--- a/HarvestMan/HarvestMan.egg-info/not-zip-safe
+++ /dev/null
@@ -1 +0,0 @@
-
diff --git a/HarvestMan/HarvestMan.egg-info/paster_plugins.txt b/HarvestMan/HarvestMan.egg-info/paster_plugins.txt
deleted file mode 100644
index 8e50835..0000000
--- a/HarvestMan/HarvestMan.egg-info/paster_plugins.txt
+++ /dev/null
@@ -1 +0,0 @@
-PasteScript
diff --git a/HarvestMan/HarvestMan.egg-info/requires.txt b/HarvestMan/HarvestMan.egg-info/requires.txt
deleted file mode 100644
index 7d90d30..0000000
--- a/HarvestMan/HarvestMan.egg-info/requires.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-sgmlop >= 1.1.1
-pyparsing >= 1.4.8
-web.py >= 0.23
\ No newline at end of file
diff --git a/HarvestMan/HarvestMan.egg-info/top_level.txt b/HarvestMan/HarvestMan.egg-info/top_level.txt
deleted file mode 100644
index d6dba88..0000000
--- a/HarvestMan/HarvestMan.egg-info/top_level.txt
+++ /dev/null
@@ -1 +0,0 @@
-harvestman
diff --git a/HarvestMan/LICENSE.txt b/HarvestMan/LICENSE.txt
deleted file mode 100755
index bc9a146..0000000
--- a/HarvestMan/LICENSE.txt
+++ /dev/null
@@ -1,124 +0,0 @@
-The GNU General Public License (GPL)
-Version 2, June 1991
-
-Copyright (C) 1989, 1991 Free Software Foundation, Inc.
-59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-
-Everyone is permitted to copy and distribute verbatim copies
-of this license document, but changing it is not allowed.
-
-Preamble
-
-The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
-
-When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
-
-To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
-
-For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
-
-We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
-
-Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
-
-Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
-
-The precise terms and conditions for copying, distribution and modification follow.
-
-TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
-0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.
-
-1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.
-
-2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
-
-    a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.
-
-    b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
-
-    c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
-
-3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
-
-    a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
-
-    b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
-
-    c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
-
-If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
-
-4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
-
-5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.
-
-6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.
-
-7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.
-
-It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.
-
-This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.
-
-8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
-
-9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.
-
-10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
-
-NO WARRANTY
-
-11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
-
-12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
-
-END OF TERMS AND CONDITIONS
-
-How to Apply These Terms to Your New Programs
-
-If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
-
-To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found.
-
-    One line to give the program's name and a brief idea of what it does.
-    Copyright (C) <year> <name of author>
-
-    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-
-Also add information on how to contact you by electronic and paper mail.
-
-If the program is interactive, make it output a short notice like this when it starts in an interactive mode:
-
-    Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details.
-
-The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program.
-
-You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names:
-
-    Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker.
-
-    signature of Ty Coon, 1 April 1989
-    Ty Coon, President of Vice
-
-This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
\ No newline at end of file
diff --git a/HarvestMan/MANIFEST b/HarvestMan/MANIFEST
deleted file mode 100755
index 5c9bae6..0000000
--- a/HarvestMan/MANIFEST
+++ /dev/null
@@ -1,58 +0,0 @@
-setup.py
-harvestman/__init__.py
-harvestman/apps/__init__.py
-harvestman/apps/harvestman.py
-harvestman/apps/harvestmanklass.py
-harvestman/apps/hget.py
-harvestman/dev/__init__.py
-harvestman/dev/feedparser.py
-harvestman/ext/__init__.py
-harvestman/ext/datafilter.py
-harvestman/ext/lucene.py
-harvestman/ext/simulator.py
-harvestman/ext/spam.py
-harvestman/ext/swish-e.py
-harvestman/ext/userbrowse.py
-harvestman/lib/__init__.py
-harvestman/lib/config.py
-harvestman/lib/configparser.py
-harvestman/lib/connector.py
-harvestman/lib/crawler.py
-harvestman/lib/datamgr.py
-harvestman/lib/filethread.py
-harvestman/lib/hooks.py
-harvestman/lib/logger.py
-harvestman/lib/mirrors.py
-harvestman/lib/options.py
-harvestman/lib/pageparser.py
-harvestman/lib/robotparser.py
-harvestman/lib/rules.py
-harvestman/lib/urlcollections.py
-harvestman/lib/urlparser.py
-harvestman/lib/urlproc.py
-harvestman/lib/urlqueue.py
-harvestman/lib/urlthread.py
-harvestman/lib/urltypes.py
-harvestman/lib/utils.py
-harvestman/lib/common/__init__.py
-harvestman/lib/common/common.py
-harvestman/lib/common/dictcache.py
-harvestman/lib/common/keepalive.py
-harvestman/lib/common/lrucache.py
-harvestman/lib/common/lrucache2.py
-harvestman/lib/common/lrucachetest.py
-harvestman/lib/common/macros.py
-harvestman/lib/common/methodwrapper.py
-harvestman/lib/common/optionparser.py
-harvestman/lib/common/progress.py
-harvestman/lib/common/singleton.py
-harvestman/lib/common/spincursor.py
-harvestman/lib/js/__init__.py
-harvestman/lib/js/jsdom.py
-harvestman/lib/js/jsparse.py
-harvestman/lib/js/jsparser.py
-harvestman/lib/js/narcissus.py
-harvestman/lib/js/testnarcissus.py
-harvestman/test/__init__.py
-harvestman/test/test_base.py
-harvestman/test/test_urlparser.py
diff --git a/HarvestMan/Readme.hget b/HarvestMan/Readme.hget
deleted file mode 100755
index 9763c45..0000000
--- a/HarvestMan/Readme.hget
+++ /dev/null
@@ -1,506 +0,0 @@
-************************************************
-* Hget - Web download manager using HarvestMan *
-*                                              *
-************************************************
-
-About
------
-
-Hget is a web downloader written on top of HarvestMan
-framework. Hget allows one to download file(s) from the
-Internet quickly by using HTTP multipart, HTTP resume
-and mirror search features.
-
-Hget is not distributed as a separate program. It is 
-part of the HarvestMan source distribution, but works
-as a separate application.
-
-Features
---------
-Hget 1.0 alpha has the following features.
-
-1. Download URLs from http, https and ftp sites.
-2. Download URLs in mutliple parts from HTTP sites which
-support byte-range downloads. Each piece gets downloaded
-in its own thread and the final file is assembled by
-combining the pieces. (multipart download is not
-available for FTP urls).
-3. Mirror downloading for sourceforge.net URLs
-4. Mirror search algorithm - Search for common application
-downloads in mirror search sites and perform multipart
-downloads.
-5. HTTP resuming - Suspend a download and resume it 
-from where it left off later.
-6. Mirror failover - For mirror downloads, automatically
-picks new mirrors if a mirror fails, without interrupting
-the download.
-7. Arbitrary mirror support - Supply "mirror files" to 
-the program for downloading a file from arbitrary mirror
-URLs.
-8. HTTP compression - Supports HTTP compression if the
-server supports it.
-9. Basic Authentication - Support for HTTP basic authentication
-for URLs requiring authentication.
-10. Automatic redirection - Automatic redirection for URLs
-and mirror URLs if server performs redirection.
-11. Auto renaming - Renames the output file smartly 
-depending on redirection, existence of file with the same
-name etc.
-
-Author
-------
-
-Anand B Pillai <abpillai at gmail dot com>
-
-License
--------
-Hget is released under GNU GPL. See file LICENSE.TXT
-
-Version
--------
-1.0 alpha 
-
-NOTE: The entire HarvestMan distribution is currently at
-2.0 alpha version. The hget application and related modules
-which are also part of HarvestMan distribution however
-are more recent and has a separate version number.
-
-WWW
----
-http://harvestmanontheweb.com
-
-Installation
-------------
-There is only a single installation step for the entire
-HarvestMan package. See section "How to Install" in
-Readme.txt for details.
-
-Running the program
--------------------
-Hget requires all options to be passed on the command line.
-There is no configuration file as in the case of HarvestMan 
-for Hget.
-
-If you run Hget without any options, you will see the usage
-printed.
-
-$ hget
-
-Usage:  /usr/lib/python2.5/site-packages/harvestman/apps/hget.py [options] URL(s) | file(s)
- 
-Hget 1.0 beta 1: A multithreaded web downloader based on HarvestMan.
-Author: Anand B Pillai
-
-The program accepts URL(s) or an input file(s) containing a number of URLs,
-one per line. If a file is passed as input, any other program option
-passed is applied for every URL downloaded using the file.
-
-Mail bug reports and suggestions to <abpillai at gmail dot com>.
-
-Options:
-  -h, --help            show this help message and exit
-  -v, --version         Print version information and exit
-  -V, --verbose         Be verbose
-  -s, --single          Single thread mode. If enabled, won't attempt to do
-                        multithreaded downloads using byte-range headers
-  -r, --noresume        If set, will not try to resume partial downloads
-  -X PROXYSERVER, --proxy=PROXYSERVER
-                        Enable and set proxy to PROXYSERVER (host:port)
-  -U USERNAME, --proxyuser=USERNAME
-                        Set username for proxy server to USERNAME
-  -W PASSWORD, --proxypass=PASSWORD
-                         Set password for proxy server to PASSWORD
-  -u USERNAME, --username=USERNAME
-                        Use username USERNAME for HTTP (basic) authentication
-  -p PASSWORD, --passwd=PASSWORD
-                        Use password PASSWORD for HTTP (basic) authentication
-  -P NUMPARTS, --numparts=NUMPARTS
-                        Force-split download into <NUMPARTS> parts (max 10)
-  -m, --inmem           Keep data in memory instead of flushing to disk
-                        (Warning: use with care as this might exhaust memory
-                        for huge file downloads!)
-  -n, --notemp          Use current directory instead of system temp directory
-                        for saving intermediate files
-  -f MIRRORFILE, --mirrorfile=MIRRORFILE
-                        Load mirror information for the URL(s) from MIRRORFILE
-                        (The file must contain a list of valid mirrors for the
-                        URL, one per line)
-  -i RELPATHINDEX, --relpathidx=RELPATHINDEX
-                        When loading mirrors, use the given index to calculate
-                        the relative path of the original URL (If given, the
-                        relative path of the original URL will be offset by
-                        this value)
-  -N, --norelpath       When loading mirrors, do not compute mirror URLs using
-                        relative-path (Instead just appends the filename to
-                        the mirror URL)
-  -o FILE, --output=FILE
-                        Save document to FILE
-  -d DIRECTORY, --outputdir=DIRECTORY
-                        Save document to directory
-
-
-Downloading file(s)
-------------------
-Hget can download files in single thread (single part) and in multiple
-threads for HTTP downloads (HTTP multipart). We will discuss single
-part downloads (Simple) and multipart downloads (Advanced) here.
-
-By default Hget saves intermediate data in temporary files in
-$TEMPDIR/harvestman folder where $TEMPDIR is the system temporary
-directory. For example for UNIX and UNIX like systems this is normally
-/tmp/harvestman . For Win32 systems it is typically 
-%WINDRIVE%/Documents and Settings/%USER%/Local Settings/Temp/harvestman
-folder.
-
-Hget can also keep the intermediate data in memory. However if this
-option is used, features like resuming (discussed below) etc will not work.
-
-Simple
-------
-
-The simplest way to use hget is to simply pass a URL to it.
-
-Example,
-
-$ hget www.python.org
-
-Downloading URL www.python.org ...
-Connecting to http://www.python.org... 
-Length: 16083 (15K) Type: text/html 
-Content Encoding: plain
- 
-12.60K/s (1) eta: 0:00:00              [==================================================>][100%] 
-
-Saved to index.html.
-16083 bytes downloaded in 0:00:03 hours at an average of 4.55 kb/s.
-
-Hget displays a progress bar while the download is in progess. This
-progress bar shows download information on the LHS. This shows
-the following information,
-
-<download speed> (<number of threads>) <estimated time of completion>
-
-For example, here is a snapshot of a download in progress.
-
-$ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2
-
-Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ...
-Connecting to http://www.python.org... 
-Length: 9357099 (8M) Type: application/x-tar 
-Content Encoding: plain
- 
-31.89K/s (1) eta: 0:04:32              [=>                                                ](  4%)  
-
-The progress bar here means that,
-
- * The current download speed is 31.89 KB per second
- * There is 1 thread doing the download.
- * The estimated time of completion is 4 minutes and 32 seconds.
- * So far 4% of the file has been downloaded.
-
-HTTP Resuming
--------------
-
-If you are downloading a file from an HTTP or HTTPS URL and if the download
-is interrupted due to some reason (explicit Ctrl-C, killing the process,
-power shutdown, timeouts etc), then the next time you try the same
-download, Hget will try to resume from where it left off, if the
-temporary files of the previous download are not deleted from the
-temporary folder. This will work in most HTTP downloads since almost
-every HTTP server supports resuming.
-
-Example,
-
-$ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2
-
-Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ...
-Connecting to http://www.python.org... 
-Length: 9357099 (8M) Type: application/x-tar 
-Content Encoding: plain
- 
-Caught keyboard interrupt...           [========>                                         ]( 18%)  
-
-
-Download aborted by user interrupt.
-Cleaning up temporary files...
-Waiting for threads to finish...
-Done.
-
-The above download was interrupted using an explicit kill (Ctrl-C).
-If you try the same download again,
-
-$ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2
-
-Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ...
-Temporary files from previous download found, trying to resume download...
-Resuming download...
-Connecting to http://www.python.org... 
-Length: 7636779 (7M) Type: application/x-tar 
-Content Encoding: plain
- 
-17.65K/s (1) eta: 0:06:59              [>                                                  ](  0%)
-
-Hget now resume the download from where it left off. The earlier download
-had completed downloading 1720320 bytes, so only the remaining is downloaded.
-
-HTTP resuming works at more than one level. For example, if the above download
-is interrupted again, the temporary files are stored again and next time
-download is resumed from that point. 
-
-At the end of download, the current piece is combined with all earlier pieces
-in order, to produce the final file.
-
-NOTE: HTTP resuming works only if Hget is run in temp-file mode. If run in
-in-memory mode using the -m or --inmem flag, all temporary data is lost
-at the end of the program and resuming will not happen.
-
-NOTE: If you don't want to resume an interrupted download for some reason,
-pass the -r flag.
-
-NOTE: Resuming does not work for FTP downloads.
-
-
-Advanced
---------
-
-Multipart downloads
--------------------
-
-Hget can perform HTTP multipart (downloading a file in many pieces) using
-HTTP/1.0 byte-range requests, if the server supports it. This is done by
-using the -P or --numparts option and supplying an argument equal to the
-number of parts you want the file split to.
-
-NOTE: The numparts has a maximum value of 20. Any value greater than 20
-is truncated to 20.
-
-If the server supports multipart downloading, Hget will split the file
-to the number of pieces requested and schedule each piece to be downloaded
-it its own thread. After every thread finishes its piece,the final file
-is produced by combining the pieces.
-
-For example,
-
-$ hget -P5 http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2
-Force-split switched on, resuming will be disabled
-
-Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ...
-Connecting to http://www.python.org... 
-Length: 9357099 (8M) Type: application/x-tar 
-Content Encoding: plain
- 
-Forcing download into 5 parts 
-Server supports multipart downloads 
-Trying multipart download... 
-113.86K/s (5) eta: 0:01:11             [====>                                             ]( 11%)
-
-Towards the end of download,
-
-$ hget -P5 http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2
-Force-split switched on, resuming will be disabled
-
-Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ...
-Connecting to http://www.python.org... 
-Length: 9357099 (8M) Type: application/x-tar 
-Content Encoding: plain
- 
-Forcing download into 5 parts 
-Server supports multipart downloads 
-Trying multipart download... 
-78.24K/s (1) eta: 0:00:00              [==================================================>][100%] 
-Download of http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 is complete... 
-Data download completed.
-                                  
-Saved to Python-2.5.tar.bz2.
-9357099 bytes downloaded in 0:01:59 hours at an average of 76.31 kb/s.
-
-NOTE: (5) indicates that currently 5 threads are downloading the file.
-The progress bar indicates the progress of the complete file, i.e sum
-of all the pieces.
-
-NOTE: When a file is split-downloaded, resuming is disabled, since we
-do not know whether the same number of parts will be requested in a 
-second download of the same file. So if a split-download is interrupted,
-all temporary files are cleaned up.
-
-NOTE: Hget currently do not support multipart downloads for FTP URLs.
-So if you request multipart for FTP URLs, it is ignored. 
-
-$ hget -P5 ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz
-Force-split switched on, resuming will be disabled
-
-Downloading URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz ...
-Connecting to ftp://ftp.gnu.org... 
-Length: 20403483 (19M) Type: application/x-tar 
-Content Encoding: plain
- 
-FTP request, not trying multipart download, defaulting to single thread 
-56.58K/s (1) eta: 0:05:45              [>                                                  ](  1%) 
-
-Sourceforge downloads
----------------------
-If a URL is a download from sourceforge.net servers and multipart is requested, Hget
-splits the download across sourceforge.net mirrors instead of downloading all
-pieces from the same mirror. This is automatically done. To see the mirror
-information the -V option can be used.
-
-For example,
-
-$ hget -P5 -V http://downloads.sourceforge.net/numpy/numpy-1.0.4.tar.gz
-Force-split switched on, resuming will be disabled
-
-Downloading URL http://downloads.sourceforge.net/numpy/numpy-1.0.4.tar.gz ...
-Connecting to http://downloads.sourceforge.net... 
-Redirected to http://jaist.dl.sourceforge.net/sourceforge/numpy/numpy-1.0.4.tar.gz... 
-Length: 1547541 (1M) Type: application/x-gzip 
-Content Encoding: plain
- 
-Forcing download into 5 parts 
-Server supports multipart downloads 
-Trying multipart download... 
-Splitting download across mirrors...
- 
-Worker-4: Downloading url http://easynews.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(0 - 309507) 
-Worker-2: Downloading url http://internap.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(309508 - 619015) 
-Worker-1: Downloading url http://superb-east.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(619016 - 928523) 
-Worker-3: Downloading url http://superb-west.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(928524 - 1238031) 
-Worker-5: Downloading url http://ufpr.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(1238032 - 1547540) 
-59.37K/s (5) eta: 0:00:18              [============>                                     ]( 26%)  
-
-
-Mirror search
--------------
-Hget supports a rudimentary mirror search algorithm. Mirror search is specified by
-the -M flag. It is useful only for multipart downloads.
-
-If you are trying to download a large file and you want to see if the file is
-available in public mirrors, it can be searched for in mirror search sites. Such
-sites return URL information for file mirrors.
-
-Hget supports searching in the mirror search site www.findfiles.com . This site
-provides mirror URLs for most projects of the Apahce foundation and also many other
-popular downloads. 
-
-To perform mirror search and download, specify the -M flag and the -P flag with
-the number of pieces requested. If no mirror URLs are found for this file, then
-the download is aborted.
-
-For example,
-
-$ hget -M -P5 ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz
-Force-split switched on, resuming will be disabled
-Mirror search turned on
-
-Downloading URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz ...
-Connecting to ftp://ftp.gnu.org... 
-Length: 20403483 (19M) Type: application/x-tar 
-Content Encoding: plain
- 
-Server supports multipart downloads 
-Trying multipart download... 
-Splitting download across mirrors...
- 
-Searching mirrors for emacs-21.4a.tar.gz...
-Searching site http://www.findfiles.com for mirror URLs...
-Length: 20403483 (19M) Type: application/x-tar 
-Content Encoding: plain
- 
-Server supports multipart downloads 
-Trying multipart download... 
-Splitting download across mirrors...
- 
-Cannot search for new mirrors
-No mirrors found
-Download of URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz not completed.
-
-
-However if mirrors are found, the download proceeds. If the number of mirror URLs found
-are less than the number of parts requested, download proceeds with those URLs only.
-
-For example,
-
-$ hget -M -P5 http://apache.mirrors.timporter.net/ant/binaries/apache-ant-1.7.0-bin.tar.gz
-Force-split switched on, resuming will be disabled
-Mirror search turned on
-
-Downloading URL http://apache.mirrors.timporter.net/ant/binaries/apache-ant-1.7.0-bin.tar.gz ...
-Connecting to http://apache.mirrors.timporter.net... 
-Length: 8958777 (8M) Type: application/x-gzip 
-Content Encoding: plain
- 
-Server supports multipart downloads 
-Trying multipart download... 
-Splitting download across mirrors...
- 
-Searching mirrors for apache-ant-1.7.0-bin.tar.gz...
-Searching site http://www.findfiles.com for mirror URLs...
-2 mirror URLs found, queuing them for multipart downloads...
-10.67K/s (2) eta: 0:12:55              [=>                                                ](  5%) 
-
-
-If mirror search returns only a single mirror, then download is aborted, since a single
-mirror is equivalent to downloading as a single part anyway.
-
-For example,
-
-$ hget -M -P5 http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2
-Force-split switched on, resuming will be disabled
-Mirror search turned on
-
-Downloading URL http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2 ...
-Connecting to http://www.apache.org... 
-Length: (Unknown) Type: text/html 
-Content Encoding: plain
- 
-Server supports multipart downloads 
-Trying multipart download... 
-Splitting download across mirrors...
- 
-Searching mirrors for apache-maven-2.0.8-bin.tar.bz2...
-Searching site http://www.findfiles.com for mirror URLs...
-1 mirror URLs found, queuing them for multipart downloads...
-Only single mirror found
-Download of URL http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2 not completed.
-
-Also, for sourceforge.net URLs, if mirror search is specified, it is ignored, and 
-mirror search algorithm defaults to mirroring among the sourceforge.net mirror servers.
-
-Mirror files
-------------
-Mirror URLs can be specified in a file and download can be scheduled among them.
-However this option is not of a lot of use, since the user has to do a lot of 
-work in finding the mirror URLs and putting them in a file. Some sample
-mirror files are available in the "apps" folder.
-
-The mirror files can be used with options -f, -i and -N.
-
-Other options
--------------
-
-Here are the other relevant options.
-
-1. -n : If you dont want to use the system temp folder for any reason (folder full etc),
-        this option can be used to save temporary files to the current directory.
-2. -X, -U, -W: Proxy server options. If your network goes through a proxy, then
-        use these options to make hget request downloads through the proxy.
-3. -u, -p: Username and password for HTTP basic authentication. This can be used
-        for URLs requiring HTTP Basic Auth.
-4. -m : For some reason, if you dont want to create temporary files at all, use this
-        option to keep all intermediate data in-memory. Note that this could possibly
-        produce out of memory errors for large data downloads, since everything is
-        kept in memory. If enabling this, make sure you have ample RAM.
-5. -d:  Save file to another directory instead of the current one.
-6. -o:  Save file in a different filename instead of the one determined by the URL.
-7. -V:  Be more verbose.
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/HarvestMan/Readme.txt b/HarvestMan/Readme.txt
deleted file mode 100755
index 183c499..0000000
--- a/HarvestMan/Readme.txt
+++ /dev/null
@@ -1,208 +0,0 @@
-*************************
-*                       *
-* HarvestMan Webcrawler *
-*                       *
-*************************
-
-About
------
-
-HarvestMan is a web crawler application and framework written entirely
-in the Python programming language. It can be used as a small, personal
-crawler to quickly download files from websites or as a crawler library
-/framework which can be used to develop large scale crawling applications.
-
-In this release, there are two applications. The regular crawler application
-(HarvestMan) and a new multithreaded web downloader application ("Hget")
-which is built on top of HarvestMan framework. This release also splits
-the HarvestMan source code into a well-defined library and a set
-of applications which make use of the library. 
-
-NOTE: For more information on Hget, read Readme.hget .
-
-Author
-------
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-License
--------
-
-See the file LICENSE.TXT.
-
-Version
---------
-2.0.4 beta (HarvestMan)
-
-
-WWW
----
-http://harvestmanontheweb.com
-http://harvestman-crawler.googlecode.com/
-
-Requirements
-------------
-
-HarvestMan depends upon:
-1.  python2.4 or higher. (python2.3 untested) (required)
-2. python-dev package (required for sgmlop)
-3. sgmlop,pyparsing,web.py which will get installed automatically as part of the installation. 
-
-Getting started 
----------------
-
-You need to get the sourcecode. You can either checkout the source code from our subversion repository or download the tar/zip file which has the source code. If you want to install from source code see our wiki page for the most current up to date installation instructions: http://code.google.com/p/harvestman-crawler/wiki/InstallHarvestMan
-
-If you get the tar archive you need to do the following:
-
-Unarchive the file to a directory of your choice. 
-
-% tar -xjf HarvestMan-<version>.tar.bz2
-
-where <version> is the version number.
-
-Go into the directory. 
-
-cd HarvestMan-<version>
-
-
-How to Install
---------------
-Make sure you are at the top-level HarvestMan directory. 
-
-On POSIX systems (Unix, Linux)
-
-% sudo python setup.py install
-
-On Win32 systems
-
-% python setup.py install
-
-The install script will work on Windows based Unix emulation
-layers such as cygwin also.
-
-The install script installs the HarvestMan framework to
-your system Python folder and creates shortcuts for the
-HarvestMan and Hget applications.
-
-Running the program(s)
----------------------
-        
-First thing to do is to test your application.
-
-harvestman --sefltest
-
-To run harvestman you need a configuration file. This
-is named 'config.xml' by default. To pass a different
-configuration file, use the command-line argument '-C'
-or '--configfile'.
-
-harvestman -C config.xml
-
-To create your configuration file run
-
-harvestman --genconfig
-
-Your browser will open and you will be able to enter what sites you will want to crawl. Save that file as mycrawl.xml and start harvestman
-
-harvestman -C mycrawl.xml
-
-
-There is a sample config file incuded in the 'apps' directory if you just want to test it right away.
-
-To run HarvestMan application, just type "harvestman" in a command
-prompt.
-
-$ harvestman
-or
-$ harvestman -h
-
-If the program finds a configuration file in the current directory
-or in the users .harvestman folder, it will start. If it does not
-find these files the program will exit with an error. 
-
-Command line mode
------------------
-HarvestMan supports command-line options.
-
-For information on the command line options, run the program 
-with the --help or -h option.
-
-For a complete FAQ on the command line options, visit
-http://www.harvstmanontheweb.com/commandline.html .
-
-Project file mode
------------------
-
-HarvestMan writes a project file before it starts crawling websites.
-This file has the extension '.hpf' and is written in the base 
-directory of the project.
-
-You can read this file back to restart the project later on. 
-
-For this, use the --projectfile or '-P option and pass the project file
-path as argument. This reruns a previously ran project.
-
-The Config file
----------------
-
-The config file provides the program with its settings.
-It is an xml file with top-level elements and children.
-Each top-level element denotes a section of HarvestMan
-configuration. Each child element denotes either a minor
-section or an actual configuration element.
-
-Example:
-
-      <project skip="0">
-        <url>http://www.python.org/doc/current/tut/tut.html</url>
-        <name>pytut</name>
-        <basedir>D:/websites</basedir>
-        <verbosity value="3"/>
-      </project>
-
-The new version of the config file separates config variables into
-8 different sections(elements) as described below.
-
-Section                       Description
-
-1. project                    All project related variables
-2. network                    All network related variables lik proxy,
-                              proxy username/password etc.
-3. download                   All download related variables (html/image/
-                              stylesheets/cookies etc)
-4. control                    All download control variables (filters/
-                              maximum limits/timeouts/depths/robots.txt)
-5. system                     Any system related variable( fastmode/thread status/
-                              thread timeouts/thread pool size etc)
-6. indexer                    All indexer related variables (localize etc)
-7. files                      All harvestman file settings (config/message log/ 
-                              error log/url list file etc) 
-8.display                    Display (GUI/browser) related setting
-  
-HarvestMan accepts about 60 configuration options in total.
-
-For a detailed discussion on the options, refer the HarvestMan 
-documentation files in the 'doc' sub-directory or point your browser
-to http://code.google.com/p/harvestman-crawler/wiki/ConfigXml
-
-Python Dependencies
--------------------
-The minimal requirement is Python 2.4 and the latest version of pyparsing.
-HarvestMan should work on all platforms where Python is supported. Due to one of the subpackages we use Python-dev version is required.
-
-More Documentation
-------------------
-Read the HarvestMan documentation in the 'doc' sub-directory for
-more information. More information is also available in the project
-web page.
-
-Changes & Fix History
----------------------    
-See the file Changes.txt.
-
-Change Log for this Version
----------------------------
-See the file ChangeLog.txt.
-
-
diff --git a/HarvestMan/__init__.py b/HarvestMan/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/HarvestMan/deps/sgmlop-1.1.1-20040207.zip b/HarvestMan/deps/sgmlop-1.1.1-20040207.zip
deleted file mode 100755
index 765748c..0000000
Binary files a/HarvestMan/deps/sgmlop-1.1.1-20040207.zip and /dev/null differ
diff --git a/HarvestMan/doc/Changelog.txt b/HarvestMan/doc/Changelog.txt
deleted file mode 100755
index df0c74f..0000000
--- a/HarvestMan/doc/Changelog.txt
+++ /dev/null
@@ -1,457 +0,0 @@
-
-*==========================================================*
-|            -Changelog.txt file for HarvestMan-           |
-|                                                          |
-|           URL: http://www.harvestmanontheweb.com         |
-*==========================================================*
-Version 2.0 b1
-Release Date: TBD
-
-Release Focus: Major Enhancements, Bug-fixes
-
-Brief
-=====
-The 2.0 version is a release of HarvestMan after more than 2.0
-years. This version has numerous new features in terms of
-extensibility and usability. This version also converts HarvestMan
-from a web crawler application to a full-fledged application framework
-for writing custom web crawling/data mining applications.
-
-The changes in this version are many - there are changes in the
-source code layout, extensibility and usability features, performance
-enhancement fixes and numerous other bug-fixes and changes in dependence
-on third-party modules and in the setup.py script. Each of these
-changes will be discussed in detail here.
-
-
-Source code layout changes
-==========================
-The source code layout is changed considerably. Tne new layout splits
-the code into well-defined folders which abstract a functionality.
-The new layout starts from a top-level "harvestman" folder which holds
-all the code. The layout in this folder is as follows.
-
-harvestman
-        |
-        +---apps
-        |     |
-        |     |
-        |     +---samples
-        |
-        |
-        +---dev
-        |
-        |
-        +---ext
-        |    |
-        |    |  
-        |    +---lucene
-        |    |
-        |    |
-        |    +---swish-e
-        |
-        |
-        +---lib
-        |    |
-        |    +---common
-        |    |
-        |    +---js
-        |         |
-        |         +---samples
-        |
-        +---test
-
-The folders and their function are described below.
-
-1. apps - This holds all application modules. The HarvestMan framework's
-main application (harvestman.py), the Hget application's module (hget.py)
-and the module holding the base application class (appbase.py) sit in
-this module. This module also holds some sample configuration XML files.
-
-2. apps/samples - This holds sample applications based on HarvestMan
-event framework which demonstrates custom crawling applications, extending
-the HarvestMan framework. 
-
-3. dev - This holds test code and code which is under a prototype stage
-and not part of the HarvestMan library yet, but could become so in future
-versions.
-
-4. ext - This holds modules which make use of the HarvestMan plugin
-framework and implement plugin extensions on HarvestMan. 
-
-5. ext/lucene - This holds some useful code for working with Lucene
-indexes.
-
-6. ext/swish-e - This holds some documentation and a sample swish-e
-configuration file which can be used with the swish-e plugin of HarvestMan.
-
-7. lib - This holds the main library modules of HarvestMan. This folder
-contains most of the code of the HarvestMan framework and in a way
-defines the HarvestMan framework. 
-
-8. lib/common - This holds library modules which implement common
-algorithms or data structures or generic libraries or holds global 
-data/objects or contain code from third party libraries. 
-
-9. lib/js - This holds a barebones Javascript parser written in pure
-Python which is used in HarvestMan to parse basic javascript. It also
-contains a pure Python implementation of the Narcissus javascript
-parser, though this parser is not used in HarvestMan framework anywhere.
-
-10.lib/js/samples - Contains sample javascript/html files as test
-cases for the Javascript/Narcissus parsers.
-
-11.test - Contains a unit test module and a single unit test case
-for the urlparser module. 
-
-
-New Features
-============
-As mentioned earlier, the features can be split into two - extensibility
-features and usability features. 
-
-Extensibility
--------------
-1. HarvestMan Event framework - This release adds an event framework
-to HarvestMan which allows a developer to very easily extend the program's
-behavior. Specific functions raise events before and after certain
-operations. A developer can bind his functions to these events and his
-functions are automatically called back by HarvestMan during program flow,
-when the event is raised. The developer can provide his own specific 
-custom processing in his event callback, which can modify the program
-behavior.
-
-There are many examples of writing custom crawler applications by using
-events in the "apps/samples" folder. For more information on HarvestMan
-event framework read the document "events.HOWTO" in this folder.
-
-2. HarvestMan plugins framework - This release adds a plugin
-mechanism to HarvestMan, which allows the developer to modify
-program behavior by writing custom logic and hooking it on to specific
-methods in HarvestMan classes. The plugin mechanism works using metaclasses
-and allows a developer to completely replace the code of a method or
-to attach functions as pre/post callbacks on methods. This has been
-used to implement sample plugins - the "ext" folder contains plugins
-which involve a simulator, a swish-e plugin, a lucene plugin etc. 
-
-For more information on plugins read "plugins.HOWTO" in this folder.
-
-3. Swish-e integration implemented as a plugin (see (2)). This plugin
-allows HarvestMan to run as an external program feeding content to swish-e
-indexer.
-
-4. Lucene integration implemented as a plugin (see (2)). This plugin
-converts HarvestMan to a web indexer, allowing to crawl and index web
-pages using Lucene via PyLucene.
-
-5. A simulate feature implemented as a standard plugin (see (2)). 
-This simulates crawling without actually downloading anything. 
-
-6. Crawl user database - From 2.0, a user crawl database named
-"crawls.db" is created in a user directory and all crawl meta
-information and statistics are appended as records to this database.
-The database is an sqlite database and consists of two tables
-namely "projects" and "project_stats". The former stores meta
-information on every HarvestMan project, while the latter stores
-information on crawl statistics of every project.
-
-For more information on the crawl database, see "dbdesign.txt" .
-
-7. New modules: Several new modules have been added. 
-These are:
-
-lib/
-
-o configparser.py - This contains configuration file parsing
-code. (This used to be named as xmlparser.py).
-o db.py - Defines classes implementing crawl database feature.
-o document.py - Defines a class which stands for a web document.
-o event.py - Defines classes for the HarvestMan event framework.
-o filethread.py - Defines a class for writing files in separate
-threas (not used).
-o hooks.py - Defines classes for the plugin mechanism.
-o methodwrapper.py - Defines metaclass level mechanisms for
-implementing the plugin feature.
-o mirrors.py - Defines classes which helps to manage mirrors
-for the Hget application.
-o options.py - Summarizes and holds all harvestman/hget options
-in tuples.
-o urlcollections.py - Defines URL context and collections classes
-which helps to associate and relate URLs belonging to specific
-contexts (frame, css etc) easily and to define new contexts.
-o urlproc.py - Defines a function which helps to replace HTML
-entitites in URLs.
-o urltypes.py - Defines a hierarchical type system for URLs encountered
-in the web, which is used by other modules.
-
-o common/* - The "common" sub-folder and all contained modules are
-brand new. These modules are,
-
-  - bst.py - Defines a binary search tree with disk caching.
-  - common.py - Contains global functions/data
-  - dictcache.py - Defines a dictionary type with disk caching.
-  - keepalive.py - Module borrowed from "urlgrabber" project, defining
-    HTTP handlers which provide HTTP/1.0 keep-alive connections.
-  - lrucache.py - Defines an O(1) LRU cache class. Code courtesy
-    Josiah Carlson.
-  - macros.py - Defines a metaclass and its derived C-like "macro" variables
-    and defines several "macros" for HarvestMan.
-  - netinfo.py - Common data and state moved from urlparser.py to this module.
-  - optionparser.py - Defines a generic option parser class which makes it
-    easy to write and modify command line options in the form of tuples.
-  - progress.py - Defines a progress bar class. Code borrowed from the 
-    "S.M.A.R.T" package manager project and customized.
-  - properties.py - A pure Python implementation of java.util.Properties class.
-  - pydblite.py - A pure Python in-memory database with selection by list 
-    comprehension and generator expression. Code courtesy Pierre Quentel.
-  - singleton.py - Singleton implementations.
-  - spincursor.py - A console spin-cursor implementation used by hget.
-
-apps/
-
-o hget.py - New application module for the "hget" application
-o appbase.py - New top-level application class module.
-
-ext/
-
- o This folder and all code under it are new.
-
-test/
-
- o This folder and all code under it are new.
-
-
-8. Deprecated/renamed modules - Several modules have been deprecated
-or renamed (moved) w.r.t 1.4.6 version.
-
- o common.py moved to "common" sub-folder.
- o feedparser.py moved to "dev" folder.
- o strptime.py deprecated and removed.
- o urlserver.py deprecated and removed.
- o xmlparser.py renamed to configparser.py .
-
-
-Usability
----------
-1. Support for HTTP compression (gzip, deflate).
-2. Support for HTML "embed" and "object" tags.
-3. Support and control for multimedia URL types. A
-new type called "multimedia" and subtypes "movies"
-and "sounds" have been added.
-4. Support for meta "refresh" tags.
-5. Support for meta "robot" tags (index, follow, noindex, nofollow).
-6. Feature to parse stylesheets, extract URLs and download them.
-7. Support for parsing and replacing HTML entities in URLs.
-8. Support for keywords and description attribtues in "meta" tags 
-for web pages.
-9. Support for specifying offset in child URLs of webpages.
-10. Support for HTTP 304 using "If-Modified-Since" HTTP headers.
-11. Support for "etags" in HTTP headers.
-12. Support for HTTP keep-alive in mutliple connections to the
-same site. Program now attempts to keep HTTP connections open
-as much as possible.
-13. Support for HTTP basic authorization for URLs.
-14. Command line options can be mixed with configuration file
-when using the -C option.
-15. Command line changes - The following command line options
-have been removed, added or updated.
-  1. --subdomain, -S: Changed to -s.
-  2. -m,--simulate: New option to simulate crawling.
-  3. -g,--plugin: New option to apply a given plugin.
-  4. --urllistfile, --urltreefile: Removed these options.
-  5. -F, --urlfile: New option to read list of start URLs from a file
-
-
-Other Changes (including performance enhancements)
---------------------------------------------------
-
-1. Project caching uses the database object defined by the pydblite
-module instead of using pickling or shelve. This helps fast
-cache reads/writes than before.
-2. Logging module rewritten to make use of Python logging library.
-3. Log for same project run many times, keeps appending to same
-log file instead of deleting old logs. 
-4. Log files now have time stamps in every log line.
-5. Replaced md5 with sha module for generating hashes.
-6. Removed lock objects in many places when the data types are
-automatically synchronized (lists etc) by the GIL.
-7. Removed strptime module since it is no longer required.
-8. Default value for maximum file size (for single file) increased
-to 5MB (was 1 MB earlier).
-9. Changed sleep times inside the loops of fetcher/crawler thread
-clases to random times ranging from 0 to 0.3 second. This allows
-for more better resource allocation and pooling.
-10. Sleep times is available as a configuration option. This can
-be used to increase sleep-times for slow/traffic intense websites.
-11. Caching supports HTTP-304 using "If-modified-since" and 
-etags using "If-none-match". Etags in HTTP headers are saved
-to project cache if found along with last-modified time information.
-12. Better management of HTTP errors in connector module. A new
-error class is defined and errors are managed according to 
-HTTP/1.1 definitions.
-13. Rudimentary javascript support - for javascript based URL
-forwarding and basic support for DOM modifications using document.write* .
-14. Better support for Frame pages.
-15. Arbitrary return values in most methods replaced by custom 
-defined macros where a number is expected as return value.
-Each macro is a class which maps to a numerical value.
-16. A fast HTML parser based on sgmlop is added to HarvestMan.
-By default this parser is used. This parser can parse even bad
-HTML so more pages should get parsed than in previous versions.
-17. The "GetObject", "SetObject" methods are replaced by a global
-object named "objects" which holds handles to all global objects
-by using alias strings. For example, instead of getting the
-config object using,
-
-cfg = GetObject('config')
-
-now this is,
-
-cfg = objects.config
-
-The "objects" object lives in the common/common.py module. The
-aliases are set using the "SetAlias" method. Objects which want
-to be registered using "SetAlias" should define their unique alias 
-string in the "alias" attribute.
-
-
-(The following are specific performance enhancement fixes)
-
-18. The url dictionary and collections data structure are now
-disk-caching BSTs as opposed to pure in-memory dictionaries as earlier.
-This helps reduce the memory usage of the application and also
-to increase program speed. 
-19. Many redundant data structures removed. No separate data structure
-used for duplicate link checking. Instead the url dictionary BST
-is reused. The "downloaddict" in datamgr.py module replaced with
-counters.
-20. The "_filter" dictionary in rules.py module no longer stores
-full URLs, but only their indices. The "_links" list is removed
-from this module.
-
-21. In urlparser.py module a new method named "get_canonical_url"
-is added. This returns the canonical form of a URL. This form is used
-to calculate an index of the URL which is used to filter out 
-duplicate URLs etc. Since canonicalization is done, this URL 
-duplicate filter algorithm works much better than before.
-22. Better unicode handling in url* modules.
-23. A state machine has been added for managing program exit
-condition instead of the previous active wait loop. The state machine
-keeps track of state changes in threads using a dictionary. Program
-exit is related to certain synchronization of states. Active wait
-replaced by waiting on a condition object.
-24. Url status dictionaries in datamgr replaced by url queue status
-variable on url objects and status macros. The "qstatus" variable
-on a URL object changes when the URL enters and exits queues, gets
-downloaded etc.
-25. Retry logic improved. Only URLs which failed without fatal
-errors and which were not read from cache are retried. 
-26. The requests dictionary is removed from the HarvestManUrlConnectorFactory
-class. There is no need to limit connections per server, only a global
-limit is used.
-
-For more information on data structure changes, see "Datastructures.txt".
-For more information on the state machine, see "state_machine.txt".
-
-Bug-fixes
----------
-1. Fixed 100% CPU utilization bug.
-2. Fixed many bugs in urlparser module,
-   - Correctly interpreting HTML entities
-   - Better unicode handling
-   - Bug-fixes for URLs with ".." character (URLs like http://www.foo.com/../bar )
-   - Bugs in re-resolving of URLs
-   - Bug to fix too many recalculations of the absolute url (in get_full_url())
-   - Bug fixing the URL index calculation
-   - Bug fixing handling of anchor tags
-   - Bug fixing handling of % characters in URLs
-   - Bug fixing changes in URLs which requires changes in directories/filenames etc (adding of recalc_locations method)
-3. Bug-fix in rules module to speed up rules checks.
-4. Bug-fix in rules module to correctly add URLs to filters dictionary for filtered URLs.
-5. Bug-fixes in downloading of duplicate content by fetchers.
-6.  Fixed problem with URLs not downloaded when base url is
-redefined. This was causing a deep-crawling problem for
-many sites.
-7. Bug-fixes in pageparser module. Fixed logic in CaselessDict class.
-8. Fixed unicode handling bug in logger module.
-9. Many others...
-
-
-Changes in setup/configuration
-------------------------------
-
-Since the 2.0 version also adds many changes in the setup.py script
-and also brings in the concept of user specific folders and 
-different levels of configuration and changes in the structure
-of configuration files also, so this is discussed separately
-in this section. 
-
-1. User folder - From 2.0 version, HarvestMan will create a folder
-for user specific configuration and data. For POSIX systems this
-folder is $HOME/.harvestman where $HOME is the home directory of the
-user. For Win32 systems this is %USERPROFILE%/Local Settings/Application Data/HarvestMan
-folder where %USERPROFILE% maps to the profile folder of the user, typically
-being the "C:/Documents and Settings/%USERNAME%" folder. A user specific
-configuration file is created in the "config" sub-folder of this folder.
-
-2. System folder - From 2.0 version, HarvestMan will create a folder for
-system wide configuration. For POSIX systems this is "/etc/harvestman"
-folder. For Win32 systems this is %ALLUSERPROFILE%/Application Data/HarvestMan/conf"
-folder. A system level configuration file is created and copied to this folder.
-
-HarvestMan can be customized by altering either the system configuration file,
-the user configuration file or both. HarvestMan first loads the system configuration
-file (if found) and next the user configuration file (if found) and further any
-project specific configuration files. So system-wide customization (for all users)
-can be kept in the system configuration file, user-specific customization in the
-user configuration file and project specific customization in specific project
-configuration files. Any changes applied further upstream overrides the settings
-applied earlier. In other words, project configuration files override user configuration
-files and user configuration files override the system configuration file.
-
-
-3. Configuration files - Earlier configuration XML files had to be complete and
-specify a <projects>...</projects> section specifying the URLs and other configuration
-for crawls. Now config XML files can be part or incomplete, with each specifying
-one or more top-level elements for any levels of configuration. In fact the user
-and system configuration files will not have the <projects>...</projects> section.
-These can be specified in another configuration file. 
-
-For example it is possible to have a configuration file as follows.
-
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    
-    <control>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="1" />
-      </plugins>
-    </control>
-
-  </config>
-  
-</HarvestMan>
-
-Since, this does not have a <projects>...</projects> section, to do
-any meaningful crawl with it, a URL has to be specified on the command line.
-So assuming the name of the above file is "cfg.xml", a sample crawl
-using this config file is,
-
-$ harvestman -C cfg.xml http://www.python.org/doc/current/tut/tut.html
-
-It is important to note that HarvestMan does not support multi-level cascading
-of configuration files. Only 3 levels of cascading are supported namely
-that of system=>user=>custom .So a crawl can be customized with configuration
-files at three levels which should be enough for most usage scenarios.
-
-4. setup.py - The setup.py script now installs sgmlop if it is not found.
-Since HarvestMan-2.0 depends on pyparsing it is also pulled in and automatically
-installed if not found. The setup.py script installs Python documentation for
-HarvestMan (generated by epydoc) and creates two shortcuts (softlinks) in POSIX
-systems namely "harvestman" for running the HarvestMan program and "hget" for
-running the hget program.
diff --git a/HarvestMan/doc/Changes.txt b/HarvestMan/doc/Changes.txt
deleted file mode 100755
index 31533db..0000000
--- a/HarvestMan/doc/Changes.txt
+++ /dev/null
@@ -1,1405 +0,0 @@
-
-*==========================================================*
-|            -Changelog.txt file for HarvestMan-           |
-|                                                          |
-|           URL: http://www.harvestmanontheweb.com         |
-*==========================================================*
-
-
-Version 2.0 b1
-Release Date: TBD
-
-Release Focus: Re-design, major enhancements, features & bug-fixes
-
-Brief
-=====
-The 2.0 version is a release of HarvestMan after more than 2.0
-years. This version has numerous new features in terms of
-extensibility and usability. This version also converts HarvestMan
-from a web crawler application to a full-fledge application framework
-for writing custom web crawling/data mining applications.
-
-The changes in this version are many - there are changes in the
-source code layout, extensibility and usability features, performance
-enhancement fixes and numerous other bug-fixes and changes in dependence
-on third-party modules and in the setup.py script. Each of these
-changes will be discussed in detail here.
-
-
-Source code layout changes
-==========================
-The source code layout is changed considerably. Tne new layout splits
-the code into well-defined folders which abstract a functionality.
-The new layout starts from a top-level "harvestman" folder which holds
-all the code. The layout in this folder is as follows.
-
-harvestman
-        |
-        +---apps
-        |     |
-        |     |
-        |     +---samples
-        |
-        |
-        +---dev
-        |
-        |
-        +---ext
-        |    |
-        |    |  
-        |    +---lucene
-        |    |
-        |    |
-        |    +---swish-e
-        |
-        |
-        +---lib
-        |    |
-        |    +---common
-        |    |
-        |    +---js
-        |         |
-        |         +---samples
-        |
-        +---test
-
-The folders and their function are described below.
-
-1. apps - This holds all application modules. The HarvestMan framework's
-main application (harvestman.py), the Hget application's module (hget.py)
-and the module holding the base application class (appbase.py) sit in
-this module. This module also holds some sample configuration XML files.
-
-2. apps/samples - This holds sample applications based on HarvestMan
-event framework which demonstrates custom crawling applications, extending
-the HarvestMan framework. 
-
-3. dev - This holds sample code and code which is under a prototype stage
-and not part of the HarvestMan library yet, but could become so in future
-versions.
-
-4. ext - This holds modules which make use of the HarvestMan plugin
-framework and implement plugin extensions on HarvestMan. 
-
-5. ext/lucene - This holds some useful code for working with Lucene
-indexes.
-
-6. ext/swish-e - This holds some documentation and a sample swish-e
-configuration file which can be used with the swish-e plugin of HarvestMan.
-
-7. lib - This holds the main library modules of HarvestMan. This folder
-contains most of the code of the HarvestMan framework and in a way
-defines the HarvestMan framework. 
-
-8. lib/common - This holds library modules which implement common
-algorithms or data structures or generic libraries or holds global 
-data/objects or contain code from third party libraries. 
-
-9. lib/js - This holds a barebones Javascript parser written in pure
-Python which is used in HarvestMan to parse basic javascript. It also
-contains a pure Python implementation of the Narcissus javascript
-parser, though this parser is not used in HarvestMan framework anywhere.
-
-10.lib/js/samples - Contains sample javascript/html files as test
-cases for the Javascript/Narcissus parsers.
-
-11.test - Contains a unit test module and a single unit test case
-for the urlparser module. 
-
-
-New Features
-============
-As mentioned earlier, the features can be split into two - extensibility
-features and usability features. 
-
-Extensibility
--------------
-1. HarvestMan Event framework - This release adds an event framework
-to HarvestMan which allows a developer to very easily extend the program's
-behavior. Specific functions raise events before and after certain
-operations. A developer can bind his functions to these events and his
-functions are automatically called back by HarvestMan during program flow,
-when the event is raised. The developer can provide his own specific 
-custom processing in his event callback, which can modify the program
-behavior.
-
-There are many examples of writing custom crawler applications by using
-events in the "apps/samples" folder. For more information on HarvestMan
-event framework read the document "events.HOWTO" in this folder.
-
-2. HarvestMan plugins framework - This release adds a plugin
-mechanism to HarvestMan, which allows the developer to modify
-program behavior by writing custom logic and hooking it on to specific
-methods in HarvestMan classes. The plugin mechanism works using metaclasses
-and allows a developer to completely replace the code of a method or
-to attach functions as pre/post callbacks on methods. This has been
-used to implement sample plugins - the "ext" folder contains plugins
-which involve a simulator, a swish-e plugin, a lucene plugin etc. 
-
-For more information on plugins read "plugins.HOWTO" in this folder.
-
-3. Swish-e integration implemented as a plugin (see (2)). This plugin
-allows HarvestMan to run as an external program feeding content to swish-e
-indexer.
-
-4. Lucene integration implemented as a plugin (see (2)). This plugin
-converts HarvestMan to a web indexer, allowing to crawl and index web
-pages using Lucene via PyLucene.
-
-5. A simulate feature implemented as a standard plugin (see (2)). 
-This simulates crawling without actually downloading anything. 
-
-6. Crawl user database - From 2.0, a user crawl database named
-"crawls.db" is created in a user directory and all crawl meta
-information and statistics are appended as records to this database.
-The database is an sqlite database and consists of two tables
-namely "projects" and "project_stats". The former stores meta
-information on every HarvestMan project, while the latter stores
-information on crawl statistics of every project.
-
-For more information on the crawl database, see "dbdesign.txt" .
-
-7. New modules: Several new modules have been added. 
-These are:
-
-lib/
-
-o configparser.py - This contains configuration file parsing
-code. (This used to be named as xmlparser.py).
-o db.py - Defines classes implementing crawl database feature.
-o document.py - Defines a class which stands for a web document.
-o event.py - Defines classes for the HarvestMan event framework.
-o filethread.py - Defines a class for writing files in separate
-threas (not used).
-o hooks.py - Defines classes for the plugin mechanism
-o methodwrapper.py - Defines metaclass level mechanisms for
-implementing the plugin feature.
-o mirrors.py - Defines classes which helps to manage mirrors
-for the Hget application.
-o options.py - Summarizes and holds all harvestman/hget options
-in tuples.
-o urlcollections.py - Defines URL context and collections classes
-which helps to associate and relate URLs belonging to specific
-contexts (frame, css etc) easily and to define new contexts.
-o urlproc.py - Defines a function which helps to replace HTML
-entitites in URLs.
-o urltypes.py - Defines a hierarchical type system for URLs encountered
-in the web, which is used by other modules.
-
-o common/* - The "common" sub-folder and all contained modules are
-brand new. These modules are,
-
-  - bst.py - Defines a binary search tree with disk caching.
-  - common.py - Contains global functions/data
-  - dictcache.py - Defines a dictionary type with disk caching.
-  - keepalive.py - Module borrowed from "urlgrabber" project, defining
-    HTTP handlers which provide HTTP/1.0 keep-alive connections.
-  - lrucache.py - Defines an O(1) LRU cache class. Code courtesy
-    Josiah Carlson.
-  - macros.py - Defines a metaclass and its derived C-like "macro" variables
-    and defines several "macros" for HarvestMan.
-  - netinfo.py - Common data and state moved from urlparser.py to this module.
-  - optionparser.py - Defines a generic option parser class which makes it
-    easy to write and modify command line options in the form of tuples.
-  - progress.py - Defines a progress bar class. Code borrowed from the 
-    "S.M.A.R.T" package manager project and customized.
-  - properties.py - A pure Python implementation of java.util.Properties class.
-  - pydblite.py - A pure Python in-memory database with selection by list 
-    comprehension and generator expression. Code courtesy Pierre Quentel.
-  - singleton.py - Singleton implementations.
-  - spincursor.py - A console spin-cursor implementation used by hget.
-
-apps/
-
-o hget.py - New application module for the "hget" application
-o appbase.py - New top-level application class module.
-
-ext/
-
- o This folder and all code under it are new.
-
-test/
-
- o This folder and all code under it are new.
-
-
-8. Deprecated/renamed modules - Several modules have been deprecated
-or renamed (moved) w.r.t 1.4.6 version.
-
- o common.py moved to "common" sub-folder.
- o feedparser.py moved to "dev" folder.
- o strptime.py deprecated and removed.
- o urlserver.py deprecated and removed.
- o xmlparser.py renamed to configparser.py .
-
-
-Usability
----------
-1. Support for HTTP compression (gzip, deflate).
-2. Support for HTML "embed" and "object" tags.
-3. Support and control for multimedia URL types. A
-new type called "multimedia" and subtypes "movies"
-and "sounds" have been added.
-4. Support for meta "refresh" tags.
-5. Support for meta "robot" tags (index, follow, noindex, nofollow).
-6. Feature to parse stylesheets, extract URLs and download them.
-7. Support for parsing and replacing HTML entities in URLs.
-8. Support for keywords and description attribtues in "meta" tags 
-for web pages.
-9. Support for specifying offset in child URLs of webpages.
-10. Support for HTTP 304 using "If-Modified-Since" HTTP headers.
-11. Support for "etags" in HTTP headers.
-12. Support for HTTP keep-alive in mutliple connections to the
-same site. Program now attempts to keep HTTP connections open
-as much as possible.
-13. Support for HTTP basic authorization for URLs.
-14. Command line options can be mixed with configuration file
-when using the -C option.
-15. Command line changes - The following command line options
-have been removed, added or updated.
-  1. --subdomain, -S: Changed to -s.
-  2. -m,--simulate: New option to simulate crawling.
-  3. -g,--plugin: New option to apply a given plugin.
-  4. --urllistfile, --urltreefile: Removed these options.
-  5. -F, --urlfile: New option to read list of start URLs from a file
-
-
-Other Changes (including performance enhancements)
---------------------------------------------------
-
-1. Project caching uses the database object defined by the pydblite
-module instead of using pickling or shelve. This helps fast
-cache reads/writes than before.
-2. Logging module rewritten to make use of Python logging library.
-3. Log for same project run many times, keeps appending to same
-log file instead of deleting old logs. 
-4. Log files now have time stamps in every log line.
-5. Replaced md5 with sha module for generating hashes.
-6. Removed lock objects in many places when the data types are
-automatically synchronized (lists etc) by the GIL.
-7. Removed strptime module since it is no longer required.
-8. Default value for maximum file size (for single file) increased
-to 5MB (was 1 MB earlier).
-9. Changed sleep times inside the loops of fetcher/crawler thread
-clases to random times ranging from 0 to 0.3 second. This allows
-for more better resource allocation and pooling.
-10. Sleep times is available as a configuration option. This can
-be used to increase sleep-times for slow/traffic intense websites.
-11. Caching supports HTTP-304 using "If-modified-since" and 
-etags using "If-none-match". Etags in HTTP headers are saved
-to project cache if found along with last-modified time information.
-12. Better management of HTTP errors in connector module. A new
-error class is defined and errors are managed according to 
-HTTP/1.1 definitions.
-13. Rudimentary javascript support - for javascript based URL
-forwarding and basic support for DOM modifications using document.write* .
-14. Better support for Frame pages.
-15. Arbitrary return values in most methods replaced by custom 
-defined macros where a number is expected as return value.
-Each macro is a class which maps to a numerical value.
-16. A fast HTML parser based on sgmlop is added to HarvestMan.
-By default this parser is used. This parser can parse even bad
-HTML so more pages should get parsed than in previous versions.
-17. The "GetObject", "SetObject" methods are replaced by a global
-object named "objects" which holds handles to all global objects
-by using alias strings. For example, instead of getting the
-config object using,
-
-cfg = GetObject('config')
-
-now this is,
-
-cfg = objects.config
-
-The "objects" object lives in the common/common.py module. The
-aliases are set using the "SetAlias" method. Objects which want
-to be registered using "SetAlias" should define their unique alias 
-string in the "alias" attribute.
-
-
-(The following are specific performance enhancement fixes)
-
-18. The url dictionary and collections data structure are now
-disk-caching BSTs as opposed to pure in-memory dictionaries as earlier.
-This helps reduce the memory usage of the application and also
-to increase program speed. 
-19. Many redundant data structures removed. No separate data structure
-used for duplicate link checking. Instead the url dictionary BST
-is reused. The "downloaddict" in datamgr.py module replaced with
-counters.
-20. The "_filter" dictionary in rules.py module no longer stores
-full URLs, but only their indices. The "_links" list is removed
-from this module.
-
-21. In urlparser.py module a new method named "get_canonical_url"
-is added. This returns the canonical form of a URL. This form is used
-to calculate an index of the URL which is used to filter out 
-duplicate URLs etc. Since canonicalization is done, this URL 
-duplicate filter algorithm works much better than before.
-22. Better unicode handling in url* modules.
-23. A state machine has been added for managing program exit
-condition instead of the previous active wait loop. The state machine
-keeps track of state changes in threads using a dictionary. Program
-exit is related to certain synchronization of states. Active wait
-replaced by waiting on a condition object.
-24. Url status dictionaries in datamgr replaced by url queue status
-variable on url objects and status macros. The "qstatus" variable
-on a URL object changes when the URL enters and exits queues, gets
-downloaded etc.
-25. Retry logic improved. Only URLs which failed without fatal
-errors and which were not read from cache are retried. 
-26. The requests dictionary is removed from the HarvestManUrlConnectorFactory
-class. There is no need to limit connections per server, only a global
-limit is used.
-
-For more information on data structure changes, see "Datastructures.txt".
-For more information on the state machine, see "state_machine.txt".
-
-Bug-fixes
----------
-1. Fixed 100% CPU utilization bug.
-2. Fixed many bugs in urlparser module,
-   - Correctly interpreting HTML entities
-   - Better unicode handling
-   - Bug-fixes for URLs with ".." character (URLs like http://www.foo.com/../bar )
-   - Bugs in re-resolving of URLs
-   - Bug to fix too many recalculations of the absolute url (in get_full_url())
-   - Bug fixing the URL index calculation
-   - Bug fixing handling of anchor tags
-   - Bug fixing handling of % characters in URLs
-   - Bug fixing changes in URLs which requires changes in directories/filenames etc (adding of recalc_locations method)
-3. Bug-fix in rules module to speed up rules checks.
-4. Bug-fix in rules module to correctly add URLs to filters dictionary for filtered URLs.
-5. Bug-fixes in downloading of duplicate content by fetchers.
-6.  Fixed problem with URLs not downloaded when base url is
-redefined. This was causing a deep-crawling problem for
-many sites.
-7. Bug-fixes in pageparser module. Fixed logic in CaselessDict class.
-8. Fixed unicode handling bug in logger module.
-9. Many others...
-
-
-Changes in setup/configuration
-------------------------------
-
-Since the 2.0 version also adds many changes in the setup.py script
-and also brings in the concept of user specific folders and 
-different levels of configuration and changes in the structure
-of configuration files also, so this is discussed separately
-in this section. 
-
-1. User folder - From 2.0 version, HarvestMan will create a folder
-for user specific configuration and data. For POSIX systems this
-folder is $HOME/.harvestman where $HOME is the home directory of the
-user. For Win32 systems this is %USERPROFILE%/Local Settings/Application Data/HarvestMan
-folder where %USERPROFILE% maps to the profile folder of the user, typically
-being the "C:/Documents and Settings/%USERNAME%" folder. A user specific
-configuration file is created in the "config" sub-folder of this folder.
-
-2. System folder - From 2.0 version, HarvestMan will create a folder for
-system wide configuration. For POSIX systems this is "/etc/harvestman"
-folder. For Win32 systems this is %ALLUSERPROFILE%/Application Data/HarvestMan/conf"
-folder. A system level configuration file is created and copied to this folder.
-
-HarvestMan can be customized by altering either the system configuration file,
-the user configuration file or both. HarvestMan first loads the system configuration
-file (if found) and next the user configuration file (if found) and further any
-project specific configuration files. So system-wide customization (for all users)
-can be kept in the system configuration file, user-specific customization in the
-user configuration file and project specific customization in specific project
-configuration files. Any changes applied further upstream overrides the settings
-applied earlier. In other words, project configuration files override user configuration
-files and user configuration files override the system configuration file.
-
-
-3. Configuration files - Earlier configuration XML files had to be complete and
-specify a <projects>...</projects> section specifying the URLs and other configuration
-for crawls. Now config XML files can be part or incomplete, with each specifying
-one or more top-level elements for any levels of configuration. In fact the user
-and system configuration files will not have the <projects>...</projects> section.
-These can be specified in another configuration file. 
-
-For example it is possible to have a configuration file as follows.
-
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    
-    <control>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="1" />
-      </plugins>
-    </control>
-
-  </config>
-  
-</HarvestMan>
-
-Since, this does not have a <projects>...</projects> section, to do
-any meaningful crawl with it, a URL has to be specified on the command line.
-So assuming the name of the above file is "cfg.xml", a sample crawl
-using this config file is,
-
-$ harvestman -C cfg.xml http://www.python.org/doc/current/tut/tut.html
-
-It is important to note that HarvestMan does not support multi-level cascading
-of configuration files. Only 3 levels of cascading are supported namely
-that of system=>user=>custom .So a crawl can be customized with configuration
-files at three levels which should be enough for most usage scenarios.
-
-4. setup.py - The setup.py script now installs sgmlop if it is not found.
-Since HarvestMan-2.0 depends on pyparsing it is also pulled in and automatically
-installed if not found. The setup.py script installs Python documentation for
-HarvestMan (generated by epydoc) and creates two shortcuts (softlinks) in POSIX
-systems namely "harvestman" for running the HarvestMan program and "hget" for
-running the hget program.
-
----------------------------------------------------------------------------
-
-
-Version 1.4.6 final
-Release Date: Sep 9 2005
-
-Release Focus: Minor bugfix
-
-Changes
-=======
-1. Fixed bugs in the setup.py and install scripts
-so that they work with Python 2.4.
-2. Updated py2exe install script. It works correctly
-with py2exe version 0.6.1 upwards.
-
-
-Version 1.4.5 final
-Release Date: Aug 19 2005
-
-Release Focus: Bug-fixes
-
-Changes
-=======
-1. Added a subdomain flag to the command line.
-2. For verbosity level of zero, no message is printed.
-Earlier this used to print the welcome message.
-
-Bug-fixes
-=========
-1. Fixed the bug with starting a project by reading
-back an existing project file. This was not working
-before. Project file written out using Python marshal
-module, not pickle.
-2. Fixed bugs in localization. The regular expression's
-sub method should replace URL only once. Test site:
-http://www.oligopolywatch.com .
-3. Verbosity command line flag was not working. Fixed
-it. Fixed errors with a few other command line options.
-4. The stop project method of the program now
-calls the "terminate" method on threads so we dont
-have hanging threads.
-
-
-Version 1.4.5 b1 (beta 1)
-Release Date: Aug 02 2005
-
-Improvements
-============
-1. There is only one improvement in this release, the new
-command line options. The new release has a complete set
-of new command line options written from scratch. It 
-replaces the previous cluttered and confusing command line.
-A notable feature is that you can use HarvestMan like wget
-for only downloading URLs with a nocrawl option. The new
-command line supports a number of useful options which
-the user is most likely to configure. It skips a number
-of advanced or obscure options that the user need not
-be bothered with, making the command line user friendly.
-For more information, consult the Readme.txt of the package
-or go to http://harvestman.freezope.org/commandline.html .
-
-Bug-fixes
-=========
-1. Added extensions .shtm, .php4, .aspx, .cfm, .cfml,
-.cms as valid web-page extensions in urlparser.py. So
-web-pages ending in these extensions will work with
-HarvestMan.(These were present in HarvestMan 1.4 alphas 
-but somehow got lost!). 
-
-2. When printing the url tree, duplicate links were not
-checked. This has been fixed by adding a check.
-3. A minor bug in setting verbosity in logger object
-was fixed.
-4. Comments will be printed for starting & stoppping
-of url server at verbosity level 3. Comments for pinging
-url server is raised to debug level 4.
-5. Program version number, when print using the -v option
-will print the release level also. For example right now
-this will be printed as 'HarvestMan 1.4.5 beta 1'. Earlier
-it used to print only the version number.
-6. The __fix method of config.py now looks at the number
-of URLs. If no URLs are found (either from config.xml or
-through command line), it exits with an error message.
-7. Asyncore thread for urlserver is now a daemon thread,
-so it will exit if the program is killed.
-8. Fixed a minor bug in set_proxy in connector.py where
-the function to set proxy was being called three times.
-Changed this to once.
-9. Fixed a bug in rules.py. Member self._robocache should
-be a list.
-
-
-
-Version 1.4.5 a2 (alpha 2)
-Release Date: 21/07/2005
-
-Bug Fixes
-=========
-1. Fixed a bug in calculating url paths of directory-like
-urls which use the set_directory_url method in module
-urlparser.py . This was causing a number of invalid urls
-which resulted in HTTP 404 errors. This bug is fixed in
-this version.
-
-2. Fixed a bug in urls that use HTTP redirection with
-cookies. Sometimes some websites send a new url and
-a cookie along with an HTTP redirection error (301,302)
-when a url is requested. The HTTP redirection handler
-is expected to send a new request with the new url
-and the cookie. These kind of urls now work with
-HarvestMan. Fix in connector.py module.
-
-  If you are using Python 2.4, this uses the cookielib
-module and the new HTTPCookieProcessor handler. However,
-even if you are using Python 2.3 or earlier versions,
-this will work since a new HTTP redirect handle is 
-added in the connector module, that takes care of this.
-
-3. Fixed a bug in parsing <base href="..."> tags in
-module pageparser.py .
-
-4. Fixed a bug that created invalid urls because the
-html parser object was not reset before parsing everytime.
-This is now fixed in module crawler.py .
-
-5. Fixed a bug in connector.py module in extracting
-error numbers and error strings from error objects. 
-
-6. Fixed a bug in logger.py module to correctly convert
-non-string types to string types.
-
-7. Fixed a bug in config.py to take care of timelimit
-settings. This was getting ignored before.
-
-Other Changes
-=============
-1. All file encodings are now in latin-1, since
-iso-8859-1 was causing some problems.
-2. A number of modules now use the high performance
-collections.deque data structure if HarvestMan is
-run with Python 2.4. If not, these default to lists.
-3. Some functions in common.py module are removed.
-Some are moved to utils.py module.
-4. Error handler function in harvestman.py removed.
-5. Module htmlparser is removed since it is no
-longer used.
-6. Module cookiemgr is removed, since it is no
-longer used. Essential cookie handling is available
-in connector.py module.
-7. The PriorityQueue in urlqueue.py module now
-uses a modified collections.deque object if run
-with Python 2.4.Otherwise it defaults to a list.
-8. Exception handlers rewritten in many modules.
-9. Unnecessary and commented out debug statements
-are removed.
-10. Tool 'cachereader.py' is removed from tools
-sub-directory.
-
-Version: 1.4.5 a1 (alpha 1)
-Release Date: 27/05/2005
-
-Features
-========
-1. Changed config file format from text to xml. The default
-config file from this version onwards is named 'config.xml'. 
-The text config file format also works, but wil be slowly phased
-out in future releases.
-2. New HTML parser based on SGMLParser module.
-3. Dependency on HTML tidy is removed. 
-4. New archive feature for archiving project files
-   to tar.bz2/tar.gz archives.
-5. Changes in project caching: 
-
-   - Data of web pages is compressed before writing to cache. 
-   - Cache data structure changed to a dictionary, from list. 
-   - Option for writing cache in DBM format. 
-   - Headers of urls is also written to cache. 
-     This can be turned on or off.
-6. A junk filter for filtering out banner ads and similar urls.
-7. HarvestMan now Works with Python 2.4 .
-8. New scripts in 'tools' directory
-    - A script to generate project files from cache.
-    - A script to dump url headers in the form of
-      a DBM file from the project cache.
-    - A script to convert between xml & text 
-      config files.
-
-Bug-fixes
-========= 
-1. Bug fixes in urlparser module. 
-2. Bug fixes in datamgr module. 
-3. Bug fixes in rules module.
-
-
-Version: 1.4 final (Bug fixes + Minor features)
-Release Date: Dec 17 2004
-
-Changes from version 1.3.9-1
-============================
-
-Features
-========
-
-1. Added an asynchronous url server which listens
-to port 3081 (by default). The url server can be
-optionally enabled to gather and send urls instead
-of using a Queue. This can be faster, since the
-url server uses asyncore module of Python with
-queues, which is faster than just using queues.
-
-To enable this feature, set the config variable
-network.urlserver to 1.
-
-2. Modified caching algorithm to store the data
-of the files download in the cache file. Hence
-if some one accidentally deletes the downloaded files,
-HarvestMan can recreate the files from the cache file,
-without actually downloading them, if they are uptodate.
-
-3. Queue architecture modified. The data queue has
-been replaced with a links queue. Instead of pushing
-web page data into a queue, fetchers process them and
-push the new urls to a queue. Crawlers get the urls 
-, walk through them and posts the newly created url
-objects into the url queue or sends them to the url
-server. This saves memory on the queues.
-
-4. Added an option for controlling file download
-based on maximum file size. The maximum size by default
-for a single file is 1 MB.
-
-5. Added an option for dumping a url tree which shows
-parent-child dependencies of the urls generated. This
-can be either a text file or an html file. 
-
-6. Added an advertisement/banner filter to the rules
-module. If enabled this can skip urls related to ad
-banners or graphics.
-
-7. New controller thread to manage file and time limits
-on downloads.
-
-Fixes
-=====
-1. This release fixes a huge bug in HarvestMan, i.e
-that of hanging threads. The threading architecture
-is modified to introduce local buffers. Threads 
-do an unblocked push on the queue as opposed to
-a blocked push in all previous versions. If they
-cannot push the data (Queue full) after 5 attempts,
-they store the data in a local buffer. In the next
-loop of the threads, they try to push the buffer data
-before creating any new objects to push (by crawling
-pages/parsing html files. This ensures that the
-threads dont block continously on the queue leading
-to deadlocks and time outs.)
-
-2.Increased the idling time of threads to reduce CPU
-  load.
-
-3. Fixed a bug with correctly identifying WWW urls.
-4. Fixed a bug that incorrectly modifies urls
-   with spaces between words.
-5. Fixed many bugs with get_relative_filename method.
-6. Fixed bugs with generating urls. Trailing spaces
-   and/or newlines need to be removed from path
-   components.
-7. Added a method to correctly identify the type of
-   a url based on its mimetype.
-8. Fixed bugs in robot protocol checking method.
-   Many optimizations are also added to quickly
-   process urls. A robot object cache (dictionary)
-   and url object whitelist has been added to
-   reduce processing time. Also html files need
-   to be processed.
-9. Fixed bugs in url filter checking method.
-10.Fixed bugs in the order of checking rules
-   in violates_basic_rules method.
-11.Fixed bug in creating regular expression for
-   filtering based on file extension.
-12. Many bug fixes in localise_file_links method.
-13. Fixed a bug in correctly generating the
-    regular expression for old url.
-14. Fixed a bug in localising file names. All
-    web page files are correctly localised now.
-15. Fixed a bug in updating files from project
-    cache.
-16. Bugfixes in urltracker module.
-17. Fixed the bug when program exits sometimes
-    just after downloading the first url.
-18. Fixed bug with parsing <base href="..."> 
-    link.
-19. Fixed error in managing an empty url.
-    Correct error message is printed now.
-20. Fixed bugs with logging errors.
-    The error log stream is not enabled
-    now.
-21. Fix to allow special characters in project base
-    directory (such as ~ for home directory on
-    Unix systems).
-22. Fixed bug in function that opens robots.txt
-    urls.
-23. Removed some useless arguments from some
-    functions.
-24. Fixed bug with url object in connect(...) 
-    function.
-25. Fixes to make slow mode work.
-26. Modified to use methods of cPickle module instead
-    of pickle module in utils.py (cPickle is faster).
-27. Use our own strptime module since this function
-    is not available on all Python versions on Windows.
-28. Fixes in locale setting on Windows platform.
-29. Log file for each project is now generated in the
-    project directory as '<projectname>.log'. This is
-    not a configurable option anymore.
-30. The verification of downloaded files by checksumming
-    is disabled. This is not a configurable option
-    anymore.
-31. The renaming algorithm is disabled since it is not
-    general purpose.
-
-Other Changes
-=============
-
-1. License of program changed to GNU GPL.
-2. The genconfig.py script is more interactive now,
-   displaying the options selected.
-3. Language encoding specified on top of all Python
-   files.
-4. A script to check Python dependency namely, 'check_dep.py'
-   has been added.
-5. Installation made easier on Linux and Unix like systems.
-   A script named 'install' does the job for you.
-6. The 'genutils' directory is renamed to 'tools'.
-
-Version: 1.3.9-1 (minor bug fixes)
-Release Date: June 24 2004
-
-Changes in version 1.3.9-1 from 1.3.9
-=====================================
-
-1. Fixed a bug in cache algorithm. Key 'checksum'
-should not be checked if it is old cache.
-2. Fixed a bug in connector.py. Check for valid
-url object in line 622.
-3. Fixed a bug in urlparser. Anchor type urls
-should have the url file name as base url, not
-original url filename.
-4. Fixed a bug in url tracker. Anchor type urls
-should not be skipped.
-
-
-Version: 1.3.9 (features/bug fixes)
-Release Date: June 14 2004
-
-Changes in version 1.3.9 from 1.3.4
-==================================
-
-New Features
-------------
-
-1. Url priorities: Every url is assigned a priority according
-to which it is downloaded. Urls with higher priority are downloaded
-first. Priorities are determined by 3 factors.
-
-    a. The generation of the url
-    b. Whether the url is a webpage
-    c. User defined priorities
-
-Urls in a lower generation are given higher priority when compared
-to urls in a higher generation. This makes sure that urls which
-were created in the beginning of a project gets downloaded first.
-
-Webpage urls are given a higher priority when compared to other urls.
-
-Apart from this user can defined priorities in the config file in the
-range of (-5,5) based on file extensions.
-
-2. Website priorites: These are like url priorities but which
-can be specified by the user in the config file.
-
-Sample usage:
-
-control.serverpriority     www.foo.com+3,www.bar.com-3
-
-3. Thread groups for downloads: The download threads are now
-pre-launched in a group similar to tracker threads. The download
-jobs are submitted to the thread pool, which in turn delegates
-them to the threads. The thread pool has been made into a 
-queue for this. 
-
- This reduces thread latency, since we no longer spawn
-new threads during the life cycle of the program.
-
-4. Allow urls with spaces: HarvestMan can now download urls which 
-contain spaces like 'http://www.foo.com/bar/this url.html'.
-
-5. Changed the way to distinguish between directory and file like
-urls. Earlier when we parsed the url, a connection was made to
-the url, assuming it was directory like. If the reply was HTTP 404
-error, then it was assumed correctly to be a file like url.
-
-  This has been changed in the new version. We assume all urls are
-file like, For example, if there is a url like http://www.foo.com/bar/file
-, which can be a directory http://www.foo.com/bar/file/index.html or
-file http://www.foo.com/bar/file, we assume it is a file initialy and
-try to download it. The geturl() method of the file-like object returned
-by opening the url, will tell whether it is file like or directory like.
-This information is used to modify the local (disk) file name of the url
-at that point. This decouples the modules urlparser and connector to
-a large extent and makes performance better with such urls.
-
-6. Added functionality to tidy html pages before parsing them by
-   using 'uTidy', the python port of html tidy. This helps to crawl
-   sites that exit due to parsing errors in previous versions of
-   HarvestMan.
-
-7. Intranet downloads need not set a specific flag (download.intranet).
-   Instead HarvestMan can figure out whether the server is in intranet
-   by resolving its name and take appropriate action. This allows
-   intranet/internet downloads to be mixed in the same project.
-
-8. Modified the way url information is cached. The field 'last-modified'
-in url's headers is used, if it is available. If it is not there, a
-checksum based on the content of the url is used (previous algorithm)
-as fallback.
-
-Other Changes
-=============
-
-1. Regular expressions for filters are pre-compiled.
-2. Derived HarvestManStateObject (config class) from 'dict'  type.
-3. Main thread 'joins' each tracker thread with zero timeout instead
-   of killing them at the end of project.
-4. Optimization fix: Links are stored for localising, only if their
-   download is successful.
-5. Assigned 2:1 ratio for fetchers and crawlers instead of current
-   1:1 ratio.
-6. Renamed all modules.
-7. Used 'weakref' wherever possible to reduce extra references to
-   objects and avoid reference loops. This is mostly used in
-   'GetObject' method and in urlparser module.
-8. 
-
-Bug fixes
-========
-
-1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
-2. Fixed bug in url filter for images. 
-3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
-4. Close file like object returned by opening urls
-   after reading data.
-5. Fixed a bug in localising links. Directory like urls
-   need to be skipped.
-6. Fixed bug in finding common domain for servers that 
-   have lesser than three 'dots' in their name string. (This is
-   the same bug as # B1083256752.28 .)
-7. Fixed a bug in setting up network for clients behind a proxy/
-   firewall.
-
-
-Version: 1.3.3 (bug fixes)
-Release Date: Feb 24 2004
-
-Changes in Version 1.3.3 from 1.3.2
-===================================
-
-1. Fixed bug with parsing of FTP links.  Bug # B1077613467.85.
-2. Fixed another bug with external server links.
-3. Fixed bug with request control. Request dictionary 
-   key is server name, not ip.
-
-Version: 1.3.2 (minor feature enhancements)
-Release Date: Feb 13 2004
-
-Changes in Version 1.3.2 from 1.3.1
-===================================
-
-There is one minor feature in this release. 
-
-1. This release adds ability to limit downloads by
-controlling the number of simultaneous requests from the
-same server. This option can be controlled by the config
-variable named 'control.requests'.
-
-2. Apart from that I have re-structured the package,
-and added a distutils setup.py script which copies the 
-package to your PYTHON installation folder.
-
-
-Version: 1.3.1 (bug fix)
-Release Date: Feb 10 2004
-
-Changes in Version 1.3.1 from 1.3
-=================================
-
-This version is a bug fix version fixing most
-of the critical and annoying HarvestMan bugs.
-These bugs can be located in the bugs database
-at http://harvestman.freezope.org/Discussons .
-
-1. Fixed bug with query forms. The program no longer
-   tries to download server side query form links.
-   Bug #B1073291938.97.
-2. Fixed bug with handling frame redirects. Bug #B1076402199.0.
-3. Fixed bug with robots.txt url. Bug #B1072436188.35.
-4. Fixed bug in finding out external server links.
-   Bug #B1076402348.52.
-5. Fixed bug in external links with respect to subdomains.
-   Bug #B1076409910.45.
-6. Fixed bug with following non-existent links in a 
-   directory listing Bug #B1073028403.71.
-7. Fixed problem in printing harvestman url in welcome
-   message.
-8. Fixed some problems in config file parsing.
-9. Fixed problem with printing version string (-v and
-   --version options).
-10. Other miscellaneous fixes and corrections thanks to
-    Vivian, Sascha and some others.
-
-
-Version: 1.3 (final)
-Release Date: Dec 15 2003
-
-Changes in Version 1.3 (from 1.3 a1)
-=========================================
-
-1. This version adds one feature, that of searching 
-   a webpage for keywords. You can create complex
-   boolean regular expressions and supply them to
-   HarvestMan. HarvestMan will parse the regular
-   expressions and download only those web pages that
-   match the regular expression.
-
-   In simpler words, this means a keyword(s) search. :-)
-
-   For example, you need to download only those webpages
-   that contain the term 'Saddam' and 'WMD'. You create
-   the following regular expression and pass it on to
-   HarvestMan as the option 'control.wordfilter'.
-
-   ;; config file for harvestman
-   control.wordfilter    (Saddam & WMD)
-
-   You use the boolean '&' and '|' to create the regular
-   expressions.
-
-   I have added this as a recipe in the ASPN Python Cookbook.
-   For more information on how it works, point to the URL,
-   http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526.
-
-Changes in Version 1.3 a1 (from 1.2 final)
-=========================================
-
-1. This version features the new threading model which was
-   started in the last release. This model is now completely
-   written to prevent thread deadlocking incidents. A description
-   of the model can be found in the HarvestMan webpage at
-   http://harvestman.freezope.org. 
-
-   This model will be developed further and will be the default
-   for all future releases of HarvestMan.
-
-2. The other major changes are complete re-writes of many modules.
-   Classes have been renamed wherever suitable and some function
-   names changed. The HarvestMan module has been trimmed up
-   considerably.
-
-3. This version has added an extra module HarvestManUtils which has
-   some utility classes for reading/writing project & cache files and for
-   creating the browse page. The code for these were earlier in the
-   HarvestMan, HarvestManDataManager and HarvestManConfig modules.
-
-4. The cache and project file information is compressed before writing
-   to files.
-
-Changes in Version 1.2 final (from 1.2 rc2)
-===========================================
-
-1. Added support for javascript and java applet tag parsing.
-   HarvestMan can now fetch javascript (.js) files and
-   java applets (.class) files from webpages. 
-  
-   The code for parsing this sits in the new HTMLParser 
-   customized for HarvestMan.
-
-2. Designated url trackers to two flavors - Fetchers and Getters.
-   Fetchers are responsible for crawling webpages and fetching links,
-   and Getters get the non-html files fetched by Fetchers. Images
-   are still fetched by the Fetchers in thier threads.
-
-   This should help in the growth of this program and make future
-   development easier. Also this might help in preventing the thread
-   locking incidents.
-
-3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser,
-   HarvestManUrlPathParser and HarvestManDataManager classes to take care
-   of this. Anchor links in webpages are localized correctly now.
-
-4. Due to javascript/javaapplet parsing code in the new html parser,
-   many webpages which failed to work before (due to mostly javascript
-   tags which the parser could not understand) will work correctly now.
-   
-5. Other routine bug fixes.
-
-   a) Fixed a problem in creating the project browse page.
-      We need to provide the absolute path of the project start url file.
-   b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser
-      class.
-   c) A few more...
-
-
-Changes in Version 1.2 rc2 (from 1.2 rc1)
-=========================================
-Release Date: Sep 27 2003
-
-1. Rewrote the algorithm for fetching urls with no filename
-   extensions. We assume that it is a directory-like url
-   (of the form dir/index.html) and try to fetch it during 
-   url path resolving time (in urlPathParser clas). 
-
-   If this fails, a 404 error is returned. The url is cached
-   for later lookup in the datamanager in a invalid urls cache.
-   We re-resolve the url assuming it now as a file-like url
-   (of the form /file ) and fetch it.
-  
-   If it does not fail, the url is again cached for later lookup
-   in the datamanager in a valid urls cache. The connector object
-   is also cached in a connector dictionary of the datamanager so
-   that we dont need to re-create the connection later.
-
-   This fixes the long-standing bug with urls with no filename
-   extensions.
-
-2. Rewrote algorithm for localizing links. Instead of re-parsing
-   html files and localizing the links, a dictionary of html files
-   and their links are kept in the datamanager object. This dictionary
-   is updated during crawling time with the url objects for each html
-   file. This dictionary is used at the end for localizing.
-
-   This improves localization time to as much as 500%.
-
-3. Fixed a bug in calculating project time. (Time for localization
-   should not be included).
-
-4. Modification in priting error messages. Error messages are printed
-   only for verbosity levels of 3 and up. OS and IO exceptions are 
-   printed only at verbosity level 4 (debug).
-
-   For seeing url error messages (connection errors), you need to set
-   the verbosity to 3 now.
-
-   At the default verbosity level (2), no error messages can be seen.
-
-5. Modified the checking of hanging threads. This check was not done
-   properly. Now it is done in the loop that checks for exit condition.
-   Also, reduced default timeout for hanging threads from 600 seconds
-   (10 minutes) to 120 seconds ( 2 minutes ). 
-   
-   Added socket timeout for sockets. This is same as thread timeout above.
-   (This works for users using Python 2.3.)
-
-   This will fix the problem of hanging threads in a big way.
-
-Changes in Version 1.2 rc1 (from 1.2 alpha)
-===========================================
-
-Release Date: Sep 24 2003
-
-1. Removed the earlier global download lock. Earlier the url
-   connector instances shared a common lock which they had to acquire
-   before downloading. This led to only a single download possible a
-   given moment. 
-
-   This has been changed to multiple downloads which can be specified
-   in the configuration file.
-
-2. We can specify any number of connections in the config file now.
-   The program makes sure that there are only so many connections
-   running at a given instant. This takes the place of the previous
-   global download lock. Since now many simultaneous downloads are possible
-   (apart from many threads), the program is much faster than before.
-
-3. Added an option for writing pickled cache files. This has been
-   made the default in this release. XML cache files take a long
-   time to read, if they are big.  
-
-4. Integrated genconfig.py script with harvestManConfig class. 
-   This makes future developments of this script easier. Added an abort
-   condition to the script which can be invoked by pressing the <space>
-   key.
-
-5. Fixes for handling error conditions in the url connector class.
-   Arbitrary error numbers are no longer used, instead we try to
-   get the error number by parsing the error strings.
-
-6. Redownload of failed links works only for links that failed with
-   non-fatal errors. This speeds up projects.
-
-7. Modified the regular expression behaviour. Compile the reg expressions
-   to optimize regular expression search.
-
-8. Moved code around from HarvestMan.py module to reduce its size. 
-   Parsing of config file is now done in the HarvestManConfig module.
-
-9. Removed usage of 'string' module everywhere and replaced with
-   methods on string objects.
-
-10. Added a timeout option for the project. Sometimes the last thread
-    in the program does not complete hanging a well downloaded project.
-    This option looks at the last data operation into the url queue 
-    and times it. If the time of the last operation (get/put) is more
-    than a prescribed time, the project times out. 
-
-    We also wait now for the download sub-threads to complete their work 
-    before exiting. This fixes any premature project exit conditions.
-
-11. Change in writing project files. We now write pickled project files
-    instead of XML project files. This will be the default from this
-    release.
-
-12. Bug fixes in urlpathparser module for fixing relative filename computation
-    errors.
-
-13. Bug fixes in rules module. Rewrote some methods in this module.
-
-14. Fixes in creating the project browse page. The project browse
-    page entry is now created correctly for every new project.
-
-15. Many other routine bug fixes to speed up downloads and reduce
-    bugs in threading.
-
-Changes in Version 1.2 alpha (From version 1.1.2)
-================================================
-
-1. This version has introduced limited support for Cookies.
-   This is experimental code, written from scratch
-   following RFC 2109. The cookie support is pretty
-   basic with only domain cookies supported. Netscape
-   style cookies may not work.
-
-2. Support for webpage caching is available. A cache
-   file (xml) is created in the project directory for
-   a project, the first time. The cache file associates
-   urls to file on the disk. We compare files by using
-   an md5 checksum on the file content. For any 
-   further runs of the project, only the out-of-date
-   files are re-fetched.
-
-3. Many bug fixes and better error checking.
-
-4. Bugs in genconfig script fixed.
-
-5. Documentation changes: We provide an RTF version of the
-   documentation file now. (Request by John J Lee of
-   Clientcookie fame)
-
-Changes in Version 1.1.2(From version 1.1.1)
-============================================
-1. Added a fast html parser based on sgmlop module by F.Lundh.
-   This can be selected by setting the variable HTMLPARSER in the
-   config file to 1. The default parser is still the standard
-   python parser.
-
-2. Added an option to localise links relatively. This is the
-   default now. That is we dont replace filenames with their
-   absolute pathname but only relative pathname, so that users
-   can browse the downloaded pages on another filesystem also.
-
-3. Added an option for the user to control md5 checksumming of files.
-   This option is controlled by the variable CHECKFILES in the 
-   config file.
-
-4. Support comments at the end of an option line in the config file.
-   (Egs: <URL http://www.python.org # This is the url> is valid now.
-   It would have thrown an error before.)
-
-5. We are not localising form links. This makes sure that a cgi
-   query goes directly to the webserver.
-
-6. An option for JIT (Just In Time) localization of url links.
-   If this option is selected, then urls in html files are localized
-   immediately after they are downloaded, instead of at the end.
-
-
-Changes In Architecture (Version 1.1)
-====================================
-
-1. Global Object Register/Lookup
-   -----------------------------
-
- One of the major changes in this version is the architecture of harvestman program.
-It uses a modified Object Oriented approach of looking up objects whenever the services
-of an object is needed by other objects. The classes no longer maintain pointers to
-other class instances inside them. 
-
-All Harvestman program objects register themselves with a global registry/look-up object
-when they are created. (It is upto the programmer to do this.). The registry object is
-a Borg singleton ensuring that the state of the objects is maintained. The objects are
-stored in the dictionary of the registry object using strings as the key. 
-
-When an object needs the services of another, it performs a simple 'query' or 'lookup'
-of the registry using the key of that particular object (This should be known. Right now
-we dont support a publish/subscribe mechanism, it will be added later.). The register
-object sits in the Harvestman globals module, so it is available to objects in all modules
-which do an import of this module. An example is given below.
-
-
-  # Create and register the object.
-  obj1 = HarvestManObject1()
-  HarvestManGlobals.SetObject('object1', obj1)
-
-  # Object2 wants services of obj1 
-  obj1instance = HarvestManGlobals.GetObject('object1')
-  # Use its services
-  obj1instance.func1(...)
-  
-
-This makes adding new modules to HarvestMan easy, if you make sure that you register them
-in the globals module.
-
-
-2. Threading Model
-   ---------------
-
-HarvestMan versions till 1.0 was using a model where url tracker threads were store in a
-queue. A url tracker object consisting of data of a url was pushed into a queue and was
-later popped by a monitor object so that downloads could be controlled. This gave rise
-to problems of controlling threads and overhead in the form of new thread contexts since
-we were not reusing threads.
-
-HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread
-data is only managed in the queue, and not threads themselves. The number of threads 
-(as per the config file or command line user input) are pre-launched in the beginning of
-the program. They run in a loop looking for url data which is managed by a url data queue.
-Threads post their url data to this queue. This ensures that we always have a given number
-of threads running. It also reduces overheads and latency.
-
-HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive
-(new thread launched per request) mechanism. This might be changed in future releases.
-
-3. Code Reorganization
-   -------------------
-
-   The new version features some extra modules which have been created by moving code
-   from existing modules and re-writing them. The aim was to split crawler code from
-   data management code, in which we succeeded quite well. There is a new Data Manager
-   module which takes care of scheduling downloading requests, indexing files, keeping
-   file statistics and localizing links. A Rules module checks the HarvestMan download
-   rules (this was earlier done by the previous "WebUrlTrackerMonitor" class).
-
-   A Synchronization lock has been added in the Connector module. This might
-   slow down downloads a bit, but should ensure that threads dont corrupt the data.
-   Interested users can experiment with the lock, removing it or modifying it, and
-   see how it works. Please report any improvements in performance you see to the
-   authors.
-
-4. Other Changes
-   -------------
-
-   For other changes continue reading.
-
-
-HISTORY
-=======
-
-+-----------------------------------------+
-|Changes in Version 1.1 (from Version 1.0)|
-+-----------------------------------------+
-
-1. A project file is created for every project in the harvestman directory
-   in the subdirectory 'projects'.
-2. Always download css files related to a web-page, even if 
-    it is outside of domain or directory. Same for images. Config options
-    for both added in the config file.
-3. Added a config file option to rename dynamically generated images.
-    Works right now for jpeg/gif images.
-4. Modified the urlfilter algorithm to check the order of filter
-    strings in case of a collision in filter results.
-5. Added a new option FETCHLEVEL to the program to allow very
-    basic control of download. For details see Readme.txt/HarvestMan.doc
-    file. 
-6. Get background images of webpages.
-7. Better error/message logging. Error files are created in each project's
-    download directory. All messages are logged to a file in the harvestman
-    installation directory. This by default is named 'harvestman.log'. User
-    can change this option by editing the config file. This file is created fresh
-      for every project.
-8. Added support for getting files from ftp servers.
-9. Write a project file based on HarvestMan.dtd before starting to crawl.
-    This file is written to the base directory.
-10. Stats file is no longer written in the current directory under "projects". Instead
-    it is written to the project directory of the particular project.
-11. Added command line support.
-12. Modified proxy setting. Removed port number from proxy string. Port number
-    needs to be specified as a separate config entry.
-13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where
-    projectname is the name of the current project) to the project directory. The file
-    extension 'hst' stands for 'HarvestMan Stats File'.
-14. Write a binary project file also. 
-15. Modified localise links function to take care of localising anchor type links also.
-    This was an undetected bug in version 1.0.
-16. HarvestMan can now load projects from saved project files. This can be done for
-    both the xml and binary project files. Added encryption for proxy related data.
-17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data
-    (except port number) before writing it to the config file.
-18. Added code in WebUrlConnector to request user for authentication information 
-    for a proxy-authenticated firewall. If the project file does not contain this information,
-    it will be requested from the user, interactively.
-19. WebRobotParser module uses the services of WebUrlConnector now, instead of having
-    its own internet connection code.
-20. Added a mechanism to log errors made in the config file, and inform user about it
-    at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals).
-21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE).
-22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now.
-    0 - 1 will fetch only local links. 2 fetches local + first level external links and
-    3 fetches any link.
-23. Tried different approaches to running thread queue. Ideally the runTrackers() method
-    should be called when you start the project and it should run separately from the
-    push() method. But this lead to blocking of the last download thread in many tests since
-    the CPU seems to run the runTrackers() method in priority to the last download thread.
-    So I reverted back to the existing method of running trackers where the push method
-    makes a call to runTrackers() ( I know that it is not good thread programming, but it works . )
-24. Modification to webUrlConnector class, this class now accepts a urlPathParser object
-    instead of a url directly. This makes handling of urls easy and we can pass more information
-    around. Made correspoding changes to Monitor/Tracker/Thread classes.
-25. Fixes for slowmode. Rewrote some code.
-
-
-+-----------------------------------------+
-|Changes in Version 1.0 (from Version 0.8)|
-+-----------------------------------------+
-
-1. Fully multithreaded. Multithreaded mode is the default.
-2. Depth fetching for starting server and external servers in config file.
-3. Browser page for projects similar to HTTrack.
-4. Added re-fetching of failed urls.
-5. Support for intranet servers.
-6. Verbosity option added in config file.
-7. Lots of configurable options added in the config file.
-   The list of options (apart from the basic ones) is now about 30.
-8. Signal handler for keyboard interrupts autmatically does clean up jobs.
-
-
-
diff --git a/HarvestMan/doc/Datastructures.txt b/HarvestMan/doc/Datastructures.txt
deleted file mode 100755
index 6e1b394..0000000
--- a/HarvestMan/doc/Datastructures.txt
+++ /dev/null
@@ -1,67 +0,0 @@
-
-=================================================
-* Enhancements done in HarvestMan Datastructures*
-=================================================
-
-Many data structures were either enhanced or removed from HarvestMan
-during work done on this from Feb 12 to Feb 14 2008. Here are a list
-of changes.
-
-I. Module datamgr.py
-
-This module has the most critical data structures which manage the
-state of the program. 
-
-1. Dictionary _downloaddict is removed. Its constituents have been
-either removed, replaced with counters or enhanced. Here are the
-changes.
-
-  a. _savedfiles list is removed. This is really not required.
-     It is replaced with self.savedfiles which just keeps the
-     counter of the saved files.
-  b. _deletedfiels list is removed. This is replaced with nothing.
-     It is really a redundant list/counter.
-  c. _failedurls list is removed. Instead failed URLs are calculated
-     at program end by searching through the BST _urlsdb (below) by
-     using the atttributes of the URL object stored in it.
-  d. _doneurls list is removed. This again was not very imporant and
-     was redundant.
-  e. _collections list is replaced with a self.collections BST.
-  f. _reposfiles moved to self.reposfiles.
-  g. _cachefiles moved to self.cachefiles.
-
-2. self._fetcherstatus dictionary is removed. This logic is not
-required. This was mainly used to find out whether a URL is currently
-being downloaded etc, but this is replaced with a state transition
-on the URL objects.
-
-3. self._urldict is replaced with the disk-caching BST self._urlsdb .
-This has shown very good results.
-
-4. self._projectcache is no longer a shelve dictionary, instead an
-instance of Base, which is a in-memory dictionary like database
-written by Pierre Quentel and published in ASPN as "pydblite"
-recipe {http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496770}.
-This seems to work pretty well and use memory efficiently. It also
-makes searching into the cache more efficient.
-
-5. New counters added - self._numfailed2, self._numretried etc.
-
-
-II. Module rules.py
-
-1. The self._links list is removed. This was used to filter out
-duplicate links. This has been replaced by a search in the _urlsdb
-data structure. This is made possible by using the integer hash of the
-full url string as the index into the _urlsdb BST. So similar URLs
-(URLs with same full url string) will hash to the same index.
-
-2. self._filter is now a dictionary, not a list. Instead of
-keeping the list of full URLs, this now just keeps a hash of
-the index of the URLs which are filtered. 
-
-3. self._rexplist is removed since it was found to be used
-nowhere.
-
-
-
diff --git a/HarvestMan/doc/PLAN.txt b/HarvestMan/doc/PLAN.txt
deleted file mode 100755
index 97bcb1b..0000000
--- a/HarvestMan/doc/PLAN.txt
+++ /dev/null
@@ -1,128 +0,0 @@
-PLAN for HarvestMan 2.0
------------------------
-
-This document captures the list of TBD and planned dates for HarvestMan
-and Hget for the next release (2.0).
-
-
-HarvestMan TBD
---------------
-
-Most of the changes are shared between HarvestMan and
-EIAOHarvestMan, but a few are specific to HarvestMan.
-The changes specific to HarvestMan are marked with an
-asterik (*).
-
-Completed
-
-1. New program exit logic => Completed
-2. New control params for controllong multimedia, movies,
-etc => Done.
-3. Signal management - Added support for SIGINT, is enough for now, Done.
-4. Integration with sgmlop - Done.
-5. Config file split - Done.
-6. Datastructure enhancements - Mostly completed.
-7. GetObject(...) and SetObject(...) replacements with global 
-objects namespace, removal of Registry class and object => Done.
-8. New document module and HarvestManDocument class => Done
-9. Event framework => Done
-10. Custom sample crawler application modules using events => Done.
-11. Database integration -> Done.
-
-TODO
-# 11. Crawler strategy classes => Not sure if this is needed now.
-11. Full docstrings for all classes, functions and modules - TBD
-12. Command line option to pass a param by name and value - Done.
-13. Better, fool-proof logic for exit condition in state machine => Partially done,
-need testing.
-
-14. Crawler strategy classes, which combine various crawl conditions
-in one class.
-15. RSS integration ?
-16. URL localizing => converting outward pointing links to disk links.
-The logic is broken and I should plan for a full rewrite of this function.
-17. Option for urlfilter to filter URLs only for download. 
-18. Tools =>
-
-    1. web.py based configuration generation tool - Done!
-    2. web.py interface to harvestman to run, schedule and see 
-       status of crawls ? - would be nice.
-
-
-Docstrings Progress
--------------------
-App modules
------------
-
-1. harvestman.py - Done.
-2. appbase.py    - Done.
-3. hget.py - Done.
-
-Lib modules
------------
-1. config.py - Done
-2. configparser.py - Done
-3. connector.py - Done
-4. crawler.py - In progress
-
-
-Database integration
---------------------
-This will be a basic database for the stand-alone crawler creating
-and storing only project (meta-data) data, configuration data
-and statistical data. There won't be any way to store 
-actual crawl data in the database.
-
-How to do this...
-
-1. A single database file is created for storing HarvestMan crawl
-meta information. The file will reside in ~/.harvestman/data folder.
-2. The file will contain a few databases. 
-
-Pending minor tasks
-
-
-Hget TBD
---------
-
-+---+-----------------------+---------------+---------------+---------------------------------------+
-| NO|       Feature/Fix     | STATUS        |    Priority   |       Description                     |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   | Automatic thread      |               |   Desirable   | When program finds that the user has  |
-| 1 | reduction control     |               |               | split the download to "n", but either |
-|   |                       |  Lower        |               | it cannot get "n" mirrors or the host |
-|   |                       | Priority/ Drop|               | is not responding to "n" threads (in  |
-|   |                       |               |               | case of same host downloads), use a   |
-|   |                       |               |               | logic to automatically reduce the     |
-|   |                       |               |               | load on the host/mirrors by reducing  |
-|   |                       |               |               | the thread count adaptively and       |
-|   |                       |               |               | recomputing the file piece size per   |
-|   |                       |               |               | thread, thereby ensuring that the     |
-|   |                       |               |               | download is completed. The user can   |
-|   |                       |               |               | set a minimum value for this, the     |
-|   |                       |               |               | default being 1.                      |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   |Mirror selection       |               |  Normal       |When failing mirrors, put mirrors that |
-| 2 |improvements           |               |               |fail with non-fatal errors in a retry  |
-|   |                       |      Done     |               |cache, so that when mirror list is     |
-|   |                       |               |               |exhausted,we can try these again.      |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   |Mirror search          |               |  High         |Add support for searching mirrors      |
-| 3 |engine                 |   Done        |               |in mirror search engines and dynamicall|
-|   |                       |               |               |y obtain mirrors for URLs.             |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   |FTP resume/multipart   |               |               |                                       |
-| 4 |support                |               |  Highest      |So far, supports only HTTP byte-range  |
-|   |                       |      TODO     |               |and resume. Need to add same for FTP   |
-|   |                       |               |               |for supporting most of the mirrors out |
-|   |                       |               |               |there.                                 |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   |Adaptive learning      |               | Desirable     |Rank mirrors based on historical       |
-| 5 |and mirror prioritizing|   Lower       |               |performance and automatically select   |
-|   |                       |Priority/Drop  |               |the best ones. (Get logic from SMART)  |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-|   |Better failover        |               |  Normal       |Detect threads which hang and perform  |
-| 6 |                       |       TODO    |               |check-pointing and migration etc to    |
-|   |                       |               |               |increase robustness.                   |
-+---+-----------------------+---------------+---------------+---------------------------------------+
-
diff --git a/HarvestMan/doc/STATUS.txt b/HarvestMan/doc/STATUS.txt
deleted file mode 100755
index e5419c1..0000000
--- a/HarvestMan/doc/STATUS.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-
-Based on EIAO bug tracker...
-
-------------------------------------------------------------------------------------------------
-
-Bug ID          Description             % Completed             Status
--------------------------------------------------------------------------------------------------
-607      State machine changes          Mostly done             Can be assumed as done.
-
-608      Config parameter for           100%                    Completed.
-         managing multimedia
-         download content
-
-609      Make sgmlop parser as          100%                   Completed. 
-         default
-
-610      Option for URL filters         0%
-         to only filter URLs for
-         downloads
-
-611      Config file split              100%                   Done. But not using another
-                                                               name for the configuration files.
-                                                               Instead program generates config.xml
-                                                               with non-project information and copies
-                                                               it to a system-wide area and also to 
-                                                               the user area. Need to add more params
-                                                               in the config file for this to be
-                                                               effective.
-
-612      URL Localization fixes         0%
-
-613      Back-port EIAO                 30%               Mostly done. 
-         stability fixes.                                
-
-628      Improved setup.py              50%              Check bug for more details.
-                                                            
diff --git a/HarvestMan/doc/events.HOWTO b/HarvestMan/doc/events.HOWTO
deleted file mode 100755
index 45962b2..0000000
--- a/HarvestMan/doc/events.HOWTO
+++ /dev/null
@@ -1,155 +0,0 @@
-This document describes the event framework of HarvestMan
-which allows the programmer to hook into specific places
-at the program flow and perfom custom handling, thereby
-altering program behavior.
-
-Events
-------
-The event framework sits in the module lib/event.py . It
-defines an Event class and a HarvestManEvent class. The
-former defines an event and the latter acts as an
-event manager providing functions to bind to and
-raise events.
-
-Event class
------------
-An event class defines an event. It has the following
-fields. 
-
-1. name => A string, for the name of the event. The
-           name should be unique.
-2. config => A reference to the configuration object
-3. url => A handle to the URL object associated to the event.
-4. document => A handle to the document object associated to
-         the event.
-
-Note that of the above 4 attributes, only the document
-attribute could have a null (None) value. The rest
-of the attribtues should have non null values.
-
-Raising an event
-----------------
-An event can be raised by the raise_event method of 
-the HarvestManEvent class. The raise_event method takes
-the following arguments.
-
-1. event => The event name (string) for which we are raising
-            the event.
-2. url   => The URL object associated to the event.
-3. document => The document object associated to the event.
-            This could be a null value.
-
-Apart from the above 3 arguments, keyword arguments can be
-passed. The keyword arguments will be passed on to the 
-event handler.
-
-Note that every event may not supply all the arguments
-to the event handler.
-
-
-Binding event handlers
-----------------------
-Event handlers can be bound using the 'bind' method of the
-HarvestManEvent class. This takes 3 arguments namely,
-
-1. event -> The event of interest.
-2. funktion - A function registered to handle the event.
-              The function is called back when the event is
-              raised.
-
-Apart from that you can pass additional positional arguments
-for the function. This will be passed to the function when
-the event is raised.
-
-The bind method is exposed on the HarvestMan class (as
-bind_event), so practically, you will be using that method 
-instead of using the method directly on the HarvestManEvent class.
-
-
-Example
--------
-The following code shows how to bind for an event
-
-def write_this_url(event, *args):
-        
-    url = event.url
-    if url.is_image():
-        return True
-    else:
-        return False
-
-We want to bind this to the 'writeurl' event to make
-sure we write only image URLs. To bind it,
-
-spider = HarvestMan()
-spider.bind_event('writeurl', write_this_url)
-
-Now whenever the 'writeurl' event is raised, the function
-'write_this_url' is called automatically.
-
-Using Events
-------------
-Typical use of events are as 'before handlers', to
-allow the developer to insert custom logic to decide
-whether an action should be taken on a URL or document.
-
-For example, the event 'includelinks' allows the developer
-to hook into int and return a value, based on custom
-processing. If the event handler returns True, the URL
-is included, else filtered. 
-
-To write such handlers, the programmer has to clearly
-return True if we want the handler to allow the action
-or return False to deny the action. Look at the previous
-sample code as an illustration for this.
-
-Other events are 'after handlers' which raise an event
-after an action is done. Since the programmer cannot 
-influence the action at this stage, the return value of 
-these events are of no importance.
-
-Events in HarvestMan
---------------------
-HarvestMan as of now, defines the following events. This
-is grouped into 'before events' and 'after events'.
-
-Before Events
--------------
-1. beforecrawl => Raised before a URL's children are 
-put into the crawl queue. Args: url, document .
-2. beforefetch => Raised before a URL is fetched, i.e
-downloaded from the web. Args: url .
-3. beforeparse => Raised before a webpage URL's data
-is parsed to extract child links. Args: url, document
-4. beforejsparse => Raised before a webpage URL's data
-is parsed to extract any links hidden in Javascript. 
-Args: url, document.
-5. beforecssparse => Raised before a stylesheet URL's data
-is parsed to extract any links. Args: url, document. 
-6.writeurl => Raised before a URL's data is saved to disk.
-Args: url, data
-7. includelinks => Raised before a URL is put into the 
-crawl queue. Filtering logic can be added in the handler
-to filter the URL. Args: url .
-
-After Events
--------------
-1. aftercrawl => Raised after the crawl of a URL's children
-are complete, i.e the children are put into the crawl
-queue. Args: url, document.
-2. afterfetch => Raised after the URL is downloaded and
-if a webpage, parsed and its children are put into the
-fetch queue. Args: url, document.
-3. afterjsparse => Raised after a webpage URL's data
-is parsed to extract any links hidden in Javascript. 
-Args: url, document, links
-4. afterparse => Raised after a webpage URL's data
-is parsed to extract webpage links. Args: url, document, links
-5. aftercssparse => Raised after a stylesheet URL's data
-is parsed to extract any links. Args: url, document, links.
-6. postdownload => Raised immediately after the steps
-performed after a download is completed. (This effectively
-functions as a callback function for the post_download_setup
-method in datamgr module). Args: None .
-
-
diff --git a/HarvestMan/doc/harvestman-epydoc.sh b/HarvestMan/doc/harvestman-epydoc.sh
deleted file mode 100644
index d0ecef2..0000000
--- a/HarvestMan/doc/harvestman-epydoc.sh
+++ /dev/null
@@ -1,4 +0,0 @@
-#Generate epydoc documentation from harvest man.
-#Required epydoc >2.0
-#Run as: sh harvestman-epydoc.sh
-epydoc -v -o harvestman-epydoc --name epydoc --css white                  --url http://localhost/ --inheritance listed  --graph classtree ../harvestman --no-frames
diff --git a/HarvestMan/doc/plugins.HOWTO b/HarvestMan/doc/plugins.HOWTO
deleted file mode 100755
index fbdd619..0000000
--- a/HarvestMan/doc/plugins.HOWTO
+++ /dev/null
@@ -1,229 +0,0 @@
-This document summarizes the plugin architecture of HarvestMan
-and describes the procedure of writing and enabling plugins.
-
-Plugin Architecture
--------------------
-From version 2.0, HarvestMan provides a way to extend and modify
-the behaviour of the program by writing simple "plugins". Plugins
-are small pieces of Python code which sit in the "ext" folder
-in HarvestMan source tree. When a plugin or a set of plugins are
-enabled, the program behaviour is modified depending on how
-the plugins modify the functionality of certain methods of certain
-classes in HarvestMan.
-
-Hooks Module
-------------
-The plugin architecture and model is defined by the new "hooks"
-module. This module defines two different methods of modifying
-behavior of key classes of HarvestMan without modifying the
-original code statically.
-
-1. By defining your own functions which dynamically replace 
-the methods of certain classes in HarvestMan.
-2. By defining your own functions which can be dynamically
-injected as pre/post hooks of methods of certain classes 
-in HarvestMan. 
-
-The magic through which this happens without modifying the
-original code is through Python meta-classes. 
-
-Module Attributes
------------------
-Any module in HarvestMan which wants to expose methods of its
-classes to the any or both of the above forms of extension,
-makes them available as two attributes.
-
-Any method which is available for complete code replacement
-is listed as part of the "__plugins__" attribute. This is a dictionary
-of key-value pairs, where the key is the plugin name and the
-value is a string which combines the class name and the method
-name in the form "<klass>:<method>".
-
-For example, this is how the harvestman.py module defines this
-attribute.
-
-__plugins__ = { 'clean_up_plugin':'HarvestMan:clean_up',
-                'save_current_state_plugin': 'HarvestMan:save_current_state',
-                'restore_state_plugin': 'HarvestMan:restore_state',
-                'reset_state_plugin': 'HarvestMan:reset_state' }
-
-
-Any method which allows pre/post hooks is made available as part
-of the "__callbacks__" attribute. This is also a dictionary with
-the same format as above.
-
-For example, the datamgr module defines this attribute as,
-
-__callbacks__ = { 'download_url_callback': 'HarvestManDataManager:download_url' }
-
-
-Writing Plugins
----------------
-In order to write a particular feature as a plugin, it is important to
-have a good understanding of the HarvestMan source code. This helps 
-you to decide which code requires to be modified and how, in order
-to add the feature.
-
-Next you need to figure out if the entire logic of a method needs
-to be replaced. If so, a plugin has to be registered. However, if
-the functionality requires only injecting some code before/after
-a method is called, a callback needs to be used. In order to develop
-plugins which do complex tasks, both options would have to be used.
-
-Once this is decided, a plugin module should be developed which
-contains the required code in the form of functions which are meant
-to be either method callbacks or plugins.
-
-Once the function(s) are written, a special function named "apply_plugin"
-has to be written. This function should not take any arguments. In
-this function, the callback/plugin functions have to be registered
-at appropriate contexts by using respective methods defined
-in the hooks module. These methods are listed below:
-
-1. register_plugin_function: This takes a context and a function
-object as arguments. It injects the code of the function object
-in place of the context. The context is nothing but a key defined
-in the __plugins__ attribute of a module which connects it to
-a method in a class. The effect is to replace the code of the
-particular method with the cod of the function object.
-
-2. register_pre_callback_method/register_postcallback_method:
-These functions takes same arguments as register_plugin_function.
-They inject the code of the function object as eitherpre/post 
-callbacks to the passed class method, respectively. The effect
-is to modify the behaviour of the original method.
-
-The apply_plugin function will also change certain require
-configuration parameters. For example, in the simulator plugin
-provided along with HarvestMan, the localise and caching
-features are turned off in the configuration. This is because
-since no files are saved in a crawl simulation, it does not
-make sense to keep these features on. Your plugin might have
-to turn features on/off in a similar fashion.
-
-If any informational message is required to be printed to
-inform the user about the plugin, it can also be done. For
-this the "logconsole" method has to be used.
-
-
-Default Plugins
----------------
-HarvestMan provides the following default plugins.
-
-1. spam - A sample plugin which demoes adding a function
-as a pre-callback.
-2. swish-e - A plugin which allows to use HarvestMan 
-as the crawler for the swish-e indexer.
-3. simulator - A plugin which modifies the program
-behaviour to simulate crawling without downloading
-of files.
-4. lucene - A plugin which integrates HarvestMan with
-Lucene using PyLucene. Enabling the plugin causes
-the program to index the downloaded pages at the
-end of the crawl.
-5. userbrowse - A plugin which imitates user browsing,
-i.e it downloads URLs as if the user is browsing the
-current page and its immediate links.
-
-These plugins can be used as templates in developing
-your own plugins.
-
-Enabling Plugins
-----------------
-Plugins can be enabled using the configuration file or
-by using command-line arguments. 
-
-
-1. Configuration File
-
-Add an entry under the "plugins" element with an "enable" 
-attribute and a "name" attribute. The name attribute's
-value should be set to the name of the plugin module. In 
-order to enable the plugin, se the value of "enable" attribute
-to 1.
-
-For example, assume you have developed a plugin module
-named "spamplugin", you will enable it as follows in
-the configuration file.
-
-<plugins>
-  <plugin name="spamplugin" enable="1" />
-</plugins>
-
-2. Command Line
-
-Specify the value of the option "plugins" (short option "g")
-as the plugin's name. For example, to enable the spamplugin,
-
-$ harvestman -g spamplugin -C <someconfigfile>
-
-
-It is possible to enable more than one plugin at once. 
-However in such cases, it is important to make sure that
-their behaviors are supplmenting each other and not in
-conflict. This can be done for a group of plugins which
-togethe add a certain set of features.
-
-To enable more than one plugin using the configuration file,
-simply set the "enable" attribute of each plugin to "1".
-
-To enable on the command-line, concatenate the plugin 
-module names using the "+" character.
-
-For example, to enable both "spamplugin" and "fooplugin"
-in the configuration file.
-
-<plugins>
-  <plugin name="spamplugin" enable="1" />
-  <plugin name="fooplugin" enable="1" />
-</plugins>
-
-To do the same on the command line,
-
-$ harvestman -g spamplugin+fooplugin -C <someconfigfile>
-
-NOTE: Plugins are enabled in th same order they are 
-presented in the configuration file or the command line.
-The order can affect the cumulative behavior of the plugins
-if they are working with the same contexts. So it is important
-to specify the order correctly, to get the righe behavior.
-
-Also some plugins have contradicting behaviour, so it does
-not make sense to enable them together. For example the
-lucene plugin requires downloaded files whereas the simulator
-plugin changes program behaviour to not to download files.
-So a combination of simulator+lucene does not make sense,
-though the program accepts it.
-
-Things to note
---------------
-Here are a few points worthy of noting when writing plugins.
-
-1. Arguments - Most methods in HarvestMan classes take fixed
-length arguments. When you write a replacement function which
-acts as a plugin, it is important to make sure that it keeps
-the number and order of arguments the same.
-
-A function inserted as a pre-callback takes the same number of
-arguments as the original method. However a function insert
-as a post-callback takes an additional argument of the
-return value of the original method. This comes as the first
-argument in the list. In other words the argument list is 
-pushed back by this return value argument.
-
-2. Return value - When writing plugins or callbacks (especially
-post-callbacks), it is important to see that the return value
-semantics are not modified. For example, if a method returns
-downloaded data as a return value, the replacement plugin or
-a post-callback should also do the same. Otherwise it can
-break the program.
-
-
-
-
-
-
-
-
-
-
diff --git a/HarvestMan/doc/state_machine.txt b/HarvestMan/doc/state_machine.txt
deleted file mode 100755
index 5b7ed91..0000000
--- a/HarvestMan/doc/state_machine.txt
+++ /dev/null
@@ -1,237 +0,0 @@
-Design for HarvestMan State Machine
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
-
-This document proposes a design for a state machine for
-HarvestMan. 
-
-Background
-----------
-The current logic in HarvestMan which uses a polling loop in the main
-thread is causing problems. Here are a few issues with this approach.
-
-1. Main thread keeps polling other threads in a spinning loop sleeping
-every second. This burns up CPU cycles.
-
-2. The polling is not very accurate, since threads can change state
-often. The polling takes a snapshot of the status of all the current
-crawler threads and then takes actions. However such decisions
-may not always reflect the current thread status.
-
-3. The current logic does not process the state of the worker threads
-properly. It determines exit condition as when the crawler threads
-are idle. For managing worker threads it uses a grace period once
-it detects crawlers are idle. This may not give enough time for the
-worker threads to do their work and generally decreases the robustness
-of the program.
-
-4. The current logic relies on "magic numbers" which have been
-arbitrarily added to decide on the program exit condition. For
-example the exit loop times the idle state of the crawler threads
-3 times continously to make sure that the program is idle and needs
-to exit. However the number "3" is not chosen because of any
-particular reason. We need to avoid such magic numbers.
-
-5. The current logic spreads state management across many functions
-and a few flags. This is not very object oriented. It would be
-better to consolidate the state and their processing and management
-to a single object.
-
-State Machine Design
---------------------
-
-A state machine class has been defined in the module urlqueue.py .
-This class has the following attributes.
-
-1. A thread state dictionary
-2. The queue manager object 
-3. Flags indicating that threads are blocked etc
-4. Counters for thread state transitions
-5. A condition object which is used to synchronize the state with
-the main thread.
-
-The module crawler.py defines the crawler states. Currently the
-following states are defined.
-
-THREAD_STATE_IDLE = 0          # Idle thread: not running
-THREAD_STATE_WAITING = 1       # Waiting for data
-THREAD_STATE_PUSH_BUFFER = 2   # Thread, pushing buffer data
-THREAD_STATE_BEFORE_WORK = 3
-THREAD_STATE_WORKING = 4       # Thread, doing work
-THREAD_STATE_SLEEPING = 5      # Thread, sleeping
-THREAD_STATE_DOWNLOADING = 6   # Fetcher thread, about to download
-THREAD_STATE_DOWNLOADED = 7    # Fetcher thread, just after download
-THREAD_STATE_PUSHING =      8  # Thread is pushing/about to push data to queue
-THREAD_STATE_PUSHED =      9   # Thread has pushed data to queue
-THREAD_STATE_DIED = 10         # Thread died due to an exception...
-
-Here are the descriptions for these states.
-
-1. THREAD_STATE_IDLE      -   Thread has not yet started.
-2. THREAD_STATE_WAITING   -   Thread has started and is waiting for data.
-3. THREAD_STATE_PUSH_BUFFER - Thread is trying to push local buffer data to queue.
-4. THREAD_STATE_BEFORE_WORK - Thread has got data and is ready to do work.
-5. THREAD_STATE_WORKING     - Thread is working now. This is a generic work flag
-                              which can be overriden by specific work flags by
-                              the threads.
-6. THREAD_STATE_SLEEPING    - Thread has finished one work cycle and is sleeping.
-                              The sleep state indicates a cycle of state transitions.
-7. THREAD_STATE_DOWNLOADING - This is a state specific to fetcher threads. This
-                              indicates the thread is about to download data.
-8. THREAD_STATE_DOWNLOADED -  This is a state specific to fetcher threads. This
-                              indicates the thread has finished downloading data.
-9. THREAD_STATE_PUSHING    -  This indicates the thread is trying to push processed
-                              data to queue.
-10. THREAD_STATE_PUSHED    -  This indicates the thread has pushed data successfully
-                              to the queue.
-11. THREAD_STATE_DIED      -  This indicates the threade died due to an exception.
-
-During the life time of a thread, it typically goes from THREAD_STATE_IDLE
-via other states to THREAD_STATE_WAITING or THREAD_STATE_DIED. 
-
-For example, here is the typical state transition cycle for a fetcher thread. The
-numbers indicate the value of the states. A '*' before a state indicates it is
-conditional.
-
-First cycle
-----------
-0-1-3-4-6-7-8-9-5-1
-
-Second cycle onwards
---------------------
-*2-1-*2-3-4-6-7-8-*9-*2
-
-This is because in the first cycle there is no buffer push and a push to the queue
-is guaranteed. This is not the case from second cycle onwards.
-
-Here is the same for a crawler thread.
-
-First cycle
-----------
-0-1-3-4-8-9-5-1
-
-Second cycle onwards
---------------------
-*2-1-*2-3-4-*8-*9-*2
-
-
-The threads keep updating their status on the state machine object which is shared
-between the threads and the queue. The status machine object is a singleton.
-
-Main thread synchronization
----------------------------
-
-The main thread creates the other threads and then enters the main loop.
-The main loop is as follows...
-
-    def mainloop(self):
-        """ Main program loop which waits for
-        the threads to do their work """
-
-        timeslot = 0.5
-        while not self.stateobj.end_state():
-            self.stateobj.wait2(timeslot)
-            if self._flag:
-                break
-
-        self._flag = 1
-        self.end_threads()
-
-
-In the loop the main thread keeps checking the end state of the state
-machine object. While the end state is not achieved it goes to sleep,
-waiting on the condition object of the state machine for a small time
-(0.5 seconds). It also checks an internal flag.
-
-Normal termination happens when the end state is achieved i.e the
-function end_state() returns True. Abnormal terminations (such
-as pressing Ctrl-C to terminate the program) sets the flag to 1,
-which breaks the main loop.
-
-At the end of main loop the threads are stopped.
-
-Why a timed wait ?
-------------------
-The simplest wait on a condition object is a simple wait() without
-a timeout argument. However if the main thread goes to such a wait,
-it will be prevented from receiving any signals such as a keyboard
-interrupt (Ctrl-C). Since Python only allows sending interrupts to
-the main thread, it means the program will be insensitive to any
-interrupts.
-
-However, waiting on the condition object with a very small timeout
-exposes the main thread to signals during the rest of the loop.
-This allows the program to be controlled using signals.
-
-Using a wait on a condition object prevents CPU hogging since
-the thread does not do a sleep(...). Instead it goes to sleep
-on the condition object which prevents it from taking any 
-CPU time.
-
-End state logic
----------------
-The end-state is decided when all the threads go to a waiting
-mode i.e when every thread reports THREAD_STATE_WAITING to the
-state object. It also uses some additional logic to prevent
-spurious end-state events.
-
-Here are the checks to prevent spurious end-state modes and
-to make sure end-states are valid.
-
-1. To make sure we do not end the crawl prematurely,
-the main thread calls end_state only after it makes sure
-that the other threads have started and well enough into 
-their tasks. For example, it waits for the base thread
-(the first fetcher thread) to get its data and start
-its work before going to the main loop.
-
-2. The end state performs validity of state transitions.
-For example, if the fetchers have done at least one push
-to the queue, the end-state will return True only if
-the crawlers have completed at least one cycle of
-transitions. The logic is that since the fetchers have
-pushed data, the crawlers have to at least extract the
-data before we signal an end-state. 
-
-3. No checks on queue lengths are performed because as
-said earlier, run-time checks on the queue objects will be
-erroneous as the queue objects use locking to control
-access to them.
-
-Additional checks
------------------
-We should add additional state transition based checks to
-take care of all the cases of thread deadlock, thread
-stalemates etc.
-
-Thread regeneration
--------------------
-The thread regeneration logic (re-creating a thread when
-it fails due to an exception) has been moved to the state
-machine.
-
-State of the code
------------------
-The current code implements the design and the logic 
-mentioned above.
-
-Pending work
-------------
-To complete the state machine, the following needs to be done.
-
-1. Add states for the worker threads - Currently the machine
-only takes care of tracker (crawler & fetcher) thread states.
-To fine-grain the thread management, we need to add the
-worker thread states also.
-
-2. Define algorithms to handle thread contentions, deadlocks 
-and timeouts - 
-
-The current logic does not handle any thread deadlocks or
-contentions. For example, there is no logic to take care of
-a hanging fetcher. The project timeout logic is not integrated
-to the state machine. 
-
-
-
-
-
diff --git a/HarvestMan/ez_setup.py b/HarvestMan/ez_setup.py
deleted file mode 100644
index d24e845..0000000
--- a/HarvestMan/ez_setup.py
+++ /dev/null
@@ -1,276 +0,0 @@
-#!python
-"""Bootstrap setuptools installation
-
-If you want to use setuptools in your package's setup.py, just include this
-file in the same directory with it, and add this to the top of your setup.py::
-
-    from ez_setup import use_setuptools
-    use_setuptools()
-
-If you want to require a specific version of setuptools, set a download
-mirror, or use an alternate download directory, you can do so by supplying
-the appropriate options to ``use_setuptools()``.
-
-This file can also be run as a script to install or upgrade setuptools.
-"""
-import sys
-DEFAULT_VERSION = "0.6c9"
-DEFAULT_URL     = "http://pypi.python.org/packages/%s/s/setuptools/" % sys.version[:3]
-
-md5_data = {
-    'setuptools-0.6b1-py2.3.egg': '8822caf901250d848b996b7f25c6e6ca',
-    'setuptools-0.6b1-py2.4.egg': 'b79a8a403e4502fbb85ee3f1941735cb',
-    'setuptools-0.6b2-py2.3.egg': '5657759d8a6d8fc44070a9d07272d99b',
-    'setuptools-0.6b2-py2.4.egg': '4996a8d169d2be661fa32a6e52e4f82a',
-    'setuptools-0.6b3-py2.3.egg': 'bb31c0fc7399a63579975cad9f5a0618',
-    'setuptools-0.6b3-py2.4.egg': '38a8c6b3d6ecd22247f179f7da669fac',
-    'setuptools-0.6b4-py2.3.egg': '62045a24ed4e1ebc77fe039aa4e6f7e5',
-    'setuptools-0.6b4-py2.4.egg': '4cb2a185d228dacffb2d17f103b3b1c4',
-    'setuptools-0.6c1-py2.3.egg': 'b3f2b5539d65cb7f74ad79127f1a908c',
-    'setuptools-0.6c1-py2.4.egg': 'b45adeda0667d2d2ffe14009364f2a4b',
-    'setuptools-0.6c2-py2.3.egg': 'f0064bf6aa2b7d0f3ba0b43f20817c27',
-    'setuptools-0.6c2-py2.4.egg': '616192eec35f47e8ea16cd6a122b7277',
-    'setuptools-0.6c3-py2.3.egg': 'f181fa125dfe85a259c9cd6f1d7b78fa',
-    'setuptools-0.6c3-py2.4.egg': 'e0ed74682c998bfb73bf803a50e7b71e',
-    'setuptools-0.6c3-py2.5.egg': 'abef16fdd61955514841c7c6bd98965e',
-    'setuptools-0.6c4-py2.3.egg': 'b0b9131acab32022bfac7f44c5d7971f',
-    'setuptools-0.6c4-py2.4.egg': '2a1f9656d4fbf3c97bf946c0a124e6e2',
-    'setuptools-0.6c4-py2.5.egg': '8f5a052e32cdb9c72bcf4b5526f28afc',
-    'setuptools-0.6c5-py2.3.egg': 'ee9fd80965da04f2f3e6b3576e9d8167',
-    'setuptools-0.6c5-py2.4.egg': 'afe2adf1c01701ee841761f5bcd8aa64',
-    'setuptools-0.6c5-py2.5.egg': 'a8d3f61494ccaa8714dfed37bccd3d5d',
-    'setuptools-0.6c6-py2.3.egg': '35686b78116a668847237b69d549ec20',
-    'setuptools-0.6c6-py2.4.egg': '3c56af57be3225019260a644430065ab',
-    'setuptools-0.6c6-py2.5.egg': 'b2f8a7520709a5b34f80946de5f02f53',
-    'setuptools-0.6c7-py2.3.egg': '209fdf9adc3a615e5115b725658e13e2',
-    'setuptools-0.6c7-py2.4.egg': '5a8f954807d46a0fb67cf1f26c55a82e',
-    'setuptools-0.6c7-py2.5.egg': '45d2ad28f9750e7434111fde831e8372',
-    'setuptools-0.6c8-py2.3.egg': '50759d29b349db8cfd807ba8303f1902',
-    'setuptools-0.6c8-py2.4.egg': 'cba38d74f7d483c06e9daa6070cce6de',
-    'setuptools-0.6c8-py2.5.egg': '1721747ee329dc150590a58b3e1ac95b',
-    'setuptools-0.6c9-py2.3.egg': 'a83c4020414807b496e4cfbe08507c03',
-    'setuptools-0.6c9-py2.4.egg': '260a2be2e5388d66bdaee06abec6342a',
-    'setuptools-0.6c9-py2.5.egg': 'fe67c3e5a17b12c0e7c541b7ea43a8e6',
-    'setuptools-0.6c9-py2.6.egg': 'ca37b1ff16fa2ede6e19383e7b59245a',
-}
-
-import sys, os
-try: from hashlib import md5
-except ImportError: from md5 import md5
-
-def _validate_md5(egg_name, data):
-    if egg_name in md5_data:
-        digest = md5(data).hexdigest()
-        if digest != md5_data[egg_name]:
-            print >>sys.stderr, (
-                "md5 validation of %s failed!  (Possible download problem?)"
-                % egg_name
-            )
-            sys.exit(2)
-    return data
-
-def use_setuptools(
-    version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
-    download_delay=15
-):
-    """Automatically find/download setuptools and make it available on sys.path
-
-    `version` should be a valid setuptools version number that is available
-    as an egg for download under the `download_base` URL (which should end with
-    a '/').  `to_dir` is the directory where setuptools will be downloaded, if
-    it is not already available.  If `download_delay` is specified, it should
-    be the number of seconds that will be paused before initiating a download,
-    should one be required.  If an older version of setuptools is installed,
-    this routine will print a message to ``sys.stderr`` and raise SystemExit in
-    an attempt to abort the calling script.
-    """
-    was_imported = 'pkg_resources' in sys.modules or 'setuptools' in sys.modules
-    def do_download():
-        egg = download_setuptools(version, download_base, to_dir, download_delay)
-        sys.path.insert(0, egg)
-        import setuptools; setuptools.bootstrap_install_from = egg
-    try:
-        import pkg_resources
-    except ImportError:
-        return do_download()       
-    try:
-        pkg_resources.require("setuptools>="+version); return
-    except pkg_resources.VersionConflict, e:
-        if was_imported:
-            print >>sys.stderr, (
-            "The required version of setuptools (>=%s) is not available, and\n"
-            "can't be installed while this script is running. Please install\n"
-            " a more recent version first, using 'easy_install -U setuptools'."
-            "\n\n(Currently using %r)"
-            ) % (version, e.args[0])
-            sys.exit(2)
-        else:
-            del pkg_resources, sys.modules['pkg_resources']    # reload ok
-            return do_download()
-    except pkg_resources.DistributionNotFound:
-        return do_download()
-
-def download_setuptools(
-    version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
-    delay = 15
-):
-    """Download setuptools from a specified location and return its filename
-
-    `version` should be a valid setuptools version number that is available
-    as an egg for download under the `download_base` URL (which should end
-    with a '/'). `to_dir` is the directory where the egg will be downloaded.
-    `delay` is the number of seconds to pause before an actual download attempt.
-    """
-    import urllib2, shutil
-    egg_name = "setuptools-%s-py%s.egg" % (version,sys.version[:3])
-    url = download_base + egg_name
-    saveto = os.path.join(to_dir, egg_name)
-    src = dst = None
-    if not os.path.exists(saveto):  # Avoid repeated downloads
-        try:
-            from distutils import log
-            if delay:
-                log.warn("""
----------------------------------------------------------------------------
-This script requires setuptools version %s to run (even to display
-help).  I will attempt to download it for you (from
-%s), but
-you may need to enable firewall access for this script first.
-I will start the download in %d seconds.
-
-(Note: if this machine does not have network access, please obtain the file
-
-   %s
-
-and place it in this directory before rerunning this script.)
----------------------------------------------------------------------------""",
-                    version, download_base, delay, url
-                ); from time import sleep; sleep(delay)
-            log.warn("Downloading %s", url)
-            src = urllib2.urlopen(url)
-            # Read/write all in one block, so we don't create a corrupt file
-            # if the download is interrupted.
-            data = _validate_md5(egg_name, src.read())
-            dst = open(saveto,"wb"); dst.write(data)
-        finally:
-            if src: src.close()
-            if dst: dst.close()
-    return os.path.realpath(saveto)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-def main(argv, version=DEFAULT_VERSION):
-    """Install or upgrade setuptools and EasyInstall"""
-    try:
-        import setuptools
-    except ImportError:
-        egg = None
-        try:
-            egg = download_setuptools(version, delay=0)
-            sys.path.insert(0,egg)
-            from setuptools.command.easy_install import main
-            return main(list(argv)+[egg])   # we're done here
-        finally:
-            if egg and os.path.exists(egg):
-                os.unlink(egg)
-    else:
-        if setuptools.__version__ == '0.0.1':
-            print >>sys.stderr, (
-            "You have an obsolete version of setuptools installed.  Please\n"
-            "remove it from your system entirely before rerunning this script."
-            )
-            sys.exit(2)
-
-    req = "setuptools>="+version
-    import pkg_resources
-    try:
-        pkg_resources.require(req)
-    except pkg_resources.VersionConflict:
-        try:
-            from setuptools.command.easy_install import main
-        except ImportError:
-            from easy_install import main
-        main(list(argv)+[download_setuptools(delay=0)])
-        sys.exit(0) # try to force an exit
-    else:
-        if argv:
-            from setuptools.command.easy_install import main
-            main(argv)
-        else:
-            print "Setuptools version",version,"or greater has been installed."
-            print '(Run "ez_setup.py -U setuptools" to reinstall or upgrade.)'
-
-def update_md5(filenames):
-    """Update our built-in md5 registry"""
-
-    import re
-
-    for name in filenames:
-        base = os.path.basename(name)
-        f = open(name,'rb')
-        md5_data[base] = md5(f.read()).hexdigest()
-        f.close()
-
-    data = ["    %r: %r,\n" % it for it in md5_data.items()]
-    data.sort()
-    repl = "".join(data)
-
-    import inspect
-    srcfile = inspect.getsourcefile(sys.modules[__name__])
-    f = open(srcfile, 'rb'); src = f.read(); f.close()
-
-    match = re.search("\nmd5_data = {\n([^}]+)}", src)
-    if not match:
-        print >>sys.stderr, "Internal error!"
-        sys.exit(2)
-
-    src = src[:match.start(1)] + repl + src[match.end(1):]
-    f = open(srcfile,'w')
-    f.write(src)
-    f.close()
-
-
-if __name__=='__main__':
-    if len(sys.argv)>2 and sys.argv[1]=='--md5update':
-        update_md5(sys.argv[2:])
-    else:
-        main(sys.argv[1:])
-
-
-
-
-
-
diff --git a/HarvestMan/harvestman/__init__.py b/HarvestMan/harvestman/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/apps/__init__.py b/HarvestMan/harvestman/apps/__init__.py
deleted file mode 100755
index 35caf0a..0000000
--- a/HarvestMan/harvestman/apps/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-# -- coding: utf-8
-import sys, os
-
-d = os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__'])))
-sys.path.append(os.path.dirname(d))
diff --git a/HarvestMan/harvestman/apps/appbase.py b/HarvestMan/harvestman/apps/appbase.py
deleted file mode 100755
index f9ace0a..0000000
--- a/HarvestMan/harvestman/apps/appbase.py
+++ /dev/null
@@ -1,83 +0,0 @@
-# -- coding: utf-8
-"""
-appbase.py - Defines the base application class for
-applications using the HarvestMan framework.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Modification History
-
-Created: Dec 12 2007       Anand B Pillai     By moving code from
-                                              harvestman.py module.
-"""
-
-import sys, os
-import __init__
-import atexit
-
-from harvestman.lib import config
-from harvestman.lib import logger
-
-from harvestman.lib.common.common import *
-
-class HarvestManAppBase(object):
-    """ Base application class for applications using the HarvestMan framework """
-
-    # All applications using HarvestMan framework should derive from this class
-    # or one of its subclasses.
-    
-    def __init__(self):
-        """ Initializer """
-        
-        self.prepare()
-        
-    def prepare(self):
-        """ Creates the state and logger objects and their aliases """
-        
-        # Init Config Object
-        SetAlias(config.HarvestManStateObject.makeInstance())
-        # Initialize logger object
-        SetAlias(logger.HarvestManLogger())
-        
-    def process_plugins(self):
-        """ Loads any plugin modules specified in configuration and process them """
-
-        import harvestman.lib
-        #Why are we adding a path here? It should know where hooks is
-        #sys.path.append(harvestman.lib.__path__)
-        from harvestman.lib import hooks
-
-        plugin_dir = os.path.abspath(os.path.join(os.path.dirname(__init__.__file__), '..', 'ext'))
-        # print plugin_dir
-
-        if os.path.isdir(plugin_dir):
-            sys.path.append(plugin_dir)
-            # Load plugins specified in plugins list
-            for plugin in objects.config.plugins:
-                # Load plugins
-                try:
-                    logconsole('Loading plugin %s...' % plugin)
-                    M = __import__(plugin)
-                    func = getattr(M, 'apply_plugin', None)
-                    if not func:
-                        logconsole('Invalid plugin %s, should define function "apply_plugin"!' % plugin)
-                    try:
-                        logconsole('Applying plugin %s...' % plugin)
-                        func()
-                    except Exception, e:
-                        logconsole('Error while trying to apply plugin %s' % plugin)
-                        logconsole('Error is:',str(e))
-                        sys.exit(0)
-                except (KeyError, ImportError), e:
-                    logconsole('Error importing plugin module %s' % plugin)
-                    logconsole('Error is:',str(e))
-                    logconsole('Invalid plugin: %s !' % plugin)
-                    hexit(0)
-
-    def get_options(self):
-        """ Reads program options from command line or configuration files """
-        
-        # Get program options
-        objects.config.get_program_options()
-
-    
diff --git a/HarvestMan/harvestman/apps/config-sample-urlfilter.xml b/HarvestMan/harvestman/apps/config-sample-urlfilter.xml
deleted file mode 100644
index b0653c8..0000000
--- a/HarvestMan/harvestman/apps/config-sample-urlfilter.xml
+++ /dev/null
@@ -1,128 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-      <project ignore="0">
-
-        <url>www.harvestmanontheweb.com</url>
-        <name>harvestman</name>
-        <verbosity level="extrainfo"/>
-      </project>
-    
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <movies value="0"/>
-        <flash value="1"/>
-        <sounds value="0"/>
-        <documents value="0"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types> 
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5000 MB" /> 
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="0"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>
-          <path value="-/images/*+/images/public/*" case="1" enable="0" />
-          <extension value="-jpg-png+doc" enable="0"/>
-          <regex value="(\s*\/banner\/)" enable="1" flags='re.LOCALE' />
-        </urlfilter>
-        <textfilter>
-          <meta value="project page of the harvestman" tags="description" case="1" />
-        </textfilter>
-        <junkfilter enable="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-    
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-    
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-    
-    <display>
-      <browsepage value="0"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
diff --git a/HarvestMan/harvestman/apps/config-sample.xml b/HarvestMan/harvestman/apps/config-sample.xml
deleted file mode 100755
index a5bc05e..0000000
--- a/HarvestMan/harvestman/apps/config-sample.xml
+++ /dev/null
@@ -1,160 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-      
-      <project ignore="0">
-
-        <url>http://docs.python.org/tutorial/index.html</url>
-        <name>pytut</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.harvestmanontheweb.com</url>
-        <name>foo</name>
-        <verbosity level="extrainfo"/>
-      </project>
-      
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-    
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <movies value="0"/>
-        <flash value="1"/>
-        <sounds value="0"/>
-        <documents value="0"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types> 
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5000 MB" /> 
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>
-          <path value="-/images/*+/images/public/*" enable="0" case="1" />
-          <extension value="-jpg-png+doc" enable="0"/>
-          <regex value="(\s*\/banner\/)" enable="0" flags='re.LOCALE' />
-        </urlfilter>
-        <textfilter>
-          <meta value="(internet|crawler|web-bot|web-crawler)" tags="keywords" enable="0" />
-          <meta value="harvestman|web-crawler" tags="title,description" enable="0" />
-        </textfilter>
-        <junkfilter enable="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-    
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-    
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-    
-    <display>
-      <browsepage value="0"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
diff --git a/HarvestMan/harvestman/apps/hget.py b/HarvestMan/harvestman/apps/hget.py
deleted file mode 100755
index 4f348fc..0000000
--- a/HarvestMan/harvestman/apps/hget.py
+++ /dev/null
@@ -1,325 +0,0 @@
-#! /usr/bin/env python
-# -- coding: utf-8
-
-""" Hget - Extensible, modular, multithreaded Internet
-    downloader program in the spirit of wget, using
-    HarvestMan codebase, with HTTP multipart support.
-    
-    Version      - 1.0 beta 1.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    HGET is free software. See the file LICENSE.txt for information
-    on the terms and conditions of usage, and a DISCLAIMER of ALL WARRANTIES.
-
- Modification History
-
-    Created: April 19 2007 Anand B Pillai
-
-     April 20 2007 Added more command-line options   Anand
-     April 24 2007 Made connector module to flush data  Anand
-                   to tempfiles when run with hget.
-     April 25 2007 Implementation of hget features is  Anand
-                   completed!
-     April 30 2007 Many fixes/enhancements to hget.
-                   1. Reconnection of a lost connection does not
-                   lose already downloaded data.
-                   2. Closing of files of threads when download is
-                   aborted.
-                   3. Thread count shows current number of threads
-                   which are actually doing downloads, reflecting
-                   the activity.
-                   4. Final printing of time taken, average bandwidth
-                   and file size.
-
-     May 10 2007   Added support for sf.net mirrors in multipart.
-     Aug    2007   Fixed bugs in resetting state of various objects
-                   when doing many multipart downloads one after other.
-     
-Copyright(C) 2007, Anand B Pillai
-
-"""
-
-import __init__
-import sys, os
-import re
-import shutil
-
-from harvestman.lib import connector
-from harvestman.lib import urlparser
-from harvestman.lib import config
-from harvestman.lib import logger
-from harvestman.lib import datamgr
-from harvestman.lib import urlthread
-from harvestman.lib import mirrors
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-
-# Current dir - okay
-from spider import HarvestMan
-
-VERSION='0.1'
-MATURITY='beta 1'
-
-class Hget(HarvestMan):
-    """ Web getter class for HarvestMan which defines a wget like interface
-    for downloading files on the command line with HTTP/1.0 Multipart support, mirror
-    search and failover """
-
-    __metaclass__ = MethodWrapperMetaClass
-    
-    USER_AGENT = "Python-urllib/1.16"
-
-    def grab_url(self, url, filename=None):
-        """ Download the given URL and save it to the (optional) filename """
-
-        # If a filename is given, set outfile to it
-        if filename:
-            objects.config.hgetoutfile = filename
-            # print 'Saving to',filename
-
-        # We need to reset some counters and
-        # data structures ...
-        
-        # Reset progress object
-        objects.config.reset_progress()
-        # Reset thread pool, multipart status
-        self._pool.reset_multipart_data()
-        # Reset monitor
-        self._monitor.reset()
-        # Reset mirror manager
-        mirrormgr = mirrors.HarvestManMirrorManager.getInstance()
-        mirrormgr.reset()
-        
-        try:
-            # print objects.config.requests, objects.config.connections
-            conn = connector.HarvestManUrlConnector()
-            urlobj = None
-            
-            try:
-                print '\nDownloading URL',url,'...'
-                urlobj = urlparser.HarvestManUrl(url)
-                ret = conn.url_to_file(urlobj)
-
-                if urlobj.trymultipart and mirrormgr.used:
-                    # Print stats if mirrors were used...
-                    mirrormgr.print_stats()
-                    
-                return HGET_DOWNLOAD_OK
-            except urlparser.HarvestManUrlError, e:
-                print str(e)
-                print 'Error: Invalid URL "%s"' % url
-
-                return HGET_DOWNLOAD_ERROR
-            
-        except KeyboardInterrupt, e:
-            print 'Caught keyboard interrupt...'
-            if urlobj: self.clean_up(conn, urlobj)
-
-            return HGET_KEYBOARD_INTERRUPT
-
-        except Exception, e:
-            print 'Caught fatal error (%s): %s' % (e.__class__.__name__, str(e))
-            if urlobj: self.clean_up(conn, urlobj, e)
-            print_traceback()
-
-            return HGET_FATAL_ERROR
-            
-    def clean_up(self, conn, urlobj, exception=None):
-        """ Perform clean up after any exception """
-        
-        reader = conn.get_fileobj()
-        if reader: reader.stop()
-        if exception==None:
-            print '\n\nDownload aborted by user interrupt.'
-
-        # If flushdata mode, delete temporary files
-        if objects.config.datamode == CONNECTOR_DATA_MODE_FLUSH:
-            print 'Cleaning up temporary files...'
-            fname1 = conn.get_tmpfname()
-            # print 'Temp fname=>',fname1
-            
-            fullurl = urlobj.get_full_url()
-            range_request = conn._headers.get('accept-ranges','').lower()
-            # If server supports range requests, then do not
-            # clean up temp file, since we can start from where
-            # we left off, if this file is requested again.
-            if not range_request=='bytes':
-                if fname1:
-                    try:
-                        os.remove(fname1)
-                    except OSError, e:
-                        print e
-            elif fname1:
-                # Dump an info file consisting of the header
-                # information to a file, so that we can use it
-                # to resume downloading from where we left off
-                conn.write_url_info_file(fullurl)
-
-            lthreads = self._pool.get_threads()
-            lfiles = []
-            for t in lthreads:
-                fname = t.get_tmpfname()
-                if fname: lfiles.append(fname)
-                t.close_file()
-
-            print 'Waiting for threads to finish...'
-            self._pool.end_all_threads()
-
-            # For currently running multipart download, clean
-            # up all pieces since there is no guarantee that
-            # the next request will be for the same number of
-            # pieces of files, though the server supports
-            # multipart downloads.
-            if lfiles:
-                tmpdir = os.path.dirname(lfiles[0])
-            else:
-                tmpdir = ''
-                
-            for f in lfiles:
-               if os.path.isfile(f):
-                   try:
-                       os.remove(f)
-                   except (IOError, OSError), e:
-                       print 'Error: ',e
-
-            # Commented out because this is giving a strange
-            # exception on Windows.
-            
-            # If doing multipart, cleanup temp dir also
-            #if objects.config.multipart:
-            #    if not objects.config.hgetnotemp and tmpdir:
-            #        try:
-            #            shutil.rmtree(tmpdir)
-            #        except OSError, e:
-            #            print e
-            print 'Done'
-
-        print ''
-        
-    def create_user_directories(self):
-        """ Create the initial directories for Hget application """
-
-        super(Hget, self).create_user_directories()
-        # Create temporary directory for saving files
-        if not objects.config.hgetnotemp:
-            try:
-                tmp = GetMyTempDir()
-                if not os.path.isdir(tmp):
-                    os.makedirs(tmp)
-                # Could not make tempdir, set hgetnotemp to True
-                if not os.path.isdir(tmp):
-                    objects.config.hgetnotemp = True
-            except Exception, e:
-                pass
-
-    def init(self):
-        """ Initialize the Hget object's state """
-
-        objects.config.USER_AGENT = self.__class__.USER_AGENT
-        # Fudge Firefox USER-AGENT string since some sites
-        # dont accept our user-agent.
-        # objects.config.USER_AGENT = "Firefox/2.0.0.8"
-        objects.config.appname = 'Hget'
-        objects.config.version = VERSION
-        objects.config.maturity = MATURITY
-        objects.config.nocrawl = True
-        self._pool = None
-        self._monitor = None
-        
-        # Get program options
-        objects.config.parse_arguments()
-        
-        objects.config.threadpoolsize = 20
-        # Set number of connections to two plus numparts
-        objects.config.connections = 2*objects.config.numparts
-        # Set socket timeout to a very low value
-        objects.config.socktimeout = 30.0
-        # objects.config.requests = 2*objects.config.numparts
-        if objects.config.hgetverbose:
-            objects.config.verbosity=logger.EXTRAINFO
-
-        objects.logger.make_logger()        
-        objects.logger.setLogSeverity(objects.config.verbosity)
-
-        self.process_plugins()
-        
-        self.register_common_objects()
-        self.create_user_directories()
-
-        # Set logging format to plain
-        objects.logger.setPlainFormat()
-
-    def hget(self):
-        """ Main method of Hget class. Downloads all URL(s) passed on the command
-        line and saves them """
-
-        if len(objects.config.projects)==0:
-            print 'Error: No input URL/file given. Run with -h or no arguments to see usage.\n'
-            return -1
-
-        objects.datamgr.initialize()
-        self._pool = objects.datamgr.get_url_threadpool()
-
-        self._monitor = urlthread.HarvestManUrlThreadPoolMonitor(self._pool)
-        self._monitor.start()
-            
-        for arg in objects.config.projects:
-            url = arg['url']
-            
-            # Check if the argument is a file, if so
-            # download URLs specified in the file.
-            if os.path.isfile(url):
-                # Open it, read URL per line and schedule download
-                print 'Input file %s found, scheduling download of URLs...' % url
-                try:
-                    for line in file(url):
-                        line = line.strip()
-                        # The line can optionally contain a different output
-                        # file name, in which case it should be separated by
-                        # commas...
-                        items = line.split(',')
-                        if len(items)==2:
-                            url, filename = items[0].strip(), items[1].strip()
-                            if self.grab_url(url, filename) == HGET_KEYBOARD_INTERRUPT:
-                                break
-                        elif len(items)==1:
-                            url = items[0].strip()
-                            if self.grab_url(url) == HGET_KEYBOARD_INTERRUPT:
-                                break
-                        print ''
-
-                #except IOError, e:
-                #    print 'Error:',e
-                except Exception, e:
-                    raise
-            else:
-                self.grab_url(url)
-
-        self._monitor.stop()
-
-    def main(self):
-        """ Main sub-routine """
-
-        # Add help option if no arguments are given
-        if len(sys.argv)<2:
-            sys.argv.append('-h')
-            
-        self.init()
-        self.hget()
-        return 0
-
-def main():
-    """ Main routine """
-
-    Hget().main()
-    
-if __name__ == "__main__":
-    main()
-
-def run():
-    h = Hget()
-    h.main()
-    
diff --git a/HarvestMan/harvestman/apps/samples/Readme.txt b/HarvestMan/harvestman/apps/samples/Readme.txt
deleted file mode 100755
index 602f437..0000000
--- a/HarvestMan/harvestman/apps/samples/Readme.txt
+++ /dev/null
@@ -1,20 +0,0 @@
-This directory contains sample crawler applications which extend
-HarvestMan and override events to produce specific crawling behavior.
-Each of the module in this directory implement a custom crawler class
-which inherits from the "HarvestMan" class.
-
-o htmlcrawler.py:     An html-only crawler
-o imagecrawler.py:    A crawler which downloads only images
-o searchingcrawler.py : A crawler which searches a web-site for pages matching  
-                      a keyword or a regular expression and downloads  them.
-o taganalyzer.py :    A crawler which analyzes tags in HTML pages.
-o indexingcrawler.py: A crawler which indexes downloaded pages using PyLucene.
-o linkchecker.py    : A crawler which checks a site for broken links and
-		      reports them.
-o postingcrawler.py:  A specific application example. This crawler is custom
-                      written to crawl monthly archives of the Bangalore Python
-                      User's group (BangPypers), retrieve JOB postings and post
-                      them automatically to the blog http://pythonjobs.blogspot.com .
-               
-
-
diff --git a/HarvestMan/harvestman/apps/samples/__init__.py b/HarvestMan/harvestman/apps/samples/__init__.py
deleted file mode 100755
index c446ec4..0000000
--- a/HarvestMan/harvestman/apps/samples/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-# -- coding: utf-8
-import sys, os
-
-d = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__']))))
-sys.path.append(os.path.dirname(d))
-
diff --git a/HarvestMan/harvestman/apps/samples/blogger.py b/HarvestMan/harvestman/apps/samples/blogger.py
deleted file mode 100755
index 79e79a2..0000000
--- a/HarvestMan/harvestman/apps/samples/blogger.py
+++ /dev/null
@@ -1,58 +0,0 @@
-"""
-blogger.py - Job posting class using Google blogger API.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 28 May 2008
-
-Copyright(C) 2008 Anand B Pillai.
-"""
-
-from gdata import service
-import gdata
-import atom
-import getopt
-import sys
-
-class BlogJobPoster(object):
-    """ Class which performs job posting to http://pythonjobs.blogspot.com """
-    
-    def __init__(self, email, password):
-        # Authenticate using ClientLogin.
-        self.service = service.GDataService(email, password)
-        self.service.source = 'Blogger_Python_Sample-1.0'
-        self.service.service = 'blogger'
-        self.service.server = 'www.blogger.com'
-        self.service.ProgrammaticLogin()
-        self.blog_id = 0
-        
-        # Get the blog ID for http://pythonjobs.blogspot.com
-        query = service.Query()
-        query.feed = '/feeds/default/blogs'
-        feed = self.service.Get(query.ToUri())
-        
-        for entry in feed.entry:
-            print "\t" + entry.title.text
-            print entry.link[0].href
-            
-            # if entry.link[0].href=='http://pythonjobs.blogspot.com/':
-            if entry.link[0].href=='http://www.blogger.com/feeds/18362312542208032325/blogs/5503040385101187323':
-                self_link = entry.GetSelfLink()
-                self.blog_id = self_link.href.split('/')[-1]
-                break
-
-    def post_job(self, title, content):
-        """ Post a job with given title and content """
-        
-        # Create the entry to insert.
-        entry = gdata.GDataEntry()
-        entry.author.append(atom.Author(atom.Name(text="Post Author")))
-        entry.title = atom.Title(title_type='xhtml', text=title)
-        entry.content = atom.Content(content_type='html', text=content)
-
-        # Ask the service to insert the new entry.
-        job_post = self.service.Post(entry, '/feeds/' + str(self.blog_id) + '/posts/default')
-        print "Successfully created post: \"" + job_post.title.text + "\".\n"
-
-if __name__ == "__main__":
-    pass
-
-
diff --git a/HarvestMan/harvestman/apps/samples/htmlcrawler.py b/HarvestMan/harvestman/apps/samples/htmlcrawler.py
deleted file mode 100755
index 8a7ed5a..0000000
--- a/HarvestMan/harvestman/apps/samples/htmlcrawler.py
+++ /dev/null
@@ -1,36 +0,0 @@
-#!/usr/bin/env python
-
-"""
-htmlcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which fetches
-only web pages from the web.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import __init__
-from harvestman.apps.spider import HarvestMan
-
-class HtmlCrawler(HarvestMan):
-    """ A crawler which fetches only HTML (webpage) pages """
-    
-    def include_this_link(self, event, *args, **kwargs):
-        
-        url = event.url
-        if url.is_webpage():
-            # Allow for further processing by rules...
-            # otherwise we will end up crawling the entire
-            # web, since no other rules will apply if we
-            # return True here.
-            return None
-        else:
-            return False
-
-if __name__ == "__main__":
-    spider=HtmlCrawler()
-    spider.initialize()
-    spider.bind_event('includelinks', spider.include_this_link)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/imagecrawler.py b/HarvestMan/harvestman/apps/samples/imagecrawler.py
deleted file mode 100755
index 5b95535..0000000
--- a/HarvestMan/harvestman/apps/samples/imagecrawler.py
+++ /dev/null
@@ -1,48 +0,0 @@
-#!/usr/bin/env python
-
-"""
-imagecrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which downloads
-only images from the web.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.macros import *
-
-class ImageCrawler(HarvestMan):
-    """ A crawler which saves only images to disk """
-    
-    def write_this_url(self, event, *args, **kwargs):
-        
-        url = event.url
-        if url.is_image() or url.starturl:
-            return True
-        else:
-            return False
-
-    def include_links(self, event, *args, **kwargs):
-
-        url = event.url
-        if url.is_image():
-            return True
-        else:
-            pass
-
-if __name__ == "__main__":
-    spider=ImageCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.robots = 0 # You might want to re-enable this!
-    config.verbosity = 3
-    # Need in-mem data mode to obtain data for
-    # web-page URLs to parse them!
-    config.datamode = CONNECTOR_DATA_MODE_INMEM 
-    spider.bind_event('writeurl', spider.write_this_url)
-    spider.bind_event('includelinks', spider.include_links)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/indexingcrawler.py b/HarvestMan/harvestman/apps/samples/indexingcrawler.py
deleted file mode 100755
index 9970219..0000000
--- a/HarvestMan/harvestman/apps/samples/indexingcrawler.py
+++ /dev/null
@@ -1,118 +0,0 @@
-#!/usr/bin/env python
-
-"""
-indexingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which crawls a given
-URL and indexes documents at the end of the crawl.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import __init__
-import sys, os
-import PyLucene
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import *
-from types import StringTypes
-
-# You can write pretty crazy custom crawlers by combining
-# different events and writing handlers for them ! :)
-
-class IndexingCrawler(HarvestMan):
-    """ A text indexing crawler using PyLucene """
-
-    # NOTE: This class performs work equivalent to the lucene plugin ...
-
-    def __init__(self):
-        super(IndexingCrawler, self).__init__()
-
-    def create_index(self):
-        """ Post download setup callback for creating a lucene index """
-
-        info("Creating lucene index")
-
-        count = 0
-
-        urllist = []
-
-        urldb = objects.datamgr.get_urldb()
-
-        storeDir = "index"
-        if not os.path.exists(storeDir):
-            os.mkdir(storeDir)
-
-        store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-        lucene_writer = PyLucene.IndexWriter(store, PyLucene.StandardAnalyzer(), True)    
-        lucene_writer.setMaxFieldLength(1048576)
-        
-        
-        for node in urldb.preorder():
-            urlobj = node.get()
-
-            # Only index if web-page or document
-            if not urlobj.is_webpage() and not urlobj.is_document(): continue
-
-            filename = urlobj.get_full_filename()
-            url = urlobj.get_full_url()
-
-            try:
-                urllist.index(urlobj.index)
-                continue
-            except ValueError:
-                urllist.append(urlobj.index)
-
-            if not os.path.isfile(filename): continue
-
-            data = ''
-
-            extrainfo('Adding index for URL',url)
-
-            try:
-                data = unicode(open(filename).read(), 'iso-8859-1')
-            except UnicodeDecodeError, e:
-                data = ''
-
-            try:
-                doc = PyLucene.Document()
-                doc.add(PyLucene.Field("name", 'file://' + filename,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.UN_TOKENIZED))
-                doc.add(PyLucene.Field("path", url,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.UN_TOKENIZED))
-                if data and len(data) > 0:
-                    doc.add(PyLucene.Field("contents", data,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.TOKENIZED))
-                else:
-                    warning("warning: no content in %s" % filename)
-
-                lucene_writer.addDocument(doc)
-            except PyLucene.JavaError, e:
-                print e
-                continue
-            
-            count += 1
-
-        info('Created lucene index for %d documents' % count)
-        info('Optimizing lucene index')
-        lucene_writer.optimize()
-        lucene_writer.close()
-    
-    def post_download_cb(self, event, *args, **kwargs):
-        self.create_index()
-
-if __name__ == "__main__":
-    spider=IndexingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = 3
-    config.localise = 0
-    config.images = 0
-    config.pagecache = 0
-
-    spider.bind_event('postdownload', spider.post_download_cb)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/linkchecker.py b/HarvestMan/harvestman/apps/samples/linkchecker.py
deleted file mode 100755
index 865cf98..0000000
--- a/HarvestMan/harvestman/apps/samples/linkchecker.py
+++ /dev/null
@@ -1,51 +0,0 @@
-#!/usr/bin/env python
-
-"""
-linkchecker.py - Demonstrating custom crawling by overriding
-events. This is a crawler class which reports broken links
-in a website. Broken links are those which result in HTTP
-404 (not found) errors.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import __init__
-import sys
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import objects, logconsole
-
-class LinkChecker(HarvestMan):
-    """ A crawler which checks a website/directory for broken links """
-
-    def __init__(self,):
-        self.broken = []
-        super(LinkChecker, self).__init__()
-
-    def find_broken_links(self, event, *args, **kwargs):
-        urldb = objects.datamgr.get_urldb()
-
-        for node in urldb.preorder():
-            urlobj = node.get()
-            if urlobj.status == 404:
-                self.broken.append(urlobj.get_full_url())
-
-        # Write to a file
-        baseurl = objects.queuemgr.get_base_url()
-        fname = '404#' + str(hash(baseurl)) + '.txt'
-        logconsole('Writing broken links to',fname)
-        f = open(fname, 'w')
-        f.write("Broken links for crawl starting with URL %s\n\n" % baseurl)
-        for link in self.broken:
-            f.write(link + '\n')
-        f.close()
-
-        return False
-    
-if __name__ == "__main__":
-    spider=LinkChecker()
-    spider.initialize()
-    spider.bind_event('beforefinish', spider.find_broken_links)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/post_jobs.py b/HarvestMan/harvestman/apps/samples/post_jobs.py
deleted file mode 100644
index bdfd719..0000000
--- a/HarvestMan/harvestman/apps/samples/post_jobs.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-import sys
-import blogger
-import getpass
-import cPickle
-
-def post_jobs(*fnames):
-    print 'About to post jobs from %d dumps' % len(fnames)
-    print 'Logging to blogger...'
-
-    username = raw_input("Enter username: ").strip()
-    passwd = getpass.getpass("Enter password: ").strip()
-    b = blogger.BlogJobPoster(username, passwd)
-    print 'Successfully logged into blogger.'
-    
-    for fname in fnames:
-        print 'Posting job from file %s' % fname
-        jobs = cPickle.load(open(fname,'rb'))
-        print 'Found %d job postings' % len(jobs)
-
-        count = 0
-        content = ''
-
-        for title, (url,data) in jobs.items():
-            # Extract date from the text...
-            print data
-            go_ahead = raw_input("Post this JOB [y/n] ? ")
-            if go_ahead.lower().strip() != 'y': continue
-
-            count += 1
-            date = data[data.find('<I>')+3:data.find('</I>')]
-            title += '(Posted on: %s)' % date
-            content = data[data.find('<PRE>')+5:data.find('</PRE>')] + '<br>\n'
-            # Wrap content to text width nicely
-            # content = wrap(content) + '<br>'
-
-            content += 'Referrering URL: <i><a href="%s">%s</a></i>\n' % (url, url)
-            content = '<P>' + content + '</P>'
-
-            print 'Posting job ID %d for %s' % (count, url)
-            b.post_job(title, content)
-
-if __name__ == "__main__":
-    post_jobs(*sys.argv[1:])
diff --git a/HarvestMan/harvestman/apps/samples/postingcrawler.py b/HarvestMan/harvestman/apps/samples/postingcrawler.py
deleted file mode 100755
index f5757f3..0000000
--- a/HarvestMan/harvestman/apps/samples/postingcrawler.py
+++ /dev/null
@@ -1,117 +0,0 @@
-#!/usr/bin/env python
-
-"""
-postingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which crawls
-BangPyper archives, finds job postings and posts it to
-a specific blog.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-import __init__
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import objects
-
-import sys
-import blogger
-import getpass
-import re
-import cPickle
-
-class JobPostingCrawler(HarvestMan):
-    """ A job-posting crawler by integrating HarvestMan
-    with the google blogger API """
-
-    month_re = re.compile(r'\d{4}-\w+')
-
-    def __init__(self):
-        self.jobs = {}
-        # Archive and post later
-        self.archive = True
-        super(JobPostingCrawler, self).__init__()
-
-    def after_parse_cb(self, event, *args, **kwargs):
-        
-        document = event.document
-        url = event.url
-
-        if document:
-            content = document.content.lower()
-            title = document.title.lower()
-            
-            if title.find('job') != -1:
-                # If this is a reply to an original job post, ignore it...
-                data = document.content
-                
-                idx = data.find('Previous message:')
-                if idx != -1:
-                    idx2 = data.find('</A>', idx)
-                    # print 'String=>',data[idx:idx2]
-                    # print 'Title=>',document.title
-                    if not data[idx:idx2].endswith(document.title):
-                        self.jobs[document.title] = (url.get_full_url(), document.content)
-                else:
-                    self.jobs[document.title] = (url.get_full_url(), document.content)                    
-
-    def finish_event_cb(self, event, *args, **kwargs):
-
-        if len(self.jobs):
-            print 'Found %d job postings' % len(self.jobs)
-            if self.archive:
-                # Archive the data as a pickled file
-                # get base URL
-                post_month = self.month_re.findall(objects.config.url)[0]
-                fname = 'pythonjobs-%s' % post_month
-                f = open(fname, 'wb')
-                cPickle.dump(self.jobs, f)
-                f.close()
-                print 'Wrote jobs data to file %s.' % fname
-                return
-                
-            go_ahead = raw_input("Go ahead with posting [y/n] ? ")
-            if go_ahead.lower().strip() == 'y':
-                print 'Logging to blogger...'
-                username = raw_input("Enter username: ").strip()
-                passwd = getpass.getpass("Enter password: ").strip()
-                b = blogger.BlogJobPoster(username, passwd)
-                print 'Successfully logged into blogger.'
-
-                count = 0
-                content = ''
-
-                for title, (url,data) in self.jobs.items():
-                    # Extract date from the text...
-                    print data
-                    go_ahead = raw_input("Post this JOB [y/n] ? ")
-                    if go_ahead.lower().strip() != 'y': continue
-                    
-                    count += 1
-                    date = data[data.find('<I>')+3:data.find('</I>')]
-                    title += '(Posted on: %s)' % date
-                    content = data[data.find('<PRE>')+5:data.find('</PRE>')] + '<br>\n'
-                    # Wrap content to text width nicely
-                    # content = wrap(content) + '<br>'
-
-                    content += 'Referrering URL: <i><a href="%s">%s</a></i>\n' % (url, url)
-                    content = '<P>' + content + '</P>'
-
-                    print 'Posting job ID %d for %s' % (count, url)
-                    b.post_job(title, content)
-            else:
-                print 'Found %d jobs, but did not post.' % len(self.jobs)
-        else:
-            print 'No job postings found!'
-
-if __name__ == "__main__":
-    spider=JobPostingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = 3
-    config.robots = 0
-    config.localise = 0
-
-    spider.bind_event('afterparse', spider.after_parse_cb)
-    spider.bind_event('beforefinish', spider.finish_event_cb)    
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/searchingcrawler.py b/HarvestMan/harvestman/apps/samples/searchingcrawler.py
deleted file mode 100755
index 210ed7c..0000000
--- a/HarvestMan/harvestman/apps/samples/searchingcrawler.py
+++ /dev/null
@@ -1,79 +0,0 @@
-#!/usr/bin/env python
-
-"""
-searchingcrawler.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which downloads
-and crawls only pages which mention a certain keyword.
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import re
-import __init__
-from harvestman.apps.spider import HarvestMan
-
-class SearchingCrawler(HarvestMan):
-    """ A crawler which fetches pages by searching for specific data
-    This crawler can be run on a website to download only pages which
-    contain a keyword or set of keywords or a regular expression """
-    
-    def __init__(self, regexp):
-        self.rexp = regexp
-        super(SearchingCrawler, self).__init__()
-        
-    def write_this_url(self, event, *args, **kwargs):
-
-        if kwargs:
-            data = kwargs['data']
-            if self.rexp.search(data):
-                return True
-            else:
-                return False
-        else:
-            return False
-
-    def parse_this_link(self, event, *args):
-
-        document = event.document
-        url = event.url
-        
-        if not url.starturl and (self.rexp.search(document.content) == None):
-            return False
-        else:
-            return True
-        
-    def crawl_this_link(self, event, *args):
-        
-        document = event.document
-        url = event.url
-
-        if document:
-            # Don't forget to crawl the start URL!
-            if not url.starturl:
-                # Look for the keyword in the document keywords also...
-                matches = [self.rexp.search(keyword) for keyword in document.keywords]
-                if len(matches) or \
-                   self.rexp.search(document.description) or \
-                   self.rexp.search(document.content):
-                    return True
-                else:
-                    return False
-            else:
-                return True
-        else:
-            return False
-
-if __name__ == "__main__":
-    # Search for strings "database" or "dbm" and save such pages only.
-    spider=SearchingCrawler(re.compile(r'database|dbm',re.IGNORECASE))
-    spider.initialize()
-    config = spider.get_config()
-    config.verbosity = 3
-
-    spider.bind_event('beforecrawl', spider.crawl_this_link)
-    spider.bind_event('beforeparse', spider.parse_this_link)
-    spider.bind_event('writeurl', spider.write_this_url)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/samples/taganalyzer.py b/HarvestMan/harvestman/apps/samples/taganalyzer.py
deleted file mode 100755
index 0615c51..0000000
--- a/HarvestMan/harvestman/apps/samples/taganalyzer.py
+++ /dev/null
@@ -1,66 +0,0 @@
-#!/usr/bin/env python
-
-"""
-taganalyzer.py - Demonstrating custom crawler writing by
-subscribing to events. This is a crawler which allows
-you to subscribe to HTML tag parsing events and to
-take actions. 
-
-Created by Anand B Pillai <abpillai at gmail dot com> 
-
-Copyright (C) 2008 Anand B Pillai
-"""
-
-import sys
-import __init__
-
-from harvestman.apps.spider import HarvestMan
-from harvestman.lib.common.common import CaselessDict
-
-class TagAnalyzingCrawler(HarvestMan):
-    """ A crawler which can perform custom tag analysis """
-
-    def __init__(self):
-        # Dictionary for storing information
-        self.d = {'images_no_alt': [], 'csslinks': []}
-        super(TagAnalyzingCrawler, self).__init__()
-
-    def write_this_url(self, event, *args, **kwargs):
-        # Since we are doing only tag analysis, don't write anything..
-        return False
-    
-    def analyze_this_tag(self, event, *args, **kwargs):
-
-        tag = kwargs.get('tag','')
-        attrs = kwargs.get('attrs',None)
-
-        # This performs a check on images not having the 'alt' attribute...
-        if tag.lower() == 'img':
-            d = CaselessDict(attrs)
-            if not 'alt' in d:
-                imgurl = d['src'] or d['href']
-                self.d['images_no_alt'].append(imgurl)
-
-        
-    def finish_event_cb(self, event, *args, **kwargs):
-
-        print self.d
-        info = open('tagsinfo.txt','w')
-        
-        if len(self.d['images_no_alt']):
-            info.write('Image URLs without "alt" attribute\n')
-            for url in self.d['images_no_alt']:
-                info.write(url + '\n')
-
-        info.close()
-
-if __name__ == "__main__":
-    spider=TagAnalyzingCrawler()
-    spider.initialize()
-    config = spider.get_config()
-    # Disable caching
-    config.pagecache = 0
-    spider.bind_event('writeurl', spider.write_this_url)
-    spider.bind_event('beforetag', spider.analyze_this_tag)
-    spider.bind_event('beforefinish', spider.finish_event_cb)
-    spider.main()
diff --git a/HarvestMan/harvestman/apps/spider.py b/HarvestMan/harvestman/apps/spider.py
deleted file mode 100755
index 5aa37d9..0000000
--- a/HarvestMan/harvestman/apps/spider.py
+++ /dev/null
@@ -1,700 +0,0 @@
-#! /usr/bin/python
-
-# -- coding: utf-8
-
-""" HarvestMan - Extensible, modular, flexible, multithreaded Internet
-    spider program using urllib2 and other python modules. This is
-    the main module of HarvestMan.
-    
-    Version      - 2.0 alpha 1.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    HARVESTMAN is free software. See the file LICENSE.txt for information
-    on the terms and conditions of usage, and a DISCLAIMER of ALL WARRANTIES.
-
- Modification History
-
-    Created: Aug 2003
-
-     Jan 23 2007          Anand      Changes to copy config file to ~/.harvestman/conf
-                                     folder on POSIX systems. This file is also looked for
-                                     if config.xml not found in curdir.
-     Jan 25 2007          Anand      Simulation feature added. Also modified config.py
-                                     to allow reading cmd line arguments when passing
-                                     a config file using -C option.
-     Feb 7 2007          Anand       Finished implementation of plugin feature. Crawl
-                                     simulator is now a plugin.
-     Feb 8 2007          Anand       Added swish-e integration as a plugin.
-     Feb 11 2007         Anand       Changes in the swish-e plugin implementation,
-                                     by using callbacks.
-     Mar 2 2007          Anand       Renamed finish to finish_project. Moved
-                                     Finish method from common.py to here and
-                                     renamed it as finish(...). finish is never
-                                     called at project end, but by default at
-                                     program end.
-     Mar 7 2007          Anand       Disabled urlserver option.
-     Mar 15 2007         Anand       Added bandwidth calculation for determining
-                                     max filesize before crawl. Need to add
-                                     code to redetermine bandwidth when network
-                                     interface changes.
-     Apr 18 2007         Anand       Added the urltypes module for URL type
-                                     definitions and replaced entries with it.
-                                     Upped version number to 2.0 since this is
-                                     almost a new program now!
-     Apr 19 2007         Anand       Disabled urlserver option completely. Removed
-                                     all referring code from this module, crawler
-                                     and urlqueue modules. Moved code for grabbing
-                                     URL to new hget module.
-    Apr 24 2007          Anand       Made to work on Windows (XP SP2 Professional,
-                                     Python 2.5)
-    Apr 24 2007          Anand       Made the config directory creation/session
-                                     saver features to work on Windows also.
-    Apr 24 2007          Anand       Modified connector algorithm to flush data to
-                                     temp files for hget. This makes sure that hget
-                                     can download huge files as multipart.
-    May 7 2007           Anand       Added plugin as option in configuration file.
-                                     Added ability to process more than one plugin
-                                     at once. Modified loading logic of plugins.
-    May 10 2007          Anand       Replaced a number of private attributes in classes
-                                     (with double underscores), to semi-private (one
-                                     underscore). This helps in inheriting from these
-                                     classes.
-    Dec 12 2007          Anand       Re-merged code from harvestmanklass module to this
-                                     and moved common initialization code to appbase.py
-                                     under HarvestManAppBase class.
-    Feb 12-14 08        Anand        Major datastructure enhancements/revisions, fixes
-                                     etc in datamgr, rules, urlparser, connector, crawler,
-                                     ,urlqueue, urlthread modules.
-
-   Copyright (C) 2004 Anand B Pillai.     
-"""     
-
-__version__ = '2.0 a1'
-__author__ = 'Anand B Pillai'
-
-import __init__
-import os, sys
-
-import cPickle
-import pickle
-import time
-import threading
-import shutil
-import glob
-import re
-import copy
-import signal
-import locale
-
-from shutil import copy
-
-from harvestman.lib.event import HarvestManEvent
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib import urlqueue
-from harvestman.lib import connector
-from harvestman.lib import rules
-from harvestman.lib import datamgr
-from harvestman.lib import utils
-from harvestman.lib import urlparser
-from harvestman.lib.db import HarvestManDbManager
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-# Current folder - okay
-from appbase import HarvestManAppBase
-
-# Defining callback points
-__callbacks__ = { 'run_saved_state_callback':'HarvestMan:run_saved_state',
-                  'restore_state_callback':'HarvestMan:restore_state',
-                  'run_projects_callback':'HarvestMan:run_projects',
-                  'start_project_callback':'HarvestMan:start_project',
-                  'finish_project_callback':'HarvestMan:finish_project',
-                  'finalize_callback':'HarvestMan:finalize',                  
-                  'init_callback' : 'HarvestMan:init'}
-
-# Defining pluggable functions
-__plugins__ = { 'clean_up_plugin':'HarvestMan:clean_up',
-                'save_current_state_plugin': 'HarvestMan:save_current_state',
-                'restore_state_plugin': 'HarvestMan:restore_state',
-                'reset_state_plugin': 'HarvestMan:reset_state' }
-
-
-class HarvestMan(HarvestManAppBase):
-    """ The main crawler application class for HarvestMan """
-
-    klassmap = {}
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'spider'
-    
-    USER_AGENT = '%s/%s (+http://code.google.com/p/harvestman-crawler/wiki/bot)' %('Harvestman',__version__)
-    
-        
-    def __init__(self):
-        """ Initializing method """
-
-        self._projectstartpage = 'file://'
-        super(HarvestMan, self).__init__()
-        
-    def finish_project(self):
-        """ Actions to take after download is over for the current project """
-
-        if objects.eventmgr.raise_event('beforefinish', objects.queuemgr.baseurl, None)==False:
-            return
-        
-        # Localise file links
-        # This code sits in the data manager class
-        objects.datamgr.post_download_setup()
-        
-        # if not objects.config.testing:
-        if objects.config.browsepage:
-            logconsole("Creating browser index page for the project...")
-            browser = utils.HarvestManBrowser()
-            browser.make_project_browse_page()
-            logconsole("Done.")
-
-        objects.eventmgr.raise_event('afterfinish', objects.queuemgr.baseurl, None)
-        
-    def finalize(self):
-        """ This method is called at program exit or when handling signals to clean up """
-        
-        # If this was started from a runfile,
-        # remove it.
-        if objects.config.runfile:
-            try:
-                os.remove(objects.config.runfile)
-            except OSError, e:
-                error('Error removing runfile %s.' % objects.config.runfile)
-
-        # inform user of config file errors
-        if globaldata.userdebug:
-            logconsole("Some errors were found in your configuration, please correct them!")
-            for x in range(len(globaldata.userdebug)):
-                logconsole(str(x+1),':', globaldata.userdebug[x])
-
-        globaldata.userdebug = []
-        logconsole('HarvestMan session finished.')
-
-        objects.datamgr.clean_up()
-        objects.rulesmgr.clean_up()
-        objects.logger.shutdown()
-
-    def save_current_state(self):
-        """ Save state of objects to disk so program can be restarted from saved state """
-
-        # If savesession is disabled, return
-        if not objects.config.savesessions:
-            extrainfo('Session save feature is disabled.')
-            return
-        
-        # Top-level state dictionary
-        state = {}
-        # All state objects are dictionaries
-
-        # Get state of queue & tracker threads
-        state['trackerqueue'] = objects.queuemgr.get_state()
-        # Get state of datamgr
-        state['datamanager'] = objects.datamgr.get_state()
-        # Get state of urlthreads 
-
-        #if p: state['threadpool'] = p.get_state()
-        #state['ruleschecker'] = objects.rulesmgr.get_state()
-
-        # Get config object
-        #state['configobj'] = objects.config.copy()
-        
-        # Dump with time-stamp 
-        fname = os.path.join(objects.config.usersessiondir, '.harvestman_saves#' + str(int(time.time())))
-        extrainfo('Saving run-state to file %s...' % fname)
-
-        try:
-            cPickle.dump(state, open(fname, 'wb'), pickle.HIGHEST_PROTOCOL)
-            extrainfo('Saved run-state to file %s.' % fname)
-        except (pickle.PicklingError, RuntimeError), e:
-            logconsole(e)
-            error('Could not save run-state !')
-        
-    def welcome_message(self):
-        """ Prints a welcome message before start of the program """
-
-        logconsole('Starting %s...' % objects.config.progname)
-        logconsole('Copyright (C) 2004, Anand B Pillai')
-        logconsole(' ')
-
-    def register_common_objects(self):
-        """ Create and register aliases for the common objects required by all program modules """
-
-        # Set myself
-        SetAlias(self)
-
-        objects.logger.make_logger()
-        # Set verbosity in logger object
-        objects.logger.setLogSeverity(objects.config.verbosity)
-        
-        # Data manager object
-        dmgr = datamgr.HarvestManDataManager()
-        SetAlias(dmgr)
-        
-        # Rules checker object
-        ruleschecker = rules.HarvestManRulesChecker()
-        SetAlias(ruleschecker)
-        
-        # Connector manager object
-        connmgr = connector.HarvestManNetworkConnector()
-        SetAlias(connmgr)
-
-        # Connector factory
-        conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-        SetAlias(conn_factory)
-
-        queuemgr = urlqueue.HarvestManCrawlerQueue()
-        SetAlias(queuemgr)
-
-        SetAlias(HarvestManEvent())
-        
-    def start_project(self):
-        """ Starts crawl for the current project, crawling its URL  """
-
-        if objects.eventmgr.raise_event('beforestart', objects.queuemgr.baseurl, None)==False:
-            return
-        
-        # crawls through a site using http/ftp/https protocols
-        if objects.config.project:
-            info('*** Log Started ***\n')
-            if not objects.config.resuming:
-                info('Starting project',objects.config.project,'...')
-            else:
-                info('Re-starting project',objects.config.project,'...')                
-
-            
-            # Write the project file 
-            if not objects.config.fromprojfile:
-                projector = utils.HarvestManProjectManager()
-                projector.write_project()
-
-            # Write the project database record
-            HarvestManDbManager.add_project_record()
-            
-        if not objects.config.resuming:
-            info('Starting download of url',objects.config.url,'...')
-        else:
-            pass
-
-        # Reset objects keeping project-specific states now
-        # Reset and re-initialize datamgr
-        objects.datamgr.clean_up()
-        objects.datamgr.initialize()
-        objects.rulesmgr.reset()
-            
-        # Read the project cache file, if any
-        if objects.config.pagecache:
-            objects.datamgr.read_project_cache()
-            
-        if not objects.config.resuming:
-            # Configure tracker manager for this project
-            if objects.queuemgr.configure():
-                # start the project
-                objects.queuemgr.crawl()
-        else:
-            objects.queuemgr.restart()
-
-        objects.eventmgr.raise_event('afterstart', objects.queuemgr.baseurl, None)
-        
-    def clean_up(self):
-        """ Performs clean up actions as part of the interrupt handling """
-
-        # Shut down logging on file
-        extrainfo('Shutting down logging...')
-        objects.logger.disableFileLogging()
-        objects.queuemgr.endloop()
-
-    def calculate_bandwidth(self):
-        """ Calculate bandwidth of the user by downloading a specific URL and timing it,
-        setting a limit on maximum file size """
-
-        # Calculate bandwidth
-        bw = 0
-        # Look for harvestman.conf in user conf dir
-        conf = os.path.join(objects.config.userconfdir, 'harvestman.conf')
-        if not os.path.isfile(conf):
-            conn = connector.HarvestManUrlConnector()
-            urlobj = urlparser.HarvestManUrl('http://harvestmanontheweb.com/schemas/HarvestMan.xsd')
-            bw = conn.calc_bandwidth(urlobj)
-            bwstr='bandwidth=%f\n' % bw
-            if bw:
-                try:
-                    open(conf,'w').write(bwstr)
-                except IOError, e:
-                    pass
-        else:
-            r = re.compile(r'(bandwidth=)(.*)')
-            try:
-                data = open(conf).read()
-                m = r.findall(data)
-                if m:
-                    bw = float(m[0][-1])
-            except IOError, e:
-                pass
-
-        return bw
-        
-    def create_user_directories(self):
-        """ Creates the user folders for HarvestMan. Creates folders for storing user specific
-        configuration, session and crawl database information """
-
-        # Create user's HarvestMan directory on POSIX at $HOME/.harvestman and 
-        # on Windows at $USERPROFILE/Local Settings/Application Data/HarvestMan
-        harvestman_dir = objects.config.userdir
-        harvestman_conf_dir = objects.config.userconfdir
-        harvestman_sessions_dir = objects.config.usersessiondir
-        harvestman_db_dir = objects.config.userdbdir
-        
-        if not os.path.isdir(harvestman_dir):
-            try:
-                logconsole('Looks like you are running HarvestMan for the first time in this machine')
-                logconsole('Doing initial setup...')
-                logconsole('Creating folder %s for storing HarvestMan application data...' % harvestman_dir)
-                os.makedirs(harvestman_dir)
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_conf_dir):
-            try:
-                logconsole('Creating "conf" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_conf_dir)
-
-                # Create user configuration in .harvestman/conf
-                conf_data = objects.config.generate_user_configuration()
-                logconsole("Generating user configuration in %s..." % harvestman_conf_dir)
-                try:
-                    user_conf_file = os.path.join(harvestman_conf_dir, 'config.xml')
-                    open(user_conf_file, 'w').write(conf_data)
-                    logconsole("Done.")                    
-                except IOError, e:
-                    print e
-
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_sessions_dir):
-            try:
-                logconsole('Creating "sessions" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_sessions_dir)                        
-                logconsole('Done.')
-            except (OSError, IOError), e:
-                logconsole(e)
-
-        if not os.path.isdir(harvestman_db_dir):
-            try:
-                logconsole('Creating "db" sub-directory in %s...' % harvestman_dir)
-                os.makedirs(harvestman_db_dir)                        
-                logconsole('Done.')
-            except (OSError, IOError), e:
-                logconsole(e)
-
-            try:
-                HarvestManDbManager.create_user_database()
-            except Exception, e:
-                logconsole(e)
-                
-        
-    def init(self):
-        """ Initialize the crawler by creating, register common objects and creating the
-        user folders """
-
-        if objects.config.USER_AGENT=='':
-            objects.config.USER_AGENT = self.__class__.USER_AGENT
-            
-        self.register_common_objects()
-        self.create_user_directories()
-
-        # Calculate bandwidth and set max file size
-        # bw = self.calculate_bandwidth()
-        # Max file size is calculated as bw*timeout
-        # where timeout => max time for a worker thread
-        # if bw: objects.config.maxfilesize = bw*objects.config.timeout
-        
-    def init_config(self):
-        """ Initialize the configuration of the crawler """
-        
-        # Following 2 methods are inherited from the parent class
-        self.get_options()
-        self.process_plugins()
-        # For the time being, save session set to false
-        objects.config.savesessions = 0
-
-    def get_config(self):
-        """ Return the configuration object """
-
-        return objects.config
-        
-    def initialize(self):
-        """ Umbrella method to initialize the crawler configuration
-        and the crawler object """
-        
-        self.init_config()
-        self.init()
-
-    def reset(self):
-        """ Resets the state of the crawler, except the state of the
-        configuration object """
-        
-        self.init()
-        
-    def run_projects(self):
-        """ Run all the HarvestMan projects specified for the current session """
-
-        # Set locale - To fix errors with
-        # regular expressions on non-english web
-        # sites.
-        locale.setlocale(locale.LC_ALL, '')
-
-        # objects.rulesmgr.make_filters()
-        
-        if objects.config.verbosity:
-            self.welcome_message()
-
-        for x in range(len(objects.config.projects)):
-            # Get all project related vars
-            entry = objects.config.projects[x]
-
-            url = entry.get('url')
-            project = entry.get('project')
-            basedir = entry.get('basedir')
-            verb = entry.get('verbosity')
-
-            if not url or not project or not basedir:
-                info('Invalid config options found!')
-                if not url: info('Provide a valid url')
-                if not project: info('Provide a valid project name')
-                if not basedir: info('Provide a valid base directory')
-                continue
-            
-            # Set the current project vars
-            objects.config.url = url
-            objects.config.project = project
-            objects.config.verbosity = verb
-            objects.config.basedir = basedir
-
-            try:
-                self.run_project()
-            except Exception, e:
-                # Note: This design means that when we are having more than
-                # one project configured, HarvestMan exits only the current
-                # project if an interrupt (Ctrl-C) is received. The next
-                # project will continue when control comes back...This was
-                # not intentional, but is a good side-effect.
-                
-                # However if a Python exception is received, we exit the
-                # program after calling this function...
-                self.handle_interrupts(-1, None, e)
-                    
-    def run_project(self):
-        """ Run the current HarvestMan project by creating any project directories
-        , initializing state and starting the project """
-
-        # Set project directory
-        # Expand any shell variables used in the base directory.
-        objects.config.basedir = os.path.expandvars(os.path.expanduser(objects.config.basedir))
-        
-        if objects.config.basedir:
-            objects.config.projdir = os.path.join( objects.config.basedir, objects.config.project )
-            if objects.config.projdir and not os.path.exists( objects.config.projdir ):
-                os.makedirs(objects.config.projdir)
-                
-            if objects.config.datamode == CONNECTOR_DATA_MODE_FLUSH:    
-                objects.config.projtmpdir = os.path.join(objects.config.projdir, '.tmp')
-                if objects.config.projtmpdir and not os.path.exists( objects.config.projtmpdir ):
-                    os.makedirs(objects.config.projtmpdir)            
-
-        # Set message log file
-        if objects.config.projdir and objects.config.project:
-            objects.config.logfile = os.path.join( objects.config.projdir, "".join((objects.config.project,
-                                                                          '.log')))
-
-        SetLogFile()
-
-        if not objects.config.testnocrawl:
-            self.start_project()
-
-        self.finish_project()
-            
-    def restore_state(self, state_file):
-        """ Restore state of some objects from the most recent run of the program.
-        This helps to re-run the program from where it left off """
-
-        try:
-            state = cPickle.load(open(state_file, 'rb'))
-            # This has six keys - configobj, threadpool, ruleschecker,
-            # datamanager, common and trackerqueue.
-
-            # First update config object
-            localcfg = {}
-            cfg = state.get('configobj')
-            if cfg:
-                for key,val in cfg.items():
-                    localcfg[key] = val
-            else:
-                print 'Config corrupted'
-                return RESTORE_STATE_NOT_OK
-
-            # Now update trackerqueue
-            ret = objects.queuemgr.set_state(state.get('trackerqueue'))
-            if ret == -1:
-                logconsole("Error restoring state in 'urlqueue' module - cannot proceed further!")
-                return RESTORE_STATE_NOT_OK                
-            else:
-                logconsole("Restored state in urlqueue module.")
-            
-            # Now update datamgr
-            ret = objects.datamgr.set_state(state.get('datamanager'))
-            if ret == -1:
-                logconsole("Error restoring state in 'datamgr' module - cannot proceed further!")
-                return RESTORE_STATE_NOT_OK                
-            else:
-                dm.initialize()
-                logconsole("Restored state in datamgr module.")                
-            
-            # Update threadpool if any
-            pool = None
-            if state.has_key('threadpool'):
-                pool = dm._urlThreadPool
-                ret = pool.set_state(state.get('threadpool'))
-                logconsole('Restored state in urlthread module.')
-            
-            # Update ruleschecker
-            ret = objects.rulesmgr.set_state(state.get('ruleschecker'))
-            logconsole('Restored state in rules module.')
-
-            # Everything is fine, copy localcfg to config object
-            for key,val in localcfg.items():
-                objects.config[key] = val
-
-            # Open stream to log file
-            SetLogFile()
-                
-            return RESTORE_STATE_OK
-        except (pickle.UnpicklingError, AttributeError, IndexError, EOFError), e:
-            return RESTORE_STATE_NOT_OK            
-
-    def run_saved_state(self):
-        """ Restart the program from a previous state, from state saved during
-        the most recent run of the program, if any """
-        
-        # If savesession is disabled, return
-        #if not objects.config.savesessions:
-        extrainfo('Session save feature is disabled, ignoring save files')
-        return SAVE_STATE_NOT_OK
-        
-        # Set locale - To fix errors with
-        # regular expressions on non-english web
-        # sites.
-        # See if there is a file named .harvestman_saves#...
-  ##       sessions_dir = objects.config.usersessiondir
-
-##         files = glob.glob(os.path.join(sessions_dir, '.harvestman_saves#*'))
-
-##         # Get the last dumped file
-##         if files:
-##             runfile = max(files)
-##             res = raw_input('Found HarvestMan save file %s. Do you want to re-run it ? [y/n]' % runfile)
-##             if res.lower()=='y':
-##                 if self.restore_state(runfile) == RESTORE_STATE_OK:
-##                     objects.config.resuming = True
-##                     objects.config.runfile = runfile
-
-##                     if objects.config.verbosity:
-##                         self.welcome_message()
-        
-##                     if not objects.config.testnocrawl:
-##                         try:
-##                             self.start_project()
-##                         except Exception, e:
-##                             self.handle_interrupts(-1, None, e)
-
-##                     try:
-##                         self.finish_project()
-##                         return SAVE_STATE_OK
-                    
-##                     except Exception, e:
-##                         # To catch errors at interpreter shutdown
-##                         pass
-##                 else:
-##                     logconsole('Could not re-run saved state, defaulting to standard configuration...')
-##                     objects.config.resuming = False
-##                     return SAVE_STATE_NOT_OK
-##             else:
-##                 logconsole('OK, falling back to default configuration...')
-##                 return SAVE_STATE_NOT_OK
-##         else:
-##             return SAVE_STATE_NOT_OK
-##         pass
-
-    def handle_interrupts(self, signum, frame, e=None):
-        """ Method which is called to handle program interrupts such as a Ctrl-C (interrupt) """
-
-        # print 'Signal handler called with',signum
-        if signum == signal.SIGINT:
-            objects.config.keyboardinterrupt = True
-
-        if e != None:
-            logconsole('Exception received=>',e)
-            print_traceback()
-
-        logtraceback()
-        # dont allow to write cache, since it
-        # screws up existing cache.
-        objects.datamgr.conditional_cache_set()
-        # self.save_current_state()
-        self.clean_up()
-
-    def bind_event(self, event, funktion, *args):
-        """ Binds a function to a specific event in HarvestMan """
-        
-        objects.eventmgr.bind(event, funktion, args)
-        
-    def main(self):
-        """ The main sub-routine of the HarvestMan class """
-
-        # Set stderr to dummy to prevent all the thread
-        # errors being printed by interpreter at shutdown
-        # sys.stderr = DummyStderr()
-        signal.signal(signal.SIGINT, self.handle_interrupts)
-        
-        # See if a crash file is there, then try to load it and run
-        # program from crashed state.
-        if self.run_saved_state() == SAVE_STATE_NOT_OK:
-            # No such crashed state or user refused to run
-            # from crashed state. So do the usual run.
-            self.run_projects()
-            
-        # Final cleanup
-        self.finalize()
-
-def callgraph_filter(call_stack, module_name, class_name, func_name, full_name):
-    """ Function which is used to filter the call graphs when HarvestMan is
-    run with 'pycallgraph' to generate call graph trees """
-    
-    if class_name.lower().find('harvestman') != -1 or \
-       full_name.lower().find('harvestman') != -1:
-        return True
-    else:
-        return False
-
-def main():
-    """ Main routine """
-
-    spider = HarvestMan()
-    spider.initialize()
-    #import pycallgraph
-    #pycallgraph.start_trace(filter_func=callgraph_filter)
-    spider.main()
-    #pycallgraph.make_dot_graph('harvestman.png')
-    
-if __name__=="__main__":
-    main()
-    
-
-               
-        
-
diff --git a/HarvestMan/harvestman/bugs/Readme.txt b/HarvestMan/harvestman/bugs/Readme.txt
deleted file mode 100755
index c241c4d..0000000
--- a/HarvestMan/harvestman/bugs/Readme.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-This folder has test cases for EIAO bug reports
-of HarvestMan which affect HarvestMan-2.0 also.
-
-The bugs for these test cases can be seen at
-http://trac.eiao.net/trac.cgi. (You need authorisation to access this site.)
-
-
-
-
-
diff --git a/HarvestMan/harvestman/bugs/config-issue20.xml b/HarvestMan/harvestman/bugs/config-issue20.xml
deleted file mode 100644
index 2ad4424..0000000
--- a/HarvestMan/harvestman/bugs/config-issue20.xml
+++ /dev/null
@@ -1,154 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-
-      <project ignore="0">
-
-        <url>http://ichikaflower.com/</url>
-        <name>ichikaflower.com</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.foo.com</url>
-        <name>foo</name>
-
-        <basedir>/tmp/sites</basedir>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types>
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxextservers value="0"/>
-        <maxextdirs value="0"/>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="5 MB" />
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter></urlfilter>
-        <serverfilter></serverfilter>
-        <wordfilter></wordfilter>
-        <junkfilter value="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-
-    <display>
-      <browsepage value="0"/>
-    </display>
-
-  </config>
-
-</HarvestMan>
diff --git a/HarvestMan/harvestman/bugs/config-issue21.xml b/HarvestMan/harvestman/bugs/config-issue21.xml
deleted file mode 100644
index 2bbd275..0000000
--- a/HarvestMan/harvestman/bugs/config-issue21.xml
+++ /dev/null
@@ -1,154 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-  <config version="3.0" xmlversion="1.0">
-    <projects>
-
-      <project ignore="0">
-
-        <url>http://www.dumitruoprea.ro</url>
-        <name>bla</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>http://docs.python.org/lib/lib.html</url>
-        <name>pydoc</name>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.foo.com</url>
-        <name>foo</name>
-
-        <basedir>/tmp/sites</basedir>
-        <verbosity level="info"/>
-      </project>
-
-      <project ignore="1">
-
-        <url>www.automotive.com</url>
-        <name>auto</name>
-        <verbosity level="extrainfo"/>
-      </project>
-
-
-      <project ignore="1">
-
-        <url>http://swish-e.org/docs</url>
-        <name>swish-docs</name>
-        <verbosity level="20"/>
-      </project>
-    </projects>
-
-    <network>
-      <proxy>
-        <proxyserver></proxyserver>
-        <proxyuser></proxyuser>
-        <proxypasswd></proxypasswd>
-        <proxyport value="80"/>
-      </proxy>
-    </network>
-
-    <download>
-      <types>
-        <html value="1"/>
-        <images value="1"/>
-        <javascript value="0"/>
-        <javaapplet value="1"/>
-        <querylinks value="1"/>
-      </types>
-      <cache status="1">
-        <datacache value="0"/>
-      </cache>
-      <protocol>
-        <http compress="1" />
-      </protocol>
-      <misc>
-        <retries value="1"/>
-      </misc>
-    </download>
-
-    <control>
-      <links>
-        <imagelinks value="1"/>
-        <stylesheetlinks value="1"/>
-        <offset start="0" end="-1" />
-      </links>
-      <extent>
-        <fetchlevel value="0"/>
-        <depth value="10"/>
-        <extdepth value="0"/>
-        <subdomain value="0"/>
-        <ignoretlds value="0" />
-      </extent>
-      <limits>
-        <maxextservers value="0"/>
-        <maxextdirs value="0"/>
-        <maxfiles value="0"/>
-        <maxfilesize value="5242880"/>
-        <maxbytes value="50 MB" />
-        <maxbandwidth value="40 k" factor="1.5" />
-        <maxconnections value="10"/>
-        <timelimit value="-1"/>
-      </limits>
-      <rules>
-        <robots value="1"/>
-        <urlpriority></urlpriority>
-        <serverpriority></serverpriority>
-      </rules>
-      <filters>
-        <urlfilter></urlfilter>
-        <serverfilter></serverfilter>
-        <wordfilter></wordfilter>
-        <junkfilter value="1"/>
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-
-    <system>
-      <useragent value="Firefox v2.0.0.8" />
-      <workers status="1" size="10" timeout="1200"/>
-      <trackers value="10"/>
-      <timegap value="3.0" random="1" />
-      <connections type="flush" />
-    </system>
-
-    <files>
-      <urltreefile status="1" />
-      <archive status="0" format="bzip"/>
-      <urlheaders status="1" />
-      <localise value="0"/>
-    </files>
-
-    <display>
-      <browsepage value="0"/>
-    </display>
-
-  </config>
-
-</HarvestMan>
diff --git a/HarvestMan/harvestman/bugs/s_municipaux.htm b/HarvestMan/harvestman/bugs/s_municipaux.htm
deleted file mode 100755
index a9602f9..0000000
--- a/HarvestMan/harvestman/bugs/s_municipaux.htm
+++ /dev/null
@@ -1,233 +0,0 @@
-<html>
-<head>
-<title>Epinay-sous-S&eacute;nart</title>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
-<script language="JavaScript" type="text/JavaScript">
-<!--
-function MM_preloadImages() { //v3.0
-  var d=document; if(d.images){ if(!d.MM_p) d.MM_p=new Array();
-    var i,j=d.MM_p.length,a=MM_preloadImages.arguments; for(i=0; i<a.length; i++)
-    if (a[i].indexOf("#")!=0){ d.MM_p[j]=new Image; d.MM_p[j++].src=a[i];}}
-}
-
-function MM_swapImgRestore() { //v3.0
-  var i,x,a=document.MM_sr; for(i=0;a&&i<a.length&&(x=a[i])&&x.oSrc;i++) x.src=x.oSrc;
-}
-
-function MM_findObj(n, d) { //v4.01
-  var p,i,x;  if(!d) d=document; if((p=n.indexOf("?"))>0&&parent.frames.length) {
-    d=parent.frames[n.substring(p+1)].document; n=n.substring(0,p);}
-  if(!(x=d[n])&&d.all) x=d.all[n]; for (i=0;!x&&i<d.forms.length;i++) x=d.forms[i][n];
-  for(i=0;!x&&d.layers&&i<d.layers.length;i++) x=MM_findObj(n,d.layers[i].document);
-  if(!x && d.getElementById) x=d.getElementById(n); return x;
-}
-
-function MM_swapImage() { //v3.0
-  var i,j=0,x,a=MM_swapImage.arguments; document.MM_sr=new Array; for(i=0;i<(a.length-2);i+=3)
-   if ((x=MM_findObj(a[i]))!=null){document.MM_sr[j++]=x; if(!x.oSrc) x.oSrc=x.src; x.src=a[i+2];}
-}
-
-function MM_jumpMenu(targ,selObj,restore){ //v3.0
-  eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+"'");
-  if (restore) selObj.selectedIndex=0;
-}
-//-->
-</script>
-<link href="style.css" rel="stylesheet" type="text/css">
-<style type="text/css">
-<!--
-.Style5 {font-size: 12px}
-.Style6 {font-size: x-small}
--->
-</style>
-</head>
-<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="MM_preloadImages('images/m_ville_r.gif','images/m_elus_r.gif','images/m_democratie_r.gif','images/m_services_r.gif','images/m_servicesol_r.gif','images/m_annuaire_r.gif','images/m_spinolien_r.gif','images/m_agenda_r.gif','images/m_actu_r.gif','images/m_plan_r.gif','images/m_travaux_r.gif','images/m_pratique_r.gif','images/m_economie_r.gif','images/m_emploi_r.gif','images/m_accueil_r.gif')">
-<table width="752" border="0" align="center" cellpadding="0" cellspacing="1" bgcolor="#333333">
-  <tr>
-    <td><table width="750" border="0" align="center" cellpadding="0" cellspacing="0" bgcolor="#FFFFFF">
-        <tr> 
-          <td><img name="gabarit_r1_c1" src="images/gabarit_r1_c1.gif" width="262" height="150" border="0" alt=""></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><img name="gabarit_r1_c5" src="images/gabarit_r1_c5.gif" width="149" height="150" border="0" alt=""></td>
-				<td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                    <tr> 
-                      <td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                          <tr> 
-                            <td><img name="gabarit_r1_c7" src="images/gabarit_r1_c7.gif" width="192" height="108" border="0" alt=""></td>
-                            <td><table border="0" cellpadding="0" cellspacing="0" width="147">
-                                <tr> 
-                                  <td><a href="emploi.asp" onMouseOver="MM_swapImage('emploi','','images/m_emploi_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/gabarit_r1_c11.gif" alt="" name="emploi" width="147" height="35" border="0" id="emploi"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="economie.htm" onMouseOver="MM_swapImage('economie','','images/m_economie_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/gabarit_r2_c11.gif" alt="" name="economie" width="147" height="27" border="0" id="economie"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="marches.htm" onMouseOver="MM_swapImage('marches','','images/m_marchespublicsr.jpg',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_marchespublics.jpg" alt="" name="marches" width="147" height="25" border="0" id="marches"></a></td>
-                                </tr>
-                                <tr> 
-                                  <td><a href="default.asp" onMouseOver="MM_swapImage('accueil','','images/m_accueil_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_accueil.gif" name="accueil" width="147" height="21" border="0" id="accueil"></a></td></tr>
-                              </table></td>
-                          </tr>
-                        </table></td>
-                    </tr>
-                    <tr> 
-                      <td><table border="0" cellpadding="0" cellspacing="0" width="339">
-                          <tr> 
-                            <td><a href="plan.htm" onMouseOver="MM_swapImage('plan','','images/m_plan_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_plan.gif" alt="" name="plan" width="108" height="42" border="0" id="plan"></a></td>
-                            <td><a href="travaux.asp" onMouseOver="MM_swapImage('travaux','','images/m_travaux_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_travaux.gif" alt="" name="travaux" width="92" height="42" border="0" id="travaux"></a></td>
-                            <td><a href="pratique.asp" onMouseOver="MM_swapImage('pratique','','images/m_pratique_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_pratique.gif" alt="" name="pratique" width="139" height="42" border="0" id="pratique"></a></td>
-                          </tr>
-                        </table></td>
-                    </tr>
-                  </table></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td><table border="0" cellpadding="0" cellspacing="0" width="262">
-              <tr> 
-                <td><img name="gabarit_r6_c1" src="images/gabarit_r6_c1.gif" width="156" height="21" border="0" alt=""></td>
-                <td><a href="ville.htm" onMouseOver="MM_swapImage('ville','','images/m_ville_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_ville.gif" alt="" name="ville" width="52" height="21" border="0" id="ville"></a></td>
-                <td><a href="elus.htm" onMouseOver="MM_swapImage('elus','','images/m_elus_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_elus.gif" alt="" name="elus" width="54" height="21" border="0" id="elus"></a></td>
-              </tr>
-            </table></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><a href="democratie.htm" onMouseOver="MM_swapImage('democratie','','images/m_democratie_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_democratie.gif" alt="" name="democratie" width="117" height="21" border="0" id="democratie"></a></td>
-                <td><a href="s_municipaux.htm" onMouseOver="MM_swapImage('services','','images/m_services_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_services.gif" alt="" name="services" width="126" height="21" border="0" id="services"></a></td>
-                <td><a href="s_online.htm" onMouseOver="MM_swapImage('servicesol','','images/m_servicesol_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_servicesol.gif" alt="" name="servicesol" width="106" height="21" border="0" id="servicesol"></a></td>
-                <td><a href="annuaire.htm" onMouseOver="MM_swapImage('annuaire','','images/m_annuaire_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_annuaire.gif" alt="" name="annuaire" width="61" height="21" border="0" id="annuaire"></a></td>
-                <td><a href="spinolien.htm" onMouseOver="MM_swapImage('spinolien','','images/m_spinolien_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_spinolien.gif" alt="" name="spinolien" width="78" height="21" border="0" id="spinolien"></a></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td><img name="gabarit_r7_c1" src="images/gabarit_r7_c1.gif" width="262" height="115" border="0" alt=""></td>
-          <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-              <tr> 
-                <td><table border="0" cellpadding="0" cellspacing="0" width="488">
-                    <tr> 
-                      <td width="170"><a href="agenda.asp" onMouseOver="MM_swapImage('agenda','','images/m_agenda_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_agenda.gif" alt="" name="agenda" width="170" height="19" border="0" id="agenda"></a></td>
-                      <td width="171"><a href="actualite.asp" onMouseOver="MM_swapImage('actu','','images/m_actu_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_actu.gif" alt="" name="actu" width="171" height="19" border="0" id="actu"></a></td>
-                      <td><a href="sports.asp" onMouseOver="MM_swapImage('sports','','images/m_sports_r.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="images/m_sports.gif" alt="" name="sports" width="147" height="19" border="0" id="sports"></a></td>
-                    </tr>
-                  </table></td>
-              </tr>
-              <tr> 
-                <td><img src="images/smunicipaux2.gif" width="488" height="96" border="0"></td>
-              </tr>
-            </table></td>
-        </tr>
-        <tr> 
-          <td valign="top">
-<table border="0" cellpadding="0" cellspacing="0" width="262">
-              <tr> 
-                <td><img name="gabarit_r9_c1" src="images/gabarit_r9_c1.jpg" width="262" height="252" border="0" alt=""></td>
-              </tr>
-              <tr> 
-                <td><table border="0" cellpadding="0" cellspacing="0" width="262">
-                    <tr> 
-                      <td>&nbsp;</td>
-                      <td>&nbsp;</td>
-                    <tr>
-                      <td colspan="2" class="texte"><a href="mailto:contact@ville-epinay-senart.fr" class="lienscat">   <img src="images/mail.gif" width="24" height="15" hspace="2" vspace="2" border="0" align="absmiddle"> Contact webmestre </a><br>                        
-                      &copy; Mairie d'Epinay-sous-S&eacute;nart </td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte"><div align="center">N&deg; vert propret&eacute; </div></td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte"><div align="center"><span class="textegras">N&deg; vert : 0800091860</span></div></td>
-                    </tr>
-                    <tr>
-                      <td colspan="2" class="texte">&nbsp;</td>
-                    </tr>
-                  </table></td>
-              </tr>
-            </table></td>
-          <td valign="top"><table width="100%" border="0" cellspacing="0" cellpadding="0">
-              <tr>
-                <td>&nbsp;</td>
-              </tr>
-              <tr>
-                <td><div align="right">
-                    <table width="100%" border="0" cellspacing="4" cellpadding="0">
-                           <tr> 
-                        <td><div align="right">
-                          <select name="menu1" class="texte" onChange="MM_jumpMenu('parent',this,1)">
-                            <option value="s_municipaux.htm"selected>Les services municipaux</option>
-                            <option value="s_generaux.htm">Administration g�n�rale</option>
-							<option value="CCAS.htm">Centre communal d&rsquo;action sociale</option>
-                            <option value="s_com.htm">Communication</option>
-							<option value="s_culture.htm">Culturel</option>
-							<option value="s_scolaire.htm">Education</option>
-							<option value="s_enfance.htm">Enfance</option>
-							<option value="s_jeunesse.htm">Jeunesse</option>
-							<option value="s_logement.htm">Logement</option>
-							<option value="police.htm">Police municipale</option>
-                            <option value="s_senior.htm">S&eacute;niors</option>
-							<option value="s_sport.htm">Sports</option>
-							<option value="s_techniques.htm">Techniques</option>
-                            <option value="s_association.htm">Vie associative</option>
-                            </select>
-                          </div></td>
-                      </tr>
-                      <tr>
-                        <td><div align="left"><span class="textegras Style6">H&ocirc;tel 
-                            de ville</span><span class="texte"><br>
-                            8, rue Sainte-Genevi&egrave;ve<br>
-                            91860 &Eacute;pinay-sous-S&eacute;nart </span> 
-                            <p class="texte"><span class="textegras">T&eacute;l&eacute;phone 
-                              :</span> 01 60 47 85 00<br>
-                              <span class="textegras">T&eacute;l&eacute;copie 
-                              :</span> 01 60 46 68 34<br>
-                              <span class="textegras">Mail :</span> <a href="mailto:contact@ville-epinay-senart.fr" class="lienscat">contact@ville-epinay-senart.fr</a><br>
-                              <br>
-                              <span class="textegras">Horaires 
-                              d&#8217;ouverture au public :</span> <br>
-                              Lundi, mardi, jeudi et vendredi : 8h30 &agrave; 
-                              11h45 et de 13h30 &agrave; 17h30<br>
-                              Mercredi et samedi : 8h30 &agrave; 11h45                            </p>
-                            <p align="center" class="texte">                              <strong>........................................                            </strong></p>
-                            <p class="texte"><span class="textegras Style5">Espace Informations Spinolien </span><br>
-                              Centre Commercial Principal<br>
-                            91860 Epinay-sous-S&eacute;nart</p>
-                            <p class="texte"><span class="textegras">T&eacute;l&eacute;phone :</span> 01 60 46 
-                              93 49 <br>
-                              <span class="textegras">T&eacute;l&eacute;copie :</span> 01 60 46 16 59 <br>
-                              <span class="textegras"><br>
-                              Horaires 
-                              d&#8217;ouverture au public : </span><br>
-                              Matin : mardi, mercredi et vendredi de 10h00 &agrave; 12h00<br>
-                              Apr&egrave;s midi : du lundi au vendredi de 14h00 &agrave; 18h30</p>
-                            <p><span class="texte">Situ&eacute; au coeur de la ville, un service de proximit&eacute; qui rapproche les habitants des services publics.<br>
-                                <br>
-                            </span><span class="texte"> Consultation des comptes rendus 
-                              des s&eacute;ances du Conseil Municipal. <br>
-                              Exposition sur les r&eacute;alisations 
-                                                en cours dans la ville.</span><br>
-                            </p>
-                            <blockquote>&nbsp;</blockquote>
-                        </div></td>
-                      </tr>
-                      <tr>
-                        <td>&nbsp;</td>
-                      </tr>
-                    </table>
-                  </div></td>
-              </tr>
-            </table></td>
-        </tr>
-      </table>
-      <div align="center"></div></td>
-  </tr>
-</table>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/bugs/soskut_hu_index.html b/HarvestMan/harvestman/bugs/soskut_hu_index.html
deleted file mode 100755
index 20fe7b4..0000000
--- a/HarvestMan/harvestman/bugs/soskut_hu_index.html
+++ /dev/null
@@ -1,255 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
-<HTML>
-<HEAD>
-<TITLE> S�sk�t </TITLE>
-<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">
-
-
-
-        <script>
-                var sname='soskut';
-                var slang='hu';
-        </script>
-
-        <script type="text/javascript" src="/jsutils/utils_hu.js"></script>
-
-        <style>
-                P {margin: 0cm 0cm 0pt;}
-                FORM {margin: 0cm 0cm 0pt;}
-                IMG {border: none}
-                A                                               {font-family:arial;font-size:11px;color:#523201}
-                A:hover                                 {text-decoration:underline}
-
-                .hand   {cursor:pointer;cursor:hand;}
-                .frame {border:#A0B9AC solid 2px}
-
-                .deftable                               {font-family:arial;font-size:11px;color:#523201}
-                .list_table                             {font-family:arial;font-size:11px;color:#523201;}
-                .list_table_head                {font-family:arial;font-size:11px;color:#523201;background-color:#c4b481}
-
-                .list_table2                    {font-family:arial;font-size:11px;color:#523201;}
-                .td1                                    {font-family:arial;font-size:11px;color:#523201;}
-                .n                                              {font-family:arial;font-size:11px;color:#523201;}
-                .ctable                                 {font-family:arial;font-size:11px;color:#523201;padding:2px;border:1px solid #405539}
-                li                                              {list-style-image:url(/images/nagyrabe/hu/dot.gif)}
-
-                .hircim1                                {font-family:arial;font-size:11px;color:black;font-weight:bold}
-                .hirszoveg1                             {font-family:arial;font-size:11px;color:black}
-
-                .a1                                             {font-family:arial;font-size:11px;color:#523201;text-decoration:none}
-                .a1:hover                               {text-decoration:underline}
-
-                .a2                                             {font-family:arial;font-size:11px;color:#635d49;}
-                .a3                                             {font-family:arial;font-size:11px;color:#635d49;}
-
-                .poll_head                              {font-family:arial;font-size:11px;color:black;font-weight:bold}
-                .poll_text                              {font-family:arial;font-size:11px;color:black;}
-
-                .forum                                  {font-family:arial;font-size:11px;color:white;}
-                .forum-fresh                    {font-family:arial;font-size:11px;color:#635d49;}
-                .forum-input input              {font-family:arial;font-size:11px;border:1px solid #6e623b;background-color:#c9bd98;width:90px}
-                .event-fresh                    {font-family:arial;font-size:11px;color:#635d49;}
-                .institutes-fresh               {font-family:arial;font-size:11px;color:#303c43;}
-                .newsmail-input input   {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:142}
-                .plaint-input input             {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .plaint-input textarea  {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .search-input           {font-family:arial;font-size:11px;border:1px solid #6e623b;background-color:#c9bd98;width:100}
-                .guestbook-input input  {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-                .guestbook-input textarea {font-family:arial;font-size:11px;border:1px solid #405539;background-color:#ffffff;width:200}
-
-                .municip-table { margin-top:8px;text-align:center}
-                .municip-header { background-color: #c4b481}
-                .institutes-table { margin-top:8px;text-align:center}
-                .institutes-header { background-color: #c4b481}
-                .plaint-table { margin-top:8px;text-align:center}
-                .plaint-header { background-color: #c4b481}
-                .plaintbox-table { margin-top:8px; text-align:center}
-                .plaintbox-header { background-color: #c4b481}
-                .guestbook-table { margin-top:8px;text-align:center}
-                .guestbook-header { background-color: #c4b481}
-                .forum-table { margin-top:8px;text-align:center;}
-                .forum-header { background-color: #c4b481}
-                .theme-header { font-family:arial;font-size:11px;color:#523201;text-align:center;padding-top:5px}
-
-                .bgcolor1 {background-color: #d9c78f}
-                .bgcolor2 {background-color: #e9d69a}
-
-                select {font:11px arial;color:#635d49;}
-                .select_year select {width:100px;}
-                .select_month select {width:120px;}
-                .select_type select {width:100px;}
-
-
-        </style>
-
-
-
-
-</HEAD>
-
-<BODY>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="ad9f72">
-        <td><img src="/images/soskut/hu/v2/img1.jpg"></td>
-        <td><img src="/images/soskut/hu/v2/img2.jpg"></td>
-        <td><img src="/images/soskut/hu/v2/img3.jpg"></td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="#ad9f72">
-        <td><img src="/images/soskut/hu/v2/img4.gif"></td>
-        <td><a href="?"><img src="/images/soskut/hu/v2/cimer.jpg" border="0" alt="Vissza a f�oldalra"></a></td>
-        <td><img src="/images/soskut/hu/v2/img5.gif"></td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980" bgcolor="#ad9f72">
-        <td width="225" valign="top">
-
-                <div style="padding-left:15px">
-                        <img src="/images/soskut/hu/v2/lmenu-top.gif" border="0"><br>
-                        <a href="?module=news&fname=telepules#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-1.gif" border="0"></a><br>
-                        <a href="?module=institutes#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-2.gif" border="0"></a><br>
-                        <a href="?module=municip#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-3.gif" border="0"></a><br>
-                        <a href="?module=news&fname=galeria#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-4.gif" border="0"></a><br>
-                        <a href="?module=news&fname=turizmus#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-5.gif" border="0"></a><br>
-                        <a href="?module=institutes&imname=cegkatalogus#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-6.gif" border="0"></a><br>
-                        <a href="?module=news&fname=terkep#MIDDLE"><img src="/images/soskut/hu/v2/lmenu-7.gif" border="0"></a><br>
-                        <img src="/images/soskut/hu/v2/lmenu-bot.gif" border="0"><br>
-                </div>
-
-        </td>
-        
-        <td valign="top" width="530" bgcolor="#f6f3e9">
-
-
-                 <div align="center">
-                        <img src="/images/soskut/hu/v2/soskut.gif"><br>
-                </div>
-                
-
-                <div align="center">
-
-                        <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,29,0" width="489" height="301"><param name="movie" value="/images/soskut/hu/v2/soskut2.swf"><param name="quality" value="high"><embed src="/images/soskut/hu/v2/soskut2.swf" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" width="489" height="301"></embed></object>
-
-<!--                    <img src="/images/soskut/hu/v2/main-img.jpg"><br> -->
-                </div>
-
-<br><br><br>
-
-
-                <table cellpadding="0" cellspacing="0" border="0" align="center">
-                        <tr>
-                                <td width="226" height="108" background="/images/soskut/hu/v2/forum-bg.jpg">
-                                
-                                         
-    <div style="position:relative;left:15px;top:15px;">
-          <form  action="/index.php" method="post" name="forum_login_form" id="forum_login_form">
-          <input name="module" type="hidden" value="forum" />
-<input name="action" type="hidden" value="login" />
-<input name="done" type="hidden" value="1" />
-
-          <table cellpadding="0" cellspacing="0" border="0" class="forum">
-                <tr>
-                  <td>Felhaszn�l�n�v:</td>
-                  <td class="forum-input"><input size="20" maxlength="30" name="nick_name" type="text" /></td>
-                </tr>
-                <tr>
-                  <td>Jelsz�:</td>
-                  <td class="forum-input"><input size="20" maxlength="30" name="pass" type="password" /></td>
-                </tr>
-                <tr>
-                  <td>
-                        <input type="image" src='/images/soskut/hu/v2/forum-send.gif'>
-                  </td>
-                   <td>
-                                  <a href="?module=forum_members&action=form">
-                          <img src="/images/soskut/hu/v2/forum-registration.gif" border=0 class="hand">
-                        </a>
-                  </td>
-                </tr>
-          </table>
-          </form>
-  </div>
-
-
-                                </td>
-                                <td width="228" height="110" background="/images/soskut/hu/v2/guestbook-bg.jpg">
-                                
-                                        <div style="position:relative;left:15px;top:15px;">
-         <div style="margin-bottom:5px;">       
-                <a href="?module=guestbook&action=list">
-                  <img src="/images/soskut/hu/v2/guestbook-view.gif" border=0 class="hand">
-                </a>
-         </div>
-          <div> 
-                <a href="?module=guestbook&action=form">
-                  <img src="/images/soskut/hu/v2/guestbook-send.gif" border=0 class="hand">
-                </a>
-          </div>        
-</div>
-                                </td>
-                        </tr>
-                        <tr>
-                                <td width="227" height="103" background="/images/soskut/hu/v2/search-bg.jpg">
-                                
-                                        <div style="position:relative;left:15px;top:10px;">
-                                                <form name="search_form" method="get">
-                                                <input type="hidden" name="module" value="search">
-                                                <input type="hidden" name="action" value="list">
-
-                                                <input type="text" name="search" size="20" class="search-input"><br>
-                                                <input type="image" src="/images/soskut/hu/v2/search-button.gif">
-                                                </form>
-                                        </div>
-
-
-                                </td>
-                                <td width="230" height="104" background="/images/soskut/hu/v2/plaint-bg.jpg">
-                                
-                                        <div style="position:relative;left:18px;top:10px;">
-        <a href="?module=plaintbox&action=form">
-      <img src="/images/soskut/hu/v2/plaint-send.gif" border=0 class="hand">
-    </a>
-</div>                          
-                                </td>
-                        </tr>
-                </table>
-
-
-                <br><br>
-
-                
-
-        </td>
-        
-        <td width="225" valign="top" background="/images/soskut/hu/v2/rmenu-bg.gif">
-
-                <div style="padding-right:15px">
-                        <img src="/images/soskut/hu/v2/rmenu-top.gif" border="0"><br>
-                        <a href="?module=regulations#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-1.gif" border="0"></a><br>
-                        <a href="?module=news&fname=urlapok#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-2.gif" border="0"></a><br>
-                        <a href="?module=news&fname=testuleti_ulesek#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-3.gif" border="0"></a><br>
-                        <a href="?module=news&fname=helyi_hirek#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-4.gif" border="0"></a><br>
-                        <a href="?module=news&fname=palyazatok#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-5.gif" border="0"></a><br>
-                        <a href="?module=events#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-6.gif" border="0"></a><br>
-                        <a href="?module=news&fname=kultura#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-7.gif" border="0"></a><br>
-                        <a href="?module=news&fname=helyi_ujsag#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-8.gif" border="0"></a><br>
-                        <a href="?module=news&fname=sport#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-9.gif" border="0"></a><br>
-                        <a href="?module=news&fname=oktatas#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-10.gif" border="0"></a><br>
-                        <a href="?module=news&fname=egyhaz#MIDDLE"><img src="/images/soskut/hu/v2/rmenu-11.gif" border="0"></a><br>
-                        <img src="/images/soskut/hu/v2/rmenu-bot.gif" border="0"><br>
-                </div>
-
-                <br>
-                <div align="center"><a href="http://www.magyarorszag.hu/" target="_blank"><img src="/images/soskut/hu/v2/magyarhu.jpg" border="0"></a></div>
-
-        </td>
-</table>
-
-<table cellpadding="0" cellspacing="0" border="0" align="center" width="980">
-        <td>
-                <img src="/images/soskut/hu/v2/main-bot.gif"><br>
-        </td>
-</table>
-
-</BODY>
-</HTML>
diff --git a/HarvestMan/harvestman/bugs/test_808.py b/HarvestMan/harvestman/bugs/test_808.py
deleted file mode 100755
index 251830d..0000000
--- a/HarvestMan/harvestman/bugs/test_808.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# Demoing fix for #808.
-# 808: Crawler should try and parse links in "select" options in HTML
-# forms.
-# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/808
-import sys
-sys.path.append('..')
-from lib import pageparser
-from lib import config
-from lib import logger
-from lib.common.common import *
-
-
-SetAlias(config.HarvestManStateObject())
-SetAlias(logger.HarvestManLogger())
-
-# First parse with sgmlop parser with option parsing disabled...
-print 'Testing with sgmlop parser...'
-p = pageparser.HarvestManSGMLOpParser()
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag disabled...'
-assert(len(p.links)==18)
-
-# Now turn on option tag parsing
-p.enable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag enabled...'
-assert(len(p.links)==31)
-
-print 'Testing with pure Python parser...'
-p = pageparser.HarvestManSimpleParser()
-p.disable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag disabled...'
-assert(len(p.links)==18)
-
-# Now turn on option tag parsing
-p.enable_feature('option')
-p.feed(open('s_municipaux.htm').read())
-print 'Asserting link count with option tag enabled...'
-assert(len(p.links)==31)
diff --git a/HarvestMan/harvestman/bugs/test_812.py b/HarvestMan/harvestman/bugs/test_812.py
deleted file mode 100755
index 36cad66..0000000
--- a/HarvestMan/harvestman/bugs/test_812.py
+++ /dev/null
@@ -1,49 +0,0 @@
-# Demoing fix for EIAO bug #812.
-# 812: Crawler does not identify links with arguments containing "#".
-# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/812
-
-import sys
-sys.path.append('..')
-from lib import pageparser
-from lib import config
-from lib import logger
-from lib.common.common import *
-from lib import urltypes
-
-class Url(str):
-
-    def __init__(self, link):
-        self.url = link[1]
-        self.typ = link[0]
-
-    def __eq__(self, item):
-        return item == self.url
-    
-SetAlias(config.HarvestManStateObject())
-SetAlias(logger.HarvestManLogger())
-
-cfg = objects.config
-cfg.getquerylinks = True
-
-p = pageparser.HarvestManSGMLOpParser()
-p.feed(open('soskut_hu_index.html').read())
-
-urls = []
-for link in p.links:
-    urls.append(Url(link))
-
-print urls
-
-test_urls = ['?module=municip#MIDDLE',
-             '?module=institutes#MIDDLE',
-             '?module=regulations#MIDDLE',
-             '?module=events#MIDDLE']
-
-for turl in test_urls:
-    print 'Asserting',turl
-    assert(turl in urls)
-
-for url in urls:
-    if url in test_urls:
-        print 'Asserting type of',turl        
-        assert(url.typ ==  urltypes.URL_TYPE_ANY and url.typ != urltypes.URL_TYPE_ANCHOR)
diff --git a/HarvestMan/harvestman/dev/__init__.py b/HarvestMan/harvestman/dev/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/dev/filethread.py b/HarvestMan/harvestman/dev/filethread.py
deleted file mode 100755
index f6d1b80..0000000
--- a/HarvestMan/harvestman/dev/filethread.py
+++ /dev/null
@@ -1,112 +0,0 @@
-# -- coding: utf-8
-""" filethread.py - File saver thread module.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-   Copyright (C) 2007 Anand B Pillai.
-    
-"""
-
-# Currently no code from this module is being used anywhere
-# in the program.
-
-import threading
-from common.common import *
-from common.singleton import Singleton
-import sys, os
-import shutil
-from Queue import Queue
-
-class FileQueue(Queue, Singleton):
-    """ File saver queue class """
-    
-    def push(self, filename, directory, url, datastring):
-        self.put((filename, directory, url, datastring))
-
-class HarvestManFileThread(threading.Thread):
-    """ File saver thread """
-    
-    def __init__(self):
-        self.q = FileQueue.getInstance()
-        self._flag = False
-        self._cfg = objects.config
-        threading.Thread.__init__(self, None, None, 'Saver')
-            
-    def _write_url_filename(self, data, filename):
-        """ Write downloaded data to the passed file """
-
-        try:
-            extrainfo('Writing file ', filename)
-            f=open(filename, 'wb')
-            # print 'Data len=>',len(self._data)
-            f.write(data.getvalue())
-            f.close()
-        except IOError,e:
-            debug('IO Exception' , str(e))
-            return 0
-        except ValueError, e:
-            return 0
-
-        return 1
-
-    def stop(self):
-        self._flag = True
-        
-    def run(self):
-
-        while not self._flag:
-            item = self.q.get()
-            if item:
-                filename, directory, url, datastring = item
-                if self.create_local_directory(directory) == 0:
-                    self._write_url_filename( datastring, filename )
-                else:
-                    extrainfo("Error in creating local directory for", url)
-
-    def create_local_directory(self, directory):
-        """ Create the directories on the disk named 'directory' """
-
-        # new in 1.4.5 b1 - No need to create the
-        # directory for raw saves using the nocrawl
-        # option.
-        if self._cfg.rawsave:
-            return 0
-
-        try:
-            # Fix for EIAO bug #491
-            # Sometimes, however had we try, certain links
-            # will be saved as files, whereas they might be
-            # in fact directories. In such cases, check if this
-            # is a file, then create a folder of the same name
-            # and move the file as index.html to it.
-            path = directory
-            while path:
-                if os.path.isfile(path):
-                    # Rename file to file.tmp
-                    fname = path
-                    os.rename(fname, fname + '.tmp')
-                    # Now make the directory
-                    os.makedirs(path)
-                    # If successful, move the renamed file as index.html to it
-                    if os.path.isdir(path):
-                        fname = fname + '.tmp'
-                        shutil.move(fname, os.path.join(path, 'index.html'))
-                    
-                path2 = os.path.dirname(path)
-                # If we hit the root, break
-                if path2 == path: break
-                path = path2
-                
-            if not os.path.isdir(directory):
-                os.makedirs( directory )
-                extrainfo("Created => ", directory)
-            return 0
-        except OSError:
-            moreinfo("Error in creating directory", directory)
-            return -1
-
-        return 0
-
-                    
-            
diff --git a/HarvestMan/harvestman/dev/sqlite_test.py b/HarvestMan/harvestman/dev/sqlite_test.py
deleted file mode 100755
index 58a9fe8..0000000
--- a/HarvestMan/harvestman/dev/sqlite_test.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import sqlite3
-
-class Point(object):
-
-    def __init__(self, x, y):
-        self.x, self.y = x, y
-
-    def __conform__(self, protocol):
-        if protocol is sqlite3.PrepareProtocol:
-            return '%f;%f' % (self.x, self.y)
-
-con = sqlite3.connect("test")
-c = con.cursor()
-
-p = Point(5.0, 3.5)
-
-c.execute("drop table points")
-c.execute("create table points (point text)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into points values (?)", (p,))
-
-c.execute("select * from points")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan/harvestman/dev/sqlite_test2.py b/HarvestMan/harvestman/dev/sqlite_test2.py
deleted file mode 100755
index 5b143bd..0000000
--- a/HarvestMan/harvestman/dev/sqlite_test2.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("test")
-c = con.cursor()
-
-now = datetime.datetime.now()
-c.execute("drop table if exists times")
-c.execute("create table times (time real)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into times values (?)", (now,))
-
-c.execute("select * from times")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan/harvestman/dev/sqlite_test3.py b/HarvestMan/harvestman/dev/sqlite_test3.py
deleted file mode 100755
index 25f6d5c..0000000
--- a/HarvestMan/harvestman/dev/sqlite_test3.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("test")
-c = con.cursor()
-
-c.execute("drop table if exists projects")
-c.execute("create table projects (id integer primary key autoincrement default 0, date real, project text)")
-#cur.execute("select ?", (p,))
-#print cur.fetchone()[0]
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project1'))
-time.sleep(1.0)
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project2'))
-time.sleep(1.0)
-c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project3'))
-
-c.execute("select max(id) from projects")
-print c.fetchone()[0]
-
-c.close()
-
-
diff --git a/HarvestMan/harvestman/dev/sqlite_test4.py b/HarvestMan/harvestman/dev/sqlite_test4.py
deleted file mode 100755
index 90b7690..0000000
--- a/HarvestMan/harvestman/dev/sqlite_test4.py
+++ /dev/null
@@ -1,17 +0,0 @@
-import sqlite3
-import datetime, time
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-sqlite3.register_adapter(datetime.datetime, adapt_datetime)
-                         
-con = sqlite3.connect("/home/anand/.harvestman/db/crawls.db")
-c = con.cursor()
-
-c.execute("select * from project_stats")
-print c.fetchall()
-
-c.close()
-
-
diff --git a/HarvestMan/harvestman/ext/__init__.py b/HarvestMan/harvestman/ext/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/ext/datafilter.py b/HarvestMan/harvestman/ext/datafilter.py
deleted file mode 100755
index 3a47587..0000000
--- a/HarvestMan/harvestman/ext/datafilter.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# -- coding: utf-8
-""" Data filter plugin example based on the
-simulator plugin for HarvestMan. This
-plugin changes the behaviour of HarvestMan
-to only simulate crawling without actually
-downloading anything. In addition, it shows 
-how to get access to the data downloaded by the crawler,
-to implement various kinds of data filters.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-Modified Nov 2 2007 by: Nils Ulltveit-Moe <nils at u-moe dot no>
-
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-from HTMLParser import HTMLParser
-
-class MyHTMLParser(HTMLParser):
-    # Example on a HTML parser, to filter img tags
-
-    def handle_starttag(self, tag, attrs):
-
-        # This just prints the image tag and its attributes
-        if tag=="img":
-            print tag,attrs
-
-def process_url(self, data):
-    """ Post process url callback test """
-    # This shows how to get access to the
-    # downloaded HTML document that is being processed.
-    # Data is either HTML document or None
-    if data:
-        p = MyHTMLParser()
-        p.feed(data)
-
-    return data
-
-def save_url(self, urlobj):
-
-    # For simulation, we need to modify the behaviour
-    # of save_url function in HarvestManUrlConnector class.
-    # This is achieved by injecting this function as a plugin
-    # Note that the signatures of both functions have to
-    # be the same.
-    url = urlobj.get_full_url()
-    self.connect(urlobj, True, self._cfg.retryfailed)
-
-    return 6
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    cfg = objects.config
-    cfg.simulate = True
-    cfg.localise = 0
-
-    # Dummy function that does not really write the mirrored files.
-    hooks.register_plugin_function('connector:save_url_plugin', save_url)
-
-    # Hook to get access to the downloaded data after process_url has been called.
-    hooks.register_post_callback_method('crawler:fetcher_process_url_callback',
-                                            process_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off header dumping, since no files are saved
-    cfg.urlheaders = 0
-    logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.')
diff --git a/HarvestMan/harvestman/ext/lucene.py b/HarvestMan/harvestman/ext/lucene.py
deleted file mode 100755
index d934afb..0000000
--- a/HarvestMan/harvestman/ext/lucene.py
+++ /dev/null
@@ -1,132 +0,0 @@
-# -- coding: utf-8
-""" Lucene plugin to HarvestMan. This plugin modifies the
-behaviour of HarvestMan to create an index of crawled
-webpages by using PyLucene.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created  Aug 7 2007     Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import PyLucene
-import sys, os
-import time
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-class PorterStemmerAnalyzer(object):
-
-    def tokenStream(self, fieldName, reader):
-
-        result = PyLucene.StandardTokenizer(reader)
-        result = PyLucene.StandardFilter(result)
-        result = PyLucene.LowerCaseFilter(result)
-        result = PyLucene.PorterStemFilter(result)
-        result = PyLucene.StopFilter(result, PyLucene.StopAnalyzer.ENGLISH_STOP_WORDS)
-
-        return result
-
-def create_index(self, arg):
-    """ Post download setup callback for creating a lucene index """
-
-    moreinfo("Creating lucene index")
-    storeDir = "index"
-    if not os.path.exists(storeDir):
-        os.mkdir(storeDir)
-
-    store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-    
-    self.lucene_writer = PyLucene.IndexWriter(store, PyLucene.StandardAnalyzer(), True)
-    # Uncomment this line to enable a PorterStemmer analyzer
-    # self.lucene_writer = PyLucene.IndexWriter(store, PorterStemmerAnalyzer(), True)    
-    self.lucene_writer.setMaxFieldLength(1048576)
-    
-    count = 0
-
-    urllist = []
-    
-    for node in self._urldb.preorder():
-        urlobj = node.get()
-
-        # Only index if web-page or document
-        if not urlobj.is_webpage() and not urlobj.is_document(): continue
-        
-        filename = urlobj.get_full_filename()
-        url = urlobj.get_full_url()
-
-        try:
-            urllist.index(urlobj.index)
-            continue
-        except ValueError:
-            urllist.append(urlobj.index)
-
-        if not os.path.isfile(filename): continue
-        
-        data = ''
-
-        moreinfo('Adding index for URL',url)
-
-        try:
-            data = unicode(open(filename).read(), 'iso-8859-1')
-        except UnicodeDecodeError, e:
-            data = ''
-
-        try:
-            doc = PyLucene.Document()
-            doc.add(PyLucene.Field("name", 'file://' + filename,
-                                   PyLucene.Field.Store.YES,
-                                   PyLucene.Field.Index.UN_TOKENIZED))
-            doc.add(PyLucene.Field("path", url,
-                                   PyLucene.Field.Store.YES,
-                                   PyLucene.Field.Index.UN_TOKENIZED))
-            if data and len(data) > 0:
-                doc.add(PyLucene.Field("contents", data,
-                                       PyLucene.Field.Store.YES,
-                                       PyLucene.Field.Index.TOKENIZED))
-            else:
-                extrainfo("warning: no content in %s" % filename)
-
-            self.lucene_writer.addDocument(doc)
-        except PyLucene.JavaError, e:
-            print e
-            
-        count += 1
-
-    moreinfo('Created lucene index for %d documents' % count)
-    moreinfo('Optimizing lucene index')
-    self.lucene_writer.optimize()
-    self.lucene_writer.close()
-        
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-
-    hooks.register_post_callback_method('datamgr:post_download_setup_callback',
-                                        create_index)
-    #logger.disableConsoleLogging()
-    # Turn off session-saver feature
-    cfg.savesessions = False
-    # Turn off interrupt handling
-    # cfg.ignoreinterrupts = True
-    # No need for localising
-    cfg.localise = 0
-    # Turn off image downloading
-    cfg.images = 0
-    # Turn off caching
-    cfg.pagecache = 0
diff --git a/HarvestMan/harvestman/ext/lucene/IndexFiles.py b/HarvestMan/harvestman/ext/lucene/IndexFiles.py
deleted file mode 100755
index 789aa95..0000000
--- a/HarvestMan/harvestman/ext/lucene/IndexFiles.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# -- coding: utf-8
-#!/usr/bin/env python
-
-import sys, os, PyLucene, threading, time
-from datetime import datetime
-
-"""
-This class is loosely based on the Lucene (java implementation) demo class 
-org.apache.lucene.demo.IndexFiles.  It will take a directory as an argument
-and will index all of the files in that directory and downward recursively.
-It will index on the file path, the file name and the file contents.  The
-resulting Lucene index will be placed in the current directory and called
-'index'.
-"""
-
-class Ticker(object):
-
-    def __init__(self):
-        self.tick = True
-
-    def run(self):
-        while self.tick:
-            sys.stdout.write('.')
-            sys.stdout.flush()
-            time.sleep(1.0)
-
-class IndexFiles(object):
-    """Usage: python IndexFiles <doc_directory>"""
-
-    def __init__(self, root, storeDir, analyzer):
-
-        if not os.path.exists(storeDir):
-            os.mkdir(storeDir)
-        store = PyLucene.FSDirectory.getDirectory(storeDir, True)
-        writer = PyLucene.IndexWriter(store, analyzer, True)
-        writer.setMaxFieldLength(1048576)
-        self.indexDocs(root, writer)
-        ticker = Ticker()
-        print 'optimizing index',
-        threading.Thread(target=ticker.run).start()
-        writer.optimize()
-        writer.close()
-        ticker.tick = False
-        print 'done'
-
-    def indexDocs(self, root, writer):
-        for root, dirnames, filenames in os.walk(root):
-            for filename in filenames:
-                #if not filename.endswith('.txt'):
-                #    continue
-                print "adding", filename
-                try:
-                    path = os.path.join(root, filename)
-                    file = open(path)
-                    contents = unicode(file.read(), 'iso-8859-1')
-                    file.close()
-                    doc = PyLucene.Document()
-                    doc.add(PyLucene.Field("name", filename,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.UN_TOKENIZED))
-                    doc.add(PyLucene.Field("path", path,
-                                           PyLucene.Field.Store.YES,
-                                           PyLucene.Field.Index.UN_TOKENIZED))
-                    if len(contents) > 0:
-                        doc.add(PyLucene.Field("contents", contents,
-                                               PyLucene.Field.Store.YES,
-                                               PyLucene.Field.Index.TOKENIZED))
-                    else:
-                        print "warning: no content in %s" % filename
-                    writer.addDocument(doc)
-                except Exception, e:
-                    print "Failed in indexDocs:", e
-
-if __name__ == '__main__':
-    if len(sys.argv) < 2:
-        print IndexFiles.__doc__
-        sys.exit(1)
-    print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION
-    start = datetime.now()
-    try:
-        IndexFiles(sys.argv[1], "index", PyLucene.StandardAnalyzer())
-        end = datetime.now()
-        print end - start
-    except Exception, e:
-        print "Failed: ", e
diff --git a/HarvestMan/harvestman/ext/lucene/SearchFiles.py b/HarvestMan/harvestman/ext/lucene/SearchFiles.py
deleted file mode 100755
index a9bfa0a..0000000
--- a/HarvestMan/harvestman/ext/lucene/SearchFiles.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# -- coding: utf-8
-#!/usr/bin/env python
-from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory
-from PyLucene import VERSION, LUCENE_VERSION
-
-"""
-This script is loosely based on the Lucene (java implementation) demo class 
-org.apache.lucene.demo.SearchFiles.  It will prompt for a search query, then it
-will search the Lucene index in the current directory called 'index' for the
-search query entered against the 'contents' field.  It will then display the
-'path' and 'name' fields for each of the hits it finds in the index.  Note that
-search.close() is currently commented out because it causes a stack overflow in
-some cases.
-"""
-def run(searcher, analyzer):
-
-    while True:
-        print
-        print "Hit enter with no input to quit."
-        command = raw_input("Query:")
-        if command == '':
-            return
-
-        print
-        print "Searching for:", command
-        query = QueryParser("contents", analyzer).parse(command)
-        hits = searcher.search(query)
-        print "%s total matching documents" % hits.length()
-        
-        for i, doc in hits:
-            print 'path:', doc.get("path"), 'name:', doc.get("name"), 100*hits.score(i)
-
-if __name__ == '__main__':
-    STORE_DIR = "index"
-    print 'PyLucene', VERSION, 'Lucene', LUCENE_VERSION
-    directory = FSDirectory.getDirectory(STORE_DIR, False)
-    searcher = IndexSearcher(directory)
-    analyzer = StandardAnalyzer()
-    run(searcher, analyzer)
-    searcher.close()
diff --git a/HarvestMan/harvestman/ext/simulator.py b/HarvestMan/harvestman/ext/simulator.py
deleted file mode 100755
index 66bac2d..0000000
--- a/HarvestMan/harvestman/ext/simulator.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# -- coding: utf-8
-""" Simulator plugin for HarvestMan. This
-plugin changes the behaviour of HarvestMan
-to only simulate crawling without actually
-downloading anything.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import CONNECTOR_DATA_MODE_INMEM
-
-def save_url(self, urlobj):
-
-    # For simulation, we need to modify the behaviour
-    # of save_url function in HarvestManUrlConnector class.
-    # This is achieved by injecting this function as a plugin
-    # Note that the signatures of both functions have to
-    # be the same.
-
-    url = urlobj.get_full_url()
-    self.connect(urlobj, True, self._cfg.retryfailed)
-
-    return 6
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    cfg = objects.config
-    cfg.simulate = True
-    cfg.localise = 0
-    hooks.register_plugin_function('connector:save_url_plugin', save_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off header dumping, since no files are saved
-    cfg.urlheaders = 0
-    # For simulator, we need in-mem data mode
-    # since files are never saved!
-    cfg.datamode = CONNECTOR_DATA_MODE_INMEM
-    logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.')
diff --git a/HarvestMan/harvestman/ext/spam.py b/HarvestMan/harvestman/ext/spam.py
deleted file mode 100755
index 3435f8d..0000000
--- a/HarvestMan/harvestman/ext/spam.py
+++ /dev/null
@@ -1,34 +0,0 @@
-# -- coding: utf-8
-""" Test plugin for HarvestMan. This demonstrates
-how to write a simple plugin based on callbacks.
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>
-
-Copyright (C) 2007 Anand B Pillai
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-def func(self):
-    print 'Before running projects...'
-
-def apply_plugin():
-    """ All plugin modules need to define this method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-    
-    hooks.register_pre_callback_method('harvestman:run_projects_callback', func)
-
diff --git a/HarvestMan/harvestman/ext/swish-e.py b/HarvestMan/harvestman/ext/swish-e.py
deleted file mode 100755
index fc8c5c6..0000000
--- a/HarvestMan/harvestman/ext/swish-e.py
+++ /dev/null
@@ -1,115 +0,0 @@
-# -- coding: utf-8
-""" Swish-e plugin to HarvestMan. This plugin modifies the
-behaviour of HarvestMan to work as an external crawler program
-for the swish-e search engine {http://swish-e.org}
-
-The data format is according to the guidelines given
-at http://swish-e.org/docs/swish-run.html#indexing.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created  Feb 8 2007     Anand B Pillai <abpillai at gmail dot com>
-Modified Feb 17 2007    Anand B Pillai Modified logic to use callbacks
-                                       instead of hooks. The logic is
-                                       in a post callback registered
-                                       at context crawler:fetcher_process_url_callback.
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys, os
-import time
-from types import StringTypes
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-urllist = []
-
-def process_url(self, data):
-    """ Post process url callback for swish-e """
-
-    if (type(data) in StringTypes) and len(data):
-        global urllist
-        urllist.append(self._urlobject.get_full_url())
-        
-        try:
-            data = data.encode('ascii', 'ignore')
-            l = len(data)
-            s = ''
-            
-            # Code which works for www.python.org/doc/current/tut/tut.html
-            # and for swish-e.org/docs
-            # if len(data) != len(data.strip()):
-            #    data = data.strip()
-            #    l = len(data) + 1
-
-            add = 0
-            if l != len(data.strip()):
-                # print l, len(data.strip()), self._urlobject.get_full_url()
-                data = data.strip()
-                l = len(data) + 1
-                # print l
-
-            if self.wp.can_index:
-                s ="Path-Name:%s\nContent-Length:%d\n\n%s" % (self._urlobject.get_full_url(),
-                                                             l,
-                                                             data)
-                # Swish-e seems to be very sensitive to any additional
-                # blank lines between content and headers. So stripping
-                # the data of trailing and preceding newlines is important.
-                # print data.strip()
-                try:
-                    print str(s)
-                except IOError, e:
-                    # global urllist
-                    # open('err.out','w').write('\n'.join(urllist))
-                    objects.queuemgr.endloop()
-
-            return data
-        except UnicodeDecodeError, e:
-            # open('uni.out','a').write(self._urlobject.get_full_url() + '\n')
-            return data
-
-    return data
-
-    
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-
-    # Makes sense to activate the callback only if swish-integration
-    # is enabled.
-    hooks.register_post_callback_method('crawler:fetcher_process_url_callback',
-                                        process_url)
-    # Turn off caching, since no files are saved
-    cfg.pagecache = 0
-    # Turn off console-logging
-    logger = objects.logger
-    #logger.disableConsoleLogging()
-    # Turn off session-saver feature
-    cfg.savesessions = False
-    # Turn off interrupt handling
-    # cfg.ignoreinterrupts = True
-    # No need for localising
-    cfg.localise = 0
-    # Turn off image downloading
-    cfg.images = 0
-    # Increase sleep time
-    cfg.sleeptime = 1.0
-    # sys.stderr = open('swish-errors.txt','wb')
-    # cfg.maxtrackers = 2
-    cfg.usethreads = 0
diff --git a/HarvestMan/harvestman/ext/swish-e/HOWTO.swish-e b/HarvestMan/harvestman/ext/swish-e/HOWTO.swish-e
deleted file mode 100755
index 6735e42..0000000
--- a/HarvestMan/harvestman/ext/swish-e/HOWTO.swish-e
+++ /dev/null
@@ -1,122 +0,0 @@
-Using HarvestMan with swish-e
------------------------------
-HarvestMan can be used as an external crawler program for swish-e 
-indexer {http://www.swish-e.org}. The swish-e support for 
-HarvestMan is built into the swish-e plugin present in the plugins
-folder.
-
-Swish-e configuration
----------------------
-In order to use swish-e with HarvestMan, an appropriate configuration
-file needs to be generated. A sample configuration file is available
-in this folder as swish-config.conf. Typically this configuration
-file only contains two directives
-
-IndexDir <program>
-SwishProgParameters <params>
-
-"IndexDir" is the path to the external crawler program. If HarvestMan
-is installed in your machine, this would be "harvesttman". If the
-PATH where HarvestMan is present is not part of the PATH environment
-variable, you need to specify the full path.
-
-"SwishProgParameters" is the parameters required for the external
-program. Here you can specify the parameters required for HarvestMan.
-
-
-HarvestMan configuration for swish-e
-------------------------------------
-In HarvestMan, there are two ways to load plugins like swish-e.
-Either the plugin can be given as a command-line parameter using the
--g/--plugins option, or it can be specified in the configuration file
-by editing the "plugins" element and adding an appopriate plugin
-element with its "enable" attribute set to 1. For more information
-read the HOWTO.plugins document in the "doc" folder.
-
-There are also two ways to pass URL and other options. The suggested
-way is to create an appropriate configuration file and put all the
-options there. If the file is the default 'config.xml' present in
-the current directory or the user's .harvestman directory, there is
-no need to specify this file. In such case, "SwishProgParameters"
-is empty and should not be specified. In this case the swish configuration
-file will look like,
-
-IndexDir harvestman
-
-However, if the configuration file name is different, it has to be 
-passed to HarvestMan with the -C option. In order to enable swish-e,
-the "enable" attribute of the swish-e plugin element should be set to
-1 in this file. In this case the swish configuration file will look like,
-
-IndexDir harvestman
-SwishProgParameters -C <path_to_config_file>
-
-The other way is to specify a URL and other options in the command line
-and pass it to HarvestMan. This typically can be used for the simplest
-crawl which do not require a lot of customization. For example,
-
-IndexDir harvestman
-SwishProgParameters -g swish-e http://swish-e.org/docs/
-
-The last line instructs HarvestMan to crawl http://www.swish-e.org/docs .
-Swish-e will in turn index the content of files contained at ths URL.
-
-NOTE: If you have more than three parameters to customize it is better to
-use a configuration file than specifying them on the command line.
-
-Running directly from source
-----------------------------
-In case you prefer to run HarvestMan directly from the source tree
-with swish-e without installing it, the above mentioned configuration
-would not work.
-
-In this case there are two ways of writing the configuration. The simplest
-way is to make the harvestman.py module executable and use the
-following configuration.
-
-IndexDir <path>/harvestman.py
-SwishProgParameters <params>
-
-where <path> is the relative path to where HarvestMan source code is
-present. If it is the current directory, this would be '.'.
-
-The second way is to run harvestman.py as an argument to Python. In
-this case the following configuration need to be used.
-
-IndexDir python
-SwishProgParameters <path>/harvestman.py <params>
-
-In this case, the main program becomes Python and path to harvestman.py
-is passed as the first part of SwishProgParameters param value.
-
-Running swish-e 
----------------
-Once the appropriate swish configuration file is written, swish-e can
-be run with HarvestMan as follows
-
-swish-e -c <path_to_config_file> -S prog
-
-Once crawling and indexing starts, swish-e prints an output like,
-
-$ swish-e -c swish-config.cong -S prog
-
-Indexing Data Source: "External-Program"
-Indexing "harvestman"
-External Program found: /usr/bin/harvestman
-
-If everything goes well, the indexing will terminate soon after
-the crawling is completed and an index summary is printed.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/HarvestMan/harvestman/ext/swish-e/README.txt b/HarvestMan/harvestman/ext/swish-e/README.txt
deleted file mode 100755
index 8b37e2d..0000000
--- a/HarvestMan/harvestman/ext/swish-e/README.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-This folder contains sample files/code which demonstrates
-the usage of plugins with HarvestMan.
\ No newline at end of file
diff --git a/HarvestMan/harvestman/ext/swish-e/swish-config.conf b/HarvestMan/harvestman/ext/swish-e/swish-config.conf
deleted file mode 100755
index 8cb4ba5..0000000
--- a/HarvestMan/harvestman/ext/swish-e/swish-config.conf
+++ /dev/null
@@ -1,10 +0,0 @@
-## Sample configuration file for HarvestMan integration with swish-e.
-## See http://swish-e.org/docs/swish-run.html#indexing for more information.
-
-# Indexing program to use
-IndexDir ./harvestman.py
-# Parameters to pass to the Indexing program
-# Change the last parameter to your own URL or configuration file.
-# SwishProgParameters -g swish-e http://swish-e.org/docs
-SwishProgParameters -g swish-e http://www.python.org/doc/current/
-# SwishProgParameters -g swish-e http://www.woogroups.com
diff --git a/HarvestMan/harvestman/ext/userbrowse.py b/HarvestMan/harvestman/ext/userbrowse.py
deleted file mode 100755
index 5071b36..0000000
--- a/HarvestMan/harvestman/ext/userbrowse.py
+++ /dev/null
@@ -1,53 +0,0 @@
-# -- coding: utf-8
-""" User browse plugin. Simulate a scenario of a user
-browsing a web-page.
-
-(Requested by Roy Cheeran)
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created  Aug 13 2007     Anand B Pillai 
-
-Copyright (C) 2007 Anand B Pillai
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import hooks
-from harvestman.lib.common.common import *
-
-# User browsing plugin approximates how a webpage
-# presents itself to a user. This means a few things
-#
-# 1. All images and stylesheets referenced by the page are fetched.
-# 2. In addition, all links directly linked from the page are
-# fetched and saved to disk. Nothing further is crawled.
-#
-# This is done by using a fetchlevel control of 2, a depth
-# control of 0, and allowing images & stylesheets to skip
-# constraints.
-
-def apply_plugin():
-    """ Apply the plugin - overrideable method """
-
-    # This method is expected to perform the following steps.
-    # 1. Register the required hook/plugin function
-    # 2. Get the config object and set/override any required settings
-    # 3. Print any informational messages.
-
-    # The first step is required, the last two are of course optional
-    # depending upon the required application of the plugin.
-
-    cfg = objects.config
-    # Set depth to 0
-    cfg.depth = 0
-    # Set fetchlevel to 2
-    cfg.fetchlevel = 2
-    # Images & stylesheets will skip rules
-    cfg.skipruletypes = ['image','stylesheet']
-    # One might have to set robots to 0
-    # sometimes to fetch images - uncomment this
-    # in such a case.
-    # cfg.robots = 0
diff --git a/HarvestMan/harvestman/lib/__init__.py b/HarvestMan/harvestman/lib/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/lib/common/__init__.py b/HarvestMan/harvestman/lib/common/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/lib/common/bst.py b/HarvestMan/harvestman/lib/common/bst.py
deleted file mode 100755
index edf7f03..0000000
--- a/HarvestMan/harvestman/lib/common/bst.py
+++ /dev/null
@@ -1,544 +0,0 @@
-"""
-bst.py - Basic binary search tree in Python with automated disk caching at
-the nodes. This is not a full-fledged implementation since it does not
-implement node deletion, tree balancing etc.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-Modified Anand B Pillai  Make BST use bsddb caching (experimental!)
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-import cPickle
-import math
-import sys
-import os
-import shutil
-import weakref
-import bsddb
-
-from dictcache import DictCache
-
-class BSTNode(dict):
-    """ Node class for a BST """
-    
-    def __init__(self, key, val, left=None, right=None, tree=None):
-        self.key = key
-        self[key] = val
-        self['left'] = left
-        self['right'] = right
-        # Mode flag
-        # 0 => mem
-        # 1 => disk
-        self.mode = 0
-        # Number of gets
-        self.cgets = 0
-        # Number of loads
-        self.cloads = 0
-        # Link back to the tree
-        self.tree = weakref.proxy(tree)
-        
-    def __getitem__(self, key):
-
-        try:
-            return super(BSTNode, self).__getitem__(key)
-        except KeyError:
-            return None
-        
-    def set(self, value):
-        self[self.key] = value
-        if self.mode == 1:
-            # Already dumped
-            self.mode = 0
-            self.dump()
-            
-    def get(self):
-        
-        if self.mode==0:
-            self.cgets += 1
-            return self[self.key]
-        else:
-            self.cloads += 1            
-            self.load()
-            return self[self.key]
-
-    def is_balanced(self, level=1):
-
-        # Return if this node is balanced
-        # The node balance check is done per
-        # level. Default is 1 which means we
-        # check whether this node has both left
-        # and right children. If level is 2, this
-        # is done at one more level, i.e for the
-        # child nodes also...
-
-        # Leaf node is not balanced...
-        if self['left']==None and self['right']==None:
-            return False
-
-        while level>0:
-            level -= 1
-            
-            if self['left'] !=None and self['right'] != None:
-                if level:
-                    return self['left'].is_balanced(level) and \
-                           self['right'].is_balanced(level)
-                else:
-                    return True
-            else:
-                return False
-
-        return False
-        
-    def load(self, recursive=False):
-
-        # Load values from disk
-        try:
-            # Don't load if mode is 0 and value is not None
-            if self.mode==1 and self[self.key] == None:
-                self[self.key] = self.tree.from_cache(self.key)
-                self.mode = 0
-            
-            if recursive:
-                left = self['left']
-                if left: left.load(True)
-                right = self['right']
-                if right: right.load(True)
-                
-        except Exception, e:
-            raise
-        
-    def dump(self, recursive=False):
-
-        try:
-            if self.mode==0 and self[self.key] != None:
-                self.tree.to_cache(self.key, self[self.key])
-                self[self.key]=None
-                self.mode = 1
-            else:
-                # Dont do anything
-                pass
-            
-            if recursive:
-                left = self['left']
-                if left: left.dump(True)
-                right = self['right']
-                if right: right.dump(True)
-                
-        except Exception, e:
-            raise
-
-    def clear(self):
-
-        # Clear removes the data from memory as well as from disk
-        try:
-            del self[self.key]
-        except KeyError:
-            pass
-
-        left = self['left']
-        right = self['right']
-                
-        if left:
-            left.clear()
-        if right:
-            right.clear()
-
-        super(BSTNode, self).clear()
-
-class BST(object):
-    """ BST class with automated disk caching of node values """
-
-    # Increase the recursion limit for large trees
-    sys.setrecursionlimit(20000)
-        
-    def __init__(self, key=None, val=None):
-        # Size of tree
-        self.size = 0
-        # Height of tree
-        self.height = 0
-        # 'Hardened' flag - if the data structure
-        # is dumped to disk fully, the flag hard
-        # is set to True
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        # Autocommit mode level
-        self.autolevel = 0
-        # Current auto left node for autocommit
-        self.autocurr_l = None
-        # Current auto right node for autocommit
-        self.autocurr_r = None        
-        # For stats
-        # Total number of lookups
-        self.nlookups = 0
-        # Total number of in-mem lookups
-        self.ngets = 0
-        # Total number of disk loads
-        self.nloads = 0
-        self.root = None
-        if key:
-            self.root = self.insert(key, val)
-        self.bdir = ''
-        self.diskcache = None
-
-    def __del__(self):
-        self.clear()
-
-    def to_cache(self, key, val):
-        self.diskcache[str(key)] = cPickle.dumps(val)
-        self.diskcache.sync()
-        
-    def from_cache(self, key):
-        return cPickle.loads(self.diskcache[str(key)])
-    
-    def addNode(self, key, val):
-        self.size += 1
-        self.height = int(math.ceil(math.log(self.size+1, 2)))
-        node = BSTNode(key, val, tree=self)
-
-        if self.auto and self.autolevel and self.size>1:
-            # print 'Auto-dumping...', self.size
-            if self.size % self.autolevel==0:
-                self.dump(self.autocurr_l)
-                # Set autocurr to this node
-                self.autocurr_l = node
-            
-            #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_l.key
-            #    self.dump(self.autocurr_l)
-            #    curr = self.autocurr_l
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right
-            #    print 'Left=>',self.autocurr_l
-            #    print 'Right=>',self.autocurr_r                
-            #    print 'Root=>',self.root.key
-                
-            #if self.autocurr_r == self.autocurr_l:
-            #    return node
-            
-            #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_r.key
-            #    self.dump(self.autocurr_r)
-            #    curr = autocurr_r
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right                
-                
-                
-        return node
-    
-    def __insert(self, root, key, val):
-        
-        if root==None:
-            return self.addNode(key, val)
-
-        else:
-            if key<=root.key:
-                # Goes to left subtree
-                # print 'Inserting on left subtree...', key
-                root['left'] = self.__insert(root['left'], key, val)
-            else:
-                # Goes to right subtree
-                # print 'Inserting on right subtree...', key
-                root['right'] = self.__insert(root['right'], key, val)
-
-            return root
-        
-    def __lookup(self, root, key):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                # Note we are returning the value
-                return root.get()
-            else:
-                if key < root.key:
-                    return self.__lookup(root['left'], key)
-                else:
-                    return self.__lookup(root['right'], key)
-
-    def __update(self, root, key, newval):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                root.set(newval)
-            else:
-                if key < root.key:
-                    return self.__update(root['left'], key, newval)
-                else:
-                    return self.__update(root['right'], key, newval)                
-                
-    def insert(self, key, val):
-        node = self.__insert(self.root, key, val)
-
-        if self.root == None:
-            self.root = node
-            # Set auto node
-            self.autocurr_l = self.autocurr_r = self.root
-
-        # If node is added to left of current autocurrent node..
-        
-        return node
-
-    def lookup(self, key):
-        return self.__lookup(self.root, key)
-
-    def update(self, key, newval):
-        self.__update(self.root, key, newval)
-        
-    def __inorder(self, root):
-
-        if root != None:
-            for node in self.__inorder(root['left']):
-                yield node
-            yield root
-            for node in self.__inorder(root['right']):
-                yield node
-            
-    def inorder(self):
-        # Inorder traversal, yielding the nodes
-        
-        return self.__inorder(self.root)
-
-    def __preorder(self, root):
-
-        if root != None:
-            yield root
-            for node in self.__preorder(root['left']):
-                yield node
-            for node in self.__preorder(root['right']):
-                yield node            
-            
-    def preorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__preorder(self.root)
-
-    def __postorder(self, root):
-
-        if root != None:
-            for node in self.__postorder(root['left']):
-                yield node
-            for node in self.__postorder(root['right']):
-                yield node            
-            yield root
-            
-    def postorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__postorder(self.root)        
-        
-    def minnode(self):
-        # Node with the minimum key value
-
-        root = self.root
-
-        while (root['left'] != None):
-            root = root['left']
-
-        return root
-    
-    def minkey(self):
-
-        node = self.minnode()
-        return node.key
-
-    def maxnode(self):
-        # Node with the maximum key value
-
-        root = self.root
-
-        while (root['right'] != None):
-            root = root['right']
-
-        return root
-    
-    def maxkey(self):
-
-        node = self.maxnode()
-        return node.key
-
-    def size_lhs(self):
-
-        # Traverse pre-order and increment counts
-        if self.root == None:
-            return 0
-        
-        root_left = self.root['left']
-        count = 0
-
-        for node in self.__preorder(root_left):
-            count += 1
-
-        return count
-
-    
-    def size_rhs(self):
-
-        if self.root == None:
-            return 0
-
-        # Traverse pre-order and increment counts
-        root_right = self.root['right']
-        count = 0
-
-        for node in self.__preorder(root_right):
-            count += 1
-
-        return count
-    
-    def size(self):
-        return self.count
-
-    def stats(self):
-
-        d = {'gets': 0, 'loads': 0}
-        self.__stats(self.root, d)
-        return d
-
-    def __stats(self, root, d):
-
-        if root != None:
-            d['gets'] += root.cgets
-            d['loads'] += root.cloads
-            self.__stats(root['left'], d)
-            self.__stats(root['right'], d)
-            
-    def dump(self, startnode=None):
-
-        if startnode==None:
-            startnode = self.root
-            
-        if startnode==None:
-            return None
-        else:
-            startnode.dump(True)
-
-        self.hard = True
-
-    def load(self):
-        if self.root==None:
-            return None
-
-        if self.hard:
-            self.root.load(True)
-            self.hard = False
-
-    def reset(self):
-        self.size = 0
-        self.height = 0
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        self.autolevel = 0
-        self.autocurr_l = None
-        self.autocurr_r = None        
-        self.nlookups = 0
-        self.ngets = 0
-        self.nloads = 0
-        self.root = None
-        
-    def clear(self):
-
-        if self.root:
-            self.root.clear()
-
-        self.reset()
-        if self.diskcache:
-            self.diskcache.clear()
-
-        # Remvoe the directory
-        if self.bdir and os.path.isdir(self.bdir):
-            try:
-                shutil.rmtree(self.bdir)
-            except Exception, e:
-                print e
-
-    def set_auto(self, level):
-        # Enable auto commit and set level
-        # If auto commit is set to true, the tree
-        # is flushed to disk after the existing
-        # autocommit node is balanced at the
-        # level 'level'. The starting autocommit
-        # node is root by default.
-        self.auto = True
-        self.autolevel = level
-        # Directory for file representation
-        self.bdir = '.bidx' + str(hash(self))        
-        if not os.path.isdir(self.bdir):
-            try:
-                os.makedirs(self.bdir)
-            except Exception, e:
-                raise
-
-        self.diskcache = bsddb.btopen('cache.db','n')              # DictCache(10, self.bdir)
-        # self.diskcache.freq = self.autolevel
-        
-if __name__ == "__main__":
-    b = BST()
-    b.set_auto(3)
-    print b.root
-    b.insert(4,[4])
-    b.insert(3,[2])
-    b.insert(2,[6])
-    b.insert(1, [3])
-    b.insert(5,[5])
-    b.insert(6,[7])
-    b.insert(0,[8])        
-    print b.size
-    print b.height
-    print b.lookup(4)
-    b.dump()
-    # Now try to lookup item 3
-    print b.lookup(3)
-    print b.lookup(3)
-    print b.lookup(3)    
-    # Load all
-    b.load()
-    print b.size, b.height
-    
-    # Do inorder
-    print 'Inorder...'
-    for node in b.inorder():
-        print node.key,'=>',node[node.key]
-    # Do preorder
-    print 'Preorder...'    
-    for node in b.preorder():
-        print node.key,'=>',node[node.key]
-    # Do postorder
-    print 'Postorder...'        
-    for node in b.postorder():
-        print node.key,'=>',node[node.key]
-
-    print 'LHS=>',b.size_lhs()
-    print 'RHS=>',b.size_rhs()    
-    
-    # b.clear()
-    print b.stats()
-    root = b.root
-    print root.is_balanced()    
-    print root.is_balanced(2)
-    
-    del b
-
-    b= BST()
-    b.insert(10,[4])
-    b.insert(5,[2])
-    b.insert(2,[6])
-    b.insert(7, [3])
-    b.insert(14,[5])
-    b.insert(12,[7])
-    b.insert(15,[8])
-
-    root = b.root    
-    print root.is_balanced(1)
-    print root.is_balanced(2)
-    print root.is_balanced(3)
-
-    print 'LHS=>',b.size_lhs()
-    print 'RHS=>',b.size_rhs()    
-    
diff --git a/HarvestMan/harvestman/lib/common/bst_orig.py b/HarvestMan/harvestman/lib/common/bst_orig.py
deleted file mode 100755
index be0dc1c..0000000
--- a/HarvestMan/harvestman/lib/common/bst_orig.py
+++ /dev/null
@@ -1,489 +0,0 @@
-"""
-bst.py - Basic binary search tree in Python with automated disk caching at
-the nodes. This is not a full-fledged implementation since it does not
-implement node deletion, tree balancing etc.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-
-
-import cPickle
-import math
-import os
-import shutil
-
-class BSTNode(dict):
-    """ Node class for a BST """
-    
-    def __init__(self, key, val, left=None, right=None):
-        self.key = key
-        self[key] = val
-        self[0] = left
-        self[1] = right
-        # Mode flag
-        # 0 => mem
-        # 1 => disk
-        self.mode = 0
-        # Cached idx filename
-        self.fname = ''
-        # Number of gets
-        self.cgets = 0
-        # Number of loads
-        self.cloads = 0
-
-    def __getitem__(self, key):
-
-        try:
-            return super(BSTNode, self).__getitem__(key)
-        except KeyError:
-            return None
-        
-    def set(self, value):
-        self.val = value
-
-    def get(self):
-        
-        if self.mode==0:
-            self.cgets += 1
-            return self[self.key]
-        else:
-            self.cloads += 1            
-            self.load()
-            return self[self.key]
-
-    def is_balanced(self, level=1):
-
-        # Return if this node is balanced
-        # The node balance check is done per
-        # level. Default is 1 which means we
-        # check whether this node has both left
-        # and right children. If level is 2, this
-        # is done at one more level, i.e for the
-        # child nodes also...
-
-        # Leaf node is not balanced...
-        if self[0]==None and self[1]==None:
-            return False
-
-        while level>0:
-            level -= 1
-            
-            if self[0] !=None and self[1] != None:
-                if level:
-                    return self[0].is_balanced(level) and \
-                           self[1].is_balanced(level)
-                else:
-                    return True
-            else:
-                return False
-
-        return False
-        
-    def load(self, recursive=False):
-
-        # Load values from disk
-        try:
-            # Don't load if mode is 0 and value is not None
-            if self.mode==1 and self[self.key] == None:
-                self[self.key] = cPickle.load(open(self.fname, 'rb'))
-                self.mode = 0
-            
-            if recursive:
-                left = self[0]
-                if left: left.load(True)
-                right = self[1]
-                if right: right.load(True)
-                
-        except cPickle.UnpicklingError, e:
-            raise
-        except Exception, e:
-            raise
-        
-    def dump(self, bdir, recursive=False):
-
-        try:
-            if self.mode==0:
-                self.fname = os.path.join(bdir, str(self.key))
-                cPickle.dump(self[self.key], open(self.fname, 'wb'))
-                # If dumping was done, set val to None to
-                # reclaim memory...
-                del self[self.key]
-                self.mode = 1
-            else:
-                # Dont do anything
-                pass
-            
-            if recursive:
-                left = self[0]
-                if left: left.dump(bdir, True)
-                right = self[1]
-                if right: right.dump(bdir, True)
-                
-        except cPickle.PicklingError, e:
-            raise
-        except Exception, e:
-            raise
-
-    def clear(self):
-
-        # Clear removes the data from memory as well as from disk
-        self.val = None
-        if self.fname and os.path.isfile(self.fname):
-            try:
-                os.remove(self.fname)
-            except Exception, e:
-                print e
-
-        left = self[0]
-        right = self[1]
-                
-        if left:
-            left.clear()
-        if right:
-            right.clear()
-
-        super(BSTNode, self).clear()
-
-        
-class BST(object):
-    """ BST class with automated disk caching of node values """
-    
-    def __init__(self, key=None, val=None):
-        # Size of tree
-        self.size = 0
-        # Height of tree
-        self.height = 0
-        # 'Hardened' flag - if the data structure
-        # is dumped to disk fully, the flag hard
-        # is set to True
-        self.hard = False
-        # Autocommit mode
-        self.auto = False
-        # Autocommit mode level
-        self.autolevel = 0
-        # Current auto left node for autocommit
-        self.autocurr_l = None
-        # Current auto right node for autocommit
-        self.autocurr_r = None        
-        # For stats
-        # Total number of lookups
-        self.nlookups = 0
-        # Total number of in-mem lookups
-        self.ngets = 0
-        # Total number of disk loads
-        self.nloads = 0
-        # Directory for file representation
-        self.bdir = '.bidx' + str(hash(self))        
-        if not os.path.isdir(self.bdir):
-            try:
-                os.makedirs(self.bdir)
-            except Exception, e:
-                raise
-
-        self.root = None
-        if key:
-            self.root = self.insert(key, val)
-
-    def addNode(self, key, val):
-        self.size += 1
-        self.height = int(math.ceil(math.log(self.size+1, 2)))
-        node = BSTNode(key, val)
-
-        if self.auto and self.autolevel and self.size>1:
-            # Check if the node has become balanced at the
-            # requested level...
-
-            if self.auto and self.autolevel:
-                # print 'Auto-dumping...', self.size
-                if self.size % self.autolevel==0:
-                    self.dump(self.autocurr_l)
-                    # Set autocurr to this node
-                    self.autocurr_l = node
-            
-            #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_l.key
-            #    self.dump(self.autocurr_l)
-            #    curr = self.autocurr_l
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right
-            #    print 'Left=>',self.autocurr_l
-            #    print 'Right=>',self.autocurr_r                
-            #    print 'Root=>',self.root.key
-                
-            #if self.autocurr_r == self.autocurr_l:
-            #    return node
-            
-            #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel):
-            #    print 'Auto-dumping...', self.autocurr_r.key
-            #    self.dump(self.autocurr_r)
-            #    curr = autocurr_r
-            #    # Set autocurr to the children of this node
-            #    self.autocurr_l = curr.left
-            #    self.autocurr_r = curr.right                
-                
-                
-        return node
-    
-    def __insert(self, root, key, val):
-        
-        if root==None:
-            return self.addNode(key, val)
-
-        else:
-            if key<=root.key:
-                # Goes to left subtree
-                # print 'Inserting on left subtree...', key
-                root[0] = self.__insert(root[0], key, val)
-            else:
-                # Goes to right subtree
-                # print 'Inserting on right subtree...', key
-                root[1] = self.__insert(root[1], key, val)
-
-            return root
-        
-    def __lookup(self, root, key):
-
-        if root == None:
-            return None
-        else:
-            if key==root.key:
-                # Note we are returning the value
-                return root.get()
-            else:
-                if key < root.key:
-                    return self.__lookup(root[0], key)
-                else:
-                    return self.__lookup(root[1], key)
-                
-    def insert(self, key, val):
-        node = self.__insert(self.root, key, val)
-
-        if self.root == None:
-            self.root = node
-            # Set auto node
-            self.autocurr_l = self.autocurr_r = self.root
-
-        # If node is added to left of current autocurrent node..
-        
-        return node
-
-    def lookup(self, key):
-        return self.__lookup(self.root, key)
-
-    def __inorder(self, root):
-
-        if root != None:
-            for node in self.__inorder(root[0]):
-                yield node
-            yield root
-            for node in self.__inorder(root[1]):
-                yield node
-            
-    def inorder(self):
-        # Inorder traversal, yielding the nodes
-        
-        return self.__inorder(self.root)
-
-    def __preorder(self, root):
-
-        if root != None:
-            yield root
-            for node in self.__preorder(root[0]):
-                yield node
-            for node in self.__preorder(root[1]):
-                yield node            
-            
-    def preorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__preorder(self.root)
-
-    def __postorder(self, root):
-
-        if root != None:
-            for node in self.__postorder(root[0]):
-                yield node
-            for node in self.__postorder(root[1]):
-                yield node            
-            yield root
-            
-    def postorder(self):
-        # Inorder traversal, yielding the nodes
-        return self.__postorder(self.root)        
-        
-    def minnode(self):
-        # Node with the minimum key value
-
-        root = self.root
-
-        while (root[0] != None):
-            root = root[0]
-
-        return root
-    
-    def minkey(self):
-
-        node = self.minnode()
-        return node.key
-
-    def maxnode(self):
-        # Node with the maximum key value
-
-        root = self.root
-
-        while (root[1] != None):
-            root = root[1]
-
-        return root
-    
-    def maxkey(self):
-
-        node = self.maxnode()
-        return node.key
-
-    def size_lhs(self):
-
-        # Return the node size on the LHS (excluding root)
-        root = self.root
-        count = 0
-        
-        while root[0] != None:
-            root = root[0]
-            count += 1
-
-        return count
-
-    def size_rhs(self):
-
-        # Return the node size on the LHS (excluding root)
-        root = self.root
-        count = 0
-        
-        while root[1] != None:
-            root = root[1]
-            count += 1
-
-        return count
-    
-    def size(self):
-        return self.count
-
-    def stats(self):
-
-        d = {'gets': 0, 'loads': 0}
-        self.__stats(self.root, d)
-        return d
-
-    def __stats(self, root, d):
-
-        if root != None:
-            d['gets'] += root.cgets
-            d['loads'] += root.cloads
-            self.__stats(root[0], d)
-            self.__stats(root[1], d)
-            
-    def dump(self, startnode=None):
-
-        if startnode==None:
-            startnode = self.root
-            
-        if startnode==None:
-            return None
-        else:
-            startnode.dump(self.bdir, True)
-
-        self.hard = True
-
-    def load(self):
-        if self.root==None:
-            return None
-
-        if self.hard:
-            self.root.load(True)
-            self.hard = False
-
-    def clear(self):
-
-        if self.root:
-            self.root.clear()
-        # Remvoe the directory
-        if self.bdir and os.path.isdir(self.bdir):
-            try:
-                shutil.rmtree(self.bdir)
-            except Exception, e:
-                print e
-
-    def set_auto(self, level):
-        # Enable auto commit and set level
-        # If auto commit is set to true, the tree
-        # is flushed to disk after the existing
-        # autocommit node is balanced at the
-        # level 'level'. The starting autocommit
-        # node is root by default.
-        self.auto = True
-        self.autolevel = level
-        
-        
-if __name__ == "__main__":
-    b = BST()
-    b.set_auto(3)
-    print b.root
-    b.insert(4,[4])
-    b.insert(3,[2])
-    b.insert(2,[6])
-    b.insert(1, [3])
-    b.insert(5,[5])
-    b.insert(6,[7])
-    b.insert(0,[8])        
-    print b.size
-    print b.height
-    print b.lookup(4)
-    #b.dump()
-    # Now try to lookup item 3
-    print b.lookup(3)
-    print b.lookup(3)
-    print b.lookup(3)    
-    # Load all
-    b.load()
-    print b.size, b.height
-    
-    # Do inorder
-    print 'Inorder...'
-    for node in b.inorder():
-        print node.key,'=>',node[node.key]
-    # Do preorder
-    print 'Preorder...'    
-    for node in b.preorder():
-        print node.key,'=>',node[node.key]
-    # Do postorder
-    print 'Postorder...'        
-    for node in b.postorder():
-        print node.key,'=>',node[node.key]
-
-    print b.size_lhs()
-    print b.size_rhs()    
-    
-    # b.clear()
-    print b.stats()
-    root = b.root
-    print root.is_balanced()    
-    print root.is_balanced(2)
-    del b
-
-    b= BST()
-    b.insert(10,[4])
-    b.insert(5,[2])
-    b.insert(2,[6])
-    b.insert(7, [3])
-    b.insert(14,[5])
-    b.insert(12,[7])
-    b.insert(15,[8])
-
-    root = b.root    
-    print root.is_balanced(1)
-    print root.is_balanced(2)
-    print root.is_balanced(3)        
diff --git a/HarvestMan/harvestman/lib/common/common.py b/HarvestMan/harvestman/lib/common/common.py
deleted file mode 100755
index 14d2e5c..0000000
--- a/HarvestMan/harvestman/lib/common/common.py
+++ /dev/null
@@ -1,603 +0,0 @@
-# -- coding: utf-8
-""" common.py - Global functions for HarvestMan Program.
-    This file is part of the HarvestMan software.
-    For licensing information, see file LICENSE.TXT.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    Created: Jun 10 2003
-
-    Aug 17 2006          Anand          Modifications for the new logging
-                                        module.
-
-    Feb 7 2007           Anand          Some changes. Added logconsole
-                                        function. Split Initialize() to
-                                        InitConfig() and InitLogger().
-    Feb 26 2007          Anand          Replaced urlmappings dictionary
-                                        with a WeakValueDictionary.
-
-   Copyright (C) 2004 - Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import weakref
-import os, sys
-import socket
-import binascii
-import copy
-import threading
-import shelve
-import cStringIO
-import traceback
-import threading
-import collections
-import random
-import cStringIO
-import tokenize
-
-from types import *
-from singleton import Singleton
-
-class Alias(Singleton):
-    def __getattr__(self, name):
-        try:
-            return super(Alias, self).__getattr__(name)
-        except AttributeError:
-            return None
-    pass
-
-class AliasError(Exception):
-    pass
-
-class GlobalData(Singleton):
-    def __getattr__(self, name):
-        try:
-            return super(Alias, self).__getattr__(name)
-        except AttributeError:
-            return None
-
-# Namespace for global unique objects
-
-# This varible holds each global object in HarvestMan
-# If any module redefines an 'objects' variable locally, it
-# is doing at its own peril!
-objects = Alias()
-
-# Namespace for global data
-globaldata = GlobalData()
-globaldata.userdebug = []
-
-
-class SleepEvent(object):
-    """ A class representing a timeout event. This can be
-    used to passively wait for a given time-period instead of
-    using time.sleep(...) """
-
-    def __init__(self, sleeptime):
-        self._sleeptime = sleeptime
-        self.evt = threading.Event()
-        self.evt.set()
-
-    def sleep(self):
-        self.evt.clear()
-        self.evt.wait(self._sleeptime)
-        self.evt.set()
-
-class RandomSleepEvent(SleepEvent):
-    """ A class representing a timeout event. This can be
-    used to passively wait for a given time-period instead of
-    using time.sleep(...) """
-
-    def sleep(self):
-        self.evt.clear()
-        self.evt.wait(random.random()*self._sleeptime)
-        self.evt.set()
-    
-class DummyStderr(object):
-    """ A dummy class to imitate stderr """
-    
-    def write(self, msg):
-        pass
-
-class CaselessDict(dict):
-
-    def __init__(self, mapping=None):
-        if mapping:
-            if type(mapping) is dict:
-                for k,v in d.items():
-                    self.__setitem__(k, v)
-            elif type(mapping) in (list, tuple):
-                d = dict(mapping)
-                for k,v in d.items():
-                    self.__setitem__(k, v)
-                    
-        # super(CaselessDict, self).__init__(d)
-        
-    def __setitem__(self, name, value):
-
-        if type(name) in StringTypes:
-            super(CaselessDict, self).__setitem__(name.lower(), value)
-        else:
-            super(CaselessDict, self).__setitem__(name, value)
-
-    def __getitem__(self, name):
-        if type(name) in StringTypes:
-            return super(CaselessDict, self).__getitem__(name.lower())
-        else:
-            return super(CaselessDict, self).__getitem__(name)
-
-    def __copy__(self):
-        pass
-
-            
-class Ldeque(collections.deque):
-    """ Length-limited deque """
-    
-    def __init__(self, count=10):
-        self.max = count
-        super(Ldeque, self).__init__()
-
-    def append(self, item):
-        super(Ldeque, self).append(item)
-        if len(self)>self.max:
-            # if size exceeds, pop from left
-            self.popleft()
-
-    def appendleft(self, item):
-        super(Ldeque, self).appendleft(item)
-        if len(self)>self.max:
-            # if size exceeds, pop from right
-            self.pop()            
-
-    def index(self, item):
-        """ Return the index of an item from the deque """
-        
-        return list(self).index(item)
-
-    def remove(self, item):
-        """ Remove an item from the deque """
-        
-        idx = self.index(item)
-        self.__delitem__(idx)      
-
-def SysExceptHook(typ, val, tracebak):
-    """ Dummy function to replace sys.excepthook """
-    pass
-
-
-def SetAlias(obj):
-    """ Set unique alias for the object """
-
-    # Alias is another name for the object, it should be unique
-    # The object's class should have a field name 'alias'
-    if getattr(obj, 'alias') == None:
-        raise AliasError, "object does not define 'alias' attribute!"
-
-    setattr(objects, obj.alias, obj)
-
-def SetLogFile():
-
-    logfile = objects.config.logfile
-    if logfile:
-        objects.logger.setLogSeverity(objects.config.verbosity)
-        # If simulation is turned off, add file-handle
-        if not objects.config.simulate:
-            objects.logger.addLogHandler('FileHandler',logfile)
-
-def SetUserDebug(message):
-    """ Used to store error messages related
-    to user settings in the config file/project file.
-    These will be printed at the end of the program """
-
-    if message:
-        try:
-            globaldata.userdebug.index(message)
-        except:
-            globaldata.userdebug.append(message)
-
-def SetLogSeverity():
-    objects.logger.setLogSeverity(objects.config.verbosity)    
-    
-def wasOrWere(val):
-    """ What it says """
-
-    if val > 1: return 'were'
-    else: return 'was'
-
-def plural((s, val)):
-    """ What it says """
-
-    if val>1:
-        if s[len(s)-1] == 'y':
-            return s[:len(s)-1]+'ies'
-        else: return s + 's'
-    else:
-        return s
-
-# file type identification functions
-# this is the precursor of a more generic file identificator
-# based on the '/etc/magic' file on unices.
-
-signatures = { "gif" : [0, ("GIF87a", "GIF89a")],
-               "jpeg" :[6, ("JFIF",)],
-               "bmp" : [0, ("BM6",)]
-             }
-aliases = { "gif" : (),                       # common extension aliases
-            "jpeg" : ("jpg", "jpe", "jfif"),
-            "bmp" : ("dib",) }
-
-def bin_crypt(data):
-    """ Encryption using binascii and obfuscation """
-
-    if data=='':
-        return ''
-
-    try:
-        return binascii.hexlify(obfuscate(data))
-    except TypeError, e:
-        debug('Error in encrypting data: <',data,'>', e)
-        return data
-    except ValueError, e:
-        debug('Error in encrypting data: <',data,'>', e)
-        return data
-
-def bin_decrypt(data):
-    """ Decrypttion using binascii and deobfuscation """
-
-    if data=='':
-        return ''
-
-    try:
-        return unobfuscate(binascii.unhexlify(data))
-    except TypeError, e:
-        logconsole('Error in decrypting data: <',data,'>', e)
-        return data
-    except ValueError, e:
-        logconsole('Error in decrypting data: <',data,'>', e)
-        return data
-
-
-def obfuscate(data):
-    """ Obfuscate a string using repeated xor """
-
-    out = ""
-    import operator
-
-    e0=chr(operator.xor(ord(data[0]), ord(data[1])))
-    out = "".join((out, e0))
-
-    x=1
-    eprev=e0
-    for x in range(1, len(data)):
-        ax=ord(data[x])
-        ex=chr(operator.xor(ax, ord(eprev)))
-        out = "".join((out,ex))
-        eprev = ex
-
-    return out
-
-def unobfuscate(data):
-    """ Unobfuscate a xor obfuscated string """
-
-    out = ""
-    x=len(data) - 1
-
-    import operator
-
-    while x>1:
-        apos=data[x]
-        aprevpos=data[x-1]
-        epos=chr(operator.xor(ord(apos), ord(aprevpos)))
-        out = "".join((out, epos))
-        x -= 1
-
-    out=str(reduce(lambda x, y: y + x, out))
-    e2, a2 = data[1], data[0]
-    a1=chr(operator.xor(ord(a2), ord(e2)))
-    a1 = "".join((a1, out))
-    out = a1
-    e1,a1=out[0], data[0]
-    a0=chr(operator.xor(ord(a1), ord(e1)))
-    a0 = "".join((a0, out))
-    out = a0
-
-    return out
-
-def send_url(data, host, port):
-    
-    cfg = objects.config
-    if cfg.urlserver_protocol == 'tcp':
-        return send_url_tcp(data, host, port)
-    elif cfg.urlserver_protocol == 'udp':
-        return send_url_udp(data, host, port)
-    
-def send_url_tcp(data, host, port):
-    """ Send url to url server """
-
-    # Return's server response if connection
-    # succeeded and null string if failed.
-    try:
-        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((host,port))
-        sock.sendall(data)
-        response = sock.recv(8192)
-        sock.close()
-        return response
-    except socket.error, e:
-        # print 'url server error:',e
-        pass
-
-    return ''
-
-def send_url_udp(data, host, port):
-    """ Send url to url server """
-
-    # Return's server response if connection
-    # succeeded and null string if failed.
-    try:
-        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-        sock.sendto(data,0,(host, port))
-        response, addr = sock.recvfrom(8192, 0)
-        sock.close()
-        return response
-    except socket.error:
-        pass
-
-    return ''
-
-def ping_urlserver(host, port):
-    
-    cfg = objects.config
-    
-    if cfg.urlserver_protocol == 'tcp':
-        return ping_urlserver_tcp(host, port)
-    elif cfg.urlserver_protocol == 'udp':
-        return ping_urlserver_udp(host, port)
-        
-def ping_urlserver_tcp(host, port):
-    """ Ping url server to see if it is alive """
-
-    # Returns server's response if server is
-    # alive & null string if server is not alive.
-    try:
-        debug('Pinging server at (%s:%d)' % (host, port))
-        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((host,port))
-        # Send a small packet
-        sock.sendall("ping")
-        response = sock.recv(8192)
-        if response:
-            debug('Url server is alive')
-        sock.close()
-        return response
-    except socket.error:
-        debug('Could not connect to (%s:%d)' % (host, port))
-        return ''
-
-def ping_urlserver_udp(host, port):
-    """ Ping url server to see if it is alive """
-
-    # Returns server's response if server is
-    # alive & null string if server is not alive.
-    try:
-        debug('Pinging server at (%s:%d)' % (host, port))
-        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-        # Send a small packet
-        sock.sendto("ping", 0, (host,port))
-        response, addr = sock.recvfrom(8192,0)
-        if response:
-            debug('Url server is alive')
-        sock.close()
-        return response
-    except socket.error:
-        debug('Could not connect to (%s:%d)' % (host, port))
-        return ''    
-
-def GetTempDir():
-    """ Return the temporary directory """
-
-    # Currently used by hget
-    tmpdir = max(map(lambda x: os.environ.get(x, ''), ['TEMP','TMP','TEMPDIR','TMPDIR']))
-
-    if tmpdir=='':
-        # No temp dir env variable
-        if os.name == 'posix':
-            if os.path.isdir('/tmp'):
-                return '/tmp'
-            elif os.path.isdir('/usr/tmp'):
-                return '/usr/tmp'
-        elif os.name == 'nt':
-            profiledir = os.environ.get('USERPROFILE','')
-            if profiledir:
-                return os.path.join(profiledir,'Local Settings','Temp')
-    else:
-        return os.path.abspath(tmpdir)
-
-def GetMyTempDir():
-    """ Return temporary directory for HarvestMan. Also creates
-    it if the directory is not there """
-
-    # This is tempdir/HarvestMan
-    tmpdir = os.path.join(GetTempDir(), 'harvestman')
-    if not os.path.isdir(tmpdir):
-        try:
-            os.makedirs(tmpdir)
-        except OSError, e:
-            return ''
-
-    return tmpdir
-
-def debug(arg, *args):
-    """ Log information, will log if verbosity is equal to DEBUG level """
-
-    objects.logger.debug(arg, *args)    
-
-def info(arg, *args):
-    """ Log information, will log if verbosity is <= INFO level """
-
-    objects.logger.info(arg, *args)
-
-def extrainfo(arg, *args):
-    """ Log information, will log if verbosity is <= EXTRAINFO level """
-
-    objects.logger.extrainfo(arg, *args)    
-
-def warning(arg, *args):
-    """ Log information, will log if verbosity is <= WARNING level """
-
-    objects.logger.warning(arg, *args)        
-
-def error(arg, *args):
-    """ Log information, will log if verbosity is <= ERROR level """
-
-    objects.logger.error(arg, *args)        
-
-def critical(arg, *args):
-    """ Log information, will log if verbosity is <= CRITICAL level """
-
-    objects.logger.critical(arg, *args)        
-
-def logconsole(arg, *args):
-    """ Log directly to sys.stdout using print """
-
-    # Setting verbosity to 5 will print maximum information
-    # plus maximum debugging information.
-    objects.logger.logconsole(arg, *args)        
-
-def logtraceback(console=False):
-    """ Log the most recent exception traceback. By default
-    the trace goes only to the log file """
-
-    s = cStringIO.StringIO()
-    traceback.print_tb(sys.exc_info()[-1], None, s)
-    if not console:
-        objects.logger.disableConsoleLogging()
-    # Log to logger
-    objects.logger.debug(s.getvalue())
-    # Enable console logging again
-    objects.logger.enableConsoleLogging()    
-
-def hexit(arg):
-    """ Exit wrapper for HarvestMan """
-
-    print_traceback()
-    sys.exit(arg)
-    
-def print_traceback():
-    print 'Printing error traceback for debugging...'
-    traceback.print_tb(sys.exc_info()[-1], None, sys.stdout)
-
-# Effbot's simple_eval function which is a safe replacement
-# for Python's eval for tuples...
-
-def atom(next, token):
-    if token[1] == "(":
-        out = []
-        token = next()
-        while token[1] != ")":
-            out.append(atom(next, token))
-            token = next()
-            if token[1] == ",":
-                token = next()
-        return tuple(out)
-    elif token[0] is tokenize.STRING:
-        return token[1][1:-1].decode("string-escape")
-    elif token[0] is tokenize.NUMBER:
-        try:
-            return int(token[1], 0)
-        except ValueError:
-            return float(token[1])
-    raise SyntaxError("malformed expression (%s)" % token[1])
-
-def simple_eval(source):
-    src = cStringIO.StringIO(source).readline
-    src = tokenize.generate_tokens(src)
-    res = atom(src.next, src.next())
-    if src.next()[0] is not tokenize.ENDMARKER:
-        raise SyntaxError("bogus data after expression")
-    return res
-
-def set_aliases(path=None):
-
-    if path != None:
-        sys.path.append(path)
-        
-    import config
-    SetAlias(config.HarvestManStateObject())
-
-    import datamgr
-    import rules
-    import connector
-    import urlqueue
-    import logger
-    import event
-
-    SetAlias(logger.HarvestManLogger())
-    
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    dmgr.initialize()
-    SetAlias(dmgr)
-    
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-    
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-    
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-    SetAlias(conn_factory)
-    
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-    
-    SetAlias(event.HarvestManEvent())
-        
-def test_sgmlop():
-    """ Test whether sgmlop is available and working """
-
-    html="""\
-    <html><
-    title>Test sgmlop</title>
-    <body>
-    <p>This is a pargraph</p>
-    <img src="img.jpg"/>
-    <a href="http://www.python.org'>Python</a>
-    </body>
-    </html>
-    """
-    
-    # Return True for working and False for not-working
-    # or not-present...
-    try:
-        import sgmlop
-        
-        class DummyHandler(object):
-            links = []
-            def finish_starttag(self, tag, attrs):
-                self.links.append(tag)
-                pass
-            
-        parser = sgmlop.SGMLParser()
-        parser.register(DummyHandler())
-        parser.feed(html)
-
-        # Check if we got all the links...
-        if len(DummyHandler.links)==4:
-            return True
-        else:
-            return False
-        
-    except ImportError, e:
-        return False
-
-
-if __name__=="__main__":
-    pass
-    
diff --git a/HarvestMan/harvestman/lib/common/dictcache.py b/HarvestMan/harvestman/lib/common/dictcache.py
deleted file mode 100755
index 0830c2f..0000000
--- a/HarvestMan/harvestman/lib/common/dictcache.py
+++ /dev/null
@@ -1,162 +0,0 @@
-"""
-dictcache.py - Module implementing a dictionary like object
-with three level caching (2 level memory, 1 level disk) with
-O(1) search times for keys. 
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 13 2008
-
-Copyright (C) 2008, Anand B Pillai.
-
-"""
-
-import os
-import cPickle
-import time
-from threading import Semaphore
-
-PID = os.getpid()
-
-class DictCache(object):
-    """ A dictionary like object with pickled disk caching
-    which allows to store large amount of data with minimal
-    memory costs """
-    
-    def __init__(self, frequency, tmpdir=''):
-        # Frequency at which commits are done to disk
-        self.freq = frequency
-        # Total number of commit cycles
-        self.cycles = 0
-        self.curr = 0
-        # Disk cache...
-        self.cache = {}
-        # Internal temporary cache
-        self.d = {} 
-        self.dmutex = Semaphore(1)
-        # Last loaded cache dictionary from disk
-        self.dcache = {}
-        # disk cache hits
-        self.dhits = 0
-        # in-mem cache hits
-        self.mhits = 0
-        # temp dict hits
-        self.thits = 0
-        self.tmpdir = tmpdir
-        if self.tmpdir:
-            self.froot = os.path.join(self.tmpdir, '.' + str(PID) + '_' + str(abs(hash(self))))
-        else:
-            self.froot = '.' + str(PID) + '_' + str(abs(hash(self)))
-        self.t = 0
-        
-    def __setitem__(self, key, value):
-
-        try:
-            self.dmutex.acquire()
-            try:
-                self.d[key] = value
-                self.curr += 1
-                if self.curr==self.freq:
-                    self.cycles += 1
-                    # Dump the cache dictionary to disk...
-                    fname = ''.join((self.froot,'#',str(self.cycles)))
-                    # print self.d
-                    cPickle.dump(self.d, open(fname, 'wb'))
-                    # We index the cache keys and associate the
-                    # cycle number to them, since the filename
-                    # is further associated to the cycle number,
-                    # finding the cache file associated to a
-                    # dictionary key is a simple dictionary look-up
-                    # operation, costing only O(1)...
-                    for k in self.d.iterkeys():
-                         self.cache[k] = self.cycles
-                    self.d.clear()
-                    self.curr = 0
-            except Exception, e:
-                import traceback
-                print 'Exception:',e, traceback.extract_stack()
-                traceback.print_stack()
-        finally:
-            self.dmutex.release()
-
-    def __len__(self):
-        # Return the 'virtual' length of the
-        # dictionary
-
-        # Length is the temporary cache length
-        # plus size of disk caches. This assumes
-        # that all the committed caches are still
-        # present...
-        return len(self.d) + self.cycles*self.freq
-    
-    def __getitem__(self, key):
-        try:
-            item = self.d[key]
-            self.thits += 1
-            return item
-        except KeyError:
-            try:
-                item = self.dcache[key]
-                self.mhits += 1
-                return item
-            except KeyError:
-                t1 = time.time()
-                # Load cache from disk...
-                # Cache filename lookup is an O(1) operation...
-                try:
-                    fname = ''.join((self.froot,'#',str(self.cache[key])))
-                except KeyError:
-                    return None
-                try:
-                    # Always caches the last loaded entry
-                    self.dcache = cPickle.load(open(fname,'rb'))
-                    self.dhits += 1
-                    # print time.time() - t1
-                    self.t += time.time() - t1
-                    
-                    return self.dcache[key]
-                except (OSError, IOError, EOFError,KeyError), e:
-                    return None
-
-    def clear(self):
-
-        try:
-            self.dmutex.acquire()
-            self.d.clear()
-            self.dcache.clear()
-            
-            # Remove cache filenames
-            for k in self.cache.itervalues():
-                fname = ''.join((self.froot,'#',str(k)))
-                if os.path.isfile(fname):
-                    # print 'Removing file',fname
-                    os.remove(fname)
-
-            self.cache.clear()
-            # Reset counters
-            self.curr = 0
-            self.cycles = 0
-            self.clear_counters()
-        finally:
-            self.dmutex.release()
-
-    def clear_counters(self):
-        self.dhits = 0
-        self.thits = 0
-        self.mhits = 0        
-        self.t = 0
-
-    def get_stats(self):
-        """ Return stats as a dictionary """
-
-        if len(self):
-            average = float(self.t)/float(len(self))
-        else:
-            average = 0.0
-            
-        return { 'disk_hits' : self.dhits,
-                 'mem_hits'  : self.mhits,
-                 'temp_hits' : self.thits,
-                 'time': self.t,
-                 'average' : average }
-        
-    def __del__(self):
-        self.clear()
diff --git a/HarvestMan/harvestman/lib/common/feedparser.py b/HarvestMan/harvestman/lib/common/feedparser.py
deleted file mode 100755
index bb802df..0000000
--- a/HarvestMan/harvestman/lib/common/feedparser.py
+++ /dev/null
@@ -1,2858 +0,0 @@
-#!/usr/bin/env python
-"""Universal feed parser
-
-Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
-
-Visit http://feedparser.org/ for the latest version
-Visit http://feedparser.org/docs/ for the latest documentation
-
-Required: Python 2.1 or later
-Recommended: Python 2.3 or later
-Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/>
-"""
-
-__version__ = "4.1"# + "$Revision: 1.92 $"[11:15] + "-cvs"
-__license__ = """Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice,
-  this list of conditions and the following disclaimer.
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
-LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE."""
-__author__ = "Mark Pilgrim <http://diveintomark.org/>"
-__contributors__ = ["Jason Diamond <http://injektilo.org/>",
-                    "John Beimler <http://john.beimler.org/>",
-                    "Fazal Majid <http://www.majid.info/mylos/weblog/>",
-                    "Aaron Swartz <http://aaronsw.com/>",
-                    "Kevin Marks <http://epeus.blogspot.com/>"]
-_debug = 0
-
-# HTTP "User-Agent" header to send to servers when downloading feeds.
-# If you are embedding feedparser in a larger application, you should
-# change this to your application name and URL.
-USER_AGENT = "UniversalFeedParser/%s +http://feedparser.org/" % __version__
-
-# HTTP "Accept" header to send to servers when downloading feeds.  If you don't
-# want to send an Accept header, set this to None.
-ACCEPT_HEADER = "application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1"
-
-# List of preferred XML parsers, by SAX driver name.  These will be tried first,
-# but if they're not installed, Python will keep searching through its own list
-# of pre-installed parsers until it finds one that supports everything we need.
-PREFERRED_XML_PARSERS = ["drv_libxml2"]
-
-# If you want feedparser to automatically run HTML markup through HTML Tidy, set
-# this to 1.  Requires mxTidy <http://www.egenix.com/files/python/mxTidy.html>
-# or utidylib <http://utidylib.berlios.de/>.
-TIDY_MARKUP = 0
-
-# List of Python interfaces for HTML Tidy, in order of preference.  Only useful
-# if TIDY_MARKUP = 1
-PREFERRED_TIDY_INTERFACES = ["uTidy", "mxTidy"]
-
-# ---------- required modules (should come with any Python distribution) ----------
-import sgmllib, re, sys, copy, urlparse, time, rfc822, types, cgi, urllib, urllib2
-try:
-    from cStringIO import StringIO as _StringIO
-except:
-    from StringIO import StringIO as _StringIO
-
-# ---------- optional modules (feedparser will work without these, but with reduced functionality) ----------
-
-# gzip is included with most Python distributions, but may not be available if you compiled your own
-try:
-    import gzip
-except:
-    gzip = None
-try:
-    import zlib
-except:
-    zlib = None
-
-# If a real XML parser is available, feedparser will attempt to use it.  feedparser has
-# been tested with the built-in SAX parser, PyXML, and libxml2.  On platforms where the
-# Python distribution does not come with an XML parser (such as Mac OS X 10.2 and some
-# versions of FreeBSD), feedparser will quietly fall back on regex-based parsing.
-try:
-    import xml.sax
-    xml.sax.make_parser(PREFERRED_XML_PARSERS) # test for valid parsers
-    from xml.sax.saxutils import escape as _xmlescape
-    _XML_AVAILABLE = 1
-except:
-    _XML_AVAILABLE = 0
-    def _xmlescape(data):
-        data = data.replace('&', '&amp;')
-        data = data.replace('>', '&gt;')
-        data = data.replace('<', '&lt;')
-        return data
-
-# base64 support for Atom feeds that contain embedded binary data
-try:
-    import base64, binascii
-except:
-    base64 = binascii = None
-
-# cjkcodecs and iconv_codec provide support for more character encodings.
-# Both are available from http://cjkpython.i18n.org/
-try:
-    import cjkcodecs.aliases
-except:
-    pass
-try:
-    import iconv_codec
-except:
-    pass
-
-# chardet library auto-detects character encodings
-# Download from http://chardet.feedparser.org/
-try:
-    import chardet
-    if _debug:
-        import chardet.constants
-        chardet.constants._debug = 1
-except:
-    chardet = None
-
-# ---------- don't touch these ----------
-class ThingsNobodyCaresAboutButMe(Exception): pass
-class CharacterEncodingOverride(ThingsNobodyCaresAboutButMe): pass
-class CharacterEncodingUnknown(ThingsNobodyCaresAboutButMe): pass
-class NonXMLContentType(ThingsNobodyCaresAboutButMe): pass
-class UndeclaredNamespace(Exception): pass
-
-sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
-sgmllib.special = re.compile('<!')
-sgmllib.charref = re.compile('&#(x?[0-9A-Fa-f]+)[^0-9A-Fa-f]')
-
-SUPPORTED_VERSIONS = {'': 'unknown',
-                      'rss090': 'RSS 0.90',
-                      'rss091n': 'RSS 0.91 (Netscape)',
-                      'rss091u': 'RSS 0.91 (Userland)',
-                      'rss092': 'RSS 0.92',
-                      'rss093': 'RSS 0.93',
-                      'rss094': 'RSS 0.94',
-                      'rss20': 'RSS 2.0',
-                      'rss10': 'RSS 1.0',
-                      'rss': 'RSS (unknown version)',
-                      'atom01': 'Atom 0.1',
-                      'atom02': 'Atom 0.2',
-                      'atom03': 'Atom 0.3',
-                      'atom10': 'Atom 1.0',
-                      'atom': 'Atom (unknown version)',
-                      'cdf': 'CDF',
-                      'hotrss': 'Hot RSS'
-                      }
-
-try:
-    UserDict = dict
-except NameError:
-    # Python 2.1 does not have dict
-    from UserDict import UserDict
-    def dict(aList):
-        rc = {}
-        for k, v in aList:
-            rc[k] = v
-        return rc
-
-class FeedParserDict(UserDict):
-    keymap = {'channel': 'feed',
-              'items': 'entries',
-              'guid': 'id',
-              'date': 'updated',
-              'date_parsed': 'updated_parsed',
-              'description': ['subtitle', 'summary'],
-              'url': ['href'],
-              'modified': 'updated',
-              'modified_parsed': 'updated_parsed',
-              'issued': 'published',
-              'issued_parsed': 'published_parsed',
-              'copyright': 'rights',
-              'copyright_detail': 'rights_detail',
-              'tagline': 'subtitle',
-              'tagline_detail': 'subtitle_detail'}
-    def __getitem__(self, key):
-        if key == 'category':
-            return UserDict.__getitem__(self, 'tags')[0]['term']
-        if key == 'categories':
-            return [(tag['scheme'], tag['term']) for tag in UserDict.__getitem__(self, 'tags')]
-        realkey = self.keymap.get(key, key)
-        if type(realkey) == types.ListType:
-            for k in realkey:
-                if UserDict.has_key(self, k):
-                    return UserDict.__getitem__(self, k)
-        if UserDict.has_key(self, key):
-            return UserDict.__getitem__(self, key)
-        return UserDict.__getitem__(self, realkey)
-
-    def __setitem__(self, key, value):
-        for k in self.keymap.keys():
-            if key == k:
-                key = self.keymap[k]
-                if type(key) == types.ListType:
-                    key = key[0]
-        return UserDict.__setitem__(self, key, value)
-
-    def get(self, key, default=None):
-        if self.has_key(key):
-            return self[key]
-        else:
-            return default
-
-    def setdefault(self, key, value):
-        if not self.has_key(key):
-            self[key] = value
-        return self[key]
-        
-    def has_key(self, key):
-        try:
-            return hasattr(self, key) or UserDict.has_key(self, key)
-        except AttributeError:
-            return False
-        
-    def __getattr__(self, key):
-        try:
-            return self.__dict__[key]
-        except KeyError:
-            pass
-        try:
-            assert not key.startswith('_')
-            return self.__getitem__(key)
-        except:
-            raise AttributeError, "object has no attribute '%s'" % key
-
-    def __setattr__(self, key, value):
-        if key.startswith('_') or key == 'data':
-            self.__dict__[key] = value
-        else:
-            return self.__setitem__(key, value)
-
-    def __contains__(self, key):
-        return self.has_key(key)
-
-def zopeCompatibilityHack():
-    global FeedParserDict
-    del FeedParserDict
-    def FeedParserDict(aDict=None):
-        rc = {}
-        if aDict:
-            rc.update(aDict)
-        return rc
-
-_ebcdic_to_ascii_map = None
-def _ebcdic_to_ascii(s):
-    global _ebcdic_to_ascii_map
-    if not _ebcdic_to_ascii_map:
-        emap = (
-            0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
-            16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
-            128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
-            144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
-            32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
-            38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
-            45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
-            186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
-            195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,201,
-            202,106,107,108,109,110,111,112,113,114,203,204,205,206,207,208,
-            209,126,115,116,117,118,119,120,121,122,210,211,212,213,214,215,
-            216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,
-            123,65,66,67,68,69,70,71,72,73,232,233,234,235,236,237,
-            125,74,75,76,77,78,79,80,81,82,238,239,240,241,242,243,
-            92,159,83,84,85,86,87,88,89,90,244,245,246,247,248,249,
-            48,49,50,51,52,53,54,55,56,57,250,251,252,253,254,255
-            )
-        import string
-        _ebcdic_to_ascii_map = string.maketrans( \
-            ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
-    return s.translate(_ebcdic_to_ascii_map)
-
-_urifixer = re.compile('^([A-Za-z][A-Za-z0-9+-.]*://)(/*)(.*?)')
-def _urljoin(base, uri):
-    uri = _urifixer.sub(r'\1\3', uri)
-    return urlparse.urljoin(base, uri)
-
-class _FeedParserMixin:
-    namespaces = {'': '',
-                  'http://backend.userland.com/rss': '',
-                  'http://blogs.law.harvard.edu/tech/rss': '',
-                  'http://purl.org/rss/1.0/': '',
-                  'http://my.netscape.com/rdf/simple/0.9/': '',
-                  'http://example.com/newformat#': '',
-                  'http://example.com/necho': '',
-                  'http://purl.org/echo/': '',
-                  'uri/of/echo/namespace#': '',
-                  'http://purl.org/pie/': '',
-                  'http://purl.org/atom/ns#': '',
-                  'http://www.w3.org/2005/Atom': '',
-                  'http://purl.org/rss/1.0/modules/rss091#': '',
-                  
-                  'http://webns.net/mvcb/':                               'admin',
-                  'http://purl.org/rss/1.0/modules/aggregation/':         'ag',
-                  'http://purl.org/rss/1.0/modules/annotate/':            'annotate',
-                  'http://media.tangent.org/rss/1.0/':                    'audio',
-                  'http://backend.userland.com/blogChannelModule':        'blogChannel',
-                  'http://web.resource.org/cc/':                          'cc',
-                  'http://backend.userland.com/creativeCommonsRssModule': 'creativeCommons',
-                  'http://purl.org/rss/1.0/modules/company':              'co',
-                  'http://purl.org/rss/1.0/modules/content/':             'content',
-                  'http://my.theinfo.org/changed/1.0/rss/':               'cp',
-                  'http://purl.org/dc/elements/1.1/':                     'dc',
-                  'http://purl.org/dc/terms/':                            'dcterms',
-                  'http://purl.org/rss/1.0/modules/email/':               'email',
-                  'http://purl.org/rss/1.0/modules/event/':               'ev',
-                  'http://rssnamespace.org/feedburner/ext/1.0':           'feedburner',
-                  'http://freshmeat.net/rss/fm/':                         'fm',
-                  'http://xmlns.com/foaf/0.1/':                           'foaf',
-                  'http://www.w3.org/2003/01/geo/wgs84_pos#':             'geo',
-                  'http://postneo.com/icbm/':                             'icbm',
-                  'http://purl.org/rss/1.0/modules/image/':               'image',
-                  'http://www.itunes.com/DTDs/PodCast-1.0.dtd':           'itunes',
-                  'http://example.com/DTDs/PodCast-1.0.dtd':              'itunes',
-                  'http://purl.org/rss/1.0/modules/link/':                'l',
-                  'http://search.yahoo.com/mrss':                         'media',
-                  'http://madskills.com/public/xml/rss/module/pingback/': 'pingback',
-                  'http://prismstandard.org/namespaces/1.2/basic/':       'prism',
-                  'http://www.w3.org/1999/02/22-rdf-syntax-ns#':          'rdf',
-                  'http://www.w3.org/2000/01/rdf-schema#':                'rdfs',
-                  'http://purl.org/rss/1.0/modules/reference/':           'ref',
-                  'http://purl.org/rss/1.0/modules/richequiv/':           'reqv',
-                  'http://purl.org/rss/1.0/modules/search/':              'search',
-                  'http://purl.org/rss/1.0/modules/slash/':               'slash',
-                  'http://schemas.xmlsoap.org/soap/envelope/':            'soap',
-                  'http://purl.org/rss/1.0/modules/servicestatus/':       'ss',
-                  'http://hacks.benhammersley.com/rss/streaming/':        'str',
-                  'http://purl.org/rss/1.0/modules/subscription/':        'sub',
-                  'http://purl.org/rss/1.0/modules/syndication/':         'sy',
-                  'http://purl.org/rss/1.0/modules/taxonomy/':            'taxo',
-                  'http://purl.org/rss/1.0/modules/threading/':           'thr',
-                  'http://purl.org/rss/1.0/modules/textinput/':           'ti',
-                  'http://madskills.com/public/xml/rss/module/trackback/':'trackback',
-                  'http://wellformedweb.org/commentAPI/':                 'wfw',
-                  'http://purl.org/rss/1.0/modules/wiki/':                'wiki',
-                  'http://www.w3.org/1999/xhtml':                         'xhtml',
-                  'http://www.w3.org/XML/1998/namespace':                 'xml',
-                  'http://schemas.pocketsoap.com/rss/myDescModule/':      'szf'
-}
-    _matchnamespaces = {}
-
-    can_be_relative_uri = ['link', 'id', 'wfw_comment', 'wfw_commentrss', 'docs', 'url', 'href', 'comments', 'license', 'icon', 'logo']
-    can_contain_relative_uris = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
-    can_contain_dangerous_markup = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
-    html_types = ['text/html', 'application/xhtml+xml']
-    
-    def __init__(self, baseuri=None, baselang=None, encoding='utf-8'):
-        if _debug: sys.stderr.write('initializing FeedParser\n')
-        if not self._matchnamespaces:
-            for k, v in self.namespaces.items():
-                self._matchnamespaces[k.lower()] = v
-        self.feeddata = FeedParserDict() # feed-level data
-        self.encoding = encoding # character encoding
-        self.entries = [] # list of entry-level data
-        self.version = '' # feed type/version, see SUPPORTED_VERSIONS
-        self.namespacesInUse = {} # dictionary of namespaces defined by the feed
-
-        # the following are used internally to track state;
-        # this is really out of control and should be refactored
-        self.infeed = 0
-        self.inentry = 0
-        self.incontent = 0
-        self.intextinput = 0
-        self.inimage = 0
-        self.inauthor = 0
-        self.incontributor = 0
-        self.inpublisher = 0
-        self.insource = 0
-        self.sourcedata = FeedParserDict()
-        self.contentparams = FeedParserDict()
-        self._summaryKey = None
-        self.namespacemap = {}
-        self.elementstack = []
-        self.basestack = []
-        self.langstack = []
-        self.baseuri = baseuri or ''
-        self.lang = baselang or None
-        if baselang:
-            self.feeddata['language'] = baselang
-
-    def unknown_starttag(self, tag, attrs):
-        if _debug: sys.stderr.write('start %s with %s\n' % (tag, attrs))
-        # normalize attrs
-        attrs = [(k.lower(), v) for k, v in attrs]
-        attrs = [(k, k in ('rel', 'type') and v.lower() or v) for k, v in attrs]
-        
-        # track xml:base and xml:lang
-        attrsD = dict(attrs)
-        baseuri = attrsD.get('xml:base', attrsD.get('base')) or self.baseuri
-        self.baseuri = _urljoin(self.baseuri, baseuri)
-        lang = attrsD.get('xml:lang', attrsD.get('lang'))
-        if lang == '':
-            # xml:lang could be explicitly set to '', we need to capture that
-            lang = None
-        elif lang is None:
-            # if no xml:lang is specified, use parent lang
-            lang = self.lang
-        if lang:
-            if tag in ('feed', 'rss', 'rdf:RDF'):
-                self.feeddata['language'] = lang
-        self.lang = lang
-        self.basestack.append(self.baseuri)
-        self.langstack.append(lang)
-        
-        # track namespaces
-        for prefix, uri in attrs:
-            if prefix.startswith('xmlns:'):
-                self.trackNamespace(prefix[6:], uri)
-            elif prefix == 'xmlns':
-                self.trackNamespace(None, uri)
-
-        # track inline content
-        if self.incontent and self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            # element declared itself as escaped markup, but it isn't really
-            self.contentparams['type'] = 'application/xhtml+xml'
-        if self.incontent and self.contentparams.get('type') == 'application/xhtml+xml':
-            # Note: probably shouldn't simply recreate localname here, but
-            # our namespace handling isn't actually 100% correct in cases where
-            # the feed redefines the default namespace (which is actually
-            # the usual case for inline content, thanks Sam), so here we
-            # cheat and just reconstruct the element based on localname
-            # because that compensates for the bugs in our namespace handling.
-            # This will horribly munge inline content with non-empty qnames,
-            # but nobody actually does that, so I'm not fixing it.
-            tag = tag.split(':')[-1]
-            return self.handle_data('<%s%s>' % (tag, ''.join([' %s="%s"' % t for t in attrs])), escape=0)
-
-        # match namespaces
-        if tag.find(':') <> -1:
-            prefix, suffix = tag.split(':', 1)
-        else:
-            prefix, suffix = '', tag
-        prefix = self.namespacemap.get(prefix, prefix)
-        if prefix:
-            prefix = prefix + '_'
-
-        # special hack for better tracking of empty textinput/image elements in illformed feeds
-        if (not prefix) and tag not in ('title', 'link', 'description', 'name'):
-            self.intextinput = 0
-        if (not prefix) and tag not in ('title', 'link', 'description', 'url', 'href', 'width', 'height'):
-            self.inimage = 0
-        
-        # call special handler (if defined) or default handler
-        methodname = '_start_' + prefix + suffix
-        try:
-            method = getattr(self, methodname)
-            return method(attrsD)
-        except AttributeError:
-            return self.push(prefix + suffix, 1)
-
-    def unknown_endtag(self, tag):
-        if _debug: sys.stderr.write('end %s\n' % tag)
-        # match namespaces
-        if tag.find(':') <> -1:
-            prefix, suffix = tag.split(':', 1)
-        else:
-            prefix, suffix = '', tag
-        prefix = self.namespacemap.get(prefix, prefix)
-        if prefix:
-            prefix = prefix + '_'
-
-        # call special handler (if defined) or default handler
-        methodname = '_end_' + prefix + suffix
-        try:
-            method = getattr(self, methodname)
-            method()
-        except AttributeError:
-            self.pop(prefix + suffix)
-
-        # track inline content
-        if self.incontent and self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            # element declared itself as escaped markup, but it isn't really
-            self.contentparams['type'] = 'application/xhtml+xml'
-        if self.incontent and self.contentparams.get('type') == 'application/xhtml+xml':
-            tag = tag.split(':')[-1]
-            self.handle_data('</%s>' % tag, escape=0)
-
-        # track xml:base and xml:lang going out of scope
-        if self.basestack:
-            self.basestack.pop()
-            if self.basestack and self.basestack[-1]:
-                self.baseuri = self.basestack[-1]
-        if self.langstack:
-            self.langstack.pop()
-            if self.langstack: # and (self.langstack[-1] is not None):
-                self.lang = self.langstack[-1]
-
-    def handle_charref(self, ref):
-        # called for each character reference, e.g. for '&#160;', ref will be '160'
-        if not self.elementstack: return
-        ref = ref.lower()
-        if ref in ('34', '38', '39', '60', '62', 'x22', 'x26', 'x27', 'x3c', 'x3e'):
-            text = '&#%s;' % ref
-        else:
-            if ref[0] == 'x':
-                c = int(ref[1:], 16)
-            else:
-                c = int(ref)
-            text = unichr(c).encode('utf-8')
-        self.elementstack[-1][2].append(text)
-
-    def handle_entityref(self, ref):
-        # called for each entity reference, e.g. for '&copy;', ref will be 'copy'
-        if not self.elementstack: return
-        if _debug: sys.stderr.write('entering handle_entityref with %s\n' % ref)
-        if ref in ('lt', 'gt', 'quot', 'amp', 'apos'):
-            text = '&%s;' % ref
-        else:
-            # entity resolution graciously donated by Aaron Swartz
-            def name2cp(k):
-                import htmlentitydefs
-                if hasattr(htmlentitydefs, 'name2codepoint'): # requires Python 2.3
-                    return htmlentitydefs.name2codepoint[k]
-                k = htmlentitydefs.entitydefs[k]
-                if k.startswith('&#') and k.endswith(';'):
-                    return int(k[2:-1]) # not in latin-1
-                return ord(k)
-            try: name2cp(ref)
-            except KeyError: text = '&%s;' % ref
-            else: text = unichr(name2cp(ref)).encode('utf-8')
-        self.elementstack[-1][2].append(text)
-
-    def handle_data(self, text, escape=1):
-        # called for each block of plain text, i.e. outside of any tag and
-        # not containing any character or entity references
-        if not self.elementstack: return
-        if escape and self.contentparams.get('type') == 'application/xhtml+xml':
-            text = _xmlescape(text)
-        self.elementstack[-1][2].append(text)
-
-    def handle_comment(self, text):
-        # called for each comment, e.g. <!-- insert message here -->
-        pass
-
-    def handle_pi(self, text):
-        # called for each processing instruction, e.g. <?instruction>
-        pass
-
-    def handle_decl(self, text):
-        pass
-
-    def parse_declaration(self, i):
-        # override internal declaration handler to handle CDATA blocks
-        if _debug: sys.stderr.write('entering parse_declaration\n')
-        if self.rawdata[i:i+9] == '<![CDATA[':
-            k = self.rawdata.find(']]>', i)
-            if k == -1: k = len(self.rawdata)
-            self.handle_data(_xmlescape(self.rawdata[i+9:k]), 0)
-            return k+3
-        else:
-            k = self.rawdata.find('>', i)
-            return k+1
-
-    def mapContentType(self, contentType):
-        contentType = contentType.lower()
-        if contentType == 'text':
-            contentType = 'text/plain'
-        elif contentType == 'html':
-            contentType = 'text/html'
-        elif contentType == 'xhtml':
-            contentType = 'application/xhtml+xml'
-        return contentType
-    
-    def trackNamespace(self, prefix, uri):
-        loweruri = uri.lower()
-        if (prefix, loweruri) == (None, 'http://my.netscape.com/rdf/simple/0.9/') and not self.version:
-            self.version = 'rss090'
-        if loweruri == 'http://purl.org/rss/1.0/' and not self.version:
-            self.version = 'rss10'
-        if loweruri == 'http://www.w3.org/2005/atom' and not self.version:
-            self.version = 'atom10'
-        if loweruri.find('backend.userland.com/rss') <> -1:
-            # match any backend.userland.com namespace
-            uri = 'http://backend.userland.com/rss'
-            loweruri = uri
-        if self._matchnamespaces.has_key(loweruri):
-            self.namespacemap[prefix] = self._matchnamespaces[loweruri]
-            self.namespacesInUse[self._matchnamespaces[loweruri]] = uri
-        else:
-            self.namespacesInUse[prefix or ''] = uri
-
-    def resolveURI(self, uri):
-        return _urljoin(self.baseuri or '', uri)
-    
-    def decodeEntities(self, element, data):
-        return data
-
-    def push(self, element, expectingText):
-        self.elementstack.append([element, expectingText, []])
-
-    def pop(self, element, stripWhitespace=1):
-        if not self.elementstack: return
-        if self.elementstack[-1][0] != element: return
-        
-        element, expectingText, pieces = self.elementstack.pop()
-        output = ''.join(pieces)
-        if stripWhitespace:
-            output = output.strip()
-        if not expectingText: return output
-
-        # decode base64 content
-        if base64 and self.contentparams.get('base64', 0):
-            try:
-                output = base64.decodestring(output)
-            except binascii.Error:
-                pass
-            except binascii.Incomplete:
-                pass
-                
-        # resolve relative URIs
-        if (element in self.can_be_relative_uri) and output:
-            output = self.resolveURI(output)
-        
-        # decode entities within embedded markup
-        if not self.contentparams.get('base64', 0):
-            output = self.decodeEntities(element, output)
-
-        # remove temporary cruft from contentparams
-        try:
-            del self.contentparams['mode']
-        except KeyError:
-            pass
-        try:
-            del self.contentparams['base64']
-        except KeyError:
-            pass
-
-        # resolve relative URIs within embedded markup
-        if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types:
-            if element in self.can_contain_relative_uris:
-                output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
-        
-        # sanitize embedded markup
-        if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types:
-            if element in self.can_contain_dangerous_markup:
-                output = _sanitizeHTML(output, self.encoding)
-
-        if self.encoding and type(output) != type(u''):
-            try:
-                output = unicode(output, self.encoding)
-            except:
-                pass
-
-        # categories/tags/keywords/whatever are handled in _end_category
-        if element == 'category':
-            return output
-        
-        # store output in appropriate place(s)
-        if self.inentry and not self.insource:
-            if element == 'content':
-                self.entries[-1].setdefault(element, [])
-                contentparams = copy.deepcopy(self.contentparams)
-                contentparams['value'] = output
-                self.entries[-1][element].append(contentparams)
-            elif element == 'link':
-                self.entries[-1][element] = output
-                if output:
-                    self.entries[-1]['links'][-1]['href'] = output
-            else:
-                if element == 'description':
-                    element = 'summary'
-                self.entries[-1][element] = output
-                if self.incontent:
-                    contentparams = copy.deepcopy(self.contentparams)
-                    contentparams['value'] = output
-                    self.entries[-1][element + '_detail'] = contentparams
-        elif (self.infeed or self.insource) and (not self.intextinput) and (not self.inimage):
-            context = self._getContext()
-            if element == 'description':
-                element = 'subtitle'
-            context[element] = output
-            if element == 'link':
-                context['links'][-1]['href'] = output
-            elif self.incontent:
-                contentparams = copy.deepcopy(self.contentparams)
-                contentparams['value'] = output
-                context[element + '_detail'] = contentparams
-        return output
-
-    def pushContent(self, tag, attrsD, defaultContentType, expectingText):
-        self.incontent += 1
-        self.contentparams = FeedParserDict({
-            'type': self.mapContentType(attrsD.get('type', defaultContentType)),
-            'language': self.lang,
-            'base': self.baseuri})
-        self.contentparams['base64'] = self._isBase64(attrsD, self.contentparams)
-        self.push(tag, expectingText)
-
-    def popContent(self, tag):
-        value = self.pop(tag)
-        self.incontent -= 1
-        self.contentparams.clear()
-        return value
-        
-    def _mapToStandardPrefix(self, name):
-        colonpos = name.find(':')
-        if colonpos <> -1:
-            prefix = name[:colonpos]
-            suffix = name[colonpos+1:]
-            prefix = self.namespacemap.get(prefix, prefix)
-            name = prefix + ':' + suffix
-        return name
-        
-    def _getAttribute(self, attrsD, name):
-        return attrsD.get(self._mapToStandardPrefix(name))
-
-    def _isBase64(self, attrsD, contentparams):
-        if attrsD.get('mode', '') == 'base64':
-            return 1
-        if self.contentparams['type'].startswith('text/'):
-            return 0
-        if self.contentparams['type'].endswith('+xml'):
-            return 0
-        if self.contentparams['type'].endswith('/xml'):
-            return 0
-        return 1
-
-    def _itsAnHrefDamnIt(self, attrsD):
-        href = attrsD.get('url', attrsD.get('uri', attrsD.get('href', None)))
-        if href:
-            try:
-                del attrsD['url']
-            except KeyError:
-                pass
-            try:
-                del attrsD['uri']
-            except KeyError:
-                pass
-            attrsD['href'] = href
-        return attrsD
-    
-    def _save(self, key, value):
-        context = self._getContext()
-        context.setdefault(key, value)
-
-    def _start_rss(self, attrsD):
-        versionmap = {'0.91': 'rss091u',
-                      '0.92': 'rss092',
-                      '0.93': 'rss093',
-                      '0.94': 'rss094'}
-        if not self.version:
-            attr_version = attrsD.get('version', '')
-            version = versionmap.get(attr_version)
-            if version:
-                self.version = version
-            elif attr_version.startswith('2.'):
-                self.version = 'rss20'
-            else:
-                self.version = 'rss'
-    
-    def _start_dlhottitles(self, attrsD):
-        self.version = 'hotrss'
-
-    def _start_channel(self, attrsD):
-        self.infeed = 1
-        self._cdf_common(attrsD)
-    _start_feedinfo = _start_channel
-
-    def _cdf_common(self, attrsD):
-        if attrsD.has_key('lastmod'):
-            self._start_modified({})
-            self.elementstack[-1][-1] = attrsD['lastmod']
-            self._end_modified()
-        if attrsD.has_key('href'):
-            self._start_link({})
-            self.elementstack[-1][-1] = attrsD['href']
-            self._end_link()
-    
-    def _start_feed(self, attrsD):
-        self.infeed = 1
-        versionmap = {'0.1': 'atom01',
-                      '0.2': 'atom02',
-                      '0.3': 'atom03'}
-        if not self.version:
-            attr_version = attrsD.get('version')
-            version = versionmap.get(attr_version)
-            if version:
-                self.version = version
-            else:
-                self.version = 'atom'
-
-    def _end_channel(self):
-        self.infeed = 0
-    _end_feed = _end_channel
-    
-    def _start_image(self, attrsD):
-        self.inimage = 1
-        self.push('image', 0)
-        context = self._getContext()
-        context.setdefault('image', FeedParserDict())
-            
-    def _end_image(self):
-        self.pop('image')
-        self.inimage = 0
-
-    def _start_textinput(self, attrsD):
-        self.intextinput = 1
-        self.push('textinput', 0)
-        context = self._getContext()
-        context.setdefault('textinput', FeedParserDict())
-    _start_textInput = _start_textinput
-    
-    def _end_textinput(self):
-        self.pop('textinput')
-        self.intextinput = 0
-    _end_textInput = _end_textinput
-
-    def _start_author(self, attrsD):
-        self.inauthor = 1
-        self.push('author', 1)
-    _start_managingeditor = _start_author
-    _start_dc_author = _start_author
-    _start_dc_creator = _start_author
-    _start_itunes_author = _start_author
-
-    def _end_author(self):
-        self.pop('author')
-        self.inauthor = 0
-        self._sync_author_detail()
-    _end_managingeditor = _end_author
-    _end_dc_author = _end_author
-    _end_dc_creator = _end_author
-    _end_itunes_author = _end_author
-
-    def _start_itunes_owner(self, attrsD):
-        self.inpublisher = 1
-        self.push('publisher', 0)
-
-    def _end_itunes_owner(self):
-        self.pop('publisher')
-        self.inpublisher = 0
-        self._sync_author_detail('publisher')
-
-    def _start_contributor(self, attrsD):
-        self.incontributor = 1
-        context = self._getContext()
-        context.setdefault('contributors', [])
-        context['contributors'].append(FeedParserDict())
-        self.push('contributor', 0)
-
-    def _end_contributor(self):
-        self.pop('contributor')
-        self.incontributor = 0
-
-    def _start_dc_contributor(self, attrsD):
-        self.incontributor = 1
-        context = self._getContext()
-        context.setdefault('contributors', [])
-        context['contributors'].append(FeedParserDict())
-        self.push('name', 0)
-
-    def _end_dc_contributor(self):
-        self._end_name()
-        self.incontributor = 0
-
-    def _start_name(self, attrsD):
-        self.push('name', 0)
-    _start_itunes_name = _start_name
-
-    def _end_name(self):
-        value = self.pop('name')
-        if self.inpublisher:
-            self._save_author('name', value, 'publisher')
-        elif self.inauthor:
-            self._save_author('name', value)
-        elif self.incontributor:
-            self._save_contributor('name', value)
-        elif self.intextinput:
-            context = self._getContext()
-            context['textinput']['name'] = value
-    _end_itunes_name = _end_name
-
-    def _start_width(self, attrsD):
-        self.push('width', 0)
-
-    def _end_width(self):
-        value = self.pop('width')
-        try:
-            value = int(value)
-        except:
-            value = 0
-        if self.inimage:
-            context = self._getContext()
-            context['image']['width'] = value
-
-    def _start_height(self, attrsD):
-        self.push('height', 0)
-
-    def _end_height(self):
-        value = self.pop('height')
-        try:
-            value = int(value)
-        except:
-            value = 0
-        if self.inimage:
-            context = self._getContext()
-            context['image']['height'] = value
-
-    def _start_url(self, attrsD):
-        self.push('href', 1)
-    _start_homepage = _start_url
-    _start_uri = _start_url
-
-    def _end_url(self):
-        value = self.pop('href')
-        if self.inauthor:
-            self._save_author('href', value)
-        elif self.incontributor:
-            self._save_contributor('href', value)
-        elif self.inimage:
-            context = self._getContext()
-            context['image']['href'] = value
-        elif self.intextinput:
-            context = self._getContext()
-            context['textinput']['link'] = value
-    _end_homepage = _end_url
-    _end_uri = _end_url
-
-    def _start_email(self, attrsD):
-        self.push('email', 0)
-    _start_itunes_email = _start_email
-
-    def _end_email(self):
-        value = self.pop('email')
-        if self.inpublisher:
-            self._save_author('email', value, 'publisher')
-        elif self.inauthor:
-            self._save_author('email', value)
-        elif self.incontributor:
-            self._save_contributor('email', value)
-    _end_itunes_email = _end_email
-
-    def _getContext(self):
-        if self.insource:
-            context = self.sourcedata
-        elif self.inentry:
-            context = self.entries[-1]
-        else:
-            context = self.feeddata
-        return context
-
-    def _save_author(self, key, value, prefix='author'):
-        context = self._getContext()
-        context.setdefault(prefix + '_detail', FeedParserDict())
-        context[prefix + '_detail'][key] = value
-        self._sync_author_detail()
-
-    def _save_contributor(self, key, value):
-        context = self._getContext()
-        context.setdefault('contributors', [FeedParserDict()])
-        context['contributors'][-1][key] = value
-
-    def _sync_author_detail(self, key='author'):
-        context = self._getContext()
-        detail = context.get('%s_detail' % key)
-        if detail:
-            name = detail.get('name')
-            email = detail.get('email')
-            if name and email:
-                context[key] = '%s (%s)' % (name, email)
-            elif name:
-                context[key] = name
-            elif email:
-                context[key] = email
-        else:
-            author = context.get(key)
-            if not author: return
-            emailmatch = re.search(r'''(([a-zA-Z0-9\_\-\.\+]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?))''', author)
-            if not emailmatch: return
-            email = emailmatch.group(0)
-            # probably a better way to do the following, but it passes all the tests
-            author = author.replace(email, '')
-            author = author.replace('()', '')
-            author = author.strip()
-            if author and (author[0] == '('):
-                author = author[1:]
-            if author and (author[-1] == ')'):
-                author = author[:-1]
-            author = author.strip()
-            context.setdefault('%s_detail' % key, FeedParserDict())
-            context['%s_detail' % key]['name'] = author
-            context['%s_detail' % key]['email'] = email
-
-    def _start_subtitle(self, attrsD):
-        self.pushContent('subtitle', attrsD, 'text/plain', 1)
-    _start_tagline = _start_subtitle
-    _start_itunes_subtitle = _start_subtitle
-
-    def _end_subtitle(self):
-        self.popContent('subtitle')
-    _end_tagline = _end_subtitle
-    _end_itunes_subtitle = _end_subtitle
-            
-    def _start_rights(self, attrsD):
-        self.pushContent('rights', attrsD, 'text/plain', 1)
-    _start_dc_rights = _start_rights
-    _start_copyright = _start_rights
-
-    def _end_rights(self):
-        self.popContent('rights')
-    _end_dc_rights = _end_rights
-    _end_copyright = _end_rights
-
-    def _start_item(self, attrsD):
-        self.entries.append(FeedParserDict())
-        self.push('item', 0)
-        self.inentry = 1
-        self.guidislink = 0
-        id = self._getAttribute(attrsD, 'rdf:about')
-        if id:
-            context = self._getContext()
-            context['id'] = id
-        self._cdf_common(attrsD)
-    _start_entry = _start_item
-    _start_product = _start_item
-
-    def _end_item(self):
-        self.pop('item')
-        self.inentry = 0
-    _end_entry = _end_item
-
-    def _start_dc_language(self, attrsD):
-        self.push('language', 1)
-    _start_language = _start_dc_language
-
-    def _end_dc_language(self):
-        self.lang = self.pop('language')
-    _end_language = _end_dc_language
-
-    def _start_dc_publisher(self, attrsD):
-        self.push('publisher', 1)
-    _start_webmaster = _start_dc_publisher
-
-    def _end_dc_publisher(self):
-        self.pop('publisher')
-        self._sync_author_detail('publisher')
-    _end_webmaster = _end_dc_publisher
-
-    def _start_published(self, attrsD):
-        self.push('published', 1)
-    _start_dcterms_issued = _start_published
-    _start_issued = _start_published
-
-    def _end_published(self):
-        value = self.pop('published')
-        self._save('published_parsed', _parse_date(value))
-    _end_dcterms_issued = _end_published
-    _end_issued = _end_published
-
-    def _start_updated(self, attrsD):
-        self.push('updated', 1)
-    _start_modified = _start_updated
-    _start_dcterms_modified = _start_updated
-    _start_pubdate = _start_updated
-    _start_dc_date = _start_updated
-
-    def _end_updated(self):
-        value = self.pop('updated')
-        parsed_value = _parse_date(value)
-        self._save('updated_parsed', parsed_value)
-    _end_modified = _end_updated
-    _end_dcterms_modified = _end_updated
-    _end_pubdate = _end_updated
-    _end_dc_date = _end_updated
-
-    def _start_created(self, attrsD):
-        self.push('created', 1)
-    _start_dcterms_created = _start_created
-
-    def _end_created(self):
-        value = self.pop('created')
-        self._save('created_parsed', _parse_date(value))
-    _end_dcterms_created = _end_created
-
-    def _start_expirationdate(self, attrsD):
-        self.push('expired', 1)
-
-    def _end_expirationdate(self):
-        self._save('expired_parsed', _parse_date(self.pop('expired')))
-
-    def _start_cc_license(self, attrsD):
-        self.push('license', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('license')
-        
-    def _start_creativecommons_license(self, attrsD):
-        self.push('license', 1)
-
-    def _end_creativecommons_license(self):
-        self.pop('license')
-
-    def _addTag(self, term, scheme, label):
-        context = self._getContext()
-        tags = context.setdefault('tags', [])
-        if (not term) and (not scheme) and (not label): return
-        value = FeedParserDict({'term': term, 'scheme': scheme, 'label': label})
-        if value not in tags:
-            tags.append(FeedParserDict({'term': term, 'scheme': scheme, 'label': label}))
-
-    def _start_category(self, attrsD):
-        if _debug: sys.stderr.write('entering _start_category with %s\n' % repr(attrsD))
-        term = attrsD.get('term')
-        scheme = attrsD.get('scheme', attrsD.get('domain'))
-        label = attrsD.get('label')
-        self._addTag(term, scheme, label)
-        self.push('category', 1)
-    _start_dc_subject = _start_category
-    _start_keywords = _start_category
-        
-    def _end_itunes_keywords(self):
-        for term in self.pop('itunes_keywords').split():
-            self._addTag(term, 'http://www.itunes.com/', None)
-        
-    def _start_itunes_category(self, attrsD):
-        self._addTag(attrsD.get('text'), 'http://www.itunes.com/', None)
-        self.push('category', 1)
-        
-    def _end_category(self):
-        value = self.pop('category')
-        if not value: return
-        context = self._getContext()
-        tags = context['tags']
-        if value and len(tags) and not tags[-1]['term']:
-            tags[-1]['term'] = value
-        else:
-            self._addTag(value, None, None)
-    _end_dc_subject = _end_category
-    _end_keywords = _end_category
-    _end_itunes_category = _end_category
-
-    def _start_cloud(self, attrsD):
-        self._getContext()['cloud'] = FeedParserDict(attrsD)
-        
-    def _start_link(self, attrsD):
-        attrsD.setdefault('rel', 'alternate')
-        attrsD.setdefault('type', 'text/html')
-        attrsD = self._itsAnHrefDamnIt(attrsD)
-        if attrsD.has_key('href'):
-            attrsD['href'] = self.resolveURI(attrsD['href'])
-        expectingText = self.infeed or self.inentry or self.insource
-        context = self._getContext()
-        context.setdefault('links', [])
-        context['links'].append(FeedParserDict(attrsD))
-        if attrsD['rel'] == 'enclosure':
-            self._start_enclosure(attrsD)
-        if attrsD.has_key('href'):
-            expectingText = 0
-            if (attrsD.get('rel') == 'alternate') and (self.mapContentType(attrsD.get('type')) in self.html_types):
-                context['link'] = attrsD['href']
-        else:
-            self.push('link', expectingText)
-    _start_producturl = _start_link
-
-    def _end_link(self):
-        value = self.pop('link')
-        context = self._getContext()
-        if self.intextinput:
-            context['textinput']['link'] = value
-        if self.inimage:
-            context['image']['link'] = value
-    _end_producturl = _end_link
-
-    def _start_guid(self, attrsD):
-        self.guidislink = (attrsD.get('ispermalink', 'true') == 'true')
-        self.push('id', 1)
-
-    def _end_guid(self):
-        value = self.pop('id')
-        self._save('guidislink', self.guidislink and not self._getContext().has_key('link'))
-        if self.guidislink:
-            # guid acts as link, but only if 'ispermalink' is not present or is 'true',
-            # and only if the item doesn't already have a link element
-            self._save('link', value)
-
-    def _start_title(self, attrsD):
-        self.pushContent('title', attrsD, 'text/plain', self.infeed or self.inentry or self.insource)
-    _start_dc_title = _start_title
-    _start_media_title = _start_title
-
-    def _end_title(self):
-        value = self.popContent('title')
-        context = self._getContext()
-        if self.intextinput:
-            context['textinput']['title'] = value
-        elif self.inimage:
-            context['image']['title'] = value
-    _end_dc_title = _end_title
-    _end_media_title = _end_title
-
-    def _start_description(self, attrsD):
-        context = self._getContext()
-        if context.has_key('summary'):
-            self._summaryKey = 'content'
-            self._start_content(attrsD)
-        else:
-            self.pushContent('description', attrsD, 'text/html', self.infeed or self.inentry or self.insource)
-
-    def _start_abstract(self, attrsD):
-        self.pushContent('description', attrsD, 'text/plain', self.infeed or self.inentry or self.insource)
-
-    def _end_description(self):
-        if self._summaryKey == 'content':
-            self._end_content()
-        else:
-            value = self.popContent('description')
-            context = self._getContext()
-            if self.intextinput:
-                context['textinput']['description'] = value
-            elif self.inimage:
-                context['image']['description'] = value
-        self._summaryKey = None
-    _end_abstract = _end_description
-
-    def _start_info(self, attrsD):
-        self.pushContent('info', attrsD, 'text/plain', 1)
-    _start_feedburner_browserfriendly = _start_info
-
-    def _end_info(self):
-        self.popContent('info')
-    _end_feedburner_browserfriendly = _end_info
-
-    def _start_generator(self, attrsD):
-        if attrsD:
-            attrsD = self._itsAnHrefDamnIt(attrsD)
-            if attrsD.has_key('href'):
-                attrsD['href'] = self.resolveURI(attrsD['href'])
-        self._getContext()['generator_detail'] = FeedParserDict(attrsD)
-        self.push('generator', 1)
-
-    def _end_generator(self):
-        value = self.pop('generator')
-        context = self._getContext()
-        if context.has_key('generator_detail'):
-            context['generator_detail']['name'] = value
-            
-    def _start_admin_generatoragent(self, attrsD):
-        self.push('generator', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('generator')
-        self._getContext()['generator_detail'] = FeedParserDict({'href': value})
-
-    def _start_admin_errorreportsto(self, attrsD):
-        self.push('errorreportsto', 1)
-        value = self._getAttribute(attrsD, 'rdf:resource')
-        if value:
-            self.elementstack[-1][2].append(value)
-        self.pop('errorreportsto')
-        
-    def _start_summary(self, attrsD):
-        context = self._getContext()
-        if context.has_key('summary'):
-            self._summaryKey = 'content'
-            self._start_content(attrsD)
-        else:
-            self._summaryKey = 'summary'
-            self.pushContent(self._summaryKey, attrsD, 'text/plain', 1)
-    _start_itunes_summary = _start_summary
-
-    def _end_summary(self):
-        if self._summaryKey == 'content':
-            self._end_content()
-        else:
-            self.popContent(self._summaryKey or 'summary')
-        self._summaryKey = None
-    _end_itunes_summary = _end_summary
-        
-    def _start_enclosure(self, attrsD):
-        attrsD = self._itsAnHrefDamnIt(attrsD)
-        self._getContext().setdefault('enclosures', []).append(FeedParserDict(attrsD))
-        href = attrsD.get('href')
-        if href:
-            context = self._getContext()
-            if not context.get('id'):
-                context['id'] = href
-            
-    def _start_source(self, attrsD):
-        self.insource = 1
-
-    def _end_source(self):
-        self.insource = 0
-        self._getContext()['source'] = copy.deepcopy(self.sourcedata)
-        self.sourcedata.clear()
-
-    def _start_content(self, attrsD):
-        self.pushContent('content', attrsD, 'text/plain', 1)
-        src = attrsD.get('src')
-        if src:
-            self.contentparams['src'] = src
-        self.push('content', 1)
-
-    def _start_prodlink(self, attrsD):
-        self.pushContent('content', attrsD, 'text/html', 1)
-
-    def _start_body(self, attrsD):
-        self.pushContent('content', attrsD, 'application/xhtml+xml', 1)
-    _start_xhtml_body = _start_body
-
-    def _start_content_encoded(self, attrsD):
-        self.pushContent('content', attrsD, 'text/html', 1)
-    _start_fullitem = _start_content_encoded
-
-    def _end_content(self):
-        copyToDescription = self.mapContentType(self.contentparams.get('type')) in (['text/plain'] + self.html_types)
-        value = self.popContent('content')
-        if copyToDescription:
-            self._save('description', value)
-    _end_body = _end_content
-    _end_xhtml_body = _end_content
-    _end_content_encoded = _end_content
-    _end_fullitem = _end_content
-    _end_prodlink = _end_content
-
-    def _start_itunes_image(self, attrsD):
-        self.push('itunes_image', 0)
-        self._getContext()['image'] = FeedParserDict({'href': attrsD.get('href')})
-    _start_itunes_link = _start_itunes_image
-        
-    def _end_itunes_block(self):
-        value = self.pop('itunes_block', 0)
-        self._getContext()['itunes_block'] = (value == 'yes') and 1 or 0
-
-    def _end_itunes_explicit(self):
-        value = self.pop('itunes_explicit', 0)
-        self._getContext()['itunes_explicit'] = (value == 'yes') and 1 or 0
-
-if _XML_AVAILABLE:
-    class _StrictFeedParser(_FeedParserMixin, xml.sax.handler.ContentHandler):
-        def __init__(self, baseuri, baselang, encoding):
-            if _debug: sys.stderr.write('trying StrictFeedParser\n')
-            xml.sax.handler.ContentHandler.__init__(self)
-            _FeedParserMixin.__init__(self, baseuri, baselang, encoding)
-            self.bozo = 0
-            self.exc = None
-        
-        def startPrefixMapping(self, prefix, uri):
-            self.trackNamespace(prefix, uri)
-        
-        def startElementNS(self, name, qname, attrs):
-            namespace, localname = name
-            lowernamespace = str(namespace or '').lower()
-            if lowernamespace.find('backend.userland.com/rss') <> -1:
-                # match any backend.userland.com namespace
-                namespace = 'http://backend.userland.com/rss'
-                lowernamespace = namespace
-            if qname and qname.find(':') > 0:
-                givenprefix = qname.split(':')[0]
-            else:
-                givenprefix = None
-            prefix = self._matchnamespaces.get(lowernamespace, givenprefix)
-            if givenprefix and (prefix == None or (prefix == '' and lowernamespace == '')) and not self.namespacesInUse.has_key(givenprefix):
-                    raise UndeclaredNamespace, "'%s' is not associated with a namespace" % givenprefix
-            if prefix:
-                localname = prefix + ':' + localname
-            localname = str(localname).lower()
-            if _debug: sys.stderr.write('startElementNS: qname = %s, namespace = %s, givenprefix = %s, prefix = %s, attrs = %s, localname = %s\n' % (qname, namespace, givenprefix, prefix, attrs.items(), localname))
-
-            # qname implementation is horribly broken in Python 2.1 (it
-            # doesn't report any), and slightly broken in Python 2.2 (it
-            # doesn't report the xml: namespace). So we match up namespaces
-            # with a known list first, and then possibly override them with
-            # the qnames the SAX parser gives us (if indeed it gives us any
-            # at all).  Thanks to MatejC for helping me test this and
-            # tirelessly telling me that it didn't work yet.
-            attrsD = {}
-            for (namespace, attrlocalname), attrvalue in attrs._attrs.items():
-                lowernamespace = (namespace or '').lower()
-                prefix = self._matchnamespaces.get(lowernamespace, '')
-                if prefix:
-                    attrlocalname = prefix + ':' + attrlocalname
-                attrsD[str(attrlocalname).lower()] = attrvalue
-            for qname in attrs.getQNames():
-                attrsD[str(qname).lower()] = attrs.getValueByQName(qname)
-            self.unknown_starttag(localname, attrsD.items())
-
-        def characters(self, text):
-            self.handle_data(text)
-
-        def endElementNS(self, name, qname):
-            namespace, localname = name
-            lowernamespace = str(namespace or '').lower()
-            if qname and qname.find(':') > 0:
-                givenprefix = qname.split(':')[0]
-            else:
-                givenprefix = ''
-            prefix = self._matchnamespaces.get(lowernamespace, givenprefix)
-            if prefix:
-                localname = prefix + ':' + localname
-            localname = str(localname).lower()
-            self.unknown_endtag(localname)
-
-        def error(self, exc):
-            self.bozo = 1
-            self.exc = exc
-            
-        def fatalError(self, exc):
-            self.error(exc)
-            raise exc
-
-class _BaseHTMLProcessor(sgmllib.SGMLParser):
-    elements_no_end_tag = ['area', 'base', 'basefont', 'br', 'col', 'frame', 'hr',
-      'img', 'input', 'isindex', 'link', 'meta', 'param']
-    
-    def __init__(self, encoding):
-        self.encoding = encoding
-        if _debug: sys.stderr.write('entering BaseHTMLProcessor, encoding=%s\n' % self.encoding)
-        sgmllib.SGMLParser.__init__(self)
-        
-    def reset(self):
-        self.pieces = []
-        sgmllib.SGMLParser.reset(self)
-
-    def _shorttag_replace(self, match):
-        tag = match.group(1)
-        if tag in self.elements_no_end_tag:
-            return '<' + tag + ' />'
-        else:
-            return '<' + tag + '></' + tag + '>'
-        
-    def feed(self, data):
-        data = re.compile(r'<!((?!DOCTYPE|--|\[))', re.IGNORECASE).sub(r'&lt;!\1', data)
-        #data = re.sub(r'<(\S+?)\s*?/>', self._shorttag_replace, data) # bug [ 1399464 ] Bad regexp for _shorttag_replace
-        data = re.sub(r'<([^<\s]+?)\s*/>', self._shorttag_replace, data) 
-        data = data.replace('&#39;', "'")
-        data = data.replace('&#34;', '"')
-        if self.encoding and type(data) == type(u''):
-            data = data.encode(self.encoding)
-        sgmllib.SGMLParser.feed(self, data)
-
-    def normalize_attrs(self, attrs):
-        # utility method to be called by descendants
-        attrs = [(k.lower(), v) for k, v in attrs]
-        attrs = [(k, k in ('rel', 'type') and v.lower() or v) for k, v in attrs]
-        return attrs
-
-    def unknown_starttag(self, tag, attrs):
-        # called for each start tag
-        # attrs is a list of (attr, value) tuples
-        # e.g. for <pre class='screen'>, tag='pre', attrs=[('class', 'screen')]
-        if _debug: sys.stderr.write('_BaseHTMLProcessor, unknown_starttag, tag=%s\n' % tag)
-        uattrs = []
-        # thanks to Kevin Marks for this breathtaking hack to deal with (valid) high-bit attribute values in UTF-8 feeds
-        for key, value in attrs:
-            if type(value) != type(u''):
-                value = unicode(value, self.encoding)
-            uattrs.append((unicode(key, self.encoding), value))
-        strattrs = u''.join([u' %s="%s"' % (key, value) for key, value in uattrs]).encode(self.encoding)
-        if tag in self.elements_no_end_tag:
-            self.pieces.append('<%(tag)s%(strattrs)s />' % locals())
-        else:
-            self.pieces.append('<%(tag)s%(strattrs)s>' % locals())
-
-    def unknown_endtag(self, tag):
-        # called for each end tag, e.g. for </pre>, tag will be 'pre'
-        # Reconstruct the original end tag.
-        if tag not in self.elements_no_end_tag:
-            self.pieces.append("</%(tag)s>" % locals())
-
-    def handle_charref(self, ref):
-        # called for each character reference, e.g. for '&#160;', ref will be '160'
-        # Reconstruct the original character reference.
-        self.pieces.append('&#%(ref)s;' % locals())
-        
-    def handle_entityref(self, ref):
-        # called for each entity reference, e.g. for '&copy;', ref will be 'copy'
-        # Reconstruct the original entity reference.
-        self.pieces.append('&%(ref)s;' % locals())
-
-    def handle_data(self, text):
-        # called for each block of plain text, i.e. outside of any tag and
-        # not containing any character or entity references
-        # Store the original text verbatim.
-        if _debug: sys.stderr.write('_BaseHTMLProcessor, handle_text, text=%s\n' % text)
-        self.pieces.append(text)
-        
-    def handle_comment(self, text):
-        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
-        # Reconstruct the original comment.
-        self.pieces.append('<!--%(text)s-->' % locals())
-        
-    def handle_pi(self, text):
-        # called for each processing instruction, e.g. <?instruction>
-        # Reconstruct original processing instruction.
-        self.pieces.append('<?%(text)s>' % locals())
-
-    def handle_decl(self, text):
-        # called for the DOCTYPE, if present, e.g.
-        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
-        #     "http://www.w3.org/TR/html4/loose.dtd">
-        # Reconstruct original DOCTYPE
-        self.pieces.append('<!%(text)s>' % locals())
-        
-    _new_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9:]*\s*').match
-    def _scan_name(self, i, declstartpos):
-        rawdata = self.rawdata
-        n = len(rawdata)
-        if i == n:
-            return None, -1
-        m = self._new_declname_match(rawdata, i)
-        if m:
-            s = m.group()
-            name = s.strip()
-            if (i + len(s)) == n:
-                return None, -1  # end of buffer
-            return name.lower(), m.end()
-        else:
-            self.handle_data(rawdata)
-#            self.updatepos(declstartpos, i)
-            return None, -1
-
-    def output(self):
-        '''Return processed HTML as a single string'''
-        return ''.join([str(p) for p in self.pieces])
-
-class _LooseFeedParser(_FeedParserMixin, _BaseHTMLProcessor):
-    def __init__(self, baseuri, baselang, encoding):
-        sgmllib.SGMLParser.__init__(self)
-        _FeedParserMixin.__init__(self, baseuri, baselang, encoding)
-
-    def decodeEntities(self, element, data):
-        data = data.replace('&#60;', '&lt;')
-        data = data.replace('&#x3c;', '&lt;')
-        data = data.replace('&#62;', '&gt;')
-        data = data.replace('&#x3e;', '&gt;')
-        data = data.replace('&#38;', '&amp;')
-        data = data.replace('&#x26;', '&amp;')
-        data = data.replace('&#34;', '&quot;')
-        data = data.replace('&#x22;', '&quot;')
-        data = data.replace('&#39;', '&apos;')
-        data = data.replace('&#x27;', '&apos;')
-        if self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'):
-            data = data.replace('&lt;', '<')
-            data = data.replace('&gt;', '>')
-            data = data.replace('&amp;', '&')
-            data = data.replace('&quot;', '"')
-            data = data.replace('&apos;', "'")
-        return data
-        
-class _RelativeURIResolver(_BaseHTMLProcessor):
-    relative_uris = [('a', 'href'),
-                     ('applet', 'codebase'),
-                     ('area', 'href'),
-                     ('blockquote', 'cite'),
-                     ('body', 'background'),
-                     ('del', 'cite'),
-                     ('form', 'action'),
-                     ('frame', 'longdesc'),
-                     ('frame', 'src'),
-                     ('iframe', 'longdesc'),
-                     ('iframe', 'src'),
-                     ('head', 'profile'),
-                     ('img', 'longdesc'),
-                     ('img', 'src'),
-                     ('img', 'usemap'),
-                     ('input', 'src'),
-                     ('input', 'usemap'),
-                     ('ins', 'cite'),
-                     ('link', 'href'),
-                     ('object', 'classid'),
-                     ('object', 'codebase'),
-                     ('object', 'data'),
-                     ('object', 'usemap'),
-                     ('q', 'cite'),
-                     ('script', 'src')]
-
-    def __init__(self, baseuri, encoding):
-        _BaseHTMLProcessor.__init__(self, encoding)
-        self.baseuri = baseuri
-
-    def resolveURI(self, uri):
-        return _urljoin(self.baseuri, uri)
-    
-    def unknown_starttag(self, tag, attrs):
-        attrs = self.normalize_attrs(attrs)
-        attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
-        _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
-        
-def _resolveRelativeURIs(htmlSource, baseURI, encoding):
-    if _debug: sys.stderr.write('entering _resolveRelativeURIs\n')
-    p = _RelativeURIResolver(baseURI, encoding)
-    p.feed(htmlSource)
-    return p.output()
-
-class _HTMLSanitizer(_BaseHTMLProcessor):
-    acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
-      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
-      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em', 'fieldset',
-      'font', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input',
-      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 'optgroup',
-      'option', 'p', 'pre', 'q', 's', 'samp', 'select', 'small', 'span', 'strike',
-      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'tfoot', 'th',
-      'thead', 'tr', 'tt', 'u', 'ul', 'var']
-
-    acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
-      'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
-      'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'clear', 'cols',
-      'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 'disabled',
-      'enctype', 'for', 'frame', 'headers', 'height', 'href', 'hreflang', 'hspace',
-      'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'media', 'method',
-      'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly',
-      'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size',
-      'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
-      'usemap', 'valign', 'value', 'vspace', 'width']
-
-    unacceptable_elements_with_end_tag = ['script', 'applet']
-
-    def reset(self):
-        _BaseHTMLProcessor.reset(self)
-        self.unacceptablestack = 0
-        
-    def unknown_starttag(self, tag, attrs):
-        if not tag in self.acceptable_elements:
-            if tag in self.unacceptable_elements_with_end_tag:
-                self.unacceptablestack += 1
-            return
-        attrs = self.normalize_attrs(attrs)
-        attrs = [(key, value) for key, value in attrs if key in self.acceptable_attributes]
-        _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
-        
-    def unknown_endtag(self, tag):
-        if not tag in self.acceptable_elements:
-            if tag in self.unacceptable_elements_with_end_tag:
-                self.unacceptablestack -= 1
-            return
-        _BaseHTMLProcessor.unknown_endtag(self, tag)
-
-    def handle_pi(self, text):
-        pass
-
-    def handle_decl(self, text):
-        pass
-
-    def handle_data(self, text):
-        if not self.unacceptablestack:
-            _BaseHTMLProcessor.handle_data(self, text)
-
-def _sanitizeHTML(htmlSource, encoding):
-    p = _HTMLSanitizer(encoding)
-    p.feed(htmlSource)
-    data = p.output()
-    if TIDY_MARKUP:
-        # loop through list of preferred Tidy interfaces looking for one that's installed,
-        # then set up a common _tidy function to wrap the interface-specific API.
-        _tidy = None
-        for tidy_interface in PREFERRED_TIDY_INTERFACES:
-            try:
-                if tidy_interface == "uTidy":
-                    from tidy import parseString as _utidy
-                    def _tidy(data, **kwargs):
-                        return str(_utidy(data, **kwargs))
-                    break
-                elif tidy_interface == "mxTidy":
-                    from mx.Tidy import Tidy as _mxtidy
-                    def _tidy(data, **kwargs):
-                        nerrors, nwarnings, data, errordata = _mxtidy.tidy(data, **kwargs)
-                        return data
-                    break
-            except:
-                pass
-        if _tidy:
-            utf8 = type(data) == type(u'')
-            if utf8:
-                data = data.encode('utf-8')
-            data = _tidy(data, output_xhtml=1, numeric_entities=1, wrap=0, char_encoding="utf8")
-            if utf8:
-                data = unicode(data, 'utf-8')
-            if data.count('<body'):
-                data = data.split('<body', 1)[1]
-                if data.count('>'):
-                    data = data.split('>', 1)[1]
-            if data.count('</body'):
-                data = data.split('</body', 1)[0]
-    data = data.strip().replace('\r\n', '\n')
-    return data
-
-class _FeedURLHandler(urllib2.HTTPDigestAuthHandler, urllib2.HTTPRedirectHandler, urllib2.HTTPDefaultErrorHandler):
-    def http_error_default(self, req, fp, code, msg, headers):
-        if ((code / 100) == 3) and (code != 304):
-            return self.http_error_302(req, fp, code, msg, headers)
-        infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        infourl.status = code
-        return infourl
-
-    def http_error_302(self, req, fp, code, msg, headers):
-        if headers.dict.has_key('location'):
-            infourl = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
-        else:
-            infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        if not hasattr(infourl, 'status'):
-            infourl.status = code
-        return infourl
-
-    def http_error_301(self, req, fp, code, msg, headers):
-        if headers.dict.has_key('location'):
-            infourl = urllib2.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
-        else:
-            infourl = urllib.addinfourl(fp, headers, req.get_full_url())
-        if not hasattr(infourl, 'status'):
-            infourl.status = code
-        return infourl
-
-    http_error_300 = http_error_302
-    http_error_303 = http_error_302
-    http_error_307 = http_error_302
-        
-    def http_error_401(self, req, fp, code, msg, headers):
-        # Check if
-        # - server requires digest auth, AND
-        # - we tried (unsuccessfully) with basic auth, AND
-        # - we're using Python 2.3.3 or later (digest auth is irreparably broken in earlier versions)
-        # If all conditions hold, parse authentication information
-        # out of the Authorization header we sent the first time
-        # (for the username and password) and the WWW-Authenticate
-        # header the server sent back (for the realm) and retry
-        # the request with the appropriate digest auth headers instead.
-        # This evil genius hack has been brought to you by Aaron Swartz.
-        host = urlparse.urlparse(req.get_full_url())[1]
-        try:
-            assert sys.version.split()[0] >= '2.3.3'
-            assert base64 != None
-            user, passw = base64.decodestring(req.headers['Authorization'].split(' ')[1]).split(':')
-            realm = re.findall('realm="([^"]*)"', headers['WWW-Authenticate'])[0]
-            self.add_password(realm, host, user, passw)
-            retry = self.http_error_auth_reqed('www-authenticate', host, req, headers)
-            self.reset_retry_count()
-            return retry
-        except:
-            return self.http_error_default(req, fp, code, msg, headers)
-
-def _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers):
-    """URL, filename, or string --> stream
-
-    This function lets you define parsers that take any input source
-    (URL, pathname to local or network file, or actual data as a string)
-    and deal with it in a uniform manner.  Returned object is guaranteed
-    to have all the basic stdio read methods (read, readline, readlines).
-    Just .close() the object when you're done with it.
-
-    If the etag argument is supplied, it will be used as the value of an
-    If-None-Match request header.
-
-    If the modified argument is supplied, it must be a tuple of 9 integers
-    as returned by gmtime() in the standard Python time module. This MUST
-    be in GMT (Greenwich Mean Time). The formatted date/time will be used
-    as the value of an If-Modified-Since request header.
-
-    If the agent argument is supplied, it will be used as the value of a
-    User-Agent request header.
-
-    If the referrer argument is supplied, it will be used as the value of a
-    Referer[sic] request header.
-
-    If handlers is supplied, it is a list of handlers used to build a
-    urllib2 opener.
-    """
-
-    if hasattr(url_file_stream_or_string, 'read'):
-        return url_file_stream_or_string
-
-    if url_file_stream_or_string == '-':
-        return sys.stdin
-
-    if urlparse.urlparse(url_file_stream_or_string)[0] in ('http', 'https', 'ftp'):
-        if not agent:
-            agent = USER_AGENT
-        # test for inline user:password for basic auth
-        auth = None
-        if base64:
-            urltype, rest = urllib.splittype(url_file_stream_or_string)
-            realhost, rest = urllib.splithost(rest)
-            if realhost:
-                user_passwd, realhost = urllib.splituser(realhost)
-                if user_passwd:
-                    url_file_stream_or_string = '%s://%s%s' % (urltype, realhost, rest)
-                    auth = base64.encodestring(user_passwd).strip()
-        # try to open with urllib2 (to use optional headers)
-        request = urllib2.Request(url_file_stream_or_string)
-        request.add_header('User-Agent', agent)
-        if etag:
-            request.add_header('If-None-Match', etag)
-        if modified:
-            # format into an RFC 1123-compliant timestamp. We can't use
-            # time.strftime() since the %a and %b directives can be affected
-            # by the current locale, but RFC 2616 states that dates must be
-            # in English.
-            short_weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
-            months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
-            request.add_header('If-Modified-Since', '%s, %02d %s %04d %02d:%02d:%02d GMT' % (short_weekdays[modified[6]], modified[2], months[modified[1] - 1], modified[0], modified[3], modified[4], modified[5]))
-        if referrer:
-            request.add_header('Referer', referrer)
-        if gzip and zlib:
-            request.add_header('Accept-encoding', 'gzip, deflate')
-        elif gzip:
-            request.add_header('Accept-encoding', 'gzip')
-        elif zlib:
-            request.add_header('Accept-encoding', 'deflate')
-        else:
-            request.add_header('Accept-encoding', '')
-        if auth:
-            request.add_header('Authorization', 'Basic %s' % auth)
-        if ACCEPT_HEADER:
-            request.add_header('Accept', ACCEPT_HEADER)
-        request.add_header('A-IM', 'feed') # RFC 3229 support
-        opener = apply(urllib2.build_opener, tuple([_FeedURLHandler()] + handlers))
-        opener.addheaders = [] # RMK - must clear so we only send our custom User-Agent
-        try:
-            return opener.open(request)
-        finally:
-            opener.close() # JohnD
-    
-    # try to open with native open function (if url_file_stream_or_string is a filename)
-    try:
-        return open(url_file_stream_or_string)
-    except:
-        pass
-
-    # treat url_file_stream_or_string as string
-    return _StringIO(str(url_file_stream_or_string))
-
-_date_handlers = []
-def registerDateHandler(func):
-    '''Register a date handler function (takes string, returns 9-tuple date in GMT)'''
-    _date_handlers.insert(0, func)
-    
-# ISO-8601 date parsing routines written by Fazal Majid.
-# The ISO 8601 standard is very convoluted and irregular - a full ISO 8601
-# parser is beyond the scope of feedparser and would be a worthwhile addition
-# to the Python library.
-# A single regular expression cannot parse ISO 8601 date formats into groups
-# as the standard is highly irregular (for instance is 030104 2003-01-04 or
-# 0301-04-01), so we use templates instead.
-# Please note the order in templates is significant because we need a
-# greedy match.
-_iso8601_tmpl = ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO',
-                'YY-?MM-?DD', 'YY-?OOO', 'YYYY', 
-                '-YY-?MM', '-OOO', '-YY',
-                '--MM-?DD', '--MM',
-                '---DD',
-                'CC', '']
-_iso8601_re = [
-    tmpl.replace(
-    'YYYY', r'(?P<year>\d{4})').replace(
-    'YY', r'(?P<year>\d\d)').replace(
-    'MM', r'(?P<month>[01]\d)').replace(
-    'DD', r'(?P<day>[0123]\d)').replace(
-    'OOO', r'(?P<ordinal>[0123]\d\d)').replace(
-    'CC', r'(?P<century>\d\d$)')
-    + r'(T?(?P<hour>\d{2}):(?P<minute>\d{2})'
-    + r'(:(?P<second>\d{2}))?'
-    + r'(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?'
-    for tmpl in _iso8601_tmpl]
-del tmpl
-_iso8601_matches = [re.compile(regex).match for regex in _iso8601_re]
-del regex
-def _parse_date_iso8601(dateString):
-    '''Parse a variety of ISO-8601-compatible formats like 20040105'''
-    m = None
-    for _iso8601_match in _iso8601_matches:
-        m = _iso8601_match(dateString)
-        if m: break
-    if not m: return
-    if m.span() == (0, 0): return
-    params = m.groupdict()
-    ordinal = params.get('ordinal', 0)
-    if ordinal:
-        ordinal = int(ordinal)
-    else:
-        ordinal = 0
-    year = params.get('year', '--')
-    if not year or year == '--':
-        year = time.gmtime()[0]
-    elif len(year) == 2:
-        # ISO 8601 assumes current century, i.e. 93 -> 2093, NOT 1993
-        year = 100 * int(time.gmtime()[0] / 100) + int(year)
-    else:
-        year = int(year)
-    month = params.get('month', '-')
-    if not month or month == '-':
-        # ordinals are NOT normalized by mktime, we simulate them
-        # by setting month=1, day=ordinal
-        if ordinal:
-            month = 1
-        else:
-            month = time.gmtime()[1]
-    month = int(month)
-    day = params.get('day', 0)
-    if not day:
-        # see above
-        if ordinal:
-            day = ordinal
-        elif params.get('century', 0) or \
-                 params.get('year', 0) or params.get('month', 0):
-            day = 1
-        else:
-            day = time.gmtime()[2]
-    else:
-        day = int(day)
-    # special case of the century - is the first year of the 21st century
-    # 2000 or 2001 ? The debate goes on...
-    if 'century' in params.keys():
-        year = (int(params['century']) - 1) * 100 + 1
-    # in ISO 8601 most fields are optional
-    for field in ['hour', 'minute', 'second', 'tzhour', 'tzmin']:
-        if not params.get(field, None):
-            params[field] = 0
-    hour = int(params.get('hour', 0))
-    minute = int(params.get('minute', 0))
-    second = int(params.get('second', 0))
-    # weekday is normalized by mktime(), we can ignore it
-    weekday = 0
-    # daylight savings is complex, but not needed for feedparser's purposes
-    # as time zones, if specified, include mention of whether it is active
-    # (e.g. PST vs. PDT, CET). Using -1 is implementation-dependent and
-    # and most implementations have DST bugs
-    daylight_savings_flag = 0
-    tm = [year, month, day, hour, minute, second, weekday,
-          ordinal, daylight_savings_flag]
-    # ISO 8601 time zone adjustments
-    tz = params.get('tz')
-    if tz and tz != 'Z':
-        if tz[0] == '-':
-            tm[3] += int(params.get('tzhour', 0))
-            tm[4] += int(params.get('tzmin', 0))
-        elif tz[0] == '+':
-            tm[3] -= int(params.get('tzhour', 0))
-            tm[4] -= int(params.get('tzmin', 0))
-        else:
-            return None
-    # Python's time.mktime() is a wrapper around the ANSI C mktime(3c)
-    # which is guaranteed to normalize d/m/y/h/m/s.
-    # Many implementations have bugs, but we'll pretend they don't.
-    return time.localtime(time.mktime(tm))
-registerDateHandler(_parse_date_iso8601)
-    
-# 8-bit date handling routines written by ytrewq1.
-_korean_year  = u'\ub144' # b3e2 in euc-kr
-_korean_month = u'\uc6d4' # bff9 in euc-kr
-_korean_day   = u'\uc77c' # c0cf in euc-kr
-_korean_am    = u'\uc624\uc804' # bfc0 c0fc in euc-kr
-_korean_pm    = u'\uc624\ud6c4' # bfc0 c8c4 in euc-kr
-
-_korean_onblog_date_re = \
-    re.compile('(\d{4})%s\s+(\d{2})%s\s+(\d{2})%s\s+(\d{2}):(\d{2}):(\d{2})' % \
-               (_korean_year, _korean_month, _korean_day))
-_korean_nate_date_re = \
-    re.compile(u'(\d{4})-(\d{2})-(\d{2})\s+(%s|%s)\s+(\d{,2}):(\d{,2}):(\d{,2})' % \
-               (_korean_am, _korean_pm))
-def _parse_date_onblog(dateString):
-    '''Parse a string according to the OnBlog 8-bit date format'''
-    m = _korean_onblog_date_re.match(dateString)
-    if not m: return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('OnBlog date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_onblog)
-
-def _parse_date_nate(dateString):
-    '''Parse a string according to the Nate 8-bit date format'''
-    m = _korean_nate_date_re.match(dateString)
-    if not m: return
-    hour = int(m.group(5))
-    ampm = m.group(4)
-    if (ampm == _korean_pm):
-        hour += 12
-    hour = str(hour)
-    if len(hour) == 1:
-        hour = '0' + hour
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': hour, 'minute': m.group(6), 'second': m.group(7),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('Nate date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_nate)
-
-_mssql_date_re = \
-    re.compile('(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})(\.\d+)?')
-def _parse_date_mssql(dateString):
-    '''Parse a string according to the MS SQL date format'''
-    m = _mssql_date_re.match(dateString)
-    if not m: return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\
-                 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\
-                 'zonediff': '+09:00'}
-    if _debug: sys.stderr.write('MS SQL date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_mssql)
-
-# Unicode strings for Greek date strings
-_greek_months = \
-  { \
-   u'\u0399\u03b1\u03bd': u'Jan',       # c9e1ed in iso-8859-7
-   u'\u03a6\u03b5\u03b2': u'Feb',       # d6e5e2 in iso-8859-7
-   u'\u039c\u03ac\u03ce': u'Mar',       # ccdcfe in iso-8859-7
-   u'\u039c\u03b1\u03ce': u'Mar',       # cce1fe in iso-8859-7
-   u'\u0391\u03c0\u03c1': u'Apr',       # c1f0f1 in iso-8859-7
-   u'\u039c\u03ac\u03b9': u'May',       # ccdce9 in iso-8859-7
-   u'\u039c\u03b1\u03ca': u'May',       # cce1fa in iso-8859-7
-   u'\u039c\u03b1\u03b9': u'May',       # cce1e9 in iso-8859-7
-   u'\u0399\u03bf\u03cd\u03bd': u'Jun', # c9effded in iso-8859-7
-   u'\u0399\u03bf\u03bd': u'Jun',       # c9efed in iso-8859-7
-   u'\u0399\u03bf\u03cd\u03bb': u'Jul', # c9effdeb in iso-8859-7
-   u'\u0399\u03bf\u03bb': u'Jul',       # c9f9eb in iso-8859-7
-   u'\u0391\u03cd\u03b3': u'Aug',       # c1fde3 in iso-8859-7
-   u'\u0391\u03c5\u03b3': u'Aug',       # c1f5e3 in iso-8859-7
-   u'\u03a3\u03b5\u03c0': u'Sep',       # d3e5f0 in iso-8859-7
-   u'\u039f\u03ba\u03c4': u'Oct',       # cfeaf4 in iso-8859-7
-   u'\u039d\u03bf\u03ad': u'Nov',       # cdefdd in iso-8859-7
-   u'\u039d\u03bf\u03b5': u'Nov',       # cdefe5 in iso-8859-7
-   u'\u0394\u03b5\u03ba': u'Dec',       # c4e5ea in iso-8859-7
-  }
-
-_greek_wdays = \
-  { \
-   u'\u039a\u03c5\u03c1': u'Sun', # caf5f1 in iso-8859-7
-   u'\u0394\u03b5\u03c5': u'Mon', # c4e5f5 in iso-8859-7
-   u'\u03a4\u03c1\u03b9': u'Tue', # d4f1e9 in iso-8859-7
-   u'\u03a4\u03b5\u03c4': u'Wed', # d4e5f4 in iso-8859-7
-   u'\u03a0\u03b5\u03bc': u'Thu', # d0e5ec in iso-8859-7
-   u'\u03a0\u03b1\u03c1': u'Fri', # d0e1f1 in iso-8859-7
-   u'\u03a3\u03b1\u03b2': u'Sat', # d3e1e2 in iso-8859-7   
-  }
-
-_greek_date_format_re = \
-    re.compile(u'([^,]+),\s+(\d{2})\s+([^\s]+)\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})\s+([^\s]+)')
-
-def _parse_date_greek(dateString):
-    '''Parse a string according to a Greek 8-bit date format.'''
-    m = _greek_date_format_re.match(dateString)
-    if not m: return
-    try:
-        wday = _greek_wdays[m.group(1)]
-        month = _greek_months[m.group(3)]
-    except:
-        return
-    rfc822date = '%(wday)s, %(day)s %(month)s %(year)s %(hour)s:%(minute)s:%(second)s %(zonediff)s' % \
-                 {'wday': wday, 'day': m.group(2), 'month': month, 'year': m.group(4),\
-                  'hour': m.group(5), 'minute': m.group(6), 'second': m.group(7),\
-                  'zonediff': m.group(8)}
-    if _debug: sys.stderr.write('Greek date parsed as: %s\n' % rfc822date)
-    return _parse_date_rfc822(rfc822date)
-registerDateHandler(_parse_date_greek)
-
-# Unicode strings for Hungarian date strings
-_hungarian_months = \
-  { \
-    u'janu\u00e1r':   u'01',  # e1 in iso-8859-2
-    u'febru\u00e1ri': u'02',  # e1 in iso-8859-2
-    u'm\u00e1rcius':  u'03',  # e1 in iso-8859-2
-    u'\u00e1prilis':  u'04',  # e1 in iso-8859-2
-    u'm\u00e1ujus':   u'05',  # e1 in iso-8859-2
-    u'j\u00fanius':   u'06',  # fa in iso-8859-2
-    u'j\u00falius':   u'07',  # fa in iso-8859-2
-    u'augusztus':     u'08',
-    u'szeptember':    u'09',
-    u'okt\u00f3ber':  u'10',  # f3 in iso-8859-2
-    u'november':      u'11',
-    u'december':      u'12',
-  }
-
-_hungarian_date_format_re = \
-  re.compile(u'(\d{4})-([^-]+)-(\d{,2})T(\d{,2}):(\d{2})((\+|-)(\d{,2}:\d{2}))')
-
-def _parse_date_hungarian(dateString):
-    '''Parse a string according to a Hungarian 8-bit date format.'''
-    m = _hungarian_date_format_re.match(dateString)
-    if not m: return
-    try:
-        month = _hungarian_months[m.group(2)]
-        day = m.group(3)
-        if len(day) == 1:
-            day = '0' + day
-        hour = m.group(4)
-        if len(hour) == 1:
-            hour = '0' + hour
-    except:
-        return
-    w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s%(zonediff)s' % \
-                {'year': m.group(1), 'month': month, 'day': day,\
-                 'hour': hour, 'minute': m.group(5),\
-                 'zonediff': m.group(6)}
-    if _debug: sys.stderr.write('Hungarian date parsed as: %s\n' % w3dtfdate)
-    return _parse_date_w3dtf(w3dtfdate)
-registerDateHandler(_parse_date_hungarian)
-
-# W3DTF-style date parsing adapted from PyXML xml.utils.iso8601, written by
-# Drake and licensed under the Python license.  Removed all range checking
-# for month, day, hour, minute, and second, since mktime will normalize
-# these later
-def _parse_date_w3dtf(dateString):
-    def __extract_date(m):
-        year = int(m.group('year'))
-        if year < 100:
-            year = 100 * int(time.gmtime()[0] / 100) + int(year)
-        if year < 1000:
-            return 0, 0, 0
-        julian = m.group('julian')
-        if julian:
-            julian = int(julian)
-            month = julian / 30 + 1
-            day = julian % 30 + 1
-            jday = None
-            while jday != julian:
-                t = time.mktime((year, month, day, 0, 0, 0, 0, 0, 0))
-                jday = time.gmtime(t)[-2]
-                diff = abs(jday - julian)
-                if jday > julian:
-                    if diff < day:
-                        day = day - diff
-                    else:
-                        month = month - 1
-                        day = 31
-                elif jday < julian:
-                    if day + diff < 28:
-                       day = day + diff
-                    else:
-                        month = month + 1
-            return year, month, day
-        month = m.group('month')
-        day = 1
-        if month is None:
-            month = 1
-        else:
-            month = int(month)
-            day = m.group('day')
-            if day:
-                day = int(day)
-            else:
-                day = 1
-        return year, month, day
-
-    def __extract_time(m):
-        if not m:
-            return 0, 0, 0
-        hours = m.group('hours')
-        if not hours:
-            return 0, 0, 0
-        hours = int(hours)
-        minutes = int(m.group('minutes'))
-        seconds = m.group('seconds')
-        if seconds:
-            seconds = int(seconds)
-        else:
-            seconds = 0
-        return hours, minutes, seconds
-
-    def __extract_tzd(m):
-        '''Return the Time Zone Designator as an offset in seconds from UTC.'''
-        if not m:
-            return 0
-        tzd = m.group('tzd')
-        if not tzd:
-            return 0
-        if tzd == 'Z':
-            return 0
-        hours = int(m.group('tzdhours'))
-        minutes = m.group('tzdminutes')
-        if minutes:
-            minutes = int(minutes)
-        else:
-            minutes = 0
-        offset = (hours*60 + minutes) * 60
-        if tzd[0] == '+':
-            return -offset
-        return offset
-
-    __date_re = ('(?P<year>\d\d\d\d)'
-                 '(?:(?P<dsep>-|)'
-                 '(?:(?P<julian>\d\d\d)'
-                 '|(?P<month>\d\d)(?:(?P=dsep)(?P<day>\d\d))?))?')
-    __tzd_re = '(?P<tzd>[-+](?P<tzdhours>\d\d)(?::?(?P<tzdminutes>\d\d))|Z)'
-    __tzd_rx = re.compile(__tzd_re)
-    __time_re = ('(?P<hours>\d\d)(?P<tsep>:|)(?P<minutes>\d\d)'
-                 '(?:(?P=tsep)(?P<seconds>\d\d(?:[.,]\d+)?))?'
-                 + __tzd_re)
-    __datetime_re = '%s(?:T%s)?' % (__date_re, __time_re)
-    __datetime_rx = re.compile(__datetime_re)
-    m = __datetime_rx.match(dateString)
-    if (m is None) or (m.group() != dateString): return
-    gmt = __extract_date(m) + __extract_time(m) + (0, 0, 0)
-    if gmt[0] == 0: return
-    return time.gmtime(time.mktime(gmt) + __extract_tzd(m) - time.timezone)
-registerDateHandler(_parse_date_w3dtf)
-
-def _parse_date_rfc822(dateString):
-    '''Parse an RFC822, RFC1123, RFC2822, or asctime-style date'''
-    data = dateString.split()
-    if data[0][-1] in (',', '.') or data[0].lower() in rfc822._daynames:
-        del data[0]
-    if len(data) == 4:
-        s = data[3]
-        i = s.find('+')
-        if i > 0:
-            data[3:] = [s[:i], s[i+1:]]
-        else:
-            data.append('')
-        dateString = " ".join(data)
-    if len(data) < 5:
-        dateString += ' 00:00:00 GMT'
-    tm = rfc822.parsedate_tz(dateString)
-    if tm:
-        return time.gmtime(rfc822.mktime_tz(tm))
-# rfc822.py defines several time zones, but we define some extra ones.
-# 'ET' is equivalent to 'EST', etc.
-_additional_timezones = {'AT': -400, 'ET': -500, 'CT': -600, 'MT': -700, 'PT': -800}
-rfc822._timezones.update(_additional_timezones)
-registerDateHandler(_parse_date_rfc822)    
-
-def _parse_date(dateString):
-    '''Parses a variety of date formats into a 9-tuple in GMT'''
-    for handler in _date_handlers:
-        try:
-            date9tuple = handler(dateString)
-            if not date9tuple: continue
-            if len(date9tuple) != 9:
-                if _debug: sys.stderr.write('date handler function must return 9-tuple\n')
-                raise ValueError
-            map(int, date9tuple)
-            return date9tuple
-        except Exception, e:
-            if _debug: sys.stderr.write('%s raised %s\n' % (handler.__name__, repr(e)))
-            pass
-    return None
-
-def _getCharacterEncoding(http_headers, xml_data):
-    '''Get the character encoding of the XML document
-
-    http_headers is a dictionary
-    xml_data is a raw string (not Unicode)
-    
-    This is so much trickier than it sounds, it's not even funny.
-    According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
-    is application/xml, application/*+xml,
-    application/xml-external-parsed-entity, or application/xml-dtd,
-    the encoding given in the charset parameter of the HTTP Content-Type
-    takes precedence over the encoding given in the XML prefix within the
-    document, and defaults to 'utf-8' if neither are specified.  But, if
-    the HTTP Content-Type is text/xml, text/*+xml, or
-    text/xml-external-parsed-entity, the encoding given in the XML prefix
-    within the document is ALWAYS IGNORED and only the encoding given in
-    the charset parameter of the HTTP Content-Type header should be
-    respected, and it defaults to 'us-ascii' if not specified.
-
-    Furthermore, discussion on the atom-syntax mailing list with the
-    author of RFC 3023 leads me to the conclusion that any document
-    served with a Content-Type of text/* and no charset parameter
-    must be treated as us-ascii.  (We now do this.)  And also that it
-    must always be flagged as non-well-formed.  (We now do this too.)
-    
-    If Content-Type is unspecified (input was local file or non-HTTP source)
-    or unrecognized (server just got it totally wrong), then go by the
-    encoding given in the XML prefix of the document and default to
-    'iso-8859-1' as per the HTTP specification (RFC 2616).
-    
-    Then, assuming we didn't find a character encoding in the HTTP headers
-    (and the HTTP Content-type allowed us to look in the body), we need
-    to sniff the first few bytes of the XML data and try to determine
-    whether the encoding is ASCII-compatible.  Section F of the XML
-    specification shows the way here:
-    http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
-
-    If the sniffed encoding is not ASCII-compatible, we need to make it
-    ASCII compatible so that we can sniff further into the XML declaration
-    to find the encoding attribute, which will tell us the true encoding.
-
-    Of course, none of this guarantees that we will be able to parse the
-    feed in the declared character encoding (assuming it was declared
-    correctly, which many are not).  CJKCodecs and iconv_codec help a lot;
-    you should definitely install them if you can.
-    http://cjkpython.i18n.org/
-    '''
-
-    def _parseHTTPContentType(content_type):
-        '''takes HTTP Content-Type header and returns (content type, charset)
-
-        If no charset is specified, returns (content type, '')
-        If no content type is specified, returns ('', '')
-        Both return parameters are guaranteed to be lowercase strings
-        '''
-        content_type = content_type or ''
-        content_type, params = cgi.parse_header(content_type)
-        return content_type, params.get('charset', '').replace("'", '')
-
-    sniffed_xml_encoding = ''
-    xml_encoding = ''
-    true_encoding = ''
-    http_content_type, http_encoding = _parseHTTPContentType(http_headers.get('content-type'))
-    # Must sniff for non-ASCII-compatible character encodings before
-    # searching for XML declaration.  This heuristic is defined in
-    # section F of the XML specification:
-    # http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
-    try:
-        if xml_data[:4] == '\x4c\x6f\xa7\x94':
-            # EBCDIC
-            xml_data = _ebcdic_to_ascii(xml_data)
-        elif xml_data[:4] == '\x00\x3c\x00\x3f':
-            # UTF-16BE
-            sniffed_xml_encoding = 'utf-16be'
-            xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
-        elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') and (xml_data[2:4] != '\x00\x00'):
-            # UTF-16BE with BOM
-            sniffed_xml_encoding = 'utf-16be'
-            xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
-        elif xml_data[:4] == '\x3c\x00\x3f\x00':
-            # UTF-16LE
-            sniffed_xml_encoding = 'utf-16le'
-            xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
-        elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and (xml_data[2:4] != '\x00\x00'):
-            # UTF-16LE with BOM
-            sniffed_xml_encoding = 'utf-16le'
-            xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
-        elif xml_data[:4] == '\x00\x00\x00\x3c':
-            # UTF-32BE
-            sniffed_xml_encoding = 'utf-32be'
-            xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
-        elif xml_data[:4] == '\x3c\x00\x00\x00':
-            # UTF-32LE
-            sniffed_xml_encoding = 'utf-32le'
-            xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
-        elif xml_data[:4] == '\x00\x00\xfe\xff':
-            # UTF-32BE with BOM
-            sniffed_xml_encoding = 'utf-32be'
-            xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
-        elif xml_data[:4] == '\xff\xfe\x00\x00':
-            # UTF-32LE with BOM
-            sniffed_xml_encoding = 'utf-32le'
-            xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
-        elif xml_data[:3] == '\xef\xbb\xbf':
-            # UTF-8 with BOM
-            sniffed_xml_encoding = 'utf-8'
-            xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
-        else:
-            # ASCII-compatible
-            pass
-        xml_encoding_match = re.compile('^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
-    except:
-        xml_encoding_match = None
-    if xml_encoding_match:
-        xml_encoding = xml_encoding_match.groups()[0].lower()
-        if sniffed_xml_encoding and (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', 'iso-10646-ucs-4', 'ucs-4', 'csucs4', 'utf-16', 'utf-32', 'utf_16', 'utf_32', 'utf16', 'u16')):
-            xml_encoding = sniffed_xml_encoding
-    acceptable_content_type = 0
-    application_content_types = ('application/xml', 'application/xml-dtd', 'application/xml-external-parsed-entity')
-    text_content_types = ('text/xml', 'text/xml-external-parsed-entity')
-    if (http_content_type in application_content_types) or \
-       (http_content_type.startswith('application/') and http_content_type.endswith('+xml')):
-        acceptable_content_type = 1
-        true_encoding = http_encoding or xml_encoding or 'utf-8'
-    elif (http_content_type in text_content_types) or \
-         (http_content_type.startswith('text/')) and http_content_type.endswith('+xml'):
-        acceptable_content_type = 1
-        true_encoding = http_encoding or 'us-ascii'
-    elif http_content_type.startswith('text/'):
-        true_encoding = http_encoding or 'us-ascii'
-    elif http_headers and (not http_headers.has_key('content-type')):
-        true_encoding = xml_encoding or 'iso-8859-1'
-    else:
-        true_encoding = xml_encoding or 'utf-8'
-    return true_encoding, http_encoding, xml_encoding, sniffed_xml_encoding, acceptable_content_type
-    
-def _toUTF8(data, encoding):
-    '''Changes an XML data stream on the fly to specify a new encoding
-
-    data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already
-    encoding is a string recognized by encodings.aliases
-    '''
-    if _debug: sys.stderr.write('entering _toUTF8, trying encoding %s\n' % encoding)
-    # strip Byte Order Mark (if present)
-    if (len(data) >= 4) and (data[:2] == '\xfe\xff') and (data[2:4] != '\x00\x00'):
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-16be':
-                sys.stderr.write('trying utf-16be instead\n')
-        encoding = 'utf-16be'
-        data = data[2:]
-    elif (len(data) >= 4) and (data[:2] == '\xff\xfe') and (data[2:4] != '\x00\x00'):
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-16le':
-                sys.stderr.write('trying utf-16le instead\n')
-        encoding = 'utf-16le'
-        data = data[2:]
-    elif data[:3] == '\xef\xbb\xbf':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-8':
-                sys.stderr.write('trying utf-8 instead\n')
-        encoding = 'utf-8'
-        data = data[3:]
-    elif data[:4] == '\x00\x00\xfe\xff':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-32be':
-                sys.stderr.write('trying utf-32be instead\n')
-        encoding = 'utf-32be'
-        data = data[4:]
-    elif data[:4] == '\xff\xfe\x00\x00':
-        if _debug:
-            sys.stderr.write('stripping BOM\n')
-            if encoding != 'utf-32le':
-                sys.stderr.write('trying utf-32le instead\n')
-        encoding = 'utf-32le'
-        data = data[4:]
-    newdata = unicode(data, encoding)
-    if _debug: sys.stderr.write('successfully converted %s data to unicode\n' % encoding)
-    declmatch = re.compile('^<\?xml[^>]*?>')
-    newdecl = '''<?xml version='1.0' encoding='utf-8'?>'''
-    if declmatch.search(newdata):
-        newdata = declmatch.sub(newdecl, newdata)
-    else:
-        newdata = newdecl + u'\n' + newdata
-    return newdata.encode('utf-8')
-
-def _stripDoctype(data):
-    '''Strips DOCTYPE from XML document, returns (rss_version, stripped_data)
-
-    rss_version may be 'rss091n' or None
-    stripped_data is the same XML document, minus the DOCTYPE
-    '''
-    entity_pattern = re.compile(r'<!ENTITY([^>]*?)>', re.MULTILINE)
-    data = entity_pattern.sub('', data)
-    doctype_pattern = re.compile(r'<!DOCTYPE([^>]*?)>', re.MULTILINE)
-    doctype_results = doctype_pattern.findall(data)
-    doctype = doctype_results and doctype_results[0] or ''
-    if doctype.lower().count('netscape'):
-        version = 'rss091n'
-    else:
-        version = None
-    data = doctype_pattern.sub('', data)
-    return version, data
-    
-def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=[]):
-    '''Parse a feed from a URL, file, stream, or string'''
-    result = FeedParserDict()
-    result['feed'] = FeedParserDict()
-    result['entries'] = []
-    if _XML_AVAILABLE:
-        result['bozo'] = 0
-    if type(handlers) == types.InstanceType:
-        handlers = [handlers]
-    try:
-        f = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)
-        data = f.read()
-    except Exception, e:
-        result['bozo'] = 1
-        result['bozo_exception'] = e
-        data = ''
-        f = None
-
-    # if feed is gzip-compressed, decompress it
-    if f and data and hasattr(f, 'headers'):
-        if gzip and f.headers.get('content-encoding', '') == 'gzip':
-            try:
-                data = gzip.GzipFile(fileobj=_StringIO(data)).read()
-            except Exception, e:
-                # Some feeds claim to be gzipped but they're not, so
-                # we get garbage.  Ideally, we should re-request the
-                # feed without the 'Accept-encoding: gzip' header,
-                # but we don't.
-                result['bozo'] = 1
-                result['bozo_exception'] = e
-                data = ''
-        elif zlib and f.headers.get('content-encoding', '') == 'deflate':
-            try:
-                data = zlib.decompress(data, -zlib.MAX_WBITS)
-            except Exception, e:
-                result['bozo'] = 1
-                result['bozo_exception'] = e
-                data = ''
-
-    # save HTTP headers
-    if hasattr(f, 'info'):
-        info = f.info()
-        result['etag'] = info.getheader('ETag')
-        last_modified = info.getheader('Last-Modified')
-        if last_modified:
-            result['modified'] = _parse_date(last_modified)
-    if hasattr(f, 'url'):
-        result['href'] = f.url
-        result['status'] = 200
-    if hasattr(f, 'status'):
-        result['status'] = f.status
-    if hasattr(f, 'headers'):
-        result['headers'] = f.headers.dict
-    if hasattr(f, 'close'):
-        f.close()
-
-    # there are four encodings to keep track of:
-    # - http_encoding is the encoding declared in the Content-Type HTTP header
-    # - xml_encoding is the encoding declared in the <?xml declaration
-    # - sniffed_encoding is the encoding sniffed from the first 4 bytes of the XML data
-    # - result['encoding'] is the actual encoding, as per RFC 3023 and a variety of other conflicting specifications
-    http_headers = result.get('headers', {})
-    result['encoding'], http_encoding, xml_encoding, sniffed_xml_encoding, acceptable_content_type = \
-        _getCharacterEncoding(http_headers, data)
-    if http_headers and (not acceptable_content_type):
-        if http_headers.has_key('content-type'):
-            bozo_message = '%s is not an XML media type' % http_headers['content-type']
-        else:
-            bozo_message = 'no Content-type specified'
-        result['bozo'] = 1
-        result['bozo_exception'] = NonXMLContentType(bozo_message)
-        
-    result['version'], data = _stripDoctype(data)
-
-    baseuri = http_headers.get('content-location', result.get('href'))
-    baselang = http_headers.get('content-language', None)
-
-    # if server sent 304, we're done
-    if result.get('status', 0) == 304:
-        result['version'] = ''
-        result['debug_message'] = 'The feed has not changed since you last checked, ' + \
-            'so the server sent no data.  This is a feature, not a bug!'
-        return result
-
-    # if there was a problem downloading, we're done
-    if not data:
-        return result
-
-    # determine character encoding
-    use_strict_parser = 0
-    known_encoding = 0
-    tried_encodings = []
-    # try: HTTP encoding, declared XML encoding, encoding sniffed from BOM
-    for proposed_encoding in (result['encoding'], xml_encoding, sniffed_xml_encoding):
-        if not proposed_encoding: continue
-        if proposed_encoding in tried_encodings: continue
-        tried_encodings.append(proposed_encoding)
-        try:
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-            break
-        except:
-            pass
-    # if no luck and we have auto-detection library, try that
-    if (not known_encoding) and chardet:
-        try:
-            proposed_encoding = chardet.detect(data)['encoding']
-            if proposed_encoding and (proposed_encoding not in tried_encodings):
-                tried_encodings.append(proposed_encoding)
-                data = _toUTF8(data, proposed_encoding)
-                known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck and we haven't tried utf-8 yet, try that
-    if (not known_encoding) and ('utf-8' not in tried_encodings):
-        try:
-            proposed_encoding = 'utf-8'
-            tried_encodings.append(proposed_encoding)
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck and we haven't tried windows-1252 yet, try that
-    if (not known_encoding) and ('windows-1252' not in tried_encodings):
-        try:
-            proposed_encoding = 'windows-1252'
-            tried_encodings.append(proposed_encoding)
-            data = _toUTF8(data, proposed_encoding)
-            known_encoding = use_strict_parser = 1
-        except:
-            pass
-    # if still no luck, give up
-    if not known_encoding:
-        result['bozo'] = 1
-        result['bozo_exception'] = CharacterEncodingUnknown( \
-            'document encoding unknown, I tried ' + \
-            '%s, %s, utf-8, and windows-1252 but nothing worked' % \
-            (result['encoding'], xml_encoding))
-        result['encoding'] = ''
-    elif proposed_encoding != result['encoding']:
-        result['bozo'] = 1
-        result['bozo_exception'] = CharacterEncodingOverride( \
-            'documented declared as %s, but parsed as %s' % \
-            (result['encoding'], proposed_encoding))
-        result['encoding'] = proposed_encoding
-
-    if not _XML_AVAILABLE:
-        use_strict_parser = 0
-    if use_strict_parser:
-        # initialize the SAX parser
-        feedparser = _StrictFeedParser(baseuri, baselang, 'utf-8')
-        saxparser = xml.sax.make_parser(PREFERRED_XML_PARSERS)
-        saxparser.setFeature(xml.sax.handler.feature_namespaces, 1)
-        saxparser.setContentHandler(feedparser)
-        saxparser.setErrorHandler(feedparser)
-        source = xml.sax.xmlreader.InputSource()
-        source.setByteStream(_StringIO(data))
-        if hasattr(saxparser, '_ns_stack'):
-            # work around bug in built-in SAX parser (doesn't recognize xml: namespace)
-            # PyXML doesn't have this problem, and it doesn't have _ns_stack either
-            saxparser._ns_stack.append({'http://www.w3.org/XML/1998/namespace':'xml'})
-        try:
-            saxparser.parse(source)
-        except Exception, e:
-            if _debug:
-                import traceback
-                traceback.print_stack()
-                traceback.print_exc()
-                sys.stderr.write('xml parsing failed\n')
-            result['bozo'] = 1
-            result['bozo_exception'] = feedparser.exc or e
-            use_strict_parser = 0
-    if not use_strict_parser:
-        feedparser = _LooseFeedParser(baseuri, baselang, known_encoding and 'utf-8' or '')
-        feedparser.feed(data)
-    result['feed'] = feedparser.feeddata
-    result['entries'] = feedparser.entries
-    result['version'] = result['version'] or feedparser.version
-    result['namespaces'] = feedparser.namespacesInUse
-    return result
-
-if __name__ == '__main__':
-    if not sys.argv[1:]:
-        print __doc__
-        sys.exit(0)
-    else:
-        urls = sys.argv[1:]
-    zopeCompatibilityHack()
-    from pprint import pprint
-    for url in urls:
-        print url
-        print
-        result = parse(url)
-        pprint(result)
-        print
-
-#REVISION HISTORY
-#1.0 - 9/27/2002 - MAP - fixed namespace processing on prefixed RSS 2.0 elements,
-#  added Simon Fell's test suite
-#1.1 - 9/29/2002 - MAP - fixed infinite loop on incomplete CDATA sections
-#2.0 - 10/19/2002
-#  JD - use inchannel to watch out for image and textinput elements which can
-#  also contain title, link, and description elements
-#  JD - check for isPermaLink='false' attribute on guid elements
-#  JD - replaced openAnything with open_resource supporting ETag and
-#  If-Modified-Since request headers
-#  JD - parse now accepts etag, modified, agent, and referrer optional
-#  arguments
-#  JD - modified parse to return a dictionary instead of a tuple so that any
-#  etag or modified information can be returned and cached by the caller
-#2.0.1 - 10/21/2002 - MAP - changed parse() so that if we don't get anything
-#  because of etag/modified, return the old etag/modified to the caller to
-#  indicate why nothing is being returned
-#2.0.2 - 10/21/2002 - JB - added the inchannel to the if statement, otherwise its
-#  useless.  Fixes the problem JD was addressing by adding it.
-#2.1 - 11/14/2002 - MAP - added gzip support
-#2.2 - 1/27/2003 - MAP - added attribute support, admin:generatorAgent.
-#  start_admingeneratoragent is an example of how to handle elements with
-#  only attributes, no content.
-#2.3 - 6/11/2003 - MAP - added USER_AGENT for default (if caller doesn't specify);
-#  also, make sure we send the User-Agent even if urllib2 isn't available.
-#  Match any variation of backend.userland.com/rss namespace.
-#2.3.1 - 6/12/2003 - MAP - if item has both link and guid, return both as-is.
-#2.4 - 7/9/2003 - MAP - added preliminary Pie/Atom/Echo support based on Sam Ruby's
-#  snapshot of July 1 <http://www.intertwingly.net/blog/1506.html>; changed
-#  project name
-#2.5 - 7/25/2003 - MAP - changed to Python license (all contributors agree);
-#  removed unnecessary urllib code -- urllib2 should always be available anyway;
-#  return actual url, status, and full HTTP headers (as result['url'],
-#  result['status'], and result['headers']) if parsing a remote feed over HTTP --
-#  this should pass all the HTTP tests at <http://diveintomark.org/tests/client/http/>;
-#  added the latest namespace-of-the-week for RSS 2.0
-#2.5.1 - 7/26/2003 - RMK - clear opener.addheaders so we only send our custom
-#  User-Agent (otherwise urllib2 sends two, which confuses some servers)
-#2.5.2 - 7/28/2003 - MAP - entity-decode inline xml properly; added support for
-#  inline <xhtml:body> and <xhtml:div> as used in some RSS 2.0 feeds
-#2.5.3 - 8/6/2003 - TvdV - patch to track whether we're inside an image or
-#  textInput, and also to return the character encoding (if specified)
-#2.6 - 1/1/2004 - MAP - dc:author support (MarekK); fixed bug tracking
-#  nested divs within content (JohnD); fixed missing sys import (JohanS);
-#  fixed regular expression to capture XML character encoding (Andrei);
-#  added support for Atom 0.3-style links; fixed bug with textInput tracking;
-#  added support for cloud (MartijnP); added support for multiple
-#  category/dc:subject (MartijnP); normalize content model: 'description' gets
-#  description (which can come from description, summary, or full content if no
-#  description), 'content' gets dict of base/language/type/value (which can come
-#  from content:encoded, xhtml:body, content, or fullitem);
-#  fixed bug matching arbitrary Userland namespaces; added xml:base and xml:lang
-#  tracking; fixed bug tracking unknown tags; fixed bug tracking content when
-#  <content> element is not in default namespace (like Pocketsoap feed);
-#  resolve relative URLs in link, guid, docs, url, comments, wfw:comment,
-#  wfw:commentRSS; resolve relative URLs within embedded HTML markup in
-#  description, xhtml:body, content, content:encoded, title, subtitle,
-#  summary, info, tagline, and copyright; added support for pingback and
-#  trackback namespaces
-#2.7 - 1/5/2004 - MAP - really added support for trackback and pingback
-#  namespaces, as opposed to 2.6 when I said I did but didn't really;
-#  sanitize HTML markup within some elements; added mxTidy support (if
-#  installed) to tidy HTML markup within some elements; fixed indentation
-#  bug in _parse_date (FazalM); use socket.setdefaulttimeout if available
-#  (FazalM); universal date parsing and normalization (FazalM): 'created', modified',
-#  'issued' are parsed into 9-tuple date format and stored in 'created_parsed',
-#  'modified_parsed', and 'issued_parsed'; 'date' is duplicated in 'modified'
-#  and vice-versa; 'date_parsed' is duplicated in 'modified_parsed' and vice-versa
-#2.7.1 - 1/9/2004 - MAP - fixed bug handling &quot; and &apos;.  fixed memory
-#  leak not closing url opener (JohnD); added dc:publisher support (MarekK);
-#  added admin:errorReportsTo support (MarekK); Python 2.1 dict support (MarekK)
-#2.7.4 - 1/14/2004 - MAP - added workaround for improperly formed <br/> tags in
-#  encoded HTML (skadz); fixed unicode handling in normalize_attrs (ChrisL);
-#  fixed relative URI processing for guid (skadz); added ICBM support; added
-#  base64 support
-#2.7.5 - 1/15/2004 - MAP - added workaround for malformed DOCTYPE (seen on many
-#  blogspot.com sites); added _debug variable
-#2.7.6 - 1/16/2004 - MAP - fixed bug with StringIO importing
-#3.0b3 - 1/23/2004 - MAP - parse entire feed with real XML parser (if available);
-#  added several new supported namespaces; fixed bug tracking naked markup in
-#  description; added support for enclosure; added support for source; re-added
-#  support for cloud which got dropped somehow; added support for expirationDate
-#3.0b4 - 1/26/2004 - MAP - fixed xml:lang inheritance; fixed multiple bugs tracking
-#  xml:base URI, one for documents that don't define one explicitly and one for
-#  documents that define an outer and an inner xml:base that goes out of scope
-#  before the end of the document
-#3.0b5 - 1/26/2004 - MAP - fixed bug parsing multiple links at feed level
-#3.0b6 - 1/27/2004 - MAP - added feed type and version detection, result['version']
-#  will be one of SUPPORTED_VERSIONS.keys() or empty string if unrecognized;
-#  added support for creativeCommons:license and cc:license; added support for
-#  full Atom content model in title, tagline, info, copyright, summary; fixed bug
-#  with gzip encoding (not always telling server we support it when we do)
-#3.0b7 - 1/28/2004 - MAP - support Atom-style author element in author_detail
-#  (dictionary of 'name', 'url', 'email'); map author to author_detail if author
-#  contains name + email address
-#3.0b8 - 1/28/2004 - MAP - added support for contributor
-#3.0b9 - 1/29/2004 - MAP - fixed check for presence of dict function; added
-#  support for summary
-#3.0b10 - 1/31/2004 - MAP - incorporated ISO-8601 date parsing routines from
-#  xml.util.iso8601
-#3.0b11 - 2/2/2004 - MAP - added 'rights' to list of elements that can contain
-#  dangerous markup; fiddled with decodeEntities (not right); liberalized
-#  date parsing even further
-#3.0b12 - 2/6/2004 - MAP - fiddled with decodeEntities (still not right);
-#  added support to Atom 0.2 subtitle; added support for Atom content model
-#  in copyright; better sanitizing of dangerous HTML elements with end tags
-#  (script, frameset)
-#3.0b13 - 2/8/2004 - MAP - better handling of empty HTML tags (br, hr, img,
-#  etc.) in embedded markup, in either HTML or XHTML form (<br>, <br/>, <br />)
-#3.0b14 - 2/8/2004 - MAP - fixed CDATA handling in non-wellformed feeds under
-#  Python 2.1
-#3.0b15 - 2/11/2004 - MAP - fixed bug resolving relative links in wfw:commentRSS;
-#  fixed bug capturing author and contributor URL; fixed bug resolving relative
-#  links in author and contributor URL; fixed bug resolvin relative links in
-#  generator URL; added support for recognizing RSS 1.0; passed Simon Fell's
-#  namespace tests, and included them permanently in the test suite with his
-#  permission; fixed namespace handling under Python 2.1
-#3.0b16 - 2/12/2004 - MAP - fixed support for RSS 0.90 (broken in b15)
-#3.0b17 - 2/13/2004 - MAP - determine character encoding as per RFC 3023
-#3.0b18 - 2/17/2004 - MAP - always map description to summary_detail (Andrei);
-#  use libxml2 (if available)
-#3.0b19 - 3/15/2004 - MAP - fixed bug exploding author information when author
-#  name was in parentheses; removed ultra-problematic mxTidy support; patch to
-#  workaround crash in PyXML/expat when encountering invalid entities
-#  (MarkMoraes); support for textinput/textInput
-#3.0b20 - 4/7/2004 - MAP - added CDF support
-#3.0b21 - 4/14/2004 - MAP - added Hot RSS support
-#3.0b22 - 4/19/2004 - MAP - changed 'channel' to 'feed', 'item' to 'entries' in
-#  results dict; changed results dict to allow getting values with results.key
-#  as well as results[key]; work around embedded illformed HTML with half
-#  a DOCTYPE; work around malformed Content-Type header; if character encoding
-#  is wrong, try several common ones before falling back to regexes (if this
-#  works, bozo_exception is set to CharacterEncodingOverride); fixed character
-#  encoding issues in BaseHTMLProcessor by tracking encoding and converting
-#  from Unicode to raw strings before feeding data to sgmllib.SGMLParser;
-#  convert each value in results to Unicode (if possible), even if using
-#  regex-based parsing
-#3.0b23 - 4/21/2004 - MAP - fixed UnicodeDecodeError for feeds that contain
-#  high-bit characters in attributes in embedded HTML in description (thanks
-#  Thijs van de Vossen); moved guid, date, and date_parsed to mapped keys in
-#  FeedParserDict; tweaked FeedParserDict.has_key to return True if asking
-#  about a mapped key
-#3.0fc1 - 4/23/2004 - MAP - made results.entries[0].links[0] and
-#  results.entries[0].enclosures[0] into FeedParserDict; fixed typo that could
-#  cause the same encoding to be tried twice (even if it failed the first time);
-#  fixed DOCTYPE stripping when DOCTYPE contained entity declarations;
-#  better textinput and image tracking in illformed RSS 1.0 feeds
-#3.0fc2 - 5/10/2004 - MAP - added and passed Sam's amp tests; added and passed
-#  my blink tag tests
-#3.0fc3 - 6/18/2004 - MAP - fixed bug in _changeEncodingDeclaration that
-#  failed to parse utf-16 encoded feeds; made source into a FeedParserDict;
-#  duplicate admin:generatorAgent/@rdf:resource in generator_detail.url;
-#  added support for image; refactored parse() fallback logic to try other
-#  encodings if SAX parsing fails (previously it would only try other encodings
-#  if re-encoding failed); remove unichr madness in normalize_attrs now that
-#  we're properly tracking encoding in and out of BaseHTMLProcessor; set
-#  feed.language from root-level xml:lang; set entry.id from rdf:about;
-#  send Accept header
-#3.0 - 6/21/2004 - MAP - don't try iso-8859-1 (can't distinguish between
-#  iso-8859-1 and windows-1252 anyway, and most incorrectly marked feeds are
-#  windows-1252); fixed regression that could cause the same encoding to be
-#  tried twice (even if it failed the first time)
-#3.0.1 - 6/22/2004 - MAP - default to us-ascii for all text/* content types;
-#  recover from malformed content-type header parameter with no equals sign
-#  ('text/xml; charset:iso-8859-1')
-#3.1 - 6/28/2004 - MAP - added and passed tests for converting HTML entities
-#  to Unicode equivalents in illformed feeds (aaronsw); added and
-#  passed tests for converting character entities to Unicode equivalents
-#  in illformed feeds (aaronsw); test for valid parsers when setting
-#  XML_AVAILABLE; make version and encoding available when server returns
-#  a 304; add handlers parameter to pass arbitrary urllib2 handlers (like
-#  digest auth or proxy support); add code to parse username/password
-#  out of url and send as basic authentication; expose downloading-related
-#  exceptions in bozo_exception (aaronsw); added __contains__ method to
-#  FeedParserDict (aaronsw); added publisher_detail (aaronsw)
-#3.2 - 7/3/2004 - MAP - use cjkcodecs and iconv_codec if available; always
-#  convert feed to UTF-8 before passing to XML parser; completely revamped
-#  logic for determining character encoding and attempting XML parsing
-#  (much faster); increased default timeout to 20 seconds; test for presence
-#  of Location header on redirects; added tests for many alternate character
-#  encodings; support various EBCDIC encodings; support UTF-16BE and
-#  UTF16-LE with or without a BOM; support UTF-8 with a BOM; support
-#  UTF-32BE and UTF-32LE with or without a BOM; fixed crashing bug if no
-#  XML parsers are available; added support for 'Content-encoding: deflate';
-#  send blank 'Accept-encoding: ' header if neither gzip nor zlib modules
-#  are available
-#3.3 - 7/15/2004 - MAP - optimize EBCDIC to ASCII conversion; fix obscure
-#  problem tracking xml:base and xml:lang if element declares it, child
-#  doesn't, first grandchild redeclares it, and second grandchild doesn't;
-#  refactored date parsing; defined public registerDateHandler so callers
-#  can add support for additional date formats at runtime; added support
-#  for OnBlog, Nate, MSSQL, Greek, and Hungarian dates (ytrewq1); added
-#  zopeCompatibilityHack() which turns FeedParserDict into a regular
-#  dictionary, required for Zope compatibility, and also makes command-
-#  line debugging easier because pprint module formats real dictionaries
-#  better than dictionary-like objects; added NonXMLContentType exception,
-#  which is stored in bozo_exception when a feed is served with a non-XML
-#  media type such as 'text/plain'; respect Content-Language as default
-#  language if not xml:lang is present; cloud dict is now FeedParserDict;
-#  generator dict is now FeedParserDict; better tracking of xml:lang,
-#  including support for xml:lang='' to unset the current language;
-#  recognize RSS 1.0 feeds even when RSS 1.0 namespace is not the default
-#  namespace; don't overwrite final status on redirects (scenarios:
-#  redirecting to a URL that returns 304, redirecting to a URL that
-#  redirects to another URL with a different type of redirect); add
-#  support for HTTP 303 redirects
-#4.0 - MAP - support for relative URIs in xml:base attribute; fixed
-#  encoding issue with mxTidy (phopkins); preliminary support for RFC 3229;
-#  support for Atom 1.0; support for iTunes extensions; new 'tags' for
-#  categories/keywords/etc. as array of dict
-#  {'term': term, 'scheme': scheme, 'label': label} to match Atom 1.0
-#  terminology; parse RFC 822-style dates with no time; lots of other
-#  bug fixes
-#4.1 - MAP - removed socket timeout; added support for chardet library
diff --git a/HarvestMan/harvestman/lib/common/keepalive.py b/HarvestMan/harvestman/lib/common/keepalive.py
deleted file mode 100755
index 675febb..0000000
--- a/HarvestMan/harvestman/lib/common/keepalive.py
+++ /dev/null
@@ -1,650 +0,0 @@
-# -- coding: utf-8
-# keepalive.py - Module which supports HTTP/HTTPS keep-alive connections
-# on the same host using a thread-safe connection pool.
-#
-# Created Anand B Pillai Sep 10 2007 Code borrowed from urlgrabber
-#                                    project.
-#
-# Original copyright follows:
-#--------------Original Copyright-----------------------------------
-#   This library is free software; you can redistribute it and/or
-#   modify it under the terms of the GNU Lesser General Public
-#   License as published by the Free Software Foundation; either
-#   version 2.1 of the License, or (at your option) any later version.
-#
-#   This library is distributed in the hope that it will be useful,
-#   but WITHOUT ANY WARRANTY; without even the implied warranty of
-#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-#   Lesser General Public License for more details.
-#
-#   You should have received a copy of the GNU Lesser General Public
-#   License along with this library; if not, write to the 
-#      Free Software Foundation, Inc., 
-#      59 Temple Place, Suite 330, 
-#      Boston, MA  02111-1307  USA
-
-# This file is part of urlgrabber, a high-level cross-protocol url-grabber
-# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko
-#------------Original Copyright---------------------------------------
-#
-#
-
-__author__ = "Anand B Pillai"
-__maintainer__ = "Anand B Pillai"
-__version__ = "2.0 b1"
-
-"""An HTTP handler for urllib2 that supports HTTP 1.1 and keepalive.
-
->>> import urllib2
->>> from keepalive import HTTPHandler
->>> keepalive_handler = HTTPHandler()
->>> opener = urllib2.build_opener(keepalive_handler)
->>> urllib2.install_opener(opener)
->>> 
->>> fo = urllib2.urlopen('http://www.python.org')
-
-If a connection to a given host is requested, and all of the existing
-connections are still in use, another connection will be opened.  If
-the handler tries to use an existing connection but it fails in some
-way, it will be closed and removed from the pool.
-
-To remove the handler, simply re-run build_opener with no arguments, and
-install that opener.
-
-You can explicitly close connections by using the close_connection()
-method of the returned file-like object (described below) or you can
-use the handler methods:
-
-  close_connection(host)
-  close_all()
-  open_connections()
-
-NOTE: using the close_connection and close_all methods of the handler
-should be done with care when using multiple threads.
-  * there is nothing that prevents another thread from creating new
-    connections immediately after connections are closed
-  * no checks are done to prevent in-use connections from being closed
-
->>> keepalive_handler.close_all()
-
-EXTRA ATTRIBUTES AND METHODS
-
-  Upon a status of 200, the object returned has a few additional
-  attributes and methods, which should not be used if you want to
-  remain consistent with the normal urllib2-returned objects:
-
-    close_connection()  -  close the connection to the host
-    readlines()         -  you know, readlines()
-    status              -  the return status (ie 404)
-    reason              -  english translation of status (ie 'File not found')
-
-  If you want the best of both worlds, use this inside an
-  AttributeError-catching try:
-
-  >>> try: status = fo.status
-  >>> except AttributeError: status = None
-
-  Unfortunately, these are ONLY there if status == 200, so it's not
-  easy to distinguish between non-200 responses.  The reason is that
-  urllib2 tries to do clever things with error codes 301, 302, 401,
-  and 407, and it wraps the object upon return.
-
-  For python versions earlier than 2.4, you can avoid this fancy error
-  handling by setting the module-level global HANDLE_ERRORS to zero.
-  You see, prior to 2.4, it's the HTTP Handler's job to determine what
-  to handle specially, and what to just pass up.  HANDLE_ERRORS == 0
-  means "pass everything up".  In python 2.4, however, this job no
-  longer belongs to the HTTP Handler and is now done by a NEW handler,
-  HTTPErrorProcessor.  Here's the bottom line:
-
-    python version < 2.4
-        HANDLE_ERRORS == 1  (default) pass up 200, treat the rest as
-                            errors
-        HANDLE_ERRORS == 0  pass everything up, error processing is
-                            left to the calling code
-    python version >= 2.4
-        HANDLE_ERRORS == 1  pass up 200, treat the rest as errors
-        HANDLE_ERRORS == 0  (default) pass everything up, let the
-                            other handlers (specifically,
-                            HTTPErrorProcessor) decide what to do
-
-  In practice, setting the variable either way makes little difference
-  in python 2.4, so for the most consistent behavior across versions,
-  you probably just want to use the defaults, which will give you
-  exceptions on errors.
-
-"""
-
-# $Id: keepalive.py,v 1.2 2007/10/08 20:52:00 pythonhacker Exp $
-
-import urllib2
-import httplib
-import socket
-import thread
-
-class FakeLogger:
-    def debug(self, msg, *args): print msg % args
-    info = warning = error = debug
-    
-DEBUG = None
-
-# import sslfactory
-
-import sys
-if sys.version_info < (2, 4): HANDLE_ERRORS = 1
-else: HANDLE_ERRORS = 0
-    
-class ConnectionManager:
-    """
-    The connection manager must be able to:
-      * keep track of all existing
-      """
-    def __init__(self):
-        self._lock = thread.allocate_lock()
-        self._hostmap = {} # map hosts to a list of connections
-        self._connmap = {} # map connections to host
-        self._readymap = {} # map connection to ready state
-
-    def add(self, host, connection, ready):
-        self._lock.acquire()
-        try:
-            if not self._hostmap.has_key(host): self._hostmap[host] = []
-            self._hostmap[host].append(connection)
-            self._connmap[connection] = host
-            self._readymap[connection] = ready
-        finally:
-            self._lock.release()
-
-    def remove(self, connection):
-        self._lock.acquire()
-        try:
-            try:
-                host = self._connmap[connection]
-            except KeyError:
-                pass
-            else:
-                del self._connmap[connection]
-                del self._readymap[connection]
-                self._hostmap[host].remove(connection)
-                if not self._hostmap[host]: del self._hostmap[host]
-        finally:
-            self._lock.release()
-
-    def set_ready(self, connection, ready):
-        try: self._readymap[connection] = ready
-        except KeyError: pass
-        
-    def get_ready_conn(self, host):
-        conn = None
-        self._lock.acquire()
-        try:
-            if self._hostmap.has_key(host):
-                for c in self._hostmap[host]:
-                    if self._readymap[c]:
-                        self._readymap[c] = 0
-                        conn = c
-                        break
-        finally:
-            self._lock.release()
-        return conn
-
-    def get_all(self, host=None):
-        if host:
-            return list(self._hostmap.get(host, []))
-        else:
-            return dict(self._hostmap)
-
-class KeepAliveHandler:
-    def __init__(self):
-        self._cm = ConnectionManager()
-        
-    #### Connection Management
-    def open_connections(self):
-        """return a list of connected hosts and the number of connections
-        to each.  [('foo.com:80', 2), ('bar.org', 1)]"""
-        return [(host, len(li)) for (host, li) in self._cm.get_all().items()]
-
-    def close_connection(self, host):
-        """close connection(s) to <host>
-        host is the host:port spec, as in 'www.cnn.com:8080' as passed in.
-        no error occurs if there is no connection to that host."""
-        for h in self._cm.get_all(host):
-            self._cm.remove(h)
-            h.close()
-        
-    def close_all(self):
-        """close all open connections"""
-        for host, conns in self._cm.get_all().items():
-            for h in conns:
-                self._cm.remove(h)
-                h.close()
-        
-    def _request_closed(self, request, host, connection):
-        """tells us that this request is now closed and the the
-        connection is ready for another request"""
-        self._cm.set_ready(connection, 1)
-
-    def _remove_connection(self, host, connection, close=0):
-        if close: connection.close()
-        self._cm.remove(connection)
-        
-    #### Transaction Execution
-    def do_open(self, req):
-        host = req.get_host()
-        if not host:
-            raise urllib2.URLError('no host given')
-
-        try:
-            h = self._cm.get_ready_conn(host)
-            while h:
-                r = self._reuse_connection(h, req, host)
-
-                # if this response is non-None, then it worked and we're
-                # done.  Break out, skipping the else block.
-                if r: break
-
-                # connection is bad - possibly closed by server
-                # discard it and ask for the next free connection
-                h.close()
-                self._cm.remove(h)
-                h = self._cm.get_ready_conn(host)
-            else:
-                # no (working) free connections were found.  Create a new one.
-                h = self._get_connection(host)
-                if DEBUG: DEBUG.info("creating new connection to %s (%d)" % (host, id(h)))
-                self._cm.add(host, h, 0)
-                self._start_transaction(h, req)
-                r = h.getresponse()
-        except (socket.error, httplib.HTTPException), err:
-            raise urllib2.URLError(err)
-            
-        # if not a persistent connection, don't try to reuse it
-        if r.will_close: self._cm.remove(h)
-
-        if DEBUG: DEBUG.info("STATUS: %s, %s" % (r.status, r.reason))
-        r._handler = self
-        r._host = host
-        r._url = req.get_full_url()
-        r._connection = h
-        r.code = r.status
-        r.headers = r.msg
-        r.msg = r.reason
-        
-        if r.status == 200 or not HANDLE_ERRORS:
-            return r
-        else:
-            return self.parent.error('http', req, r,
-                                     r.status, r.msg, r.headers)
-
-    def _reuse_connection(self, h, req, host):
-        """start the transaction with a re-used connection
-        return a response object (r) upon success or None on failure.
-        This DOES not close or remove bad connections in cases where
-        it returns.  However, if an unexpected exception occurs, it
-        will close and remove the connection before re-raising.
-        """
-        try:
-            self._start_transaction(h, req)
-            r = h.getresponse()
-            # note: just because we got something back doesn't mean it
-            # worked.  We'll check the version below, too.
-        except (socket.error, httplib.HTTPException):
-            r = None
-        except:
-            # adding this block just in case we've missed
-            # something we will still raise the exception, but
-            # lets try and close the connection and remove it
-            # first.  We previously got into a nasty loop
-            # where an exception was uncaught, and so the
-            # connection stayed open.  On the next try, the
-            # same exception was raised, etc.  The tradeoff is
-            # that it's now possible this call will raise
-            # a DIFFERENT exception
-            if DEBUG: DEBUG.error("unexpected exception - closing " + \
-                                  "connection to %s (%d)" % host, id(h))
-            self._cm.remove(h)
-            h.close()
-            raise
-                    
-        if r is None or r.version == 9:
-            # httplib falls back to assuming HTTP 0.9 if it gets a
-            # bad header back.  This is most likely to happen if
-            # the socket has been closed by the server since we
-            # last used the connection.
-            if DEBUG: DEBUG.info("failed to re-use connection to %s (%d)" % (host, id(h)))
-            r = None
-        else:
-            if DEBUG: DEBUG.info("re-using connection to %s (%d)" % (host, id(h)))
-
-        return r
-
-    def _start_transaction(self, h, req):
-        try:
-            if req.has_data():
-                data = req.get_data()
-                h.putrequest('POST', req.get_selector())
-                if not req.headers.has_key('Content-type'):
-                    h.putheader('Content-type',
-                                'application/x-www-form-urlencoded')
-                if not req.headers.has_key('Content-length'):
-                    h.putheader('Content-length', '%d' % len(data))
-            else:
-                h.putrequest('GET', req.get_selector())
-        except (socket.error, httplib.HTTPException), err:
-            raise urllib2.URLError(err)
-
-        for args in self.parent.addheaders:
-            h.putheader(*args)
-        for k, v in req.headers.items():
-            h.putheader(k, v)
-        h.endheaders()
-        if req.has_data():
-            h.send(data)
-
-    def _get_connection(self, host):
-        return NotImplementedError
-
-class HTTPHandler(KeepAliveHandler, urllib2.HTTPHandler):
-    def __init__(self):
-        KeepAliveHandler.__init__(self)
-
-    def http_open(self, req):
-        return self.do_open(req)
-
-    def _get_connection(self, host):
-        return HTTPConnection(host)
-
-class HTTPSHandler(KeepAliveHandler, urllib2.HTTPSHandler):
-    def __init__(self, ssl_factory=None):
-        KeepAliveHandler.__init__(self)
-        #if not ssl_factory:
-        #    ssl_factory = sslfactory.get_factory()
-        #self._ssl_factory = ssl_factory
-    
-    def https_open(self, req):
-        return self.do_open(req)
-
-    def _get_connection(self, host):
-        # return self._ssl_factory.create_https_connection(host)
-        return HTTPSConnection(host)
-
-class HTTPResponse(httplib.HTTPResponse):
-    # we need to subclass HTTPResponse in order to
-    # 1) add readline() and readlines() methods
-    # 2) add close_connection() methods
-    # 3) add info() and geturl() methods
-
-    # in order to add readline(), read must be modified to deal with a
-    # buffer.  example: readline must read a buffer and then spit back
-    # one line at a time.  The only real alternative is to read one
-    # BYTE at a time (ick).  Once something has been read, it can't be
-    # put back (ok, maybe it can, but that's even uglier than this),
-    # so if you THEN do a normal read, you must first take stuff from
-    # the buffer.
-
-    # the read method wraps the original to accomodate buffering,
-    # although read() never adds to the buffer.
-    # Both readline and readlines have been stolen with almost no
-    # modification from socket.py
-    
-
-    def __init__(self, sock, debuglevel=0, strict=0, method=None):
-        if method: # the httplib in python 2.3 uses the method arg
-            httplib.HTTPResponse.__init__(self, sock, debuglevel, method)
-        else: # 2.2 doesn't
-            httplib.HTTPResponse.__init__(self, sock, debuglevel)
-        self.fileno = sock.fileno
-        self.code = None
-        self._rbuf = ''
-        self._rbufsize = 8096
-        self._handler = None # inserted by the handler later
-        self._host = None    # (same)
-        self._url = None     # (same)
-        self._connection = None # (same)
-
-    _raw_read = httplib.HTTPResponse.read
-
-    def close(self):
-        if self.fp:
-            self.fp.close()
-            self.fp = None
-            if self._handler:
-                self._handler._request_closed(self, self._host,
-                                              self._connection)
-
-    def close_connection(self):
-        self._handler._remove_connection(self._host, self._connection, close=1)
-        self.close()
-        
-    def info(self):
-        return self.headers
-
-    def geturl(self):
-        return self._url
-
-    def read(self, amt=None):
-        # the _rbuf test is only in this first if for speed.  It's not
-        # logically necessary
-        if self._rbuf and not amt is None:
-            L = len(self._rbuf)
-            if amt > L:
-                amt -= L
-            else:
-                s = self._rbuf[:amt]
-                self._rbuf = self._rbuf[amt:]
-                return s
-
-        s = self._rbuf + self._raw_read(amt)
-        self._rbuf = ''
-        return s
-
-    def readline(self, limit=-1):
-        data = ""
-        i = self._rbuf.find('\n')
-        while i < 0 and not (0 < limit <= len(self._rbuf)):
-            new = self._raw_read(self._rbufsize)
-            if not new: break
-            i = new.find('\n')
-            if i >= 0: i = i + len(self._rbuf)
-            self._rbuf = self._rbuf + new
-        if i < 0: i = len(self._rbuf)
-        else: i = i+1
-        if 0 <= limit < len(self._rbuf): i = limit
-        data, self._rbuf = self._rbuf[:i], self._rbuf[i:]
-        return data
-
-    def readlines(self, sizehint = 0):
-        total = 0
-        list = []
-        while 1:
-            line = self.readline()
-            if not line: break
-            list.append(line)
-            total += len(line)
-            if sizehint and total >= sizehint:
-                break
-        return list
-
-
-class HTTPConnection(httplib.HTTPConnection):
-    # use the modified response class
-    response_class = HTTPResponse
-
-class HTTPSConnection(httplib.HTTPSConnection):
-    response_class = HTTPResponse
-
-    def connect(self):
-        import _socket
-        
-        # For fixing #503
-        sock = _socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        sock.connect((self.host, self.port))
-        # Change this to certicate paths where you have your SSL client certificates
-        # to be able to download URLs producing SSL errors.
-        ssl = socket.ssl(sock, None, None)
-        
-        self.sock = httplib.FakeSocket(sock, ssl)
-    
-    
-    
-#########################################################################
-#####   TEST FUNCTIONS
-#########################################################################
-
-def error_handler(url):
-    global HANDLE_ERRORS
-    orig = HANDLE_ERRORS
-    keepalive_handler = HTTPHandler()
-    opener = urllib2.build_opener(keepalive_handler)
-    urllib2.install_opener(opener)
-    pos = {0: 'off', 1: 'on'}
-    for i in (0, 1):
-        print "  fancy error handling %s (HANDLE_ERRORS = %i)" % (pos[i], i)
-        HANDLE_ERRORS = i
-        try:
-            fo = urllib2.urlopen(url)
-            foo = fo.read()
-            fo.close()
-            try: status, reason = fo.status, fo.reason
-            except AttributeError: status, reason = None, None
-        except IOError, e:
-            print "  EXCEPTION: %s" % e
-            raise
-        else:
-            print "  status = %s, reason = %s" % (status, reason)
-    HANDLE_ERRORS = orig
-    hosts = keepalive_handler.open_connections()
-    print "open connections:", hosts
-    keepalive_handler.close_all()
-
-def continuity(url):
-    import md5
-    format = '%25s: %s'
-    
-    # first fetch the file with the normal http handler
-    opener = urllib2.build_opener()
-    urllib2.install_opener(opener)
-    fo = urllib2.urlopen(url)
-    foo = fo.read()
-    fo.close()
-    m = md5.new(foo)
-    print format % ('normal urllib', m.hexdigest())
-
-    # now install the keepalive handler and try again
-    opener = urllib2.build_opener(HTTPHandler())
-    urllib2.install_opener(opener)
-
-    fo = urllib2.urlopen(url)
-    foo = fo.read()
-    fo.close()
-    m = md5.new(foo)
-    print format % ('keepalive read', m.hexdigest())
-
-    fo = urllib2.urlopen(url)
-    foo = ''
-    while 1:
-        f = fo.readline()
-        if f: foo = foo + f
-        else: break
-    fo.close()
-    m = md5.new(foo)
-    print format % ('keepalive readline', m.hexdigest())
-
-def comp(N, url):
-    print '  making %i connections to:\n  %s' % (N, url)
-
-    sys.stdout.write('  first using the normal urllib handlers')
-    # first use normal opener
-    opener = urllib2.build_opener()
-    urllib2.install_opener(opener)
-    t1 = fetch(N, url)
-    print '  TIME: %.3f s' % t1
-
-    sys.stdout.write('  now using the keepalive handler       ')
-    # now install the keepalive handler and try again
-    opener = urllib2.build_opener(HTTPHandler())
-    urllib2.install_opener(opener)
-    t2 = fetch(N, url)
-    print '  TIME: %.3f s' % t2
-    print '  improvement factor: %.2f' % (t1/t2, )
-    
-def fetch(N, url, delay=0):
-    import time
-    lens = []
-    starttime = time.time()
-    for i in range(N):
-        if delay and i > 0: time.sleep(delay)
-        fo = urllib2.urlopen(url)
-        foo = fo.read()
-        fo.close()
-        lens.append(len(foo))
-    diff = time.time() - starttime
-
-    j = 0
-    for i in lens[1:]:
-        j = j + 1
-        if not i == lens[0]:
-            print "WARNING: inconsistent length on read %i: %i" % (j, i)
-
-    return diff
-
-def test_timeout(url):
-    global DEBUG
-    dbbackup = DEBUG
-    class FakeLogger:
-        def debug(self, msg, *args): print msg % args
-        info = warning = error = debug
-    DEBUG = FakeLogger()
-    print "  fetching the file to establish a connection"
-    fo = urllib2.urlopen(url)
-    data1 = fo.read()
-    fo.close()
- 
-    i = 20
-    print "  waiting %i seconds for the server to close the connection" % i
-    while i > 0:
-        sys.stdout.write('\r  %2i' % i)
-        sys.stdout.flush()
-        time.sleep(1)
-        i -= 1
-    sys.stderr.write('\r')
-
-    print "  fetching the file a second time"
-    fo = urllib2.urlopen(url)
-    data2 = fo.read()
-    fo.close()
-
-    if data1 == data2:
-        print '  data are identical'
-    else:
-        print '  ERROR: DATA DIFFER'
-
-    DEBUG = dbbackup
-
-    
-def test(url, N=10):
-    print "checking error hander (do this on a non-200)"
-    try: error_handler(url)
-    except IOError, e:
-        print "exiting - exception will prevent further tests"
-        sys.exit()
-    print
-    print "performing continuity test (making sure stuff isn't corrupted)"
-    continuity(url)
-    print
-    print "performing speed comparison"
-    comp(N, url)
-    print
-    print "performing dropped-connection check"
-    test_timeout(url)
-    
-if __name__ == '__main__':
-    import time
-    import sys
-    try:
-        N = int(sys.argv[1])
-        url = sys.argv[2]
-    except:
-        print "%s <integer> <url>" % sys.argv[0]
-    else:
-        test(url, N)
diff --git a/HarvestMan/harvestman/lib/common/lrucache.py b/HarvestMan/harvestman/lib/common/lrucache.py
deleted file mode 100755
index bd8abc4..0000000
--- a/HarvestMan/harvestman/lib/common/lrucache.py
+++ /dev/null
@@ -1,370 +0,0 @@
-# -- coding: utf-8
-"""
-lrucache.py - Length-limited O(1) LRU cache implementation
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-    
-Created Anand B Pillai Jun 26 2007 from ASPN Python Cookbook recipe #252524.
-
-{http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252524}
-
-Original code courtesy Josiah Carlson.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-import copy
-import cPickle, os, sys
-import time
-import cStringIO
-from threading import Semaphore
-from dictcache import DictCache
-
-class Node(object):
-    # __slots__ = ['prev', 'next', 'me']
-
-    def __init__(self, prev, me):
-        self.prev = prev
-        self.me = me
-        self.next = None
-
-    def __copy__(self):
-        n = Node(self.prev, self.me)
-        n.next = self.next
-
-        return n
-
-    #def __getstate__(self):
-    #    return self
-    
-class LRU(object):
-    """
-    Implementation of a length-limited O(1) LRU queue.
-    Built for and used by PyPE:
-    http://pype.sourceforge.net
-    Copyright 2003 Josiah Carlson.
-    """
-    def __init__(self, count, pairs=[]):
-        self.count = max(count, 1)
-        self.d = {}
-        self.first = None
-        self.last = None
-        for key, value in pairs:
-            self[key] = value
-
-    def __copy__(self):
-        lrucopy = LRU(self.count)
-        lrucopy.first = copy.copy(self.first)
-        lrucopy.last = copy.copy(self.last)
-        lrucopy.d = self.d.copy()
-        for key,value in self.iteritems():
-            lrucopy[key] = value
-
-        return lrucopy
-        
-    def __contains__(self, obj):
-        return obj in self.d
-
-    def __getitem__(self, obj):
-
-        a = self.d[obj].me
-        self[a[0]] = a[1]
-        return a[1]
-
-    def __setitem__(self, obj, val):
-        if obj in self.d:
-            del self[obj]
-        nobj = Node(self.last, (obj, val))
-        if self.first is None:
-            self.first = nobj
-        if self.last:
-            self.last.next = nobj
-        self.last = nobj
-        self.d[obj] = nobj
-        if len(self.d) > self.count:
-            if self.first == self.last:
-                self.first = None
-                self.last = None
-                return
-            a = self.first
-            if a:
-                if a.next:
-                    a.next.prev = None
-                    self.first = a.next
-                    a.next = None
-                    try:
-                       del self.d[a.me[0]]
-                    except KeyError:
-                       pass
-                del a
-
-    def __delitem__(self, obj):
-        nobj = self.d[obj]
-        if nobj.prev:
-            nobj.prev.next = nobj.next
-        else:
-            self.first = nobj.next
-        if nobj.next:
-            nobj.next.prev = nobj.prev
-        else:
-            self.last = nobj.prev
-        del self.d[obj]
-
-    def __iter__(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me[1]
-            cur = cur2
-
-    def iteritems(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me
-            cur = cur2
-
-    def iterkeys(self):
-        return iter(self.d)
-
-    def itervalues(self):
-        for i,j in self.iteritems():
-            yield j
-
-    def keys(self):
-        return self.d.keys()
-
-    def clear(self):
-        self.d.clear()
-
-    def __len__(self):
-        return len(self.d)
-
-
-class LRU2(object):
-    """
-    Implementation of a length-limited O(1) LRU queue
-    with disk caching. 
-    """
-
-    # This LRU drops off items to a disk dictionary cache
-    # when older items are dropped. 
-    def __init__(self, count, freq, cachedir='', pairs=[]):
-        self.count = max(count, 1)
-        self.d = {}
-        self.lastmutex = Semaphore(1)
-        self.first = None
-        self.last = None
-        for key, value in pairs:
-            self[key] = value
-        # Set the frequency to something like 1/100th of
-        # the expected dictionary final size to achieve
-        # best performance.
-        self.diskcache = DictCache(freq, cachedir)
-        
-    def __copy__(self):
-        lrucopy = LRU(self.count)
-        lrucopy.first = copy.copy(self.first)
-        lrucopy.last = copy.copy(self.last)
-        lrucopy.d = self.d.copy()
-        for key,value in self.iteritems():
-            lrucopy[key] = value
-
-        return lrucopy
-        
-    def __contains__(self, obj):
-        return obj in self.d
-
-    def __getitem__(self, obj):
-       try:
-           a = self.d[obj].me
-           self[a[0]] = a[1]
-           return a[1]
-       except (KeyError,AttributeError):
-           return self.diskcache[obj]
-        
-    def __setitem__(self, obj, val):
-        if obj in self.d:
-            del self[obj]
-        nobj = Node(self.last, (obj, val))
-        if self.first is None:
-            self.first = nobj
-        self.lastmutex.acquire()
-        try:
-            if self.last:
-               self.last.next = nobj
-            self.last = nobj
-        except:
-            pass
-        self.lastmutex.release()
-        self.d[obj] = nobj
-        if len(self.d) > self.count:
-            self.lastmutex.acquire()
-            try:
-                if self.first == self.last:
-                    self.first = None
-                    self.last = None
-                    self.lastmutex.release()
-                    return
-            except:
-                pass
-            self.lastmutex.release()
-            a = self.first
-            if a:
-                if a.next:
-                    a.next.prev = None
-                self.first = a.next
-                a.next = None
-            try:
-                key, val = a.me[0], self.d[a.me[0]]
-                del self.d[a.me[0]]
-                del a
-                self.diskcache[key] = val.me[1]
-            except (KeyError,AttributeError):
-                pass
-
-    def __delitem__(self, obj):
-        nobj = self.d[obj]
-        if nobj.prev:
-            nobj.prev.next = nobj.next
-        else:
-            self.first = nobj.next
-        if nobj.next:
-            nobj.next.prev = nobj.prev
-        else:
-            self.last = nobj.prev
-        del self.d[obj]
-
-    def __iter__(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me[1]
-            cur = cur2
-
-    def iteritems(self):
-        cur = self.first
-        while cur != None:
-            cur2 = cur.next
-            yield cur.me
-            cur = cur2
-
-    def iterkeys(self):
-        return iter(self.d)
-
-    def itervalues(self):
-        for i,j in self.iteritems():
-            yield j
-
-    def keys(self):
-        return self.d.keys()
-
-    def clear(self):
-        self.d.clear()
-        self.diskcache.clear()
-
-    def __len__(self):
-        return len(self.d)
-
-    def get_stats(self):
-        """ Return statistics as a dictionary """
-
-        return self.diskcache.get_stats()
-    
-    def test(self, N):
-
-        # Test to see if the diskcache works. Pass
-        # the total number of items added to this
-        # function...
-        
-        flag = True
-        
-        for x in range(N):
-            if self[x] == None:
-                flag = False
-                break
-
-        return flag
-
-def test_lru2():
-    import random
-    
-    n1, n2 = 10000, 5000
-    
-    l=LRU2(n1, 100)
-    for x in range(n1):
-        l[x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html')
-
-    # make use of first n2 random entries
-    for x in range(n2):
-        l[random.randint(0,n2)]
-        
-    # Add another n2 entries
-    # This will cause the LRU to drop
-    # entries and cache old entries.
-    for x in range(n2):
-        l[n1+x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html') #  x + 1
-
-    print l.test(n1+n2)
-
-    print 'Random access test...'
-    # Try to access random entries
-    for x in range(n1+n2):
-        # A random access will take more time since in-mem
-        # cache will be emptied more often
-        l[random.randint(0,n1+n2-1)]
-
-    print
-    print "Disk Hits",l.diskcache.dhits
-    print "Mem Hits",l.diskcache.mhits
-    print "Temp dict Hits",l.diskcache.thits    
-    print "Time taken",l.diskcache.t
-    print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2)
-    print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits)
-    print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache))
-    
-    l.diskcache.clear_counters()
-
-    print 'Sequential access test...'
-    
-    for x in range(n1+n2):
-        # A sequential access will be faster since in-mem cache
-        # will be hit sequentially...
-        l[x]        
-
-    print
-    
-    print "Disk Hits",l.diskcache.dhits
-    print "Mem Hits",l.diskcache.mhits
-    print "Temp dict Hits",l.diskcache.thits    
-    print "Time taken",l.diskcache.t
-    print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2)
-    print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits)    
-    print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache))    
-    
-    l.clear()
-
-if __name__=="__main__":
-    test_lru2()
- ##    l = LRU2(10)
-##     for x in range(10):
-##         l[x] = x
-##     print l.keys()
-##     print l[3]
-##     print l[3]
-##     print l[9]
-##     print l[9]
-    
-##     l[12]=11
-##     l[13]=12
-##     l[14]=15
-##     l[15]=16
-##     l[16]=17
-##     l[17]=18
-##     l[18]=19
-##     l[19]=20
-##     print l.keys()
-##     print len(l)
-##     print l[0]
-##     print l[1]
-##     print l[2]    
-##     print copy.copy(l).keys()
diff --git a/HarvestMan/harvestman/lib/common/macros.py b/HarvestMan/harvestman/lib/common/macros.py
deleted file mode 100755
index 63cd500..0000000
--- a/HarvestMan/harvestman/lib/common/macros.py
+++ /dev/null
@@ -1,185 +0,0 @@
-# -- coding: utf-8
-"""
-macros.py - Defining macro variables for use by other
-modules.
-
-Created Anand B Pillai <abpillai at gmail dot com> Oct 5 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-class HarvestManMacroVariable(type):
-    """ A metaclass for HarvestMan macro variables """
-    
-    PIDX = 0
-    NIDX = 0
-    macrodict = {}
-    
-    def __new__(cls, name, bases=(), dct={}):
-
-        val = dct.get('value')
-        if val != None:
-            dct['index'] = val
-            
-        elif dct.get('negate'):
-            cls.NIDX -= 1
-            dct['index'] = cls.NIDX
-        else:
-            cls.PIDX += 1
-            dct['index'] = cls.PIDX
-            
-        item = type.__new__(cls, name, bases, dct)
-        cls.macrodict[name] = item
-        return item
-
-    def __init__(cls, name, bases=(), dct={}):
-	pass
-    def __str__(self):
-        return '%d' % (self.index)
-
-    def __eq__(self, number):
-        # Makes it easy to do things like
-        # THREAD_IDLE == 0 in code.
-        return self.index == number
-
-    def __lt__(self, number):
-
-        return self.index < number
-
-    def __gt__(self, number):
-
-        return self.index > number
-
-    def __le__(self, number):
-
-        return self.index <= number
-
-    def __ge__(self, number):
-
-        return self.index >= number
-    
-    
-def DEFINE_MACRO(name, val=None):
-    """ A factory function for defining macros """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val})
-    else:
-        globals()[name] = HarvestManMacroVariable(name)
-
-def DEFINE_NEGATIVE_MACRO(name, val=None):
-    """ A factory function for defining macros with negative values """
-
-    if val != None:
-        globals()[name] = HarvestManMacroVariable(name, dct={'value': val,'negate': True})
-    else:
-        globals()[name] = HarvestManMacroVariable(name, dct={'negate': True})
-
-
-def SUCCESS(status):
-    return (status > 0)
-    
-DEFINE_ERROR_MACRO = DEFINE_NEGATIVE_MACRO
-
-# Special (predefined) macros
-DEFINE_MACRO("HARVESTMAN_OK", 1)
-DEFINE_MACRO("HARVESTMAN_FAIL", -1)
-DEFINE_MACRO("OPTION_TURN_OFF", 0)
-DEFINE_MACRO("OPTION_TURN_ON", 1)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_FLUSH", 0)
-DEFINE_MACRO("CONNECTOR_DATA_MODE_INMEM", 1)
-
-# Success macros
-DEFINE_MACRO("RESTORE_STATE_OK")
-DEFINE_MACRO("SAVE_STATE_OK")
-DEFINE_MACRO("CONFIG_FILE_EXISTS")
-DEFINE_MACRO("CONFIG_FILE_PARSE_OK")
-DEFINE_MACRO("CONFIG_OPTION_SET")
-DEFINE_MACRO("CONFIG_ITEM_SKIPPED")
-DEFINE_MACRO("CONFIG_OPTION_NOT_DEFINED")
-DEFINE_MACRO("CONFIG_ARGUMENT_OK")
-DEFINE_MACRO("CONFIG_ARGUMENTS_OK")
-DEFINE_MACRO("PROJECT_FILE_EXISTS", 0)
-DEFINE_MACRO("CONFIGURE_PROTOCOL_OK")
-DEFINE_MACRO("CONNECT_MULTIPART_DOWNLOAD")
-DEFINE_MACRO("CONNECT_NO_UPTODATE")
-DEFINE_MACRO("CONNECT_YES_DOWNLOADED")
-DEFINE_MACRO("DOWNLOAD_YES_WITH_MODIFICATION")
-DEFINE_MACRO("DOWNLOAD_NO_UPTODATE")
-DEFINE_MACRO("DOWNLOAD_NO_CACHE_SYNCED")
-DEFINE_MACRO("DOWNLOAD_YES_OK")
-DEFINE_MACRO("URL_PUSHED_TO_POOL")
-DEFINE_MACRO("CREATE_DIRECTORY_OK")
-DEFINE_MACRO("URL_DOWNLOAD_OK")
-DEFINE_MACRO("DATA_ALREADY_PRESENT")
-DEFINE_MACRO("FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_OK")
-DEFINE_MACRO("DUMP_URL_OK")
-DEFINE_MACRO("PROJECT_FILE_READ_OK")
-DEFINE_MACRO("PROJECT_FILE_WRITE_OK")
-DEFINE_MACRO("WRITE_URL_HEADERS_OK")
-DEFINE_MACRO("BROWSE_FILE_WRITE_OK")
-DEFINE_MACRO("LINK_FILTERED")
-DEFINE_MACRO("LINK_NOT_FILTERED")
-DEFINE_MACRO("LINK_EMPTY")
-DEFINE_MACRO("ANCHOR_LINK_FOUND")
-DEFINE_MACRO("SET_STATE_OK")
-DEFINE_MACRO("THREAD_MIGRATION_OK")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_QUEUED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_COMPLETED")
-DEFINE_MACRO("MULTIPART_DOWNLOAD_STATUS_UNKNOWN")
-DEFINE_MACRO("HGET_DOWNLOAD_OK")
-
-# Error macros
-DEFINE_ERROR_MACRO("SAVE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("RESTORE_STATE_NOT_OK")
-DEFINE_ERROR_MACRO("CONFIG_FILE_DOES_NOT_EXIST")
-DEFINE_ERROR_MACRO("CONFIG_FILE_PARSE_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_EMPTY")
-DEFINE_ERROR_MACRO("CONFIG_VALUE_MISMATCH")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_NOT_SET")
-DEFINE_ERROR_MACRO("CONFIG_OPTION_ASSIGN_ERROR")
-DEFINE_ERROR_MACRO("CONFIG_INVALID_ARGUMENT")
-DEFINE_ERROR_MACRO("CONFIG_ARGUMENT_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_NO_RULES_VIOLATION")
-DEFINE_ERROR_MACRO("CONNECT_NO_FILTERED")
-DEFINE_ERROR_MACRO("CONNECT_NO_ERROR")
-DEFINE_ERROR_MACRO("CONNECT_DOWNLOAD_ABORTED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_ERROR")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_WRITE_FILTERED")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_RULE_VIOLATION")
-DEFINE_ERROR_MACRO("DOWNLOAD_NO_CACHE_SYNC_FAILED")
-DEFINE_ERROR_MACRO("CREATE_DIRECTORY_NOT_OK")
-DEFINE_ERROR_MACRO("URL_DOWNLOAD_FAILED")
-DEFINE_ERROR_MACRO("DATA_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("DATA_EMPTY_ERROR")
-DEFINE_ERROR_MACRO("FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_FAILED")
-DEFINE_ERROR_MACRO("NULL_URLOBJECT_ERROR")
-DEFINE_ERROR_MACRO("INVALID_ARCHIVE_FORMAT")
-DEFINE_ERROR_MACRO("FILE_TRUNCATE_ERROR")
-DEFINE_ERROR_MACRO("DUMP_URL_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("PROJECT_FILE_REMOVE_ERROR")
-DEFINE_ERROR_MACRO("WRITE_URL_HEADERS_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_NOT_FOUND")
-DEFINE_ERROR_MACRO("BROWSE_FILE_READ_ERROR")
-DEFINE_ERROR_MACRO("BROWSE_FILE_EMPTY")
-DEFINE_ERROR_MACRO("BROWSE_FILE_INVALID")
-DEFINE_ERROR_MACRO("BROWSE_FILE_WRITE_ERROR")
-DEFINE_ERROR_MACRO("ANCHOR_LINK_NOT_FOUND")
-DEFINE_ERROR_MACRO("SET_STATE_ERROR")
-DEFINE_ERROR_MACRO("THREAD_MIGRATION_ERROR")
-DEFINE_ERROR_MACRO("MULTIPART_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("HGET_FATAL_ERROR")
-DEFINE_ERROR_MACRO("HGET_KEYBOARD_INTERRUPT")
-DEFINE_ERROR_MACRO("HGET_DOWNLOAD_ERROR")
-DEFINE_ERROR_MACRO("MIRRORS_NOT_FOUND")
-DEFINE_ERROR_MACRO("WRITE_URL_FILTERED")
-DEFINE_ERROR_MACRO("WRITE_URL_BLOCKED")
-DEFINE_ERROR_MACRO("CONTROLLER_EXIT")
-
-if __name__ == "__main__":
-    for key, val in HarvestManMacroVariable.macrodict.iteritems():
-        print key,'=>',val.index
diff --git a/HarvestMan/harvestman/lib/common/netinfo.py b/HarvestMan/harvestman/lib/common/netinfo.py
deleted file mode 100755
index 080e85a..0000000
--- a/HarvestMan/harvestman/lib/common/netinfo.py
+++ /dev/null
@@ -1,184 +0,0 @@
-"""
-netinfo - Module summarizing information regarding protocols,
-ports, file extensions, regular expressions for analyzing URLs etc
-for HarvestMan.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 22 2008, moving
-                                                   content from urlparser.py
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import re
-
-URLSEP = '/'          # URL separator character
-PROTOSEP = '//'       # String which separates a protocol string from the rest of URL
-DOTDOT = '..'         # A convenient name for the .. string
-DOT = '.'             # A convenient name for the . string
-PORTSEP = ':'         # Protocol separator character, character which separates the protocol
-                      # string from rest of URL
-BACKSLASH = '\\'      # A convenient name for the backslash character
-
-# Mapping popular protocols to most widely used port numbers
-protocol_map = { "http://" : 80,       
-                 "ftp://" : 21,
-                 "https://" : 443,
-                 "file://": 0,
-                 "file:": 0
-                 }
-
-# Popular image types file extensions
-image_extns = ('.bmp', '.dib', '.dcx', '.emf', '.fpx', '.gif', '.ico', '.img',
-               '.jp2', '.jpc', '.j2k', '.jpf', '.jpg', '.jpeg', '.jpe',
-               '.mng', '.pbm', '.pcd', '.pcx', '.pgm', '.png', '.ppm', 
-               '.psd', '.ras', '.rgb', '.tga', '.tif', '.tiff', '.wbmp',
-               '.xbm', '.xpm')
-
-# Popular video types file extensions
-movie_extns = ('.3gp', '.avi', '.asf','.asx', '.avs', '.bay',
-               '.bik', '.bsf', '.dat', '.dv' ,'.dvr-ms', 'flc',
-               '.flv', '.ivf', '.m1v', '.m2ts', '.m2v', '.m4v',
-               '.mgv', '.mkv', '.mov', '.mp2v', '.mp4', '.mpe',
-               '.mpeg', '.mpg', '.ogm', '.qt', '.ratDVD', '.rm',
-               '.smi', '.vob', '.wm', '.wmv', '.xvid' )
-
-# Popular audio types file extensions
-sound_extns = ('.aac', '.aif', '.aiff', '.aifc', '.aifr', '.amr',
-               '.ape' ,'.asf', '.au', '.aud', '.aup', '.bwf', 
-               '.cda', '.dct', '.dss', '.dts', '.dvf', '.esu',
-               '.eta', '.flac', '.gsm', '.jam', '.m4a', '.m4p',
-               '.mdi', '.mid', '.midi', '.mka', '.mod', '.mp1', '.mp2',
-               '.mp3', '.mpa', '.mpc', '.mpega', '.msv', '.mus',
-               '.nrj', '.nwc', '.nwp', '.ogg', '.psb', '.psm', '.ra',
-               '.ram', '.rel', '.sab', '.shn', '.smf', '.snd', '.speex',
-               '.tta', '.vox', '.vy3', '.wav', '.wave', '.wma',
-               '.wpk', '.wv', '.wvc')
-
-# Most common web page url file extensions
-# including dynamic server pages & cgi scripts.
-webpage_extns = ('', '.htm', '.html', '.shtm', '.shtml', '.php',
-                 '.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl',
-                 '.cgi', '.stx', '.cfm', '.cfml', '.cms', '.ars')
-
-
-# Document extensions
-document_extns = ('.doc','.rtf','.odt','.odp','.ott','.sxw','.stw',
-                  '.sdw','.vor','.pdf','.ps')
-
-# Extensions for flash/flash source code/flash action script
-flash_extns = ('.swf', '.fla', '.mxml', '.as', '.abc')
-
-# Web-page extensions which automatically default to directories
-# These are special web-page types which are web-pages as well
-# as directories. Most common example is the .ars file extension
-# of arstechnica.com.
-default_directory_extns = ('.ars',)
-
-# Most common stylesheet url file extensions
-stylesheet_extns = ( '.css', )
-
-# Regular expression for matching
-# urls which contain white spaces
-wspacere = re.compile(r'\s+\S+', re.LOCALE|re.UNICODE)
-
-# Regular expression for anchor tags
-anchore = re.compile(r'\#+')
-
-# jkleven: Regex if we still don't recognize a URL address as HTML.  Only
-# to be used if we've looked at everything else and URL still isn't
-# a known type.  This regex is similar to one in pageparser.py but 
-# we changed a few '*' to '+' to get one or more matches.  
-# form_re = re.compile(r'[-.:_a-zA-Z0-9]+\?[-.:_a-zA-Z0-9]+=[-.a:_-zA-Z0-9]*', re.UNICODE)
-
-# Made this more generic and lenient.
-form_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE)
-
-# Junk chars which cannot be part of valid filenames
-junk_chars = ('?','*','"','<','>','!',':','/','\\')
-
-# Replacement chars
-junk_chars_repl = ('',)*len(junk_chars)
-
-# Dirty chars which need to be hex encoded in URLs (apart from white-space)
-# We are assuming that there won't be many idiots who would put a backslash in a URL...
-dirty_chars = ('<','>','(',')','{','}','[',']','^','`','|')
-
-# These are replaced with their hex counterparts
-dirty_chars_repl = ('%3C','%3E','%28','%29','%7B','%7D','%5B','%5D','%5E','%60','%7C')
-
-# %xx char replacement regexp
-percent_repl = re.compile(r'\%[a-f0-9][a-f0-9]', re.IGNORECASE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*)+', re.UNICODE)
-# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[^\&]*)+', re.UNICODE)
-
-# Regexp which extracts params from query URLs, most generic
-params_re = re.compile(r'([^&=\?]*=[^&=\?]*)', re.UNICODE)
-# Regular expression for validating a query param group (such as "lang=en")
-param_re = re.compile(r'([^&=\?]+=[^&=\?\s]+)', re.UNICODE)
-
-# ampersand regular expression at URL end
-ampersand_re = re.compile(r'\&+$')
-# question mark regular expression at URL end
-question_re = re.compile(r'\?+$')
-# Regular expression for www prefixes at front of a string
-www_re = re.compile(r'^www(\d*)\.')
-# Regular expression for www prefixes anywhere
-www2_re = re.compile(r'www(\d*)\.')
-
-# List of TLD (top-level domain) name endings from http://data.iana.org/TLD/tlds-alpha-by-domain.txt
-
-tlds = ['ac', 'ad', 'ae', 'aero', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq',
-        'ar', 'arpa', 'as', 'asia', 'at', 'au', 'aw','ax', 'az', 'ba', 'bb', 'bd',
-        'be', 'bf', 'bg', 'bh', 'bi', 'biz', 'bj', 'bm', 'bn', 'bo', 'br', 'bs',
-        'bt', 'bv', 'bw', 'by', 'bz', 'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch',
-        'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'com', 'coop', 'cr', 'cu', 'cv', 'cx',
-        'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'edu', 'ee', 'eg',
-        'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo', 'fr', 'ga', 'gb',
-        'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq',
-        'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht', 'hu',
-        'id', 'ie', 'il', 'im', 'in', 'info', 'int', 'io', 'iq', 'ir', 'is',
-        'it', 'je', 'jm', 'jo', 'jobs', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn',
-        'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls',
-        'lt', 'lu', 'lv', 'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mil', 'mk',
-        'ml', 'mm', 'mn', 'mo', 'mobi', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu',
-        'museum', 'mv', 'mw', 'mx', 'my', 'mz', 'na', 'name', 'nc', 'ne', 'net',
-        'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'org', 'pa',
-        'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'pro', 'ps', 'pt',
-        'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd',
-        'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st',
-        'su', 'sv', 'sy', 'sz', 'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk',
-        'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'travel', 'tt', 'tv', 'tw', 'tz',
-        'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi',
-        'vn', 'vu', 'wf', 'ws', 'xn--0zwm56d', 'xn--11b5bs3a9aj6g', 'xn--80akhbyknj4f',
-        'xn--9t4b11yi5a', 'xn--deba0ad', 'xn--g6w251d', 'xn--hgbk6aj7f53bba',
-        'xn--hlcj6aya9esc7a', 'xn--jxalpdlp', 'xn--kgbechtv', 'xn--zckzah',
-        'ye', 'yt', 'yu', 'za', 'zm', 'zw']
-
-def get_base_server(server):
-    """ Return the base server name of  the passed
-    server (domain) name """
-
-    # If the server name is of the form say bar.foo.com
-    # or vodka.bar.foo.com, i.e there are more than one
-    # '.' in the name, then we need to return the
-    # last string containing a dot in the middle.
-    if server.count('.') > 1:
-        dotstrings = server.split('.')
-        # now the list is of the form => [vodka, bar, foo, com]
-
-        # Skip the list for skipping over tld domain name endings
-        # such as .org.uk, .mobi.uk etc. For example, if the
-        # server is games.mobileworld.mobi.uk, then we
-        # need to return mobileworld.mobi.uk, not mobi.uk
-        dotstrings.reverse()
-        idx = 0
-
-        for item in dotstrings:
-            if item.lower() in tlds:
-                idx += 1
-
-        return '.'.join(dotstrings[idx::-1])
-    else:
-        # The server is of the form foo.com or just "foo"
-        # so return it straight away
-        return server
diff --git a/HarvestMan/harvestman/lib/common/optionparser.py b/HarvestMan/harvestman/lib/common/optionparser.py
deleted file mode 100755
index 6c6cea7..0000000
--- a/HarvestMan/harvestman/lib/common/optionparser.py
+++ /dev/null
@@ -1,286 +0,0 @@
-# -- coding: utf-8
-"""
-optionparser.py - Generic option parser class. This class
-can be used to write code that will parse command line options
-for an application by invoking one of the standard Python
-library command argument parser modules optparse or
-getopt.
-
-The class first tries to use optparse. It it is not there
-(< Python 2.3), it invokes getopt. However, this is
-transparent to the application which uses the class.
-
-The class requires a list with tuple entries of the following
-form for each command line option.
-
-('option_var', 'short=<short option>','long=<long option>',
-'help=<help string>', 'meta=<meta variable>','default=<default value>',
-'type=<option type>')
-
-where, 'option_var' is the key for the option in the final
-dictionary of option-value pairs. The rest are strings of
-the form 'key=value', where 'key' is borrowed from the way
-optparse represents each variables for an option setting.
-
-To parse the arguments, call the method 'parse_arguments'.
-The return value is a dictionary of the option-value pairs.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-This module was originally created as an ASPN cookbook
-recipe http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/425345 .
-The current module is a slightly modified form of the recipe.
-
-Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-Created: Feb 11 2007 by Anand B Pillai 
-
-
-Copyright (C) 2007, Anand B Pillai.
-
-"""
-
-import sys
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-class GenericOptionParserError(Exception):
-    
-    def __init__(self,value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-    
-class GenericOptionParser:
-    """ Generic option parser using
-    either optparse or getopt """
-
-    def __init__(self, optlist, usage='', description=''):
-        self._optlist = self.__process_optlist(optlist)
-        self._optdict = {}
-        self._args = ''
-        self._usage = usage
-        self._description = description
-        self.maxw = 24
-
-    def __process_optlist(self, optlist):
-        """ Process options list """
-
-        optionslist = []
-
-        for optiontuple in optlist:
-            # Option destination is first item
-            optiondest = optiontuple[0]
-            # Create empty dictionary
-            itemdict = {}
-            # Convert rest to a dictionary
-            for item in optiontuple[1:]:
-                key,val = item.split('=', 1)
-                itemdict[key] = val
-
-            # Append option dest as first val
-            # and itemdict as second val in a list
-            optionslist.append([optiondest, itemdict])
-
-        return optionslist
-            
-    def parse_arguments(self):
-        """ Parse command line arguments and
-        return a dictionary of option-value pairs """
-
-        try:
-            self.optparse = __import__('optparse')
-            # For invoking help, when no arguments
-            # are passed.
-            if len(sys.argv)==1:
-                sys.argv.append('-h')
-
-            self._parse_arguments1()
-        except ImportError:
-            try:
-                import getopt
-                self.getopt = __import__('getopt')                
-                self._parse_arguments2()
-            except ImportError:
-                raise GenericOptionParserError,'Fatal Error: No optparse or getopt modules found'
-
-        return (self._optdict, self._args)
-                
-    def _parse_arguments1(self):
-        """ Parse command-line arguments using optparse """
-
-        p = self.optparse.OptionParser(usage=self._usage, description=self._description)
-        
-        for option, itemdict in self._optlist:
-            # Default action is 'store'
-            action = 'store'
-            # Short option string
-            sopt = itemdict.get('short','')
-            # Long option string
-            lopt = itemdict.get('long','')
-            # Help string
-            helpstr = itemdict.get('help','')
-            # Meta var
-            meta = itemdict.get('meta','')
-            # Default value
-            defl = itemdict.get('default','')
-            # Default type is 'string'
-            typ = itemdict.get('type','string')
-            
-            # If bool type...
-            if typ == 'bool':
-                action = 'store_true'
-                defl = bool(str(defl) == 'True')
-
-            if sopt: sopt = '-' + sopt
-            if lopt: lopt = '--' + lopt
-            
-            # Add option
-            p.add_option(sopt,lopt,dest=option,help=helpstr,metavar=meta,action=action,
-                         default=defl)
-
-        (options,args) = p.parse_args()
-        self._optdict = options.__dict__
-        self._args = args
-        
-    def _parse_arguments2(self):
-        """ Parse command-line arguments using getopt """
-
-        # getopt requires help string to
-        # be generated.
-        if len(sys.argv)==1:
-            sys.exit(self._usage())
-        
-        shortopt,longopt='h',['help']
-        # Create short option string and long option
-        # list for getopt
-        for option, itemdict in self._optlist:
-            sopt = itemdict.get('short','')
-            lopt = itemdict.get('long','')
-            typ = itemdict.get('type','string')            
-            defl = itemdict.get('default','')
-
-            # If bool type...
-            if typ == 'bool':
-                defl = bool(str(defl) == 'True')
-            # Set default value
-            self._optdict[key] = defl
-
-            if typ=='bool':
-                if sopt: shortopt += sopt
-                if lopt: longopt.append(lopt)
-            else:
-                if sopt: shortopt = "".join((shortopt,sopt,':'))
-                if lopt: longopt.append(lopt+'=')
-
-        # Parse
-        (optlist,args) = self.getopt.getopt(sys.argv[1:],shortopt,longopt)
-        self._args = args
-        
-        # Match options
-        for opt,val in optlist:
-            # Invoke help
-            if opt in ('-h','--help'):
-                sys.exit(self._usage())
-                
-            for option,itemdict in self._optlist:
-                sopt = '-' + itemdict.get('short','')
-                lopt = '--' + itemdict.get('long','')
-                typ = itemdict.get('type','string')
-                
-                if opt in (sopt,lopt):
-                    if typ=='bool': val = True
-                    self._optdict[key]=val
-                    break
-
-    def _split_help_str(self, help):
-        """ Split help string into many lines """
-
-        # According to following
-        # Max number of chars per line is 52
-        # If char count exceeds, split line,
-        # preserving words.
-        maxlen = 53
-        helps,count=[],0
-        if len(help)<=maxlen:
-            return help
-        
-        while help:
-            if len(help)<maxlen:
-                helps.append(self.maxw*' ' + help.strip())
-                break
-            else:
-                piece = help[0:maxlen]
-                # Find max index of space char in this piece
-                sindx = piece.rfind(' ')
-                # Split according to space char
-                piece = piece[0:sindx]
-                if count>0:
-                    piece = self.maxw*' ' + piece
-                helps.append(piece)
-                help = help[sindx+1:]
-                count += 1
-                
-        return '\n'.join(helps)
-    
-    def _usage(self):
-        """ Generate and return a help string
-        for the program, similar to the one
-        generated by optparse """
-
-        if self._usage:
-            usage = [self._usage]
-        else:
-            usage = ["usage: %s [options]\n\n" % sys.argv[0]]
-            usage.append("options:\n")
-
-        options = [('  -h, --help', 'show this help message and exit\n')]
-        maxlen = 0
-        for option, itemdict in self._optlist:
-            sopt = itemdict.get('short','')
-            lopt = itemdict.get('long','')
-            help = itemdict.get('help','')
-            meta = itemdict.get('meta','')
-            
-            optstr = ""
-            if sopt: optstr="".join(('  -',sopt,' ',meta))
-            if lopt: optstr="".join((optstr,', --',lopt))
-            if meta: optstr="".join((optstr,'=',meta))
-
-            help = self._split_help_str(help)
-                
-            l = len(optstr)
-            if l>maxlen: maxlen=l
-            options.append((optstr,help))
-            
-        for x in range(len(options)):
-            optstr = options[x][0]
-            helpstr = options[x][1]
-            if maxlen<self.maxw - 1:
-                usage.append("".join((optstr,(maxlen-len(optstr) + 2)*' ', helpstr,'\n')))
-            elif len(optstr)<self.maxw - 1:
-                usage.append("".join((optstr,(self.maxw-len(optstr))*' ', helpstr,'\n')))
-            else:
-                usage.append("".join((optstr,'\n',self.maxw*' ', helpstr,'\n')))                
-
-        return "".join(usage)
-
-if __name__=="__main__":
-    l=[ ('infile', 'short=i','long=in','help=Input file for the program',
-                    'meta=IN'),
-        ('outfile', 'short=o','long=out','help=Output file for the program',
-                    'meta=OUT'),
-        ('verbose', 'short=V','long=verbose','help=Be verbose in output',
-                    'type=bool') ]
-
-    g=GenericOptionParser(l)
-    optdict = g.parse_arguments()
- 
-    for key,value in optdict.items():
-         # Use the option and the value in
-         # your program
-         print key,value
diff --git a/HarvestMan/harvestman/lib/common/progress.py b/HarvestMan/harvestman/lib/common/progress.py
deleted file mode 100755
index 494027a..0000000
--- a/HarvestMan/harvestman/lib/common/progress.py
+++ /dev/null
@@ -1,498 +0,0 @@
-# -- coding: utf-8
-# progress.py - Module which provides a generic Text Progress bar
-# and a Progress bar class.
-#
-# Created: Anand B Pillai Mar 9 2007 by copying and merging
-# Progress/TextProgress classes from S.M.A.R.T version 0.5.
-# with some modifications for HarvestMan.
-#
-# This module is part of the HarvestMan program.
-# For licensing information see the file LICENSE.txt that
-# is included in this distribution.
-#
-#----------Original copyright/license information-------------------------
-#
-# Copyright (c) 2005 Canonical
-# Copyright (c) 2004 Conectiva, Inc.
-#
-# Written by Gustavo Niemeyer <niemeyer@conectiva.com>
-#
-# This file is part of Smart Package Manager.
-#
-# Smart Package Manager is free software; you can redistribute it and/or
-# modify it under the terms of the GNU General Public License as published
-# by the Free Software Foundation; either version 2 of the License, or (at
-# your option) any later version.
-#
-# Smart Package Manager is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-# General Public License for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with Smart Package Manager; if not, write to the Free Software
-# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
-#-------------- End of original copyright ---------------------------------
-#
-# Copyright (C) 2007, Anand B Pillai
-#
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import posixpath
-import time
-import sys
-import struct
-import os
-
-if os.name == 'posix':
-    import fcntl
-    import termios
-
-import thread
-import time
-import sys
-
-INTERVAL = 0.1
-
-class Progress(object):
-
-    def __init__(self):
-        self.__topic = ""
-        self.__progress = (0, 0, {}) # (current, total, data)
-        self.__lastshown = None
-        self.__done = False
-        self.__subtopic = {}
-        self.__subprogress = {} # (subcurrent, subtotal, fragment, subdata)
-        self.__sublastshown = {}
-        self.__subdone = {}
-        self.__lasttime = 0
-        self.__lock = thread.allocate_lock()
-        self.__hassub = False
-
-    def getScreenWidth(self):
-        if os.name == 'posix':
-            s = struct.pack('HHHH', 0, 0, 0, 0)
-            try:
-                x = fcntl.ioctl(1, termios.TIOCGWINSZ, s)
-            except IOError:
-                return 80
-
-            return struct.unpack('HHHH', x)[1]
-        else:
-            return 80
-
-    def lock(self):
-        self.__lock.acquire()
-
-    def unlock(self):
-        self.__lock.release()
-
-    def start(self):
-        pass
-
-    def stop(self):
-        self.__topic = ""
-        self.__progress = (0, 0, {})
-        self.__lastshown = None
-        self.__done = False
-        self.__subtopic.clear()
-        self.__subprogress.clear()
-        self.__sublastshown.clear()
-        self.__subdone.clear()
-        self.__lasttime = 0
-        self.__hassub = False
-
-    def setHasSub(self, flag):
-        self.__hassub = flag
-
-    def getHasSub(self):
-        return self.__hassub
-
-    def getSubCount(self):
-        return len(self.__subprogress)
-
-    def show(self):
-        now = time.time()
-        if self.__lasttime > now-INTERVAL:
-            return
-        self.__lock.acquire()
-        try:
-            self.__lasttime = now
-            current, total, data = self.__progress
-            subexpose = []
-            for subkey in self.__subprogress.keys():
-                sub = self.__subprogress[subkey]
-                subcurrent, subtotal, fragment, subdata = sub
-                subpercent = int(100*float(subcurrent)/(subtotal or 1))
-                if fragment:
-                    current += int(fragment*float(subpercent)/100)
-                subtopic = self.__subtopic.get(subkey)
-                if (subkey not in self.__subdone and
-                    sub == self.__sublastshown.get(subkey)):
-                    continue
-                self.__sublastshown[subkey] = sub
-                subdone = False
-                if subpercent == 100:
-                    self.__subdone[subkey] = True
-                    subdone = True
-                    if fragment:
-                        _current, _total, _data = self.__progress
-                        self.__progress = (_current+fragment, _total, _data)
-                        if _current == _total:
-                            self.__lasttime = 0
-                elif subkey in self.__subdone:
-                    subdone = subkey in self.__subdone
-                subexpose.append((subkey, subtopic, subpercent,
-                                  subdata, subdone))
-            topic = self.__topic
-            percent = int(100*float(current)/(total or 1))
-            if subexpose:
-                for info in subexpose:
-                    self.expose(topic, percent, *info)
-                    if info[-1]:
-                        subkey = info[0]
-                        del self.__subprogress[subkey]
-                        del self.__sublastshown[subkey]
-                        del self.__subtopic[subkey]
-                if percent == 100 and len(self.__subprogress) == 0:
-                    self.__done = True
-                self.expose(topic, percent, None, None, None, data,
-                            self.__done)
-            elif (topic, percent) != self.__lastshown:
-                if percent == 100 and len(self.__subprogress) == 0:
-                    self.__done = True
-                self.expose(topic, percent, None, None, None, data,
-                            self.__done)
-        finally:
-            self.__lock.release()
-
-    def expose(self, topic, percent, subkey, subtopic, subpercent, data, done):
-        pass
-
-    def setTopic(self, topic):
-        self.__topic = topic
-
-    def get(self):
-        return self.__progress
-
-    def set(self, current, total, data={}):
-        self.__lock.acquire()
-        try:
-            if self.__done:
-                return
-            if current > total:
-                current = total
-            self.__progress = (current, total, data)
-            if current == total:
-                self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-    def add(self, value):
-        self.__lock.acquire()
-        try:
-            if self.__done:
-                return
-            current, total, data = self.__progress
-            current += value
-            if current > total:
-                current = total
-            self.__progress = (current, total, data)
-            if current == total:
-                self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-    def addTotal(self, value):
-        self.__lock.acquire()
-        try:
-            if self.__done:
-                return
-            current, total, data = self.__progress
-            self.__progress = (current, total+value, data)
-        finally:
-            self.__lock.release()
-
-    def setSubTopic(self, subkey, subtopic):
-        self.__lock.acquire()
-        try:
-            if subkey not in self.__subtopic:
-                self.__lasttime = 0
-            self.__subtopic[subkey] = subtopic
-        finally:
-            self.__lock.release()
-
-    def getSub(self, subkey):
-        return self.__subprogress.get(subkey)
-
-    def getSubData(self, subkey, _none=[None]):
-        return self.__subprogress.get(subkey, _none)[-1]
-
-    def setSub(self, subkey, subcurrent, subtotal, fragment=0, subdata={}):
-        self.__lock.acquire()
-        try:
-            if self.__done or subkey in self.__subdone:
-                return
-            if subkey not in self.__subtopic:
-                self.__subtopic[subkey] = ""
-                self.__lasttime = 0
-            if subcurrent > subtotal:
-                subcurrent = subtotal
-            if subcurrent == subtotal:
-                self.__lasttime = 0
-            self.__subprogress[subkey] = (subcurrent, subtotal,
-                                          fragment, subdata)
-        finally:
-            self.__lock.release()
-
-    def addSub(self, subkey, value):
-        self.__lock.acquire()
-        try:
-            if self.__done or subkey in self.__subdone:
-                return
-            (subcurrent, subtotal,
-             fragment, subdata) = self.__subprogress[subkey]
-            subcurrent += value
-            if subcurrent > subtotal:
-                subcurrent = subtotal
-            self.__subprogress[subkey] = (subcurrent, subtotal,
-                                          fragment, subdata)
-            if subcurrent == subtotal:
-                self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-    def addSubTotal(self, subkey, value):
-        self.__lock.acquire()
-        try:
-            if self.__done or subkey in self.__subdone:
-                return
-            (subcurrent, subtotal,
-             fragment, subdata) = self.__subprogress[subkey]
-            self.__subprogress[subkey] = (subcurrent, subtotal+value,
-                                          fragment, subdata)
-        finally:
-            self.__lock.release()
-
-    def setDone(self):
-        self.__lock.acquire()
-        try:
-            current, total, data = self.__progress
-            self.__progress = (total, total, data)
-            self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-    def setSubDone(self, subkey):
-        self.__lock.acquire()
-        try:
-            if subkey in self.__subdone:
-                return
-            (subcurrent, subtotal,
-             fragment, subdata) = self.__subprogress[subkey]
-            if subcurrent != subtotal:
-                self.__subprogress[subkey] = (subtotal, subtotal,
-                                              fragment, subdata)
-            self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-    def setStopped(self):
-        self.__lock.acquire()
-        self.__done = True
-        self.__lasttime = 0
-        self.__lock.release()
-
-    def setSubStopped(self, subkey):
-        self.__lock.acquire()
-        self.__subdone[subkey] = True
-        self.__lasttime = 0
-        self.__lock.release()
-
-    def resetSub(self, subkey):
-        self.__lock.acquire()
-        try:
-            if subkey in self.__subdone:
-                del self.__subdone[subkey]
-            if subkey in self.__subprogress:
-                (subcurrent, subtotal,
-                 fragment, subdata) = self.__subprogress[subkey]
-                self.__subprogress[subkey] = (0, subtotal, fragment, {})
-            self.__lasttime = 0
-        finally:
-            self.__lock.release()
-
-class TextProgress(Progress):
-
-    def __init__(self):
-        Progress.__init__(self)
-        self._lasttopic = None
-        self._lastsubkey = None
-        self._lastsubkeystart = 0
-        self._fetchermode = False
-        self._nolengthmode = False
-        # Cursor direction for no length mode
-        self._direction = 0
-        self._current = 0.0
-        self._seentopics = {}
-        self._addline = False
-        self.setScreenWidth(self.getScreenWidth())
-        #signal.signal(signal.SIGWINCH, self.handleScreenWidth)
-
-    def setScreenWidth(self, width):
-        self._screenwidth = width
-        self._topicwidth = int(width*0.4)
-        self._hashwidth = int(width-self._topicwidth-1)
-        self._topicmask = "%%-%d.%ds" % (self._topicwidth, self._topicwidth)
-        self._topicmaskn = "%%4d:%%-%d.%ds" % (self._topicwidth-5,
-                                               self._topicwidth-5)
-
-    def handleScreenWidth(self, signum, frame):
-        self.setScreenWidth(self.getScreenWidth())
-
-    def setFetcherMode(self, flag):
-        self._fetchermode = flag
-
-    def setNoLengthMode(self, flag):
-        self._nolengthmode = flag
-        
-    def stop(self):
-        Progress.stop(self)
-        print
-
-    def expose(self, topic, percent, subkey, subtopic, subpercent, data, done):
-
-        out = sys.stdout
-        if not out.isatty() and not done:
-            return
-        if self.getHasSub():
-            if topic != self._lasttopic:
-                self._lasttopic = topic
-                out.write(" "*(self._screenwidth-1)+"\r")
-
-                if self._addline:
-                    print
-                else:
-                    self._addline = True
-            if not subkey:
-                return
-            if not done:
-                now = time.time()
-                if subkey == self._lastsubkey:
-                    if (self._lastsubkeystart+2 < now and
-                        self.getSubCount() > 1):
-                        return
-                else:
-                    if (self._lastsubkeystart+2 > now and
-                        self.getSubCount() > 1):
-                        return
-                    self._lastsubkey = subkey
-                    self._lastsubkeystart = now
-            elif subkey == self._lastsubkey:
-                    self._lastsubkeystart = 0
-            current = subpercent
-            topic = subtopic
-        else:
-            current = percent
-
-        n = data.get("item-number")
-        if n:
-            if len(topic) > self._topicwidth-6:
-                topic = topic[:self._topicwidth-8]+".."
-            out.write(self._topicmaskn % (n, topic))
-        else:
-            if len(topic) > self._topicwidth-1:
-                topic = topic[:self._topicwidth-3]+".."
-            out.write(self._topicmask % topic)
-
-        if not done:
-            speed = data.get("speed")
-            if speed:
-                suffix = "(%s%% %s %s) \r" % (str(current), speed, data.get("eta"))
-            else:
-                suffix = "(%3d%%) \r" % current
-        elif subpercent is None:
-            suffix = "[%3d%%] \n" % current
-        else:
-            suffix = "[%3d%%] \n" % percent
-
-        if self._nolengthmode:
-            hashwidth = self._hashwidth - 8
-            hashes = int(hashwidth*current/100)
-            suffix = "(%3s) \r" % '...'
-            out.write("[")
-            #if self._direction==0:
-            leftwidth = hashes - 1
-            rightwidth = hashwidth - hashes -3
-            #elif self._direction==1:
-            #    leftwidth = hashwidth - hashes - 3
-            #    rightwidth = hashes
-
-            # print leftwidth, rightwidth, hashwidth
-            #if rightwidth==0:
-            #    # Switch direction
-            #    self._direction = 1
-            #elif leftwidth==0:
-            #    self._direction = 0
-                
-            if leftwidth >=0: out.write(" "*leftwidth)
-            out.write("<=>")
-            if rightwidth >=0: out.write(" "*(rightwidth))
-            out.write("]")
-            out.write(suffix)
-            out.flush()
-            # Sleep for some time
-            time.sleep(0.1)
-        else:
-            hashwidth = self._hashwidth -len(suffix)
-            hashes = int(hashwidth*current/100)
-            
-            out.write("[")
-            out.write("="*(hashes-1))
-            out.write(">")
-            out.write(" "*(hashwidth-hashes-1))
-            out.write("]")
-            out.write(suffix)
-            out.flush()            
-
-def test():
-    prog = TextProgress()
-    data = {"item-number": 0}
-    total, subtotal = 10, 10
-    prog.setHasSub(True)
-    prog.start()
-    prog.setTopic("Installing packages..")
-    #data["item-number"] = 1
-    for n in range(1, total+1):
-        prog.set(n, total)
-        for i in range(0,subtotal+1):
-            prog.setSubTopic(n, "package-name%d" % n)
-            prog.setSub(n, i, subtotal) #, subdata=data)
-            prog.show()
-            time.sleep(0.1)
-        
-    prog.stop()
-
-def test2():
-    prog = TextProgress()
-    data = {"item-number": 0}
-    total, subtotal = 10, 100
-    prog.setFetcherMode(True)
-    prog.setHasSub(True)
-    prog.start()
-    prog.setTopic("Installing packages...")
-    for n in range(1,total+1):
-        data["item-number"] = n
-        prog.set(n, total)
-        prog.setSubTopic(n, "package-name%d" % n)
-        for i in range(0,subtotal+1):
-            prog.setSub(n, i, subtotal, subdata=data)
-            prog.show()
-            # time.sleep(0.1)
-    prog.stop()
-if __name__ == "__main__":
-    test()
-
-# vim:ts=4:sw=4:et
diff --git a/HarvestMan/harvestman/lib/common/properties.py b/HarvestMan/harvestman/lib/common/properties.py
deleted file mode 100755
index 8c7e9f3..0000000
--- a/HarvestMan/harvestman/lib/common/properties.py
+++ /dev/null
@@ -1,362 +0,0 @@
-# -- coding: utf-8
-
-#! /usr/bin/env python
-"""
-properties.py - A Python replacement for java.util.Properties class
-This is modelled as closely as possible to the Java original.
-
-Created Anand B Pillai <abpillai at gmail dot com> 2006-07-28
-
-Copyright(C) 2006-2007, Anand B Pillai.
-
-"""
-
-import sys,os
-import re
-import time
-from types import StringTypes
-
-class Properties(object):
-    """ A Python replacement for java.util.Properties """
-    
-    def __init__(self, props=None):
-
-        # Note: We don't take a default properties object
-        # as argument yet
-
-        # Dictionary of properties.
-        self._props = {}
-
-        # Dictionary of properties with 'pristine' keys
-        # This is used for dumping the properties to a file
-        # using the 'store' method
-        self._origprops = {}
-
-        # Dictionary mapping keys from property
-        # dictionary to pristine dictionary
-        self._keymap = {}
-        
-        self.othercharre = re.compile(r'(?<!\\)(\s*\=)|(?<!\\)(\s*\:)')
-        self.othercharre2 = re.compile(r'(\s*\=)|(\s*\:)')
-        self.bspacere = re.compile(r'\\(?!\s$)')
-
-        if props:
-            # Store the passed properties, it should be
-            # a dictionary or an instance of this class itself...
-            self._copyFrom(props)
-        
-    def __str__(self):
-        s='{'
-        for key,value in self._props.items():
-            s = ''.join((s,key,'=',value,', '))
-
-        s=''.join((s[:-2],'}'))
-        return s
-
-    def __parse(self, lines):
-        """ Parse a list of lines and create
-        an internal property dictionary """
-
-        # Every line in the file must consist of either a comment
-        # or a key-value pair. A key-value pair is a line consisting
-        # of a key which is a combination of non-white space characters
-        # The separator character between key-value pairs is a '=',
-        # ':' or a whitespace character not including the newline.
-        # If the '=' or ':' characters are found, in the line, even
-        # keys containing whitespace chars are allowed.
-
-        # A line with only a key according to the rules above is also
-        # fine. In such case, the value is considered as the empty string.
-        # In order to include characters '=' or ':' in a key or value,
-        # they have to be properly escaped using the backslash character.
-
-        # Some examples of valid key-value pairs:
-        #
-        # key     value
-        # key=value
-        # key:value
-        # key     value1,value2,value3
-        # key     value1,value2,value3 \
-        #         value4, value5
-        # key
-        # This key= this value
-        # key = value1 value2 value3
-        
-        # Any line that starts with a '#' is considerered a comment
-        # and skipped. Also any trailing or preceding whitespaces
-        # are removed from the key/value.
-        
-        # This is a line parser. It parses the
-        # contents like by line.
-
-        lineno=0
-        i = iter(lines)
-
-        for line in i:
-            lineno += 1
-            line = line.strip()
-            # Skip null lines
-            if not line: continue
-            # Skip lines which are comments
-            if line[0] == '#': continue
-            # Some flags
-            escaped=False
-            # Position of first separation char
-            sepidx = -1
-            # A flag for performing wspace re check
-            flag = 0
-            # Check for valid space separation
-            # First obtain the max index to which we
-            # can search.
-            m = self.othercharre.search(line)
-            if m:
-                first, last = m.span()
-                start, end = 0, first
-                flag = 1
-                wspacere = re.compile(r'(?<![\\\=\:])(\s)')        
-            else:
-                if self.othercharre2.search(line):
-                    # Check if either '=' or ':' is present
-                    # in the line. If they are then it means
-                    # they are preceded by a backslash.
-                    
-                    # This means, we need to modify the
-                    # wspacere a bit, not to look for
-                    # : or = characters.
-                    wspacere = re.compile(r'(?<![\\])(\s)')        
-                start, end = 0, len(line)
-                
-            m2 = wspacere.search(line, start, end)
-            if m2:
-                # print 'Space match=>',line
-                # Means we need to split by space.
-                first, last = m2.span()
-                sepidx = first
-            elif m:
-                # print 'Other match=>',line
-                # No matching wspace char found, need
-                # to split by either '=' or ':'
-                first, last = m.span()
-                sepidx = last - 1
-                # print line[sepidx]
-                
-                
-            # If the last character is a backslash
-            # it has to be preceded by a space in which
-            # case the next line is read as part of the
-            # same property
-            while line[-1] == '\\':
-                # Read next line
-                nextline = i.next()
-                nextline = nextline.strip()
-                lineno += 1
-                # This line will become part of the value
-                line = line[:-1] + nextline
-
-            # Now split to key,value according to separation char
-            if sepidx != -1:
-                key, value = line[:sepidx], line[sepidx+1:]
-            else:
-                key,value = line,''
-
-            self.processPair(key, value)
-            
-    def processPair(self, key, value):
-        """ Process a (key, value) pair """
-
-        oldkey = key
-        oldvalue = value
-        
-        # Create key intelligently
-        keyparts = self.bspacere.split(key)
-        # print keyparts
-
-        strippable = False
-        lastpart = keyparts[-1]
-
-        if lastpart.find('\\ ') != -1:
-            keyparts[-1] = lastpart.replace('\\','')
-
-        # If no backspace is found at the end, but empty
-        # space is found, strip it
-        elif lastpart and lastpart[-1] == ' ':
-            strippable = True
-
-        key = ''.join(keyparts)
-        if strippable:
-            key = key.strip()
-            oldkey = oldkey.strip()
-        
-        oldvalue = self.unescape(oldvalue)
-        value = self.unescape(value)
-        
-        self._props[key] = value.strip()
-
-        # Check if an entry exists in pristine keys
-        if self._keymap.has_key(key):
-            oldkey = self._keymap.get(key)
-            self._origprops[oldkey] = oldvalue.strip()
-        else:
-            self._origprops[oldkey] = oldvalue.strip()
-            # Store entry in keymap
-            self._keymap[key] = oldkey
-
-    def _copyFrom(self, props):
-        """ Copy data from a passed property object, the passed
-        object has to be a dictionary or an instance of this class """
-
-        if type(props) is dict or isinstance(props, self.__class__):
-            for key, value in props.iteritems():
-                self.setProperty(key, value)
-        else:
-            raise TypeError,"Argument should be dictionary or Properties instance!"
-            
-    def escape(self, value):
-
-        # Java escapes the '=' and ':' in the value
-        # string with backslashes in the store method.
-        # So let us do the same.
-        newvalue = value.replace(':','\:')
-        newvalue = newvalue.replace('=','\=')
-
-        return newvalue
-
-    def unescape(self, value):
-
-        # Reverse of escape
-        newvalue = value.replace('\:',':')
-        newvalue = newvalue.replace('\=','=')
-
-        return newvalue    
-        
-    def load(self, stream):
-        """ Load properties from an open file stream """
-        
-        # For the time being only accept file input streams
-        if type(stream) is not file:
-            raise TypeError,'Argument should be a file object!'
-        # Check for the opened mode
-        if stream.mode != 'r':
-            raise ValueError,'Stream should be opened in read-only mode!'
-
-        # Reset dictionaries...
-        self.reset()
-        
-        try:
-            lines = stream.readlines()
-            self.__parse(lines)
-        except IOError, e:
-            raise
-
-    def getProperty(self, key):
-        """ Return a property for the given key """
-        
-        return self._props.get(key,'')
-
-    def setProperty(self, key, value):
-        """ Set the property for the given key """
-
-        if type(key) in StringTypes and type(value) in StringTypes:
-            self.processPair(key, value)
-        else:
-            raise TypeError,'both key and value should be strings!'
-
-    def propertyNames(self):
-        """ Return an iterator over all the keys of the property
-        dictionary, i.e the names of the properties """
-
-        return self._props.keys()
-
-    def list(self, out=sys.stdout):
-        """ Prints a listing of the properties to the
-        stream 'out' which defaults to the standard output """
-
-        out.write('-- listing properties --\n')
-        for key,value in self._props.items():
-            out.write(''.join((key,'=',value,'\n')))
-
-    def store(self, out, header=""):
-        """ Write the properties list to the stream 'out' along
-        with the optional 'header' """
-
-        if out.mode[0] != 'w':
-            raise ValueError,'Steam should be opened in write mode!'
-
-        try:
-            out.write(''.join(('#',header,'\n')))
-            # Write timestamp
-            tstamp = time.strftime('%a %b %d %H:%M:%S %Z %Y', time.localtime())
-            out.write(''.join(('#',tstamp,'\n')))
-            # Write properties from the pristine dictionary
-            for prop, val in self._origprops.items():
-                out.write(''.join((prop,'=',self.escape(val),'\n')))
-                
-            out.close()
-        except IOError, e:
-            raise
-
-    def getPropertyDict(self):
-        return self._props
-
-    def reset(self):
-        self._props.clear()
-        self._props = {}
-        self._origprops.clear()
-        self._origprops = {}
-        self._keymap.clear()
-        self._keymap = {}
-        
-    def __getitem__(self, name):
-        """ To support direct dictionary like access """
-
-        return self.getProperty(name)
-
-    def __setitem__(self, name, value):
-        """ To support direct dictionary like access """
-
-        self.setProperty(name, value)
-        
-    def __getattr__(self, name):
-        """ For attributes not found in self, redirect
-        to the properties dictionary """
-
-        try:
-            return self.__dict__[name]
-        except KeyError:
-            if hasattr(self._props,name):
-                return getattr(self._props, name)
-
-    # To support other dictionary like methods
-    def keys(self):
-        return self._props.keys()
-    
-    def values(self):
-        return self._props.values()
-
-    def items(self):
-        return self._props.items()    
-    
-    def iterkeys(self):
-        for key in self._props.keys():
-            yield key
-
-    def itervalues(self):
-        for value in self._props.values():
-            yield value
-
-    def iteritems(self):
-        for key,value in self._props.items():
-            yield key, value
-            
-    def clear(self):
-        self.reset()
-        
-if __name__=="__main__":
-    d = {}
-    for x in range(10):
-        d[str(x)] = 'string#' + str(x+1)
-    p = Properties(d)
-    p.list()
-    p.store(open("test.properties", 'w'))
-    p.load(open('test.properties'))
-    p.list()
diff --git a/HarvestMan/harvestman/lib/common/pydblite.py b/HarvestMan/harvestman/lib/common/pydblite.py
deleted file mode 100755
index e9d2c12..0000000
--- a/HarvestMan/harvestman/lib/common/pydblite.py
+++ /dev/null
@@ -1,361 +0,0 @@
-"""PyDbLite.py
-
-In-memory database management, with selection by list comprehension 
-or generator expression
-
-Fields are untyped : they can store anything that can be pickled.
-Selected records are returned as dictionaries. Each record is 
-identified by a unique id and has a version number incremented
-at every record update, to detect concurrent access
-
-Syntax :
-    from PyDbLite import Base
-    db = Base('dummy')
-    # create new base with field names
-    db.create('name','age','size')
-    # existing base
-    db.open()
-    # insert new record
-    db.insert(name='homer',age=23,size=1.84)
-    # records are dictionaries with a unique integer key __id__
-    # selection by list comprehension
-    res = [ r for r in db if 30 > r['age'] >= 18 and r['size'] < 2 ]
-    # or generator expression
-    for r in (r for r in db if r['name'] in ('homer','marge') ):
-    # simple selection (equality test)
-    res = db(age=30)
-    # delete a record or a list of records
-    db.delete(one_record)
-    db.delete(list_of_records)
-    # delete a record by its id
-    del db[rec_id]
-    # direct access by id
-    record = db[rec_id] # the record such that record['__id__'] == rec_id
-    # create an index on a field
-    db.create_index('age')
-    # access by index
-    records = db._age[23] # returns the list of records with age == 23
-    # update
-    db.update(record,age=24)
-    # add and drop fields
-    db.add_field('new_field')
-    db.drop_field('name')
-    # save changes on disk
-    db.commit()
-"""
-
-import os
-import cPickle
-import bisect
-
-# compatibility with Python 2.3
-try:
-    set([])
-except NameError:
-    from sets import Set as set
-    
-class Index:
-    """Class used for indexing a base on a field
-    The instance of Index is an attribute the Base instance"""
-
-    def __init__(self,db,field):
-        self.db = db # database object (instance of Base)
-        self.field = field # field name
-
-    def __iter__(self):
-        return iter(self.db.indices[self.field])
-
-    def keys(self):
-        return self.db.indices[self.field].keys()
-
-    def __getitem__(self,key):
-        """Lookup by key : return the list of records where
-        field value is equal to this key, or an empty list"""
-        ids = self.db.indices[self.field].get(key,[])
-        return [ self.db.records[_id] for _id in ids ]
-
-class Base:
-
-    def __init__(self,basename):
-        self.name = basename
-
-    def create(self,*fields,**kw):
-        """Create a new base with specified field names
-        A keyword argument mode can be specified ; it is used if a file
-        with the base name already exists
-        - if mode = 'open' : open the existing base, ignore the fields
-        - if mode = 'override' : erase the existing base and create a
-        new one with the specified fields"""
-        self.mode = mode = kw.get("mode",None)
-        if os.path.exists(self.name):
-            if not os.path.isfile(self.name):
-                raise IOError,"%s exists and is not a file" %self.name
-            elif mode is None:
-                raise IOError,"Base %s already exists" %self.name
-            elif mode == "open":
-                return self.open()
-            elif mode == "override":
-                os.remove(self.name)
-        self.fields = list(fields)
-        self.records = {}
-        self.next_id = 0
-        self.indices = {}
-        self.commit()
-        return self
-
-    def create_index(self,*fields):
-        """Create an index on the specified field names
-        
-        An index on a field is a mapping between the values taken by the field
-        and the sorted list of the ids of the records whose field is equal to 
-        this value
-        
-        For each indexed field, an attribute of self is created, an instance 
-        of the class Index (see above). Its name it the field name, with the
-        prefix _ to avoid name conflicts
-        """
-        reset = False
-        for f in fields:
-            if not f in self.fields:
-                raise NameError,"%s is not a field name" %f
-            # initialize the indices
-            if self.mode == "open" and f in self.indices:
-                continue
-            reset = True
-            self.indices[f] = {}
-            for _id,record in self.records.iteritems():
-                # use bisect to quickly insert the id in the list
-                bisect.insort(self.indices[f].setdefault(record[f],[]),
-                    _id)
-            # create a new attribute of self, used to find the records
-            # by this index
-            setattr(self,'_'+f,Index(self,f))
-        if reset:
-            self.commit()
-
-    def open(self):
-        """Open an existing database and load its content into memory"""
-        _in = open(self.name) # don't specify binary mode !
-        self.fields = cPickle.load(_in)
-        self.next_id = cPickle.load(_in)
-        self.records = cPickle.load(_in)
-        self.indices = cPickle.load(_in)
-        for f in self.indices.keys():
-            setattr(self,'_'+f,Index(self,f))
-        _in.close()
-        self.mode = "open"
-        return self
-
-    def commit(self):
-        """Write the database to a file"""
-        out = open(self.name,'wb')
-        cPickle.dump(self.fields,out)
-        cPickle.dump(self.next_id,out)
-        cPickle.dump(self.records,out)
-        cPickle.dump(self.indices,out)
-        out.close()
-
-    def insert(self,*args,**kw):
-        """Insert a record in the database
-        Parameters can be positional or keyword arguments. If positional
-        they must be in the same order as in the create() method
-        If some of the fields are missing the value is set to None
-        Returns the record identifier
-        """
-        if args:
-            kw = dict([(f,arg) for f,arg in zip(self.fields,args)])
-        # initialize all fields to None
-        record = dict([(f,None) for f in self.fields])
-        # set keys and values
-        for (k,v) in kw.iteritems():
-            record[k]=v
-        # add the key __id__ : record identifier
-        record['__id__'] = self.next_id
-        # add the key __version__ : version number
-        record['__version__'] = 0
-        # create an entry in the dictionary self.records, indexed by __id__
-        self.records[self.next_id] = record
-        # update index
-        for ix in self.indices.keys():
-            bisect.insort(self.indices[ix].setdefault(record[ix],[]),
-                self.next_id)
-        # increment the next __id__ to attribute
-        self.next_id += 1
-        return record['__id__']
-
-    def delete(self,removed):
-        """Remove a single record, or the records in an iterable
-        Before starting deletion, test if all records are in the base
-        and don't have twice the same __id__
-        Return the number of deleted items
-        """
-        if isinstance(removed,dict):
-            # remove a single record
-            removed = [removed]
-        else:
-            # convert iterable into a list (to be able to sort it)
-            removed = [ r for r in removed ]
-        if not removed:
-            return 0
-        _ids = [ r['__id__'] for r in removed ]
-        _ids.sort()
-        keys = set(self.records.keys())
-        # check if the records are in the base
-        if not set(_ids).issubset(keys):
-            missing = list(set(_ids).difference(keys))
-            raise IndexError,'Delete aborted. Records with these ids' \
-                ' not found in the base : %s' %str(missing)
-        # raise exception if duplicate ids
-        for i in range(len(_ids)-1):
-            if _ids[i] == _ids[i+1]:
-                raise IndexError,"Delete aborted. Duplicate id : %s" %_ids[i]
-        deleted = len(removed)
-        while removed:
-            r = removed.pop()
-            _id = r['__id__']
-            # remove id from indices
-            for indx in self.indices.keys():
-                pos = bisect.bisect(self.indices[indx][r[indx]],_id)-1
-                del self.indices[indx][r[indx]][pos]
-                if not self.indices[indx][r[indx]]:
-                    del self.indices[indx][r[indx]]
-            # remove record from self.records
-            del self.records[_id]
-        return deleted
-
-    def update(self,record,**kw):
-        """Update the record with new keys and values and update indices"""
-        # update indices
-        _id = record["__id__"]
-        for indx in self.indices.keys():
-            if indx in kw.keys():
-                if record[indx] == kw[indx]:
-                    continue
-                # remove id for the old value
-                old_pos = bisect.bisect(self.indices[indx][record[indx]],_id)-1
-                del self.indices[indx][record[indx]][old_pos]
-                if not self.indices[indx][record[indx]]:
-                    del self.indices[indx][record[indx]]
-                # insert new value
-                bisect.insort(self.indices[indx].setdefault(kw[indx],[]),_id)
-        # update record values
-        record.update(kw)
-        # increment version number
-        record["__version__"] += 1
-
-    def add_field(self,field,default=None):
-        if field in self.fields + ["__id__","__version__"]:
-            raise ValueError,"Field %s already defined" %field
-        for r in self:
-            r[field] = default
-        self.fields.append(field)
-        self.commit()
-    
-    def drop_field(self,field):
-        if field in ["__id__","__version__"]:
-            raise ValueError,"Can't delete field %s" %field
-        self.fields.remove(field)
-        for r in self:
-            del r[field]
-        if field in self.indices:
-            del self.indices[field]
-        self.commit()
-
-    def __call__(self,**kw):
-        """Selection by field values
-        db(key=value) returns the list of records where r[key] = value"""
-        for key in kw:
-            if not key in self.fields:
-                raise ValueError,"Field %s not in the database" %key
-        def sel_func(r):
-            for key in kw:
-                if not r[key] == kw[key]:
-                    return False
-            return True
-        return [ r for r in self if sel_func(r) ]
-    
-    def __getitem__(self,record_id):
-        """Direct access by record id"""
-        return self.records[record_id]
-    
-    def __len__(self):
-        return len(self.records)
-
-    def __delitem__(self,record_id):
-        """Delete by record id"""
-        self.delete(self[record_id])
-        
-    def __iter__(self):
-        """Iteration on the records"""
-        return self.records.itervalues()
-
-if __name__ == '__main__':
-    # test on a 1000 record base
-    import random
-    import datetime
-    names = ['pierre','claire','simon','camille','jean',
-                 'florence','marie-anne']
-    db = Base('PyDbLite_test')
-    db.create('name','age','size','birth',mode="override")
-    for i in range(1000):
-        db.insert(name=unicode(random.choice(names)),
-             age=random.randint(7,47),size=random.uniform(1.10,1.95),
-             birth=datetime.date(1990,10,10))
-    db.create_index('age')
-    db.commit()
-
-    print 'Record #20 :',db[20]
-    print '\nRecords with age=30 :'
-    for rec in db._age[30]:
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-
-    print "\nSame with __call__"
-    for rec in db(age=30):
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-    print db._age[30] == db(age=30)
-
-    db.insert(name=unicode(random.choice(names))) # missing fields
-    print '\nNumber of records with 30 <= age < 33 :',
-    print sum([1 for r in db if 33 > r['age'] >= 30])
-    
-    print db.delete([])
-
-    d = db.delete([r for r in db if 32> r['age'] >= 30 and r['name']==u'pierre'])
-    print "\nDeleting %s records with name == 'pierre' and 30 <= age < 32" %d
-    print '\nAfter deleting records '
-    for rec in db._age[30]:
-        print '%-10s | %2s | %s' %(rec['name'],rec['age'],round(rec['size'],2))
-    print '\n',sum([1 for r in db]),'records in the database'
-    print '\nMake pierre uppercase for age > 27'
-    for record in ([r for r in db if r['name']=='pierre' and r['age'] >27]) :
-        db.update(record,name=u"Pierre")
-    print len([r for r in db if r['name']==u'Pierre']),'Pierre'
-    print len([r for r in db if r['name']==u'pierre']),'pierre'
-    print len([r for r in db if r['name'] in [u'pierre',u'Pierre']]),'p/Pierre'
-    print 'is unicode :',isinstance(db[20]['name'],unicode)
-    db.commit()
-    db.open()
-    print '\nSame operation after commit + open'
-    print len([r for r in db if r['name']==u'Pierre']),'Pierre'
-    print len([r for r in db if r['name']==u'pierre']),'pierre'
-    print len([r for r in db if r['name'] in [u'pierre',u'Pierre']]),'p/Pierre'
-    print 'is unicode :',isinstance(db[20]['name'],unicode)
-    
-    print "\nDeleting record #20"
-    del db[20]
-    if not 20 in db:
-        print "record 20 removed"
-
-    print db[21]
-    db.drop_field('name')
-    print db[21]
-    db.add_field('adate',datetime.date.today())
-    print db[21]
-    
-    k = db._age.keys()[0]
-    print "key",k
-    print k in db._age
-    db.delete(db._age[k])
-    print db._age[k]
-    print k in db._age
-
diff --git a/HarvestMan/harvestman/lib/common/singleton.py b/HarvestMan/harvestman/lib/common/singleton.py
deleted file mode 100755
index ce7a008..0000000
--- a/HarvestMan/harvestman/lib/common/singleton.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# -- coding: utf-8
-""" singleton.py - Singleton design-pattern implemented using
-    meta-classes. 
-
-    This module is part of the HarvestMan program.
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Author: Anand B Pillai <anand at harvestmanontheweb.com>
-
-    Created Anand B Pillai Feb 2 2007
-    
-
-    Copyright(C) 2007, Anand B Pillai.
-    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-class SingletonMeta(type):
-    """ A type for Singleton classes """
-
-    def my_new(cls,name,bases=(),dct={}):
-        if not cls.instance:
-            cls.instance = object.__new__(cls)
-                
-        return cls.instance
-    
-    def __init__(cls, name, bases, dct):
-        super(SingletonMeta, cls).__init__(name, bases, dct)
-        cls.instance = None
-        cls.__new__ = cls.my_new
-
-class SingletonMeta2(type):
-    """ A type for Singleton classes """    
-
-    def __init__(cls, *args):
-        type.__init__(cls, *args)
-        cls.instance = None
-
-    def __call__(cls, *args):
-        if not cls.instance:
-            cls.instance = type.__call__(cls, *args)
-        return cls.instance
-    
-class Singleton(object):
-    """ The default implementation for a Python Singleton class """
-
-    __metaclass__ = SingletonMeta2
-
-    def getInstance(cls, *args):
-        return cls(*args)
-    
-    getInstance = classmethod(getInstance)
-    makeInstance = getInstance
-
diff --git a/HarvestMan/harvestman/lib/common/spincursor.py b/HarvestMan/harvestman/lib/common/spincursor.py
deleted file mode 100755
index b90d443..0000000
--- a/HarvestMan/harvestman/lib/common/spincursor.py
+++ /dev/null
@@ -1,89 +0,0 @@
-# -- coding: utf-8
-"""
-spincursor.py - Module which provides a spin-cursor class.
-The spincursor class can be used to indicate progress of
-an action on the console.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created: Anand B Pillai 31 Oct 2007
-    
-Copyright (C) 2007, Anand B Pillai.
-"""
-import threading
-import sys, os
-import time
-import unicodedata
-
-class SpinCursor(threading.Thread):
-    """ A console spin cursor class """
-    
-    def __init__(self, msg='',maxspin=0,minspin=10,speed=5):
-        # Count of a spin
-        self.count = 0
-        self.out = sys.stdout
-        self.flag = False
-        self.max = maxspin
-        self.min = minspin
-        # Any message to print first ?
-        self.msg = msg
-        # Complete printed string
-        self.string = ''
-        # Speed is given as number of spins a second
-        # Use it to calculate spin wait time
-        self.waittime = 1.0/float(speed*4)
-        # Don't do this for cygwin also!
-        if os.name == 'posix' and os.environ.get('TERM','') != 'cygwin':
-            self.spinchars = (unicodedata.lookup('FIGURE DASH'),u'\\ ',u'| ',u'/ ')
-        else:
-            # The unicode dash character does not show
-            # up properly in Windows console.
-            self.spinchars = (u'-',u'\\ ',u'| ',u'/ ')        
-        threading.Thread.__init__(self, None, None, "Spin Thread")
-        
-    def spin(self):
-        """ Perform a single spin """
-
-        for x in self.spinchars:
-            self.string = self.msg + "...\t" + x + "\r"
-            self.out.write(self.string.encode('utf-8'))
-            self.out.flush()
-            time.sleep(self.waittime)
-
-    def run(self):
-
-        while (not self.flag) and ((self.count<self.min) or (self.count<self.max)):
-            self.spin()
-            self.count += 1
-
-        # Clean up display...
-        self.out.write(" "*(len(self.string) + 1) + "\n")
-        
-    def stop(self):
-        self.flag = True
-
-class InfiniteSpinCursor(SpinCursor):
-    """ A spin cursor that keeps spinning till told to stop"""
-    
-    def __init__(self, msg=''):
-        super(InfiniteSpinCursor, self).__init__(msg)
-
-    def run(self):
-
-        while (not self.flag):
-            try:
-                self.spin()
-                self.count += 1
-            except KeyboardInterrupt:
-                break
-
-        # Clean up display...
-        self.out.write(" "*(len(self.string) + 1) + "\n")        
-        
-if __name__ == "__main__":
-    spin = SpinCursor(msg="Spinning...",minspin=20,maxspin=50,speed=5)
-    spin.start()
-    spin.join()
-        
-        
-        
diff --git a/HarvestMan/harvestman/lib/config.py b/HarvestMan/harvestman/lib/config.py
deleted file mode 100755
index 9eaa58f..0000000
--- a/HarvestMan/harvestman/lib/config.py
+++ /dev/null
@@ -1,1653 +0,0 @@
-# -- coding: utf-8
-""" config.py - Module to keep configuration options
-    for HarvestMan program and its related modules. This 
-    module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-
-    Jan 23 2007      Anand    Added code to check config in $HOME/.harvestman.
-                              Added control-var for session saving feature.
-    Feb 8 2007       Anand    Added config support for loading plugins. Added
-                              code for swish-e plugin.
-
-    Feb 11 2007      Anand    Re-wrote configuration parsing using generic option
-                              parser.
-
-    Mar 03 2007      Anand    Removed old option parsing dictionary and some
-                              obsolete code. Added option for changing time gap
-                              between downloads in config file. Removed command
-                              line option for urllistfile/urltree file. Added
-                              option to read multiple start URLs from a file.
-                              Modified behaviour so that if a source of URL is
-                              specified (command-line, URL file etc), any URLs
-                              in config file is skipped. Set urlserver option
-                              as default.
-   Mar 06 2007       Anand    Reset default option to queue.
-   April 11 2007     Anand    Renamed xmlparser module to configparser.
-   April 20 2007     Anand    Added options for hget.
-   May 7 2007       Anand     Modified option parsing for plugin option.
-   Jun 2 2008       Anand     Fixed kludgy processing of <project> options
-                              by using a function set_project. Added method
-                              'add' for easy adding of project URLs in
-                              interactive/programmatic crawling.
-   
-   Copyright (C) 2004 Anand B Pillai.                              
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-USAGE1 = """\
- %(program)s [options] [optional URL]
- 
-%(appname)s %(version)s %(maturity)s: An extensible, multithreaded web crawler.
-Author: Anand B Pillai
-
-Mail bug reports and suggestions to <abpillai at gmail dot com>."""
-
-USAGE2 = """\
- %(program)s [options] URL(s) | file(s)
- 
-%(appname)s %(version)s %(maturity)s: A multithreaded web downloader based on HarvestMan.
-Author: Anand B Pillai
-
-The program accepts URL(s) or an input file(s) containing a number of URLs,
-one per line. If a file is passed as input, any other program option
-passed is applied for every URL downloaded using the file.
-
-Mail bug reports and suggestions to <abpillai at gmail dot com>."""
-
-import os, sys
-import re
-
-from harvestman.lib import configparser
-from harvestman.lib import options
-from harvestman.lib import urlparser
-from harvestman.lib import logger
-from harvestman.lib import utils
-
-from harvestman.lib.common.optionparser import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.common import hexit, test_sgmlop, logconsole, objects
-from harvestman.lib.common.singleton import Singleton
-from harvestman.lib.common.progress import TextProgress
-
-CONFIG_XML_TEMPLATE="""\
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-            
-   <config version="3.0" xmlversion="1.0">
-            %(@PROJECTS_ELEMENT)s
-     <network>
-      <proxy>
-        <proxyserver>%(proxy)s</proxyserver>
-        <proxyuser>%(puser)s</proxyuser>
-        <proxypasswd>%(ppasswd)s</proxypasswd>
-        <proxyport value="%(proxyport)s" />
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="%(html)s" />
-        <images value="%(images)s" />
-        <movies value="%(movies)s" />
-        <flash value="%(flash)s" />
-        <sounds value="%(sounds)s" />
-        <documents value="%(documents)s" />        
-        <javascript value="%(javascript)s" />
-        <javaapplet value="%(javaapplet)s" />
-        <querylinks value="%(getquerylinks)s" />
-      </types> 
-      <cache status="%(pagecache)s">
-        <datacache value="%(datacache)s" />
-      </cache>
-      <protocol>
-        <http compress="%(httpcompress)s" />
-      </protocol>
-      <misc>
-        <retries value="%(retryfailed)s" />
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="%(getimagelinks)s" />
-        <stylesheetlinks value="%(getstylesheets)s" />
-        <offset start="%(linksoffsetstart)s" end="%(linksoffsetend)s" />
-      </links>
-      <extent>
-        <fetchlevel value="%(fetchlevel)s" />
-        <depth value="%(depth)s" />
-        <extdepth value="%(extdepth)s" />
-        <subdomain value="%(subdomain)s" />
-        <ignoretlds value="%(ignoretlds)s" />
-      </extent>
-      <limits>
-        <maxfiles value="%(maxfiles)s" />
-        <maxfilesize value="%(maxfilesize)s" />
-        <maxbytes value="%(maxbytes)s" />
-        <maxconnections value="%(connections)s" />
-        <maxbandwidth value="%(bandwidthlimit)s" factor="%(throttlefactor)s" />
-        <timelimit value="%(timelimit)s" />
-      </limits>
-      <rules>
-        <robots value="%(robots)s" />
-        <urlpriority>%(urlpriority)s</urlpriority>
-        <serverpriority>%(serverpriority)s</serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>%(urlfilter)s</urlfilter>
-        <serverfilter>%(serverfilter)s</serverfilter>
-        <wordfilter>%(wordfilter)s</wordfilter>
-        <junkfilter value="%(junkfilter)s" />
-      </filters>
-      <plugins>
-        <plugin name="swish-e" enable="0" />
-        <plugin name="simulator" enable="0" />
-        <plugin name="lucene" enable="0" />
-        <plugin name="userbrowse" enable="0" />
-        <plugin name="spam" enable="0" />
-        <plugin name="datafilter" enable="0" />        
-      </plugins>
-    </control>
-
-    <parser>
-      <feature name='a' enable='1' />
-      <feature name='base' enable='1' />
-      <feature name='frame' enable='1' />
-      <feature name='img' enable='1' />
-      <feature name='form' enable='1' />
-      <feature name='link' enable='1' />
-      <feature name='body' enable='1' />
-      <feature name='script' enable='1' />
-      <feature name='applet' enable='1' />
-      <feature name='area' enable='1' />
-      <feature name='meta' enable='1' />
-      <feature name='embed' enable='1' />
-      <feature name='object' enable='1' />
-      <feature name='option' enable='0' />
-    </parser>
-      
-    <system>
-      <useragent value="%(USER_AGENT)s" />
-      <workers status="%(usethreads)s" size="%(threadpoolsize)s" timeout="%(timeout)s" />
-      <trackers value="%(maxtrackers)s" timeout="%(fetchertimeout)s" />
-      <timegap value="%(sleeptime)s" random="%(randomsleep)s" />
-      <connections type="%(datamodename)s" />
-    </system>
-    
-    <files>
-      <urltreefile status="%(urltreefile)s" />
-      <archive status="%(archive)s" format="%(archformat)s" />
-      <urlheaders status="%(urlheaders)s" />
-      <localise value="%(localise)s" />
-    </files>
-    
-    <display>
-      <browsepage value="%(browsepage)s"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
-"""
-
-param_re = re.compile(r'\S+=\S+',re.LOCALE|re.UNICODE)
-int_re = re.compile(r'\d+')
-float_re = re.compile(r'\d+\.\d*')
-maxbytes_re = re.compile(r'(\d+\s*)(kb?|mb?|gb?)?$', re.IGNORECASE)
-maxbw_re = re.compile(r'(\d+\s*)(k(b|bps)?|m(b|bps)?|g(b|bps)?)?$', re.IGNORECASE)
-projectname_re = re.compile(r'^[a-zA-Z0-9-_\.]+$', re.IGNORECASE|re.UNICODE|re.LOCALE)
-
-# This will contain the absolute path of parent-folder of
-# harvestman installation...
-
-module_path = ''
-
-class HarvestManConfigError(Exception):
-    """ Exception class for HarvestManStateObject """
-    pass
-
-HarvestManStateObject = HarvestManConfigError
-    
-class HarvestManStateObject(dict, Singleton):
-    """ Configuration class for HarvestMan framework and applications
-    derived from it. A single instance of this class keeps most of the
-    shared state and configuration params of HarvestMan """
-
-    klassmap = {}
-    alias = 'config'
-    
-    def __init__(self):
-        """ Initialize dictionary with the most common settings and their values """
-
-        # Calculate the module path
-        mydir = os.path.dirname(globals()["__file__"])
-        global module_path
-        module_path = os.path.dirname(mydir)
-            
-        self._init1()
-        self._init2()
-        self.set_system_params()
-        self.set_user_params()
-        super(HarvestManStateObject, self).__init__()
-        
-    def _init1(self):
-        """ First level initialization method. Initializes most of the state variables """
-        
-        self.items_to_skip=[]
-        # USER-AGENT string
-        # Version for harvestman
-        self.version='2.0'
-        # Maturity for harvestman
-        self.maturity="alpha 1"
-        # Single appname property for hget/harvestman
-        self.appname='HarvestMan'
-        #self.USER_AGENT = 'v'.join((self.appname + ' ', self.version))
-        self.USER_AGENT = '%s/%s (+http://code.google.com/p/harvestman-crawler/wiki/bot)' %(self.appname,self.version)
-        self.progname="".join((self.appname," ",self.version," ",self.maturity))
-        self.program = sys.argv[0]
-        self.url=''
-        self.project=''
-        self.project_ignore = 0
-        self.basedir=''
-        # A list which will hold dicts of (url, name, basedir, verbosity) for all projects
-        self.projects = []
-        self.urlmap = {}
-        self.archive = 0
-        self.archformat = 'bzip'
-        self.urlheaders = 1
-        self.configfile = 'config.xml'
-        self.projectfile = ''         
-        self.proxy=''
-        self.puser=''
-        self.ppasswd=''
-        self.proxyenc=1
-        self.username=''   
-        self.passwd=''     
-        self.proxyport=80
-        self.errorfile='errors.log'
-        self.localise=2
-        self.images=1
-        self.movies=0
-        self.flash=0
-        self.sounds=0
-        self.documents=1
-        self.depth=10
-        self.html=1
-        self.robots=1
-        # self.eserverlinks=0
-        # self.epagelinks=1
-        self.fastmode=1
-        self.usethreads=1
-        self.maxfiles=5000
-        self.maxbytes=0
-        # self.maxextservers=0
-        # self.maxextdirs=0
-        self.retryfailed=1
-        self.extdepth=0
-        self.maxtrackers=4
-        # Url filter object
-        self.urlfilter = None
-        self.urlfiltercontext = 'crawl'
-        # To prevent config from breaking...
-        self.serverfilter=''
-        self.wordfilter=''
-        self.regexurlfilters = []
-        self.pathurlfilters = []
-        self.extnurlfilters = []
-        # Text filter object
-        self.textfilter = None
-        self.contentfilters = []
-        self.metafilters = []
-        self.inclfilter=[]
-        self.exclfilter=[]
-        self.allfilters=[]
-        self.urlpriority = ''
-        self.serverpriority = ''
-        self.urlprioritydict = {}
-        self.serverprioritydict = {}
-        self.verbosity=logger.INFO
-        self.verbosity_default=logger.INFO
-        # Override project verbosity - done
-        # if a global verbosity flag is defined
-        # say using the command line
-        self.verbosity_override = False
-        # timeout for worker threads is a rather
-        # large 5 minutes.
-        self.timeout=300.00
-        # timeout for sockets is a rather high 1.0 minute
-        self.socktimeout = 60.0
-        # Time out for fetchers is a rather small 4 minutes
-        self.fetchertimeout = 240.0
-        self.getimagelinks=1
-        self.getstylesheets=1
-        # Load images from anywhere irrepsective of rules
-        self.anyimages=1
-        self.threadpoolsize=10
-        self.renamefiles=0
-        self.fetchlevel=0
-        self.browsepage=0
-        self.checkfiles=1
-        self.pagecache=1
-        # Internal variable telling whether to write cache
-        self.writecache=True
-        self.cachefound=0
-        self._error=''
-        self.starttime=0
-        self.endtime=0
-        self.javascript = 1
-        self.javaapplet = 1
-        self.connections=5
-        # Bandwidth limit, 0 means no limit
-        self.bandwidthlimit = 0
-        self.throttlefactor = 1.5
-        self.cachefileformat='pickled' 
-        self.testing = 0
-        self.testnocrawl = 0
-        self.ignoreinterrupts = 0
-        # Set to true when a kb interrupt is caught
-        self.keyboardinterrupt = 0
-        # Differentiate between sub-domains of a domain ?
-        # When set to True, subdomains act like different
-        # domains, so they are filtered out for fetchlevel<=1
-        self.subdomain = 1
-        # Flag to ignore tlds, if set to True, domains
-        # such as www.foo.com, www.foo.co.uk, www.foo.org
-        # will all evaluate to the same server.
-        # Use this carefully!
-        self.ignoretlds = 0
-        self.getquerylinks = 0
-        self.bytes = 20.00 # Not used!
-        self.projtimeout = 1800.00
-        self.downloadtime = 0.0
-        self.timelimit = -1
-        self.terminate = 0
-        self.datacache = 0
-        self.blocking = 0
-        self.junkfilter = 1
-        self.junkfilterdomains = 1
-        self.junkfilterpatterns = 1
-        self.urltreefile = 0
-        self.urlfile = ''
-        self.maxfilesize=5242880
-        self.minfilesize=0
-        self.format = 'xml'
-        self.rawsave = 0
-        self.fromprojfile = 0
-        # HTML features (optional)
-        self.htmlfeatures = []
-        # For running from previous states.
-        self.resuming = 0
-        self.runfile = None
-        # Control var for session-saver feature.
-        self.savesessions = 0
-        # List of enabled plugins
-        self.plugins = []
-        # Control var for simulation feature
-        self.simulate = 0
-        # Time to sleep between requests
-        self.sleeptime = 1.0
-        # Time to sleep on the request queue
-        self.queuetime = 1.0
-        # Queue size - fixed...
-        self.queuesize = 5000
-        self.randomsleep = 1
-        # For http compression
-        self.httpcompress = 1
-        # Type of URLs which can be
-        # set to skip any rules we define
-        # This is not a user configurable
-        # option, but can be configured in
-        # plugins, of course.
-        self.skipruletypes = []
-        # Number of parts to split a file
-        # to, for multipart http downloads
-        self.numparts = 4
-        # Flag to force multipart downloads off
-        self.nomultipart = 0
-        # Flag to indicate that a multipart
-        # download is in progress
-        self.multipart = 0
-        # Links offset start - this will
-        # skip the list of child links of a URL
-        # to the given value
-        self.linksoffsetstart = 0
-        # Links offset value - this will skip
-        # the list of child links of a URL
-        # after the given value
-        self.linksoffsetend = -1
-        # Cache size for 
-        # Current progress object
-        self.progressobj = TextProgress()
-        # Internal flag - show progress obj ?
-        self.showprogress = True
-        # Flag for forcing multipart downloads
-        self.forcesplit = 0
-        # Data save mode for connectors
-        # Is flush by default
-        self.datamode = CONNECTOR_DATA_MODE_FLUSH
-        # Name for data mode
-        self.datamodename = "flush"
-        # Hget outfile - default empty string
-        self.hgetoutfile = ''
-        # Hget output directory - default current directory
-        self.hgetoutdir = '.'
-        # Hget verbosity flag - default False
-        self.hgetverbose = 0
-        # Hget temp flag - default False
-        self.hgetnotemp = 0
-        # Hget mirror file
-        self.mirrorfile = ''
-        # Hget mirror search flag
-        self.mirrorsearch = False
-        # Hget mirror relpath index
-        self.mirrorpathindex = 0
-        # Hget relpath use flag
-        self.mirroruserelpath = 1
-        # Hget resume mode
-        self.canresume = 1
-        # Internal state param
-        self._badrequests = 0
-        # Internal config param
-        self._connaddua = True
-        
-    def _init2(self):
-        """ Second level initialization method. Initializes the dictionary which maps
-        configuration XML file entries to state variables """
-        
-        # For mapping xml entities to config entities
-        
-        self.xml_map = { 'project_ignore' : ('project_ignore', 'int'),
-                         'url' : ('url', 'func:set_project'),
-                         'name': ('project', 'func:set_project'),
-                         'basedir' : ('basedir', 'func:set_project'),
-                         'verbosity_level' : ('verbosity', 'func:set_project'),
-
-                         'proxyserver': ('proxy','str'),
-                         'proxyuser': ('puser','str'),
-                         'proxypasswd' : ('ppasswd','str'),
-                         'proxyport_value' : ('proxyport','int'),
-
-                         'username': ('username','str'),
-                         'passwd' : ('passwd','str'),
-                         
-                         'html_value' : ('html','int'),
-                         'images_value' : ('images','int'),
-                         'movies_value' : ('movies','int'),
-                         'flash_value' : ('flash','int'),                         
-                         'sounds_value' : ('sounds','int'),
-                         'documents_value' : ('documents','int'),                         
-                         
-                         'javascript_value' : ('javascript','int'),
-                         'javaapplet_value' : ('javaapplet','int'),
-                         'querylinks_value' : ('getquerylinks','int'),
-
-                         'cache_status' : ('pagecache','int'),
-                         'datacache_value' : ('datacache','int'),
-
-                         'urllist': ('urlfile', 'str'),
-                         'urltreefile_status' : ('urltreefile', 'int'),
-                         'archive_status' : ('archive', 'int'),
-                         'archive_format' : ('archformat', 'str'),
-                         'urlheaders_status' : ('urlheaders', 'int'),
-                         'retries_value': ('retryfailed','int'),
-                         'imagelinks_value' : ('getimagelinks','int'),
-                         'stylesheetlinks_value' : ('getstylesheets','int'),
-                         'offset_start' : ('linksoffsetstart','int'),
-                         'offset_end' : ('linksoffsetend','int'),
-                         'fetchlevel_value' : ('fetchlevel','int'),
-                         'extserverlinks_value' : ('eserverlinks','int'),
-                         'extpagelinks_value' : ('epagelinks','int'),
-                         'depth_value' : ('depth','int'),
-                         'extdepth_value' : ('extdepth','int'),
-                         'subdomain_value' : ('subdomain','int'),
-                         'ignoretlds_value': ('ignoretlds','int'),
-                         'maxextservers_value' : ('maxextservers','int'),
-                         'maxextdirs_value' : ('maxextdirs','int'),
-                         'maxfiles_value' : ('maxfiles','int'),
-                         'maxfilesize_value' : ('maxfilesize','int'),
-                         'maxbytes_value' : ('maxbytes', 'func:set_maxbytes'),
-                         'maxconnections_value' : ('connections','int'),
-                         'maxbandwidth_value' : ('bandwidthlimit','func:set_maxbandwidth'),
-                         'maxbandwidth_factor': ('throttlefactor','float'),
-                         'robots_value' : ('robots','int'),
-                         'timelimit_value' : ('timelimit','float'),
-                         'urlpriority' : ('urlpriority','str'),
-                         'serverpriority' : ('serverpriority','str'),
-                         'serverfilter' : ('serverfilter','str'),
-                         'wordfilter' : ('wordfilter','str'),
-                         'junkfilter_value' : ('junkfilter','int'),
-                         'useragent_value': ('USER_AGENT','str'),
-                         'workers_status' : ('usethreads','int'),
-                         'workers_size' : ('threadpoolsize','int'),
-                         'workers_timeout' : ('timeout','float'),
-                         'trackers_value' : ('maxtrackers','int'),
-                         'trackers_timeout' : ('fetchertimeout','float'),                         
-                         'fastmode_value': ('fastmode','int'),
-                         'savesessions_value': ('savesessions','int'),
-                         'timegap_value': ('sleeptime', 'float'),
-                         'timegap_random': ('randomsleep', 'int'),
-                         'connections_type' : ('datamode', 'func:set_datamode'),
-                         'feature_name' : ('htmlfeatures', 'func:set_parse_features'),
-                         'simulate_value': ('simulate', 'int'),
-                         'localise_value' : ('localise','int'),
-                         'browsepage_value' : ('browsepage','int'),
-
-                         'configfile_value': ('configfile', 'str'),
-                         'projectfile_value': ('projectfile', 'str'),
-
-                         'regex_value' : ('regex', 'func:set_urlfilter'),
-                         'path_value': ('path', 'func:set_urlfilter'),
-                         'extension_value': ('extension', 'func:set_urlfilter'),
-
-                         'content_value': ('content', 'func:set_textfilter'),
-                         'meta_value': ('meta', 'func:set_textfilter'),
-                         'urlfilterre_value': (('inclfilter', 'list'),
-                                               ('exclfilter', 'list'),
-                                               ('allfilters', 'list')),
-                         'urlprioritydict_value': ('urlprioritydict', 'dict'),
-                         'serverprioritydict_value': ('serverprioritydict', 'dict'),
-                         'http_compress' : ('httpcompress', 'int'),
-                         'plugin_name': ('plugins','func:set_plugin')
-                         }
-
-    def copy(self):
-        """ Return a serializable copy of this instance """
-        
-        # Set non-picklable objects to None type
-        self.progressobj = None
-        return self
-
-    def __getstate__(self):
-        """ Overloaded __getstate__ method """
-        return self
-
-    def __setstate__(self, state):
-        """ Overloaded __setstate__ method """
-        pass
-    
-
-    def assign_option(self, option_val, value, kwargs={}):
-        """ Assigns values to internal variables using the option specified """
-
-        try:
-            if len(option_val) == 2:
-                key, typ = option_val
-                
-                # If type is not a list, the
-                # action is simple assignment
-
-                # Bug fix: If someone has set the
-                # value to 'True'/'False' instead of
-                # 1/0, convert to bool type first.
-
-                if type(value) in (str, unicode):
-                    if value.lower() == 'true':
-                        value = 1
-                    elif value.lower() == 'false':
-                        value = 0
-
-                if typ.find(':') == -1:
-                    # do any type casting of the option
-                    fval = (eval(typ))(value)
-                    self[key] = fval
-                    
-                    # If type is list, the action is
-                    # appending, after doing any type
-                    # casting of the actual value
-                else:
-                    # Type is of the form <type>:<actual type>
-                    typname, typ = typ.split(':')
-
-                    if typname == 'list':
-                        if typ:
-                            fval = (eval(typ))(value)
-                        else:
-                            fval = value
-
-                        var = self[key]
-                        var.append(fval)
-                    elif typname == 'func':
-                        funktion = getattr(self, typ)
-                        if funktion:
-                            funktion(key, value, kwargs)
-            else:
-                error('Error in option value %s!' % option_val)
-        except Exception, e:
-            raise HarvestManConfigError, "Error: " + str(e)
-
-    def set_option(self, option, value, negate=0):
-        """ Sets the passed option in with its value as the passed value """
-        
-        # find out if the option exists in the dictionary
-        if option in self.xml_map.keys():
-            # if the option is a string or int or any
-            # non-seq type
-
-            # if value is an emptry string, return error
-            if value=="": return CONFIG_VALUE_EMPTY
-
-            # Bug fix: If someone has set the
-            # value to 'True'/'False' instead of
-            # 1/0, convert to bool type first.
-            if type(value) in (str, unicode):
-                if value.lower() == 'true':
-                    value = 1
-                elif value.lower() == 'false':
-                    value = 0
-            
-            if type(value) is not tuple:
-                # get the key for the option
-                key = (self.xml_map[option])[0]
-                # get the type of the option
-                typ = (self.xml_map[option])[1]
-                # do any type casting of the option
-                fval = (eval(typ))(value)
-                # do any negation of the option
-                if type(fval) in (int,bool):
-                    if negate: fval = not fval
-                # set the option on the dictionary
-                self[key] = fval
-                
-                return CONFIG_OPTION_SET
-            else:
-                # option is a tuple of values
-                # iterate through all values of the option
-                # see if the size of the value tuple and the
-                # size of the values for this key match
-                _values = self.xml_map[option]
-                if len(_values) != len(value): return CONFIG_VALUE_MISMATCH
-
-                for index in range(0, len(_values)):
-                    _v = _values[index]
-                    if len(_v) !=2: continue
-                    _key, _type = _v
-
-                    v = value[index]
-                    # do any type casting on the option's value
-                    fval = (eval(_type))(v)
-                    # do any negation
-                    if type(fval) in (int,bool):                    
-                        if negate: fval = not fval
-                    # set the option on the dictionary
-                    self[_key] = fval
-
-                return CONFIG_OPTION_SET
-
-        return CONFIG_OPTION_NOT_SET
-
-    def set_option_xml_attr(self, option, value, attrs):
-        """ Sets an option from the XML config file for an XML attribute """
-
-        # If option in things to be skipped, return
-        if option in self.items_to_skip:
-            return CONFIG_ITEM_SKIPPED
-        
-        option_val = self.xml_map.get(option, None)
-        
-        if option_val:
-            try:
-                if type(option_val) is tuple:
-                    self.assign_option(option_val, value, attrs)
-                elif type(option_val) is list:
-                    # If the option_val is a list, there
-                    # might be multiple vars to set.
-                    for item in option_val:
-                        # The item has to be a tuple again...
-                        if type(item) is tuple:
-                            # Set it
-                            self.assign_option(item, value, attrs)
-            except Exception, e:
-                print 'Error assigning option','"'+option+'"','=>',str(e)
-                if e.__class__==ValueError:
-                    print '(Perhaps you invoked the wrong argument ?)'
-                print 'Pass option -h for command line usage.'                    
-                hexit(e)
-        else:
-            return CONFIG_OPTION_NOT_DEFINED
-
-        return CONFIG_OPTION_SET
-
-    def set_option_xml(self, option, value):
-        """ Set an option from the XML config file from an XML element """
-        
-        # If option in things to be skipped, return
-        if option in self.items_to_skip:
-            return CONFIG_ITEM_SKIPPED
-        
-        option_val = self.xml_map.get(option, None)
-        
-        if option_val:
-            try:
-                if type(option_val) is tuple:
-                    # print 'Assigning option',option_val,value
-                    self.assign_option(option_val, value)
-                elif type(option_val) is list:
-                    # If the option_val is a list, there
-                    # might be multiple vars to set.
-                    for item in option_val:
-                        # The item has to be a tuple again...
-                        if type(item) is tuple:
-                            # Set it
-                            self.assign_option(item, value)
-            except Exception, e:
-               print 'Error assigning option','"'+option+'"','=>',str(e)
-               # If this is a ValueError, mostly the wrong argument was passed
-               if e.__class__==ValueError:
-                   print '(Perhaps you invoked the wrong argument ?)'
-               print 'Pass option -h for command line usage.'
-               hexit(1)
-        else:
-            return CONFIG_OPTION_NOT_DEFINED
-
-        return CONFIG_OPTION_SET        
-
-    def set_maxbytes(self, key, val, attrdict):
-
-        # The value could be in any of the following forms
-        # <maxbytes value="5000" /> - End crawl at 5000 bytes
-        # <maxbytes value="10kb" /> - End crawl at 10kb 
-        # <maxbytes value="50MB" /> - End crawl at 50 MB.
-        # <maxbytes value="1GB" /> - End crawl at 1 GB.
-        # <maxbytes value="10k" /> - End crawl at 10kb 
-        # <maxbytes value="50M" /> - End crawl at 50 MB.
-        # <maxbytes value="1G" /> - End crawl at 1 GB.        
-        # Any extra spaces should also be taken care of
-        
-        # The regexp does all the above
-        items = maxbytes_re.findall(val.strip())
-        if items:
-            # 'item' is a pair
-            item = items[0]
-            # First member is the number, second the
-            # specification for kb, mb, gb if any.
-            limit, spec = item
-            limit = int(limit)
-            
-            if spec != '':
-                # Check for kb, mb, gb
-                spec = spec.strip().lower()
-                if spec.startswith('k'):
-                    limit *= 1024
-                elif spec.startswith('m'):
-                    limit *= pow(1024, 2)
-                elif spec.startswith('g'):
-                    limit *= pow(1024, 3)
-
-            # Set maxbytes
-            self.maxbytes = limit
-
-    def set_maxbandwidth(self, key, val, attrdict):
-
-        # The value could be in any of the following forms
-        # <maxbandwidth value="5" /> - Crawl at 5 bytes per sec 
-        # <maxbandwidth value="5 k" /> - Crawl at 5 kbps
-        # <maxbandwidth value="5 kb" /> - Crawl at 5 kbps
-        # <maxbandwidth value="5 kbps" /> - Crawl at 5 kbps        
-        # <maxbandwidth value="5 m" /> - Crawl at 5 mbps
-        # <maxbandwidth value="5 mb" /> - Crawl at 5 mbps
-        # <maxbandwidth value="5 mbps" /> - Crawl at 5 mbps        
-        # Any extra spaces should also be taken care of
-        
-        # The regexp does all the above
-        items = maxbw_re.findall(val.strip())
-        if items:
-            item = items[0]
-            # First member is the number, second the
-            # specification for kb, mb, gb if any.
-            limit = int(item[0])
-            spec = item[1]
-            
-            if spec != '':
-                # Check for kb, mb, gb, kbps, mbps, gbps
-                spec = spec.strip().lower()
-                if spec.startswith('k'):
-                    limit *= 1024
-                elif spec.startswith('m'):
-                    limit *= pow(1024, 2)
-                elif spec.startswith('g'):
-                    limit *= pow(1024, 3)
-
-            # Set maxbandwidth
-            self.bandwidthlimit = float(limit)
-
-    def set_urlfilter(self, key, val, filterdict):
-
-        enable = int(filterdict.get(u'enable', 1))
-        if not enable:
-            return
-        
-        casing = int(filterdict.get(u'case',0))
-        flags = filterdict.get(u'flags','')
-
-        # Append a tuple of value, casing, flags
-        if key=='regex':
-            self.regexurlfilters.append((val,casing,flags))
-            # print self.regexurlfilters
-        elif key=='path':
-            self.pathurlfilters.append((val,casing,flags))
-            # print self.pathurlfilters
-        elif key=='extension':
-            self.extnurlfilters.append((val,casing,flags))
-            # print self.extnurlfilters
-
-    def set_textfilter(self, key, val, filterdict):
-
-        enable = int(filterdict.get(u'enable', 1))
-        if not enable:
-            return
-        
-        casing = int(filterdict.get(u'case',0))
-        flags = filterdict.get(u'flags','')
-
-        # Append a tuple of value, casing, flags
-        if key=='content':
-            self.contentfilters.append((val,casing,flags))
-        elif key=='meta':
-            tags = filterdict.get(u'tags','all')
-            self.metafilters.append((val,casing,flags,tags))
-
-    def set_project(self, key, val, prjdict):
-        # Same function is called for url, basedir, name
-        # and verbosity
-
-        # If project is to be ignored, skip this
-        if self.project_ignore:
-            return
-
-        # Project name has to match [a-zA-Z0-9_]. No other
-        # characters allowed.
-        if key=='project':
-            if not projectname_re.match(val):
-                raise HarvestManConfigError,'Project name %s is not valid' % val
-            
-        new_entry, recent = False, {}
-        sz = len(self.projects)
-        if sz==0:
-            new_entry = True
-        else:
-            item = self.projects[-1]
-            # If item is completed, new entry is True
-            # else it is false
-            if item.get('done',False):
-                new_entry = True
-            else:
-                recent = item
-
-                # Still check if this is beginning of a new entry
-                # since we may not define all fields inside <project>...</project>
-                if key in recent:
-                    # Key already there, so treat this as a fresh entry
-                    # and close current entry...
-                    recent['done'] = True
-                    self.projects[-1] = recent
-                    recent = {}
-                    new_entry = True
-
-        if key in ('url','basedir','project'):
-            recent[key] = val
-        elif key=='verbosity':
-            try:
-                recent['verbosity'] = logger.getLogLevel(prjdict[u'level'])
-                # print 'Verbosity=>',recent['verbosity']
-            except KeyError:
-                recent['verbosity'] = logger.getLogLevel(val)
-
-        # If all items are present, put 'done' to True
-        if len(recent)==4:
-            recent['done'] = True
-
-        # If new entry, then append, else set in position
-        if new_entry:
-            self.projects.append(recent)
-        else:
-            self.projects[-1] = recent
-
-        # print self.projects
-        
-    def set_plugin(self, key, val, plugindict):
-        """ Sets the state of the plugins param """
-
-        plugin = plugindict['name']
-        enable = int(plugindict['enable'])
-        if enable: self.plugins.append(plugin)
-
-    def set_datamode(self, key, val, modedict):
-        """ Sets the state of the connections param """
-
-        
-        mode = modedict['type']
-        if type(mode) is str:
-            if mode.lower()=='flush' :
-                self.datamode = CONNECTOR_DATA_MODE_FLUSH
-                self.datamodename == mode.lower()
-            elif mode.lower()=='mem':
-                self.datamode = CONNECTOR_DATA_MODE_INMEM
-                self.datamodename == mode.lower()
-        elif type(mode) is int:
-            self.datamode = mode
-            if self.datamode>1: 
-                self.datamode=1
-            
-    def set_parse_features(self, key, val, featuredict):
-        """ Sets the state of the plugins param """
-
-        feat = featuredict['name']
-        enable = int(featuredict['enable'])
-        self.htmlfeatures.append((feat, enable))
-        
-    def parse_arguments(self):
-        """ Parses the command line arguments """
-
-        # This function has 3 return values
-        # CONFIG_INVALID_ARGUMENT => no cmd line arguments/invalid cmd line arguments
-        # ,so force program to read config file.
-        # PROJECT_FILE_EXISTS => existing project file supplied in cmd line
-        # CONFIG_ARGUMENTS_OK => all options correctly read from cmd line
-
-        # if no cmd line arguments, then use config file,
-        # return -1
-        if len(sys.argv)==1:
-            return CONFIG_INVALID_ARGUMENT
-
-        # Otherwise parse the arguments, the command line arguments
-        # are the same as the variables(dictionary keys) of this class.
-        # Description
-        # Options needing no arguments
-        #
-        # -h => prints help
-        # -v => prints version info
-        
-        args, optdict = '',{}
-        try:
-            if self.appname == 'HarvestMan':
-                USAGE = USAGE1
-            elif self.appname == 'Hget':
-                USAGE = USAGE2
-                
-            gopt = GenericOptionParser(options.getOptList(self.appname), usage = USAGE % self )
-            optdict, args = gopt.parse_arguments()
-        except GenericOptionParserError, e:
-            hexit('Error: ' + str(e))
-
-        # print optdict
-        
-        cfgfile = False
-
-        if self.appname == 'HarvestMan':
-            for option, value in optdict.items():
-                # If an option with value of null string, skip it
-                if value=='':
-                   # print 'Skipping option',option
-                   continue
-                else:
-                   # print 'Processing option',option,'value',value
-                   pass
-
-                # first parse arguments with no options
-                if option=='version' and value:
-                    self.print_version_info()
-                    sys.exit(0)                
-                elif option=='configfile':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.set_option_xml('configfile_value', self.process_value(value))
-                        cfgfile = True
-                        # Continue parsing and take rest of options from cmd line
-                elif option=='projectfile':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.set_option_xml('projectfile_value', self.process_value(value))
-
-                        projector = utils.HarvestManProjectManager()
-
-                        if projector.read_project() == PROJECT_FILE_EXISTS:
-                            # No need to parse further values
-                            return PROJECT_FILE_EXISTS
-
-                elif option=='basedir':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('basedir', self.process_value(value))
-                elif option=='project':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('name', self.process_value(value))
-                elif option=='retries':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('retries_value', self.process_value(value))
-                elif option=='localise':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('localise_value', self.process_value(value))
-                elif option=='fetchlevel':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('fetchlevel_value', self.process_value(value))
-                elif option=='maxthreads':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('trackers_value', self.process_value(value))
-                elif option=='maxfiles':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('maxfiles_value', self.process_value(value))
-                elif option=='timelimit':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('timelimit_value', self.process_value(value))
-                elif option=='workers':
-                    self.set_option_xml('workers_status',1)
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('workers_size', self.process_value(value))                
-                elif option=='urlfilter':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urlfilter', self.process_value(value))
-                elif option=='depth':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('depth_value', self.process_value(value))
-                elif option=='robots':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('robots_value', self.process_value(value))
-                elif option=='urllist':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urllist', self.process_value(value))
-                elif option=='proxy':
-                    if SUCCESS(self.check_value(option,value)):
-                        # Set proxyencrypted flat to False
-                        self.proxyenc=False
-                        self.set_option_xml('proxyserver', self.process_value(value))
-                elif option=='proxyuser':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxyuser', self.process_value(value))                
-                elif option=='proxypasswd':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxypasswd', self.process_value(value))
-                elif option=='urlserver':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('urlserver_status', self.process_value(value))
-
-                elif option=='cache':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('cache_status', self.process_value(value))
-                elif option=='connections':
-                    if SUCCESS(self.check_value(option,value)):
-                        val = self.process_value(value)
-                        if val>=self.connections:
-                            self.connections = val + 1
-                elif option=='verbosity':
-                    if SUCCESS(self.check_value(option,value)):
-                        self.verbosity = logger.getLogLevel(value)
-                        self.verbosity_override = True
-                    else:
-                        print 'Check failed!'
-                elif option=='subdomain':
-                    if value: self.set_option_xml('subdomain_value', 0)                    
-                #elif option=='savesessions':
-                #    if SUCCESS(self.check_value(option,value)): self.set_option_xml('savesessions_value', self.process_value(value))
-                elif option=='simulate':
-                    self.set_option_xml('simulate_value', value)
-                elif option=='plugins':
-                    # Plugin is specified as plugin1+plugin2+...
-                    plugins = value.split('+')
-                    # Remove any duplicate occurence of same plugin
-                    self.plugins = list(set([plugin.strip() for plugin in plugins]))
-                    # Don't allow reading plugin from config file now
-                    self.items_to_skip.append('plugin_name')
-                elif option=='option':
-                    # Value should be of the form param=value
-                    if not param_re.match(value):
-                        print 'Error in option value, should be of the form <param>=<value>'
-                    else:
-                        param,val=value.strip().split('=')
-                        if param in self:
-                            # Guess type of param
-                            # Could be a tuple, dict or list value ?
-                            if (val.startswith('(') and val.endswith(')')) or \
-                               (val.startswith('[') and val.endswith(']')) or \
-                               (val.startswith('{') and val.endswith('}')):
-
-                                # Try tupling
-                                try:
-                                    val = eval(val)
-                                except Exception:
-                                    pass
-                            else:
-                                # Try float next
-                                if float_re.match(val):
-                                    val = float(val)
-                                # Try int next
-                                elif int_re.match(val):
-                                    val = int(val)
-                                else:
-                                    # Plain string
-                                    pass
-                                
-                            self[param]=val
-                elif option == 'ui' and value:
-                    # Start the web UI
-                    from harvestman.lib import gui
-                    gui.run()
-                    
-                    sys.exit(0)
-                elif option == 'genconfig' and value:
-                    #Generate configuration File
-                    from harvestman.tools import genconfig
-                    genconfig.main()
-                    
-                    sys.exit(0)
-                elif option == 'selftest' and value:
-                    # Run the unit-tests as self-tests
-                    print 'Running self-test...'
-                    sys.path.append(os.path.join(module_path, 'test'))
-                    from harvestman.test import run_tests
-                    
-                    result = run_tests.run_all_tests()
-                    print result
-                    if result.wasSuccessful():
-                        print 'self-test complete. All tests passed.'
-                        sys.exit(0)
-                    else:
-                        print 'self-test failed. Please check your installation!'
-                        sys.exit(1)
-
-        elif self.appname == 'Hget':
-            # Hget options
-            for option, value in optdict.items():
-                # If an option with value of null string, skip it
-                if value=='':
-                   # print 'Skipping option',option
-                   continue
-                else:
-                   # print 'Processing option',option,'value',value
-                   pass
-
-                # first parse arguments with no options
-                if option=='version' and value:
-                    self.print_version_info()
-                    sys.exit(0)                               
-                elif option=='numparts':
-                    # Setting numparts forces split downloads
-                    self.numparts = abs(int(value))
-                    if self.numparts == 0:
-                        print 'Error: Invalid value for number of parts, value should be non-zero!'
-                        sys.exit(1)
-                    if self.numparts>1:
-                        self.forcesplit = True
-                        # If we are forcesplitting with parts>1, then disable resume automatically
-                        print 'Force-split switched on, resuming will be disabled'
-                        self.canresume = False
-                    else:
-                        print 'Warning: Setting numparts to 1 has no effect!'
-                elif option=='memory':
-                    if value:
-                        print 'Warning: Enabling in-memory flag, data will be stored in memory!'
-                        self.datamode = CONNECTOR_DATA_MODE_INMEM
-                        self.datamodename = "mem"
-                elif option=='currentdir':
-                    if value:
-                        print 'Temporary files will be saved to current directory'
-                        # Do not use temporary directory for saving intermediate files
-                        self.hgetnotemp = True
-                elif option=='output':
-                    self.hgetoutfile = value
-                elif option=='outputdir':
-                    self.hgetoutdir = value
-                elif option=='verbose':
-                    self.hgetverbose = value
-                elif option=='proxy':
-                    if SUCCESS(self.check_value(option,value)):
-                        # Set proxyencrypted flat to False
-                        self.proxyenc=False
-                        self.set_option_xml('proxyserver', self.process_value(value))
-                elif option=='proxyuser':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxyuser', self.process_value(value))
-                elif option=='proxypasswd':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('proxypasswd', self.process_value(value))
-                elif option=='passwd':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('passwd', self.process_value(value))
-                elif option=='username':
-                    if SUCCESS(self.check_value(option,value)): self.set_option_xml('username', self.process_value(value))
-                elif option == 'mirrorfile':
-                    filename = value.strip()
-                    if os.path.isfile(filename):
-                        print 'Mirrors will be loaded from %s...' % filename
-                        self.mirrorfile = filename
-                    else:
-                        print  'Warning: Mirror file %s not found!' % filename
-                elif option == 'mirrorsearch':
-                    if value:
-                        print  'Mirror search turned on'
-                        print 'Warning: This is an experimental feature...'
-                        self.mirrorsearch = True
-                elif option == 'relpathidx':
-                    idx = int(value.strip())
-                    self.mirrorpathindex = idx
-                elif option == 'norelpath':
-                    if value:
-                        print 'Using filename only to construct mirror URLs...'
-                        self.mirroruserelpath = False
-                elif option == 'resumeoff':
-                    if value:
-                        print 'Resume mode set to off, partial downloads will not be resumed!'
-                        self.canresume = False
-                elif option == 'retries':
-                    self.retryfailed = int(value)
-
-            # If both mirror search and mirror file specified, mirror file is used
-            # Print some information regarding mismatch of options...
-            if self.mirrorfile and self.mirrorsearch:
-                print 'Both mirror search and mirror file option given, mirror search will not be done'
-                self.mirrorsearch = False
-            if self.mirrorpathindex and not self.mirrorfile:
-                print 'Ignoring mirror path index param because no mirror file is loaded'
-            if not self.mirroruserelpath and not self.mirrorfile:
-                print 'Ignoring relpath flag because no mirror file is loaded'
-        
-        if args:
-            # Any option without an argument is assumed to be a URL
-            for arg in args:
-                self.set_option_xml('url',self.process_value(arg))
-                
-            # Since we set a URL from outside, we dont want to read
-            # URLs from the config file - same for plugins
-            self.items_to_skip = ['url','name','basedir','verbosity_level']
-
-        # If urlfile option set, read all URLs from a file
-        # and load them.
-        if self.urlfile:
-            if not os.path.isfile(self.urlfile):
-                print 'Error: Cannot find URL file %s!' % self.urlfile
-                return CONFIG_INVALID_ARGUMENT
-            
-            # Open file
-            try:
-                lines = open(self.urlfile).readlines()
-                if len(lines):
-                    # Reset all...
-                    self.projects = []
-
-                    for line in lines:
-                        url = line.strip()
-                        # Fix URL protocol string
-                        url = self._fix_url_protocol(url)
-                        try:
-                            # Create project name
-                            h = urlparser.HarvestManUrl(url)
-                            project = h.get_domain()
-                            self.projects.append({'url': url,
-                                                  'project': project,
-                                                  'basedir': '.',
-                                                  'verbosity': self.verbosity_default})
-                        except urlparser.HarvestManUrlError, e:
-                            continue
-
-                    # We would now want to skip url, project,
-                    # basedir etc in the config file
-                    self.items_to_skip = ['url','name','basedir','verbosity_level']
-
-            except Exception, e:
-                print e
-                return CONFIG_INVALID_ARGUMENT
-
-
-        # Error in option value
-        if self._error:
-            print self._error, value
-            return CONFIG_INVALID_ARGUMENT
-
-        # If need to pass config file return CONFIG_INVALID_ARGUMENT
-        if cfgfile:
-            return CONFIG_INVALID_ARGUMENT
-        
-        return CONFIG_ARGUMENTS_OK
-
-    def check_value(self, option, value):
-        """ This function checks the values for options when options
-        are supplied as command line arguments. Returns 0 on any error
-        and non-zero otherwise """
-
-        # check #1: If value is a null, return 0
-        if not value:
-            self._error='Error in option value for option %s, value should not be empty!' % option
-            return CONFIG_ARGUMENT_ERROR
-
-        # no other checks right now
-        return CONFIG_ARGUMENT_OK
-
-    def process_value(self, value):
-        """ This function maps values of command line arguments to values
-        which can be used to assign to config params """
-
-        # a 'yes' is treated as 1 and 'no' as 0
-        # also an 'on' is treated as 1 and 'off' as 0
-        # Other valid values: integers, strings, 'YES'/'NO'
-        # 'OFF'/'ON'
-
-        ret = OPTION_TURN_OFF
-        # We expect the null check has been done before
-        val = value.lower()
-        if val in ('yes', 'on'):
-            return OPTION_TURN_ON
-        elif val in ('no', 'off'):
-            return OPTION_TURN_OFF
-
-        # convert value to int
-        try:
-            ret=int(val)
-            return ret
-        except:
-            pass
-
-        # return string value directly
-        return str(value)
-
-    def print_help(self):
-        """ Prints the command-line usage information """
-
-        print PROG_HELP % {'appname' : self.appname,
-                           'version' : self.version,
-                           'maturity' : self.maturity }
-
-    def print_version_info(self):
-        """ Prints version information about the program """
-
-        print 'Version: %s %s' % (self.version, self.maturity)
-
-    def _fix_url_protocol(self, url):
-        """ Fixes errors in URL protocol string, if any """
-        
-        r = re.compile('www\d?\.|http://|https://|ftp://|file://',re.IGNORECASE)
-        if not r.match(url):
-            # Assume http url
-            # prepend http:// to it
-            # We prepend http:// to even FTP urls so that
-            # the ftp servers can be crawled.
-            url = 'http://' + url
-
-        return url
-
-    def add(self, url, name='', basedir='.', verbosity=logger.INFO):
-        """ Adds a crawl project to the crawler. The arguments
-        are the starting URL, and optional name for the project,
-        a base directory for saving files and project verbosity """
-
-        # Useful for command-line crawling
-        self.projects.append({'url': url,
-                              'project': name,
-                              'basedir': basedir,
-                              'verbosity': verbosity})
-        
-    def setup(self):
-        """ Sets up the configuration object for full use,
-        after fixing any errors in key config variables such as
-        URL, project directory, project names etc """
-
-        
-        # If there is more than one url, we
-        # combine all the project related
-        # variables into a dictionary for easy
-        # lookup.
-        num=len(self.projects)
-        if num==0:
-            msg = 'Fatal Error: No URLs given, Aborting.\nFor command-line options run with -h option'
-            sys.exit(msg)
-
-        # Fix url error
-        for x in range(len(self.projects)):
-            entry = self.projects[x]
-            url = entry['url']
-            
-            # If null url, return
-            if not url: continue
-
-            # Fix protocol strings
-            url = self._fix_url_protocol(url)
-            entry['url'] = url
-            
-            # If project is not set, set it to domain
-            # name of the url.
-            if entry.get('project','')=='':
-                h = urlparser.HarvestManUrl(url)
-                project = h.get_domain()
-                entry['project'] = project
-
-            if entry.get('basedir','')=='':
-                entry['basedir'] = '.'
-
-            if entry.get('verbosity',-1)==-1:
-                if self.verbosity_override:
-                    entry['verbosity'] = logger.getLogLevel(self.verbosity)
-                else:
-                    entry['verbosity'] = logger.getLogLevel(self.verbosity_default)
-            else:
-                entry['verbosity'] = logger.getLogLevel(entry['verbosity'])
-
-        self.plugins = list(set(self.plugins))
-            
-        if 'swish-e' in self.plugins:
-            # Disable any message output for swish-e
-            self.verbosity = logger.DISABLE
-            # Set verbosity to zero for all projects
-            for entry in self.projects:
-                entry['verbosity'] = logger.DISABLE
-
-    def set_system_params(self):
-        """ Sets config file/directory parameters for all users """
-
-        # Need to reload sys to get setdefaultencoding attribute
-        reload(sys)
-        sys.setdefaultencoding('utf8')
-        
-        # Directory for system wide configuration files...
-        if os.name == 'posix':
-            #We might have to use find_packager() if somebody will use py2app py2exe
-            #print os.path.split(os.path.dirname(__main__.__file__))[0]
-            if os.path.isdir(sys.prefix):
-                basefolder = os.path.abspath(sys.prefix)
-            else:
-                basefolder = os.path.dirname(sys.prefix)
-            #print os.path.join(basefolder, 'etc', 'harvestman', 'config.xml')
-            self.etcdir=os.path.join(basefolder, 'etc', 'harvestman')
-            #self.etcdir = '/etc/harvestman'            
-        elif os.name == 'nt':
-            self.etcdir = os.path.join(os.environ.get("ALLUSERSPROFILE"),
-                                       "Application Data", "HarvestMan", "conf")
-                
-    def set_user_params(self):
-        """ Set config file/directory parameters specific to the current user  """
-        
-        if os.name == 'posix':
-            homedir = os.environ.get('HOME')
-            if homedir and os.path.isdir(homedir):
-                harvestman_dir = os.path.join(homedir, '.harvestman')
-                
-        elif os.name == 'nt':
-            profiledir = os.environ.get('USERPROFILE')
-            if profiledir and os.path.isdir(profiledir):
-                harvestman_dir = os.path.join(profiledir, 'Local Settings', 'Application Data','HarvestMan')
-
-        if harvestman_dir:
-            harvestman_conf_dir = os.path.join(harvestman_dir, 'conf')
-            harvestman_sessions_dir = os.path.join(harvestman_dir, 'sessions')
-            harvestman_db_dir = os.path.join(harvestman_dir, 'db')
-
-            self.userdir = harvestman_dir
-            self.userconfdir = harvestman_conf_dir
-            self.usersessiondir = harvestman_sessions_dir
-            self.userdbdir = harvestman_db_dir
-    
-    def parse_config_file(self, configfile=None):
-        """ Parses the configuration file. An optional configuration file can be
-        passed to this method. Otherwise it tries to parse the default configuration
-        file """
-
-        if configfile:
-            cfgfile = configfile
-        else:
-            cfgfile = self.configfile
-            
-        if not os.path.isfile(cfgfile):
-            logconsole('Configuration file %s not found...' % cfgfile)
-        else:
-            logconsole('Using configuration file %s...' % cfgfile)
-
-        return configparser.parse_xml_config_file(self, cfgfile)
-            
-    def get_program_options(self):
-        """ Umbrella method for reading the program configuration
-        from a configuration file or the command-line or both """
-
-        # Now load system wide configuration file...
-        system_conf_file = os.path.join(self.etcdir, "config.xml")
-        if os.path.isfile(system_conf_file):
-            logconsole("Loading system configuration...")
-            configparser.parse_xml_config_file(self, system_conf_file)
-
-        # Then load user configuration file
-        user_conf_file = os.path.join(self.userconfdir, 'config.xml')
-        if os.path.isfile(user_conf_file):
-            logconsole("Loading user configuration...")
-            configparser.parse_xml_config_file(self, user_conf_file)
-
-        # first check in argument list, if failed
-        # check in config file
-        res = self.parse_arguments()
-
-        if res == CONFIG_INVALID_ARGUMENT:
-            self.parse_config_file()
-
-        # fix errors in config variables
-        self.setup()
-        
-    def enable_controller(self):
-        """ Return whether we need to start the controller
-        thread. This is determined by whether we have
-        any limits either on time, files or data """
-
-        return (self.timelimit != -1) or \
-               (self.maxfiles) or \
-               (self.maxbytes)
-    
-    def reset_progress(self):
-        """ Rests the progress bar object (used by Hget)"""
-        
-        self.progressobj = None
-        self.progressobj = TextProgress()
-        
-    def __getattr__(self, name):
-        """Overloaded __getattr__ method """
-        
-        try:
-            return self[intern(name)]
-        except KeyError:
-            return
-
-    def __setattr__(self, name, value):
-        """ Overloaded __setattr__ method """
-        
-        self[intern(name)] = value
-
-    def set_klass_plugin_func(self, klassname, funcname, func):
-        """ Sets the plugin function for the given HarvestMan class
-        'klassname'. The plugin target function is specified by
-        'funcname' and the plugin source function is 'func' """
-        
-        try:
-            d = self.__class__.klassmap[klassname + '_plugins']
-            d[funcname] = func
-        except KeyError:
-            self.__class__.klassmap[klassname + '_plugins'] = { funcname: func }            
-
-    def get_klass_plugins(self, klassname):
-        """ Return the plugin function dictionary for the HarvestMan class
-        with the name 'klassname' """
-        
-        return self.__class__.klassmap.get(klassname + '_plugins')
-
-    def set_klass_callback_func(self, klassname, funcname, func, where):
-        """ Sets the callback function for the given HarvestMan class
-        'klassname'. The callback target function is specified by
-        'funcname' and the callback source function is 'func'. The
-        last argument specifies whether to insert the callback before
-        or after the target function """
-        
-        try:
-            d = self.__class__.klassmap[klassname + '_callbacks']
-            d[funcname] = (func, where)
-        except KeyError:
-            self.__class__.klassmap[klassname + '_callbacks'] = { funcname: (func, where) }            
-
-    def get_klass_callbacks(self, klassname):
-        """ Return the callbacks function dictionary for the HarvestMan class
-        with the name 'klassname' """
-        
-        return self.__class__.klassmap.get(klassname + '_callbacks')    
-
-    def generate_projects_xml(self):
-        """ Generates and returns content for the <projects> config file XML element """
-
-        content = "<projects>\n\n"
-        for x in range(len(self.projects)):
-            entry = self.projects[x]
-            
-            project = entry.get('project')
-            url = entry.get('url')
-            verb = logger.getLogLevelName(entry.get('verbosity'))
-            basedir = entry.get('basedir')
-            
-            projcontent = '<project skip="0">\n'
-            projcontent += '<url>' + url + '</url>\n'
-            projcontent += '<name>' + project + '</name>\n'
-            projcontent += '<basedir>' + basedir + '</basedir>\n'            
-            projcontent += '<verbosity level="' + str(verb) + '"/>\n'
-            projcontent += '</project>\n\n'
-
-            content = content + projcontent
-
-        content += "\n</projects>\n"
-
-        return content
-
-    def generate_current_configuration(self):
-        """ Generates and returns the XML configuration string for the current configuration """
-
-        projcontent = self.generate_projects_xml()
-        self['@PROJECTS_ELEMENT'] = projcontent
-        return CONFIG_XML_TEMPLATE % self
-
-    def generate_system_configuration(self):
-        """ Generates and returns configuration content for the system wide
-        HarvestMan configuration file """
-
-        self['@PROJECTS_ELEMENT'] = ''
-        return CONFIG_XML_TEMPLATE % self
-
-    def generate_user_configuration(self):
-        """ Generates and returns configuration content for the user specific
-        HarvestMan configuration file """
-
-        return self.generate_system_configuration()
-
-def set_aliases():
-
-    from harvestman.lib import datamgr
-    from harvestman.lib import rules
-    from harvestman.lib import connector
-    from harvestman.lib import urlqueue
-    from harvestman.lib import event
-    from harvestman.lib import logger
-    from harvestman.lib.common.common import SetAlias
-    
-    SetAlias(HarvestManStateObject())
-    SetAlias(logger.HarvestManLogger())
-
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    SetAlias(dmgr)
-
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections)
-    SetAlias(conn_factory)
-
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-
-    SetAlias(event.HarvestManEvent())
-        
-
-if __name__ == "__main__":
-    s = HarvestManStateObject()
-    print s.generate_system_configuration()
-    
diff --git a/HarvestMan/harvestman/lib/configparser.py b/HarvestMan/harvestman/lib/configparser.py
deleted file mode 100755
index cda593e..0000000
--- a/HarvestMan/harvestman/lib/configparser.py
+++ /dev/null
@@ -1,143 +0,0 @@
-# -- coding: utf-8
-"""
-configparser.py - Module which contains routines to parse
-the XML configuration file of HarvestMan
-
-This module contains a single class HarvestManConfigParser which acts
-as a class for parsing HarvestMan XML configuration files
-using pyexpat.
-
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <<abpillai at gmail dot com>>
-
-Created xx-xx-xxxx  Anand
-
-Added this comment header                         10-1-06 Anand
-Fixes for handling URLs with '&amp;' correctly    10/1/06 jkleven 
-Renamed class to HarvestManConfigParser and       11/04/07 Anand
-module to configparser.
-
-Copyright (C) 2004 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os
-import xml.parsers.expat
-from harvestman.lib.common.macros import *
-
-class HarvestManConfigParser(object):
-    """ Class whose function is used to parse the XML configuration
-    file of HarvestMan """
-    
-    def __init__(self, config):
-        """ Overloaded __init__ method """
-        
-        self.cfg = config
-        self._node = ''
-        self._data = ''
-        
-    def start_element(self, name, attrs):
-        """ Callback handler for StartElementHandler """
-        
-        # reset character and node data, we're starting new element
-        self._data = ''
-        
-        if attrs:
-            # If the element has attributes
-            # it does not have CDATA. So set
-            # curr elem to null.
-            self._node = ''
-            
-            for key, value in attrs.iteritems():
-                # Form key name in xml map
-                xmlkey = "".join((name, "_", key))
-                # Set value
-                if self.cfg:
-                    # print 'Setting option for',xmlkey,value,attrs                    
-                    self.cfg.set_option_xml_attr(xmlkey, value, attrs)
-                else:
-                   print key, value
-        else:
-            # If element has no attributes, the
-            # value will be in CDATA. Store the
-            # element name so that we can use it
-            # in cdata callback.
-            self._node = name
-
-    def end_element(self, name):
-        """ Callback handler for EndElementHandler """
-        
-        # This is called after the closing tag of an XML element was found
-        # When this is called we know that char_data now truly has all the data
-        # that was between the element start and end tag.        
-        
-        # jkleven: 10/1/06 - this function exists because we weren't 
-        # parsing strings in config file 
-        # with '&amp;' (aka '&') correctly.  Now we are.
-        
-        if self._data != '':
-            # This was an element with data between an opening and closing tag
-            # ... now that we're guaranteed to have it all, lets add it to the config
-            # print 'Setting option for %s %s ' % (self._node, self._data)
-            if self.cfg:
-                self.cfg.set_option_xml(self._node, self._data)
-            else:
-                print self._data
-                
-        # reset these because we'll be encountering a new element node 
-        # name soon, and our char data will then be useless as well.
-        self._node = ''
-        self._data  = ''            
-            
-    def char_data(self, data):
-        """ Callback handler for CharacterDataHandler """
-        
-        # This will be called after the
-        # start element is called. Simply
-        # record all the data passed in and
-        # then in the end element callback
-        # we will actually add the whole
-        # string to the internal config structure
-
-        self._data += data.strip()
-
-def parse_xml_config_file(configobj, configfile):
-    """ Function to parse an XML configuration file 'configfile'.
-    The first argument is the configuration object of HarvestMan """
-
-    # Create config parser
-    c = HarvestManConfigParser(configobj)
-    p = xml.parsers.expat.ParserCreate()
-
-    p.StartElementHandler = c.start_element
-    p.CharacterDataHandler = c.char_data
-    p.EndElementHandler = c.end_element
-
-    if not os.path.isfile(configfile):
-        print 'Error: file %s does not exist' % configfile
-        return CONFIG_FILE_DOES_NOT_EXIST
-    
-    try:
-        p.Parse(open(configfile).read())
-        return CONFIG_FILE_PARSE_OK
-    except (IOError, OSError, xml.parsers.expat.ExpatError), e:
-        print e
-        return CONFIG_FILE_PARSE_ERROR
-    
-if __name__=="__main__":
-    p = xml.parsers.expat.ParserCreate()
-    c = HarvestManConfigParser(None)
-    
-    p.StartElementHandler = c.start_element
-    p.CharacterDataHandler = c.char_data
-    p.EndElementHandler = c.end_element
-
-    try:
-        p.Parse(open('config.xml').read())
-    except xml.parsers.expat.ExpatError, e:
-        print e
-        
diff --git a/HarvestMan/harvestman/lib/connector.py b/HarvestMan/harvestman/lib/connector.py
deleted file mode 100755
index e3dd872..0000000
--- a/HarvestMan/harvestman/lib/connector.py
+++ /dev/null
@@ -1,2870 +0,0 @@
-# -- coding: utf-8
-""" connector.py - Module to manage and retrieve data
-    from an internet connection using urllib2. This module is
-    part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification History
-    ====================
-
-    Aug 16 06         Restarted dev cycle. Fixed 2 bugs with 404
-                      errors, one with setting directory URLs
-                      and another with re-resolving URLs.
-    Feb 8 2007        Added hooks support.
-
-    Mar 5 2007        Modified cache check logic slightly to add
-                      support for HTTP 304 errors. HarvestMan will
-                      now use HTTP 304 if caching is enabled and
-                      we have data cache for the URL being checked.
-                      This adds true server-side cache check.
-                      Older caching logic retained as fallback.
-
-   Mar 7 2007         Added HTTP compression (gzip) support.
-   Mar 8 2007         Added connect2 method for grabbing URLs.
-                      Added interactive progress bar for connect2 method.
-                      Improved interactive progress bar to resize
-                      with changing size of terminal.
-
-   Mar 9 2007         Made progress bar use Progress class borrowed
-                      from SMART package manager (Thanks to Vaibhav
-                      for pointing this out!)
-
-   Mar 14 2007        Completed implementation of multipart with
-                      range checks and all.
-
-   Mar 26 2007        Finished implementation of multipart, integrating
-                      with the crawler pieces. Resuming of URLs and
-                      caching changes are pending.
-
-   April 20 2007  Anand Added force-splitting option for hget.
-   April 30 2007  Anand Using datetime module to convert seconds to
-                        hh:mm:ss display.
-                        HarvestManFileObject obejcts not recreated when a lost
-                        connection is resumed, instead new data is
-                        added to existing data, by adjusting byte range
-                        if necessary.
-   Aug 14 2007    Anand Fixed a bug with download after querying a server
-                        for multipart download abilities. Also split
-                        _write_url function and rewrote it.
-
-   Aug 22 2007    Anand  MyRedirectHandler is buggy - replaced with
-                         urllib2.HTTPRedirectHandler.
-   Mar 07 2008    Anand  Made connect to create HEAD request (instead of 'GET')
-                         when either last modified time or etag is given. Added etag
-                         support to connect and HarvestMan cache.
-                         
-   Copyright (C) 2004 Anand B Pillai.    
-                              
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys
-import socket
-import time
-import datetime
-import threading
-
-import urllib2 
-import urlparse
-import gzip
-import cStringIO
-import os
-import shutil
-import glob
-import random
-import base64
-import sha
-import weakref
-import getpass
-import cookielib
-
-from httplib import BadStatusLine
-
-from harvestman.lib import document
-from harvestman.lib import urlparser
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.spincursor import InfiniteSpinCursor
-from harvestman.lib.common import keepalive
-
-# Defining pluggable functions
-__plugins__ = { 'save_url_plugin': 'HarvestManUrlConnector:save_url' }
-
-# Defining functions with callbacks
-__callbacks__ = { 'connect_callback' : 'HarvestManUrlConnector:connect' }
-
-__protocols__=["http", "ftp"]
-
-# Error Macros with arbitrary error numbers
-URL_IO_ERROR = 31
-URL_BADSTATUSLINE = 41
-URL_TYPE_ERROR = 51
-URL_VALUE_ERROR = 61
-URL_ASSERTION_ERROR = 71
-URL_SOCKET_ERROR = 81
-URL_SOCKET_TIMEOUT = 91
-URL_GENERAL_ERROR = 101
-
-FILEOBJECT_EXCEPTION = 111
-
-TEST = False
-
-class HeadRequest(urllib2.Request):
-    """ A request class which performs a HEAD request """
-
-    def get_method(self):
-        return 'HEAD'
-    
-class HarvestManFileObjectException(Exception):
-    """ Exception class for HarvestManFileObject class """
-    pass
-
-class HarvestManFileObject(threading.Thread):
-    """ A class which imitates a file object. This wraps
-    around the file object returned by urllib2 and provides
-    features such as block reading, throttling and a progress
-    bar """
-
-    # Class level attributes used for multipart
-    ORIGLENGTH = 0
-    START_TIME = 0.0
-    CONTENTLEN = []
-    MULTIPART = False
-    NETDATALEN = 0
-    
-    def __init__(self, fobj, filename, clength, mode = 0, bwlimit = 0):
-        """ Overloaded __init__ method """
-
-        self._fobj = fobj
-        self._data = ''
-        self._clength = int(clength)
-        self._start = 0.0
-        self._flag = False
-        # Mode: 0 => flush data to file (default)
-        #     : 1 => keep data in memory
-        self._mode = mode
-        self._filename = filename
-        if self._mode == CONNECTOR_DATA_MODE_FLUSH:
-            self._tmpf = open(filename, 'wb')
-        else:
-            self._tmpf = None
-        # Content-length so far
-        self._contentlen = 0
-        self._index = 0
-        # Initialized flag
-        self._init = False
-        # Last error
-        self._lasterror = None
-        # Bandwidth limit as bytes/sec
-        self._bwlimit = bwlimit
-        self._bs = 4096
-
-        threading.Thread.__init__(self, None, None, 'data reader')
-        
-    def initialize(self):
-        """ Initialize before using an instance of this class.
-        This methods sets the start time and the init flag """
-        
-        self._start = time.time()
-        self._init = True
-
-    def is_initialized(self):
-        """ Returns the init flag """
-        
-        return self._init
-
-    def set_fileobject(self, fileobj):
-        """ Setter method for the encapsulated file object """
-        
-        self._fobj = fileobj
-
-    def throttle(self, bytecount, start_time, factor):
-        """ Throttle to fall within limits of specified download speed """
-            
-        diff = float(bytecount)/self._bwlimit - (time.time() - start_time)
-        diff = factor*diff/HarvestManUrlConnectorFactory.connector_count
-        # We need to sleep. But a time.sleep does waste raw CPU
-        # cycles. Still there does not seem to be an option here
-        # since we cannot use SleepEvent class here as there could
-        # be many connectors at any given time and hence the threading
-        # library may not be able to create so many distinct Event
-        # objects...
-        # print 'Diff=>',diff
-
-        if diff>0:
-            # We are 'ahead' of the required bandwidth, so sleep
-            # the time difference off.
-            if self._bs>=256: 
-                self._bs -= 128
-            time.sleep(diff)
-
-        elif diff<0:
-            # We are behind the required bandwidth, so read the
-            # additional data
-            self._bs += int(self._bwlimit*abs(diff))
-
-    def run(self):
-        """ Overloaded run method """
-        
-        self.initialize()
-        self.read()
-
-    def read(self):
-
-        reads = 0
-        
-        dmgr = objects.datamgr
-        config = objects.config
-
-        start_time = config.starttime
-        tfactor = config.throttlefactor
-
-        while not self._flag:
-            try:
-                block = self._fobj.read(self._bs)
-                if block=='':
-                    self._flag = True
-                    # Close the file
-                    if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                        self.close()                
-                    break
-                else:
-                    reads += 1
-                    self._data = self._data + block
-                    self._contentlen += len(block)
-                    if self._bwlimit:
-                        self.throttle(dmgr.bytes, start_time, tfactor)
-                    
-                    # Flush data to disk
-                    if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                        self.flush()
-            except socket.error, e:
-                self._flag = True
-                self._lasterror = e
-                break
-            except Exception, e:
-                self._flag = True
-                self._lasterror = e
-                break
-
-        self._fobj.close()
-
-    def readNext(self):
-        """ Method which reads the next block of data from the URL """
-
-        dmgr = objects.datamgr
-        config = objects.config
-
-        start_time = config.starttime
-        tfactor = config.throttlefactor
-
-        try:
-            block = self._fobj.read(self._bs)
-            if block=='':
-                self._flag = True
-                # Close the file
-                if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                    self.close()
-                return False
-            else:
-                self._data = self._data + block
-                self._contentlen += len(block)
-                if self._bwlimit:
-                    self.throttle(dmgr.bytes, start_time, tfactor)
-                
-                # Flush data to disk
-                if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                    self.flush()
-
-        except socket.error, e:
-            self._fobj.close()
-            raise HarvestManFileObjectException, str(e)
-        except Exception, e:
-            self._fobj.close()            
-            raise HarvestManFileObjectException, str(e)               
-
-    def flush(self):
-        """ Flushes data to the temporary file on disk """
-
-        self._tmpf.write(self._data)
-        self._data = ''
-
-    def close(self):
-        """ Closes the temporary file object """
-        
-        self._tmpf.close()
-
-    def get_tmpfile(self):
-
-        return self._tmpf
-    
-    def get_lasterror(self):
-        """ Returns the last error object """
-        
-        return self._lasterror
-    
-    def get_info(self):
-        """ Returns percentage, data downloaded, bandwidth and estimated time to
-        complete as a tuple """
-
-        curr = time.time()
-        per, pertotal, bandwidth, l, eta = -1, -1, 0, 0, -1
-        
-        if not self.__class__.MULTIPART:
-            if self._clength:
-                pertotal = float(100.0*self._contentlen)/float(self._clength)
-            
-            l = self._contentlen
-            self.__class__.NETDATALEN = self._contentlen
-            
-            per = pertotal
-            
-            if curr>self._start:
-                bandwidth = float(l)/float(curr - self._start)
-            
-            if bandwidth and self._clength:
-                eta = int((self._clength - l)/float(bandwidth))
-        else:
-            kls = self.__class__
-            kls.CONTENTLEN[self._index] = self._contentlen
-
-            total = sum(kls.CONTENTLEN)
-            self.__class__.NETDATALEN = total
-            
-            if kls.ORIGLENGTH:
-               pertotal = float(100.0*total)/float(kls.ORIGLENGTH)
-
-            if self._clength:
-                per = float(100.0*self._contentlen)/float(self._clength)
-                
-            if curr>kls.START_TIME:
-               bandwidth = float(total)/float(curr - kls.START_TIME)
-
-            if bandwidth and kls.ORIGLENGTH:
-               eta = int((kls.ORIGLENGTH - total)/float(bandwidth))
-            pass
-        
-        if eta != -1:
-            eta = str(datetime.timedelta(seconds=int(eta)))
-        else:
-            eta = 'NaN'
-        
-        return (per, pertotal, l, bandwidth, eta)
-
-    def get_data(self):
-        """ Returns the downloaded data """
-        
-        return self._data
-
-    def get_datalen(self):
-        """ Returns length of downloaded data """
-        
-        return self._contentlen
-
-    def set_index(self, idx):
-        """ Sets the index attribute (used for multipart downloads only) """
-
-        self._index = idx
-
-    def get_index(self):
-        """ Gets the index attribute (used for multipart downloads only) """
-        
-        return self._index
-    
-    def stop(self):
-        """ Stops the thread """
-        
-        self._flag = True
-
-    @classmethod
-    def reset_klass(cls):
-        """ Resets all class attributes (classmethod) """
-        
-        cls.ORIGLENGTH = 0
-        cls.START_TIME = 0.0
-        cls.CONTENTLEN = []
-        cls.MULTIPART = False
-        cls.NETDATALEN = 0
-        
-class HarvestManNetworkConnector(object):
-    """ This class keeps the Internet settings and configures the network. """
-
-    alias = 'connmgr'                
-    
-    def __init__(self):
-        # use proxies flag
-        self._useproxy=0
-        # check for ssl support in python
-        self._initssl=False
-        # Number of socket errors
-        self._sockerrs = 0
-        # Config object
-        self._cfg = objects.config
-        
-        if hasattr(socket, 'ssl'):
-            self._initssl=True
-            __protocols__.append("https")
-
-        # dictionary of protocol:proxy values
-        self._proxydict = {}
-        # dictionary of protocol:proxy auth values
-        self._proxyauth = {}
-        self.configure()
-        
-    def set_useproxy(self, val=True):
-        """ Set the value of use-proxy flag. Also sets proxy dictionaries to default values """
-
-        self._useproxy=val
-
-        if val:
-            proxystring = 'proxy:80'
-            
-            # proxy variables
-            self._proxydict["http"] =  proxystring
-            self._proxydict["https"] = proxystring
-            self._proxydict["ftp"] = proxystring
-            # set default for proxy authentication
-            # tokens.
-            self._proxyauth["http"] = ""
-            self._proxyauth["https"] = ""
-            self._proxyauth["ftp"] = ""            
-
-    def set_ftp_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets ftp proxy information """
-
-        if encrypted:
-            self._proxydict["ftp"] = "".join((bin_decrypt(proxyserver),  ':', str(proxyport)))
-        else:
-            self._proxydict["ftp"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring = "".join((username, ':', passwd))
-
-            self._proxyauth["ftp"] = passwdstring
-
-    def set_https_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets https(ssl) proxy  information """
-
-        if encrypted:
-            self._proxydict["https"] = "".join((bin_decrypt(proxyserver), ':', str(proxyport)))
-        else:
-            self._proxydict["https"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring = "".join((username, ':', passwd))
-
-            self._proxyauth["https"] = passwdstring
-
-    def set_http_proxy(self, proxyserver, proxyport, authinfo=(), encrypted=True):
-        """ Sets http proxy information """
-
-        if encrypted:
-            self._proxydict["http"] = "".join((bin_decrypt(proxyserver), ':', str(proxyport)))
-        else:
-            self._proxydict["http"] = "".join((proxyserver, ':', str(proxyport)))
-
-        if authinfo:
-            try:
-                username, passwd = authinfo
-            except ValueError:
-                username, passwd = '', ''
-
-            if encrypted:
-                passwdstring= "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-            else:
-                passwdstring= "".join((username, ':', passwd))
-
-            self._proxyauth["http"] = passwdstring
-
-    def set_proxy(self, server, port, authinfo=(), encrypted=True):
-        """ Sets proxy information for all protocols """
-
-        # For most users, only this method will be called,
-        # rather than the specific method for each protocol,
-        # as proxies are normally shared for all tcp/ip protocols 
-
-        for p in __protocols__:
-            # eval helps to do this dynamically
-            s='self.set_' + p + '_proxy'
-            func=eval(s, locals())
-            
-            func(server, port, authinfo, encrypted)
-
-    def set_authinfo(self, username, passwd, encrypted=True):
-        """ Set authentication information for proxy server """
-
-        
-        # Note: If this function is used all protocol specific
-        # authentication will be replaced by this authentication. 
-
-        if encrypted:
-            passwdstring = "".join((bin_decrypt(username), ':', bin_decrypt(passwd)))
-        else:
-            passwdstring = "".join((username, ':', passwd))
-
-        self._proxyauth = {"http" : passwdstring,
-                            "https" : passwdstring,
-                            "ftp" : passwdstring }
-
-    def configure(self):
-        """ Wrapper method for configuring network and protocols """
-
-        log = objects.logger
-
-        keepalive.DEBUG = keepalive.FakeLogger()
-        keepalive.DEBUG.info = log.debug
-        keepalive.DEBUG.error = log.debug
-        
-        self.configure_network()
-        self.configure_protocols()
-        
-    def configure_network(self):
-        """ Configure network settings for the user """
-
-        # First: Configuration of network (proxies/intranet etc)
-        
-        # Check for proxies in the config object
-        if self._cfg.proxy:
-            self.set_useproxy(True)
-            proxy = self._cfg.proxy
-            
-            index = proxy.rfind(':')
-            if index != -1:
-                port = proxy[(index+1):].strip()
-                server = proxy[:index]
-                # strip of any 'http://' from server
-                index = server.find('http://')
-                if index != -1:
-                    server = server[(index+7):]
-
-                self.set_proxy(server, int(port), (), self._cfg.proxyenc)
-
-            else:
-                port = self._cfg.proxyport
-                server = self._cfg.proxy
-                self.set_proxy(server, int(port), (), self._cfg.proxyenc)
-
-            # Set proxy username and password, if specified
-            puser, ppasswd = self._cfg.puser, self._cfg.ppasswd
-            if puser and ppasswd: self.set_authinfo(puser, ppasswd, self._cfg.proxyenc)
-
-
-    def configure_protocols(self):
-        """ Configures protocol handlers """
-        
-        # Second: Configuration of protocol handlers.
-
-        # TODO: Verify gopher protocol
-        authhandler = urllib2.HTTPBasicAuthHandler()
-        cookiehandler = None
-
-        # No need for version checks since HarvestMan install
-        # works for Python >=2.4 anyway
-        socket.setdefaulttimeout( self._cfg.socktimeout )
-        cj = cookielib.MozillaCookieJar()
-        cookiehandler = urllib2.HTTPCookieProcessor(cj)
-
-        # HTTP/HTTPS handlers
-        if self._cfg.appname == 'Hget':
-            httphandler = urllib2.HTTPHandler
-            httpshandler = urllib2.HTTPSHandler
-        else:
-            httphandler = keepalive.HTTPHandler
-            httpshandler = urllib2.HTTPSHandler #keepalive.HTTPSHandler
-            
-        # If we are behing proxies/firewalls
-        if self._useproxy:
-            if self._proxyauth['http']:
-                httpproxystring = "".join(('http://',
-                                           self._proxyauth['http'],
-                                           '@',
-                                           self._proxydict['http']))
-            else:
-                httpproxystring = "".join(('http://', self._proxydict['http']))
-
-            if self._proxyauth['ftp']:
-                ftpproxystring = "".join(('http://',
-                                          self._proxyauth['ftp'],
-                                          '@',
-                                          self._proxydict['ftp']))
-            else:
-                ftpproxystring = "".join(('http://', self._proxydict['ftp']))
-
-            if self._proxyauth['https']:
-                httpsproxystring = "".join(('http://',
-                                            self._proxyauth['https'],
-                                            '@',
-                                            self._proxydict['https']))
-            else:
-                httpsproxystring = "".join(('http://', self._proxydict['https']))
-
-            # Set this as the new entry in the proxy dictionary
-            self._proxydict['http'] = httpproxystring
-            self._proxydict['ftp'] = ftpproxystring
-            self._proxydict['https'] = httpsproxystring
-
-            
-            proxy_support = urllib2.ProxyHandler(self._proxydict)
-            
-            # build opener and install it
-            if self._initssl:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              proxy_support,
-                                              httphandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              urllib2.FTPHandler,
-                                              #urllib2.GopherHandler,
-                                              httpshandler,
-                                              urllib2.FileHandler,
-                                              cookiehandler)
-            else:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              proxy_support,
-                                              httphandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              urllib2.FTPHandler,
-                                              #urllib2.GopherHandler,
-                                              urllib2.FileHandler,
-                                              cookiehandler)
-
-        else:
-            # Direct connection to internet
-            if self._initssl:
-                opener = urllib2.build_opener(authhandler,
-                                              urllib2.HTTPRedirectHandler,
-                                              httphandler,
-                                              urllib2.FTPHandler,
-                                              httpshandler,
-                                              #urllib2.GopherHandler,
-                                              urllib2.FileHandler,
-                                              urllib2.HTTPDefaultErrorHandler,
-                                              cookiehandler)
-                opener.addheaders=[] #Need to clear default headers so we can apply our own
-            else:
-                opener = urllib2.build_opener( authhandler,
-                                               urllib2.HTTPRedirectHandler,
-                                               httphandler,
-                                               urllib2.FTPHandler,
-                                               #urllib2.GopherHandler,
-                                               urllib2.FileHandler,
-                                               urllib2.HTTPDefaultErrorHandler,
-                                               cookiehandler)
-                opener.addheaders=[] #Need to clear default headers so we can apply our own
-
-        urllib2.install_opener(opener)
-
-        return CONFIGURE_PROTOCOL_OK
-
-    # Get methods
-    def get_useproxy(self):
-        """ Returns whether we are going through a proxy server """
-
-        return self._useproxy
-    
-    def get_proxy_info(self):
-        """ Return proxy information as a tuple. The first member
-        of the tuple is the proxy server dictionary and the second
-        member the proxy authentication information """
-        
-        return (self._proxydict, self._proxyauth)
-
-    def increment_socket_errors(self, val=1):
-        """ Increment socket error count """
-        
-        self._sockerrs += val
-
-    def decrement_socket_errors(self, val=1):
-        """ Decrement socket error count """
-        
-        self._sockerrs -= val
-        
-    def get_socket_errors(self):
-        """ Get socket error count """
-        
-        return self._sockerrs
-
-class HarvestManUrlError(object):
-    """ Class encapsulating errors raised by HarvestManUrlConnector objects
-    while connecting and downloading data from the Internet """
-    
-    def __init__(self):
-        """ Overloaded __init__ method """
-        
-        self.initialize()
-
-    def initialize(self):
-        """ Initializes an instance of this class """
-        self.reset()
-
-    def __str__(self):
-        """ Returns string representation of an instance of the class """
-        
-        return ''.join((str(self.errclass),' ', str(self.number),': ',self.msg))
-
-    def reset(self):
-        """ Resets attributes """
-        
-        self.number = 0
-        self.msg = ''
-        self.fatal = False
-        self.errclass = ''        
-        
-class HarvestManUrlConnector(object):
-    """ Class which performs the work of fetching data for URLs
-    from the Internet and save data to the disk """
-
-    __metaclass__ = MethodWrapperMetaClass
-    
-    def __str__(self):
-        """ Return a string representation of an instance of this class """
-        return `self` 
-        
-    def __init__(self):
-        """ Overloaded __init__ method """
-
-        # file like object returned by
-        # urllib2.urlopen(...)
-        self._freq = None
-        # data downloaded
-        self._data = ''
-        # length of data downloaded
-        self._datalen = 0
-        # error object
-        self._error = HarvestManUrlError()
-        # time to wait before reconnect
-        # in case of failed connections
-        self._sleeptime = 0.5
-        # global network configurator
-        self.network_conn = objects.connmgr
-        # Config object
-        self._cfg = objects.config    
-        # Http header for current connection
-        self._headers = CaselessDict()
-        # HarvestMan file object
-        self._fo = None
-        # Elasped time for reading data
-        self._elapsed = 0.0
-        # Mode for data download
-        self._mode = self._cfg.datamode
-        # Temporary filename if any
-        self._tmpfname = ''
-        # Status of connection
-        # 0 => no connection
-        # 1 => connected, download in progress
-        self._status = 0
-        # Number of tries
-        self._numtries = 0
-        # Acquired flag
-        self._acquired = True
-        # Block write flag - to be used
-        # to indicate to connector to
-        # not save the data to disk
-        self.blockwrite = False
-        # Throttle sleeping time to be
-        # set on the file object
-        self.throttle_time = 0
-        
-    def __del__(self):
-        del self._data
-        self._data = None
-        del self._freq
-        self._freq = None
-        del self._error
-        self._error = None
-        del self.network_conn
-        self.network_conn = None
-        del self._cfg
-        self._cfg = None
-        
-    def _proxy_query(self, queryauth=1, queryserver=0):
-        """ Query the user for proxy related information """
-
-        self.network_conn.set_useproxy(True)
-        
-        if queryserver or queryauth:
-            # There is an error in the config file/project file/user input
-            SetUserDebug("Error in proxy server settings (Regenerate the config/project file)")
-
-        # Get proxy info from user
-        try:
-            if queryserver:
-                server=bin_crypt(raw_input('Enter the name/ip of your proxy server: '))
-                port=int(raw_input('Enter the proxy port: '))         
-                self.network_conn.set_proxy(server, port)
-
-            if queryauth:
-                user=bin_crypt(raw_input('Enter username for your proxy server: '))
-                # Ask for password only if a valid user is given.
-                if user:
-                    passwd=bin_crypt(getpass.getpass('Enter password for your proxy server: '))
-                    # Set it on myself and re-configure
-                    self.network_conn.set_authinfo(user,passwd)
-        except EOFError, e:
-            error("Proxy Setting Error:",e)
-
-        info('Re-configuring protocol handlers...')
-        self.network_conn.configure_protocols()
-        
-        extrainfo('Done.')
-
-    def release(self):
-        """ Marks the connector object as released """
-
-        self._acquired = False
-
-    def is_released(self):
-        """ Returns whether the connector was released or not """
-
-        return (not self._acquired)
-    
-    def urlopen(self, url):
-        """ Opens the URL and returns the url file stream """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            self.connect(urlobj, True, self._cfg.retryfailed )
-            # return the file like object
-            if self._error.fatal:
-                return None
-            else:
-                return self._freq
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error:",e)
-            
-    def robot_urlopen(self, url):
-        """ Opens a robots.txt URL and returns the request object """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            self.connect(urlobj, False, 0)
-            # return the file like object
-            if self._error.fatal:
-                return None
-            else:
-                return self._freq
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error:",e)
-        
-    def connect(self, urlobj, fetchdata=True, retries=1, lastmodified='', etag=''):
-        """ Connects to the Internet and fetches data for the URL encapsulated
-        in the object 'urlobj' """
-
-        # This is the work-horse method of this class...
-        
-        data = ''
-
-        dmgr = objects.datamgr
-        rulesmgr = objects.rulesmgr
-
-        self._numtries = 0
-        three_oh_four = False
-
-        # Reset the http headers
-        self._headers.clear()
-        urltofetch = urlobj.get_full_url()
-
-        lmt, tag = lastmodified, etag
-
-        # Raise an event...
-        if objects.eventmgr.raise_event('beforeconnect', urlobj, None, last_modified=lastmodified, etag=etag)==False:
-            return CONNECT_NO_FILTERED
-
-        add_ua = self._cfg._connaddua
-        
-        while self._numtries <= retries and not self._error.fatal:
-
-            # Reset status
-            self._status = 0
-            
-            errnum = 0
-            try:
-                # Reset error
-                self._error.reset()
-
-                self._numtries += 1
-
-                # create a request object
-                # If we are passed either the lastmodified time or
-                # the etag value or both, we will be creating a
-                # head request. Now if either the etag or lastmodified
-                # time match, the server should produce a 304 error
-                # and we break the loop automatically. If not, we have
-                # to set lmt and tag values to null strings so that
-                # we make an actual request.
-
-                # Set lmt, tag to null strings if try count is greater
-                # than 1...
-                if self._numtries>1:
-                    lmt, tag = '', ''
-                    
-                request = self.create_request(urltofetch, lmt, tag, useragent=add_ua)
-
-                # Check for urlobject which is trying to do
-                # multipart download.
-                #byterange = urlobj.range
-                #if byterange:
-                #    range1 = byterange[0]
-                #    range2 = byterange[-1]
-                #    request.add_header('Range','bytes=%d-%d' % (range1, range2))
-
-                # If we accept http-compression, add the required header.
-                if self._cfg.httpcompress:
-                    request.add_header('Accept-Encoding', 'gzip')
-
-                self._freq = urllib2.urlopen(request)
-                # Set status to 1
-                self._status = 1
-                
-                # Set http headers
-                self.set_http_headers()
-
-                clength = int(self.get_content_length())
-                if urlobj: urlobj.clength = clength
-                
-                trynormal = False
-                # Check constraint on file size, dont do this on
-                # objects which are already downloading pieces of
-                # a multipart download.
-                if not self.check_content_length(): # and not byterange
-                    maxsz = self._cfg.maxfilesize
-                    extrainfo("Url",urltofetch,"does not match size constraints")
-                    # Raise an event...
-                    objects.eventmgr.raise_event('afterconnect', urlobj, None)
-                    
-                    return CONNECT_NO_RULES_VIOLATION
-                
-##                     supports_multipart = dmgr.supports_range_requests(urlobj)
-                    
-##                     # Dont do range checking on FTP servers since they
-##                     # typically support it by default.
-##                     if urlobj.protocol != 'ftp' and supports_multipart==0:
-##                         # See if the server supports 'Range' header
-##                         # by requesting half the length
-##                         self._headers.clear()
-##                         request.add_header('Range','bytes=%d-%d' % (0,clength/2))
-##                         self._freq = urllib2.urlopen(request)
-##                         # Set http headers
-##                         self.set_http_headers()
-##                         range_result = self._headers.get('accept-ranges')
-
-##                         if range_result.lower()=='bytes':
-##                             supports_multipart = 1
-##                         else:
-##                             extrainfo('Server %s does not support multipart downloads' % urlobj.domain)
-##                             extrainfo('Aborting download of  URL %s.' % urltofetch)
-##                             return CONNECT_NO_RULES_VIOLATION
-
-##                     if supports_multipart==1:
-##                         extrainfo('Server %s supports multipart downloads' % urlobj.domain)
-##                         dmgr.download_multipart_url(urlobj, clength)
-##                         return CONNECT_MULTIPART_DOWNLOAD
-                    
-                # The actual url information is used to
-                # differentiate between directory like urls
-                # and file like urls.
-                actual_url = self._freq.geturl()
-                
-                # Replace the urltofetch in actual_url with null
-                if actual_url:
-                    no_change = (actual_url == urltofetch)
-                    
-                    if not no_change:
-                        replacedurl = actual_url.replace(urltofetch, '')
-                        # If the difference is only as a directory url
-                        if replacedurl=='/':
-                            no_change = True
-                        else:
-                            no_change = False
-                            
-                        # Sometimes, there could be HTTP re-directions which
-                        # means the actual url may not be same as original one.
-                        if no_change:
-                            if (actual_url[-1] == '/' and urltofetch[-1] != '/'):
-                                extrainfo('Setting directory url=>',urltofetch)
-                                urlobj.set_directory_url()
-                                
-                        else:
-                            # There is considerable change in the URL.
-                            # So we need to re-resolve it, since otherwies
-                            # some child URLs which derive from this could
-                            # be otherwise invalid and will result in 404
-                            # errors.
-                            urlobj.redirected = True                            
-                            urlobj.url = actual_url
-                            debug('Actual URL=>',actual_url)
-                            debug("Re-resolving URL: Current is %s..." % urlobj.get_full_url())
-                            urlobj.wrapper_resolveurl()
-                            debug("Re-resolving URL: New is %s..." % urlobj.get_full_url())
-                            urltofetch = urlobj.get_full_url()
-                    
-                # Find the actual type... if type was assumed
-                # as wrong, correct it.
-                content_type = self.get_content_type()
-                urlobj.manage_content_type(content_type)
-                        
-                # update byte count
-                # if this is the not the first attempt, print a success msg
-                if self._numtries>1:
-                    extrainfo("Reconnect succeeded => ", urltofetch)
-
-                # Update content info on urlobject
-                self.set_content_info(urlobj)
-
-                if fetchdata:
-                    try:
-                        # If gzip-encoded, need to deflate data
-                        encoding = self.get_content_encoding()
-                        clength = self.get_content_length()
-                        
-                        t1 = time.time()
-                        
-                        if self._fo==None:
-                            if self._mode==CONNECTOR_DATA_MODE_FLUSH:
-                                if self._cfg.projtmpdir:
-                                    self._tmpfname = self.make_tmp_fname(urlobj.get_filename(),
-                                                                         self._cfg.projtmpdir)
-                                else:
-                                    # For stand-alone use outside crawls
-                                    self._tmpfname = self.make_tmp_fname(urlobj.get_filename(),
-                                                                         GetMyTempDir())
-                            self._fo = HarvestManFileObject(self._freq,
-                                                            self._tmpfname,
-                                                            clength,
-                                                            self._mode,
-                                                            float(self._cfg.bandwidthlimit))
-                            self._fo.initialize()
-                        else:
-                            self._fo.set_fileobject(self._freq)
-
-                        
-                        self._fo.read()
-                        self._elapsed = time.time() - t1
-                        
-                        self._freq.close()                       
- 
-                        if self._mode==CONNECTOR_DATA_MODE_INMEM:
-                            data = self._fo.get_data()
-                            self._datalen = len(data)
-
-                            # Save a reference
-                            data0 = data
-                            self._freq.close()                        
-                            dmgr.update_bytes(len(data))
-                            debug('Encoding',encoding)
-                        
-                            if encoding.strip().find('gzip') != -1:
-                                try:
-                                    gzfile = gzip.GzipFile(fileobj=cStringIO.StringIO(data))
-                                    data = gzfile.read()
-                                    gzfile.close()
-                                except (IOError, EOFError), e:
-                                    data = data0
-                                    pass
-                        else:
-                            self._datalen = self._fo.get_datalen()
-                            dmgr.update_bytes(self._datalen)
-                            
-                    except MemoryError, e:
-                        # Catch memory error for sockets
-                        error("Memory Error:",str(e))
-
-                # Explicitly set the status of urlobj to zero since
-                # download was completed...
-                urlobj.status = 0
-                        
-                break
-
-            #except Exception, e:
-            #     raise
-            
-            except urllib2.HTTPError, e:
-                
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.strip().split()
-                    self._error.number = int(parts[-1])
-                    self._error.msg = errdescn.strip()
-                    self._error.errclass = "HTTPError"
-                except:
-                    pass
-
-                if self._error.msg:
-                    error(self._error.msg, '=> ',urltofetch)
-                else:
-                    error('HTTPError:',urltofetch)
-
-                try:
-                    errnum = int(self._error.number)
-                except:
-                    pass
-
-                if errnum==304:
-                    # Page not modified
-                    three_oh_four = True
-                    self._error.fatal = False
-                    # Need to do this to ensure that the crawler
-                    # proceeds further!
-                    content_type = self.get_content_type()
-                    urlobj.manage_content_type(content_type)                    
-                    break
-                if errnum in range(400, 407):
-                    # 400 => bad request
-                    # 401 => Unauthorized
-                    # 402 => Payment required (not used)
-                    # 403 => Forbidden
-                    # 404 => Not found
-                    # 405 => Method not allowed
-                    # 406 => Not acceptable
-                    
-                    # If error is 400, 405 or 406, then we
-                    # retry with the useragent string not set.
-                    if errnum in (400, 405, 406):
-                        self._cfg._badrequests += 1
-                        # If we get many badrequests in a row
-                        # we disable UA addition for this crawl.
-                        if self._cfg._badrequests>=5:
-                            self._cfg._connaddua = False
-                            
-                        if self._numtries<=retries:
-                            add_ua = False
-                        else:
-                            self._error.fatal = True                            
-                    else:
-                        self._error.fatal = True
-                elif errnum == 407:
-                    # Proxy authentication required
-                    self._proxy_query(1, 1)
-                elif errnum == 408:
-                    # Request timeout, try again
-                    pass
-                elif errnum == 412:
-                    # Pre-condition failed, this has been
-                    # detected due to our user-agent on some
-                    # websites (sample URL: http://guyh.textdriven.com/)
-                    self._error.fatal =  True
-                elif errnum in range(409, 418):
-                    # Error codes in 409-417 contain a mix of
-                    # fatal and non-fatal states. For example
-                    # 410 indicates requested resource is no
-                    # Longer available, but we could try later.
-                    # However for all practical purposes, we
-                    # are marking these codes as fatal errors
-                    # for the time being.
-                    self._error.fatal = True
-                elif errnum == 500:
-                    # Internal server error, can try again
-                    pass
-                elif errnum == 501:
-                    # Server does not implement the functionality
-                    # to fulfill the request - fatal
-                    self._error.fatal = True
-                elif errnum == 502:
-                    # Bad gateway, can try again ?
-                    pass
-                elif errnum in (503, 506):
-                    # 503 - Service unavailable
-                    # 504 - Gatway timeout
-                    # 505 - HTTP version not supported
-                    self._error.fatal = True
-
-                if self._error.fatal:
-                    rulesmgr.add_to_filter(urltofetch)
-                    
-            except urllib2.URLError, e:
-                # print 'urlerror',urltofetch
-                
-                errdescn = ''
-                self._error.errclass = "URLError"
-                
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.split()                            
-                except:
-                    try:
-                        errbasic, errdescn = (str(e)).split(',')
-                        parts = errbasic.split('(')
-                        errdescn = (errdescn.split("'"))[1]
-                    except:
-                        pass
-
-                try:
-                    self._error.number = int(parts[-1])
-                except:
-                    pass
-                
-                if errdescn:
-                    self._error.msg = errdescn
-
-                if self._error.msg:
-                    error(self._error.msg, '=> ',urltofetch)
-                else:
-                    error('URLError:',urltofetch)
-
-                errnum = self._error.number
-                if errnum == 10049 or errnum == 10061: # Proxy server error
-                    self._proxy_query(1, 1)
-                elif errnum == 10055:
-                    # no buffer space available
-                    self.network_conn.increment_socket_errors()
-                    # If the number of socket errors is >= 4
-                    # we decrease max connections by 1
-                    sockerrs = self.network_conn.get_socket_errors()
-                    if sockerrs>=4:
-                        self._cfg.connections -= 1
-                        self.network_conn.decrement_socket_errors(4)
-
-            except IOError, e:
-                self._error.number = URL_IO_ERROR
-                self._error.fatal=True
-                self._error.msg = str(e)
-                self._error.errclass = "IOError"                
-                # Generated by invalid ftp hosts and
-                # other reasons,
-                # bug(url: http://www.gnu.org/software/emacs/emacs-paper.html)
-                error(e ,'=> ',urltofetch)
-
-            except BadStatusLine, e:
-                self._error.number = URL_BADSTATUSLINE
-                self._error.msg = str(e)
-                self._error.errclass = "BadStatusLine"
-                error(e, '=> ',urltofetch)
-
-            except TypeError, e:
-                self._error.number = URL_TYPE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "TypeError"                
-                error(e, '=> ',urltofetch)
-                
-            except ValueError, e:
-                self._error.number = URL_VALUE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "ValueError"                
-                error(e, '=> ',urltofetch)
-
-            except AssertionError, e:
-                self._error.number = URL_ASSERTION_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "AssertionError"                
-                error(e ,'=> ',urltofetch)
-
-            except socket.error, e:
-                self._error.msg = str(e)
-                self._error.number = URL_SOCKET_ERROR
-                self._error.errclass = "SocketError"                
-                errmsg = self._error.msg
-
-                error('Socket Error: ', errmsg,'=> ',urltofetch)
-
-                if errmsg.lower().find('connection reset by peer') != -1:
-                    # Connection reset by peer (socket error)
-                    self.network_conn.increment_socket_errors()
-                    # If the number of socket errors is >= 4
-                    # we decrease max connections by 1
-                    sockerrs = self.network_conn.get_socket_errors()
-
-                    if sockerrs>=4:
-                        self._cfg.connections -= 1
-                        self.network_conn.decrement_socket_errors(4)
-
-            except socket.timeout, e:
-                self._error['msg'] = 'socket timed out'
-                self._error['number'] = URL_SOCKET_TIMEOUT
-                errmsg = self._error['msg']
-
-                error('Socket Error: ', errmsg,'=> ',urltofetch)
-                
-            except Exception, e:
-                self._error.msg = str(e)
-                self._error.number = URL_GENERAL_ERROR
-                self._error.errclass = "GeneralError"                
-                errmsg = self._error.msg
-            
-                error('General Error: ', errmsg,'=> ',urltofetch)
-                
-            # attempt reconnect after some time
-            # self.evnt.sleep()
-            time.sleep(self._sleeptime)
-
-        if data:
-            self._data = data
-            # Set hash on URL object
-            urlobj.pagehash = sha.new(data).hexdigest()
-
-        # print 'URLOBJ STATUS=>',urlobj.status
-        if urlobj and urlobj.status != 0:
-            urlobj.status = self._error.number
-            urlobj.fatal = self._error.fatal
-            debug('Setting %s status to %s' % (urlobj.get_full_url(), str(urlobj.status)))
-            
-        # Raise an event...
-        objects.eventmgr.raise_event('afterconnect', urlobj, None)
-        
-        if three_oh_four:
-            return CONNECT_NO_UPTODATE
-            
-        if self._data or self._datalen:
-            return CONNECT_YES_DOWNLOADED
-        else:
-            return CONNECT_NO_ERROR
-
-    def set_progress_object(self, topic, n=0, subtopics=[], nolengthmode=False):
-        """ Set the progress bar with the given topic and sub-topics """
-
-        # n=> number of subtopics
-        # topic => Topic
-        # subtopics => List of subtopics
-
-        # n should be = len(subtopics)
-        if n != len(subtopics):
-            return False
-
-        # Create progress object
-        prog = self._cfg.progressobj
-        prog.setTopic(topic)
-        #if n==1:
-        prog.set(100, 100)
-        #else:
-        #    prog.set(n, 100)
-        
-        if nolengthmode:
-            prog.setNoLengthMode(True)
-
-        if n>0:
-            prog.setHasSub(True)
-            if not nolengthmode:
-                for x in range(1,n+1):
-                    prog.setSubTopic(x, subtopics[x-1])
-                    prog.setSub(x, 0.0, 100)
-        else:
-            pass
-            
-    def make_tmp_fname(self, filename, directory='.'):
-        """ Creates a temporary filename for download """
-
-        random.seed()
-        
-        while True:
-            fint = int(random.random()*random.random()*10000000)
-            fname = ''.join(('.',filename,'#',str(fint)))
-            fpath = os.path.join(directory, fname)
-            if not os.path.isfile(fpath):
-                return fpath
-
-    def create_request(self, urltofetch, lmt='', etag='', useragent=True):
-        """ Creates request object for the URL 'urltofetch' and return it """
-
-        # This function takes care of adding any additional headers
-        # etc in addition to creating the request object.
-        
-        # create a request object
-        if lmt or etag:
-            # print 'Making a head request', lmt, etag
-            # Create a head request...
-            request = HeadRequest(urltofetch)
-            if lmt != '':
-                ts = time.strftime("%a, %d %b %Y %H:%M:%S GMT", time.localtime(lmt))
-                request.add_header('If-Modified-Since', ts)
-            if etag != '':
-                request.add_header('If-None-Match', etag)
-        else:
-            request = urllib2.Request(urltofetch)
-
-        # Some sites do not like User-Agent strings and raise a Bad Request
-        # (HTTP 400) error. Egs: http://www.bad-ischl.ooe.gv.at/. In such
-        # cases, the connect method, sets useragent flag to False and calls
-        # this method again.
-        # print 'User agent', self._cfg.USER_AGENT
-        if useragent: 
-            request.add_header('User-Agent', self._cfg.USER_AGENT)
-        
-        # Check if any HTTP username/password are required
-        username, password = self._cfg.username, self._cfg.passwd
-        if username and password:
-            # Add basic HTTP auth headers
-            authstring = base64.encodestring('%s:%s' % (username, password))
-            request.add_header('Authorization','Basic %s' % authstring)
-
-        return request
-
-    def get_url_data(self, url):
-        """ Downloads data for the given URL and returns it """
-
-        try:
-            urlobj = urlparser.HarvestManUrl(url)
-            res = self.connect(urlobj)
-            return self._data
-        except urlparser.HarvestManUrlError, e:
-            error("URL Error: ",e)
-        
-    def connect2(self, urlobj, resuming=False):
-        """ Connects to the Internet and fetches data for the URL encapsulated
-        in the object 'urlobj'. This is the method used by Hget """
-        
-        data = '' 
-        
-        # Reset the http headers
-        self._headers.clear()
-        retries = self._cfg.retryfailed
-        self._numtries = 0
-
-        urltofetch = urlobj.get_full_url()
-        filename = urlobj.get_filename()
-
-        dmgr = objects.datamgr
-        rulesmgr = objects.rulesmgr
-        showprogress = self._cfg.showprogress
-
-        # Flag indicating we are reusing a previously redirected URL
-        # to produce automatic mirror split
-        automirror = urlobj.redirected and urlobj.redirected_old
-
-        # print self, urltofetch
-        while self._numtries <= retries:
-
-            # If automirror we don't exit on even fatal errors since
-            # a new request on the old URL can lead to a new mirror
-            # which can work. If automirror we exit only after number
-            # of retries are completed.
-            if not automirror and self._error.fatal:
-                break
-
-            # Reset status
-            self._status = 0
-            
-            errnum = 0
-            try:
-                # Reset error
-                self._error.reset()
-
-                self._numtries += 1
-
-                request = self.create_request(urltofetch)
-                byterange = urlobj.range
-                
-                if byterange:
-                
-                    range1 = byterange[0]
-                    range2 = byterange[-1]
-                    # For a repeat connection, don't redownload already downloaded data.
-                    if self._fo:
-                        datasofar = self._fo.get_datalen()
-                        if datasofar: range1 += datasofar
-
-                    # If this was a redirected old URL which we are re-using for
-                    # producing further redirections for auto-mirror downloads
-                    # don't add this header now, but add it later, after the
-                    # redirection has happened.
-                    if not automirror:
-                        request.add_header('Range','bytes=%d-%d' % (range1,range2))
-                        
-                self._freq = urllib2.urlopen(request)
-
-                # Set status to 1
-                self._status = 1
-
-                actual_url = self._freq.geturl()
-                if actual_url != urltofetch:
-                    # Don't do this for mirrors...
-                    if not urlobj.trymultipart:
-                        logconsole('Redirected to %s...' % actual_url)
-                    else:
-                        extrainfo('Redirected to %s...' % actual_url)                        
-                
-                    if actual_url.replace(urltofetch, '') != '/':
-                        no_change = False
-                    else:
-                        no_change = True
-
-                    if no_change:
-                        if (actual_url[-1] == '/' and urltofetch[-1] != '/'):
-                            # Setting directory URL
-                            urlobj.set_directory_url()
-                    else:
-                        # Considerable change
-                        urlobj.redirected = True
-                        urlobj.url = actual_url
-                        debug("Re-resolving URL: Current is %s..." % urlobj.get_full_url())                        
-                        urlobj.wrapper_resolveurl()
-                        debug("Re-resolving URL: New is %s..." % urlobj.get_full_url())                        
-                        # Get filename again
-                        filename = urlobj.get_filename()
-                        # Get URL again
-
-                        if byterange and automirror:
-                            # print 'Automirror URL...!'
-                            range1, range2 = byterange
-                            # For a repeat connection, don't redownload already downloaded data.
-                            if self._fo:
-                                datasofar = self._fo.get_datalen()
-                                if datasofar: range1 += datasofar
-
-                            request = self.create_request(urlobj.get_full_url())                            
-                            request.add_header('Range','bytes=%d-%d' % (range1,range2))
-                            self._freq.close()
-                            self._freq = urllib2.urlopen(request)
-                        
-                # Set http headers
-                self.set_http_headers()
-                
-                encoding = self.get_content_encoding()
-                ctype = self.get_content_type()
-                clength = int(self.get_content_length())
-                
-                if clength==0:
-                    clength_str = 'Unknown'
-                elif clength>=1024*1024:
-                    clength_str = '%dM' % (clength/(1024*1024))
-                elif clength >=1024:
-                    clength_str = '%dK' % (clength/1024)
-                else:
-                    clength_str = '%d bytes' % clength
-
-                if showprogress and (resuming or (not urlobj.range)):
-                    if clength:
-                        logconsole('Length: %d (%s) Type: %s' % (clength, clength_str, ctype))
-                        nolengthmode = False
-                    else:
-                        logconsole('Length: (%s) Type: %s' % (clength_str, ctype))
-                        nolengthmode = True
-
-                    logconsole('Content Encoding: %s\n' % encoding)
-
-                # FTP servers do not support HTTP like byte-range
-                # requests. The way to do multipart for FTP is to use
-                # the FTP restart (REST) command, but that requires writing
-                # new wrappers on top of ftplib instead of the current simpler
-                # way of routing everything using urllib2. This is planned
-                # for later.
-
-                # However if mirror search is enabled, we try to do it
-                if urlobj.protocol == 'ftp://' and not self._cfg.mirrorsearch:
-                    trynormal = True
-                    
-                    if self._cfg.forcesplit:
-                        logconsole('FTP request, not trying multipart download, defaulting to single thread')
-                else:
-                    trynormal = False
-
-                # Check constraint on file size
-                if (not byterange) and (not trynormal) and (not self._cfg.nomultipart) and self._cfg.forcesplit:
-                    # I hate local imports, but there is no other way to do this
-                    # except moving the function to this module.
-                    from harvestman.lib import mirrors
-                    
-                    if self._cfg.forcesplit and not self._cfg.mirrorsearch:
-                        logconsole('Forcing download into %d parts' % self._cfg.numparts)
-
-                    if (not self._headers.get('accept-ranges', '').lower() == 'bytes') and \
-                           (not self._cfg.mirrorfile) and \
-                           not mirrors.is_multipart_download_supported(urlobj) and \
-                           not (self._cfg.mirrorsearch):
-                        
-                        logconsole('Checking whether server supports multipart downloads...')
-                        # See if the server supports 'Range' header
-                        # by requesting half the length
-                        self._headers.clear()
-
-                        # Need to re-create request in case URL has changed
-                        request = self.create_request(urlobj.get_full_url())                        
-                        request.add_header('Range','bytes=%d-%d' % (0,clength/2))
-                        self._freq.close()                        
-                        self._freq = urllib2.urlopen(request)
-
-                        # Set http headers
-                        self.set_http_headers()
-                        range_result = self._headers.get('accept-ranges', '')
-                        if range_result.lower()=='bytes':
-                            logconsole('Server supports multipart downloads')
-                            self._freq.close()
-                        else:
-                            logconsole('Server does not support multipart downloads')
-                            resp = raw_input('Do you still want to download this URL [y/n] ?')
-                            if resp.lower() !='y':
-                                logconsole('Aborting download.')
-                                return CONNECT_DOWNLOAD_ABORTED
-                            else:
-                                # Create a fresh request object
-                                self._freq.close()
-                                request = self.create_request(urlobj.get_full_url())
-                                self._freq = urllib2.urlopen(request)
-
-                                logconsole('Downloading' % urlobj.get_full_url())
-                                trynormal = True
-                    else:
-                        logconsole('Server supports multipart downloads')
-
-                    if not trynormal:
-                        logconsole('Trying multipart download...')
-                        urlobj.trymultipart = True
-                        
-                        ret = dmgr.download_multipart_url(urlobj, clength)
-                        if ret == URL_PUSHED_TO_POOL:
-                            # Set flag which indicates a multipart
-                            # download is in progress
-                            self._cfg.multipart = True
-                            # Set progress object
-                            if showprogress:
-                                self.set_progress_object(filename,1,[filename],nolengthmode)
-
-                            return CONNECT_MULTIPART_DOWNLOAD
-                        elif ret == MIRRORS_NOT_FOUND:
-                            return ret
-                    
-                # if this is the not the first attempt, print a success msg
-                if self._numtries>1:
-                    extrainfo("Reconnect succeeded => ", urlobj.get_full_url())
-
-                try:
-                    # Don't set progress object if multipart download - it
-                    # would have been done before.
-                    if showprogress and (resuming or (not urlobj.range)):
-                        self.set_progress_object(filename,1,[filename],nolengthmode)
-                    
-                    prog = self._cfg.progressobj
-                    
-                    mypercent = 0.0
-
-                    # Report fname to calling thread
-                    ct = threading.currentThread()
-
-                    # Only set tmpfname if this is a fresh download.
-                    if self._tmpfname=='':
-                        if not self._cfg.hgetnotemp:
-                            if urlobj.trymultipart:
-                                localurl = urlobj.mirror_url.get_original_url()
-                            else:
-                                localurl = urlobj.get_original_url()
-                                
-                            tmpd = os.path.join(GetMyTempDir(), str(abs(hash(localurl))))
-                        else:
-                            tmpd = '.'
-                            
-                        self._tmpfname = self.make_tmp_fname(filename, tmpd)
-                        
-                    if ct.__class__.__name__ == 'HarvestManUrlThread':
-                        ct.set_tmpfname(self._tmpfname)
-
-                    if self._fo==None:
-                        self._fo = HarvestManFileObject(self._freq,
-                                                        self._tmpfname,
-                                                        clength,
-                                                        self._mode)
-                    else:
-                        self._fo.set_fileobject(self._freq)
-
-                    # Setting class-level variables
-                    if self._cfg.multipart:
-                        if not HarvestManFileObject.MULTIPART:
-                            HarvestManFileObject.MULTIPART = True
-                            HarvestManFileObject.START_TIME = time.time()
-                            HarvestManFileObject.ORIGLENGTH = urlobj.clength
-                            HarvestManFileObject.CONTENTLEN = [0]*self._cfg.numparts
-
-                    if not self._fo.is_initialized():
-                        if self._cfg.multipart:
-                            self._fo.set_index(urlobj.mindex)
-                            self._fo.initialize()
-                        else:
-                            self._fo.start()
-
-                    t1 = time.time()
-
-                    while True:
-                        if self._cfg.multipart:
-                            self._fo.readNext()
-
-                        # Get number of active worker threads...
-                        nthreads = dmgr.get_url_threadpool().get_busy_count()
-                        # If no active worker threads, then there is at least
-                        # the main thread which is active
-                        if nthreads==0: nthreads = 1
-
-                        # Check if there was any exception in the reader thread
-                        # If there is an exception when the reader is running as
-                        # a thread, the flag will be set
-                        if self._fo._flag:
-                            readerror = self._fo.get_lasterror()
-                            if readerror:
-                                raise HarvestManFileObjectException, str(readerror)
-                            
-                        if clength:
-                            per1,per2,l,bw,eta = self._fo.get_info()
-                            
-                            if per2 and showprogress:
-                                prog.setScreenWidth(prog.getScreenWidth())
-                                infostring = '%4.2fK/s (%d) eta: %s' % (float(bw/1024.0),
-                                                                            nthreads,
-                                                                            str(eta))
-                                prog.setSubTopic(1, infostring)
-                                prog.setSub(1, per2, 100)
-                                prog.show()
-                                    
-                            if per1==100.0: break
-                        else:
-                            if mypercent and showprogress:
-                                self._fo.get_info()
-                                prog.setScreenWidth(prog.getScreenWidth())
-                                infostring = 'RT: %s  ' % str(nthreads) +  filename
-                                prog.setSubTopic(1, infostring)
-                                prog.setSub(1, mypercent, 100)
-                                prog.show()
-                                
-                            if self._fo._flag: break
-                            mypercent += 2.0
-                            if mypercent==100.0: mypercent=0.0
-
-                    self._elapsed = time.time() - t1
-
-                    if self._fo._mode == CONNECTOR_DATA_MODE_INMEM:
-                        if not resuming:
-                            self._data = self._fo.get_data()
-                        else:
-                            self._data += self._fo.get_data()
-                        
-                        self._datalen = len(self._data)
-                    else:
-                        self._datalen = self._fo.get_datalen()
-
-                except MemoryError, e:
-                    # Catch memory error for sockets
-                    pass
-                    
-                break
-
-            #except Exception, e:
-            #    raise
-                
-            except urllib2.HTTPError, e:
-
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.strip().split()
-                    self._error.number = int(parts[-1])
-                    self._error.msg = errdescn.strip()
-                    self._error.errclass = "HTTPError"                    
-                except:
-                    pass
-
-                if self._error.msg:
-                    extrainfo(self._error.msg, '=> ',urltofetch)
-                else:
-                    extrainfo('HTTPError:',urltofetch)
-
-                try:
-                    errnum = int(self._error.number)
-                except:
-                    pass
-
-                if errnum==304:
-                    # Page not modified
-                    three_oh_four = True
-                    self._error.fatal = False
-                    # Need to do this to ensure that the crawler
-                    # proceeds further!
-                    content_type = self.get_content_type()
-                    urlobj.manage_content_type(content_type)                    
-                    break
-                if errnum in range(400, 407):
-                    # 400 => bad request
-                    # 401 => Unauthorized
-                    # 402 => Payment required (not used)
-                    # 403 => Forbidden
-                    # 404 => Not found
-                    # 405 => Method not allowed
-                    # 406 => Not acceptable
-                    self._error.fatal = True
-                elif errnum == 407:
-                    # Proxy authentication required
-                    self._proxy_query(1, 1)
-                elif errnum == 408:
-                    # Request timeout, try again
-                    pass
-                elif errnum == 412:
-                    # Pre-condition failed, this has been
-                    # detected due to our user-agent on some
-                    # websites (sample URL: http://guyh.textdriven.com/)
-                    self._error.fatal =  True
-                elif errnum in range(409, 418):
-                    # Error codes in 409-417 contain a mix of
-                    # fatal and non-fatal states. For example
-                    # 410 indicates requested resource is no
-                    # Longer available, but we could try later.
-                    # However for all practical purposes, we
-                    # are marking these codes as fatal errors
-                    # for the time being.
-                    self._error.fatal = True
-                elif errnum == 500:
-                    # Internal server error, can try again
-                    pass
-                elif errnum == 501:
-                    # Server does not implement the functionality
-                    # to fulfill the request - fatal
-                    self._error.fatal = True
-                elif errnum == 502:
-                    # Bad gateway, can try again ?
-                    pass
-                elif errnum in (503, 506):
-                    # 503 - Service unavailable
-                    # 504 - Gatway timeout
-                    # 505 - HTTP version not supported
-                    self._error.fatal = True
-
-                if self._error.fatal:
-                    rulesmgr.add_to_filter(urltofetch)                
-
-            except urllib2.URLError, e:
-                errdescn = ''
-                self._error.errclass = "URLError"
-                    
-                try:
-                    errbasic, errdescn = (str(e)).split(':',1)
-                    parts = errbasic.split()                            
-                except:
-                    try:
-                        errbasic, errdescn = (str(e)).split(',')
-                        parts = errbasic.split('(')
-                        errdescn = (errdescn.split("'"))[1]
-                    except:
-                        pass
-
-                try:
-                    self._error.number = int(parts[-1])
-                except:
-                    pass
-                
-                if errdescn:
-                    self._error.msg = errdescn
-
-                if self._error.msg:
-                    extrainfo(self._error.msg, '=> ',urltofetch)
-                else:
-                    extrainfo('URLError:',urltofetch)
-
-                errnum = self._error.number
-
-                # URL error basically wraps up socket error numbers
-                # Why did I decide 10049 etc stand for Proxy server
-                # error ? Need to check this...
-                if errnum == 10049 or errnum == 10061: # Proxy server error
-                    self._proxy_query(1, 1)
-
-            except IOError, e:
-                self._error.number = URL_IO_ERROR
-                self._error.fatal=True
-                self._error.errclass = "IOError"                                    
-                self._error.msg = str(e)                    
-                # Generated by invalid ftp hosts and
-                # other reasons,
-                # bug(url: http://www.gnu.org/software/emacs/emacs-paper.html)
-                extrainfo(e,'=>',urltofetch)
-
-            except BadStatusLine, e:
-                self._error.number = URL_BADSTATUSLINE
-                self._error.msg = str(e)
-                self._error.errclass = "BadStatusLine"                                    
-                extrainfo(e, '=> ',urltofetch)
-
-            except TypeError, e:
-                self._error.number = URL_TYPE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "TypeError"                                    
-                extrainfo(e, '=> ',urltofetch)
-                
-            except ValueError, e:
-                self._error.number = URL_VALUE_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "ValueError"                                    
-                extrainfo(e, '=> ',urltofetch)
-
-            except AssertionError, e:
-                self._error.number = URL_ASSERTION_ERROR
-                self._error.msg = str(e)
-                self._error.errclass = "AssertionError"                                    
-                extrainfo(e ,'=> ',urltofetch)
-
-            except socket.error, e:
-                self._error.number = URL_SOCKET_ERROR                
-                self._error.msg = str(e)
-                self._error.errclass = "SocketError"                                    
-                errmsg = self._error.msg
-
-                extrainfo('Socket Error: ',errmsg,'=> ',urltofetch)
-
-            except HarvestManFileObjectException, e:
-                self._error.number = FILEOBJECT_EXCEPTION
-                self._error.msg = str(e)
-                self._error.errclass = "HarvestManFileObjectException"                                    
-                errmsg = self._error.msg
-
-                extrainfo('HarvestManFileObjectException: ',errmsg,'=> ',urltofetch)
-                
-            # attempt reconnect after some time
-            # self.evnt.sleep()
-            time.sleep(self._sleeptime)
-
-        if self._data or self._datalen:
-            return CONNECT_YES_DOWNLOADED
-        else:
-            return CONNECT_NO_ERROR
-        
-    def get_error(self):
-        """ Returns the error object """
-        
-        return self._error
-
-    def set_content_info(self, urlobj):
-        """ Sets the contents information on the url object 'urlobj' """
-
-        # set this on the url object
-        if self._headers:
-            urlobj.set_url_content_info(self._headers)
-
-    def set_http_headers(self):
-        """ Sets http header dictionary from current request """
-
-        self._headers.clear()
-        for key,val in dict(self._freq.headers).iteritems():
-            self._headers[key] = val
-
-    def print_http_headers(self):
-        """ Prints the HTTP headers for this connection """
-
-        print 'HTTP Headers '
-        for k,v in self._headers.iteritems():
-            print k,'=> ', v
-
-        print '\n'
-
-    def get_content_length(self):
-        """ Returns the content length after fetching a URL """
-        
-        clength = self._headers.get('content-length', 0)
-        if clength != 0:
-            # Sometimes this could be two numbers
-            # separated by commas.
-            return int(clength.split(',')[0].strip())
-        else:
-            return self._datalen
-
-    def check_content_length(self):
-        """ Checks whether content length of a URL is within the
-        limits of the maximum allowed file size """
-
-        # check for min & max file size
-        try:
-            length = int(self.get_content_length())
-        except:
-            length = 0
-            
-        return (length <= self._cfg.maxfilesize)
-        
-    def get_content_type(self):
-        """ Returns content type after fetching a URL """
-
-        ctype = self._headers.get('content-type','')
-        if ctype:
-            # Sometimes content type
-            # definition might have
-            # the charset information,
-            # - .stx files for example.
-            # Need to strip out that
-            # part, if any
-            if ctype.find(';') != -1:
-                ctype2, charset = ctype.split(';',1)
-                if ctype2: ctype = ctype2
-            
-        return ctype
-
-    def get_etag(self):
-        """ Returns the 'etag' header information """
-        
-        return self._headers.get('etag','')
-    
-    def get_last_modified_time(self):
-        """ Returns the 'last-modified' header information """
-        
-        return self._headers.get('last-modified','')
-
-    def get_content_encoding(self):
-        """ Returns the 'content-encoding' header information """
-        
-        return self._headers.get('content-encoding', 'plain')
-
-    def _write_url(self, urlobj, overwrite=False):
-        """ Writes the data for the URL object 'urlobj' to a disk file (internal method) """
-
-        # Raise writeurl event
-        if objects.eventmgr.raise_event('writeurl', urlobj, data=self._data)==False:
-            extrainfo('Filtering write of URL',urlobj)
-            return WRITE_URL_FILTERED
-
-        if self.blockwrite:
-            return WRITE_URL_BLOCKED
-        
-        dmgr = objects.datamgr
-        # For raw-saves, i.e saving without any directory structure, a simple
-        # logic is sufficient.
-        if self._cfg.rawsave:
-            if SUCCESS(self._write_url_filename( urlobj.get_filename())):
-                return WRITE_URL_OK
-            else:
-                return WRITE_URL_FAILED
-                
-        # If the file does not exist...
-        fname = urlobj.get_full_filename()
-        if not os.path.isfile(fname):
-            # Recalculate locations to check if there is any error
-            # in computed directories/filenames - like saving a
-            # filename, when its parent directory is saved as a
-            # file or trying to save as file when there is already
-            # a directory in that name etc... This is a fix for
-            # EIAO bug #491 - sample websites: www.nyc.estemb.org
-            # and www.est-emb.fr
-            urlobj.recalc_locations()
-            directory = urlobj.get_local_directory()
-
-            if SUCCESS(dmgr.create_local_directory(directory)):
-                if SUCCESS(self._write_url_filename( urlobj.get_full_filename())):
-                    return WRITE_URL_OK
-                else:
-                    return WRITE_URL_FAILED
-                
-            else:
-                error("Error in creating local directory for", urlobj.get_full_url())
-                return WRITE_URL_FAILED
-        else:
-            debug("File exists => ",urlobj.get_full_filename())
-            # File exists - could be many reasons for it (redirected URL
-            # duplicate download etc) - first check if this is a redirected
-            # URL.
-            if urlobj.reresolved:
-                # Get old filename and save in it
-                old_urlobj = urlobj.get_original_state()
-                directory = old_urlobj.get_local_directory()
-
-                if SUCCESS(dmgr.create_local_directory(directory)):
-                    if SUCCESS(self._write_url_filename( old_urlobj.get_full_filename())):
-                        return WRITE_URL_OK
-                    else:
-                        return WRITE_URL_FAILED
-                else:
-                    error("Error in creating local directory for", urlobj.get_full_url())
-                    return WRITE_URL_FAILED
-            else:
-                if overwrite:
-                    extrainfo("Over-writing file =>",urlobj.get_full_filename())
-                else:
-                    # Save as filename.1 etc
-                    index = 1
-                    fname2 = fname
-                    rootfname = urlobj.validfilename
-                    
-                    while os.path.isfile(fname2):
-                        urlobj.validfilename = rootfname + '.' + str(index)
-                        fname2 = urlobj.get_full_filename()
-                        index += 1
-
-                directory = urlobj.get_local_directory()
-
-                if SUCCESS(dmgr.create_local_directory(directory)):
-                    if SUCCESS(self._write_url_filename( urlobj.get_full_filename())):
-                        return WRITE_URL_OK
-                    else:
-                        WRITE_URL_FAILED
-                else:
-                    error("Error in creating local directory for", urlobj.get_full_url())
-                    return WRITE_URL_FAILED
-
-    def _write_url_filename(self, filename, overwrite=True, printmsg=False):
-        """ Writes downloaded data for a URL to the file named 'filename' """
-
-        if self.blockwrite:
-            return WRITE_URL_BLOCKED
-        
-        if self._data=='' and self._datalen==0:
-            return DATA_EMPTY_ERROR
-
-        if not overwrite:
-            # Recalcuate new filename...
-            origfilepath, n = filename, 1
-        
-            while os.path.isfile(filename):
-                filename = ''.join((origfilepath,'.',str(n)))
-                n += 1
-
-        try:
-            extrainfo('Writing file ', filename)
-            if self._mode==CONNECTOR_DATA_MODE_INMEM:
-                f=open(filename, 'wb')
-                f.write(self._data)
-                f.close()
-            else:
-                # Rename file
-                if os.path.isfile(self._tmpfname):
-                    # If gzip-encoded, we need to deflate data
-                    if self.get_content_encoding().strip().find('gzip') != -1:
-                        try:
-                            g=gzip.GzipFile(fileobj=open(self._tmpfname, 'rb'))
-                            # Open file for writing and write block by block
-                            f=open(filename, 'wb')
-                            while 1:
-                                block = g.read(8192)
-                                if block=='':
-                                    f.flush()
-                                    f.close()
-                                    break
-                                else:
-                                    f.write(block)
-                                    f.flush()
-                            g.close()
-                        except (IOError, OSError), e:
-                            return FILE_WRITE_ERROR
-                    else:
-                        shutil.move(self._tmpfname, filename)
-                    
-            if os.path.isfile(filename):
-                self._writelen = os.path.getsize(filename)
-
-                if printmsg:
-                    print '\nSaved to %s' % filename    
-                return FILE_WRITE_OK
-                
-        except IOError,e:
-            error('IO Error:', str(e))
-            return FILE_WRITE_ERROR
-        except ValueError, e:
-            error(str(e))
-            return FILE_WRITE_ERROR
-
-        return FILE_WRITE_ERROR
-
-    def write_url_info_file(self, url):
-        """ Writes an information file in temporary directory for
-        the URL. This will support HTTP resuming in case of partial
-        downloads (Used by Hget) """
-
-        # Used only by hget now
-        info_file = os.path.join(os.path.dirname(self._tmpfname),
-                                 ''.join((".info#",str(abs(hash(url))))))
-
-        if not os.path.isfile(info_file):
-            try:
-                open(info_file, 'wb').write(str(self._headers))
-            except (OSError, IOError), e:
-                print e
-
-    def print_download_stats(self, statsdict):
-        """ Prints download statistics such as bytes transferred, time and speed """
-
-        # Used by hget...
-        print '%(BYTES)s bytes downloaded in %(TIME)s hours at an average of %(BANDWIDTH)s kb/s.' % statsdict
-        
-    def wrapper_connect(self, urlobj):
-        """ Wrapper method for the connect/connect2 methods """
-
-        if self._cfg.nocrawl:
-            return self.connect2(urlobj)
-        else:
-            url = urlobj.get_full_url()
-            # See if this URL is in cache, then get its lmt time & data
-            lmt = objects.datamgr.get_last_modified_time(urlobj)
-            return self.connect(urlobj, True, self._cfg.retryfailed, lmt)            
-                        
-    def save_url(self, urlobj):
-        """ Downloads data for url represented by 'urlobj' and saves the
-        downloaded data to disk """
-
-        # Rearranged this to take care of http 304
-        url = urlobj.get_full_url()
-
-        # See if this URL is in cache, then get its lmt time & data
-
-        # If data caching is enabled, we cannot use this since
-        # we will not have any data to parse...
-        lmt, cached_data, etag, filefound = '', '', '', ''
-
-        if self._cfg.rawsave:
-            filename = urlobj.get_filename()
-        else:
-            filename = urlobj.get_full_filename()
-
-        filefound = os.path.isfile(filename)
-        
-        if self._cfg.datacache:
-            cached_data = objects.datamgr.get_url_cache_data(urlobj)
-            
-        # Makes sense to do this only if we find the cached data or the file...
-        if (cached_data or filefound) and not urlobj.starturl:
-            lmt = objects.datamgr.get_last_modified_time(urlobj)
-            etag = objects.datamgr.get_etag(urlobj)
-
-        urlobj.qstatus = urlparser.URL_IN_DOWNLOAD
-        res = self.connect(urlobj, True, self._cfg.retryfailed, lmt, etag)
-        urlobj.qstatus = urlparser.URL_DONE_DOWNLOAD
-        
-        # If it was a rules violation or error, skip it
-        if res in (CONNECT_NO_RULES_VIOLATION, CONNECT_NO_ERROR):
-            return res
-
-        # If this became a request for multipart download
-        # wait for the download to complete.
-        #if res == CONNECT_MULTIPART_DOWNLOAD:
-        #    # Trying multipart download...
-        #    pool = objects.datamgr.get_url_threadpool()
-        #    while not pool.get_multipart_download_status(urlobj):
-        #        time.sleep(2.0)
-
-        #    data = pool.get_multipart_url_data(urlobj)
-        #    self._data = data
-
-        #    if SUCCESS(self._write_url(urlobj)):
-        #        return DOWNLOAD_YES_OK
-        #    else:
-        #        return DOWNLOAD_NO_ERROR
-
-        if res == CONNECT_NO_UPTODATE:
-            # Set the data as cache-data
-            self._data = cached_data
-            
-        # Apply word filter
-        if not urlobj.starturl:
-            if urlobj.is_webpage() and objects.rulesmgr.apply_word_filter(self._data):
-                extrainfo("Word filter prevents download of url =>", url)
-                return DOWNLOAD_NO_RULE_VIOLATION
-
-        # If no need to save html files return from here
-        if urlobj.is_webpage() and not self._cfg.html:
-            extrainfo("Html filter prevents download of url =>", url)
-            return DOWNLOAD_NO_RULE_VIOLATION
-
-        # Get last modified time
-        timestr = self.get_last_modified_time()
-        if timestr:
-            try:
-                lmt = time.mktime( time.strptime(timestr, "%a, %d %b %Y %H:%M:%S GMT"))
-            except ValueError, e:
-                pass
-        
-        update, fileverified = False, False
-
-        datalen = self.get_content_length()
-        etag = self.get_etag()
-        
-        if self._cfg.cachefound:
-            update, fileverified = objects.datamgr.is_url_cache_uptodate(urlobj, filename, self._data, datalen, lmt, etag)
-
-            # If this caused a 304 error, then our copy is up-to-date so nothing to be done.
-            # print update, fileverified
-            if res == CONNECT_NO_UPTODATE:
-                if update and fileverified:
-                    extrainfo("Project cache is uptodate (304) =>", url)
-                    return DOWNLOAD_NO_UPTODATE
-            
-            # No need to download
-            elif update and fileverified:
-                extrainfo("Project cache is uptodate =>", url)
-                return DOWNLOAD_NO_UPTODATE                        
-
-            # If cache is up to date, but someone has deleted
-            # the downloaded files, instruct data manager to
-            # write file from the cache.
-            if update and not fileverified:
-                if objects.datamgr.write_file_from_cache(urlobj):
-                    return DOWNLOAD_NO_CACHE_SYNCED
-                else:
-                    return DOWNLOAD_NO_CACHE_SYNC_FAILED
-        else:
-            objects.datamgr.update_cache_for_url(urlobj, filename, self._data, datalen, lmt, etag)
-
-        # Overwrite flag...
-        # If file exists, but is not up to date
-        # it needs to be overwritten...
-        overwrite = fileverified and (not update)
-
-        retval = self._write_url(urlobj, overwrite)
-        
-        if SUCCESS(retval):
-            # Update saved bytes
-            objects.datamgr.update_saved_bytes(self._writelen)
-            return DOWNLOAD_YES_OK
-        
-        elif retval == WRITE_URL_FILTERED:
-            return DOWNLOAD_NO_WRITE_FILTERED
-        else:
-            return DOWNLOAD_NO_ERROR
-
-    def calc_bandwidth(self, urlobj):
-        """ Calculates bandwidth of the user's network by downloading URL specified by 'urlobj' """
-
-        url = urlobj.get_full_url()
-        # Set verbosity to silent
-        logobj = objects.logger
-        self._cfg.verbosity = 0
-        logobj.setLogSeverity(0)
-
-        # Reset force-split, otherwise download
-        # will be split!
-        fs = self._cfg.forcesplit
-        self._cfg.forcesplit = False
-        self._cfg.showprogress = False
-        ret = self.connect2(urlobj)
-        # Reset verbosity
-        self._cfg.verbosity = self._cfg.verbosity_default
-        logobj.setLogSeverity(self._cfg.verbosity)
-
-        # Set it back
-        self._cfg.forcesplit = fs
-        
-        if self._data:
-            return float(len(self._data))/(self._elapsed)
-        else:
-            return 0
-
-    def write_data_from_tempfiles(self, tmpflist, filename, overwrite=True, printmsg=False):
-        """ Writes data from a list of temporary files and save it to the given filename.
-        If writing is successful the temporary files are removed. If the argument 'overwrite'
-        is True, the file is overwritten if it exists already, else a new filename is
-        constructed. The last argument can be used to print a message informing that the
-        writing is completed """
-
-        # The temporary files should contain
-        # the data in the required order, since this function
-        # does not have any logic to automatically order pieces
-        # of data 
-
-        if not overwrite:
-            # Recalcuate new filename...
-            origfilepath, n = filename, 1
-            
-            while os.path.isfile(filename):
-                filename = ''.join((origfilepath,'.',str(n)))
-                n += 1
-
-        # If some-one calls this with a single-file, just do a renaming
-        if len(tmpflist)==1:
-            shutil.copy2(tmpflist[0], filename)
-            os.remove(tmpflist[0])
-            
-            if os.path.isfile(filename):
-                if printmsg: print '\nSaved to %s' % filename
-                return FILE_WRITE_OK
-            else:
-                return FILE_WRITE_ERROR
-            
-        try:
-            # Provide a spinning cursor...
-            spinner = InfiniteSpinCursor(msg='Assembling file chunks...')
-            spinner.start()
-
-            # On a POSIX system this is best done by using the "cat" command.
-            # Look for the 'cat' command..
-            catcommand=''
-
-            if os.name=='posix':
-                # 'cat' should be in /usr/bin, /bin or /usr/local/bin
-                for catpath in ('/bin/cat','/usr/bin/cat','/usr/local/bin/cat'):
-                    if os.path.isfile(catpath):
-                        catcommand = catpath
-                        break
-
-            if catcommand != '':
-                # Create the command line
-                cmdline = 'cat ' + ''.join((' '.join(['"' + item + '"' for item in tmpflist]),'>','"',filename,'"'))
-                ret = os.system(cmdline)
-            else:
-                # Do this in Python...
-                cf = open(filename, 'wb')
-                # Combine data into one
-            
-                # If any of the file has a size larger than sys.maxint, reading it
-                # at once go would produce an overflow error.
-                if reduce(lambda x,y: x or y, [os.path.getsize(f) >= sys.maxint for f in tmpflist]):
-                    # Find files which have huge sizes
-                    hugefiles = []
-                    for x in range(len(tmpflist)):
-                        f = tmpflist[x]
-                        if os.path.getsize(f) >= sys.maxint:
-                            hugefiles.append(x)
-
-                    # For hugefiles, read upto sys.maxint -1 at a time...
-                    # However, this is going to be incredibly slow for huge files...
-                    for x in range(len(tmpflist)):
-                        f = tmpflist[x]
-                        if x in hugefiles:
-                            try:
-                                fptr = open(f, 'rb')
-                                while True:
-                                    data = fptr.read(sys.maxint - 1)
-                                    if data=='': break
-                                    cf.write(data)
-                                    cf.flush()
-                                fptr.close()
-                            except Exception, e:
-                                print e
-                        else:
-                            data = open(f, 'rb').read()
-                            cf.write(data)
-                            cf.flush()                                
-
-                else:
-                    for f in tmpflist:
-                        # print 'Appending data from',f,'...'
-                        data = open(f, 'rb').read()
-                        cf.write(data)
-                        cf.flush()
-
-                cf.close()
-                
-            spinner.stop()
-
-            if os.path.isfile(filename):
-                if printmsg:
-                    print '\nSaved to %s' % filename
-            
-                #for f in tmpflist:
-                #    try:
-                #        os.remove(f)
-                #    except OSError, e:
-                #        pass
-                
-                return FILE_WRITE_OK
-            else:
-                return FILE_WRITE_ERROR
-        except (IOError, OSError), e:
-            print e        
-            return FILE_WRITE_ERROR
-    
-    def url_to_file(self, urlobj):
-        """ Downloads the URL encapsulated in 'urlobj' to disk. This method is
-        the main method used by the Hget application to download files """
-
-        # Reset state on data reader class...
-        HarvestManFileObject.reset_klass()
-        
-        if self._cfg.hgetoutfile:
-            n, filename = 1, self._cfg.hgetoutfile
-        else:
-            n, filename = 1, urlobj.get_filename()
-
-        origfilename = filename
-
-        currtmpfiles = []
-        resuming = False
-
-        # Use original URL hashes for temporary directory names
-        # since URL can change dynamically.
-        orig_url = urlobj.get_original_url()
-
-        # Create temp folders for download
-        if not self._cfg.hgetnotemp:
-            tmpd = os.path.join(GetMyTempDir(), str(abs(hash(orig_url))))
-            if not os.path.isdir(tmpd):
-                # print 'Directory does not exist=>',tmpd
-                try:
-                    os.makedirs(tmpd)
-                except OSError, e:
-                    print e
-                    print 'Error in creating temp directory %s!' % tmpd
-                    return CREATE_DIRECTORY_NOT_OK
-        else:
-            tmpd =  '.'
-
-        # Check if there is an info file containing the headers
-        infof = os.path.join(tmpd, ''.join((".info#",
-                                            str(abs(hash(urlobj.get_full_url()))))))
-
-        flist = glob.glob(os.path.join(tmpd, ''.join(('.', origfilename, '#*'))))
-        
-        if self._cfg.canresume:
-            # Check if a previous unfinished download exists
-            # if so, only read from where we left off from
-            # previous download.
-
-            if flist:
-                # Sort the files according to creation times
-                tmpcflist = [(os.path.getctime(fname), fname) for fname in flist]
-                tmpcflist.sort()
-                # print 'Sorted list=>',tmpcflist
-                cflist = [item[1] for item in tmpcflist]
-
-                if os.path.isfile(infof):
-                    print 'Temporary files from previous download found, trying to resume download...'
-                    # We can proceed only if this file is there, since it contains
-                    # all the header information of the previous attempt.
-                    try:
-                        hdict = eval(open(infof).read())
-                        # print 'Header dict=>',hdict
-                        # Get content length
-                        clength = hdict.get('content-length','')
-                        if clength:
-                            clength = int(clength.split(',')[0].strip())
-                        if clength:
-                            # With the failover logic in hget, errors are quite less. So not sure
-                            # whether to display warning with more than 3 pieces. However let it
-                            # remain...
-                            if len(cflist)>3:
-                                print 'Warning: 3 or more temporary files found. Final file may have errors!'
-                            totsz = sum([os.path.getsize(fname) for fname in cflist])
-                            # Get difference in sizes
-                            sztoget = clength - totsz
-                            # If nothing to get, just save the temp file to original
-                            if sztoget==0:
-                                if SUCCESS(self.write_data_from_tempfiles(cflist, filename, False, True)):
-                                    print 'No data was downloaded, data already present in temporary file for this URL'
-                                    return DATA_ALREADY_PRESENT
-                            else:
-                                # Append filename to list of current temp files
-                                # and set resuming to True
-                                if self._mode == CONNECTOR_DATA_MODE_FLUSH:
-                                    currtmpfiles = cflist
-
-                                elif self._mode==CONNECTOR_DATA_MODE_INMEM:
-                                    for tmpf in cflist:
-                                        try:
-                                            self._data += open(tmpf, 'rb').read()
-                                        except IOError, e:
-                                            print e
-
-                                    self._datalen = len(self._data)
-                                    # print 'Datalen=>',self._datalen
-
-                                resuming = True
-                                # print 'Fize=>',totsz,clength
-                                urlobj.range = (totsz, clength+1)
-                                print 'Resuming download...'
-                    except SyntaxError, e:
-                        print 'Error reading URL info file, cannot resume previous download...'
-                        pass
-
-        else:
-            if len(flist):
-                print 'Resume mode switched off, cleaning previous temporary files...',
-                for tmpitem in flist:
-                    try:
-                        if os.path.isfile(tmpitem):
-                            os.remove(tmpitem)
-                    except OSError, e:
-                        pass
-
-                print 'done.'
-
-            # Remove the info file
-            if os.path.isfile(infof):
-                try:
-                    os.remove(infof)
-                except OSError, e:
-                    pass
-
-        flist = glob.glob(os.path.join(tmpd, ''.join(('.', origfilename, '#*'))))
-
-        url = urlobj.get_full_url()
-        logconsole('Connecting to %s...' % urlobj.get_full_domain())
-
-        start = time.time()
-        ret = self.connect2(urlobj,resuming=resuming)
-        end = time.time()
-        
-        status = URL_DOWNLOAD_FAILED
-
-        # If a redirection is found, save URL in the redirected filename
-        # unless the user has already specified an out filename.
-        if self._cfg.hgetoutfile=='' and urlobj.redirected:
-            filename = urlobj.filename
-        
-        if ret==CONNECT_MULTIPART_DOWNLOAD:
-            # Trying multipart download...
-            pool = objects.datamgr.get_url_threadpool()
-            while True:
-                multi_status = pool.get_multipart_download_status(urlobj)
-                if multi_status in (MULTIPART_DOWNLOAD_ERROR, MULTIPART_DOWNLOAD_COMPLETED):
-                    break
-                
-                time.sleep(1.0)
-
-            end = time.time()
-
-            if multi_status == MULTIPART_DOWNLOAD_COMPLETED:
-                print 'Data download completed.'
-                if self._mode == CONNECTOR_DATA_MODE_INMEM:
-                    data = pool.get_multipart_url_data(urlobj)
-                    self._data = data
-                    if self._data:
-                        status = URL_DOWNLOAD_OK
-                    
-                elif self._mode == CONNECTOR_DATA_MODE_FLUSH:
-                    # Get url info
-                    infolist = pool.get_multipart_url_info(urlobj)
-                    infolist.sort()
-                    debug('Info list=>',infolist)
-                    # Get filenames
-                    tmpflist = [item[1] for item in infolist]
-                    debug(tmpflist)
-                    # Temp file name
-                    self._tmpfname = filename + '.tmp'
-
-                    if SUCCESS(self.write_data_from_tempfiles(tmpflist, self._tmpfname)):
-                        status = URL_DOWNLOAD_OK
-                        
-            elif multi_status == MULTIPART_DOWNLOAD_ERROR:
-                print 'Data download could not be completed.'
-                
-        else:
-            if self._data or self._datalen:
-                status = URL_DOWNLOAD_OK
-
-        if not SUCCESS(status):
-            if self._error.msg:
-                print 'Error:',self._error.msg
-
-            # print 'Fatal=>',self._error.fatal
-            if not self._error.fatal:
-                self.write_url_info_file(url)
-                
-            print 'Download of URL',url ,'not completed.\n'
-            return 
-        
-        tgap = end - start
-        timestr = str(datetime.timedelta(seconds=int(tgap)))
-
-
-        # Check if full data was downloaded...
-        clen = self.get_content_length()
-        if clen and (not urlobj.redirected) and (HarvestManFileObject.NETDATALEN != clen):
-            print 'Error: Complete data was not downloaded!'
-            print 'Expected: %d, Downloaded: %d' % (clen, HarvestManFileObject.NETDATALEN)
-            print 'Download of URL',url ,'not completed.\n'
-            
-            return URL_DOWNLOAD_FAILED
-
-        filesz = HarvestManFileObject.NETDATALEN                
-        bandwidth = float(filesz)/float(1024*tgap)
-
-        statsdict = { 'BYTES': filesz,
-                      'TIME':  timestr,
-                      'BANDWIDTH' : '%.2f' % bandwidth
-                      }
-
-        res = FILE_WRITE_ERROR
-
-        if self._cfg.hgetoutdir != '.':
-            outdir = self._cfg.hgetoutdir
-            if not os.path.isdir(outdir):
-                try:
-                    os.makedirs(outdir)
-                    # If an output director is specified, strip any directory
-                    # part from the filename
-                    filename = os.path.join(outdir, os.path.split(filename)[1]) 
-                except OSError, e:
-                    print 'Error in creating directory',e
-            else:
-                filename = os.path.join(outdir, os.path.split(filename)[1])                 
-
-        if self._mode == CONNECTOR_DATA_MODE_INMEM:
-            res = self._write_url_filename(filename, False, True)
-            if SUCCESS(res):
-                self.print_download_stats(statsdict)
-
-        elif self._mode == CONNECTOR_DATA_MODE_FLUSH:
-
-            if os.path.isfile(self._tmpfname):
-                if resuming:
-                    currtmpfiles.append(self._tmpfname)
-                    tmpflist = currtmpfiles
-                    res = self.write_data_from_tempfiles(tmpflist, filename, False, True)
-                else:
-                    res = self.write_data_from_tempfiles([self._tmpfname], filename, False, True)
-                        
-                if SUCCESS(res):
-                    self.print_download_stats(statsdict)                    
-                else:
-                    print 'Error saving to file %s' % filename
-            else:
-                print 'Error saving to file %s' % filename
-
-
-        # Perform cleanups for successful downloads
-        if SUCCESS(res):
-            if infof and os.path.isfile(infof):                
-                try:
-                    os.remove(infof)
-                except OSError, e:
-                    print e
-
-            # Clean up files if resuming
-            if resuming:
-                # If this was resumed using temp files then
-                # these files will be already cleaned up, but
-                # will remain if resumed using in-mem flag.
-                for f in currtmpfiles:
-                    if os.path.isfile(f):
-                        try:
-                            os.remove(f)
-                        except OSError, e:
-                            print e
-                        
-            # Clean up temp directory if any
-            if not self._cfg.hgetnotemp:
-               if os.path.isdir(tmpd):
-                   # print 'Cleaning up temporary directory...'
-                   try:
-                       shutil.rmtree(tmpd, True)
-                   except OSError, e:
-                       print e
-                            
-            return URL_DOWNLOAD_OK
-        
-        return URL_DOWNLOAD_OK
-
-    def get_data(self):
-        """ Returns the downloaded data string """
-        
-        return self._data
-    
-    def get_error(self):
-        """ Returns last network error code """
-
-        return self._error
-
-    def get_fileobj(self):
-        """ Returns a handle to internal HarvestMan file object """
-
-        return self._fo
-
-    def get_data_sofar(self):
-        """ Returned the length of data downloaded so far """
-
-        if self._fo:
-            return self._fo.get_datalen()
-
-        return 0
-    
-    def get_data_mode(self):
-        """ Returns the data download mode """
-
-        # 0 => Data is flushed
-        # 1 => Data in memory (default)
-        return self._mode
-
-    def get_tmpfname(self):
-        """ Returns temporary filename if any """
-
-        return self._tmpfname
-
-    def get_status(self):
-        """ Returns the download status """
-
-        return self._status
-
-    def get_numtries(self):
-        """ Returns the number of attempts in fetching a URL """
-
-        return self._numtries
-
-    def reset(self):
-        """ Resets the attribute values """
-
-        # file like object returned by
-        # urllib2.urlopen(...)
-        self._freq = urllib2.Request('file://')
-        # data downloaded
-        self._data = ''
-        # length of data downloaded
-        self._datalen = 0
-        # length of data written
-        self._writelen = 0
-        # error dictionary
-        self._error.reset()
-        # Http header for current connection
-        self._headers = CaselessDict()
-        # Elasped time for reading data
-        self._elapsed = 0.0
-        # Status of connection
-        # 0 => no connection
-        # 1 => connected, download in progress
-        self._status = 0
-        # Number of tries
-        self._numtries = 0
-        
-class HarvestManUrlConnectorFactory(object):
-    """ Factory class for HarvestManUrlConnector class """
-
-    klass = HarvestManUrlConnector
-    alias = 'connfactory'                
-    connector_count = 0
-    
-    def __init__(self, maxsize):
-        """ Overloaded __init__ method """
-        
-        # The requests dictionary
-        self._requests = {}
-        self._sema = threading.BoundedSemaphore(maxsize)
-        self._conndict = {}
-
-    def create_connector(self):
-        """ Creates and returns a connector object """
-
-        # Even if the number of connections is
-        # below the maximum, the number of requests
-        # to the same server can exceed the maximum
-        # count. So check for that condition. If
-        # the number of current active requests to
-        # the server is equal to the maximum allowd
-        # this call will also block the calling
-        # thread
-        self._sema.acquire()
-
-        # Make a connector 
-        connector = self.__class__.klass()
-        self._conndict[connector] = 1
-        self.__class__.connector_count += 1
-        
-        return connector
-        
-    def remove_connector(self, conn):
-        """ Removes a connector object after use """
-
-        # Release the semaphore once to increase the internal count
-        self.__class__.connector_count -= 1
-        # print 'Connector removed, count is',self._count
-        conn.release()
-        del self._conndict[conn]       
-        self._sema.release()
-
-    def get_count(self):
-        """ Return the current connector count """
-
-        return self.__class__.connector_count
-
-    def get_connector_dict(self):
-        return self._conndict
-    
-# test code
-if __name__=="__main__":
-    pass
-
diff --git a/HarvestMan/harvestman/lib/crawler.py b/HarvestMan/harvestman/lib/crawler.py
deleted file mode 100755
index 89c6f17..0000000
--- a/HarvestMan/harvestman/lib/crawler.py
+++ /dev/null
@@ -1,1005 +0,0 @@
-# -- coding: utf-8
-""" crawler.py - Module which does crawling and downloading
-    of urls from the web. This module is part of HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification history (Trimmed on Dec 14 2004)
-
-    Aug 22 2006  Anand    Changes for fixing single-thread mode.
-    Nov 9 2006   Anand    Added support to download imported stylesheets.
-
-    Jan 2007     Anand    Support for META robot tags.
-    Feb 17 2007  Anand    Modified return type of process_url in
-                          HarvestManUrlFetcher class to return the data.
-                          This is required for the modified swish-e
-                          plugin.
-    Feb 26 2007 Anand     Figured out the problem with 'disappearing' URLs.
-                          The error is in the crawl_url method which was
-                          checking whether a source URL was crawled. This
-                          happens when a page redefines its base URL as
-                          something else and when that URL is already crawled.
-                          We need to modify our logic of applying base URLs.
-    Mar 06 2007 Anand     Reset the logic of url-server to old one (only
-                          crawlers send data to url server). This is because
-                          sending both data to the server causes it to fail
-                          in a number of ways.
-
-                          NOTE: Decided not to use url server anymore since
-                          it is not yet stable. I think I need to go the
-                          Twisted way if this has to be done right.
-
-    Apr 06 2007  Anand    Added check to make sure that threads are not
-                          re-started for the same recurring problem.
-
-    Oct 21 2007  Anand    Added states for the crawler state machine.
-    
- Copyright (C) 2004 Anand B Pillai.
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import socket
-import time
-import threading
-import random
-import exceptions
-import sha
-from sgmllib import SGMLParseError
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.urltypes import *
-from harvestman.lib.urlcollections import *
-
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-from harvestman.lib.js.jsparser import JSParser, JSParserException
-
-from harvestman.lib import urlparser
-from harvestman.lib import pageparser
-from harvestman.lib.common import netinfo 
-
-
-# Defining pluggable functions
-# Plugin name is the key and value is <class>:<function>
-
-__plugins__ = { 'fetcher_process_url_plugin': 'HarvestManUrlFetcher:process_url',
-                'crawler_crawl_url_plugin': 'HarvestManUrlCrawler:crawl_url' }
-
-# Defining functions with pre & post callbacks
-# Callback name is the key and value is <class>:<function>
-__callbacks__ = { 'fetcher_process_url_callback' : 'HarvestManUrlFetcher:process_url',
-                  'crawler_crawl_url_callback' : 'HarvestManUrlCrawler:crawl_url',
-                  'fetcher_push_buffer_callback' : 'HarvestManUrlFetcher:push_buffer',
-                  'crawler_push_buffer_callback' : 'HarvestManUrlCrawler:push_buffer',
-                  'fetcher_terminate_callback' : 'HarvestManUrlFetcher:terminate',
-                  'crawler_terminate_callback' : 'HarvestManUrlCrawler:terminate' }
-
-
-class HarvestManThreadState(type):
-    """ A metaclass for HarvestMan thread states """
-
-    IDX = -1
-    
-    def __new__(cls, name, bases=(), dct={}):
-        """ Overloaded Constructor """
-        
-        # Automatically increment index, without we bothering
-        # to assign a number to the state class...
-        cls.IDX += 1
-        dct['index'] = cls.IDX
-        return type.__new__(cls, name, bases, dct)
-
-    def __init__(self, name, bases=(), dct={}):
-        type.__init__(self, name, bases, dct)
-        
-    def __repr__(self):
-        return '%d: %s' % (self.index, self.about)
-
-    def __str__(self):
-        return self.__name__
-    
-    def __eq__(self, number):
-        """ Overloaded __eq__ method to allow
-        comparisons with numbers """
-        
-        # Makes it easy to do things like
-        # THREAD_IDLE == 0 in code.
-        return self.index == number
-    
-def DEFINE_STATE(name, description):
-    """ A factory function for defining thread state classes """
-
-    # State classes are created and automatically injected in the module's
-    # global namespace using the class name.
-    globals()[name] = HarvestManThreadState(name, dct={'about': description})
-     
-# Thread states
-DEFINE_STATE('THREAD_IDLE', "Idle thread, not running")
-DEFINE_STATE('THREAD_STARTED', "Thread started to run")
-DEFINE_STATE('CRAWLER_WAITING', "Crawler thread waiting for data")
-DEFINE_STATE('FETCHER_WAITING', "Fetcher thread waiting for data")
-DEFINE_STATE('CRAWLER_GOT_DATA', "Crawler thread got new list of URLs to crawl from the queue")
-DEFINE_STATE('FETCHER_GOT_DATA', "Fetcher thread got new URL information from the queue")
-DEFINE_STATE('FETCHER_DOWNLOADING', "Fetcher thread downloading data")
-DEFINE_STATE('FETCHER_PARSING', "Fetcher thread parsing webpage to extract new URLs")
-DEFINE_STATE('CRAWLER_CRAWLING', "Crawler thread crawling a page")
-DEFINE_STATE('FETCHER_PUSH_URL', "Fetcher thread pushing URL to queue")
-DEFINE_STATE('CRAWLER_PUSH_URL', "Crawler thread pushing URL to queue")
-DEFINE_STATE('FETCHER_PUSHED_URL', "Fetcher thread pushed URL to queue")
-DEFINE_STATE('CRAWLER_PUSHED_URL', "Crawler thread pushed URL to queue")
-DEFINE_STATE('THREAD_SLEEPING', "Thread sleeping")
-DEFINE_STATE('THREAD_SUSPENDED', "Thread is suspended on the state machine")
-DEFINE_STATE('THREAD_DIED', "Thread died due to an error")
-DEFINE_STATE('THREAD_STOPPED', "Thread stopped")
-
-class HarvestManUrlCrawlerException(Exception):
-    """ An exception class for HarvestManBaseUrlCrawler and its
-    derived classes """
-    
-    def __init__(self, value):
-        """ Overloaded __init__ method """
-        
-        self.value = value
-
-    def __repr__(self):
-        return self.__str__()
-    
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManBaseUrlCrawler( threading.Thread ):
-    """ Base class to do the crawling and fetching of internet/intranet urls.
-    This is the base class with no actual code apart from the threading or
-    termination functions. """
-
-    __metaclass__ = MethodWrapperMetaClass
-    # Last error which caused the thread to die
-    _lasterror = None
-    
-    def __init__(self, index, url_obj = None, isThread = True):
-        # Index of the crawler
-        self._index = index
-        # Initialize my variables
-        self._initialize()
-        # Am i a thread
-        self._isThread = isThread
-        
-        if isThread:
-            threading.Thread.__init__(self, None, None, self._role + str(self._index))
-
-    def _initialize(self):
-        """ Initialise my state after construction """
-
-        # End flag
-        self._endflag = False
-        # Download flag
-        self._download = True
-        self.url = None
-        self.document = None
-        # Number of loops
-        self._loops = 0
-        # Role string
-        self._role = "undefined"
-        # State of the crawler
-        self.stateobj = objects.queuemgr.stateobj
-        # Configuration
-        self._configobj = objects.config
-        # Local Buffer for Objects
-        # to be put in q. Maximum size is 100
-        self.buffer = Ldeque(100)
-        # Flag for pushing to buffer
-        self._pushflag = self._configobj.fastmode and (not self._configobj.blocking)
-        # Resume flag - for resuming from a saved state
-        self.resuming = False
-        # Last exception
-        self.exception = None
-        # Sleep event
-        if self._configobj.randomsleep:
-            self.evnt = RandomSleepEvent(self._configobj.sleeptime)
-        else:
-            self.evnt = SleepEvent(self._configobj.sleeptime)            
-        
-    def __str__(self):
-        return self.getName()
-
-    def get_url(self):
-        """ Return my url """
-
-        return self.url
-
-    def set_download_flag(self, val = True):
-        """ Set the download flag """
-        self._download = val
-
-    def set_url_object(self, obj):
-        """ Set the url object of this crawler """
-
-        self.url = obj
-        return True
-
-    def set_index(self, index):
-        self._index = index
-
-    def get_index(self):
-        return self._index
-    
-    def get_url_object(self):
-        """ Return the url object of this crawler """
-
-        return self.url
-
-    def get_current_url(self):
-        """ Return the current url """
-
-        return self.url.get_full_url()
-    
-    def action(self):
-        """ The action method, to be overridden by
-        sub classes to provide action """
-
-        pass
-        
-    def run(self):
-        """ The overloaded run method of threading.Thread class """
-
-        #try:
-        self.stateobj.set(self, THREAD_STARTED)
-        self.action()
-        #except Exception, e:
-        #    # print 'Exception',e,self
-        #    self.exception = e
-        #    self.stateobj.set(self, THREAD_DIED)                
-
-    def stop(self):
-        self.join()
-        
-    def join(self):
-        """ Stop this crawler thread """
-
-        self._endflag = True
-        self.set_download_flag(False)
-        threading.Thread.join(self, 1.0)
-        self.stateobj.set(self, THREAD_STOPPED)
-        # raise HarvestManUrlCrawlerException, "%s: Stopped" % self.getName()
-    
-    def sleep(self):
-
-        self.stateobj.set(self, THREAD_SLEEPING)
-        self.evnt.sleep()
-        
-    def crawl_url(self):
-        """ Crawl a web page, recursively downloading its links """
-
-        pass
-
-    def process_url(self):
-        """ Download the data for a web page or a link and
-        manage its data """
-
-        pass
-        
-    def push_buffer(self):
-        """ Try to push items in local buffer to queue """
-
-        # Try to push the last item
-        stuff = self.buffer[-1]
-
-        if objects.queuemgr.push(stuff, self._role):
-            # Remove item
-            self.buffer.remove(stuff)
-
-class HarvestManUrlCrawler(HarvestManBaseUrlCrawler):
-    """ The crawler class which crawls urls and fetches their links.
-    These links are posted to the url queue """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManBaseUrlCrawler.__init__(self, index, url_obj, isThread)
-        # Not running yet
-        self.stateobj.set(self, THREAD_IDLE)
-        
-    def _initialize(self):
-        HarvestManBaseUrlCrawler._initialize(self)
-        self._role = "crawler"
-        self.links = []
-
-    def set_url_object(self, obj):
-
-        # Reset
-        self.links = []
-        
-        if not obj:
-            return False
-
-        prior, coll, document = obj
-        url_index = coll.getSourceURL()
-        url_obj = objects.datamgr.get_url(url_index)
-        
-        if not url_obj:
-            return False
-        
-        self.links = [objects.datamgr.get_url(index) for index in coll.getAllURLs()]
-        self.document = document
-        
-        return HarvestManBaseUrlCrawler.set_url_object(self, url_obj)
-
-    def action(self):
-        
-        if self._isThread:
-            
-            if not self.resuming:
-                self._loops = 0
-
-            while not self._endflag:
-
-                if not self.resuming:
-                    if self.buffer and self._pushflag:
-                        self.push_buffer()
-
-                    self.stateobj.set(self, CRAWLER_WAITING)
-                    obj = objects.queuemgr.get_url_data( "crawler" )
-                    
-                    if not obj:
-                        if self._endflag: break
-
-                        if self.buffer and self._pushflag:
-                            debug('Trying to push buffer...')
-                            self.push_buffer()
-
-                        debug('OBJECT IS NONE,CONTINUING...',self)
-                        continue
-
-                    self.set_url_object(obj)
-
-                    if self.url==None:
-                        debug('NULL URLOBJECT',self)
-                        continue
-
-                    # We needs to do violates check here also
-                    if self.url.violates_rules():
-                        extrainfo("Filtered URL",self.url)
-                        continue
-
-                # Do a crawl to generate new objects
-                # only after trying to push buffer
-                # objects.
-                self.crawl_url()
-                self._loops += 1
-                # Sleep for some time
-                self.sleep()
-
-                # If I had resumed from a saved state, set resuming flag
-                # to false
-                self.resuming = False
-        else:
-            self.process_url()
-            self.crawl_url()
-
-
-    def apply_url_priority(self, url_obj):
-        """ Apply priority to url objects """
-
-        cfg = objects.config
-        
-        # Set initial priority to previous url's generation
-        url_obj.priority = self.url.generation
-
-        # Get priority
-        curr_priority = url_obj.priority
-
-        # html files (webpages) get higher priority
-        if url_obj.is_webpage():
-            curr_priority -= 1
-
-        # Apply any priorities specified based on file extensions in
-        # the config file.
-        pr_dict1, pr_dict2 = cfg.urlprioritydict, cfg.serverprioritydict
-        # Get file extension
-        extn = ((os.path.splitext(url_obj.get_filename()))[1]).lower()
-        # Skip the '.'
-        extn = extn[1:]
-
-        # Get domain (server)
-        domain = url_obj.get_domain()
-
-        # Apply url priority
-        if extn in pr_dict1:
-            curr_priority -= int(pr_dict1[extn])
-
-        # Apply server priority, this allows a a partial
-        # key match 
-        for key in pr_dict2:
-            # Apply the first match
-            if domain.find(key) != -1:
-                curr_priority -= int(pr_dict2[domain])
-                break
-            
-        # Set priority again
-        url_obj.priority = curr_priority
-        
-        return HARVESTMAN_OK
-
-    def crawl_url(self):
-        """ Crawl a web page, recursively downloading its links """
-
-        # Raise before crawl event...
-        if objects.eventmgr.raise_event('beforecrawl', self.url, self.document)==False:
-            extrainfo('Not crawling this url',self.url)
-            return
-        
-        if not self._download:
-            debug('DOWNLOAD FLAG UNSET!',self)
-            return None
-
-        self.stateobj.set(self, CRAWLER_CRAWLING)                    
-        info('Fetching links', self.url)
-        
-        priority_indx = 0
-
-        # print self.links
-        
-        for url_obj in self.links:
-
-            # Check for status flag to end loop
-            if self._endflag: break
-            if not url_obj: continue
-
-            url_obj.generation = self.url.generation + 1
-            typ = url_obj.get_type()
-
-            # Type based filtering
-            if typ == 'javascript':
-                if not self._configobj.javascript:
-                    continue
-            elif typ == 'javaapplet':
-                if not self._configobj.javaapplet:
-                    continue
-                
-            # Check for basic rules of download
-            if url_obj.violates_rules():
-                extrainfo("Filtered URL",url_obj.get_full_url())
-                continue
-
-            priority_indx += 1
-            self.apply_url_priority( url_obj )
-
-            if not objects.queuemgr.push( url_obj, "crawler" ):
-                if self._pushflag: self.buffer.append(url_obj)
-
-        objects.eventmgr.raise_event('aftercrawl', self.url, self.document)
-        
-class HarvestManUrlFetcher(HarvestManBaseUrlCrawler):
-    """ This is the fetcher class, which downloads data for a url
-    and writes its files. It also posts the data for web pages
-    to a data queue """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManBaseUrlCrawler.__init__(self, index, url_obj, isThread)
-        self._fetchtime = 0
-        self.stateobj.set(self, THREAD_IDLE)
-        
-    def _initialize(self):
-        HarvestManBaseUrlCrawler._initialize(self)
-        self._role = "fetcher"
-        self.make_html_parser()
-        
-    def make_html_parser(self, choice=0):
-
-        if choice==0:
-            self.wp = pageparser.HarvestManSimpleParser()
-        elif choice==1:
-            try:
-                self.wp = pageparser.HarvestManSGMLOpParser()
-            except ImportError:
-                self.wp = pageparser.HarvestManSimpleParser()
-
-
-        # Enable/disable features
-        if self.wp != None:
-            for feat, val in self._configobj.htmlfeatures:
-                # int feat,'=>',val
-                if val: self.wp.enable_feature(feat)
-                else: self.wp.disable_feature(feat)
-        
-    def get_fetch_timestamp(self):
-        """ Return the time stamp before fetching """
-
-        return self._fetchtime
-    
-    def set_url_object(self, obj):
-
-        if not obj: return False
-        
-        try:
-            prior, url_obj = obj
-            # url_obj = GetUrlObject(indx)
-        except TypeError:
-            url_obj = obj
-
-        return HarvestManBaseUrlCrawler.set_url_object(self, url_obj)
-
-    def action(self):
-        
-        if self._isThread:
-
-            if not self.resuming:
-                self._loops = 0            
-
-            while not self._endflag:
-                    
-                if not self.resuming:
-                    if self.buffer and self._pushflag:
-                        debug('Trying to push buffer...')
-                        self.push_buffer()
-
-                    self.stateobj.set(self, FETCHER_WAITING)                    
-                    obj = objects.queuemgr.get_url_data("fetcher" )
-                    
-                    if not obj:
-                        if self._endflag: break
-
-                        if self.buffer and self._pushflag:
-                            debug('Trying to push buffer...')
-                            self.push_buffer()
-
-                        continue
-
-                    if not self.set_url_object(obj):
-                        debug('NULL URLOBJECT',self)
-                        if self._endflag: break
-                        continue
-
-                # Process to generate new objects
-                # only after trying to push buffer
-                # objects.
-                self.process_url()
-
-                # Raise "afterfetch" event
-                objects.eventmgr.raise_event('afterfetch', self.url)
-                
-                self._loops += 1
-
-                # Sleep for some random time
-                self.sleep()
-
-                # Set resuming flag to False
-                self.resuming = False
-        else:
-            self.process_url()
-            self.crawl_url()
-
-    def offset_links(self, links):
-        """ Calculate a new list by applying any offset params
-        on the list of links """
-
-        n = len(links)
-        # Check for any links offset params - if so trim
-        # the list of links to the supplied offset values
-        offset_start = self._configobj.linksoffsetstart
-        offset_end = self._configobj.linksoffsetend
-        # Check for negative values for end offset
-        # This is considered as follows.
-        # -1 => Till and including end of list
-        # -2 => Till and including (n-1) element
-        # -3 => Till and including (n-2) element
-        # like that... upto -(n-1)...
-        if offset_end < 0:
-            offset_end = n - (offset_end + 1)
-        # If we still get negative value for offset end
-        # discard it and use list till end
-        if offset_end < 0:
-            offset_end = n
-
-        # Start offset should not have negative values
-        if offset_start >= 0:
-            return links[offset_start:offset_end]
-        else:
-            return links[:offset_end]
-        
-    def process_url(self):
-        """ This function downloads the data for a url and writes its files.
-        It also posts the data for web pages to a data queue """
-
-        data = ''
-        # Raise "beforefetch" event...
-        if objects.eventmgr.raise_event('beforefetch', self.url)==False:
-            return 
-        
-        if self.url.qstatus==urlparser.URL_NOT_QUEUED:
-            info('Downloading', self.url.get_full_url())
-            # About to fetch
-            self._fetchtime = time.time()
-            self.stateobj.set(self, FETCHER_DOWNLOADING)
-            data = objects.datamgr.download_url(self, self.url)
-            
-        # Add webpage links in datamgr, if we managed to
-        # download the url
-        url_obj = self.url
-
-        # print self.url,'=>',self.url.is_webpage()
-        if self.url.is_webpage() and data:
-            # Create a HarvestMan document with all data we have...
-
-            # Create a document and keep updating it -this is useful to provide
-            # information to events...
-            document = url_obj.make_document(data, [], '', [])
-            
-            # Raise "beforeparse" event...
-            if objects.eventmgr.raise_event('beforeparse', self.url, document)==False:
-                return 
-            
-            # Check if this page was already crawled
-            url = self.url.get_full_url()
-            sh = sha.new(data)
-            # Set this hash on the URL object itself
-            self.url.pagehash = str(sh.hexdigest())
-
-            extrainfo("Parsing web page", self.url)
-
-            self.stateobj.set(self, FETCHER_PARSING)
-        
-            links = []
-
-            # Perform any Javascript based redirection etc
-            if self._configobj.javascript:
-                skipjsparse = False
-                # Raise "beforejsparse" event...
-                if objects.eventmgr.raise_event('beforejsparse', self.url, document)==False:
-                    # Don't return, skip this...
-                    skipjsparse = True
-
-                if not skipjsparse:
-                    try:
-                        parser = JSParser()
-                        parser.parse(data)
-                        if parser.locnchanged:
-                            redirect_url = parser.getLocation().href
-                            extrainfo("Javascript redirection to", redirect_url)
-                            links.append((urlparser.URL_TYPE_ANY, redirect_url))
-
-                        # DOM modification parsing logic is rudimentary and will
-                        # screw up original page data most of the time!
-                        
-                        #elif parser.domchanged:
-                        #    extrainfo("Javascript modified page DOM, using modified data to construct URLs...")
-                        #    # Get new content
-                        #    datatemp = repr(parser.getDocument())
-                        #    # Somehow if data is NULL, don't use it
-                        #    if len(datatemp) !=0:
-                        #        data = datatemp
-                        #    # print data
-                    except JSParserException, e:
-                        # No point printing this as error, since the parser is very baaaasic!
-                        # debug("Javascript parsing error =>", e)
-                        pass
-
-                    # Raise "afterjsparse" event
-                    objects.eventmgr.raise_event('afterjsparse', self.url, document, links=links)
-
-            parsecount = 0
-            
-            while True:
-                try:
-                    parsecount += 1
-
-                    self.wp.reset()
-                    self.wp.set_url(self.url)
-                    self.wp.feed(data)
-                    # Bug Fix: If the <base href="..."> tag was defined in the
-                    # web page, relative urls must be constructed against
-                    # the url provided in <base href="...">
-
-                    if self.wp.base_url_defined():
-                        url = self.wp.get_base_url()
-                        if not self.url.is_equal(url):
-                            debug("Base url defined, replacing",self.url)
-                            # Construct a url object
-                            url_obj = urlparser.HarvestManUrl(url,
-                                                              URL_TYPE_BASE,
-                                                              0,
-                                                              self.url,
-                                                              self._configobj.projdir)
-
-                            # Change document
-                            objects.datamgr.add_url(url_obj)
-                            document.set_url(url_obj)
-
-                    self.wp.close()
-                    # Related to issue #25 - Print a message if parsing went through
-                    # in a 2nd attempt
-                    if parsecount>1:
-                        extrainfo('Parsed web page successfully in second attempt',self.url) 
-                    break
-                except (SGMLParseError, IOError), e:
-                    error('SGML parse error:',str(e))
-                    error('Error in parsing web-page %s' % self.url)
-
-                    if self.wp.typ==0:
-                        # Parse error occurred with Python parser
-                        debug('Trying to reparse using the HarvestManSGMLOpParser...')
-                        self.make_html_parser(choice=1)
-                    else:
-                        break
-                #except ValueError, e:
-                #    break
-                #except Exception, e:
-                #    
-                #    break
-
-            if self._configobj.robots:
-                # Check for NOFOLLOW tag
-                if not self.wp.can_follow:
-                    extrainfo('URL %s defines META Robots NOFOLLOW flag, not following its children...' % self.url)
-                    return data
-
-            links.extend(self.wp.links)
-            # print 'LINKS=>',self.wp.links
-            #for typ, link in links:
-            #    print 'Link=>',link
-                
-            # Let us update some stuff on the document...
-            document.keywords = self.wp.keywords[:]
-            document.description = self.wp.description
-            document.title = self.wp.title
-            
-            # Raise "afterparse" event...
-            objects.eventmgr.raise_event('afterparse', self.url, document, links=links)
-
-            # Apply textfilter check here. This filter is applied on content
-            # or metadata and is always a crawl filter, i.e since it operates
-            # on content, we cannot apply the filter before the URL is fetched.
-            # However it is applied after the URL is fetched on its content. If
-            # matches, then its children are not crawled...
-            if objects.rulesmgr.apply_text_filter(document, self.url):
-                extrainfo('Text filter - filtered', self.url)                
-                return data
-            
-            # Some times image links are provided in webpages as regular <a href=".."> links.
-            # So in order to filer images fully, we need to check the wp.links list also.
-            # Sample site: http://www.sheppeyseacadets.co.uk/gallery_2.htm
-            if self._configobj.images:
-                links += self.wp.images
-            else:
-                # Filter any links with image extensions out from links
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.image_extns] 
-
-            #for typ, link in links:
-            #    print 'Link=>',link
-                
-            self.wp.reset()
-            
-            # Filter like that for video, flash & audio
-            if not self._configobj.movies:
-                # Filter any links with video extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.movie_extns]
-
-            if not self._configobj.flash:
-                # Filter any links with flash extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.flash_extns]    
-                
-
-            if not self._configobj.sounds:
-                # Filter any links with audio extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.sound_extns]                
-
-            if not self._configobj.documents:
-                # Filter any links with popular documents extension out from links...
-                links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \
-                         netinfo.document_extns]                
-            
-            links = self.offset_links(links)
-            # print "Filtered links",links
-            
-            # Create collection object
-            coll = HarvestManAutoUrlCollection(url_obj)
-
-            children = []
-            for typ, url in links:
-                
-                is_cgi, is_php = False, False
-
-                # Not sure of the logical validity of the following 2 lines anymore...!
-                # This is old code...
-                if url.find('php?') != -1: is_php = True
-                if typ == 'form' or is_php: is_cgi = True
-
-                if not url or len(url)==0: continue
-                # print 'URL=>',url,url_obj.get_full_url()
-                
-                try:
-                    child_urlobj = urlparser.HarvestManUrl(url,
-                                                           typ,
-                                                           is_cgi,
-                                                           url_obj)
-
-                    # print url, child_urlobj.get_full_url()
-                    
-                    if objects.datamgr.check_exists(child_urlobj):
-                        continue
-                    else:
-                        objects.datamgr.add_url(child_urlobj)
-                        coll.addURL(child_urlobj)
-                        children.append(child_urlobj)
-                    
-                except urlparser.HarvestManUrlError, e:
-                    error('URL Error:', e)
-                    continue
-
-            # objects.queuemgr.endloop(True)
-            
-            # Update the document again...
-            for child in children:
-                document.add_child(child)
-                    
-            if not objects.queuemgr.push((url_obj.priority, coll, document), 'fetcher'):
-                if self._pushflag: self.buffer.append((url_obj.priority, coll, document))
-
-            # Update links called here
-            objects.datamgr.update_links(url_obj, coll)
-
-            
-            return data
-        
-        elif self.url.is_stylesheet() and data:
-
-            # Parse stylesheet to find all contained URLs
-            # including imported stylesheets, if any.
-
-            # Create a document and keep updating it -this is useful to provide
-            # information to events...
-            document = url_obj.make_document(data, [], '', [])
-
-            # Raise "beforecssparse" event...
-            if objects.eventmgr.raise_event('beforecssparse', self.url, document)==False:
-                # Dont do anything with this URL...
-                return
-            
-            sp = pageparser.HarvestManCSSParser()
-            sp.feed(data)
-
-            objects.eventmgr.raise_event('aftercssparse', self.url, links=sp.links)
-            
-            links = self.offset_links(sp.links)
-            
-            # Filter the CSS URLs also w.r.t rules
-            # Filter any links with image extensions out from links
-            if not self._configobj.images:
-                links = [link for link in links if link[link.rfind('.'):].lower() not in netinfo.image_extns]
-                
-            children = []
-             
-            # Create collection object
-            coll = HarvestManAutoUrlCollection(self.url)
-            
-            # Add these links to the queue
-            for url in links:
-                if not url: continue
-                
-                # There is no type information - so look at the
-                # extension of the URL. If ending with .css then
-                # add as stylesheet type, else as generic type.
-
-                if url.lower().endswith('.css'):
-                    urltyp = URL_TYPE_STYLESHEET
-                else:
-                    urltyp = URL_TYPE_ANY
-                    
-                try:
-                    child_urlobj =  urlparser.HarvestManUrl(url,
-                                                            urltyp,
-                                                            False,
-                                                            self.url)
-
-
-                    if objects.datamgr.check_exists(child_urlobj):
-                        continue
-                    else:
-                        objects.datamgr.add_url(child_urlobj)
-                        coll.addURL(child_urlobj)                    
-                        children.append(child_urlobj)
-                        
-                except urlparser.HarvestManUrlError:
-                    continue
-
-            # Update the document...
-            for child in children:
-                document.add_child(child)
-            
-            if not objects.queuemgr.push((self.url.priority, coll, document), 'fetcher'):
-                if self._pushflag: self.buffer.append((self.url.priority, coll, document))
-
-            # Update links called here
-            objects.datamgr.update_links(self.url, coll)
-
-            # Successful return returns data
-            return data
-        else:
-            # Dont do anything
-            return None
-
-
-class HarvestManUrlDownloader(HarvestManUrlFetcher, HarvestManUrlCrawler):
-    """ This is a mixin class which does both the jobs of crawling webpages
-    and download urls """
-
-    def __init__(self, index, url_obj = None, isThread=True):
-        HarvestManUrlFetcher.__init__(self, index, url_obj, isThread)
-        self.set_url_object(url_obj)
-        
-    def _initialize(self):
-        HarvestManUrlFetcher._initialize(self)
-        HarvestManUrlCrawler._initialize(self)        
-        self._role = 'downloader'
-
-    def set_url_object(self, obj):
-        HarvestManUrlFetcher.set_url_object(self, obj)
-
-    def set_url_object2(self, obj):
-        HarvestManUrlCrawler.set_url_object(self, obj)        
-
-    def exit_condition(self, caller):
-
-        # Exit condition for single thread case
-        if caller=='crawler':
-            return (objects.queuemgr.data_q.qsize()==0)
-        elif caller=='fetcher':
-            return (objects.queuemgr.url_q.qsize()==0)
-
-        return False
-
-    def is_exit_condition(self):
-
-        return (self.exit_condition('crawler') and self.exit_condition('fetcher'))
-    
-    def action(self):
-
-        if self._isThread:
-            self._loops = 0
-
-            while not self._endflag:
-                obj = objects.queuemgr.get_url_data("downloader")
-                if not obj: continue
-                
-                self.set_url_object(obj)
-                
-                self.process_url()
-                self.crawl_url()
-
-                self._loops += 1
-                self.sleep()
-        else:
-            while True:
-                self.process_url()
-
-                obj = objects.queuemgr.get_url_data( "crawler" )
-                if obj: self.set_url_object2(obj)
-                
-                if self.url.is_webpage():
-                    self.crawl_url()
-
-                obj = objects.queuemgr.get_url_data("fetcher" )
-                self.set_url_object(obj)
-                    
-                if self.is_exit_condition():
-                    break
-
-    def process_url(self):
-
-        # First process urls using fetcher's algorithm
-        HarvestManUrlFetcher.process_url(self)
-
-    def crawl_url(self):
-        HarvestManUrlCrawler.crawl_url(self)
-        
-
-
diff --git a/HarvestMan/harvestman/lib/datamgr.py b/HarvestMan/harvestman/lib/datamgr.py
deleted file mode 100755
index 6b3b3cd..0000000
--- a/HarvestMan/harvestman/lib/datamgr.py
+++ /dev/null
@@ -1,1384 +0,0 @@
-# -- coding: utf-8
-""" datamgr.py - Data manager module for HarvestMan.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Oct 13 2006     Anand          Removed data lock since it is not required - Python GIL
-                                   automatically locks byte operations.
-
-    Feb 2 2007      Anand          Re-added function parse_style_sheet which went missing.
-
-    Feb 26 2007      Anand          Fixed bug in check_duplicate_download for stylesheets.
-                                   Also rewrote logic.
-
-    Mar 05 2007     Anand          Added method get_last_modified_time_and_data to support
-                                   server-side cache checking using HTTP 304. Fixed a small
-                                   bug in css url handling.
-    Apr 19 2007     Anand          Made to work with URL collections. Moved url mapping
-                                   dictionary here. Moved CSS parsing logic to pageparser
-                                   module.
-    Feb 13 2008     Anand          Replaced URL dictionary with disk caching binary search
-                                   tree. Other changes done later -> Got rid of many
-                                   redundant lists which were wasting memory. Need to trim
-                                   this further.
-
-   Feb 14 2008      Anand          Many changes. Replaced/removed datastructures. Merged
-                                   cache updating functions. Details in doc/Datastructures.txt .
-
-   April 4 2008     Anand          Added update_url method and corresponding update method
-                                   in bst.py to update state of URLs after download. Added
-                                   statement to print broken links information at end.
-
-   Jan 13 2008      Anand          Better check for thread download in download_url method.
-                                   Added method 'parseable' in urlparser.py for the same.
-   
-   Copyright (C) 2004 Anand B Pillai.
-    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import shutil
-import time
-import math
-import re
-import sha
-import copy
-import random
-import shelve
-import tarfile
-import zlib
-
-import threading 
-# Utils
-from harvestman.lib import utils
-from harvestman.lib import urlparser
-
-from harvestman.lib.mirrors import HarvestManMirrorManager
-from harvestman.lib.db import HarvestManDbManager
-
-from harvestman.lib.urlthread import HarvestManUrlThreadPool
-from harvestman.lib.connector import *
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.bst import BST
-from harvestman.lib.common.pydblite import Base
-
-
-# Defining pluggable functions
-__plugins__ = { 'download_url_plugin': 'HarvestManDataManager:download_url',
-                'post_download_setup_plugin': 'HarvestManDataManager:post_download_setup',
-                'print_project_info_plugin': 'HarvestManDataManager:print_project_info',
-                'dump_url_tree_plugin': 'HarvestManDataManager:dump_url_tree'}
-
-# Defining functions with callbacks
-__callbacks__ = { 'download_url_callback': 'HarvestManDataManager:download_url',
-                  'post_download_setup_callback' : 'HarvestManDataManager:post_download_setup' }
-
-class HarvestManDataManager(object):
-    """ The data manager cum indexer class """
-
-    # For supporting callbacks
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'datamgr'        
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        # URLs which failed with any error
-        self._numfailed = 0
-        # URLs which failed even after a re-download
-        self._numfailed2 = 0
-        # URLs which were retried
-        self._numretried = 0
-        self.cache = None
-        self.savedfiles = 0
-        self.reposfiles = 0
-        self.cachefiles = 0
-        self.filteredfiles = 0
-        # Config object
-        self._cfg = objects.config
-        # Dictionary of servers crawled and
-        # their meta-data. Meta-data is
-        # a dictionary which currently
-        # has only one entry.
-        # i.e accept-ranges.
-        self._serversdict = {}
-        # byte count
-        self.bytes = 0L
-        # saved bytes count
-        self.savedbytes = 0L        
-        # Redownload flag
-        self._redownload = False
-        # Mirror manager
-        self.mirrormgr = HarvestManMirrorManager.getInstance()
-        # Condition object for synchronization
-        self.cond = threading.Condition(threading.Lock())        
-        self._urldb = None
-        self.collections = None
-
-    def initialize(self):
-        """ Do initializations per project """
-
-        # Url thread group class for multithreaded downloads
-        if self._cfg.usethreads:
-            self._urlThreadPool = HarvestManUrlThreadPool()
-            self._urlThreadPool.spawn_threads()
-        else:
-            self._urlThreadPool = None
-
-        # URL database, a BST with disk-caching
-        self._urldb = BST()
-        # Collections database, a BST with disk-caching        
-        self.collections = BST()
-        # For testing, don't set this otherwise we might
-        # be left with many orphaned .bidx... folders!
-        if not self._cfg.testing:
-            self._urldb.set_auto(2)
-            self.collections.set_auto(2)
-
-        # Load any mirrors
-        self.mirrormgr.load_mirrors(self._cfg.mirrorfile)
-        # Set mirror search flag
-        self.mirrormgr.mirrorsearch = self._cfg.mirrorsearch
-
-    def get_urldb(self):
-        return self._urldb
-    
-    def add_url(self, urlobj):
-        """ Add urlobject urlobj to the local dictionary """
-
-        # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index)
-        self._urldb.insert(urlobj.index, urlobj)
-        
-    def update_url(self, urlobj):
-        """ Update urlobject urlobj in the local dictionary """
-
-        # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index)
-        self._urldb.update(urlobj.index, urlobj)
-        
-    def get_url(self, index):
-
-        # return self._urldict[str(index)]
-        return self._urldb.lookup(index)
-
-    def get_original_url(self, urlobj):
-
-        # Return the original URL object for
-        # duplicate URLs. This is useful for
-        # processing URL objects obtained from
-        # the collection object, because many
-        # of them might be duplicate and would
-        # not have any post-download information
-        # such a headers etc.
-        if urlobj.refindex != -1:
-            return self.get_url(urlobj.refindex)
-        else:
-            # Return the same URL object to avoid
-            # an <if None> check on the caller
-            return urlobj
-        
-    def get_proj_cache_filename(self):
-        """ Return the cache filename for the current project """
-
-        # Note that this function does not actually build the cache directory.
-        # Get the cache file path
-        if self._cfg.projdir and self._cfg.project:
-            cachedir = os.path.join(self._cfg.projdir, "hm-cache")
-            cachefilename = os.path.join(cachedir, 'cache')
-
-            return cachefilename
-        else:
-            return ''
-
-    def get_proj_cache_directory(self):
-        """ Return the cache directory for the current project """
-
-        # Note that this function does not actually build the cache directory.
-        # Get the cache file path
-        if self._cfg.projdir and self._cfg.project:
-            return os.path.join(self._cfg.projdir, "hm-cache")
-        else:
-            return ''        
-
-    def get_server_dictionary(self):
-        return self._serversdict
-
-    def supports_range_requests(self, urlobj):
-        """ Check whether the given url object
-        supports range requests """
-
-        # Look up its server in the dictionary
-        server = urlobj.get_full_domain()
-        if server in self._serversdict:
-            d = self._serversdict[server]
-            return d.get('accept-ranges', False)
-
-        return False
-        
-    def read_project_cache(self):
-        """ Try to read the project cache file """
-
-        # Get cache filename
-        info('Reading Project Cache...')
-        cachereader = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-        obj, found = cachereader.read_project_cache()
-        self._cfg.cachefound = found
-        self.cache = obj
-        if not found:
-            # Fresh cache - create structure...
-            self.cache.create('url','last_modified','etag', 'updated','location','checksum',
-                              'content_length','data','headers')
-            
-            # Create an index on URL
-            self.cache.create_index('url')
-        else:
-            pass
-
-    def write_file_from_cache(self, urlobj):
-        """ Write file from url cache. This
-        works only if the cache dictionary of this
-        url has a key named 'data' """
-
-        ret = False
-
-        # print 'Inside write_file_from_cache...'
-        url = urlobj.get_full_url()
-        content = self.cache._url[url]
-        
-        if len(content):
-            # Value itself is a dictionary
-            item = content[0]
-            if not item.has_key('data'):
-                return ret
-            else:
-                urldata = item['data']
-                if urldata:
-                    fileloc = item['location']                    
-                    # Write file
-                    extrainfo("Updating file from cache=>", fileloc)
-                    try:
-                        if SUCCESS(self.create_local_directory(os.path.dirname(fileloc))):
-                            f=open(fileloc, 'wb')
-                            f.write(zlib.decompress(urldata))
-                            f.close()
-                            ret = True
-                    except (IOError, zlib.error), e:
-                        error("Error:",e)
-                                
-        return ret
-
-    def update_cache_for_url(self, urlobj, filename, urldata, contentlen, lastmodified, tag):
-        """ Method to update the cache information for the URL 'url'
-        associated to file 'filename' on the disk """
-
-        # if page caching is disabled, skip this...
-        if not objects.config.pagecache:
-            return
-        
-        url = urlobj.get_full_url()
-        if urldata:
-            csum = sha.new(urldata).hexdigest()
-        else:
-            csum = ''
-            
-        # Update all cache keys
-        content = self.cache._url[url]
-        if content:
-            rec = content[0]
-            self.cache.update(rec, checksum=csum, location=filename,content_length=contentlen, 
-                              last_modified=lastmodified,etag=tag, updated=True)
-            if self._cfg.datacache:
-                self.cache.update(rec,data=zlib.compress(urldata))
-        else:
-            # Insert as new values
-            if self._cfg.datacache:
-                self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen,last_modified=lastmodified,
-                                  etag=tag, updated=True,data=zlib.compress(urldata))
-            else:
-                self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen, last_modified=lastmodified,
-                                  etag=tag, updated=True)                
-        
-
-    def get_url_cache_data(self, urlobj):
-        """ Get cached data for the URL from disk """
-
-        # This is returned as Unix time, i.e number of
-        # seconds since Epoch.
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. However, we support this only
-        # if we have data for the URL.
-        if (not self._cfg.pagecache) or (not self._cfg.datacache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            item = content[0]
-            # Check if we have the data for the URL
-            data = item.get('data','')
-            if data:
-                try:
-                    return zlib.decompress(data)
-                except zlib.error, e:
-                    error('Error:',e)
-                    return ''
-
-        return ''
-
-    def get_last_modified_time(self, urlobj):
-        """ Return last-modified-time and data of the given URL if it
-        was found in the cache """
-
-        # This is returned as Unix time, i.e number of
-        # seconds since Epoch.
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. 
-        if (not self._cfg.pagecache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            return content[0].get('last_modified', '')
-        else:
-            return ''
-
-    def get_etag(self, urlobj):
-        """ Return the etag of the given URL if it was found in  the cache """
-
-        # This will be called from connector to avoid downloading
-        # URL data using HTTP 304. 
-        if (not self._cfg.pagecache):
-            return ''
-
-        url = urlobj.get_full_url()
-
-        content = self.cache._url[url]
-        if content:
-            return content[0].get('etag', '')
-        else:
-            return ''        
-
-    def is_url_cache_uptodate(self, urlobj, filename, urldata, contentlen=0, last_modified=0, etag=''):
-        """ Check with project cache and find out if the
-        content needs update """
-        
-        # If page caching is not enabled, return False
-        # straightaway!
-
-        # print 'Inside is_url_cache_uptodate...'
-        
-        if not self._cfg.pagecache:
-            return (False, False)
-
-        # Return True if cache is uptodate(no update needed)
-        # and False if cache is out-of-date(update needed)
-        # NOTE: We are using an comparison of the sha checksum of
-        # the file's data with the sha checksum of the cache file.
-        
-        # Assume that cache is not uptodate apriori
-        uptodate, fileverified = False, False
-
-        url = urlobj.get_full_url()
-        content = self.cache._url[url]
-
-        if content:
-            cachekey = content[0]
-            cachekey['updated']=False
-
-            fileloc = cachekey['location']
-            if os.path.exists(fileloc) and os.path.abspath(fileloc) == os.path.abspath(filename):
-                fileverified=True
-
-            # Use a cascading logic - if last_modified is available use it first
-            if last_modified:
-                if cachekey['last_modified']:
-                    # Get current modified time
-                    cmt = cachekey['last_modified']
-                    # print cmt,'=>',lmt
-                    # If the latest page has a modified time greater than this
-                    # page is out of date, otherwise it is uptodate
-                    if last_modified<=cmt:
-                        uptodate=True
-
-            # Else if etag is available use it...
-            elif etag:
-                if cachekey['etag']:
-                    tag = cachekey['etag']
-                    if etag==tag:
-                        uptodate = True
-            # Finally use a checksum of actual data if everything else fails
-            elif urldata:
-                if cachekey['checksum']:
-                    cachesha = cachekey['checksum']
-                    digest = sha.new(urldata).hexdigest()
-                    
-                    if cachesha == digest:
-                        uptodate=True
-        
-        if not uptodate:
-            # Modified this logic - Anand Jan 10 06            
-            self.update_cache_for_url(urlobj, filename, urldata, contentlen, last_modified, etag)
-
-        return (uptodate, fileverified)
-
-    def conditional_cache_set(self):
-        """ A utility function to conditionally enable/disable
-        the cache mechanism """
-
-        # If already page cache is disabled, do not do anything
-        if not self._cfg.pagecache:
-            return
-        
-        # If the cache file exists for this project, disable
-        # cache, else enable it.
-        cachefilename = self.get_proj_cache_filename()
-
-        if os.path.exists(cachefilename) and os.path.getsize(cachefilename):
-            self._cfg.writecache = False
-        else:
-            self._cfg.writecache = True
-
-    def post_download_setup(self):
-        """ Actions to perform after project is complete """
-
-        # Loop through URL db, one by one and then for those
-        # URLs which were downloaded but did not succeed, try again.
-        # But make sure we don't download links which were not-modified
-        # on server-side (HTTP 304) and hence were skipped.
-        failed = []
-        # Broken links (404)
-        nbroken = 0
-        
-        for node in self._urldb.preorder():
-            urlobj = node.get()
-            # print 'URL=>',urlobj.get_full_url()
-            
-            if urlobj.status == 404:
-                # print 'BROKEN', urlobj.get_full_url()
-                nbroken += 1
-            elif urlobj.qstatus == urlparser.URL_DONE_DOWNLOAD and \
-                   urlobj.status != 0 and urlobj.status != 304:
-                failed.append(urlobj)
-                    
-        self._numfailed = len(failed)
-        # print 'BROKEN=>', nbroken
-        
-        if self._cfg.retryfailed:
-            info(' ')
-
-            # try downloading again
-            if self._numfailed:
-                info('Redownloading failed links...',)
-                self._redownload=True
-                
-                for urlobj in failed:
-                    if urlobj.fatal or urlobj.starturl: continue
-                    extrainfo('Re-downloading',urlobj.get_full_url())
-                    self._numretried += 1
-                    self.thread_download(urlobj)
-                    
-                # Wait for the downloads to complete...
-                if self._numretried:
-                    extrainfo("Waiting for the re-downloads to complete...")
-                    self._urlThreadPool.wait(10.0, self._cfg.timeout)
-
-                worked = 0
-                # Let us calculate the failed rate again...
-                for urlobj in failed:
-                    if urlobj.status == 0:
-                        # Download was done
-                        worked += 1
-
-                self._numfailed2 = self._numfailed - worked
-
-        # Stop the url thread pool
-        # Stop worker threads
-        self._urlThreadPool.stop_all_threads()
-                    
-        # bugfix: Moved the time calculation code here.
-        t2=time.time()
-
-        self._cfg.endtime = t2
-
-        # Write cache file
-        if self._cfg.pagecache and self._cfg.writecache:
-            cachewriter = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-            self.add_headers_to_cache()
-            cachewriter.write_project_cache(self.cache)
-
-        # If url header dump is enabled, dump it
-        if self._cfg.urlheaders:
-            self.dump_headers()
-
-        if self._cfg.localise:
-            self.localise_links()
-
-        # Write archive file...
-        if self._cfg.archive:
-            self.archive_project()
-            
-        # dump url tree (dependency tree) to a file
-        if self._cfg.urltreefile:
-            self.dump_urltree()
-
-        if not self._cfg.project: return
-
-        nlinks = self._urldb.size
-        # print stats of the project
-        nservers, ndirs, nfiltered = objects.rulesmgr.get_stats()
-        nfailed = self._numfailed
-        numstillfailed = self._numfailed2
-
-        numfiles = self.savedfiles
-        numfilesinrepos = self.reposfiles
-        numfilesincache = self.cachefiles
-
-        numretried = self._numretried
-        
-        fetchtime = self._cfg.endtime-self._cfg.starttime
-        
-        statsd = { 'links' : nlinks,
-                   'filtered': nfiltered,
-                   'processed': nlinks - nfiltered,
-                   'broken': nbroken,
-                   'extservers' : nservers,
-                   'extdirs' : ndirs,
-                   'failed' : nfailed,
-                   'fatal' : numstillfailed,
-                   'files' : numfiles,
-                   'filesinrepos' : numfilesinrepos,
-                   'filesincache' : numfilesincache,
-                   'retries' : numretried,
-                   'bytes': self.bytes,
-                   'fetchtime' : fetchtime,
-                }
-
-        self.print_project_info(statsd)
-
-        objects.eventmgr.raise_event('postdownload', None)
-        
-    def check_exists(self, urlobj):
-
-        # Check if this URL object exits (is a duplicate)
-        return self._urldb.lookup(urlobj.index)
-        
-    def update_bytes(self, count):
-        """ Update the global byte count """
-
-        self.bytes += count
-
-    def update_saved_bytes(self, count):
-        """ Update the saved byte count """
-
-        self.savedbytes += count        
-
-    def update_file_stats(self, urlObject, status):
-        """ Add the passed information to the saved file list """
-
-        if not urlObject: return NULL_URLOBJECT_ERROR
-
-        filename = urlObject.get_full_filename()
-
-        if status == DOWNLOAD_YES_OK:
-            self.savedfiles += 1
-        elif status == DOWNLOAD_NO_UPTODATE:
-            self.reposfiles += 1
-        elif status == DOWNLOAD_NO_CACHE_SYNCED:
-            self.cachefiles += 1
-        elif status == DOWNLOAD_NO_WRITE_FILTERED:
-            self.filteredfiles += 1                        
-        
-        return HARVESTMAN_OK
-    
-    def update_links(self, source, collection):
-        """ Update the links dictionary for this collection """
-        
-        self.collections.insert(source.index, collection)
-
-    def thread_download(self, url):
-        """ Schedule download of this web document in a separate thread """
-
-        # Add this task to the url thread pool
-        if self._urlThreadPool:
-            url.qstatus = urlparser.URL_QUEUED
-            self._urlThreadPool.push(url)
-
-    def has_download_threads(self):
-        """ Return true if there are any download sub-threads
-        running, else return false """
-
-        if self._urlThreadPool:
-            num_threads = self._urlThreadPool.has_busy_threads()
-            if num_threads:
-                return True
-
-        return False
-
-    def last_download_thread_report_time(self):
-        """ Get the time stamp of the last completed
-        download (sub) thread """
-
-        if self._urlThreadPool:
-            return self._urlThreadPool.last_thread_report_time()
-        else:
-            return 0
-
-    def kill_download_threads(self):
-        """ Terminate all the download threads """
-
-        if self._urlThreadPool:
-            self._urlThreadPool.end_all_threads()
-
-    def create_local_directory(self, directory):
-        """ Create the directories on the disk named 'directory' """
-
-        # new in 1.4.5 b1 - No need to create the
-        # directory for raw saves using the nocrawl
-        # option.
-        if self._cfg.rawsave:
-            return CREATE_DIRECTORY_OK
-        
-        try:
-            # Fix for EIAO bug #491
-            # Sometimes, however had we try, certain links
-            # will be saved as files, whereas they might be
-            # in fact directories. In such cases, check if this
-            # is a file, then create a folder of the same name
-            # and move the file as index.html to it.
-            path = directory
-            while path:
-                if os.path.isfile(path):
-                    # Rename file to file.tmp
-                    fname = path
-                    os.rename(fname, fname + '.tmp')
-                    # Now make the directory
-                    os.makedirs(path)
-                    # If successful, move the renamed file as index.html to it
-                    if os.path.isdir(path):
-                        fname = fname + '.tmp'
-                        shutil.move(fname, os.path.join(path, 'index.html'))
-                    
-                path2 = os.path.dirname(path)
-                # If we hit the root, break
-                if path2 == path: break
-                path = path2
-                
-            if not os.path.isdir(directory):
-                os.makedirs( directory )
-                extrainfo("Created => ", directory)
-            return CREATE_DIRECTORY_OK
-        except OSError, e:
-            error("Error in creating directory", directory)
-            error(str(e))
-            return CREATE_DIRECTORY_NOT_OK
-
-        return CREATE_DIRECTORY_OK
-
-    def download_multipart_url(self, urlobj, clength):
-        """ Download a URL using HTTP/1.1 multipart download
-        using range headers """
-
-        # First add entry of this domain in
-        # dictionary, if not there
-        domain = urlobj.get_full_domain()
-        orig_url = urlobj.get_full_url()
-        old_urlobj = urlobj.get_original_state()
-
-        domain_changed_a_lot = False
-        
-        # If this was a re-directed URL, check if there is a
-        # considerable change in the domains. If there is,
-        # there is a very good chance that the original URL
-        # is redirecting to mirrors, so we can split on
-        # the original URL and this would automatically
-        # produce a split-mirror download without us having
-        # to do any extra work!
-        if urlobj.redirected and old_urlobj != None:
-            old_domain = old_urlobj.get_domain()
-            if old_domain != domain:
-                # Check if it is somewhat similar
-                # if domain.find(old_domain) == -1:
-                domain_changed_a_lot = True
-
-        try:
-            self._serversdict[domain]
-        except KeyError:
-            self._serversdict[domain] = {'accept-ranges': True}
-
-        if self.mirrormgr.mirrors_available(urlobj):
-            return self.mirrormgr.download_multipart_url(urlobj, clength, self._cfg.numparts, self._urlThreadPool)
-        else:
-            if domain_changed_a_lot:
-                urlobj = old_urlobj
-                # Set a flag to indicate this
-                urlobj.redirected_old = True
-                
-        parts = self._cfg.numparts
-        # Calculate size of each piece
-        piecesz = clength/parts
-        
-        # Calculate size of each piece
-        pcsizes = [piecesz]*parts
-        # For last URL add the reminder
-        pcsizes[-1] += clength % parts 
-        # Create a URL object for each and set range
-        urlobjects = []
-        for x in range(parts):
-            urlobjects.append(copy.copy(urlobj))
-
-        prev = 0
-        for x in range(parts):
-            curr = pcsizes[x]
-            next = curr + prev
-            urlobject = urlobjects[x]
-            # Set mirror_url attribute
-            urlobject.mirror_url = urlobj
-            urlobject.trymultipart = True
-            urlobject.clength = clength
-            urlobject.range = (prev, next-1)
-            urlobject.mindex = x
-            prev = next
-            self._urlThreadPool.push(urlobject)
-            
-        # Push this URL objects to the pool
-        return URL_PUSHED_TO_POOL
-
-    def download_url(self, caller, url):
-
-        no_threads = (not self._cfg.usethreads) or \
-                     url.parseable()
-
-        data=""
-        if no_threads:
-            # This call will block if we exceed the number of connections
-            url.qstatus = urlparser.URL_QUEUED            
-            conn = objects.connfactory.create_connector()
-
-            # Set status to queued
-            url.qstatus = urlparser.URL_IN_QUEUE            
-            res = conn.save_url( url )
-            
-            objects.connfactory.remove_connector(conn)
-
-            filename = url.get_full_filename()
-            if res != CONNECT_NO_ERROR:
-                filename = url.get_full_filename()
-
-                self.update_file_stats( url, res )
-
-                if res==DOWNLOAD_YES_OK:
-                    info("Saved",filename)
-
-                if url.is_webpage():
-                    if self._cfg.datamode==CONNECTOR_DATA_MODE_INMEM: 
-                        data = conn.get_data()
-                    elif os.path.isfile(filename):
-                        # Need to read data from the file...
-                        data = open(filename, 'rb').read()
-
-            else:
-                fetchurl = url.get_full_url()
-                extrainfo( "Failed to download url", fetchurl)
-
-            self._urldb.update(url.index, url)
-            
-        else:
-            # Set status to queued
-            self.thread_download( url )
-
-        return data
-
-    def clean_up(self):
-        """ Purge data for a project by cleaning up
-        lists, dictionaries and resetting other member items"""
-
-        # Reset byte count
-        if self._urldb and self._urldb.size:
-            del self._urldb
-        if self.collections and self.collections.size:
-            del self.collections
-        self.reset()
-
-    def archive_project(self):
-        """ Archive project files into a tar archive file.
-        The archive will be further compressed in gz or bz2
-        format """
-
-        extrainfo("Archiving project files...")
-        # Get project directory
-        projdir = self._cfg.projdir
-        # Get archive format
-        if self._cfg.archformat=='bzip':
-            format='bz2'
-        elif self._cfg.archformat=='gzip':
-            format='gz'
-        else:
-            error("Archive Error: Archive format not recognized")
-            return INVALID_ARCHIVE_FORMAT
-
-        # Create tarfile name
-        ptarf = os.path.join(self._cfg.basedir, "".join((self._cfg.project,'.tar.',format)))
-        cwd = os.getcwd()
-        os.chdir(self._cfg.basedir)
-
-        # Create tarfile object
-        tf = tarfile.open(ptarf,'w:'+format)
-        # Projdir base name
-        pbname = os.path.basename(projdir)
-
-        # Add directories
-        for item in os.listdir(projdir):
-            # Skip cache directory, if any
-            if item=='hm-cache':
-                continue
-            # Add directory
-            fullpath = os.path.join(projdir,item)
-            if os.path.isdir(fullpath):
-                tf.add(os.path.join(pbname,item))
-        # Dump the tarfile
-        tf.close()
-
-        os.chdir(cwd)            
-        # Check whether writing was done
-        if os.path.isfile(ptarf):
-            extrainfo("Wrote archive file",ptarf)
-            return FILE_WRITE_OK
-        else:
-            error("Error in writing archive file",ptarf)
-            return FILE_WRITE_ERROR
-            
-    def add_headers_to_cache(self):
-        """ Add original URL headers of urls downloaded
-        as an entry to the cache file """
-        
-        # Navigate in pre-order, i.e in the order of insertion...
-        for node in self.collections.preorder():
-            coll = node.get()
-
-            # Get list of links for this collection
-            for urlobjidx in coll.getAllURLs():
-                urlobj = self.get_url(urlobjidx)
-                if urlobj==None: continue
-                
-                url = urlobj.get_full_url()
-                # Get headers
-                headers = urlobj.get_url_content_info()
-                
-                if headers:
-                    content = self.cache._url[url]
-                    if content:
-                        urldict = content[0]
-                        urldict['headers'] = headers
-
-
-    def dump_headers(self):
-        """ Dump the headers of the web pages
-        downloaded, into a DBM file """
-        
-        # print dbmfile
-        extrainfo("Writing url headers database")        
-        
-        headersdict = {}
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            for urlobjidx in coll.getAllURLs():
-                urlobj = self.get_url(urlobjidx)
-                
-                if urlobj:
-                    url = urlobj.get_full_url()
-                    # Get headers
-                    headers = urlobj.get_url_content_info()
-                    if headers:
-                        headersdict[url] = str(headers)
-                        
-        cache = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory())
-        return cache.write_url_headers(headersdict)
-    
-    def localise_links(self):
-        """ Localise all links (urls) of the downloaded html pages """
-
-        # Dont confuse 'localising' with language localization.
-        # This means just converting the outward (Internet) pointing
-        # URLs in files to local files.
-
-        info('Localising links of downloaded web pages...',)
-
-        count = 0
-        localized = []
-        
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            sourceurl = self.get_url(coll.getSourceURL())
-            childurls = [self.get_url(index) for index in coll.getAllURLs()]
-            filename = sourceurl.get_full_filename()
-
-            if (not filename in localized) and os.path.exists(filename):
-                extrainfo('Localizing links for',filename)
-                if SUCCESS(self.localise_file_links(filename, childurls)):
-                    count += 1
-                    localized.append(filename)
-
-        info('Localised links of',count,'web pages.')
-
-    def localise_file_links(self, filename, links):
-        """ Localise links for this file """
-
-        data=''
-        
-        try:
-            fw=open(filename, 'r+')
-            data=fw.read()
-            fw.seek(0)
-            fw.truncate(0)
-        except (OSError, IOError),e:
-            return FILE_TRUNCATE_ERROR
-
-        # Regex1 to replace ( at the end
-        r1 = re.compile(r'\)+$')
-        r2 = re.compile(r'\(+$')        
-        
-        # MOD: Replace any <base href="..."> line
-        basehrefre = re.compile(r'<base href=.*>', re.IGNORECASE)
-        if basehrefre.search(data):
-            data = re.sub(basehrefre, '', data)
-        
-        for u in links:
-            if not u: continue
-            
-            url_object = u
-            typ = url_object.get_type()
-
-            if url_object.is_image():
-                http_str="src"
-            else:
-                http_str="href"
-
-            v = url_object.get_original_url()
-            if v == '/': continue
-
-            # Somehow, some urls seem to have an
-            # unbalanced parantheses at the end.
-            # Remove it. Otherwise it will crash
-            # the regular expressions below.
-            v = r1.sub('', v)
-            v2 = r2.sub('', v)
-            
-            # Bug fix, dont localize cgi links
-            if typ != 'base':
-                if url_object.is_cgi(): 
-                    continue
-                
-                fullfilename = os.path.abspath( url_object.get_full_filename() )
-                urlfilename=''
-
-                # Modification: localisation w.r.t relative pathnames
-                if self._cfg.localise==2:
-                    urlfilename = url_object.get_relative_filename()
-                elif self._cfg.localise==1:
-                    urlfilename = fullfilename
-
-                # replace '\\' with '/'
-                urlfilename = urlfilename.replace('\\','/')
-
-                newurl=''
-                oldurl=''
-            
-                # If we cannot get the filenames, replace
-                # relative url paths will full url paths so that
-                # the user can connect to them.
-                if not os.path.exists(fullfilename):
-                    # for relative links, replace it with the
-                    # full url path
-                    fullurlpath = url_object.get_full_url_sans_port()
-                    newurl = "href=\"" + fullurlpath + "\""
-                else:
-                    # replace url with urlfilename
-                    if typ == 'anchor':
-                        anchor_part = url_object.get_anchor()
-                        urlfilename = "".join((urlfilename, anchor_part))
-                        # v = "".join((v, anchor_part))
-
-                    if self._cfg.localise == 1:
-                        newurl= "".join((http_str, "=\"", "file://", urlfilename, "\""))
-                    else:
-                        newurl= "".join((http_str, "=\"", urlfilename, "\""))
-
-            else:
-                newurl="".join((http_str,"=\"","\""))
-
-            if typ != 'img':
-                oldurl = "".join((http_str, "=\"", v, "\""))
-                try:
-                    oldurlre = re.compile("".join((http_str,'=','\\"?',v,'\\"?')))
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-                    
-                # Get the location of the link in the file
-                try:
-                    if oldurl != newurl:
-                        data = re.sub(oldurlre, newurl, data,1)
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-            else:
-                try:
-                    oldurlre1 = "".join((http_str,'=','\\"?',v,'\\"?'))
-                    oldurlre2 = "".join(('href','=','\\"?',v,'\\"?'))
-                    oldurlre = re.compile("".join(('(',oldurlre1,'|',oldurlre2,')')))
-                except Exception, e:
-                    debug("Error:",str(e))
-                    continue
-                
-                http_strs=('href','src')
-            
-                for item in http_strs:
-                    try:
-                        oldurl = "".join((item, "=\"", v, "\""))
-                        if oldurl != newurl:
-                            data = re.sub(oldurlre, newurl, data,1)
-                    except:
-                        pass
-
-        try:
-            fw.write(data)
-            fw.close()
-        except IOError, e:
-            logconsole(e)
-            return HARVESTMAN_FAIL
-
-        return HARVESTMAN_OK
-
-    def print_project_info(self, statsd):
-        """ Print project information """
-
-        nlinks = statsd['links']
-        nservers = statsd['extservers'] + 1
-        nfiles = statsd['files']
-        ndirs = statsd['extdirs'] + 1
-        numfailed = statsd['failed']
-        nretried = statsd['retries']
-        fatal = statsd['fatal']
-        fetchtime = statsd['fetchtime']
-        nfilesincache = statsd['filesincache']
-        nfilesinrepos = statsd['filesinrepos']
-        nbroken = statsd['broken']
-        
-        # Bug fix, download time to be calculated
-        # precisely...
-
-        dnldtime = fetchtime
-
-        strings = [('link', nlinks), ('server', nservers),
-                   ('file', nfiles), ('file', nfilesinrepos),
-                   ('directory', ndirs), ('link', numfailed), ('link', fatal),
-                   ('link', nretried), ('file', nfilesincache), ('link', nbroken) ]
-
-        fns = map(plural, strings)
-        info(' ')
-
-        bytes = self.bytes
-        savedbytes = self.savedbytes
-        
-        ratespec='KB/sec'
-        if bytes and dnldtime:
-            bps = float(bytes/dnldtime)/1024.0
-            if bps<1.0:
-                bps *= 1000.0
-                ratespec='bytes/sec'
-            bps = '%.2f' % bps
-        else:
-            bps = '0.0'
-
-        fetchtime = float((math.modf(fetchtime*100.0)[1])/100.0)
-
-        if self._cfg.simulate:
-            info("HarvestMan crawl simulation of",self._cfg.project,"completed in",fetchtime,"seconds.")
-        else:
-            info('HarvestMan mirror',self._cfg.project,'completed in',fetchtime,'seconds.')
-            
-        if nlinks: info(nlinks,fns[0],'scanned in',nservers,fns[1],'.')
-        else: info('No links parsed.')
-        if nfiles: info(nfiles,fns[2],'written.')
-        else:info('No file written.')
-        
-        if nfilesinrepos:
-            info(nfilesinrepos,fns[3],wasOrWere(nfilesinrepos),'already uptodate in the repository for this project and',wasOrWere(nfilesinrepos),'not updated.')
-        if nfilesincache:
-            info(nfilesincache,fns[8],wasOrWere(nfilesincache),'updated from the project cache.')
-
-        if nbroken: info(nbroken,fns[9],wasOrWere(nbroken),'were broken.')
-        if fatal: info(fatal,fns[5],'had fatal errors and failed to download.')
-        if bytes: info(bytes,' bytes received at the rate of',bps,ratespec,'.')
-        if savedbytes: info(savedbytes,' bytes were written to disk.\n')
-        
-        info('*** Log Completed ***\n')
-        
-        # get current time stamp
-        s=time.localtime()
-
-        tz=(time.tzname)[0]
-
-        format='%b %d %Y '+tz+' %H:%M:%S'
-        tstamp=time.strftime(format, s)
-
-        if not self._cfg.simulate:
-            # Write statistics to the crawl database
-            HarvestManDbManager.add_stats_record(statsd)
-            logconsole('Done.')
-
-            # No longer writing a stats file...
-            # Write stats to a stats file
-            #statsfile = self._cfg.project + '.hst'
-            #statsfile = os.path.abspath(os.path.join(self._cfg.projdir, statsfile))
-            #logconsole('Writing stats file ', statsfile , '...')
-            # Append to files contents
-            #sf=open(statsfile, 'a')
-
-            # Write url, file count, links count, time taken,
-            # files per second, failed file count & time stamp
-            #infostr='url:'+self._cfg.url+','
-            #infostr +='files:'+str(nfiles)+','
-            #infostr +='links:'+str(nlinks)+','
-            #infostr +='dirs:'+str(ndirs)+','
-            #infostr +='failed:'+str(numfailed)+','
-            #infostr +='refetched:'+str(nretried)+','
-            #infostr +='fatal:'+str(fatal)+','
-            #infostr +='elapsed:'+str(fetchtime)+','
-            #infostr +='fps:'+str(fps)+','
-            #infostr +='kbps:'+str(bps)+','
-            #infostr +='timestamp:'+tstamp
-            #infostr +='\n'
-            
-            #sf.write(infostr)
-            #sf.close()
-
-    def dump_urltree(self):
-        """ Dump url tree to a file """
-
-        # This creats an html file with
-        # each url and its children below
-        # it. Each url is a hyperlink to
-        # itself on the net if the file
-        # is an html file.
-
-        # urltreefile is <projdir>/urls.html
-        urlfile = os.path.join(self._cfg.projdir, 'urltree.html')
-        
-        try:
-            if os.path.exists(urlfile):
-                os.remove(urlfile)
-        except OSError, e:
-            logconsole(e)
-
-        info('Dumping url tree to file', urlfile)
-        fextn = ((os.path.splitext(urlfile))[1]).lower()        
-        
-        try:
-            f=open(urlfile, 'w')
-            if fextn in ('', '.txt'):
-                self.dump_urltree_textmode(f)
-            elif fextn in ('.htm', '.html'):
-                self.dump_urltree_htmlmode(f)
-            f.close()
-        except Exception, e:
-            logconsole(e)
-            return DUMP_URL_ERROR
-
-        debug("Done.")
-
-        return DUMP_URL_OK
-
-    def dump_urltree_textmode(self, stream):
-        """ Dump urls in text mode """
-
-        for node in self.collections.preorder():
-            coll = node.get()
-
-            idx = 0
-            links = [self.get_url(index) for index in coll.getAllURLs()]
-            children = []
-            
-            for link in links:
-                if not link: continue
-
-                # Get base link, only for first
-                # child url, since base url will
-                # be same for all child urls.
-                if idx==0:
-                    children = []
-                    base_url = link.get_parent_url().get_full_url()
-                    stream.write(base_url + '\n')
-
-                childurl = link.get_full_url()
-                if childurl and childurl not in children:
-                    stream.write("".join(('\t',childurl,'\n')))
-                    children.append(childurl)
-
-                idx += 1
-
-
-    def dump_urltree_htmlmode(self, stream):
-        """ Dump urls in html mode """
-
-        # Write html header
-        stream.write('<html>\n')
-        stream.write('<head><title>')
-        stream.write('Url tree generated by HarvestMan - Project %s'
-                     % self._cfg.project)
-        stream.write('</title></head>\n')
-
-        stream.write('<body>\n')
-
-        stream.write('<p>\n')
-        stream.write('<ol>\n')
-        
-        for node in self.collections.preorder():
-            coll = node.get()
-            
-            idx = 0
-            links = [self.get_url(index) for index in coll.getAllURLs()]            
-
-            children = []
-            for link in links:
-                if not link: continue
-
-                # Get base link, only for first
-                # child url, since base url will
-                # be same for all child urls.
-                if idx==0:
-                    children = []                   
-                    base_url = link.get_parent_url().get_full_url()
-                    stream.write('<li>')                    
-                    stream.write("".join(("<a href=\"",base_url,"\"/>",base_url,"</a>")))
-                    stream.write('</li>\n')
-                    stream.write('<p>\n')
-                    stream.write('<ul>\n')
-                                 
-                childurl = link.get_full_url()
-                if childurl and childurl not in children:
-                    stream.write('<li>')
-                    stream.write("".join(("<a href=\"",childurl,"\"/>",childurl,"</a>")))
-                    stream.write('</li>\n')                    
-                    children.append(childurl)
-                    
-                idx += 1                
-
-
-            # Close the child list
-            stream.write('</ul>\n')
-            stream.write('</p>\n')
-            
-        # Close top level list
-        stream.write('</ol>\n')        
-        stream.write('</p>\n')
-        stream.write('</body>\n')
-        stream.write('</html>\n')
-
-    def get_url_threadpool(self):
-        """ Return the URL thread-pool object """
-
-        return self._urlThreadPool
-
-class HarvestManController(threading.Thread):
-    """ A controller class for managing exceptional
-    conditions such as file limits. Right now this
-    is written with the sole aim of managing file
-    & time limits, but could get extended in future
-    releases. """
-
-    def __init__(self):
-        self._dmgr = objects.datamgr
-        self._tq =  objects.queuemgr
-        self._cfg = objects.config
-        self._exitflag = False
-        self._starttime = 0
-        threading.Thread.__init__(self, None, None, 'HarvestMan Control Class')
-
-    def run(self):
-        """ Run in a loop looking for
-        exceptional conditions """
-
-        while not self._exitflag:
-            # Wake up every half second and look
-            # for exceptional conditions
-            time.sleep(1.0)
-            if self._cfg.timelimit != -1:
-                if self._manage_time_limits()==CONTROLLER_EXIT:
-                    break
-            if self._cfg.maxfiles:
-                if self._manage_file_limits()==CONTROLLER_EXIT:
-                    break
-            if self._cfg.maxbytes:
-                if self._manage_maxbytes_limits()==CONTROLLER_EXIT:
-                    break
-            
-    def stop(self):
-        """ Stop this thread """
-
-        self._exitflag = True
-
-    def terminator(self):
-        """ The function which terminates the program
-        in case of an exceptional condition """
-
-        # This somehow got deleted in HarvestMan 1.4.5
-        self._tq.endloop(True)
-
-    def _manage_time_limits(self):
-        """ Manage limits on time for the project """
-
-        t2=time.time()
-
-        timediff = float((math.modf((t2-self._cfg.starttime)*100.0)[1])/100.0)
-        timemax = self._cfg.timelimit
-        
-        if timediff >= timemax -1:
-            info('Specified time limit of',timemax ,'seconds reached!')            
-            self.terminator()
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-
-    def _manage_file_limits(self):
-        """ Manage limits on maximum file count """
-
-        lsaved = self._dmgr.savedfiles
-        lmax = self._cfg.maxfiles
-
-        if lsaved >= lmax:
-            info('Specified file limit of',lmax ,'reached!')
-            self.terminator()
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-
-    def _manage_maxbytes_limits(self):
-        """ Manage limits on maximum bytes a crawler should download in total per job. """
-
-        lsaved = self._dmgr.savedbytes
-        lmax = self._cfg.maxbytes
-
-        # Let us check for a closer hit of 90%...
-        if (lsaved >=0.90*lmax):
-            info('Specified maxbytes limit of',lmax ,'reached!')
-            self.terminator()   
-            return CONTROLLER_EXIT
-        
-        return HARVESTMAN_OK
-        
-                    
diff --git a/HarvestMan/harvestman/lib/db.py b/HarvestMan/harvestman/lib/db.py
deleted file mode 100755
index 2235121..0000000
--- a/HarvestMan/harvestman/lib/db.py
+++ /dev/null
@@ -1,133 +0,0 @@
-# -- coding: utf-8
-"""
-db.py - Provides HarvestManDbManager class which takes care
-of creating and managing the user's crawl database. The
-crawl database is an sqlite database created as
-$HOME/.harvestman/db/crawls.db where $HOME is the home
-directory of the user. The crawls database is updated with
-meta-data of every crawl after a crawl is completed.
-
-Created by Anand B Pillai <abpillai at gmail dot com> Mar 26 2008
-
-Copyright (C) 2008 Anand B Pillai.
-
-"""
-
-import os, sys
-import time
-
-from harvestman.lib.common.common import objects, extrainfo, logconsole
-
-def adapt_datetime(ts):
-    return time.mktime(ts.timetuple())
-
-class HarvestManDbManager(object):
-    """ Class performing the creation/management of crawl databases """
-
-    projid = 0
-
-    @classmethod
-    def try_import(cls):
-        try:
-            import sqlite3
-            return sqlite3
-        except ImportError, e:
-            pass
-            
-    @classmethod
-    def create_user_database(cls):
-
-        sqlite3 = cls.try_import()
-            
-        if sqlite3 is None:
-            return
-
-        logconsole("Creating user's crawl database file in %s..." % objects.config.userdbdir)
-        
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        
-        # Create table for projects
-        # This line is causing a problem in darwin
-        # c.execute("drop table if exists projects")
-        c.execute("""create table projects (id integer primary key autoincrement default 0, time real, name text, url str, config str)""")
-        # Create table for project statistics
-        # We are storing the information for
-        # 1. number of urls scanned
-        # 2. number of urls processed (fetched/crawled)
-        # 3. number of URLs which were crawl-filtered
-        # 4. number of urls failed to fetch
-        # 5. number of urls with 404 errors
-        # 6. number of URLs which hit the cache
-        # 7. number of servers scanned
-        # 8. number of unique directories scanned
-        # 9. number of files saved
-        # 10. Amount of data fetched in bytes
-        # 11. the total time for the crawl.
-        
-        # This line is causing a problem in darwin        
-        # c.execute("drop table project_stats")        
-        c.execute("""create table project_stats (project_id integer primary key, urls integer, procurls integer, filteredurls integer, failedurls integer, brokenurls integer, cacheurls integer, servers integer, directories integer, files integer, data real, duration text)""")
-        
-        c.close()
-
-    @classmethod
-    def add_project_record(cls):
-
-        sqlite3 = cls.try_import()
-        if sqlite3 is None:
-            return
-        
-        extrainfo('Writing project record to crawls database...')
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        
-        # Get the configuration as a pickled string
-        cfg = objects.config.copy()
-
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        c.execute("insert into projects (time, name, url, config) values(?,?,?,?)",
-                  (time.time(),cfg.project,cfg.url, repr(cfg)))
-        conn.commit()
-
-        # Fetch the most recent project id and save it as projid
-        c.execute("select max(id) from projects")
-        cls.projid = c.fetchone()[0]
-        # print 'project id=>',cls.projid
-        c.close()
-        extrainfo("Done.")
-
-    @classmethod
-    def add_stats_record(cls, statsd):
-
-        sqlite3 = cls.try_import()        
-        if sqlite3 is None:
-            return
-
-        logconsole('Writing project statistics to crawl database...')
-        dbfile = os.path.join(objects.config.userdbdir, "crawls.db")
-        conn = sqlite3.connect(dbfile)
-        c = conn.cursor()
-        t = (cls.projid,
-             statsd['links'],
-             statsd['processed'],
-             statsd['filtered'],
-             statsd['fatal'],
-             statsd['broken'],
-             statsd['filesinrepos'],
-             statsd['extservers'] + 1,
-             statsd['extdirs'] + 1,
-             statsd['files'],
-             statsd['bytes'],
-             '%.2f' % statsd['fetchtime'])
-             
-        c.execute("insert into project_stats values(?,?,?,?,?,?,?,?,?,?,?,?)", t)
-        conn.commit()
-        c.close()
-        pass
-
-if __name__ == "__main__":
-    HarvestManDbManager.create_user_database()
-    pass
-
diff --git a/HarvestMan/harvestman/lib/document.py b/HarvestMan/harvestman/lib/document.py
deleted file mode 100755
index 0f982fc..0000000
--- a/HarvestMan/harvestman/lib/document.py
+++ /dev/null
@@ -1,74 +0,0 @@
-# -- coding: utf-8
-"""
-document.py - Provides HarvestManDocument class which provides
-an abstraction over a webpage with attributes such as URL,
-content, child URLs, HTTP headers, lastmodified value and
-other attributes.
-
-Created by Anand B Pillai <abpillai at gmail dot com> Feb 26 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-import bz2
-from harvestman.lib.common.common import *
-
-class HarvestManDocument(object):
-    """ Web document class """
-
-    def __init__(self, urlobj):
-        # Store only index for conserving memory
-        self.urlindex = urlobj.index
-        # Also, list of children is actually list of
-        # child indices to save memory...
-        self.children = []
-        self.content = ''
-        self.content_hash = ''
-        self.headers = {}
-        # Only valid for webpages
-        self.description = ''
-        # Only valid for webpages        
-        self.keywords = []
-        # Only valid for webpages
-        self.title = ''
-        self.lastmodified = ''
-        self.etag = ''
-        #self.httpstatus = ''
-        #self.httpreason = ''
-        self.content_type = ''
-        self.content_encoding = ''
-        self.error = None
-        
-    def get_url(self):
-        return objects.datamgr.get_url(self.urlindex)
-
-    def set_url(self, urlobj):
-        self.urlindex = urlobj.index
-
-    def add_child(self, urlobj):
-        self.children.append(urlobj.index)
-        
-    def get_links(self):
-        # Links are already "normalized"
-        return [objects.datamgr.get_url(index) for index in self.children]
-
-    def get_content(self):
-        return self.content
-
-    def set_content(self, data):
-        self.content = data
-        
-    def get_content_hash(self):
-        return self.content_hash
-
-    def get_zipped_content(self):
-        # Return the content, gzipped
-        pass
-
-    def get_bzipped_content(self):
-        return bz2.compress(self.content)
-
-
-    
-
-        
diff --git a/HarvestMan/harvestman/lib/event.py b/HarvestMan/harvestman/lib/event.py
deleted file mode 100755
index 5eb89cd..0000000
--- a/HarvestMan/harvestman/lib/event.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# -- coding: utf-8
-"""event.py - Module defining an event notification framework
-associated with the data flow in HarvestMan.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 28 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.singleton import Singleton
-
-class Event(object):
-    """ Event class for HarvestMan """
-
-    def __init__(self):
-        self.name = ''
-        self.config = objects.config
-        self.url = None
-        self.document = None
-
-class HarvestManEvent(Singleton):
-    """ Event manager class for HarvestMan """
-
-    alias = 'eventmgr'
-
-    def __init__(self):
-        self.events = {}
-        
-    def bind(self, event, funktion, *args):
-        """ Register for a function 'funktion' to be bound to a certain event.
-        The return value of the function will be used to determine the behaviour
-        of the original function which raises the event in cases of events
-        which are called before the original function bound to the event. For
-        events which are raised after the original function is called, the
-        behavior of the original function is not changed """
-
-        # An event is a string, you can bind only one function to an event
-        # The function should accept a default first argument which is the
-        # event object. The event object will provide 4 attributes, namely
-        # the event name, the url associated to the event (should be valid),
-        # the document associated to the event (could be None) and the configuration
-        # object of the system.
-        self.events[event] = (funktion, args)
-        # print self.events
-        
-    def raise_event(self, event, url, document=None, **kwargs):
-        """ Raise a certain event. This automatically calls back on any function
-        registered for the event and returns the return value of that function. This
-        is an internal method """
-
-        try:
-            funktion, args = self.events[event]
-            eventobj = Event()
-            eventobj.name = event
-            eventobj.url = url
-            eventobj.document = document
-            # Other keyword arguments
-            return funktion(eventobj, *args, **kwargs)
-        except KeyError:
-            pass
-
-
diff --git a/HarvestMan/harvestman/lib/filters.py b/HarvestMan/harvestman/lib/filters.py
deleted file mode 100755
index 588a88d..0000000
--- a/HarvestMan/harvestman/lib/filters.py
+++ /dev/null
@@ -1,788 +0,0 @@
-# -- coding: utf-8
-"""
-filters.py - Module which holds class definitions for
-classes which define filters for filtering out URLs
-and web pages based on regualr expression and other kinds
-of filters.
-
- Author: Anand B Pillai <abpillai at gmail dot com>
-
- Modification History
- --------------------
-
- Jul 23 2008 Anand   Creation
- Nov 17 2008 Anand   Completed URL filters class implementation
-                     and integrated with HarvestMan.
- Jan 13 2009 Anand   Added text filter class. Modified
-                     junk filter class to follow the filter
-                     class interface.
- 
-  Copyright (C) 2003-2008 Anand B Pillai.
-                                
-"""
-import re
-from harvestman.lib.common.common import *
-
-class HarvestManBaseFilter(object):
-    """ Base class for all HarvestMan filter classes """
-
-    def __init__(self):
-        self.context = None
-        
-    def filter(self, url):
-        raise NotImplementedError
-
-    def make_regex(self, pattern, casing, flags):
-
-        flag = 0
-        if not casing:
-            flag |= re.IGNORECASE
-        if flags:
-            flag |= eval(flags)
-
-        return re.compile(pattern, flag)
-        
-class HarvestManUrlFilter(HarvestManBaseFilter):
-    """ Filter class for filtering out web pages based on the URL path string """
-
-    def __init__(self, pathfilters=[], extnfilters=[], regexfilters=[]):
-        # Filter pattern strings
-        self.regexfilterpatterns = regexfilters
-        self.pathfilterpatterns = pathfilters
-        self.extnfilterpatterns = extnfilters
-        # Intermediate patterns, dictionaries
-        # with keys 'include' and 'exclude'
-        self.regexpatterns = []
-        self.pathpatterns = { 'include': [], 'exclude': [] }
-        self.extnpatterns = { 'include': [], 'exclude': [] }
-        # Actual filters
-        self.inclfilters = []
-        self.exclfilters = []
-        self.compile_filters()
-
-    def parse_filter(self, filterstring):
-        """ Parse a filter pattern string and return a two
-        tuple of (<inclusion>, <exclusion>) pattern string
-        lists """
-
-        fstr = filterstring
-        # First replace any ''' with ''
-        fstr=fstr.replace("'",'')            
-        # regular expressions to include
-        include=[]
-        # regular expressions to exclude        
-        exclude=[]
-
-        index=0
-        previndex=-1
-        fstr += '+'
-        for c in fstr:
-            if c in ('+','-'):
-                previndex=index
-            index+=1
-
-        l=fstr.split('+')
-
-        for s in l:
-            l2=s.split('-')
-            for x in xrange(len(l2)):
-                s=l2[x]
-                if s=='': continue
-                if x==0:
-                    include.append(s)
-                else:
-                    exclude.append(s)
-
-        return (include, exclude)
-
-    def create_filter(self, plainstr, extn=False):
-        """ Create a python regular expression based on
-        the list of filter strings provided as input """
-
-        # Final filter string
-        fstr = ''
-        s = plainstr
-        
-        # First replace any ''' with ''
-        s=s.replace("'",'')            
-        # Then we remove the asteriks
-        s=s.replace('*','.*')
-        fstr = s
-
-        if extn:
-            fstr = '\.' + fstr + '$'
-
-        return fstr
-        
-    def make_path_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a URL path filter. A URL path is specified
-        as an include/exclude filter. Wildcards are specified by
-        using asteriks """
-        
-        include, exclude = self.parse_filter(filterstring)
-        
-        for pattern in include:
-            self.pathpatterns['include'].append((self.create_filter(pattern), casing, flags))
-        for pattern in exclude:
-            self.pathpatterns['exclude'].append((self.create_filter(pattern), casing, flags))
-
-    def make_extn_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a file extension filter. A file extension filter
-        is specified by concatenating file extensions with a + or - in
-        front of them to specify include/exclude respectively """
-        
-        include, exclude = self.parse_filter(filterstring)
-        
-        for pattern in include:
-            self.extnpatterns['include'].append((self.create_filter(pattern, True), casing, flags))
-        for pattern in exclude:
-            self.extnpatterns['exclude'].append((self.create_filter(pattern, True), casing, flags))
-
-    def make_regex_filter(self, filterstring, casing=0, flags=''):
-        """ Creates a regular expression filter. This is nothing but a Python
-        regular expression string which is compiled directly into a regex """
-        
-        # Direct use - no processing required
-        self.regexpatterns.append((filterstring, casing, flags))
-        
-    def compile_filters(self):
-        """ Compile all filter strings and create regular
-        expression objects """
-        
-        for pattern, casing, flags in self.pathfilterpatterns:
-            self.make_path_filter(pattern, casing, flags)
-
-        for pattern, casing, flags in self.extnfilterpatterns:
-            self.make_extn_filter(pattern, casing, flags)            
-            
-        for pattern, casing, flags in self.regexfilterpatterns:
-            self.make_regex_filter(pattern, casing, flags)
-
-        # Now, compile each to regular expressions and
-        # append to include & exclude regex filter list
-        for urlfilter in self.pathpatterns['include'] + self.extnpatterns['include']:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.inclfilters.append(regexp)
-            
-        for urlfilter in self.pathpatterns['exclude'] + self.extnpatterns['exclude']:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.exclfilters.append(regexp)
-
-        for urlfilter in self.regexpatterns:
-            regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2])
-            self.exclfilters.append(regexp)            
-
-    def filter(self, urlobj):
-        """ Apply all URL filters on the passed URL object 'urlobj'.
-        Return True if filtered and False if not filtered """
-
-        # The logic of this is simple - The URL is checked
-        # against all inclusion filters first, if any. If
-        # anything matches, then we don't do exclusion filter
-        # check since inclusion (+) has preference over exclusion (-)
-        # In that case, False is returned.
-        
-        # Otherwise, the URL is checked against all exclusion
-        # filters and if any match, True is returned.
-
-        # Finally, if none match, False is returned.
-
-        url = urlobj.get_full_url()
-        matchincl, matchexcl = False, False
-
-        for urlfilter in self.inclfilters:
-            m = urlfilter.search(url)
-            if m:
-                debug("Inclusion filter for URL %s found", url)
-                matchincl = True
-                break
-
-        if matchincl:
-            return False
-
-        for urlfilter in self.exclfilters:
-            m = urlfilter.search(url)
-            if m:
-                debug("Exclusion filter for URL %s found", url)
-                matchexcl = True
-                break
-
-        if matchexcl:
-            return True
-
-        return False
-
-class HarvestManTextFilter(HarvestManBaseFilter):
-    """ Filter class for filtering out web pages based on the URL path string """
-
-    def __init__(self, contentfilters=[], metafilters=[]):
-        # Filter pattern strings
-        self.contentpatterns = contentfilters
-        self.metapatterns = metafilters
-        # print 'Content=>',self.contentpatterns
-        # print 'Meta=>',self.metapatterns
-        # Actual filters
-        # Text filters are always exclude filters, so
-        # no need of separate include & exclude keys
-        self.contentfilter = []
-        # Meta filters
-        self.keywordfilter = []
-        self.titlefilter = []
-        self.descfilter = []
-        # Parse and compile the filters
-        self.compile_filters()
-
-    def compile_filters(self):
-
-        # Content filter is straight forward
-        for pattern, casing, flags in self.contentpatterns:
-            self.contentfilter.append(self.make_regex(pattern, casing, flags))
-
-        # Some pre-processing is involved in meta-filters
-        for pattern,casing,flags,tags in self.metapatterns:
-            regex = self.make_regex(pattern, casing, flags)
-            if tags=='all':
-                # Append to all filters
-                self.keywordfilter.append(regex)
-                self.titlefilter.append(regex)
-                self.descfilter.append(regex)
-            else:
-                # Split and see which all tags are specified
-                tagslist = tags.split(',')
-                if 'title' in tagslist:
-                    self.titlefilter.append(regex)
-                if 'keywords' in tagslist:
-                    self.keywordfilter.append(regex)
-                if 'description' in tagslist:
-                    self.descfilter.append(regex)                    
-
-
-    def filter(self, urldoc, urlobj):
-        """ Apply all URL filters on the passed URL document object
-        Return True if filtered and False if not filtered """
-
-        filterurl = False
-        
-        # Apply content filter
-        for cfilter in self.contentfilter:
-            m = cfilter.search(urldoc.content)
-            if m:
-                debug("Content filter for URL %s found" % urlobj)
-                self.context='Content'
-                return True
-
-        # Apply meta filters
-        for tfilter in self.titlefilter:
-            m = tfilter.search(urldoc.title)
-            if m:
-                debug("Title filter for URL %s found" % urlobj)
-                self.context='Title'                
-                return True
-
-        for dfilter in self.descfilter:
-            m = dfilter.search(urldoc.description)
-            if m:
-                debug("Description filter for URL %s found" % urlobj)
-                self.context='Description'                
-                return True            
-            
-        for kfilter in self.keywordfilter:
-            matches = [kfilter.search(keyword) for keyword in urldoc.keywords]
-            if len(matches):
-                debug("Keyword filter for URL %s found" % urlobj)
-                self.context='Keyword'                
-                return True
-
-        return False
-                   
-class HarvestManJunkFilter(HarvestManBaseFilter):
-    """ Junk filter class. Filter out junk urls such
-    as ads, banners, flash files etc """
-
-    # Domain specific blocking - List courtesy
-    # junkbuster proxy.
-    block_domains =[ '1ad.prolinks.de',
-                     '1st-fuss.com',
-                     '247media.com',
-                     'admaximize.com',
-                     'adbureau.net',
-                     'adsolution.de',
-                     'adwisdom.com',
-                     'advertising.com',
-                     'atwola.com',
-                     'aladin.de',
-                     'annonce.insite.dk',
-                     'a.tribalfusion.com',                           
-                     'avenuea.com',
-                     'bannercommunity.de',
-                     'banerswap.com',
-                     'bizad.nikkeibp.co.jp',
-                     'bluestreak.com',
-                     'bs.gsanet.com',
-                     'cash-for-clicks.de',
-                     'cashformel.com',                           
-                     'cash4banner.de',
-                     'cgi.tietovalta.fi',
-                     'cgicounter.puretec.de',
-                     'click-fr.com',
-                     'click.egroups.com',
-                     'commonwealth.riddler.com',
-                     'comtrack.comclick.com',
-                     'customad.cnn.com',
-                     'cybereps.com:8000',
-                     'cyberclick.net',
-                     'dino.mainz.ibm.de',
-                     'dinoadserver1.roka.net',
-                     'disneystoreaffiliates.com',
-                     'dn.adzerver.com',
-                     'doubleclick.net',
-                     'ds.austriaonline.at',
-                     'einets.com',
-                     'emap.admedia.net',
-                     'eu-adcenter.net',
-                     'eurosponser.de',
-                     'fastcounter.linkexchange.com',
-                     'findcommerce.com',
-                     'flycast.com',
-                     'focalink.com',
-                     'fp.buy.com',
-                     'globaltrack.com',
-                     'globaltrak.net',
-                     'gsanet.com',                           
-                     'hitbox.com',
-                     'hurra.de',
-                     'hyperbanner.net',
-                     'iadnet.com',
-                     'image.click2net.com',
-                     'image.linkexchange.com',
-                     'imageserv.adtech.de',
-                     'imagine-inc.com',
-                     'img.getstats.com',
-                     'img.web.de',
-                     'imgis.com',
-                     'james.adbutler.de',
-                     'jmcms.cydoor.com',
-                     'leader.linkexchange.com',
-                     'linkexchange.com',
-                     'link4ads.com',
-                     'link4link.com',
-                     'linktrader.com',
-                     'media.fastclick.net',
-                     'media.interadnet.com',
-                     'media.priceline.com',
-                     'mediaplex.com',
-                     'members.sexroulette.com',
-                     'newsads.cmpnet.com',
-                     'ngadcenter.net',
-                     'nol.at:81',
-                     'nrsite.com',
-                     'offers.egroups.com',
-                     'omdispatch.co.uk',
-                     'orientserve.com',
-                     'pagecount.com',
-                     'preferences.com',
-                     'promotions.yahoo.com',
-                     'pub.chez.com',
-                     'pub.nomade.fr',
-                     'qa.ecoupons.com',
-                     'qkimg.net',
-                     'resource-marketing.com',
-                     'revenue.infi.net',
-                     'sam.songline.com',
-                     'sally.songline.com',
-                     'sextracker.com',
-                     'smartage.com',
-                     'smartclicks.com',
-                     'spinbox1.filez.com',
-                     'spinbox.versiontracker.com',
-                     'stat.onestat.com',
-                     'stats.surfaid.ihost.com',
-                     'stats.webtrendslive.com',
-                     'swiftad.com',
-                     'tm.intervu.net',
-                     'tracker.tradedoubler.com',
-                     'ultra.multimania.com',
-                     'ultra1.socomm.net',
-                     'uproar.com',
-                     'usads.imdb.com',
-                     'valueclick.com',
-                     'valueclick.net',
-                     'victory.cnn.com',
-                     'videoserver.kpix.com',
-                     'view.atdmt.com',
-                     'webcounter.goweb.de',
-                     'websitesponser.de',
-                     'werbung.guj.de',
-                     'wvolante.com',
-                     'www.ad-up.com',
-                     'www.adclub.net',
-                     'www.americanpassage.com',
-                     'www.bannerland.de',
-                     'www.bannermania.nom.pl',
-                     'www.bizlink.ru',
-                     'www.cash4banner.com',                           
-                     'www.clickagents.com',
-                     'www.clickthrough.ca',
-                     'www.commision-junction.com',
-                     'www.eads.com',
-                     'www.flashbanner.no',                           
-                     'www.mediashower.com',
-                     'www.popupad.net',                           
-                     'www.smartadserver.com',                           
-                     'www.smartclicks.com:81',
-                     'www.spinbox.com',
-                     'www.sponsorpool.net',
-                     'www.ugo.net',
-                     'www.valueclick.com',
-                     'www.virtual-hideout.net',
-                     'www.web-stat.com',
-                     'www.webpeep.com',
-                     'www.zserver.com',
-                     'www3.exn.net:80',
-                     'xb.xoom.com',
-                     'yimg.com' ]
-
-    # Common block patterns. These are created
-    # in the Python regular expression syntax.
-    # Original list courtesy junkbuster proxy.
-    block_patterns = [ r'/*.*/(.*[-_.])?ads?[0-9]?(/|[-_.].*|\.(gif|jpe?g))',
-                       r'/*.*/(.*[-_.])?count(er)?(\.cgi|\.dll|\.exe|[?/])',
-                       r'/*.*/(.*[-_.].*)?maino(kset|nta|s).*(/|\.(gif|html?|jpe?g|png))',
-                       r'/*.*/(ilm(oitus)?|kampanja)(hallinta|kuvat?)(/|\.(gif|html?|jpe?g|png))',
-                       r'/*.*/(ng)?adclient\.cgi',
-                       r'/*.*/(plain|live|rotate)[-_.]?ads?/',
-                       r'/*.*/(sponsor|banner)s?[0-9]?/',
-                       r'/*.*/*preferences.com*',
-                       r'/*.*/.*banner([-_]?[a-z0-9]+)?\.(gif|jpg)',
-                       r'/*.*/.*bannr\.gif',
-                       r'/*.*/.*counter\.pl',
-                       r'/*.*/.*pb_ihtml\.gif',
-                       r'/*.*/Advertenties/',
-                       r'/*.*/Image/BannerAdvertising/',
-                       r'/*.*/[?]adserv',
-                       r'/*.*/_?(plain|live)?ads?(-banners)?/',
-                       r'/*.*/abanners/',
-                       r'/*.*/ad(sdna_image|gifs?)/',
-                       r'/*.*/ad(server|stream|juggler)\.(cgi|pl|dll|exe)',
-                       r'/*.*/adbanner*',
-                       r'/*.*/adfinity',
-                       r'/*.*/adgraphic*',
-                       r'/*.*/adimg/',
-                       r'/*.*/adjuggler',
-                       r'/*.*/adlib/server\.cgi',
-                       r'/*.*/ads\\',
-                       r'/*.*/adserver',
-                       r'/*.*/adstream\.cgi',
-                       r'/*.*/adv((er)?ts?|ertis(ing|ements?))?/',
-                       r'/*.*/advanbar\.(gif|jpg)',
-                       r'/*.*/advanbtn\.(gif|jpg)',
-                       r'/*.*/advantage\.(gif|jpg)',
-                       r'/*.*/amazon([a-zA-Z0-9]+)\.(gif|jpg)',
-                       r'/*.*/ana2ad\.gif',
-                       r'/*.*/anzei(gen)?/?',
-                       r'/*.*/ban[-_]cgi/',
-                       r'/*.*/banner_?ads/',
-                       r'/*.*/banner_?anzeigen',
-                       r'/*.*/bannerimage/',
-                       r'/*.*/banners?/',
-                       r'/*.*/banners?\.cgi/',
-                       r'/*.*/bizgrphx/',
-                       r'/*.*/biznetsmall\.(gif|jpg)',
-                       r'/*.*/bnlogo.(gif|jpg)',
-                       r'/*.*/buynow([a-zA-Z0-9]+)\.(gif|jpg)',
-                       r'/*.*/cgi-bin/centralad/getimage',
-                       r'/*.*/drwebster.gif',
-                       r'/*.*/epipo\.(gif|jpg)',
-                       r'/*.*/gsa_bs/gsa_bs.cmdl',
-                       r'/*.*/images/addver\.gif',
-                       r'/*.*/images/advert\.gif',
-                       r'/*.*/images/marketing/.*\.(gif|jpe?g)',
-                       r'/*.*/images/na/us/brand/',
-                       r'/*.*/images/topics/topicgimp\.gif',
-                       r'/*.*/phpAds/phpads.php',
-                       r'/*.*/phpAds/viewbanner.php',
-                       r'/*.*/place-ads',
-                       r'/*.*/popupads/',
-                       r'/*.*/promobar.*',
-                       r'/*.*/publicite/',
-                       r'/*.*/randomads/.*\.(gif|jpe?g)',
-                       r'/*.*/reklaam/.*\.(gif|jpe?g)',
-                       r'/*.*/reklama/.*\.(gif|jpe?g)',
-                       r'/*.*/reklame/.*\.(gif|jpe?g)',
-                       r'/*.*/servfu.pl',
-                       r'/*.*/siteads/',
-                       r'/*.*/smallad2\.gif',
-                       r'/*.*/spin_html/',
-                       r'/*.*/sponsor.*\.gif',
-                       r'/*.*/sponsors?[0-9]?/',
-                       r'/*.*/ucbandeimg/',
-                       r'/*.*/utopiad\.(gif|jpg)',
-                       r'/*.*/werb\..*',
-                       r'/*.*/werbebanner/',
-                       r'/*.*/werbung/.*\.(gif|jpe?g)',
-                       r'/*ad.*.doubleclick.net',
-                       r'/.*(ms)?backoff(ice)?.*\.(gif|jpe?g)',
-                       r'/.*./Adverteerders/',
-                       r'/.*/?FPCreated\.gif',
-                       r'/.*/?va_banner.html',
-                       r'/.*/adv\.',
-                       r'/.*/advert[0-9]+\.jpg',
-                       r'/.*/favicon\.ico',
-                       r'/.*/ie_?(buttonlogo|static?|anim.*)?\.(gif|jpe?g)',
-                       r'/.*/ie_horiz\.gif',
-                       r'/.*/ie_logo\.gif',
-                       r'/.*/ns4\.gif',
-                       r'/.*/opera13\.gif',
-                       r'/.*/opera35\.gif',
-                       r'/.*/opera_b\.gif',
-                       r'/.*/v3sban\.gif',
-                       r'/.*Ad00\.gif',
-                       r'/.*activex.*(gif|jpe?g)',
-                       r'/.*add_active\.gif',
-                       r'/.*addchannel\.gif',
-                       r'/.*adddesktop\.gif',
-                       r'/.*bann\.gif',
-                       r'/.*barnes_logo\.gif',
-                       r'/.*book.search\.gif',
-                       r'/.*by/main\.gif',
-                       r'/.*cnnpostopinionhome.\.gif',
-                       r'/.*cnnstore\.gif',
-                       r'/.*custom_feature\.gif',
-                       r'/.*exc_ms\.gif',
-                       r'/.*explore.anim.*gif',
-                       r'/.*explorer?.(gif|jpe?g)',
-                       r'/.*freeie\.(gif|jpe?g)',
-                       r'/.*gutter117\.gif',
-                       r'/.*ie4_animated\.gif',
-                       r'/.*ie4get_animated\.gif',
-                       r'/.*ie_sm\.(gif|jpe?g)',
-                       r'/.*ieget\.gif',
-                       r'/.*images/cnnfn_infoseek\.gif',
-                       r'/.*images/pathfinder_btn2\.gif',
-                       r'/.*img/gen/fosz_front_em_abc\.gif',
-                       r'/.*img/promos/bnsearch\.gif',
-                       r'/.*infoseek\.gif',
-                       r'/.*logo_msnhm_*',
-                       r'/.*mcsp2\.gif',
-                       r'/.*microdell\.gif',
-                       r'/.*msie(30)?\.(gif|jpe?g)',
-                       r'/.*msn2\.gif',
-                       r'/.*msnlogo\.(gif|jpe?g)',
-                       r'/.*n_iemap\.gif',
-                       r'/.*n_msnmap\.gif',
-                       r'/.*navbars/nav_partner_logos\.gif',
-                       r'/.*nbclogo\.gif',
-                       r'/.*office97_ad1\.(gif|jpe?g)',
-                       r'/.*pathnet.warner\.gif',
-                       r'/.*pbbobansm\.(gif|jpe?g)',
-                       r'/.*powrbybo\.(gif|jpe?g)',
-                       r'/.*s_msn\.gif',
-                       r'/.*secureit\.gif',
-                       r'/.*sqlbans\.(gif|jpe?g)',
-                       r'/BannerImages/'
-                       r'/BarnesandNoble/images/bn.recommend.box.*',
-                       r'/Media/Images/Adds/',
-                       r'/SmartBanner/',
-                       r'/US/AD/',
-                       r'/_banner/',
-                       r'/ad[-_]container/',
-                       r'/adcycle.cgi',
-                       r'/adcycle/',
-                       r'/adgenius/',
-                       r'/adimages/',
-                       r'/adproof/',
-                       r'/adserve/',
-                       r'/affiliate_banners/',
-                       r'/annonser?/',
-                       r'/anz/pics/',
-                       r'/autoads/',
-                       r'/av/gifs/av_logo\.gif',
-                       r'/av/gifs/av_map\.gif',
-                       r'/av/gifs/new/ns\.gif',
-                       r'/bando/',
-                       r'/bannerad/',
-                       r'/bannerfarm/',
-                       r'/bin/getimage.cgi/...\?AD',
-                       r'/cgi-bin/centralad/',
-                       r'/cgi-bin/getimage.cgi/....\?GROUP=',
-                       r'/cgi-bin/nph-adclick.exe/',
-                       r'/cgi-bin/nph-load',
-                       r'/cgi-bin/webad.dll/ad',
-                       r'/cgi/banners.cgi',
-                       r'/cwmail/acc\.gif',
-                       r'/cwmail/amzn-bm1\.gif',
-                       r'/db_area/banrgifs/',
-                       r'/digitaljam/images/digital_ban\.gif',
-                       r'/free2try/',
-                       r'/gfx/bannerdir/',
-                       r'/gif/buttons/banner_.*',
-                       r'/gif/buttons/cd_shop_.*',
-                       r'/gif/cd_shop/cd_shop_ani_.*',
-                       r'/gif/teasere/',
-                       r'/grafikk/annonse/',
-                       r'/graphics/advert',
-                       r'/graphics/defaultAd/',
-                       r'/grf/annonif',
-                       r'/hotstories/companies/images/companies_banner\.gif',
-                       r'/htmlad/',
-                       r'/image\.ng/AdType',
-                       r'/image\.ng/transactionID',
-                       r'/images/.*/.*_anim\.gif',
-                       r'/images/adds/',
-                       r'/images/getareal2\.gif',
-                       r'/images/locallogo.gif',
-                       r'/img/special/chatpromo\.gif',
-                       r'/include/watermark/v2/',
-                       r'/ip_img/.*\.(gif|jpe?g)',
-                       r'/ltbs/cgi-bin/click.cgi',
-                       r'/marketpl*/',
-                       r'/markets/images/markets_banner\.gif',
-                       r'/minibanners/',
-                       r'/ows-img/bnoble\.gif',
-                       r'/ows-img/nb_Infoseek\.gif',
-                       r'/p/d/publicid',
-                       r'/pics/amzn-b5\.gif',
-                       r'/pics/getareal1\.gif',
-                       r'/pics/gotlx1\.gif',
-                       r'/promotions/',
-                       r'/rotads/',
-                       r'/rotations/',
-                       r'/torget/jobline/.*\.gif'
-                       r'/viewad/'
-                       r'/we_ba/',
-                       r'/werbung/',
-                       r'/world-banners/',
-                       r'/worldnet/ad\.cgi',
-                       r'/zhp/auktion/img/' ]
-                            
-
-    def __init__(self):
-        self.msg = '<No Error>'
-        self.match = ''
-        # Compile pattern list for performance
-        self.patterns = map(re.compile, self.block_patterns)
-        # Create base domains list from domains list
-        self.base_domains = map(self.base_domain, self.block_domains)
-
-    def reset_msg(self):
-        self.msg = '<No Error>'
-
-    def reset_match(self):
-        self.msg = ''        
-        
-    def filter(self, urlobj):
-        """ Apply Junk filter on the passed URL object. Return True
-        if filtered and False if not filtered """
-
-        self.reset_msg()
-        self.reset_match()
-        
-        # Check domain first
-        ret = self._check_domain(urlobj)
-        if ret:
-            return ret
-
-        # Check pattern next
-        return self._check_pattern(urlobj)
-
-    def base_domain(self, domain):
-
-        if domain.count(".") > 1:
-            strings = domain.split(".")
-            return "".join((strings[-2], strings[-1]))
-        else:
-            return domain
-            
-    def _check_domain(self, url_obj):
-        """ Check whether the url belongs to a junk
-        domain. Return true if url is O.K (NOT a junk
-        domain) and False otherwise """
-
-        # Get base server of the domain with port
-        base_domain_port = url_obj.get_base_domain_with_port()
-        # Get domain with port
-        domain_port = url_obj.get_domain_with_port()
-
-        # First check for domain
-        if domain_port in self.block_domains:
-            self.msg = '<Found domain match>'
-            return True
-        # Then check for base domain
-        else:
-            if base_domain_port in self.base_domains:
-                self.msg = '<Found base-domain match>'                
-                return True
-
-        return False
-
-    def _check_pattern(self, url_obj):
-        """ Check whether the url matches a junk pattern.
-        Return true if url is O.K (not a junk pattern) and
-        false otherwise """
-
-        url = url_obj.get_full_url()
-
-        indx=0
-        for p in self.patterns:
-            # Do a search, not match
-            if p.search(url):
-                self.msg = '<Found pattern match>'
-                self.match = self.block_patterns[indx]
-                return True
-            
-            indx += 1
-            
-        return False
-            
-    def get_error_msg(self):
-        return self.msg
-
-    def get_match(self):
-        return self.match
-    
-if __name__=="__main__":
-    import urlparser
-    
-    # Test filter class
-    filter = HarvestManJunkFilter()
-    
-    # Violates, should return False
-    # The first two are direct domain matches, the
-    # next two are base domain matches.
-    u = urlparser.HarvestManUrl("http://a.tribalfusion.com/images/1.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    u = urlparser.HarvestManUrl("http://stats.webtrendslive.com/cgi-bin/stats.pl")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    u = urlparser.HarvestManUrl("http://stats.cyberclick.net/cgi-bin/stats.pl")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()    
-    u = urlparser.HarvestManUrl("http://m.doubleclick.net/images/anim.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    
-    # The next are pattern matches
-    u = urlparser.HarvestManUrl("http://www.foo.com/popupads/ad.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()
-    u = urlparser.HarvestManUrl("http://www.foo.com/htmlad/1.html")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/logos/nbclogo.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/bar/siteads/1.ad")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    u = urlparser.HarvestManUrl("http://www.foo.com/banners/world-banners/banner.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()
-    u = urlparser.HarvestManUrl("http://ads.foo.com/")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    print '\tMatch=>',filter.get_match()    
-    
-    
-    # This one should not match
-    u = urlparser.HarvestManUrl("http://www.foo.com/doc/logo.gif")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()
-    # This also...
-    u = urlparser.HarvestManUrl("http://www.foo.org/bar/vodka/pattern.html")
-    print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url()    
-
diff --git a/HarvestMan/harvestman/lib/gui.py b/HarvestMan/harvestman/lib/gui.py
deleted file mode 100755
index 8e6c44e..0000000
--- a/HarvestMan/harvestman/lib/gui.py
+++ /dev/null
@@ -1,659 +0,0 @@
-"""
-gui.py - Module which provides a browser based UI
-mode to HarvestMan using web.py. This module is part
-of the HarvestMan program.
-
-Created Anand B Pillai <abpillai at gmail dot com> Jun 01 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import sys, os
-import web
-import webbrowser
-import time
-
-from web import form, net, request
-
-def get_templates_location():
-    # Templates are located at harvestman/ui/templates folder...
-    top = os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__'])))
-    template_dir = os.path.join(top, 'ui','templates')
-    return template_dir
-
-# Global render object
-g_render = web.template.render(get_templates_location())
-
-PLUG_TEMPLATE="""\
-       <plugin name="%s" enable="1" />
-"""
-
-PLUGINS_TEMPLATE="""\
-    <plugins>
-        %s
-    </plugins>
-"""
-
-
-CONFIG_XML_TEMPLATE="""\
-<?xml version="1.0" encoding="utf-8"?>
-
-<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-            xsi:schemaLocation="http://harvestmanontheweb.com/schemas/HarvestMan.xsd">
-
-   <!--- Configuration file generated by HarvestMan Config Generator %(TIMESTAMP)s -->
-   <config version="3.0" xmlversion="1.0">
-     <projects>
-      
-      <project skip="0">
-
-        <url>%(url)s</url>
-        <name>%(projname)s</name>
-
-        <basedir>%(basedir)s</basedir>
-        <verbosity value="%(verbosity)s"/>
-      </project>
-      
-     </projects>
-     <network>
-      <proxy>
-        <proxyserver>%(proxy)s</proxyserver>
-        <proxyuser>%(puser)s</proxyuser>
-        <proxypasswd>%(ppasswd)s</proxypasswd>
-        <proxyport value="%(proxyport)s"/>
-      </proxy>
-    </network>
-    
-    <download>
-      <types>
-        <html value="%(html)s"/>
-        <images value="%(images)s"/>
-        <movies value="%(movies)s"/>
-        <flash value="%(flash)s"/>        
-        <sounds value="%(sounds)s"/>
-        <documents value="%(documents)s"/>        
-        <javascript value="%(javascript)s"/>
-        <javaapplet value="%(javaapplet)s"/>
-        <querylinks value="%(getquerylinks)s"/>
-      </types> 
-      <cache status="%(pagecache)s">
-        <datacache value="%(datacache)s"/>
-      </cache>
-      <protocol>
-        <http compress="%(httpcompress)s" />
-      </protocol>
-      <misc>
-        <retries value="%(retryfailed)s"/>
-      </misc>
-    </download>
-    
-    <control>
-      <links>
-        <imagelinks value="%(getimagelinks)s" />
-        <stylesheetlinks value="%(getstylesheets)s"/>
-        <offset start="%(linksoffsetstart)s" end="%(linksoffsetend)s" />
-      </links>
-      <extent>
-        <fetchlevel value="%(fetchlevel)s"/>
-        <depth value="%(depth)s"/>
-        <extdepth value="%(extdepth)s"/>
-        <subdomain value="%(subdomain)s"/>
-        <ignoretlds value="%(ignoretlds)s"/>
-      </extent>
-      <limits>
-        <maxfiles value="%(maxfiles)s"/>
-        <maxfilesize value="%(maxfilesize)s"/>
-        <maxbandwidth value="%(maxbandwidth)s"/>
-        <connections value="%(connections)s"/>
-        <timelimit value="%(timelimit)s"/>
-      </limits>
-      <rules>
-        <robots value="%(robots)s"/>
-        <urlpriority>%(urlpriority)s</urlpriority>
-        <serverpriority>%(serverpriority)s</serverpriority>
-      </rules>
-      <filters>
-        <urlfilter>%(urlfilter)s</urlfilter>
-        <serverfilter>%(serverfilter)s</serverfilter>
-        <wordfilter>%(wordfilter)s</wordfilter>
-        <junkfilter value="%(junkfilter)s"/>
-      </filters>
-      %(PLUGIN)s
-    </control>
-
-    <parser>
-      <feature name="a" enable="%(parser_enable_a)s" />
-      <feature name="base" enable="%(parser_enable_base)s" />
-      <feature name="frame" enable="%(parser_enable_frame)s" />
-      <feature name="img" enable="%(parser_enable_img)s" />
-      <feature name="form" enable="%(parser_enable_form)s" />
-      <feature name="link" enable="%(parser_enable_link)s" />
-      <feature name="body" enable="%(parser_enable_body)s" />
-      <feature name="script" enable="%(parser_enable_script)s" />
-      <feature name="applet" enable="%(parser_enable_applet)s" />
-      <feature name="area" enable="%(parser_enable_area)s" />
-      <feature name="meta" enable="%(parser_enable_meta)s" />
-      <feature name="embed" enable="%(parser_enable_embed)s" />
-      <feature name="object" enable="%(parser_enable_object)s" />
-      <feature name="option" enable="%(parser_enable_option)s" />
-    </parser>
-      
-    <system>
-      <workers status="%(usethreads)s" size="%(threadpoolsize)s" timeout="%(timeout)s"/>
-      <trackers value="%(maxtrackers)s" timeout="%(fetchertimeout)s" />
-      <timegap value="%(sleeptime)s" random="%(randomsleep)s" />
-    </system>
-    
-    <files>
-      <urltreefile>%(urltreefile)s</urltreefile>
-      <archive status="%(archive)s" format="%(archformat)s"/>
-      <urlheaders status="%(urlheaders)s" />
-      <localise value="%(localise)s"/>
-    </files>
-    
-    <display>
-      <browsepage value="%(browsepage)s"/>
-    </display>
-    
-  </config>
-  
-</HarvestMan>
-"""
-
-def render_stylesheet():
-    css = """\
-    <style>
-    body {
-           font-family: Arial;
-           font-size: 12px;
-         }
-    form th {
-           text-align: right;
-           vertical-align:top;
-         }
-
-    .description {
-       font-style: italic;
-     }
-    .help {
-       font-style: italic;
-       font-size: 12px;
-       color: #343434;
-     }
-    </style>
-    """
-
-    return css
-
-
-def render_tabs():
-
-    content ="""\
-    <html><head><title>HarvestMan Web Console</title>
-    <head>
-    <script type="text/javascript" src="content/tabberjs"></script>
-    <link rel="stylesheet" href="content/example_css" TYPE="text/css" MEDIA="screen">
-    <link rel="stylesheet" href="content/example_print_css" TYPE="text/css" MEDIA="print">
-    <script type="text/javascript">
-
-    /* Optional: Temporarily hide the "tabber" class so it does not "flash"
-       on the page as plain HTML. After tabber runs, the class is changed
-       to "tabberlive" and it will appear. */
-
-       document.write('<style type="text/css">.tabber{display:none;}<\/style>');
-     </script>
-     </head>
-
-    <body>
-    <h1>HarvestMan Web Console</h1>
-
-    <div class="tabber">
-
-     <div class="tabbertab">
-          <h2>Configuration</h2>
-          <p>
-            <ol>
-              <li>User configuration</li>
-              <li>System configuration</li>
-              <li>New configuration</li>
-            </ol>
-          </p>
-     </div>
-
-
-     <div class="tabbertab">
-          <h2>Projects</h2>
-          <p>
-            <ol>
-              <li>Project History</li>
-              <li>Current Project</li>
-              <li>New Project</li>
-            </ol>
-          </p>          
-     </div>
-
-     <div class="tabbertab">
-
-          <h2>Documentation</h2>
-          <p>
-            <ol>
-              <li>Release Notes</li>
-              <li>Change History</li>
-              <li>API Documentation</li>
-              <li>HOWTOs & Tutorials</li>
-            </ol>
-          </p>
-     </div>     
-
-     <div class="tabbertab">
-
-          <h2>About</h2>
-          <p>HarvestMan - Web crawling application/framework written in pure Python.</p>
-          <p>WWW: <a href="http://www.harvestmanontheweb.com">HarvestMan on the Web</a></p>
-     </div>
-     
-    </div>
-    </body>
-    </html>
-    """
-
-    return content
-       
-       
-
-    
-############## Start web.py custom widgets ####################################################
-      
-class SizedTextbox(form.Textbox):
-    """ A GUI class for a textbox which accepts an argument for
-    its size """
-    
-    def __init__(self, name, size, title='', *validators, **attrs):
-        super(SizedTextbox, self).__init__(name, *validators, **attrs)
-        self.size = size
-        self.val = self.value
-        self.title = title
-        
-    def render(self):
-        x = '<input type="text" name="%s" size="%d" title="%s"' % (net.websafe(self.name),
-                                                                   self.size,
-                                                                   net.websafe(self.title))
-        if self.val !=None: x += ' value="%s"' % net.websafe(self.val)
-        x += self.addatts()
-        x += ' />'
-        return x
-
-class MyDropbox(form.Dropdown):
-    """ A modified Dropdown class """
-    
-    def __init__(self, name, title='', args=None, *validators, **attrs):
-        super(MyDropbox, self).__init__(name, args, *validators, **attrs)
-        self.val = self.value
-        self.title = title
-
-    def render(self):
-        x = '<select name="%s"%s title="%s">\n' % (net.websafe(self.name),
-                                                   self.addatts(),
-                                                   net.websafe(self.title))
-        for arg in self.args:
-            if type(arg) == tuple:
-                value, desc= arg
-            else:
-                value, desc = arg, arg 
-
-            if self.val == value: select_p = 'selected'
-            else: select_p = ''
-            x += '  <option %s>%s</option>\n' % (select_p, net.websafe(desc))
-        x += '</select>\n'
-        return x
-        
-class Label(form.Input):
-    """ A class which provides a Label widget """
-    
-    def __init__(self, text,bold=False,italic=False,underlined=False, *validators, **attrs):
-        self.name = ''
-        self.text = text
-        self.bold = bold
-        self.italic = italic
-        self.underlined = underlined
-        super(Label, self).__init__('', *validators, **attrs)
-        
-    def render(self):
-        text = '<p>%s</p>' % self.text
-        if self.bold:
-            text = '<b>%s</b>' % text
-        if self.italic:
-            text = '<i>%s</i>' % text
-        if self.underlined:
-            text = '<u>%s</u>' % text
-
-        return text
-
-    def validate(self, v):
-        return True
-
-############## End web.py custom widgets ####################################################
-
-class HarvestManConfigGenerator(object):
-    """ A class for web-based, interactive configuration
-    file generation for HarvestMan """
-
-    def __init__(self):
-        self.form = None
-        
-    def create_form(self):
-        """ Create an HTML form for user input """
-        
-        myform = form.Form(
-            Label("Project Configuration", True),
-            SizedTextbox("URL", 100, 'Starting URL for the project',
-                         form.Validator("", lambda x:len(x)),value='http://www.foo.com'),
-            SizedTextbox("Name", 20,'Name for the project',
-                         form.Validator("", lambda x:len(x)), value='foo'),
-            SizedTextbox("Base Directory", 50, 'Directory for saving downloaded files',
-                         form.Validator("", lambda x:len(x)), value='~/websites'),
-            MyDropbox("Verbosity", "0=>No messages, 5=>Maximum messages",
-                      [0,1,2,3,4,5], value=2),
-            Label("Network Configuration", True),
-            SizedTextbox("Proxy Server", 50, "Proxy server address for your network, if any"),
-            SizedTextbox("Proxy Server Port",10, "Port number for the proxy server",
-                         value=80),
-            SizedTextbox("Proxy Server Username", 20,
-                         "Username for authenticating the proxy server (leave blank for unauthenticated proxies)"),
-            SizedTextbox("Proxy Server Password", 20,
-                         "Password for authenticating the proxy server (leave blank for unauthenticated proxies)"),
-            Label("Download Types/Caching/Protocol Configuration", True),
-            MyDropbox("HTML", 'Save HTML pages ?', ["Yes","No"]),
-            MyDropbox("Images",'Save images in pages ?',["Yes","No"]),
-            MyDropbox("Video",'Save video URLs (movies) ?',["No","Yes"]),
-            MyDropbox("Flash",'Save Adobe Flash URLs ?',["No","Yes"]),
-            MyDropbox("Audio",'Save audio URLs (sounds) ?',["No","Yes"]),
-            MyDropbox("Documents",'Save Microsoft Office, Openoffice, PDF and Postscript files ?',
-                      ["Yes","No"]),
-            MyDropbox("Javascript",'Save server-side javascript URLs ?',["Yes","No"]),
-            MyDropbox("Javaapplet",'Save java applet class files ?',["Yes","No"]),        
-            MyDropbox("Query Links",'Save links of the form "http://www.foo.com/query?param=val" ?',
-                      ["Yes","No"]),                      
-            MyDropbox("Caching",'Enable URL caching in HarvestMan ?',
-                      ["Yes","No"]),    
-            MyDropbox("Data Caching",'Enable caching of URL data in the cache (requires more space) ?',
-                      ["No","Yes"]),
-            MyDropbox("HTTP Compression",'Accept gzip compressed data from web servers ?',
-                      ["Yes","No"]),
-            SizedTextbox("Retry Attempts", 10,
-                         'Number of additional download tries for URLs which produce errors',
-                         value=1),
-            Label("Download Limits/Extent Configuration", True),
-            MyDropbox("Fetch Level",
-                      'Fetch level for the crawl (see FAQ)',[0,1,2,3,4]),
-            MyDropbox("Crawl Sub-domains",
-                      'Crawls "http://bar.foo.com" when starting URL belongs to "http://foo.com"',
-                      ["No","Yes"]),
-            SizedTextbox("Maximum Files Limit",10,
-                         'Stops crawl when number of files downloaded reaches this limit',
-                         value=5000),
-            SizedTextbox("Maximum File Size Limit",10,
-                         'Ignore URLs whose size is larger than this limit',
-                         value=5242880),    
-            SizedTextbox("Maximum Connections Limit",10,
-                         'Maximum number of simultaneously open HTTP connections',
-                         value=5),
-            SizedTextbox("Maximum Bandwidth Limit(kb)",10,
-                         'Maximum number of bandwidth used for given HTTP connections',
-                         value=0),
-            SizedTextbox("Crawl Time Limit",10,
-                         'Stops crawl after the crawl duration reaches this limit',
-                         value=-1),
-            Label("Download Rules/Filters Configuration", True),    
-            MyDropbox("Robots Rules",
-                      'Obey robots.txt and META ROBOTS rules ?', 
-                      ["Yes","No"]),
-            SizedTextbox("URL Filter String",100,'A filter string for URLs (see FAQ)'),
-            # SizedTextbox("Server Filter String",100, 'A filter string for servers (see FAQ)'),
-            SizedTextbox("Word Filter String",100,
-                         'A generic word filter based on regular expressions to filter web pages'),
-            MyDropbox("JunkFilter",'Enable the advertisement/banner/other junk URL filter ?',
-                      ["Yes","No"]),
-            Label("Download Plugins Configuration", True),
-            Label("Add up-to 5 valid plugins in the boxes below",italic=True),
-            SizedTextbox("Plugin 1",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 2",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 3",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 4",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            SizedTextbox("Plugin 5",20,'Enter the name of your plugin module here, without the .py* suffix'),
-            Label("Files Configuration", True),    
-            SizedTextbox("Url Tree File", 20,
-                         'A filename which will capture parent/child relationship of all processed URLs',
-                         value=''),
-            MyDropbox("Archive Saved Files", 'Archive all saved files to a single tar archive file ?',
-                      ["No","Yes"]),
-            MyDropbox("Archive Format",'Archive format (tar.bz2 or tar.gz)',["bzip","gzip"]),
-            MyDropbox("Serialize URL Headers",'Serialize all URL headers to a file (urlheaders.db) ?',
-                      ["Yes","No"]),
-            MyDropbox("Localise Links",'Convert outward (web) pointing links to disk pointing links ?',
-                      ["No","Yes"]),
-            Label("Misc Configuration", True),        
-            MyDropbox("Create Project Browse Page",'Create an HTML page which summarizes all crawled projects ?',
-                      ["No","Yes"]),            
-            Label("Advanced Configuration Settings", True),
-            Label('These are configuration parameters which are useful only for advanced tweaking. Most users can ignore the following settings and use the defaults',italic=True),
-            Label("Download Limits/Extent/Filters/Rules Configuration", True, True),            
-            MyDropbox("Fetch Image Links Always",
-                      'Ignore download rules when fetching images ?',["Yes","No"]),
-            MyDropbox("Fetch Stylesheet Links Always",
-                      'Ignore download rules when fetching stylesheets ?',["Yes","No"]),
-            SizedTextbox("Links Offset Start", 10,
-                         'Offset of child links measured from zero (useful for crawling web directories)',
-                         value=0),
-            SizedTextbox("Links Offset End", 10,
-                         'Offset of child links measured from end (useful for crawling web directories)',
-                         value=-1),    
-            MyDropbox("URL Depth", 'Maximum depth of a URL in relation to the starting URL',
-                      [10,9,8,7,6,5,4,3,2,1,0]),
-            MyDropbox("External URL Depth",
-                      'Maximum depth of an external URL in relation to its server root (useful for only fetchlevels >1)',
-                      [0,1,2,3,4,5,6,7,8,9,10]),
-            MyDropbox("Ignore TLDs (Top level domains)",
-                      'Consider http://foo.com and http://foo.org as the same server (dangerous)',
-                      ["No","Yes"]),    
-            SizedTextbox("URL Priority String",100,'A priority string for URLs (see FAQ)'),
-            # SizedTextbox("Server Priority String",100, 'A priority string for servers (see FAQ)'),    
-            Label("Parser Configuration", True, True),
-
-            Label("Enable/Disable parsing of the tags shown below",italic=True),
-            MyDropbox("Tag <a>", 'Enable parsing of <a> tags ?',["Yes","No"]),
-            MyDropbox("Tag <applet>", 'Enable parsing of <applet> tags ?',["Yes","No"]),
-            MyDropbox("Tag <area>", 'Enable parsing of <area> tags ?',["Yes","No"]),
-            MyDropbox("Tag <base>", 'Enable parsing of <base> tags ?',["Yes","No"]),
-            MyDropbox("Tag <body>", 'Enable parsing of <body> tags ?',["Yes","No"]),
-            MyDropbox("Tag <embed>", 'Enable parsing of <embed> tags ?',["Yes","No"]),
-            MyDropbox("Tag <form>", 'Enable parsing of <form> tags ?',["Yes","No"]),
-            MyDropbox("Tag <frame>", 'Enable parsing of <frame> tags ?',["Yes","No"]),
-            MyDropbox("Tag <img>", 'Enable parsing of <img> tags ?',["Yes","No"]),
-            MyDropbox("Tag <link>", 'Enable parsing of <link> tags ?',["Yes","No"]),
-            MyDropbox("Tag <meta>", 'Enable parsing of <meta> tags ?',["Yes","No"]),
-            MyDropbox("Tag <object>", 'Enable parsing of <object> tags ?',["Yes","No"]),
-            MyDropbox("Tag <option>", 'Enable parsing of <option> tags ?',["No","Yes"]),
-            MyDropbox("Tag <script>", 'Enable parsing of <script> tags ?',["Yes","No"]),
-            Label("Crawler System Configuration", True, True),
-            MyDropbox("Worker Threads", 'Enable worker (downloader) thread pool ?',["Yes","No"]),
-            SizedTextbox("Worker Thread Count", 10, 'Size of the worker thread pool',value=10),
-            SizedTextbox("Worker Thread Timeout", 10, 'Timeout for the worker thread pool',value=1200.0),    
-            SizedTextbox("Tracker Thread Count", 10, 'Size of the tracker (crawler/fetcher) thread pool',
-                         value=10),
-            SizedTextbox("Tracker Thread Timeout", 10, 'Timeout for the tracker thread pool',
-                         value=240.0),    
-            SizedTextbox("Tracker Sleep Time", 10,
-                         'Duration of sleep time for tracker threads between cycles of activity',
-                         value=3.0),
-            MyDropbox("Tracker Sleep Randomized", 'Randomize the tracker thread sleep time ?',
-                      ["Yes","No"]))
-            
-
-        return myform
-
-
-    def convert_val(self, val):
-        if val=="Yes":
-            return '1'
-        else:
-            return '0'
-    
-    def make_config_xml(self, form):
-        
-        # Make dictionary...
-        params_dict = {'url': form['URL'].value,
-                       'projname': form['Name'].value,
-                       'basedir': form['Base Directory'].value,
-                       'verbosity': form['Verbosity'].value,
-                       'proxy': form['Proxy Server'].value,
-                       'puser': form['Proxy Server Username'].value,
-                       'ppasswd': form['Proxy Server Password'].value,
-                       'proxyport': form['Proxy Server Port'].value,
-                       'html': self.convert_val(form['HTML'].value),
-                       'images': self.convert_val(form['Images'].value),
-                       'movies': self.convert_val(form['Video'].value),
-                       'flash': self.convert_val(form['Flash'].value),                       
-                       'sounds': self.convert_val(form['Audio'].value),
-                       'documents': self.convert_val(form['Documents'].value),
-                       'javascript': self.convert_val(form['Javascript'].value),
-                       'javaapplet': self.convert_val(form['Javaapplet'].value),
-                       'getquerylinks': self.convert_val(form['Query Links'].value),
-                       'pagecache': self.convert_val(form['Caching'].value),
-                       'datacache': self.convert_val(form['Data Caching'].value),
-                       'httpcompress': self.convert_val(form['HTTP Compression'].value),                                      
-                       'retryfailed': form['Retry Attempts'].value,
-                       'pagecache': self.convert_val(form['Caching'].value),                                      
-                       'getimagelinks': self.convert_val(form['Fetch Image Links Always'].value),
-                       'getstylesheets': self.convert_val(form['Fetch Stylesheet Links Always'].value),
-                       'linksoffsetstart': form["Links Offset Start"].value,
-                       'linksoffsetend': form["Links Offset End"].value,
-                       'fetchlevel': form["Fetch Level"].value,
-                       'depth': form["URL Depth"].value,
-                       "extdepth": form["External URL Depth"].value,
-                       "subdomain": self.convert_val(form["Crawl Sub-domains"].value),
-                       'ignoretlds': self.convert_val(form["Ignore TLDs (Top level domains)"].value),
-                       'maxfiles': form["Maximum Files Limit"].value,
-                       'maxfilesize': form["Maximum File Size Limit"].value,
-                       'connections': form["Maximum Connections Limit"].value,
-                       'maxbandwidth': str(form["Maximum Bandwidth Limit(kb)"].value) +'kb',
-                       'timelimit': form["Crawl Time Limit"].value,
-                       'robots': self.convert_val(form["Robots Rules"].value),
-                       'urlpriority': form["URL Priority String"].value,
-                       # 'serverpriority': form["Server Priority String"].value,
-                       'serverpriority': '',
-                       'urlfilter': form['URL Filter String'].value,
-                       #'serverfilter': form['Server Filter String'].value,
-                       'serverfilter': '',
-                       'wordfilter': form['Word Filter String'].value,
-                       'junkfilter': self.convert_val(form["JunkFilter"].value),
-                       'parser_enable_a': self.convert_val(form["Tag <a>"].value),
-                       'parser_enable_applet': self.convert_val(form["Tag <applet>"].value),
-                       'parser_enable_area': self.convert_val(form["Tag <area>"].value),
-                       'parser_enable_base': self.convert_val(form["Tag <base>"].value),
-                       'parser_enable_body': self.convert_val(form["Tag <body>"].value),
-                       'parser_enable_embed': self.convert_val(form["Tag <embed>"].value),
-                       'parser_enable_form': self.convert_val(form["Tag <form>"].value),
-                       'parser_enable_frame': self.convert_val(form["Tag <frame>"].value),
-                       'parser_enable_img': self.convert_val(form["Tag <img>"].value),
-                       'parser_enable_link': self.convert_val(form["Tag <link>"].value),
-                       'parser_enable_meta': self.convert_val(form["Tag <meta>"].value),
-                       'parser_enable_object': self.convert_val(form["Tag <object>"].value),
-                       'parser_enable_option': self.convert_val(form["Tag <option>"].value),
-                       'parser_enable_script': self.convert_val(form["Tag <script>"].value),
-                       'usethreads': self.convert_val(form["Worker Threads"].value),
-                       'threadpoolsize': form["Worker Thread Count"].value,
-                       'timeout': form["Worker Thread Timeout"].value,
-                       'maxtrackers': form["Tracker Thread Count"].value,
-                       'fetchertimeout': form["Tracker Thread Timeout"].value,
-                       'sleeptime': form["Tracker Sleep Time"].value,
-                       'randomsleep': self.convert_val(form["Tracker Sleep Randomized"].value),
-                       'urltreefile': form["Url Tree File"].value,
-                       'archive': self.convert_val(form["Archive Saved Files"].value),
-                       'archformat': form["Archive Format"].value,
-                       'urlheaders': self.convert_val(form["Serialize URL Headers"].value),
-                       'localise': self.convert_val(form["Localise Links"].value),
-                       'browsepage': self.convert_val(form["Create Project Browse Page"].value)}
-
-        plugins = ""
-
-        # Add plugins information
-        plugin1 = form["Plugin 1"].value
-        plugin2 = form["Plugin 2"].value
-        plugin3 = form["Plugin 3"].value
-        plugin4 = form["Plugin 4"].value
-        plugin5 = form["Plugin 5"].value
-        plugint = (plugin1, plugin2, plugin3, plugin4, plugin5)
-
-        for plug in plugint:
-            if plug != '':
-                plugins += PLUG_TEMPLATE % plug
-
-        if plugins != '':
-            plugins = PLUGINS_TEMPLATE % plugins
-
-        params_dict['PLUGIN'] = plugins
-        params_dict['TIMESTAMP'] = ' '.join((time.ctime(), time.tzname[0]))
-
-        return CONFIG_XML_TEMPLATE % params_dict
-
-    def GET(self):
-        
-        form = self.create_form()
-        print "<html><head><title>HarvestMan Configuration File Generator</title>"
-        # Styles...
-        print "%s\n" % render_stylesheet()
-        print "</head>\n"
-        print "<body>\n"
-        print g_render.form(form)
-        print "</body>"
-        print "</html>"
-
-    def POST(self): 
-
-        form = self.create_form()
-        if not form.validates(): 
-            print g_render.form(form)
-        else:
-            print self.make_config_xml(form)
-
-class HarvestManGUI(object):
-    """ Main UI class for HarvestMan """
-
-    def GET(self):
-        print "%s" % render_tabs()
-
-class HarvestManLoader(object):
-
-    GET = request.autodelegate('GET_')
-
-    def GET_tabberjs(self):
-        path = os.path.join(get_templates_location(), 'content','tabber.js')
-        print '%s' % open(path).read()
-
-    def GET_example_css(self):
-        path = os.path.join(get_templates_location(), 'content','example.css')
-        print '%s' % open(path).read()
-
-    def GET_example_print_css(self):
-        path = os.path.join(get_templates_location(), 'content','example-print.css')
-        print '%s' % open(path).read()        
-        
-
-urls = ('/', 'HarvestManGUI',
-        '/content/(.*)', 'HarvestManLoader')
-
-def run():
-    """ Run the web UI """
-
-    # UI runs on port 5940
-    sys.argv = [sys.argv[0]]
-    sys.argv.append("5940")
-    print "Starting HarvestMan Web UI at port 5940..."
-    web.internalerror = web.debugerror
-    web.run(urls, globals(), web.reloader)
-
-if __name__ == "__main__":
-    run()
-
-
diff --git a/HarvestMan/harvestman/lib/hooks.py b/HarvestMan/harvestman/lib/hooks.py
deleted file mode 100755
index dd07337..0000000
--- a/HarvestMan/harvestman/lib/hooks.py
+++ /dev/null
@@ -1,295 +0,0 @@
-# -- coding: utf-8
-""" hooks.py - Module allowing developer extensions(plugins/callbacks)
-    to HarvestMan. This module makes it possible to hook into/modify the
-    execution flow of HarvestMan, making it easy to extend and customize
-    it. 
-
-    This module is part of the HarvestMan program. For licensing information
-    see the file LICENSE.txt that is included in this distribution.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Created by Anand B Pillai Feb 1 07.
-
-    Modified by Anand B Pillai Feb 17 2007 Completed callback implementation
-                                           using metaclass methodwrappers.
-    Modified by Anand B Pillai Feb 26 2007 Replaced all 'hook' strings with
-                                           'plugin'.
-
-   Copyright (C) 2007 Anand B Pillai.    
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import sys, os
-import imp
-import __init__
-
-from harvestman.lib.common.common import *
-from methodwrapper import MethodWrapperMetaClass, set_wrapper_method
-
-class HarvestManHooksException(Exception):
-    """ Exception class for HarvestManHooks class """
-    pass
-
-class HarvestManHooks:
-    """ Class which manages pluggable hooks and callbacks for HarvestMan """
-    
-    supported_modules = ('harvestman.lib.crawler',
-                         'harvestman.apps.spider',                         
-                         'harvestman.lib.urlqueue',
-                         'harvestman.lib.datamgr',
-                         'harvestman.lib.connector',
-                         'harvestman.lib.rules')
-    module_plugins = {}
-    module_callbacks = {}
-    run_plugins = {}
-    run_callbacks = {}
-
-    def __init__(self):
-        pass
-    
-    @classmethod
-    def add_all_plugins(cls):
-
-        dirname = os.path.dirname(os.path.dirname(os.path.abspath(__init__.__file__)))
-        # Append app path
-        libpath = os.path.join(dirname, 'lib')
-        sys.path.append(libpath)
-        
-        for module in cls.supported_modules:
-            # Get __plugins__ attribute from the module
-            from_module = '.'.join(module.split('.')[:-1])
-            M = __import__(module,fromlist=[from_module])
-            plugins = getattr(M, '__plugins__',{})
-            for plugin in plugins.keys():
-                # print 'Adding plugin',module,plugin
-                cls.add_plugin(module, plugin)
-
-    @classmethod
-    def add_all_callbacks(cls):
-
-        for module in cls.supported_modules:
-            # Get __plugins__ attribute from the module
-            M = __import__(module)
-            callbacks = getattr(M, '__callbacks__',{})
-
-            for cb in callbacks.keys():
-                cls.add_callback(module, cb)
-
-    @classmethod                
-    def add_plugin(cls, module, plugin):
-        """ Add a plugin named 'plugin' for module 'module' """
-
-        l = cls.module_plugins.get(module)
-        if l is None:
-            # Reduce the module to its basic
-            module = module.split('.')[-1]
-            cls.module_plugins[module] = [plugin]
-        else:
-            l.append(plugin)
-
-    @classmethod
-    def add_callback(cls, module, callback):
-        """ Add a callback named 'callback' for module 'module' """
-
-        l = cls.module_callbacks.get(module)
-        if l is None:
-            # Reduce the module to its basic
-            module = module.split('.')[-1]            
-            cls.module_callbacks[module] = [callback]
-        else:
-            l.append(callback)            
-
-    @classmethod
-    def get_plugins(cls, module):
-        """ Return all plugins for module 'module' """
-
-        return cls.module_plugins.get(module)
-
-    @classmethod
-    def get_callbacks(cls, module):
-        """ Return all callbacks for module 'module' """
-
-        return cls.module_callbacks.get(module)    
-
-    @classmethod
-    def get_all_plugins(cls):
-        """ Return the plugins data structure """
-
-        # Note this is a copy of the dictionary,
-        # so modifying it will not have any impact
-        # locally.
-        return cls.module_plugins.copy()
-
-    @classmethod
-    def get_all_callbacks(cls):
-        """ Return the callbacks data structure """
-
-        # Note this is a copy of the dictionary,
-        # so modifying it will not have any impact
-        # locally.
-        return cls.module_callbacks.copy()    
-
-    @classmethod
-    def set_plugin_func(self, context, func):
-        """ Set plugin function 'func' for context 'context' """
-
-        self.run_plugins[context] = func
-        # Inject the given function in place of the original
-        module, plugin = context.split(':')
-        # Load module and get the entry point
-        M = __import__(module)
-        orig_func = getattr(M, '__plugins__').get(plugin)
-        # Orig func is in the form class:function
-        klassname, function = orig_func.split(':')
-        if klassname:
-            klass = getattr(M, klassname)
-            # print klass
-            # Replace function with the new one
-            # print 'Klassid=>',id(klass), func, id(func)
-            objects.config.set_klass_plugin_func(klassname, function, func)
-
-    @classmethod        
-    def set_callback_method(self, context, method, order='post'):
-        """ Set callback method at context as the given
-        method 'method'. The callback will be inserted after
-        the original function call if order is 'post' and
-        inserted before the original function call if order
-        is 'pre' """
-
-        self.run_callbacks[order + ':' + context] = method
-        module, callback = context.split(':')
-        # Load module and get the entry point
-        M = __import__(module)
-        orig_meth = getattr(M, '__callbacks__').get(callback)
-        
-        # Orig func is in the form class:function
-        klassname, origmethod = orig_meth.split(':')
-        if klassname:
-            klass = getattr(M, klassname)
-            # print klass
-            # If klass does not define its __metaclass__ attribute
-            # as MethodWrapperMetaClass, then we cannot do anything.
-            cls = getattr(klass, '__metaclass__', None)
-            if (cls==None) or (cls.__name__ != 'MethodWrapperMetaClass'):
-                raise HarvestManHooksException, 'Error: Cannot set callback on klass %s, __metaclass__ attribute is not set correctly!' % klassname
-            
-            # Insert new function which basicaly calls
-            # new function before or after the original
-            methobj = getattr(klass, origmethod)
-            # Post call back function should take one extra argument
-            # than the function itself.
-            argcnt1 = methobj.im_func.func_code.co_argcount
-            argcnt2 = method.func_code.co_argcount
-            if order=='post' and ((argcnt1 + 1)!= argcnt2) or \
-               order=='pre' and (argcnt1 != argcnt2):
-                raise HarvestManHooksException,'Error: invalid callback method, signature does not match!'
-            # Set wrapper method
-            # print 'Klassid=>',id(klass), origmethod, id(method)
-            objects.config.set_klass_callback_func(klassname, origmethod, method, order)            
-        else:
-            pass
-
-    @classmethod                          
-    def get_plugin_func(self, context):
-
-        return self.run_plugins.get(context)
-
-    @classmethod
-    def get_all_plugin_funcs(self):
-
-        return self.run_plugins.copy()
-
-#HarvestManHooks.add_all_plugins()
-HarvestManHooks.add_all_callbacks()
-              
-def register_plugin_function(context, func):
-    """ Register function 'func' as
-    a plugin at context 'context' """
-    
-    # The context is a string of the form module:hook
-    # Example => crawler:fetcher_process_url_hook
-
-    # Hooks are defined in modules using the global dictionary
-    # __hooks__. This module will load all hooks from modules
-    # supporting hookable(pluggable) function definitions, when
-    # this module is imported and add the hook definitions to
-    # the class HarvestManHooks.
-    
-    # The function is a hook function/method object which is defined
-    # at the time of calling this function. This function
-    # will not attempt to validate whether the hook function exists
-    # and whether it accepts the right parameters (if any). Any
-    # such validation is done by the Python run-time. The function
-    # can be a module-level, class-level or instance-level function.
-    
-    module, plugin = context.split(':')
-
-    # Validity checks...
-    if module not in HarvestManHooks.get_all_plugins().keys():
-        raise HarvestManHooksException,'Error: %s has no plugins defined!' % module
-
-    if plugin not in HarvestManHooks.get_plugins(module):
-        raise HarvestManHooksException,'Error: Plugin %s is not defined for module %s!' % (plugin, module)
-
-    # Finally add hook..
-    HarvestManHooks.set_plugin_func(context, func)
-
-def register_callback_method(context, method, order):
-    """ Register class method 'method' as
-    a callback at context 'context' according
-    to given order """
-
-    # Don't call this function directly, instead
-    # use one of the function below which wraps up
-    # this function.
-    
-    if order not in ('post','pre'):
-        raise HarvestManHooksException,'Error: order argument %s is not valid!' % order
-    
-    # The context is a string of the form module:hook
-    # Example => crawler:fetcher_process_url_hook
-
-    # Callbackss are defined in modules using the global dictionary
-    # __callbacks__. This module will load all callbacks from modules
-    # having function definitions which support callbacks, when
-    # this module is imported. The callback definitions are added to
-    # the class HarvestManHooks.
-    
-    # The method 'method' is a callback instance method object which is defined
-    # at the time of calling this function. The method has to be declared
-    # as a class-level method with the same arguments as the original.
-    # Callbacks are not supported for module level functions, i.e functions
-    # not associated to classes.
-    
-    module, hook = context.split(':')
-
-    # Validity checks...can be a module-level, class-level or instance-level function.
-    if module not in HarvestManHooks.get_all_callbacks().keys():
-        raise HarvestManHooksException,'Error: %s has no callbacks defined!' % module
-
-    if hook not in HarvestManHooks.get_callbacks(module):
-        raise HarvestManHooksException,'Error: Callback %s is not defined for module %s!' % (hook, module)
-
-    # Finally add callback..
-    HarvestManHooks.set_callback_method(context, method, order)    
-    
-def register_pre_callback_method(context, method):
-    """ Register callback method as a pre callback """
-
-    return register_callback_method(context, method,'pre')
-
-def register_post_callback_method(context, method):
-    """ Register callback method as a post callback """
-
-    return register_callback_method(context, method,'post')
-    
-def myfunc(self):
-    pass
-
-if __name__ == "__main__":
-    register_plugin_function('datamgr:download_url_plugin', myfunc)
-    register_post_callback_method('crawler:fetcher_process_url_callback', myfunc)
-    print HarvestManHooks.getInstance().get_all_hook_funcs()
diff --git a/HarvestMan/harvestman/lib/js/Parser.rb b/HarvestMan/harvestman/lib/js/Parser.rb
deleted file mode 100755
index 6ec8f53..0000000
--- a/HarvestMan/harvestman/lib/js/Parser.rb
+++ /dev/null
@@ -1,1204 +0,0 @@
-
-$tokens = [
-        # End of source.
-        "END",
-        
-        # Operators and punctuators.  Some pair-wise order matters, e.g. (+, -)
-        # and (UNARY_PLUS, UNARY_MINUS).
-        "\n", ";",
-        ",",
-        "=",
-        "?", ":", "CONDITIONAL",
-        "||",
-        "&&",
-        "|",
-        "^",
-        "&",
-        "==", "!=", "===", "!==",
-        "<", "<=", ">=", ">",
-        "<<", ">>", ">>>",
-        "+", "-",
-        "*", "/", "%",
-        "!", "~", "UNARY_PLUS", "UNARY_MINUS",
-        "++", "--",
-        ".",
-        "[", "]",
-        "{", "}",
-        "(", ")",
-        
-        # Nonterminal tree node type codes.
-        "SCRIPT", "BLOCK", "LABEL", "FOR_IN", "CALL", "NEW_WITH_ARGS", "INDEX",
-        "ARRAY_INIT", "OBJECT_INIT", "PROPERTY_INIT", "GETTER", "SETTER",
-        "GROUP", "LIST",
-        
-        # Terminals.
-        "IDENTIFIER", "NUMBER", "STRING", "REGEXP",
-        
-        # Keywords.
-        "break",
-        "case", "catch", "const", "continue",
-        "debugger", "default", "delete", "do",
-        "else", "enum",
-        "false", "finally", "for", "function",
-        "if", "in", "instanceof",
-        "new", "null",
-        "return",
-        "switch",
-        "this", "throw", "true", "try", "typeof",
-        "var", "void",
-        "while", "with",
-]
-
-# Operator and punctuator mapping from token to tree node type name.
-$opTypeNames = {
-        "\n"  => "NEWLINE",
-        ';'   => "SEMICOLON",
-        ','   => "COMMA",
-        '?'   => "HOOK",
-        ':'   => "COLON",
-        '||'  => "OR",
-        '&&'  => "AND",
-        '|'   => "BITWISE_OR",
-        '^'   => "BITWISE_XOR",
-        '&'   => "BITWISE_AND",
-        '===' => "STRICT_EQ",
-        '=='  => "EQ",
-        '='   => "ASSIGN",
-        '!==' => "STRICT_NE",
-        '!='  => "NE",
-        '<<'  => "LSH",
-        '<='  => "LE",
-        '<'   => "LT",
-        '>>>' => "URSH",
-        '>>'  => "RSH",
-        '>='  => "GE",
-        '>'   => "GT",
-        '++'  => "INCREMENT",
-        '--'  => "DECREMENT",
-        '+'   => "PLUS",
-        '-'   => "MINUS",
-        '*'   => "MUL",
-        '/'   => "DIV",
-        '%'   => "MOD",
-        '!'   => "NOT",
-        '~'   => "BITWISE_NOT",
-        '.'   => "DOT",
-        '['   => "LEFT_BRACKET",
-        ']'   => "RIGHT_BRACKET",
-        '{'   => "LEFT_CURLY",
-        '}'   => "RIGHT_CURLY",
-        '('   => "LEFT_PAREN",
-        ')'   => "RIGHT_PAREN"
-}
-
-# Hash of keyword identifier to tokens index.
-$keywords = {}
-
-# Define const END, etc., based on the token names.  Also map name to index.
-$consts = {}
-
-$tokens.length.times do |i|
-        t = $tokens[i]
-        if /\A[a-z]/ =~ t
-                $consts[t.upcase] = i
-                $keywords[t] = i
-        elsif /\A\W/ =~ t
-                $consts[$opTypeNames[t]] = i
-        else
-                $consts[t] = i
-        end
-end
-
-# Map assignment operators to their indexes in the tokens array.
-$assignOps = ['|', '^', '&', '<<', '>>', '>>>', '+', '-', '*', '/', '%']
-$assignOpsHash = {}
-
-$assignOps.length.times do |i|
-        t = $assignOps[i]
-        $assignOpsHash[t] = $consts[$opTypeNames[t]]
-end
-
-$opPrecedence = {
-        "SEMICOLON" => 0,
-        "COMMA" => 1,
-        "ASSIGN" => 2,
-        "HOOK" => 3, "COLON" => 3, "CONDITIONAL" => 3,
-        "OR" => 4,
-        "AND" => 5,
-        "BITWISE_OR" => 6,
-        "BITWISE_XOR" => 7,
-        "BITWISE_AND" => 8,
-        "EQ" => 9, "NE" => 9, "STRICT_EQ" => 9, "STRICT_NE" => 9,
-        "LT" => 10, "LE" => 10, "GE" => 10, "GT" => 10, "IN" => 10, "INSTANCEOF" => 10,
-        "LSH" => 11, "RSH" => 11, "URSH" => 11,
-        "PLUS" => 12, "MINUS" => 12,
-        "MUL" => 13, "DIV" => 13, "MOD" => 13,
-        "DELETE" => 14, "VOID" => 14, "TYPEOF" => 14, # PRE_INCREMENT: 14, PRE_DECREMENT: 14,
-        "NOT" => 14, "BITWISE_NOT" => 14, "UNARY_PLUS" => 14, "UNARY_MINUS" => 14,
-        "INCREMENT" => 15, "DECREMENT" => 15, # postfix
-        "NEW" => 16,
-        "DOT" => 17
-}
-
-# Map operator type code to precedence.
-$opPrecedence.keys.each do |i|
-        $opPrecedence[$consts[i]] = $opPrecedence[i]
-end
-
-$opArity = {
-        "COMMA" => -2,
-        "ASSIGN" => 2,
-        "CONDITIONAL" => 3,
-        "OR" => 2,
-        "AND" => 2,
-        "BITWISE_OR" => 2,
-        "BITWISE_XOR" => 2,
-        "BITWISE_AND" => 2,
-        "EQ" => 2, "NE" => 2, "STRICT_EQ" => 2, "STRICT_NE" => 2,
-        "LT" => 2, "LE" => 2, "GE" => 2, "GT" => 2, "IN" => 2, "INSTANCEOF" => 2,
-        "LSH" => 2, "RSH" => 2, "URSH" => 2,
-        "PLUS" => 2, "MINUS" => 2,
-        "MUL" => 2, "DIV" => 2, "MOD" => 2,
-        "DELETE" => 1, "VOID" => 1, "TYPEOF" => 1, # PRE_INCREMENT: 1, PRE_DECREMENT: 1,
-        "NOT" => 1, "BITWISE_NOT" => 1, "UNARY_PLUS" => 1, "UNARY_MINUS" => 1,
-        "INCREMENT" => 1, "DECREMENT" => 1,   # postfix
-        "NEW" => 1, "NEW_WITH_ARGS" => 2, "DOT" => 2, "INDEX" => 2, "CALL" => 2,
-        "ARRAY_INIT" => 1, "OBJECT_INIT" => 1, "GROUP" => 1
-}
-
-# Map operator type code to arity.
-$opArity.keys.each do |i|
-        $opArity[$consts[i]] = $opArity[i]
-end
-
-
-# NB: superstring tokens (e.g., ++) must come before their substring token
-# counterparts (+ in the example), so that the $opRegExp regular expression
-# synthesized from this list makes the longest possible match.
-$ops = [';', ',', '?', ':', '||', '&&', '|', '^', '&', '===', '==', 
-        '=', '!==', '!=', '<<', '<=', '<', '>>>', '>>', '>=', '>', '++', '--',
-        '+', '-', '*', '/', '%', '!', '~', '.', '[', ']', '{', '}', '(', ')']
-
-# Build a regexp that recognizes operators and punctuators (except newline).
-$opRegExpSrc = "\\A"
-$ops.length.times do |i|
-        $opRegExpSrc += "|\\A" if $opRegExpSrc != "\\A"
-        $opRegExpSrc += $ops[i].gsub(/([?|^&(){}\[\]+\-*\/\.])/) {|s| "\\" + s}
-end
-$opRegExp = Regexp.new($opRegExpSrc, Regexp::MULTILINE)
-
-
-# A regexp to match floating point literals (but not integer literals).
-$fpRegExp = Regexp.new("\\A\\d+\\.\\d*(?:[eE][-+]?\\d+)?|\\A\\d+(?:\\.\\d*)?[eE][-+]?\\d+|\\A\\.\\d+(?:[eE][-+]?\\d+)?", Regexp::MULTILINE)
-
-
-class Tokenizer
-
-        attr_accessor :cursor, :source, :tokens, :tokenIndex, :lookahead
-        attr_accessor :scanNewlines, :scanOperand, :filename, :lineno
-
-        def initialize (source, filename, line)
-                @cursor = 0
-                @source = source.to_s
-                @tokens = []
-                @tokenIndex = 0
-                @lookahead = 0
-                @scanNewlines = false
-                @scanOperand = true
-                @filename = filename or ""
-                @lineno = line or 1
-        end
-
-        def input
-                return @source.slice(@cursor, @source.length - @cursor)
-        end
-
-        def done
-                return self.peek == $consts["END"];
-        end
-
-        def token
-                return @tokens[@tokenIndex];
-        end
-        
-        def match (tt)
-                puts "Calling match of " + tt.to_s
-                got = self.get
-                puts "GOOT " + got.to_s + " " + tt.to_s
-                #puts got
-                #puts tt
-                return got == tt || self.unget
-        end
-        
-        def mustMatch (tt)
-                print "Calling mustMatch " + tt.to_s
-                raise SyntaxError.new("Missing " + $tokens[tt].downcase, self) unless self.match(tt)
-                return self.token
-        end
-
-        def peek
-                if @lookahead > 0
-                        #tt = @tokens[(@tokenIndex + @lookahead)].type
-                        tt = @tokens[(@tokenIndex + @lookahead) & 3].type
-                else
-                        tt = self.get
-                        self.unget
-                end
-                return tt
-        end
-        
-        def peekOnSameLine
-                @scanNewlines = true;
-                tt = self.peek
-                @scanNewlines = false;
-                return tt
-        end
-
-        def get
-                while @lookahead > 0
-                        @lookahead -= 1
-                        @tokenIndex = (@tokenIndex + 1) & 3
-                        token = @tokens[@tokenIndex]
-                        return token.type if token.type != $consts["NEWLINE"] || @scanNewlines
-                end
-                
-                while true
-                        input = self.input
-                        puts "Input => " + input
-                        if @scanNewlines
-                                puts "Scannewlines is true"
-                                match = /\A[ \t]+/.match(input)
-                        else
-                                match = /\A\s+/.match(input)
-                        end
-                        
-                        if match
-                                puts "A MATCH FOUND!"
-                                spaces = match[0]
-                                puts "Spaces => " + spaces.length.to_s
-                                @cursor += spaces.length
-                                puts "Newline count => " + spaces.count("\n").to_s
-                                @lineno += spaces.count("\n")
-                                input = self.input
-                        end
-                        
-                        puts "Input=> " + input + " " + input.length.to_s
-                        match = /\A\/(?:\*(?:.)*?\*\/|\/[^\n]*)/m.match(input)
-                        if !match:
-                            puts "BREAKING"
-                            break
-                        end
-                          
-                        puts "Cursor=> " + @cursor.to_s
-                        comment = match[0]
-                        puts "Comment => " + comment
-                        @cursor += comment.length
-                        puts "Cursor=> " + @cursor.to_s
-                        puts 'Comment length => ' + comment.length.to_s
-                        puts "Comment newline count => " + comment.count("\n").to_s
-                        @lineno += comment.count("\n")
-                end
-                
-                #puts input
-                
-                @tokenIndex = (@tokenIndex + 1) & 3
-                token = @tokens[@tokenIndex]
-                (@tokens[@tokenIndex] = token = Token.new) unless token
-                if input.length == 0
-                        #puts "end!!!"
-                        return (token.type = $consts["END"])
-                end
-
-                cursor_advance = 0
-                if (match = $fpRegExp.match(input))
-                        puts "Matched here1"            
-                        token.type = $consts["NUMBER"]
-                        token.value = match[0].to_f
-                elsif (match = /\A0[xX][\da-fA-F]+|\A0[0-7]*|\A\d+/.match(input))
-                        puts "Matched here2"            
-                        token.type = $consts["NUMBER"]
-                        token.value = match[0].to_i
-                elsif (match = /\A(\w|\$)+/.match(input))
-                        puts "MATCH: Matched here3 " + input
-                        id = match[0]
-                        token.type = $keywords[id] || $consts["IDENTIFIER"]
-                        token.value = id
-                elsif (match = /\A"(?:\\.|[^"])*"|\A'(?:[^']|\\.)*'/.match(input))
-                        puts "Matched here4"            
-                        token.type = $consts["STRING"]
-                        token.value = match[0].to_s
-                elsif @scanOperand and (match = /\A\/((?:\\.|[^\/])+)\/([gi]*)/.match(input))
-                        puts "Matched here5"            
-                        token.type = $consts["REGEXP"]
-                        token.value = Regexp.new(match[1] ) #, match[2])
-                elsif (match = $opRegExp.match(input))
-                        puts "Matched here6 " + input            
-                        op = match[0]
-                        if $assignOpsHash[op] && input[op.length, 1] == '='
-                                puts "Token type is ASSIGN"
-                                token.type = $consts["ASSIGN"]
-                                token.assignOp = $consts[$opTypeNames[op]]
-                                cursor_advance = 1 # length of '='
-                        else
-                                puts "TOKEN TYPE NOT ASSIGN"
-                                #puts $consts[$opTypeNames[op]].to_s + " " + $opTypeNames[op] + " " + op
-                                token.type = $consts[$opTypeNames[op]]
-                                puts token.type.to_s + " " + self.scanOperand.to_s + " " + $consts["MINUS"].to_s
-                                if @scanOperand and (token.type == $consts["PLUS"] || token.type == $consts["MINUS"])
-                                        puts "Adding to token type!"
-                                        token.type += $consts["UNARY_PLUS"] - $consts["PLUS"]
-                                end
-                                token.assignOp = nil
-                        end
-                        token.value = op
-                else
-                        raise SyntaxError.new("Illegal token", self)
-                end
-
-                token.start = @cursor
-                puts "Group0 => " + match[0]
-                puts 'Advancing cursor by ' + (match[0].length + cursor_advance).to_s
-                @cursor += match[0].length + cursor_advance
-                token.end = @cursor
-                token.lineno = @lineno
-                
-                puts "Returning " + token.type.to_s
-                return token.type
-        end
-
-        def unget
-                #puts "start: lookahead: " + @lookahead.to_s + " tokenIndex: " + @tokenIndex.to_s
-                @lookahead += 1
-                raise SyntaxError.new("PANIC: too much lookahead!", self) if @lookahead == 4
-                @tokenIndex = (@tokenIndex - 1) & 3
-                #puts "end:   lookahead: " + @lookahead.to_s + " tokenIndex: " + @tokenIndex.to_s
-                return nil
-        end
-
-end
-
-class SyntaxError
-        def initialize (msg, tokenizer)
-                puts msg
-                puts "on line " + tokenizer.lineno.to_s
-        end
-end
-
-
-class Token
-        attr_accessor :type, :value, :start, :end, :lineno, :assignOp
-end
-
-
-class CompilerContext
-        attr_accessor :inFunction, :stmtStack, :funDecls, :varDecls
-        attr_accessor :bracketLevel, :curlyLevel, :parenLevel, :hookLevel
-        attr_accessor :ecmaStrictMode, :inForLoopInit
-
-        def initialize (inFunction)
-                @inFunction = inFunction
-                @stmtStack = []
-                @funDecls = []
-                @varDecls = []
-                
-                @bracketLevel = @curlyLevel = @parenLevel = @hookLevel = 0
-                @ecmaStrictMode = @inForLoopInit = false
-        end
-end
-
-
-class Node < Array
-
-        attr_accessor :type, :value, :lineno, :start, :end, :tokenizer, :initializer
-        attr_accessor :name, :params, :funDecls, :varDecls, :body, :functionForm
-        attr_accessor :assignOp, :expression, :condition, :thenPart, :elsePart
-        attr_accessor :readOnly, :isLoop, :setup, :postfix, :update, :exception
-        attr_accessor :object, :iterator, :varDecl, :label, :target, :tryBlock
-        attr_accessor :catchClauses, :varName, :guard, :block, :discriminant, :cases
-        attr_accessor :defaultIndex, :caseLabel, :statements, :statement
-
-        def initialize (t, type = nil)
-                token = t.token
-                if token
-                        if type != nil
-                                @type = type
-                        else
-                                @type = token.type
-                        end
-                        @value = token.value
-                        @lineno = token.lineno
-                        @start = token.start
-                        @end = token.end
-                else
-                        @type = type
-                        @lineno = t.lineno
-                end
-                @tokenizer = t
-                #for (var i = 2; i < arguments.length; i++)
-                #this.push(arguments[i]);
-        end
-
-        alias superPush push
-        # Always use push to add operands to an expression, to update start and end.
-        def push (kid)
-                if kid.start and @start
-                        @start = kid.start if kid.start < @start
-                end
-                if kid.end and @end
-                        @end = kid.end if @end < kid.end
-                end
-                return superPush(kid)
-        end
-
-#       def to_s
-#               a = []
-#               
-#               #for (var i in this) {
-#               #       if (this.hasOwnProperty(i) && i != 'type')
-#               #               a.push({id: i, value: this[i]});
-#               #}
-#               #a.sort(function (a,b) { return (a.id < b.id) ? -1 : 1; });
-#               iNDENTATION = "    "
-#               n = (Node.indentLevel += 1)
-#               t = $tokens[@type]
-#               s = "{\n" + iNDENTATION.repeat(n) +
-#                               "type: " + (/^\W/.test(t) and opTypeNames[t] or t.upcase)
-#               #for (i = 0; i < a.length; i++)
-#               #       s += ",\n" + INDENTATION.repeat(n) + a[i].id + ": " + a[i].value
-#                       s += ",\n" + iNDENTATION.repeat(n) + @value + ": " + a[i].value
-#               n = (Node.indentLevel -= 1)
-#               s += "\n" + iNDENTATION.repeat(n) + "}"
-#               return s
-#       end
-
-        def to_s
-        
-                attrs = [@value,
-                        @lineno, @start, @end,
-                        @name, @params, @funDecls, @varDecls, @body, @functionForm,
-                        @assignOp, @expression, @condition, @thenPart, @elsePart]
-                
-                #puts $tokens[@condition.type] if @condition != nil
-                
-                #if /\A[a-z]/ =~ $tokens[@type] # identifier
-                #       print @tokenizer.source.slice($cursor, @start - $cursor) if $cursor < @start
-                #       print '<span class="identifier">'
-                #       print @tokenizer.source.slice(@start, $tokens[@type].length)
-                #       print '</span>'
-                #       $cursor = @start + $tokens[@type].length
-                #end
-                
-                #puts (" " * $ind) + "{" + $tokens[@type] + "\n" if /\A[a-z]/ =~ $tokens[@type]
-                #puts (" " * $ind) + " " + @start.to_s + "-" + @end.to_s + "\n"
-                $ind += 1
-                #puts @value
-                self.length.times do |i|
-                        self[i].to_s if self[i] != self and self[i].class == Node
-                end
-                attrs.length.times do |attr|
-                        if $tokens[@type] == "if"
-                        #       puts $tokens[attrs[attr].type] if attrs[attr].class == Node and attrs[attr] !== self
-                        end
-                        attrs[attr].to_s if attrs[attr].class == Node #and attrs[attr] != self
-                        #puts (" " * $ind).to_s + attrs[attr].to_s if attrs[attr].to_s != nil and attrs[attr] != self
-                end
-                $ind -= 1
-                #puts "\n}\n"
-                
-                if $ind == 0
-                        print @tokenizer.source.slice($cursor, @tokenizer.source.length - $cursor)
-                end
-                
-                return ""
-        end
-
-        def getSource
-                return @tokenizer.source.slice(@start, @end)
-        end
-
-        def filename
-                return @tokenizer.filename
-        end
-end
-
-$cursor = 0
-
-$ind = 0
-
-def Script (t, x)
-        n = Statements(t, x)
-        n.type = $consts["SCRIPT"]
-        n.funDecls = x.funDecls
-        n.varDecls = x.varDecls
-        return n
-end
-
-
-# Statement stack and nested statement handler.
-# nb. Narcissus allowed a function reference, here we use Statement explicitly
-def nest (t, x, node, end_ = nil)
-        x.stmtStack.push(node)
-        n = Statement(t, x)
-        x.stmtStack.pop
-        end_ and t.mustMatch(end_)
-        return n
-end
-
-
-def Statements (t, x)
-        n = Node.new(t, $consts["BLOCK"])
-        x.stmtStack.push(n)
-        n.push(Statement(t, x)) while !t.done and t.peek != $consts["RIGHT_CURLY"]
-        x.stmtStack.pop
-        return n
-end
-
-
-def Block (t, x)
-        t.mustMatch($consts["LEFT_CURLY"])
-        n = Statements(t, x)
-        t.mustMatch($consts["RIGHT_CURLY"])
-        return n
-end
-
-
-DECLARED_FORM = 0
-EXPRESSED_FORM = 1
-STATEMENT_FORM = 2
-
-def Statement (t, x)
-        tt = t.get
-        puts "TT is " + tt.to_s
-        # Cases for statements ending in a right curly return early, avoiding the
-        # common semicolon insertion magic after this switch.
-        case tt
-                when $consts["FUNCTION"]
-                       puts "TT is a function"
-                        return FunctionDefinition(t, x, true, 
-                                (x.stmtStack.length > 1) && STATEMENT_FORM || DECLARED_FORM)
-
-                when $consts["LEFT_CURLY"]
-                        n = Statements(t, x)
-                        t.mustMatch($consts["RIGHT_CURLY"])
-                        return n
-                
-                when $consts["IF"]
-                        n = Node.new(t)
-                        n.condition = ParenExpression(t, x)
-                        x.stmtStack.push(n)
-                        n.thenPart = Statement(t, x)
-                        n.elsePart = t.match($consts["ELSE"]) ? Statement(t, x) : nil
-                        x.stmtStack.pop()
-                        return n
-
-                when $consts["SWITCH"]
-                        n = Node.new(t)
-                        t.mustMatch($consts["LEFT_PAREN"])
-                        n.discriminant = Expression(t, x)
-                        t.mustMatch($consts["RIGHT_PAREN"])
-                        n.cases = []
-                        n.defaultIndex = -1
-                        x.stmtStack.push(n)
-                        t.mustMatch($consts["LEFT_CURLY"])
-                        while (tt = t.get) != $consts["RIGHT_CURLY"]
-                                case tt
-                                        when $consts["DEFAULT"], $consts["CASE"]
-                                                if tt == $consts["DEFAULT"] and n.defaultIndex >= 0
-                                                        raise SyntaxError.new("More than one switch default", t)
-                                                end
-                                                n2 = Node.new(t)
-                                                if tt == $consts["DEFAULT"]
-                                                        n.defaultIndex = n.cases.length
-                                                else
-                                                        n2.caseLabel = Expression(t, x, $consts["COLON"])
-                                                end
-                                        
-                                        else
-                                                raise SyntaxError.new("Invalid switch case", t)
-                                end
-                                t.mustMatch($consts["COLON"])
-                                n2.statements = Node.new(t, $consts["BLOCK"])
-                                while (tt = t.peek) != $consts["CASE"] and tt != $consts["DEFAULT"] and tt != $consts["RIGHT_CURLY"]
-                                        n2.statements.push(Statement(t, x))
-                                end
-                                n.cases.push(n2)
-                        end
-                        x.stmtStack.pop
-                        return n
-                
-                when $consts["FOR"]
-                        n = Node.new(t)
-                        n.isLoop = true
-                        t.mustMatch($consts["LEFT_PAREN"])
-                        if (tt = t.peek) != $consts["SEMICOLON"]
-                                x.inForLoopInit = true
-                                if tt == $consts["VAR"] or tt == $consts["CONST"]
-                                        t.get
-                                        n2 = Variables(t, x)
-                                else
-                                        n2 = Expression(t, x)
-                                end
-                                x.inForLoopInit = false
-                        end
-                        if n2 and t.match($consts["IN"])
-                                n.type = $consts["FOR_IN"]
-                                if n2.type == $consts["VAR"]
-                                        if n2.length != 1
-                                                raise SyntaxError.new("Invalid for..in left-hand side", t)
-                                        end
-                                        # NB: n2[0].type == IDENTIFIER and n2[0].value == n2[0].name.
-                                        n.iterator = n2[0]
-                                        n.varDecl = n2
-                                else
-                                        n.iterator = n2
-                                        n.varDecl = nil
-                                end
-                                n.object = Expression(t, x)
-                        else
-                                n.setup = n2 or nil
-                                t.mustMatch($consts["SEMICOLON"])
-                                n.condition = (t.peek == $consts["SEMICOLON"]) ? nil : Expression(t, x)
-                                t.mustMatch($consts["SEMICOLON"])
-                                n.update = (t.peek == $consts["RIGHT_PAREN"]) ? nil : Expression(t, x)
-                        end
-                        t.mustMatch($consts["RIGHT_PAREN"])
-                        n.body = nest(t, x, n)
-                        return n
-                
-                when $consts["WHILE"]
-                        n = Node.new(t)
-                        n.isLoop = true
-                        n.condition = ParenExpression(t, x)
-                        n.body = nest(t, x, n)
-                        return n
-                
-                when $consts["DO"]
-                        n = Node.new(t)
-                        n.isLoop = true
-                        n.body = nest(t, x, n, $consts["WHILE"])
-                        n.condition = ParenExpression(t, x)
-                        if !x.ecmaStrictMode
-                                # <script language="JavaScript"> (without version hints) may need
-                                # automatic semicolon insertion without a newline after do-while.
-                                # See http://bugzilla.mozilla.org/show_bug.cgi?id=238945.
-                                t.match($consts["SEMICOLON"])
-                                return n
-                        end
-                
-                when $consts["BREAK"], $consts["CONTINUE"]
-                        n = Node.new(t)
-                        if t.peekOnSameLine == $consts["IDENTIFIER"]
-                                t.get
-                                n.label = t.token.value
-                        end
-                        ss = x.stmtStack
-                        i = ss.length
-                        label = n.label
-                        if label
-                                begin
-                                        i -= 1
-                                        raise SyntaxError.new("Label not found", t) if i < 0
-                                end while (ss[i].label != label)
-                        else
-                                begin
-                                        i -= 1
-                                        raise SyntaxError.new("Invalid " + ((tt == $consts["BREAK"]) and "break" or "continue"), t) if i < 0
-                                end while !ss[i].isLoop and (tt != $consts["BREAK"] or ss[i].type != $consts["SWITCH"])
-                        end
-                        n.target = ss[i]
-                
-                when $consts["TRY"]
-                        n = Node.new(t)
-                        n.tryBlock = Block(t, x)
-                        n.catchClauses = []
-                        while t.match($consts["CATCH"])
-                                n2 = Node.new(t)
-                                t.mustMatch($consts["LEFT_PAREN"])
-                                n2.varName = t.mustMatch($consts["IDENTIFIER"]).value
-                                if t.match($consts["IF"])
-                                        raise SyntaxError.new("Illegal catch guard", t) if x.ecmaStrictMode
-                                        if n.catchClauses.length and !n.catchClauses.last.guard
-                                                raise SyntaxError.new("Guarded catch after unguarded", t)
-                                        end
-                                        n2.guard = Expression(t, x)
-                                else
-                                        n2.guard = nil
-                                end
-                                t.mustMatch($consts["RIGHT_PAREN"])
-                                n2.block = Block(t, x)
-                                n.catchClauses.push(n2)
-                        end
-                        n.finallyBlock = Block(t, x) if t.match($consts["FINALLY"])
-                        if !n.catchClauses.length and !n.finallyBlock
-                                raise SyntaxError.new("Invalid try statement", t)
-                        end
-                        return n
-                
-                when $consts["CATCH"]
-                when $consts["FINALLY"]
-                        raise SyntaxError.new(tokens[tt] + " without preceding try", t)
-                
-                when $consts["THROW"]
-                        n = Node.new(t)
-                        n.exception = Expression(t, x)
-                
-                when $consts["RETURN"]
-                        puts "In Return"
-                        raise SyntaxError.new("Invalid return", t) unless x.inFunction
-                        n = Node.new(t)
-                        tt = t.peekOnSameLine
-                        if tt != $consts["END"] and tt != $consts["NEWLINE"] and tt != $consts["SEMICOLON"] and tt != $consts["RIGHT_CURLY"]
-                                n.value = Expression(t, x)
-                                puts "Val => " + n.value.to_s
-                        end
-                
-                when $consts["WITH"]
-                        n = Node.new(t)
-                        n.object = ParenExpression(t, x)
-                        n.body = nest(t, x, n)
-                        return n
-                
-                when $consts["VAR"], $consts["CONST"]
-                        n = Variables(t, x)
-                
-                when $consts["DEBUGGER"]
-                        n = Node.new(t)
-        
-                when $consts["NEWLINE"], $consts["SEMICOLON"]
-                        n = Node.new(t, $consts["SEMICOLON"])
-                        n.expression = nil
-                        return n
-
-                else
-                        if tt == $consts["IDENTIFIER"] and t.peek == $consts["COLON"]
-                                label = t.token.value
-                                ss = x.stmtStack
-                                (ss.length - 1).times do |i|
-                                        raise SyntaxError.new("Duplicate label", t) if ss[i].label == label
-                                end
-                                t.get
-                                n = Node.new(t, $consts["LABEL"])
-                                n.label = label
-                                n.statement = nest(t, x, n)
-                                return n
-                        end
-
-                        t.unget
-                        n = Node.new(t, $consts["SEMICOLON"])
-                        n.expression = Expression(t, x)
-                        n.end = n.expression.end
-        end
-
-        if t.lineno == t.token.lineno
-                tt = t.peekOnSameLine
-                puts "TT*=> " + tt.to_s
-                if tt != $consts["END"] and tt != $consts["NEWLINE"] and tt != $consts["SEMICOLON"] and tt != $consts["RIGHT_CURLY"]
-                        raise SyntaxError.new("Missing ; before statement", t)
-                end
-        end
-        t.match($consts["SEMICOLON"])
-        return n
-end
-
-
-def FunctionDefinition (t, x, requireName, functionForm)
-        f = Node.new(t)
-        if f.type != $consts["FUNCTION"]
-                f.type = (f.value == "get") and $consts["GETTER"] or $consts["SETTER"]
-        end
-        if t.match($consts["IDENTIFIER"])
-                f.name = t.token.value
-        elsif requireName
-                raise SyntaxError.new("Missing function identifier", t)
-        end
-        t.mustMatch($consts["LEFT_PAREN"])
-        f.params = []
-        while (tt = t.get) != $consts["RIGHT_PAREN"]
-                raise SyntaxError.new("Missing formal parameter", t) unless tt == $consts["IDENTIFIER"]
-                f.params.push(t.token.value)
-                t.mustMatch($consts["COMMA"]) unless t.peek == $consts["RIGHT_PAREN"]
-        end
-        
-        t.mustMatch($consts["LEFT_CURLY"])
-        x2 = CompilerContext.new(true)
-        f.body = Script(t, x2)
-        t.mustMatch($consts["RIGHT_CURLY"])
-        f.end = t.token.end
-        f.functionForm = functionForm
-        x.funDecls.push(f) if functionForm == $consts["DECLARED_FORM"]
-        return f
-end
-
-
-def Variables (t, x)
-        n = Node.new(t)
-
-        begin
-                t.mustMatch($consts["IDENTIFIER"])
-                n2 = Node.new(t)
-                n2.name = n2.value
-                if t.match($consts["ASSIGN"])
-                        raise SyntaxError.new("Invalid variable initialization", t) if t.token.assignOp
-                        n2.initializer = Expression(t, x, $consts["COMMA"])
-                end
-                n2.readOnly = (n.type == $consts["CONST"])
-                n.push(n2)
-                x.varDecls.push(n2)
-        end while t.match($consts["COMMA"])
-        return n
-end
-
-
-def ParenExpression (t, x)
-        t.mustMatch($consts["LEFT_PAREN"])
-        n = Expression(t, x)
-        t.mustMatch($consts["RIGHT_PAREN"])
-        return n
-end
-
-
-def Expression(t, x, stop = nil)
-        operators = []
-        operands = []
-        bl = x.bracketLevel
-        cl = x.curlyLevel
-        pl = x.parenLevel
-        hl = x.hookLevel
-        
-        def reduce(operators, operands, t)
-                #  puts "Operators length => " + operators.length.to_s
-                n = operators.pop
-                puts "N => " + n.to_s
-                op = n.type
-                arity = $opArity[op]
-                puts "Arity=> " + arity.to_s
-                if arity == -2
-                        if operands.length >= 2
-                                # Flatten left-associative trees.
-                                left = operands[operands.length - 2]
-                                puts "Left=> " + left.to_s
-
-                                if left.type == op
-                                        puts "Dude!"
-                                        right = operands.pop
-                                        left.push(right)
-                                        return left
-                                end
-                        end
-                        arity = 2
-                end
-                
-                # Always use push to add operands to n, to update start and end.
-                a = operands.slice!(operands.length - arity, operands.length)
-
-                arity.times do |i|
-                        n.push(a[i])
-                end
-                
-                # Include closing bracket or postfix operator in [start,end).
-                n.end = t.token.end if n.end < t.token.end
-                
-                operands.push(n)
-                return n
-        end
-
-gotoloopContinue = false
-until gotoloopContinue or (t.token and t.token.type == $consts["END"])
-gotoloopContinue = catch(:gotoloop) do
-#loop:
-        while true
-                puts "Before getting TT"
-                if ((tt = t.get) == $consts["END"])
-                  break
-                 end
-                puts "TT ==> " + tt.to_s
-                # Stop only if tt matches the optional stop parameter, and that
-                # token is not quoted by some kind of bracket.
-                if tt == stop and x.bracketLevel == bl and x.curlyLevel == cl and x.parenLevel == pl and x.hookLevel == hl
-                        throw :gotoloop, true
-                end
-                
-                case tt
-                        when $consts["SEMICOLON"]
-                                # NB: cannot be empty, Statement handled that.
-                                throw :gotoloop, true;
-                        
-                        when $consts["ASSIGN"], $consts["HOOK"], $consts["COLON"]
-                                if t.scanOperand
-                                        throw :gotoloop, true
-                                end
-                                                                
-                                # Use >, not >=, for right-associative ASSIGN and HOOK/COLON.
-                                while operators.length > 0 && $opPrecedence[operators.last.type] && $opPrecedence[operators.last.type] > $opPrecedence[tt]
-                                        reduce(operators, operands, t)
-                                end
-                                if tt == $consts["COLON"]
-                                        n = operators.last
-                                        raise SyntaxError.new("Invalid label", t) if n.type != $consts["HOOK"]
-                                        n.type = $consts["CONDITIONAL"]
-                                        x.hookLevel -= 1
-                                else
-                                        operators.push(Node.new(t))
-                                        if tt == $consts["ASSIGN"]
-                                                operands.last.assignOp = t.token.assignOp
-                                        else
-                                                x.hookLevel += 1 # tt == HOOK
-                                        end
-                                end
-                                t.scanOperand = true
-                        
-                        when $consts["COMMA"],
-                                # Treat comma as left-associative so reduce can fold left-heavy
-                                # COMMA trees into a single array.
-                                $consts["OR"], $consts["AND"], $consts["BITWISE_OR"], $consts["BITWISE_XOR"],
-                                $consts["BITWISE_AND"], $consts["EQ"], $consts["NE"], $consts["STRICT_EQ"],
-                                $consts["STRICT_NE"], $consts["LT"], $consts["LE"], $consts["GE"],
-                                $consts["GT"], $consts["INSTANCEOF"], $consts["LSH"], $consts["RSH"],
-                                $consts["URSH"], $consts["PLUS"], $consts["MINUS"], $consts["MUL"],
-                                $consts["DIV"], $consts["MOD"], $consts["DOT"], $consts["IN"]
-
-                                # An in operator should not be parsed if we're parsing the head of
-                                # a for (...) loop, unless it is in the then part of a conditional
-                                # expression, or parenthesized somehow.
-                                if tt == $consts["IN"] and x.inForLoopInit and x.hookLevel == 0 and x.bracketLevel == 0 and x.curlyLevel == 0 and x.parenLevel == 0
-                                        throw :gotoloop, true
-                                end
-                                
-                                if t.scanOperand
-                                        throw :gotoloop, true
-                                end
-
-                                reduce(operators, operands, t) while operators.length > 0 && $opPrecedence[operators.last.type] && $opPrecedence[operators.last.type] >= $opPrecedence[tt]
-
-                                if tt == $consts["DOT"]
-                                        t.mustMatch($consts["IDENTIFIER"])
-                                        node = Node.new(t, $consts["DOT"])
-                                        node.push(operands.pop)
-                                        node.push(Node.new(t))
-                                        operands.push(node)
-                                else
-                                        operators.push(Node.new(t))
-                                        t.scanOperand = true
-                                end
-                        
-                        when $consts["DELETE"], $consts["VOID"], $consts["TYPEOF"], $consts["NOT"],
-                                $consts["BITWISE_NOT"], $consts["UNARY_PLUS"], $consts["UNARY_MINUS"],
-                                $consts["NEW"]
-
-                                if !t.scanOperand
-                                        throw :gotoloop, true
-                                end
-                                operators.push(Node.new(t))
-                        
-                        when $consts["INCREMENT"], $consts["DECREMENT"]
-                                if t.scanOperand
-                                        operators.push(Node.new(t)) # prefix increment or decrement
-                                else
-                                        # Use >, not >=, so postfix has higher precedence than prefix.
-                                        reduce(operators, operands, t) while operators.length > 0 && $opPrecedence[operators.last.type] && $opPrecedence[operators.last.type] > $opPrecedence[tt]
-                                        n = Node.new(t, tt)
-                                        n.push(operands.pop)
-                                        n.postfix = true
-                                        operands.push(n)
-                                end
-                        
-                        when $consts["FUNCTION"]
-                                if !t.scanOperand
-                                        throw :gotoloop, true
-                                end
-                                # puts "EXPRESSED FORM=> " + $consts["EXPRESSED_FORM"]
-                                operands.push(FunctionDefinition(t, x, false, $consts["EXPRESSED_FORM"]))
-                                t.scanOperand = false
-                        
-                        when $consts["NULL"], $consts["THIS"], $consts["TRUE"], $consts["FALSE"],
-                                $consts["IDENTIFIER"], $consts["NUMBER"], $consts["STRING"],
-                                $consts["REGEXP"]
-
-                                if !t.scanOperand
-                                        puts "THrowing"
-                                        throw :gotoloop, true
-                                end
-                                puts "No throw"
-                                operands.push(Node.new(t))
-                                t.scanOperand = false
-                        
-                        when $consts["LEFT_BRACKET"]
-                                if t.scanOperand
-                                        # Array initialiser.  Parse using recursive descent, as the
-                                        # sub-grammar here is not an operator grammar.
-                                        n = Node.new(t, $consts["ARRAY_INIT"])
-                                        while (tt = t.peek) != $consts["RIGHT_BRACKET"]
-                                                if tt == $consts["COMMA"]
-                                                        t.get
-                                                        n.push(nil)
-                                                        next
-                                                end
-                                                n.push(Expression(t, x, $consts["COMMA"]))
-                                                break if !t.match($consts["COMMA"])
-                                        end
-                                        t.mustMatch($consts["RIGHT_BRACKET"])
-                                        operands.push(n)
-                                        t.scanOperand = false
-                                else
-                                        # Property indexing operator.
-                                        operators.push(Node.new(t, $consts["INDEX"]))
-                                        t.scanOperand = true
-                                        x.bracketLevel += 1
-                                end
-                        
-                        when $consts["RIGHT_BRACKET"]
-                                if t.scanOperand or x.bracketLevel == bl
-                                        throw :gotoloop, true
-                                end
-                                while reduce(operators, operands, t).type != $consts["INDEX"]
-                                        nil
-                                end
-                                x.bracketLevel -= 1
-                        
-                        when $consts["LEFT_CURLY"]
-                                if !t.scanOperand
-                                        throw :gotoloop, true
-                                end
-                                # Object initialiser.  As for array initialisers (see above),
-                                # parse using recursive descent.
-                                x.curlyLevel += 1
-                                n = Node.new(t, $consts["OBJECT_INIT"])
-
-catch(:gotoobject_init) do
-#object_init:
-                                if !t.match($consts["RIGHT_CURLY"])
-                                        begin
-                                                tt = t.get
-                                                if (t.token.value == "get" or t.token.value == "set") and t.peek == $consts["IDENTIFIER"]
-                                                        raise SyntaxError.new("Illegal property accessor", t) if x.ecmaStrictMode
-                                                        n.push(FunctionDefinition(t, x, true, $consts["EXPRESSED_FORM"]))
-                                                else
-                                                        case tt
-                                                                when $consts["IDENTIFIER"], $consts["NUMBER"], $consts["STRING"]
-                                                                        id = Node.new(t)
-                                                                
-                                                                when $consts["RIGHT_CURLY"]
-                                                                        raise SyntaxError.new("Illegal trailing ,", t) if x.ecmaStrictMode
-                                                                        throw :gotoobject_init
-                                                                
-                                                                else
-                                                                        raise SyntaxError.new("Invalid property name", t)
-                                                        end
-                                                        t.mustMatch($consts["COLON"])
-                                                        n2 = Node.new(t, $consts["PROPERTY_INIT"])
-                                                        n2.push(id)
-                                                        n2.push(Expression(t, x, $consts["COMMA"]))
-                                                        n.push(n2)
-                                                end
-                                        end while t.match($consts["COMMA"])
-                                        t.mustMatch($consts["RIGHT_CURLY"])
-                                end
-                                operands.push(n)
-                                t.scanOperand = false
-                                x.curlyLevel -= 1
-end
-
-                        when $consts["RIGHT_CURLY"]
-                                raise SyntaxError.new("PANIC: right curly botch", t) if !t.scanOperand and x.curlyLevel != cl
-                                throw :gotoloop, true
-                        
-                        when $consts["LEFT_PAREN"]
-                                if t.scanOperand
-                                        operators.push(Node.new(t, $consts["GROUP"]))
-                                else
-                                        reduce(operators, operands, t) while operators.length > 0 && $opPrecedence[operators.last.type] && $opPrecedence[operators.last.type] > $opPrecedence[$consts["NEW"]]
-                                        # Handle () now, to regularize the n-ary case for n > 0.
-                                        # We must set scanOperand in case there are arguments and
-                                        # the first one is a regexp or unary+/-.
-                                        n = operators.last
-                                        puts n.to_s + " " + n.type.to_s + " " + $consts["NEW"].to_s
-                                        t.scanOperand = true
-                                        puts "I WENT HERE"
-                                        if t.match($consts["RIGHT_PAREN"])
-                                                if n && n.type == $consts["NEW"]
-                                                        puts "INSIDE"
-                                                        operators.pop
-                                                        n.push(operands.pop)
-                                                        puts "AFTER*"
-                                                else
-                                                        puts "OUTSIDE"
-                                                        n = Node.new(t, $consts["CALL"])
-                                                        n.push(operands.pop)
-                                                        n.push(Node.new(t, $consts["LIST"]))
-                                                end
-                                                puts "AFTER2*"
-                                                operands.push(n)
-                                                t.scanOperand = false
-                                                #puts "woah"
-                                                puts "BEFOREBREAK"
-                                                break
-                                        end
-                                        if n && n.type == $consts["NEW"]
-                                                n.type = $consts["NEW_WITH_ARGS"]
-                                        else
-                                                operators.push(Node.new(t, $consts["CALL"]))
-                                        end
-                                end
-                                x.parenLevel += 1
-                                
-                        when $consts["RIGHT_PAREN"]
-                                if t.scanOperand or x.parenLevel == pl
-                                        throw :gotoloop, true
-                                end
-                                
-                                #while (tt = reduce(operators, operands, t).type) != $consts["GROUP"] \
-                                #                and tt != $consts["CALL"] and tt != $consts["NEW_WITH_ARGS"]
-                                #        nil
-                                #end
-
-                                while true
-                                      puts "Operators => " + operators.length.to_s + " " + operands.length.to_s
-                                      tt = reduce(operators, operands, t)
-                                      puts "TT=> " + tt.to_s + " " + tt.type.to_s
-                                      if tt.type == $consts["GROUP"] or tt.type == $consts["CALL"] or tt.type == $consts["NEW_WITH_ARGS"]
-                                         puts "Breaking..."
-                                         break
-                                         end    
-                                end
-                                if tt != $consts["GROUP"]
-                                        n = operands.last
-                                        if n[1].type != $consts["COMMA"]
-                                                n2 = n[1]
-                                                n[1] = Node.new(t, $consts["LIST"])
-                                                n[1].push(n2)
-                                        else
-                                                n[1].type = $consts["LIST"]
-                                        end
-                                end
-                                x.parenLevel -= 1
-                                
-                        # Automatic semicolon insertion means we may scan across a newline
-                        # and into the beginning of another statement.  If so, break out of
-                        # the while loop and let the t.scanOperand logic handle errors.
-                        else
-                                throw :gotoloop, true
-                end
-        end
-
-end
-end
-
-        raise SyntaxError.new("Missing : after ?", t) if x.hookLevel != hl
-        raise SyntaxError.new("Missing operand", t) if t.scanOperand
-                
-        # Resume default mode, scanning for operands, not operators.
-        t.scanOperand = true
-        puts "Ungetting"
-        puts t.unget
-        reduce(operators, operands, t) while operators.length > 0
-        return operands.pop
-end
-
-def parse (source, filename, line = 1)
-        t = Tokenizer.new(source, filename, line)
-        x = CompilerContext.new(false)
-        n = Script(t, x)
-        raise SyntaxError.new("Syntax error", t) if !t.done
-    return n
-end
-
diff --git a/HarvestMan/harvestman/lib/js/README.txt b/HarvestMan/harvestman/lib/js/README.txt
deleted file mode 100755
index 21ed3cc..0000000
--- a/HarvestMan/harvestman/lib/js/README.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-This folder contains a pure Python port of Rbnarcissus
-{http://idontsmoke.co.uk/2005/rbnarcissus/}. The narcissus.py
-module contains the main parser code. The jsparse.py module
-prints out a dictionary of functions in a javascript source
-file passed as argument to it.
-
-For testing the code, use jsparse.py as follows.
-
-$ python testnarcissus.py
-
-Please note that this is an alpha port and it has been tested
-for only select javascript samples. However it should work for 
-arbitrary javascript code once the bugs are fixed. 
-
-Now the parser closely approximates the original ruby parser.
-The parser passes for all sample files directly inside the "js"
-folder. It fails for files inside "js/fail" folder. The ruby
-parser also exhibits the same behavior.
-
-Note: This folder also contains a working Javascript DOM parser
-which is based on regular expressions and a previous HTML
-Javascript parser written by me. It does not use the narcissus
-code. This parser code resides in jsparser.py and it uses 
-the DOM definitions in jsdom.py . 
diff --git a/HarvestMan/harvestman/lib/js/__init__.py b/HarvestMan/harvestman/lib/js/__init__.py
deleted file mode 100755
index a93df7e..0000000
--- a/HarvestMan/harvestman/lib/js/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# -- coding: utf-8
diff --git a/HarvestMan/harvestman/lib/js/jsdom.py b/HarvestMan/harvestman/lib/js/jsdom.py
deleted file mode 100755
index 4f7c04c..0000000
--- a/HarvestMan/harvestman/lib/js/jsdom.py
+++ /dev/null
@@ -1,94 +0,0 @@
-# -- coding: utf-8
-"""
-jsdom.py - Defines classes for Javascript DOM objects.
-This module is part of the HarvestMan program.For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Oct 2 2007
-
-Copyright (C) 2007 Anand B Pillai.
-
-"""
-
-class Base(object):
-    """ Base class for DOM objects """
-
-    __slots__ = []
-
-    def __init__(self):
-        for item in self.__class__.__slots__:
-            setattr(self, item, None)
-        
-class Window(Base):
-    """ DOM class which mimics a browser Window """
-    
-    __slots__ = ['frames','closed','defaultStatus','document',
-                 'history','length','location','name','opener',
-                 'outerheight','outerwidth','pageXOffset','pageYOffset',
-                 'parent','personalbar','scrollbars','status','toolbar',
-                 'top']
-
-    def __init__(self):
-        super(Window, self).__init__()
-
-class Location(Base):
-    """ DOM class for page location """
-    
-    __slots__ = ['hash','host','hostname','href','pathname','port',
-                 'protocol','search','hrefchanged']
-
-    def __init__(self):
-        super(Location, self).__init__()
-        # Internal flag
-        self.hrefchanged = False
-
-    def replace(self, url):
-        self.href =  url
-        self.hrefchanged = True
-
-    def assign(self, url):
-        self.replace(url)
-                 
-class Document(Base):
-    """ DOM class for the document """
-    
-    __slots__ = ['body','cookie','domain','lastModified','referrer',
-                 'title','URL', 'content', 'domcontent', 'prescript',
-                 'postscript','contentchanged']
-    
-    def __init__(self):
-        super(Document, self).__init__()
-        self.content = ''
-        self.domcontent = ''
-        # Text before <script...> tags
-        self.prescript = ''
-        # Text after </script>..
-        self.postscript = ''
-        # Internal flag
-        self.contentchanged = False
-        
-    def chomp(self, start, end):
-        """ Split content according to start and end of javascript tags """
-
-        # All content before <script...>
-        self.prescript = self.content[:start]
-        # All content after </script>        
-        self.postscript = self.content[end:]
-        
-    def write(self, text):
-        # Called for document.write(...) actions
-        self.domcontent = self.domcontent + text
-
-    def writeln(self, text):
-        # Called for document.writeln(...) actions        
-        self.domcontent = self.domcontent + text + '\n'
-
-    def construct(self):
-        """ Reconstruct document content using modified DOM """
-        
-        self.contentchanged = True
-        self.content = ''.join((self.prescript, self.domcontent, self.postscript))
-
-    def __repr__(self):
-        return self.content
diff --git a/HarvestMan/harvestman/lib/js/jsparse.py b/HarvestMan/harvestman/lib/js/jsparse.py
deleted file mode 100755
index 58d2486..0000000
--- a/HarvestMan/harvestman/lib/js/jsparse.py
+++ /dev/null
@@ -1,76 +0,0 @@
-# -- coding: utf-8
-import sys
-from narcissus import *
-
-def get_children(n):
-    children = []
-    attrs = [n.type, n.value, n.lineno, n.start, n.end, n.tokenizer, n.initializer,
-             n.name, n.params, n.funDecls, n.varDecls, n.body, n.functionForm,
-             n.assignOp, n.expression, n.condition, n.thenPart, n.elsePart,
-             n.readOnly, n.isLoop, n.setup, n.postfix, n.update, n.exception,
-             n.object, n.iterator, n.varDecl, n.label, n.target, n.tryBlock,
-             n.catchClauses, n.varName, n.guard, n.block, n.discriminant, n.cases,
-             n.defaultIndex, n.caseLabel, n.statements, n.statement]
-
-    # print 'Node length =>', len(n)
-    for x in range(len(n)):
-        if n[x] != n and n[x].__class__ == Node:
-            children.append(n[x])
-
-    for x in range(len(attrs)):
-        if isinstance(attrs[x], Node) and attrs[x] != n:
-            children.append(attrs[x])
-
-    return children
-
-def resolve_name(n):
-    name = ''
-
-    if n.type == consts["DOT"]:
-        name = resolve_name(n[0]) + "." + resolve_name(n[1])
-    else: # INDENTIFIER
-        name = n.value
-        
-    return name
-
-def get_functions (n, functions = None):
-
-    # print "FUNCS=>",functions
-    if not functions: functions = {}
-
-    function = None
-    name = None
-
-    # print n.type, consts["FUNCTION"]
-    if n.type == consts["FUNCTION"] and n.name:
-            print 'Function=>',n
-            functions[n.name] = n
-    elif  (n.type == consts["ASSIGN"]) and (n[1].type == consts["FUNCTION"]) and (not n[1].name):
-            function = n[1]
-            name = resolve_name(n[0])
-            functions[name] = function
-            
-
-    children = get_children(n)
-    for x in range(len(children)):
-        functions = get_functions(children[x], functions)
-
-    return functions
-
-
-def print_functions(filename):
-
-    jsdata = open(filename).read()
-    jstree = parse(jsdata, filename)
-    print type(jstree)
-    
-    functions = get_functions(jstree)
-    print functions
-
-    for f, funcobj in functions.iteritems():
-        print 'Function',f
-        print funcobj.varDecls
-
-if __name__ == "__main__":
-    print_functions(sys.argv[1])
-            
diff --git a/HarvestMan/harvestman/lib/js/jsparser.py b/HarvestMan/harvestman/lib/js/jsparser.py
deleted file mode 100755
index 2893056..0000000
--- a/HarvestMan/harvestman/lib/js/jsparser.py
+++ /dev/null
@@ -1,579 +0,0 @@
-# -- coding: utf-8
-""" jsparser - This module provides classes which perform
-Javascript extraction from HTML and Javascript parsing to
-process DOM objects.
-
-The module consists of two classes. HTMLJSParser
-is an HTML Javascript extractor which can extract javascript
-code present in HTML pages. JSParser builds upon HTMLJSParser
-to provide a Javascript parser which can parse HTML pages
-and process Javascript which performs DOM modifications.
-Currently this class can process document.write* functions
-and Javascript based redirection which changes the location
-of a page.
-
-Both classes are written trying to mimic the behaviour
-of Firefox (2.0) as closely as possible.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this
-distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Aug 31 2007
-Modified Anand B Pillai Oct 2 2007 Added JSParser class and renamed
-                                   old JSParser to HTMLJSParser.
-
-Modified Anand B Pillai  Jan 18 2008 Rewrote regular expressions in
-                                     HTMLJSParser using pyparsing.
-
-Copyright (C) 2007 Anand B Pillai.
-
-"""
-
-import sys, os
-import re
-import urllib2
-from pyparsing import *
-from jsdom import *
-
-class HTMLJSParser(object):
-   """ Javascript parser which extracts javascript statements
-   embedded in HTML. The parser only performs extraction, and no
-   Javascript tokenizing """
-
-   script_content = Literal("<") + Literal("script") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "."+ "/"  + '"' + "'")) + Literal(">") + SkipTo(Literal("</") + Literal("script") + Literal(">"), True)
-
-   comment_open = Literal("<!--") + SkipTo("\n", True)
-   comment_close = Literal("//") + ZeroOrMore(Word(alphanums)) + Literal("-->")
-
-   brace_open = Literal("{")
-   brace_close = Literal("}")
-   
-   syntaxendre = re.compile(r';$')
-
-   def __init__(self):
-      self.comment_open.setParseAction(replaceWith(''))
-      self.comment_close.setParseAction(replaceWith(''))
-      self.brace_open.setParseAction(replaceWith(''))
-      self.brace_close.setParseAction(replaceWith(''))            
-      self.reset()
-       
-   def reset(self):
-       self.rawdata = ''
-       self.buffer = ''
-       self.statements = []
-       self.positions = []
-       
-   def feed(self, data):
-
-       self.rawdata = self.rawdata + data
-       # Extract javascript content
-       self.extract()
-    
-   # Internal - parse the HTML to extract Javascript
-   def extract(self):
-
-      rawdata = self.rawdata
-      for match in self.script_content.scanString(rawdata):
-         if not match: continue
-         if len(match) != 3: continue
-         if len(match[0])==0: continue
-         if len(match[0][-1])==0: continue         
-         statement = match[0][-1][0]
-         # print 'Statement=>',statement
-         self.statements.append(statement.strip())
-         self.positions.append((match[-2], match[-1]))
-
-      # print 'Length=>',len(self.statements)
-
-      # If the JS is embedded in HTML comments <!--- js //-->
-      # remove the comments. This logic takes care of trimming
-      # any junk before/after the comments modeling the
-      # behaviour of a browser (Firefox) as closely as possible.
-      
-      flag  = True
-      for x in range(len(self.statements)):
-         s = self.statements[x]
-         # Remove any braces
-         s = self.brace_close.transformString(self.brace_open.transformString(s))
-         s = self.comment_open.transformString(s)
-         s = self.comment_close.transformString(s)         
-
-         # Clean up any syntax end chars
-         s = self.syntaxendre.sub('', s).strip()
-         
-         if s:self.statements[x] = s
-
-      # Trim any empty statements
-      # print 'Length=>',len(self.statements)
-      # print self.statements
-      
-class JSParserException(Exception):
-   """ An exception class for JSParser """
-   
-   def __init__(self, error, context=None):
-      self._error = error
-      # Context: line number, js statement etc.
-      self._context =context
-
-   def __str__(self):
-      return str(self._error)
-
-   def __repr__(self):
-      return '@'.join((str(self._error), str(self._context)))
-
-  
-class JSParser(object):
-   """ Parser for Javascript DOM. This class can be used to parse
-   javascript which contains DOM binding statements. It returns
-   a DOM object. Calling a repr() on this object will produce
-   the modified DOM text """
-
-   # TODO: Rewrite this using pyparsing
-   
-   # Start signature of document.write* methods
-   re1 = re.compile(r"(document\.write\s*\()|(document\.writeln\s*\()")
-   
-   re3 = re.compile(r'(?<![document\.write\s*|document\.writeln\s*])\(.*\)', re.MULTILINE)
-   # End signature of document.write* methods
-   re4 = re.compile(r'[\'\"]\s*\)|[\'\"]\s*\);', re.MULTILINE)
-
-   # Pattern for contents inside document.write*(...) methods
-   # This can be either a single string enclosed in quotes,
-   # a set of strings concatenated using "+" or a set of
-   # string arguments (individual or concatenated) separated
-   # using commas. Text can be enclosed either in single or
-   # double quotes.
-
-   # Valid Examples...
-   # 1. document.write("<H1>This is a heading</H1>\n");
-   # 2. document.write("Hello World! ","Hello You! ","<p style='color:blue;'>Hello World!</p>");
-   # 3. document.write("Hi, this is " + "<p>A paragraph</p>" + "and this is "  + "<p>Another one</p>");
-   # 4. document.write("Hi, this is " + "<p>A paragraph</p>", "and this is "  + "<p>Another one</p>");
-
-   # Pattern for content
-   re5 = re.compile(r'(\".*\")|(\'.*\')', re.MULTILINE)
-   re6 = re.compile(r'(?<=[\"\'\s])[\+\,]+')
-   re7 = re.compile(r'(?<=[\"\'])(\s*[\+\,]+)')   
-   re8 = re.compile(r'^[\'\"]|[\'\"]$')
-   
-   # JS redirect regular expressions
-   # Form => window.location.replace("<url>") or window.location.assign("<url>")
-   # or location.replace("<url>") or location.assign("<url>")
-   jsredirect1 = re.compile(r'((window\.|this\.)?location\.(replace|assign))(\(.*\))', re.IGNORECASE)
-   # Form => window.location.href="<url>" or location.href="<url>"
-   jsredirect2 = re.compile(r'((window\.|this\.)?location(\.href)?\s*\=\s*)(.*)', re.IGNORECASE)
-   
-   quotechars = re.compile(r'[\'\"]*')
-   newlineplusre = re.compile(r'\n\s*\+')
-      
-    
-   def __init__(self):
-      self._nflag = False
-      self.parser = HTMLJSParser()
-      self.resetDOM()
-      self.statements = []
-      self.js = []
-      pass
-
-   def resetDOM(self):
-      self.page = None
-      self.page = Window()
-      self.page.document = Document()
-      self.page.location = Location()
-      self.locnchanged = False
-      self.domchanged = False
-      
-   def _find(self, data):
-      # Search for document.write* statements and return the
-      # match group if found. Also sets the internal newline
-      # flag, depending on whether a document.write or
-      # document.writeln was found.
-      self._nflag = False
-      m = self.re1.search(data)
-      if m:
-         grp = m.group()
-         if grp.startswith('document.writeln'):
-            self._nflag = True
-         return m
-
-   def parse_url(self, url):
-      """ Parse data from the given URL """
-      
-      try:
-         data = urllib2.urlopen(url).read()
-         # print 'fetched data'
-         return self.parse(data)
-      except Exception, e:
-         print e
-         
-         
-   def parse(self, data):
-      """ Parse HTML, extract javascript and process it """
-
-      self.js = []
-      self.resetDOM()
-      self.parser.reset()
-      
-      self.page.document.content = data
-      
-      # Create a jsparser to extract content inside <script>...</script>
-      # print 'Extracting js content...'
-      self.parser.feed(data)
-      self.js = self.parser.statements[:]
-      
-      # print 'Extracted js content.'
-      # print 'Found %d JS statements.' % len(self.js)
-      
-      # print 'Processing JS'
-      for x in range(len(self.js)):
-         statement = self.js[x]
-
-         # First check for JS redirects and
-         # then for JS document changes.
-         jsredirect = self.processLocation(statement)
-         if jsredirect:
-            # No need to process further since we are redirecting
-            # the location
-            break
-         else:
-            # Further process the URL for document changes
-            position = self.parser.positions[x]
-            
-            rawdata = statement.strip()
-            self._feed(rawdata)
-         
-            if len(self.statements):
-               self.processDocument(position)
-
-      # Set flags for DOM/Location change
-      self.locnchanged = self.page.location.hrefchanged
-      self.domchanged = self.page.document.contentchanged
-
-      # print 'Processed JS.'
-      
-   def processDocument(self, position):
-      """ Process DOM document javascript """
-
-      # The argument 'position' is a two tuple
-      # containing the start and end positions of
-      # the javascript tags inside the document.
-
-      dom = self.page.document
-      start, end = position
-      
-      # Reset positions on DOM content to empty string
-      dom.chomp(start, end)
-      
-      for text, newline in self.statements:
-         if newline:
-            dom.writeln(text)
-         else:
-            dom.write(text)
-
-      # Re-create content
-      dom.construct()
-
-   # Internal - validate URL strings for Javascript
-   def validate_url(self, urlstring):
-      """ Perform validation of URL strings """
-      
-      # Validate the URL - This follows Firefox behaviour
-      # In firefox, the URL might be or might not be enclosed
-      # in quotes. However if it is enclosed in quotes the quote
-      # character at start and begin should match. For example
-      # 'http://www.struer.dk/webtop/site.asp?site=5',
-      # "http://www.struer.dk/webtop/site.asp?site=5" and
-      # http://www.struer.dk/webtop/site.asp?site=5 are valid, but
-      # "http://www.struer.dk/webtop/site.asp?site=5' and
-      # 'http://www.struer.dk/webtop/site.asp?site=5" are not.
-      if urlstring.startswith("'") or urlstring.startswith('"'):
-         if urlstring[0] != urlstring[-1]:
-            # Invalid URL
-            return False
-         
-      return True
-
-   def make_valid_url(self, urlstring):
-      """ Create a valid URL string from the passed urlstring """
-      
-      # Strip off any leading/trailing quote chars
-      urlstring = self.quotechars.sub('',urlstring)
-      return urlstring.strip()
-     
-   def processLocation(self, statement):
-      """ Process any changes in document location """
-
-      locnchanged = False
-      
-      for line in statement.split('\n'):
-         
-         # print 'Expression=>',statement
-         m1 = self.jsredirect1.search(line)
-         if m1:
-            tokens = self.jsredirect1.findall(line)
-            if tokens:
-                urltoken = tokens[0][-1]
-                # Strip of trailing and leading parents
-                url = urltoken.replace('(','').replace(')','').strip()
-                # Validate URL
-                if self.validate_url(url):
-                   url = self.make_valid_url(url)
-                   locnchanged = True
-                   self.page.location.replace(url)
-         else:
-            m2 = self.jsredirect2.search(line)
-            if m2:
-               tokens = self.jsredirect2.findall(line)
-               urltoken = tokens[0][-1]
-               # Strip of trailing and leading parents
-               url = urltoken.replace('(','').replace(')','').strip()
-               if tokens and self.validate_url(url):
-                  url = self.make_valid_url(url)
-                  locnchanged = True
-                  self.page.location.replace(url)                  
-                  locnchanged = True
-
-      return locnchanged
-                
-   def _feed(self, data):
-      """ Internal method to feed data to process DOM document """
-      
-      self.statements = []
-      self.rawdata = data
-      self.goahead()
-      self.process()
-      
-   def tryQuoteException(self, line):
-      """ Check line for mismatching quotes """
-      
-      ret = 0
-      # Check line for mismatching quotes
-      if line[0] in ("'",'"') and line[-1] in ("'",'"'):
-         ret = 1
-         if line[0] != line[-1]:
-            raise JSParserException("Mismatching quote characters", line)
-
-      return ret
-   
-   def process(self):
-      """ Process DOM document related javascript """
-
-      # Process extracted statements
-      statements2 = []
-      for s, nflag in self.statements:
-
-         m = self.re5.match(s)
-         if m:
-            # Well behaved string
-            if self.re6.search(s):
-               m = self.re7.search(s)
-               newline = self.newlineplusre.match(m.groups(1)[0])
-               items = self.re6.split(s)
-               
-               # See if any entry in the list has mismatching quotes, then
-               # raise an error...
-               for item in items:
-                  # print 'Item=>',item
-                  self.tryQuoteException(item)
-                  
-               # Remove any trailing or beginning quotes from the items
-               items = [self.re8.sub('',item.strip()) for item in items]
-               # Replace any \" with "
-               items = [item.replace("\\\"", '"') for item in items]
-
-               # If the javascript consists of many lines with a +
-               # connecting them, there is a very good chance that it
-               # breaks spaces across multiple lines. In such case we
-               # need to join the pieces with at least a space.
-               if newline:
-                  s = ' '.join(items)
-               else:
-                  # If it does not consist of newline and a +, we don't
-                  # need any spaces between the pieces.
-                  s = ''.join(items)                  
-
-            # Remove any trailing or beginning quotes from the statement
-            s = self.re8.sub('', s)
-            statements2.append((s, nflag))
-         else:
-            # Ill-behaved string, has some strange char either beginning
-            # or end of line which was passed up to here.
-            # print 'no match',s
-            # Split and check for mismatched quotes
-            if self.re6.search(s):
-               items = self.re6.split(s)
-               # See if any entry in the list has mismatching quotes, then
-               # raise an error...
-               for item in items:
-                  self.tryQuoteException(item)
-                  
-            else:
-               # Ignore it
-               pass
-            
-      self.statements = statements2[:]
-      pass
-   
-   def goahead(self):
-
-      rawdata = self.rawdata
-      self._nflag = False
-      
-      # Scans the document for document.write* statements
-      # At the end of parsing, an internal DOM object
-      # will contain the modified DOM if any.
-
-      while rawdata:
-         m = self._find(rawdata)
-         if m:
-            # Get start of contents
-            start = m.end()
-            rawdata = rawdata[start:]
-            # Find the next occurence of a ')'
-            # First exclude any occurences of pairs of parens
-            # in the content
-            contentdata, pos = rawdata, 0
-            m1 = self.re3.search(contentdata)
-            while m1:
-               contentdata = contentdata[m1.end():]
-               pos = m1.end()
-               # print 'Pos=>',pos
-               # print contentdata
-               m1 = self.re3.search(contentdata)
-
-            m2 = self.re4.search(rawdata, pos)
-            if not m2:
-               raise JSParserException('Missing end paren!')
-            else:
-               start = m2.start()
-               statement = rawdata[:start+1].strip()
-               # print 'Statement=>',statement
-               # If statement contains a document.write*, then it is a
-               # botched up javascript, so raise error
-               if self.re1.search(statement):
-                  raise JSParserException('Invalid javascript', statement)
-               
-               # Look for errors like mismatching start and end quote chars
-               if self.tryQuoteException(statement) == 1:
-                  pass
-               elif statement[0] in ('+','-') and statement[-1] in ("'", '"'):
-                  # Firefox seems to accept this case
-                  print 'warning: garbage char "%s" in beginning of statement!' % statement[0]
-               else:
-                  raise JSParserException("Garbage in beginning/end of statement!")
-                  
-               # Add statement to list
-               self.statements.append((statement, self._nflag))
-               rawdata = rawdata[m2.end():]
-         else:
-            # No document.write* content found
-            # print 'no content'
-            break
-
-   def getDocument(self):
-      """ Return the DOM document object, this can be used to get
-      the modified page if any """
-
-      return self.page.document
-
-   def getLocation(self):
-      """ Return the DOM Location object, this can be used to
-      get modified URL if any """
-
-      return self.page.location
-
-   def getStatements(self):
-      """ Return the javascript statements in a list """
-
-      return self.parser.statements
-
-   
-def localtests():
-    print 'Doing local tests...'
-    
-    P = JSParser()
-    P.parse(open('samples/bportugal.html').read())
-    assert(repr(P.getDocument())==open('samples/bportugal_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-    
-    P.parse(open('samples/jstest.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-
-    P.parse(open('samples/jsnodom.html').read())
-    assert(repr(P.getDocument())==open('samples/jsnodom.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==False)    
-    
-
-    P.parse(open('samples/jstest2.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest2_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)    
-
-    P.parse(open('samples/jstest3.html').read())
-    assert(repr(P.getDocument())==open('samples/jstest3_dom.html').read())
-    assert(P.domchanged==True)
-    assert(P.locnchanged==False)
-
-    P.parse(open('samples/jsredirect.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="http://www.struer.dk/webtop/site.asp?site=5")
-
-    P.parse(open('samples/jsredirect2.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect2.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="http://www.struer.dk/webtop/site.asp?site=5")    
-
-    P.parse(open('samples/jsredirect3.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect3.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="fra/index.php")
-
-    P.parse(open('samples/jsredirect4.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect4.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="http://www.szszm.hu/szigetszentmiklos.hu")
-    
-    P.parse(open('samples/jsredirect5.html').read())
-    assert(repr(P.getDocument())==open('samples/jsredirect5.html').read())
-    assert(P.domchanged==False)
-    assert(P.locnchanged==True)
-    assert(P.getLocation().href=="sopron/main.php")    
-
-    print 'All local tests passed.'
-
-def webtests():
-    print 'Starting web tests...'
-    P = JSParser()
-
-    urls = [("http://www.skien.kommune.no/", 0), ("http://www.bayern.de/", 7),
-            ("http://www.agsbs.ch/", 2), ("http://www.froideville.ch/", 1)]
-
-    for url, number in urls:
-       print 'Parsing URL %s...' % url
-       P.parse_url(url)
-       print 'Found %d statements.' % len(P.getStatements())
-       assert(number==len(P.getStatements()))
-       
-def experiments():
-   P = JSParser()
-   P.parse(open('samples/test.html').read())
-   
-if __name__ == "__main__":
-   localtests()
-   #webtests()
-   #experiments()
-   
-   
-    
-
-
diff --git a/HarvestMan/harvestman/lib/js/narcissus.py b/HarvestMan/harvestman/lib/js/narcissus.py
deleted file mode 100755
index c42e60f..0000000
--- a/HarvestMan/harvestman/lib/js/narcissus.py
+++ /dev/null
@@ -1,1370 +0,0 @@
-# -- coding: utf-8
-"""
-Python port of Rbnarcissus, a pure Javascript parser written in Ruby.
-This code has been ported from the free Rbnarcissus port available
-at http://idontsmoke.co.uk/2005/rbnarcissus/Parser.rb.
-
-For a status of the code and the test cases which pass, read
-the README.
-
-This code is licensed under GNU GPL version 2.0.
-
-Author : Anand B Pillai <abpillai at gmail dot com>
-Copyright (C) 2007 Anand B Pillai <abpillai at gmail dot com>
-"""
-
-__version__ = "0.1 (alpha)"
-__author__  = "Anand B Pillai"
-
-import re
-
-def dump(d):
-    for key in sorted(d):
-        print key,'=>',d[key]
-
-class NarcissusError(Exception):
-
-    def __init__(self, msg, tokenizer):
-        self.msg = msg
-        self.t = tokenizer
-
-    def __str__(self):
-        return self.msg
-    
-tokens = [
-        # End of source.
-        "END",
-        
-        # Operators and punctuators.  Some pair-wise order matters, e.g. (+, -)
-        # and (UNARY_PLUS, UNARY_MINUS).
-        "\n", ";",
-        ",",
-        "=",
-        "?", ":", "CONDITIONAL",
-        "||",
-        "&&",
-        "|",
-        "^",
-        "&",
-        "==", "!=", "===", "!==",
-        "<", "<=", ">=", ">",
-        "<<", ">>", ">>>",
-        "+", "-",
-        "*", "/", "%",
-        "!", "~", "UNARY_PLUS", "UNARY_MINUS",
-        "++", "--",
-        ".",
-        "[", "]",
-        "{", "}",
-        "(", ")",
-        
-        # Nonterminal tree node type codes.
-        "SCRIPT", "BLOCK", "LABEL", "FOR_IN", "CALL", "NEW_WITH_ARGS", "INDEX",
-        "ARRAY_INIT", "OBJECT_INIT", "PROPERTY_INIT", "GETTER", "SETTER",
-        "GROUP", "LIST",
-        
-        # Terminals.
-        "IDENTIFIER", "NUMBER", "STRING", "REGEXP",
-        
-        # Keywords.
-        "break",
-        "case", "catch", "const", "continue",
-        "debugger", "default", "delete", "do",
-        "else", "enum",
-        "false", "finally", "for", "function",
-        "if", "in", "instanceof",
-        "new", "null",
-        "return",
-        "switch",
-        "this", "throw", "true", "try", "typeof",
-        "var", "void",
-        "while", "with",
-]
-
-# Operator and punctuator mapping from token to tree node type name.
-opTypeNames = {
-    "\n"  : "NEWLINE",
-    ';'   : "SEMICOLON",
-    ','   : "COMMA",
-    '?'   : "HOOK",
-    ':'   : "COLON",
-    '||'  : "OR",
-    '&&'  : "AND",
-    '|'   : "BITWISE_OR",
-    '^'   : "BITWISE_XOR",
-    '&'   : "BITWISE_AND",
-    '===' : "STRICT_EQ",
-    '=='  : "EQ",
-    '='   : "ASSIGN",
-    '!==' : "STRICT_NE",
-    '!='  : "NE",
-    '<<'  : "LSH",
-    '<='  : "LE",
-    '<'   : "LT",
-    '>>>' : "URSH",
-    '>>'  : "RSH",
-    '>='  : "GE",
-    '>'   : "GT",
-    '++'  : "INCREMENT",
-    '--'  : "DECREMENT",
-    '+'   : "PLUS",
-    '-'   : "MINUS",
-    '*'   : "MUL",
-    '/'   : "DIV",
-    '%'   : "MOD",
-    '!'   : "NOT",
-    '~'   : "BITWISE_NOT",
-    '.'   : "DOT",
-    '['   : "LEFT_BRACKET",
-    ']'   : "RIGHT_BRACKET",
-    '{'   : "LEFT_CURLY",
-    '}'   : "RIGHT_CURLY",
-    '('   : "LEFT_PAREN",
-    ')'   : "RIGHT_PAREN"
-}
-
-# Hash of keyword identifier to tokens index.
-keywords = {}
-
-# Define const END, etc., based on the token names.  Also map name to index.
-consts = {}
-
-r1 = re.compile(r'\A[a-z]')
-r2 = re.compile(r'\A\W')
-
-for i in range(len(tokens)):
-    t = tokens[i]
-
-    if r1.match(t):
-#         # print t
-        consts[t.upper()] = i
-        keywords[t] = i
-    elif r2.match(t):
-        consts[opTypeNames[t]] = i
-    else:
-        consts[t] = i
-
-#for key in sorted(keywords):
-# #    print key,'=>',keywords[key]
-
-# Map assignment operators to their indexes in the tokens array.
-assignOps = ['|', '^', '&', '<<', '>>', '>>>', '+', '-', '*', '/', '%']
-assignOpsHash = {}
-
-for i in range(len(assignOps)):
-    t = assignOps[i]
-    assignOpsHash[t] = consts[opTypeNames[t]]
-
-#for key in sorted(assignOpsHash):
-# #    print key,'=>',assignOpsHash[key]
-    
-opPrecedence = {
-    "SEMICOLON" : 0,
-    "COMMA" : 1,
-    "ASSIGN" : 2,
-    "HOOK" : 3, "COLON" : 3, "CONDITIONAL" : 3,
-    "OR" : 4,
-    "AND" : 5,
-    "BITWISE_OR" : 6,
-    "BITWISE_XOR" : 7,
-    "BITWISE_AND" : 8,
-    "EQ" : 9, "NE" : 9, "STRICT_EQ" : 9, "STRICT_NE" : 9,
-    "LT" : 10, "LE" : 10, "GE" : 10, "GT" : 10, "IN" : 10, "INSTANCEOF" : 10,
-    "LSH" : 11, "RSH" : 11, "URSH" : 11,
-    "PLUS" : 12, "MINUS" : 12,
-    "MUL" : 13, "DIV" : 13, "MOD" : 13,
-    "DELETE" : 14, "VOID" : 14, "TYPEOF" : 14, # PRE_INCREMENT: 14, PRE_DECREMENT: 14,
-    "NOT" : 14, "BITWISE_NOT" : 14, "UNARY_PLUS" : 14, "UNARY_MINUS" : 14,
-    "INCREMENT" : 15, "DECREMENT" : 15, # postfix
-    "NEW" : 16,
-    "DOT" : 17
-}
-
-# Map operator type code to precedence.
-for key in opPrecedence.keys():
-    opPrecedence[consts[key]] = opPrecedence[key]
-
-#for key in sorted(opPrecedence):
-# #    print key,'=>',opPrecedence[key]
-
-opArity = {
-        "COMMA" : -2,
-        "ASSIGN" : 2,
-        "CONDITIONAL" : 3,
-        "OR" : 2,
-        "AND" : 2,
-        "BITWISE_OR" : 2,
-        "BITWISE_XOR" : 2,
-        "BITWISE_AND" : 2,
-        "EQ" : 2, "NE" : 2, "STRICT_EQ" : 2, "STRICT_NE" : 2,
-        "LT" : 2, "LE" : 2, "GE" : 2, "GT" : 2, "IN" : 2, "INSTANCEOF" : 2,
-        "LSH" : 2, "RSH" : 2, "URSH" : 2,
-        "PLUS" : 2, "MINUS" : 2,
-        "MUL" : 2, "DIV" : 2, "MOD" : 2,
-        "DELETE" : 1, "VOID" : 1, "TYPEOF" : 1, # PRE_INCREMENT: 1, PRE_DECREMENT: 1,
-        "NOT" : 1, "BITWISE_NOT" : 1, "UNARY_PLUS" : 1, "UNARY_MINUS" : 1,
-        "INCREMENT" : 1, "DECREMENT" : 1,   # postfix
-        "NEW" : 1, "NEW_WITH_ARGS" : 2, "DOT" : 2, "INDEX" : 2, "CALL" : 2,
-        "ARRAY_INIT" : 1, "OBJECT_INIT" : 1, "GROUP" : 1
-}
-
-# Map operator type code to arity.
-for key in opArity.keys():
-    opArity[consts[key]] = opArity[key]
-
-# dump(opArity)
-
-# NB: superstring tokens (e.g., ++) must come before their substring token
-# counterparts (+ in the example), so that the opRegExp regular expression
-# synthesized from this list makes the longest possible match.
-ops = [';', ',', '?', ':', '||', '&&', '|', '^', '&', '===', '==', 
-       '=', '!==', '!=', '<<', '<=', '<', '>>>', '>>', '>=', '>', '++', '--',
-       '+', '-', '*', '/', '%', '!', '~', '.', '[', ']', '{', '}', '(', ')']
-
-
-# Build a regexp that recognizes operators and punctuators (except newline).
-
-opRegExpSrc = "\\A"
-r3 = re.compile(r'([?|^&(){}\[\]+\-*\/\.])')
-
-# $ops.length.times do |i|
-for i in range(len(ops)):
-    if opRegExpSrc != "\\A":
-        opRegExpSrc += "|\\A"
-        
-    s = ops[i]
-#     #print 'S=>',s
-    for item in s:
-        opRegExpSrc += r3.sub("\\" + item, item)
-    
-opRegExp = re.compile(opRegExpSrc, re.MULTILINE)
-
-# A regexp to match floating point literals (but not integer literals).
-fpRegExp = re.compile("\\A\\d+\\.\\d*(?:[eE][-+]?\\d+)?|\\A\\d+(?:\\.\\d*)?[eE][-+]?\\d+|\\A\\.\\d+(?:[eE][-+]?\\d+)?", re.MULTILINE)
-
-# dump(consts)
-# import sys
-# sys.exit(0)
-
-class List(list):
-    def __setitem__(self, index, item):
-        l = len(self)
-        if l<=index:
-          for x in range(index-l): self.append(None)
-          self.append(item)
-        else:
-            super(List, self).__setitem__(index, item)
-
-    def __getitem__(self, index):
-        try:
-            return super(List, self).__getitem__(index)
-        except IndexError, e:
-            return None
-
-class List2(list):
-    
-    def last(self):
-        try:
-            return self[-1]
-        except IndexError, e:
-            return None
-
-class Tokenizer(object):
-
-    def __init__(self, source, filename='', line=1):
-        self.cursor = 0
-        self.source = str(source)
-        self.tokens = List()
-        self.tokenIndex = 0
-        self.lookahead = 0
-        self.scanNewlines = False
-        self.scanOperand = True
-        self.filename = filename
-        self.lineno = line
-
-    def input(self):
-        return self.source[self.cursor:]
-
-    def done(self):
-        return (self.peek() == consts["END"])
-
-    def token(self):
-        return self.tokens[self.tokenIndex]
-
-    def match(self, tt):
-        print 'Calling match of',tt
-        got = self.get()
-        print 'GOOT',got,tt
-        
-        if got == tt:
-            return True
-        else:
-            return self.unget()
-    
-    def mustMatch(self, tt):
-        print 'Calling mustMatch',tt
-        if not self.match(tt):
-            raise NarcissusError("Missing " + self.tokens[tt].lower(), self)
-        return self.token()
-    
-    def peek(self):
-#         # print self.lookahead
-        
-        if self.lookahead > 0:
-#             # print len(self.tokens)
-#             # print self.tokenIndex,self.lookahead
-#             # print (self.tokenIndex + self.lookahead) & 3
-#             # print self.tokens
-            token = self.tokens[(self.tokenIndex + self.lookahead) & 3]
-#             # print token
-            tt = token.type
-        else:
-            tt = self.get()
-            self.unget()
-
-        return tt
-
-    def peekOnSameLine(self):
-        self.scanNewlines = True
-        tt = self.peek()
-        self.scanNewlines = False
-        return tt
-
-    def get(self):
-
-        pattern1 = re.compile(r'\A[ \t]+')
-        pattern2 = re.compile(r'\s+')
-        pattern3 = re.compile("\/(?:\*(?:.)*?\*\/|\/[^\n]*)", re.DOTALL)
-        
-        # pattern3 = re.compile(r'(\?\:\*(\?\:\.)*?\*\/|\/[^\n]*)', re.MULTILINE)
-        pattern4 = re.compile(r'\A0[xX][\da-fA-F]+|0[0-7]*|\d+')
-        pattern5 = re.compile(r'\A(\w|\$)+')
-        pattern6 = re.compile(r"\A\"(?:\\.|[^\"])*\"|\A'(?:[^']|\\.)*'")
-        pattern7 = re.compile(r'\A\/((?:\\.|[^\/])+)\/([gi]*)')
-            
-        while self.lookahead > 0:
-            self.lookahead -= 1
-            self.tokenIndex = (self.tokenIndex + 1) & 3
-            token = self.tokens[self.tokenIndex]
-            if token.type != consts["NEWLINE"] or self.scanNewlines:
-                return token.type
-
-        while True:
-
-            input_s = self.input()
-            print 'Input => ', input_s
-            
-            if self.scanNewlines:
-                print 'Scannewlines is true'
-                match = pattern1.match(input_s)
-            else:
-                match = pattern2.match(input_s)
-
-            if match:
-                print 'A MATCH FOUND!'
-                spaces = match.group(0)
-                print 'Spaces =>',len(spaces)
-                self.cursor += len(spaces)
-                print 'Newline count =>',spaces.count('\n')
-                self.lineno += spaces.count('\n')
-                input_s = self.input()
-
-            print 'Input=>',input_s, len(input_s)
-            match = pattern3.match(input_s)
-            if not match:
-                print 'BREAKING'
-                break
-
-            print 'Cursor',self.cursor
-            comment = match.group(0)
-            print 'Comment =>',comment
-            self.cursor += len(comment)
-            print 'Cursor',self.cursor            
-            print 'Comment length, comment newline count=>',len(comment),comment.count('\n')
-            self.lineno += comment.count('\n')
-
-        self.tokenIndex = (self.tokenIndex + 1) & 3
-#         # print self.tokenIndex
-        token = self.tokens[self.tokenIndex]
-
-        if token==None:
-#             # print self.tokens, self.tokenIndex
-            self.tokens[self.tokenIndex] = token = Token()
-
-        if len(input_s)==0:
-            token.type = consts["END"]
-            return token.type
-        
-        matchflag = False
-        cursor_advance = 0
-
-        if fpRegExp.match(input_s):
-            print "Matched here1"            
-            match = fpRegExp.match(input_s)
-            
-            token.type = consts["NUMBER"]
-            # Not sure if this works or if we need .findall()[0]
-            token.value = float(match.group(0))
-        elif pattern4.match(input_s):
-            print "Matched here2"                        
-            match = pattern4.match(input_s)
-            token.type = consts["NUMBER"]
-            token.value = int(match.group(0))
-        elif pattern5.match(input_s):
-            print "MATCH: Matched here3",input_s                        
-            match = pattern5.match(input_s)
-            id = match.group(0)
-            token.type = keywords.get(id) or consts["IDENTIFIER"]
-            token.value = id
-        elif pattern6.match(input_s):
-            print "Matched here4"                        
-            match = pattern6.match(input_s)
-            token.type = consts["STRING"]
-            token.value = str(match.group(0))
-        elif self.scanOperand and pattern7.match(input_s):
-            print "Matched here5"                                    
-            match = pattern7.match(input_s)
-#             # print match.group(2)
-            token.type = consts["REGEXP"]
-            # print match.group(1), match.group(2)
-            token.value = re.compile(match.group(1)) # , match.group(2))
-        elif opRegExp.match(input_s):
-            print "Matched here6",input_s                               
-            match = opRegExp.match(input_s)
-
-            op = match.group(0)
-            if assignOpsHash.get(op) and (input_s[len(op):2] == '='):
-                print 'Token type is ASSIGN'
-                token.type = consts["ASSIGN"]
-                token.assignOp = consts[opTypeNames[op]]
-                cursor_advance = 1 # length of '='
-            else:
-                token.type = consts[opTypeNames[op]]
-                print token.type, self.scanOperand, consts["MINUS"]
-                print 'TOKEN TYPE NOT ASSIGN!'
-
-                if self.scanOperand and (token.type==consts["PLUS"] or token.type==consts["MINUS"]):
-                    print 'Adding to token type!'
-                    token.type += consts["UNARY_PLUS"] - consts["PLUS"]
-
-                token.assignOp = None
-            token.value = op
-        else:
-            raise NarcissusError("Illegal token", self)
-
-        token.start = self.cursor
-#         # print token.start
-        print 'Group0 =>',match.group(0)
-        print 'Advancing cursor by',len(match.group(0)) + cursor_advance
-        self.cursor += len(match.group(0)) + cursor_advance
-#         # print self.cursor
-        token.end = self.cursor
-#         # print token.end
-        token.lineno = self.lineno
-#         # print token.lineno
-
-        print 'Returning',token.type
-        return token.type
-
-    def unget(self):
-
-        self.lookahead += 1
-        if self.lookahead == 4:
-            raise NarcissusError("PANIC: too much lookahead!", self)
-        self.tokenIndex = (self.tokenIndex - 1) & 3
-        return None
-
-class NarcissusError(Exception):
-
-    def __init__(self, msg, tokenizer):
-        self.msg = msg
-        self.line = str(tokenizer.lineno)
-
-    def __str__(self):
-        return '\n'.join((self.msg, "on line " + self.line))
-
-
-class Token(object):
-
-    __slots__ = ['type','value','start','end','assignOp','lineno']
-    
-    def __init__(self):
-        self.type = None
-        self.value = None
-        self.start = None
-        self.end = None
-        self.assignOp = None
-        self.lineno = None
-
-    def lower(self):
-        return self.value.lower()
-    
-class CompilerContext(object):
-
-    __slots__ = ['inFunction', 'stmtStack','funDecls','varDecls',
-                 'bracketLevel','curlyLevel','parenLevel','hookLevel',
-                 'ecmaStrictMode','inForLoopInit']
-
-    def __init__(self, inFunction):
-        self.inFunction = inFunction
-        self.stmtStack = []
-        self.funDecls = []
-        self.varDecls = []
-
-        self.bracketLevel = self.curlyLevel = self.parenLevel = self.hookLevel = 0
-        self.ecmaStrictMode = self.inForLoopInit = False
-
-
-CURSOR = 0
-IND = 0
-
-# class Node => TODO
-class Node(list):
-
-    __slots__ =  ['type','value','lineno','start','end','tokenizer','initializer',
-                  'name','params','funDecls','varDecls', 'body', 'functionForm',
-                  'assignOp', 'expression', 'condition', 'thenPart', 'elsePart',
-                  'readOnly', 'isLoop', 'setup', 'postfix', 'update', 'exception',
-                  'object', 'iterator', 'varDecl', 'label', 'target', 'tryBlock',
-                  'catchClauses', 'varName', 'guard', 'block', 'discriminant', 'cases',
-                  'defaultIndex', 'caseLabel', 'statements', 'statement','children']
-    
-    def __init__(self, t, type=None):
-        # Blanket initialize all params
-        for var in self.__class__.__slots__:
-            setattr(self, var, None)
-            
-        token = t.token()
-#         # print 'token=>',token
-        if token:
-            if type:
-                self.type = type
-            else:
-                self.type = token.type
-            self.value = token.value
-            self.lineno = token.lineno
-            self.start = token.start
-            self.end = token.end
-        else:
-            self.type = type
-            self.lineno = t.lineno
-
-        # T is the Tokenizer instance
-        self.tokenizer = t
-
-    # Always use push$ to add operands to an expression, to update start and end.
-    def append(self, kid):
-        # print "Pushing...", kid.__class__, kid.type
-        if kid.start and self.start:
-            if kid.start < self.start:
-                self.start = kid.start
-
-        if kid.end and self.end:
-            if self.end < kid.end:
-                self.end = kid.end
-
-        return super(Node, self).append(kid)
-
-    push = append
-    def __str__(self):
-
-        s = ''
-        attrs = [self.value, self.lineno, self.start, self.end,
-                 self.name, self.params, self.funDecls, self.varDecls,
-                 self.body, self.functionForm, self.assignOp, self.expression,
-                 self.condition, self.thenPart, self.elsePart ]
-
-        global IND
-        IND += 1
-
-        for x in range(len(self)):
-            item = self[x]
-            if item != self and isinstance(item, Node):
-                s = s + str(item)
-
-        for x in range(len(attrs)):
-            # Some commented out code here, not adding it
-            item = attrs[x]
-            if isinstance(item, Node):
-                s = s + str(item)
-
-        IND -= 1
-        if IND == 0:
-             print self.tokenizer.source[CURSOR:],
-
-        return ""
-            
-    def getSource(self):
-        return self.tokenizer.source[self.start:self.start+self.end]
-    
-    def filename(self):
-        return self.tokenizer.filename
-        
-
-def Script(t, x):
-    n = Statements(t, x)
-    n.type = consts["SCRIPT"]
-    n.funDecls = x.funDecls
-    n.varDecls = x.varDecls
-
-    return n
-
-# Statement stack and nested statement handler.
-# nb. Narcissus allowed a function reference, here we use Statement explicitly
-def nest(t, x, node, end_ = None):
-    x.stmtStack.append(node)
-    n = Statement(t, x)
-    x.stmtStack.pop()
-    if end_:
-        t.mustMatch(end_)
-    return n
-
-def Statements(t, x):
-
-    n = Node(t, consts["BLOCK"])
-    x.stmtStack.append(n)
-
-    while (not t.done()) and (t.peek() != consts["RIGHT_CURLY"]):
-        # print 'Loop'
-#         # print 'Starting to push'
-        s = Statement(t, x)
-        # print 'Type =>',s.type
-        n.push(s)
-#         # print "Push ended"
-        
-    x.stmtStack.pop()
-    return n
-
-def Block(t, x):
-    t.mustMatch(consts["LEFT_CURLY"])
-    n = Statements(t, x)
-    t.mustMatch(consts["RIGHT_CURLY"])
-    return n
-
-DECLARED_FORM = 0
-EXPRESSED_FORM = 1
-STATEMENT_FORM = 2
-
-def Statement(t, x):
-    # Cases for statements ending in a right curly return early, avoiding the
-    # common semicolon insertion magic after this switch.
-
-    # t is an instance of Tokenizer
-    tt = t.get()
-    print 'TT is',tt
-    if tt == consts["FUNCTION"]:
-        print 'TT is a function', consts["FUNCTION"]
-        return FunctionDefinition(t, x, True, 
-                                  (len(x.stmtStack) > 1) and STATEMENT_FORM or DECLARED_FORM)
-    elif tt == consts["LEFT_CURLY"]:
-        n = Statements(t, x)
-        t.mustMatch(consts["RIGHT_CURLY"])
-        return n
-                
-    elif tt == consts["IF"]:
-        n = Node(t)
-        n.condition = ParenExpression(t, x)
-        x.stmtStack.append(n)
-        n.thenPart = Statement(t, x)
-        if t.match(consts["ELSE"]):
-            n.elsePart = Statement(t, x)
-        else:
-            n.elsePart = None
-            
-        x.stmtStack.pop()
-        return n
-
-    elif tt == consts["SWITCH"]:
-        n = Node(t)
-        t.mustMatch(consts["LEFT_PAREN"])
-        n.discriminant = Expression(t, x)
-        t.mustMatch(consts["RIGHT_PAREN"])
-        n.cases = []
-        n.defaultIndex = -1
-        x.stmtStack.append(n)
-        t.mustMatch(consts["LEFT_CURLY"])
-        
-        while True:
-            tt = t.get()
-            print 'TT IS',tt
-            if tt == consts["RIGHT_CURLY"]: break
-            if tt == consts["DEFAULT"] or tt == consts["CASE"]:
-                if tt == consts["DEFAULT"] and n.defaultIndex >= 0:
-                    raise NarcissusError("More than one switch default", t)
-
-                n2 = Node(t)
-                if tt == consts["DEFAULT"]:
-                    n.defaultIndex = len(n.cases)
-                else:
-                    n2.caseLabel = Expression(t, x, consts["COLON"])
-
-                                        
-            else:
-                raise NarcissusError("Invalid switch case", t)
-
-            t.mustMatch(consts["COLON"])
-            n2.statements = Node(t, consts["BLOCK"])
-
-            while True:
-                tt = t.peek()
-                if (tt == consts["CASE"]) or \
-                    (tt==consts["DEFAULT"]) or \
-                    (tt==consts["RIGHT_CURLY"]):
-                    break
-
-                print 'Yeah dude!'
-                n2.statements.append(Statement(t, x))
-
-            n.cases.append(n2)
-            # End of while...
-            
-        x.stmtStack.pop()
-        return n
-
-    elif tt == consts["FOR"]:
-        n = Node(t)
-        n.isLoop = True
-
-        n2 = None
-        
-        t.mustMatch(consts["LEFT_PAREN"])
-        tt = t.peek()
-        if tt != consts["SEMICOLON"]:
-            x.inForLoopInit = True
-            if tt == consts["VAR"] or tt == consts["CONST"]:
-                print 'Got',t.get()
-                n2 = Variables(t, x)
-            else:
-                n2 = Expression(t, x)
-
-            x.inForLoopInit = False
-
-        if n2 and t.match(consts["IN"]):
-            n.type = consts["FOR_IN"]
-            if n2.type == consts["VAR"]:
-                if len(n2) != 1:
-                    raise NarcissusError("Invalid for..in left-hand side", t)
-
-                n.iterator = n2[0]
-                n.varDecl = n2
-            else:
-                n.iterator = n2
-                n.varDecl = None
-
-            n.object = Expression(t, x)
-        else:
-            n.setup = n2 or None
-            t.mustMatch(consts["SEMICOLON"])
-            if (t.peek() == consts["SEMICOLON"]):
-                n.condition = None
-            else:
-                n.condition = Expression(t, x)
-            t.mustMatch(consts["SEMICOLON"])
-            if (t.peek() == consts["RIGHT_PAREN"]):
-                n.update = None
-            else:
-                n.update = Expression(t, x)
-
-        t.mustMatch(consts["RIGHT_PAREN"])
-        n.body = nest(t, x, n)
-        return n
-
-    elif tt == consts["WHILE"]:
-        n = Node(t)
-        n.isLoop = True
-        n.condition = ParenExpression(t, x)
-        n.body = nest(t, x, n)
-        return n
-
-    elif tt == consts["DO"]:
-        n = Node(t)
-        n.isLoop = True
-        n.body = nest(t, x, n, consts["WHILE"])
-        n.condition = ParenExpression(t, x)
-        if not x.ecmaStrictMode:
-            # <script language="JavaScript"> (without version hints) may need
-            # automatic semicolon insertion without a newline after do-while.
-            # See http://bugzilla.mozilla.org/show_bug.cgi?id=238945.
-            t.match(consts["SEMICOLON"])
-            return n
-        
-    elif (tt == consts["BREAK"]) or (tt == consts["CONTINUE"]):
-        n = Node(t)
-        if t.peekOnSameLine() == consts["IDENTIFIER"]:
-            print 'Gewt',t.get()
-            n.label = t.token().value
-
-        ss = x.stmtStack
-        i = len(ss)
-
-        label = n.label
-        if label:
-            while True:
-                i -= 1
-                if i < 0:
-                    raise NarcissusError("Label not found", t)
-                if (ss[i].label != label): break
-                
-        else:
-            while True:
-                i -= 1
-
-                if i<0:
-                    if tt == consts["BREAK"]:
-                        raise NarcissusError("Invalid break", t)
-                    else:
-                        raise NarcissusError("Invalid continue", t)
-
-                if ss[i].isLoop or (tt==consts["BREAK"] and ss[i].type == consts["SWITCH"]):
-                    break
-
-        n.target = ss[i]        
-
-    elif tt == consts["TRY"]:
-        n = Node(t)
-        n.tryBlock = Block(t, x)
-        n.catchClauses = List2()
-
-        while t.match(consts["CATCH"]):
-            n2 = Node(t)
-            t.mustMatch(consts["LEFT_PAREN"])
-            n2.varName = t.mustMatch(consts["IDENTIFIER"]).value
-            if t.match(consts["IF"]):
-                if x.ecmaStrictMode:                
-                    raise NarcissusError("Illegal catch guard", t)
-                if len(n.catchClauses) and (not n.catchClauses.last().guard):
-                    raise NarcissusError("Guarded catch after unguarded", t)
-                n2.guard = Expression(t, x)
-            else:
-                n2.guard = None
-
-            t.mustMatch(consts["RIGHT_PAREN"])
-            n2.block = Block(t, x)
-            n.catchClauses.append(n2)
-
-        if t.match(consts["FINALLY"]):
-            n.finallyBlock = Block(t, x) 
-        if (not len(n.catchClauses)) and (not n.finallyBlock):
-            raise NarcissusError("Invalid try statement", t)
-        
-        return n
-
-    elif tt == consts["CATCH"]:
-        pass
-    elif tt == consts["FINALLY"]:
-        raise NarcissusError(str(tokens[tt]) + " without preceding try", t)
-    elif tt == consts["THROW"]:
-        n = Node(t)
-        n.exception = Expression(t, x)
-    elif tt == consts["RETURN"]:
-        print 'In Return'
-        if not x.inFunction:
-            raise NarcissusError("Invalid return", t)
-        n = Node(t)
-        tt = t.peekOnSameLine()
-        if (tt != consts["END"]) and \
-           (tt != consts["NEWLINE"]) and \
-           (tt != consts["SEMICOLON"]) and \
-           (tt != consts["RIGHT_CURLY"]):
-            print 'Okay!'
-            n.value = Expression(t, x)
-            print "Val => " + str(n.value)
-
-    elif tt == consts["WITH"]:
-        n = Node(t)
-        n.object = ParenExpression(t, x)
-        n.body = nest(t, x, n)
-        return n
-            
-    elif (tt == consts["VAR"]) or (tt==consts["CONST"]):
-        n = Variables(t, x)
-
-    elif tt == consts["DEBUGGER"]:
-        n = Node(t)
-
-    elif (tt==consts["NEWLINE"]) or (tt==consts["SEMICOLON"]):
-        print 'out here'
-        n = Node(t, consts["SEMICOLON"])
-        n.expression = None
-        return n
-
-    else:
-#         # print 'out there'
-        if (tt==consts["IDENTIFIER"]) and (t.peek() == consts["COLON"]):
-            label = t.token().value
-            ss = x.stmtStack
-            for x in range(len(ss)-1):
-                if ss[i].label == label:
-                    raise NarcissusError("Duplicate label", t)
-
-            print 'GAWT',t.get()
-            n = Node(t, consts["LABEL"])
-            n.label = label
-            n.statement = nest(t, x, n)
-            return n
-
-        t.unget()
-        n = Node(t, consts["SEMICOLON"])
-        n.expression = Expression(t, x)
-        n.end = n.expression.end
-
-
-    if t.lineno == t.token().lineno:
-        tt = t.peekOnSameLine()
-        print 'TT*=>',tt
-        if tt != consts["END"] and \
-           tt != consts["NEWLINE"] and \
-           tt != consts["SEMICOLON"] and \
-           tt != consts["RIGHT_CURLY"]:
-            raise NarcissusError("Missing ; before statement", t)
-
-    t.match(consts["SEMICOLON"])
-    return n
-              
-            
-    
-def FunctionDefinition(t, x, requireName, functionForm):
-
-    # t => an instance of Tokenizer
-    f = Node(t)
-    if f.type != consts["FUNCTION"]:
-        f.type = (f.value == "get") and consts["GETTER"] or consts["SETTER"]
-#     print f.type
-    if t.match(consts["IDENTIFIER"]):
-        f.name = t.token().value
-    t.mustMatch(consts["LEFT_PAREN"])
-    f.params = []
-
-    while True:
-        tt = t.get()
-        print 'TT IZ',tt
-        if tt==consts["RIGHT_PAREN"]: break
-        if tt != consts["IDENTIFIER"]:
-            raise NarcissusError("Missing formal parameters", t)
-        f.params.append(t.token().value)
-        if t.peek() != consts["RIGHT_PAREN"]:
-            t.mustMatch(consts["COMMA"])
-
-    t.mustMatch(consts["LEFT_CURLY"])
-    x2 = CompilerContext(True)
-    f.body = Script(t, x2)
-    t.mustMatch(consts["RIGHT_CURLY"])
-    f.end = t.token().end
-    f.functionForm = functionForm
-    if functionForm == consts.get("DECLARED_FORM"):
-#         print 'okay'
-        x.funDecls.append(f)
-
-    return f
-
-def Variables(t, x):
-
-    n = Node(t)
-
-    while True:
-        t.mustMatch(consts["IDENTIFIER"])
-        n2 = Node(t)
-        n2.name = n2.value
-        if t.match(consts["ASSIGN"]):
-            if t.token().assignOp:
-                raise NarcissusError("Invalid variable initialization", t)
-#             print 'Initializing var...'
-            n2.initializer = Expression(t, x, consts["COMMA"])
-
-        n2.readOnly = ( n.type == consts["CONST"])
-        # print 'vars=>',n2.type
-#         print 'Starting to push'
-        n.push(n2)
-#         print "Push ended"        
-        x.varDecls.append(n2)
-        if not t.match(consts["COMMA"]): break
-        
-    return n
-
-def ParenExpression(t, x):
-    t.mustMatch(consts["LEFT_PAREN"])
-    n = Expression(t, x)
-    t.mustMatch(consts["RIGHT_PAREN"])
-    return n
-
-def Expression(t, x, stop = None):
-    operators = List2()
-    operands = List2()
-
-    print 'In Expression',len(operands)
-
-    bl = x.bracketLevel
-    cl = x.curlyLevel
-    pl = x.parenLevel
-    hl = x.hookLevel
-
-    def Reduce(operators, operands, t):
-        # print 'Reduce called!'
-        #for item in operators:
-        #    print 'operator=>',item
-            
-        n = operators.pop()
-        print 'N=>',n.type
-        op = n.type
-        arity = opArity[op]
-        print 'Arity=>',arity
-        if arity == -2:
-            if len(operands) >= 2:
-                # Flatten left-associative trees
-                left = operands[len(operands) - 2]
-                print 'Left=>',left
-                
-                if left.type == op:
-                    print 'Dude!'
-                    right = operands.pop()
-                    left.append(right)
-                    return (operators, operands, left)
-
-            arity = 2
-
-        # Always use push to add operands to n, to update start and end.
-        # print 'Before slicing =>',len(operands)
-        startidx, endidx = len(operands)-arity, 2*len(operands) - arity
-        a = operands[startidx:endidx]
-        operands = List2(operands[:startidx])
-        # print 'After slicing =>',len(operands)
-        # for x in operands:
-        #     print "Optype=>",x.type
-            
-#         print a
-#         print arity
-        for x in range(arity):
-#             print x
-#             print a[x]
-            # print 'Type=>',a[x].type
-            n.push(a[x])
-
-        # Include closing bracket or postfix operator in [start,end).
-        if n.end < t.token().end:
-            n.end = t.token().end
-
-        operands.append(n)
-        # print 'Operands length =>',len(operands)
-        return (operators, operands, n)
-
-    while True: # (t.token() and t.token().type != consts["END"]):
-        if (t.token() and t.token().type == consts["END"]): break
-        tt = t.get()
-        if tt  == consts["END"]: break
-        
-        print 'TT ==>',tt
-        # Stop only if tt matches the optional stop parameter, and that
-        # token is not quoted by some kind of bracket.        
-        if (tt==stop) and \
-           (x.bracketLevel == bl) and \
-           (x.curlyLevel == cl) and \
-           (x.parenLevel == pl) and \
-           (x.hookLevel == hl):
-            break
-
-        if tt == consts["SEMICOLON"]:
-            # NB: cannot be empty, Statement handled that.
-            break
-        elif (tt==consts["ASSIGN"]) or \
-             (tt==consts["HOOK"]) or \
-             (tt==consts["COLON"]):
-            if t.scanOperand:
-                break
-
-#             print 'here....'
-#             print operators
-#             print operands
-#             print len(operands)
-            # Use >, not >=, for right-associative ASSIGN and HOOK/COLON.            
-            while len(operators) > 0 and \
-                  opPrecedence.get(operators.last().type) and \
-                  (opPrecedence.get(operators.last().type) > opPrecedence.get(tt)):
-
-                operators, operands, ret = Reduce(operators, operands, t)
-
-            # print 'Operands length2 =>',len(operands)
-            if tt == consts["COLON"]:
-                n = operators.last()
-                if n.type != consts["HOOK"]:
-                    raise NarcissusError("Invalid label", t)
-            
-                n.type = consts["CONDITIONAL"]
-                x.hookLevel -= 1
-            else:
-                operators.append(Node(t))
-                if tt == consts["ASSIGN"]:
-#                     print operands
-                    operands.last().assignOp = t.token().assignOp
-                else:
-                    x.hookLevel += 1 # tt == HOOK
-
-            t.scanOperand = True
-
-        # Treat comma as left-associative so reduce can fold left-heavy
-        # COMMA trees into a single array.
-        elif tt in (consts["COMMA"], consts["OR"], consts["AND"], consts["BITWISE_OR"],
-                    consts["BITWISE_XOR"], consts["BITWISE_AND"], consts["EQ"],
-                    consts["NE"], consts["STRICT_EQ"], consts["STRICT_NE"],
-                    consts["LT"], consts["LE"], consts["GE"], consts["GT"],
-                    consts["INSTANCEOF"], consts["LSH"], consts["RSH"],
-                    consts["URSH"], consts["PLUS"], consts["MINUS"], consts["MUL"],
-                    consts["DIV"], consts["MOD"], consts["DOT"], consts["IN"]):
-
-            # print 'here...'
-            # An in operator should not be parsed if we're parsing the head of
-            # a for (...) loop, unless it is in the then part of a conditional
-            # expression, or parenthesized somehow.            
-            if (tt == consts["IN"]) and \
-               x.inForLoopInit and \
-               x.hookLevel == 0 and \
-               x.bracketLevel == 0 and \
-               x.curlyLevel == 0 and \
-               x.parenLevel == 0:
-                break
-
-            if t.scanOperand:
-                break
-
-            while (len(operators)) and \
-                  (opPrecedence.get(operators.last().type)) and \
-                  (opPrecedence.get(operators.last().type) >= opPrecedence.get(tt)):
-                operators, operands, ret = Reduce(operators, operands, t)
-
-            # print 'Operands length2 =>',len(operands)
-            
-            if tt == consts["DOT"]:
-                t.mustMatch(consts["IDENTIFIER"])
-                node = Node(t, consts["DOT"])
-                node.push(operands.pop())
-                node.push(Node(t))
-                operands.append(node)
-            else:
-                operators.append(Node(t))
-                # print operators
-                t.scanOperand = True
-
-        elif tt in (consts["DELETE"], consts["VOID"], consts["TYPEOF"],
-                    consts["NOT"], consts["BITWISE_NOT"], consts["UNARY_PLUS"],
-                    consts["UNARY_MINUS"], consts["NEW"]):
-
-            if not t.scanOperand:
-                break
-
-            operators.append(Node(t))
-
-        elif tt in (consts["INCREMENT"], consts["DECREMENT"]):
-            if t.scanOperand:
-                operators.append(Node(t)) # prefix increment or decrement
-            else:
-                # Use >, not >=, so postfix has higher precedence than prefix.
-                while (len(operators)) and \
-                      (opPrecedence.get(operators.last().type)) and \
-                      (opPrecedence.get(operators.last().type) > opPrecedence.get(tt)):
-                    operators, operands, ret = Reduce(operators, operands, t)                    
-
-                n = Node(t, tt)
-                n.push(operands.pop())
-                n.postfix = True
-                operands.append(n)
-
-        elif tt == consts["FUNCTION"]:
-#             print 'HERE'
-            if not t.scanOperand:
-                break
-            operands.append(FunctionDefinition(t, x, False, consts.get("EXPRESSED_FORM")))
-            t.scanOperand = False
-
-        elif tt in (consts["NULL"], consts["THIS"], consts["TRUE"], consts["FALSE"],
-                    consts["IDENTIFIER"], consts["NUMBER"], consts["STRING"],
-                    consts["REGEXP"]):
-
-            if not t.scanOperand:
-                break
-            # print 'HERE2'
-            operands.append(Node(t))
-#             print operands
-            t.scanOperand = False
-
-        elif tt == consts["LEFT_BRACKET"]:
-            if t.scanOperand:
-                # Array initialiser.  Parse using recursive descent, as the
-                # sub-grammar here is not an operator grammar.
-                n = Node(t, consts["ARRAY_INIT"])
-
-                while True:
-                    tt = t.peek()
-                    if tt == consts["RIGHT_BRACKET"]: break
-                    if tt == consts["COMMA"]:
-                        print 'Tt Iz',t.get()
-                        n.push(None)
-
-                        # FIXME: What is this next ?
-                        next
-
-                    n.push(Expression(t, x, consts["COMMA"]))
-                    if not t.match(consts["COMMA"]):
-                        continue
-                    
-                t.mustMatch(consts["RIGHT_BRACKET"])
-                operands.append(n)
-                t.scanOperand = False
-            else:
-                # Property indexing operator.
-                operators.append(Node(t, consts["INDEX"]))
-                t.scanOperand = True
-                x.bracketLevel += 1
-
-        elif tt == consts["RIGHT_BRACKET"]:
-            if t.scanOperand or x.bracketLevel == bl:
-                break
-
-            while True:
-                operators, operands, ret = Reduce(operators, operands, t)
-                if ret.type == consts["INDEX"]: break
-                pass
-            x.bracketLevel -= 1
-
-        elif tt == consts["LEFT_CURLY"]:
-            if not t.scanOperand:
-                break
-            # Object initialiser.  As for array initialisers (see above),
-            # parse using recursive descent.
-            x.curlyLevel += 1
-            n = Node(t, consts["OBJECT_INIT"])
-            
-            if not t.match(consts["RIGHT_CURLY"]):
-                while True:
-                    tt = t.get()
-                    print 'tt IZ',tt
-                    if (t.token().value == 'get') or (t.token().value=='set') and (t.peek() == consts["IDENTIFIER"]):
-                        if x.ecmaStrictMode:
-                            raise NarcissusError("Illegal property accessor", t)
-                        n.push(FunctionDefinition(t, x, True, consts["EXPRESSED_FORM"]))
-                    elif tt in (consts["IDENTIFIER"], consts["NUMBER"], consts["STRING"]):
-                        id = Node(t)
-                    elif tt == consts["RIGHT_CURLY"]:
-                        if x.ecmaStrictMode:
-                            raise NarcissusError("Illegal training", t)
-                    else:
-                        raise NarcissusError("Invalid property name", t)
-
-                    t.mustMatch(consts["COLON"])
-                    n2 = Node(t, consts["PROPERTY_INIT"])
-                    n2.push(id)
-                    n2.push(Expression(t, x, consts["COMMA"]))
-                    n.push(n2)                    
-                    if not t.match(consts["COMMA"]): break
-
-                t.mustMatch(consts["RIGHT_CURLY"])
-
-            operands.append(n)
-            t.scanOperand = False
-            x.curlyLevel -= 1
-
-        elif tt == consts["RIGHT_CURLY"]:
-            if not t.scanOperand and x.curlyLevel != cl:
-                raise NarcissusError("PANIC: right curly botch", t)
-            break
-
-        elif tt == consts["LEFT_PAREN"]:
-            if t.scanOperand:
-                operators.append(Node(t, consts["GROUP"]))
-            else:
-                print operators, type(operators)
-                while len(operators) and \
-                      opPrecedence.get(operators.last().type) and \
-                      opPrecedence.get(operators.last().type) > opPrecedence.get(consts["NEW"]):
-                    operators, operands, ret = Reduce(operators, operands, t)
-
-                # Handle () now, to regularize the n-ary case for n > 0.
-                # We must set scanOperand in case there are arguments and
-                # the first one is a regexp or unary+/-.
-                n = operators.last()
-                # print n, n.type, consts["NEW"]
-                t.scanOperand = True
-                print 'I WENT HERE'
-                if t.match(consts["RIGHT_PAREN"]):
-                    if n != None and n.type == consts["NEW"]:
-                        print "INSIDE" 
-                        operators.pop()
-                        n.push(operands.pop())
-                    else:
-                        print "OUTSIDE"
-                        n = Node(t, consts["CALL"])
-                        n.push(operands.pop())
-                        n.push(Node(t, consts["LIST"]))
-
-                    operands.append(n)
-                    t.scanOperand = False
-                    continue
-   
-                if n != None and n.type == consts["NEW"]:
-                    n.type = consts["NEW_WITH_ARGS"]
-                else:
-                    operators.append(Node(t, consts["CALL"]))
-
-            x.parenLevel += 1
-
-        elif tt == consts["RIGHT_PAREN"]:
-            if t.scanOperand or x.parenLevel == pl:
-                break
-            while True:
-                print 'Operators =>',len(operators),len(operands)
-                
-                operators, operands, tt = Reduce(operators, operands, t)
-                print 'TT=>',tt,tt.type
-                if (tt.type == consts["GROUP"]) or \
-                       (tt.type == consts["CALL"]) or \
-                       (tt.type == consts["NEW_WITH_ARGS"]):
-                    print 'Breaking'
-                    break
-                
-            if tt != consts["GROUP"]:
-                n = operands.last()
-                if n[1].type != consts["COMMA"]:
-                    n2 = n[1]
-                    n[1] = Node(t, consts["LIST"])
-                    n[1].push(n2)
-                else:
-                    n[1].type = consts["LIST"]
-
-            x.parenLevel -= 1
-                
-        else:
-            # Automatic semicolon insertion means we may scan across a newline
-            # and into the beginning of another statement.  If so, break out of
-            # the while loop and let the t.scanOperand logic handle errors.
-            break
-
-    if x.hookLevel != hl:
-        raise NarcissusError("Missing : after ?", t)
-    if t.scanOperand:
-        raise NarcissusError("Missing operand", t)
-
-    # Resume default mode, scanning for operands, not operators.
-    t.scanOperand = True
-    print 'Ungetting...'
-    print t.unget()
-#     print 'HERE4'    
-    while len(operators) > 0:
-        operators, operands, ret = Reduce(operators, operands, t)
-
-    # print 'Operands length2 =>',len(operands)        
-    return operands.pop()
-        
-def parse(source, filename, line = 1):
-    t = Tokenizer(source, filename, line)
-    x = CompilerContext(False)
-    n = Script(t, x)
-    if not t.done:
-        raise NarcissusError("Syntax error", t)
-
-    return n
-
-
-        
-
-            
-            
-        
-    
diff --git a/HarvestMan/harvestman/lib/js/parse.rb b/HarvestMan/harvestman/lib/js/parse.rb
deleted file mode 100755
index a1ede26..0000000
--- a/HarvestMan/harvestman/lib/js/parse.rb
+++ /dev/null
@@ -1,79 +0,0 @@
-require 'Parser'
-
-
-filename = ARGV[0]
-#filename = 'test.js'
-jsfile = File.open(filename).read
-
-jstree = parse(jsfile, filename)
-#puts 'finished parsing'
-
-def get_children (n)
-        children = []
-        attrs = [n.type, n.value, n.lineno, n.start, n.end, n.tokenizer, n.initializer,
-                n.name, n.params, n.funDecls, n.varDecls, n.body, n.functionForm,
-                n.assignOp, n.expression, n.condition, n.thenPart, n.elsePart,
-                n.readOnly, n.isLoop, n.setup, n.postfix, n.update, n.exception,
-                n.object, n.iterator, n.varDecl, n.label, n.target, n.tryBlock,
-                n.catchClauses, n.varName, n.guard, n.block, n.discriminant, n.cases,
-                n.defaultIndex, n.caseLabel, n.statements, n.statement]
-        
-        n.length.times do |i|
-                children.push(n[i]) if n[i] != n and n[i].class == Node
-        end
-        
-        attrs.length.times do |attr|
-                children.push(attrs[attr]) if attrs[attr].class == Node and attrs[attr] != n
-        end
-        
-        return children
-end
-
-def resolve_name (n)
-        name = ""
-        
-        if n.type == $consts["DOT"]
-                name = resolve_name(n[0]) + "." + resolve_name(n[1])
-        else # INDENTIFIER
-                name = n.value
-        end
-        
-        return name
-                
-end
-
-def get_functions (n, functions = nil)
-
-        functions = {} unless functions
-
-        function = nil
-        name = nil
-
-        if n.type == $consts["FUNCTION"] and n.name
-                function = n
-                name = n.name
-        elsif n.type == $consts["ASSIGN"] && n[1].type == $consts["FUNCTION"] && !n[1].name
-                function = n[1]
-                name = resolve_name(n[0])
-        end
-
-        if function
-                functions[name] = function
-                #puts function.lineno.to_s + ": " + name
-        end
-        
-        children = get_children(n)
-        children.length.times do |i|
-                get_functions(children[i], functions)
-        end
-
-        return functions
-end
-
-functions = get_functions(jstree)
-
-if ARGV.length == 2
-        jsfile[functions[ARGV[1]].start..functions[ARGV[1]].end]
-else
-        functions.each_key {|name| puts "Function => " + name}
-end
diff --git a/HarvestMan/harvestman/lib/js/samples/Banco de Portugal.html b/HarvestMan/harvestman/lib/js/samples/Banco de Portugal.html
deleted file mode 100755
index 2014de7..0000000
--- a/HarvestMan/harvestman/lib/js/samples/Banco de Portugal.html	
+++ /dev/null
@@ -1,28 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<script language="javascript">
-document.writeln("<frameset framespacing=\"0\" frameborder=\"0\" rows=\"83,*,31\">"
-  + "<frame name=\"topo\" scrolling=\"no\" noresize target=\"main\" src=\"default_top_p.htm\""
-  + "marginwidth=\"0\" marginheight=\"0\">"
-  + "<frameset cols=\"119,*,145\">"
-  + " <frame name=\"contents\" target=\"main\" src=\"default_toc_p.htm\" scrolling=\"no\">"
-  + " <frame name=\"main\" src=\"" + (location.search ? location.search.substring(1) : "principal_p.htm") + "\" scrolling=\"auto\" marginwidth=\"0\""
-  + " marginheight=\"0\" noresize>"
-  +	"<frame name=\"euromenu\" src=\"default_menu_p.htm\" target=\"main\" scrolling=\"auto\" marginwidth=\"0\""
-  + "  marginheight=\"0\" noresize>"
-  + "</frameset>"
-  + "<frame name=\"bottom\" scrolling=\"no\" noresize target=\"main\" marginwidth=\"0\""
-  + " marginheight=\"0\" src=\"scripts/default_bottom_p.asp\">"
-  + "<noframes>"
-  + "<body topmargin=\"0\" leftmargin=\"0\">"
-  + "<p>This page uses frames, but your browser doesn't support them.</p>"
-  + "</body>"
-  + "</noframes>"
-  + "</frameset>");
-
-</script></head><frameset framespacing="0" frameborder="0" rows="83,*,31"><frame name="topo" noresize="noresize" target="main" src="Banco%20de%20Portugal_files/default_top_p.html" marginwidth="0" marginheight="0" scrolling="no"><frameset cols="119,*,145"> <frame name="contents" target="main" src="Banco%20de%20Portugal_files/default_toc_p.html" scrolling="no"> <frame name="main" src="Banco%20de%20Portugal_files/principal_p.html" marginwidth="0" marginheight="0" noresize="noresize" scrolling="auto"><frame name="euromenu" src="Banco%20de%20Portugal_files/default_menu_p.html" target="main" marginwidth="0" marginheight="0" noresize="noresize" scrolling="auto"></frameset><frame name="bottom" noresize="noresize" target="main" marginwidth="0" marginheight="0" src="Banco%20de%20Portugal_files/default_bottom_p.html" scrolling="no"><noframes><body topmargin="0" leftmargin="0"><p>This page uses frames, but your browser doesn't support them.</p></body></noframes></frameset>
-
-
-</html>
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/addbookmark.js b/HarvestMan/harvestman/lib/js/samples/addbookmark.js
deleted file mode 100755
index a75a5b0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/addbookmark.js
+++ /dev/null
@@ -1,8 +0,0 @@
-var bookmarkurl="Add full URI here"
-var bookmarktitle="Add your title here"
-
-function addbookmark(){
-if (document.all)
-window.external.AddFavorite(bookmarkurl,bookmarktitle)//IE
-window.sidebar.addPanel( bookmarktitle, bookmarkurl, '' );//Moz
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/addrowstotable.js b/HarvestMan/harvestman/lib/js/samples/addrowstotable.js
deleted file mode 100755
index 5c4830c..0000000
--- a/HarvestMan/harvestman/lib/js/samples/addrowstotable.js
+++ /dev/null
@@ -1,17 +0,0 @@
-function addRow(content,morecontent)
-{
-         if (!document.getElementsByTagName) return;
-         tabBody=document.getElementsByTagName("TBODY").item(0);
-         row=document.createElement("TR");
-         cell1 = document.createElement("TD");
-         cell2 = document.createElement("TD");
-         textnode1=document.createTextNode(content);
-         textnode2=document.createTextNode(morecontent);
-         cell1.appendChild(textnode1);
-         cell2.appendChild(textnode2);
-         row.appendChild(cell1);
-         row.appendChild(cell2);
-         tabBody.appendChild(row);
-       
-   
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/arrayintersect.js b/HarvestMan/harvestman/lib/js/samples/arrayintersect.js
deleted file mode 100755
index ab853b4..0000000
--- a/HarvestMan/harvestman/lib/js/samples/arrayintersect.js
+++ /dev/null
@@ -1,5 +0,0 @@
-Object.extend(Array.prototype, {
-  intersect: function(array){
-    return this.findAll( function(token){ return array.include(token) } );
-  }
-});
diff --git a/HarvestMan/harvestman/lib/js/samples/bportugal.html b/HarvestMan/harvestman/lib/js/samples/bportugal.html
deleted file mode 100755
index fd8612e..0000000
--- a/HarvestMan/harvestman/lib/js/samples/bportugal.html
+++ /dev/null
@@ -1,28 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<script language="javascript">
-document.writeln("<frameset framespacing=\"0\" frameborder=\"0\" rows=\"83,*,31\">"
-  + "<frame name=\"topo\" scrolling=\"no\" noresize target=\"main\" src=\"default_top_p.htm\""
-  + "marginwidth=\"0\" marginheight=\"0\">"
-  + "<frameset cols=\"119,*,145\">"
-  + " <frame name=\"contents\" target=\"main\" src=\"default_toc_p.htm\" scrolling=\"no\">"
-  + " <frame name=\"main\" src=\"" + (location.search ? location.search.substring(1) : "principal_p.htm") + "\" scrolling=\"auto\" marginwidth=\"0\""
-  + " marginheight=\"0\" noresize>"
-  +     "<frame name=\"euromenu\" src=\"default_menu_p.htm\" target=\"main\" scrolling=\"auto\" marginwidth=\"0\""
-  + "  marginheight=\"0\" noresize>"
-  + "</frameset>"
-  + "<frame name=\"bottom\" scrolling=\"no\" noresize target=\"main\" marginwidth=\"0\""
-  + " marginheight=\"0\" src=\"scripts/default_bottom_p.asp\">"
-  + "<noframes>"
-  + "<body topmargin=\"0\" leftmargin=\"0\">"
-  + "<p>This page uses frames, but your browser doesn't support them.</p>"
-  + "</body>"
-  + "</noframes>"
-  + "</frameset>");
-
-</script></head>
-
-
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/bportugal_dom.html b/HarvestMan/harvestman/lib/js/samples/bportugal_dom.html
deleted file mode 100755
index e6efdf6..0000000
--- a/HarvestMan/harvestman/lib/js/samples/bportugal_dom.html
+++ /dev/null
@@ -1,9 +0,0 @@
-<html><head><title>Banco de Portugal</title>
-
-
-
-<frameset framespacing="0" frameborder="0" rows="83,*,31"> <frame name="topo" scrolling="no" noresize target="main" src="default_top_p.htm" marginwidth="0" marginheight="0"> <frameset cols="119,*,145">  <frame name="contents" target="main" src="default_toc_p.htm" scrolling="no">  <frame name="main" src=" (location.search ? location.search.substring(1) : "principal_p.htm") " scrolling="auto" marginwidth="0"  marginheight="0" noresize> <frame name="euromenu" src="default_menu_p.htm" target="main" scrolling="auto" marginwidth="0"   marginheight="0" noresize> </frameset> <frame name="bottom" scrolling="no" noresize target="main" marginwidth="0"  marginheight="0" src="scripts/default_bottom_p.asp"> <noframes> <body topmargin="0" leftmargin="0"> <p>This page uses frames, but your browser doesn't support them.</p> </body> </noframes> </frameset>
-</head>
-
-
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/breadcrumbs.js b/HarvestMan/harvestman/lib/js/samples/breadcrumbs.js
deleted file mode 100755
index f89db18..0000000
--- a/HarvestMan/harvestman/lib/js/samples/breadcrumbs.js
+++ /dev/null
@@ -1,33 +0,0 @@
-function breadcrumbs(){
-    sURL = new String;
-    bits = new Object;
-    var x = 0;
-    var stop = 0;
-    var output = "<div class=topnav><A HREF=/>Hyperdisc</A> &raquo; ";
-
-    sURL = location.href;
-    sURL = sURL.slice(8,sURL.length);
-    chunkStart = sURL.indexOf("/");
-    sURL = sURL.slice(chunkStart+1,sURL.length)
-
-    while(!stop){
-      chunkStart = sURL.indexOf("/");
-      if (chunkStart != -1){
-        bits[x] = sURL.slice(0,chunkStart)
-        sURL = sURL.slice(chunkStart+1,sURL.length);
-      }else{
-        stop = 1;
-      }
-      x++;
-    }
-
-    for(var i in bits){
-      output += "<A HREF=\"";
-      for(y=1;y<x-i;y++){
-        output += "../";
-      }
-      output += bits[i] + "/\">" + bits[i] + "</A> &raquo; ";
-    }
-    document.write(output + document.title);
-        document.write("</div>");
-  }
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/colsameheight.js b/HarvestMan/harvestman/lib/js/samples/colsameheight.js
deleted file mode 100755
index a989534..0000000
--- a/HarvestMan/harvestman/lib/js/samples/colsameheight.js
+++ /dev/null
@@ -1,7 +0,0 @@
-function setH()
-{
-   var maxH = Math.max(document.getElementById('leftside').offsetHeight,document.getElementById('rightside').offsetHeight);
-   document.getElementById('leftside').style.height=maxH+'px';
-   document.getElementById('rightside').style.height=maxH+'px';
-}
-onload=setH;
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/combinations.js b/HarvestMan/harvestman/lib/js/samples/combinations.js
deleted file mode 100755
index ae8e4b0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/combinations.js
+++ /dev/null
@@ -1,20 +0,0 @@
-combine = function(a) {
-  var fn = function(n, src, got, all) {
-    if (n == 0) {
-      if (got.length > 0) {
-        all[all.length] = got;
-      }
-      return;
-    }
-    for (var j = 0; j < src.length; j++) {
-      fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
-    }
-    return;
-  }
-  var all = [];
-  for (var i=0; i < a.length; i++) {
-    fn(i, a, [], all);
-  }
-  all.push(a);
-  return all;
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/dblkeypress.js b/HarvestMan/harvestman/lib/js/samples/dblkeypress.js
deleted file mode 100755
index 49483cc..0000000
--- a/HarvestMan/harvestman/lib/js/samples/dblkeypress.js
+++ /dev/null
@@ -1,9 +0,0 @@
-var v_fixDblKey = 0;
-function fixDblKey() {
-        if (v_fixDblKey != 0) {
-                return true;
-        } else {
-                v_fixDblKey = setTimeout('v_fixDblKey = 0;', 10);
-                return false;
-        }
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/derive.js b/HarvestMan/harvestman/lib/js/samples/derive.js
deleted file mode 100755
index ae4ba99..0000000
--- a/HarvestMan/harvestman/lib/js/samples/derive.js
+++ /dev/null
@@ -1,122 +0,0 @@
-// derive.js: edward.frederick@revolution.com, 2/07
-// By: Revolution Dev Team
-// Questions/Resources: rails-trunk@revolution.com, revolutiononrails.blogspot.com, www.edwardfrederick.com
-
-Object.chain = function(dest,source){                   
-  for (var property in source) {
-          var chainprop = '__' + property + '_chain';
-                if (!property.match(/__.*_chain$/)){
-                        if (!dest[chainprop]){
-                                dest[chainprop] = new Array();
-                        }
-                        if (dest[property]){
-                                dest[chainprop].unshift(dest[property]);
-                        }
-                        if (source[chainprop]){
-                                dest[chainprop] = source[chainprop].concat(dest[chainprop]);
-                        }
-                        dest[property] = source[property];
-                }
-  }
-  return dest;  
-}
-
-Object.extend(Class, {
-  derive: function(superclass, body){
-                var ctr = Class.create();               
-                if (superclass){
-                        ctr.superclass = superclass;
-                        Object.chain(ctr.prototype,superclass.prototype);                       
-                        Object.chain(ctr,superclass);
-                        if (superclass.inherited)
-                          superclass.inherited(ctr);
-                }               
-                if (body){
-                        if (body.self)
-                                Object.chain(ctr,body.self);            
-                        body.self = undefined;
-                        Object.chain(ctr.prototype,body);
-                }
-                Object.extend(ctr,Class._deriveClassExtensions);
-                Object.extend(ctr.prototype,Class._deriveInstanceExtensions);           
-                return ctr;                     
-  },
-        _deriveClassExtensions: {       
-                include: function(mixin){
-                        if (!mixin._mixin)
-                                throw "Can only include a mixin!";                              
-                        Object.chain(this.prototype,mixin.methods);
-                        if (mixin.self && mixin.self.included)
-                                mixin.self.included(this);
-                },
-                extend: function(mixin){
-                        if (!mixin._mixin)
-                                throw "Can only extend a mixin!";
-                        Object.chain(this,mixin.methods);
-                        if (mixin.self && mixin.self.extended)
-                                mixin.self.extended(this);
-                }
-        },
-        _deriveInstanceExtensions: {
-                sup: function(method){
-                        var currentLinkVar = '__' + method + '_current_chain_link';
-                        var chainVar = '__' + method + '_chain';
-
-                        if (!this[currentLinkVar])
-                                this[currentLinkVar] = 0;
-                        if (this[currentLinkVar] && this[currentLinkVar] >= this[chainVar].size())
-                                throw "NoPreviousMethod: " + method;    
-        
-                        var mine = this[chainVar][this[currentLinkVar]];
-                        this[currentLinkVar]++;
-                        
-                        var shiftedArguments = new Array();
-                        for (var i = 1; i < arguments.length; i++)
-                                shiftedArguments.push(arguments[i]);
-                                
-                        var result = mine.apply(this,shiftedArguments);
-                        this[currentLinkVar] = undefined;
-                        return result;
-                }
-        }       
-});
-
-/* Can also mix things in horizontally, as the derive heirarchies are 
-         intended to be singly-rooted */
-var Mixin = {
-        create: function(object){
-                var mixin = {};
-                var methods = Object.extend({},object);
-                Object.extend(mixin,{
-                        self: methods.self,
-                        _mixin: true,
-                        methods: methods
-                });
-                mixin.methods.self = undefined;
-                Object.extend(mixin, Mixin._classMethods);
-                return mixin;
-        },
-        _classMethods: {
-                included: Prototype.emptyFunction,
-                extended: Prototype.emptyFunction
-        }
-}
-
-
-/* Singleton mixin provided as an example (albeit a useful one) */
-var Singleton = Mixin.create({
-        instance: function(){
-                if (this._instance)
-                        return this._instance;
-                else
-                        return this._instance = new this(arguments);
-        },
-        self: {
-                included: function(klass){
-                        // nothing here, but could be
-                },
-                extended: function(klass){
-                        // nothing here, but could be
-                }
-        }
-});
diff --git a/HarvestMan/harvestman/lib/js/samples/domcss.js b/HarvestMan/harvestman/lib/js/samples/domcss.js
deleted file mode 100755
index 86ce228..0000000
--- a/HarvestMan/harvestman/lib/js/samples/domcss.js
+++ /dev/null
@@ -1,7 +0,0 @@
-function includeCSS(p_file) {
-        var v_css  = document.createElement('link');
-        v_css.rel = 'stylesheet'
-        v_css.type = 'text/css';
-        v_css.href = p_file;
-        document.getElementsByTagName('head')[0].appendChild(v_css);
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/dzone.js b/HarvestMan/harvestman/lib/js/samples/dzone.js
deleted file mode 100755
index c949963..0000000
--- a/HarvestMan/harvestman/lib/js/samples/dzone.js
+++ /dev/null
@@ -1,19 +0,0 @@
-var sharedURL='';  // Place your url between the quotes.
-
-function getFeed(url, callback) {
-   var newScript = document.createElement('script');
-       newScript.type = 'text/javascript';
-       newScript.src = 'http://pipes.yahoo.com/pipes/9oyONQzA2xGOkM4FqGIyXQ/run?&_render=JSON&_callback='+callback+'&feed=' + sharedURL;
-   document.getElementsByTagName("head")[0].appendChild(newScript);
-}
-
-function dzone(feed) {
-   var tmp='';
-   for (var i=0; i<feed.value.items.length; i++) {
-      tmp+='<a href="'+feed.value.items[i].link+'" rel="nofollow">';
-      tmp+=feed.value.items[i].title+'</a><br>';
-   }
-   document.getElementById('dzoneLinks').innerHTML=tmp;
-}
-
-getFeed(sharedURL, 'dzone');
diff --git a/HarvestMan/harvestman/lib/js/samples/extlinks.js b/HarvestMan/harvestman/lib/js/samples/extlinks.js
deleted file mode 100755
index b2ea46a..0000000
--- a/HarvestMan/harvestman/lib/js/samples/extlinks.js
+++ /dev/null
@@ -1,12 +0,0 @@
-function externalLinks() {
- if (!document.getElementsByTagName) return;
- var anchors = document.getElementsByTagName("a");
- for (var i=0; i<anchors.length; i++) {
-   var anchor = anchors[i];
-   if (anchor.getAttribute("href") && anchor.getAttribute("rel") == "external") {
-     anchor.target = "_blank";
-    }
-
-}
-}
-window.onload = externalLinks;
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/aim.js b/HarvestMan/harvestman/lib/js/samples/fail/aim.js
deleted file mode 100755
index 0aa5ff0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/aim.js
+++ /dev/null
@@ -1,56 +0,0 @@
-/**
-*
-*  AJAX IFRAME METHOD (AIM)
-*  http://www.webtoolkit.info/
-*
-**/
-
-AIM = {
-
-        frame : function(c) {
-
-                var n = 'f' + Math.floor(Math.random() * 99999);
-                var d = document.createElement('DIV');
-                d.innerHTML = '<iframe style="display:none" src="about:blank" id="'+n+'" name="'+n+'" onload="AIM.loaded(\''+n+'\')"></iframe>';
-                document.body.appendChild(d);
-
-                var i = document.getElementById(n);
-                if (c && typeof(c.onComplete) == 'function') {
-                        i.onComplete = c.onComplete;
-                }
-
-                return n;
-        },
-
-        form : function(f, name) {
-                f.setAttribute('target', name);
-        },
-
-        submit : function(f, c) {
-                AIM.form(f, AIM.frame(c));
-                if (c && typeof(c.onStart) == 'function') {
-                        return c.onStart();
-                } else {
-                        return true;
-                }
-        },
-
-        loaded : function(id) {
-                var i = document.getElementById(id);
-                if (i.contentDocument) {
-                        var d = i.contentDocument;
-                } else if (i.contentWindow) {
-                        var d = i.contentWindow.document;
-                } else {
-                        var d = window.frames[id].document;
-                }
-                if (d.location.href == "about:blank") {
-                        return;
-                }
-
-                if (typeof(i.onComplete) == 'function') {
-                        i.onComplete(d.body.innerHTML);
-                }
-        }
-
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/ajaxobject.js b/HarvestMan/harvestman/lib/js/samples/fail/ajaxobject.js
deleted file mode 100755
index 0275e25..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/ajaxobject.js
+++ /dev/null
@@ -1,45 +0,0 @@
-function ajaxObject(url, callbackFunction) {
-  var that=this;      
-  this.updating = false;
-  this.abort = function() {
-    if (that.updating) {
-      that.updating=false;
-      that.AJAX.abort();
-      that.AJAX=null;
-    }
-  }
-  this.update = function(passData,postMethod) { 
-    if (that.updating) { return false; }
-    that.AJAX = null;                          
-    if (window.XMLHttpRequest) {              
-      that.AJAX=new XMLHttpRequest();              
-    } else {                                  
-      that.AJAX=new ActiveXObject("Microsoft.XMLHTTP");
-    }                                             
-    if (that.AJAX==null) {                             
-      return false;                               
-    } else {
-      that.AJAX.onreadystatechange = function() {  
-        if (that.AJAX.readyState==4) {             
-          that.updating=false;                
-          that.callback(that.AJAX.responseText,that.AJAX.status,that.AJAX.responseXML);        
-          that.AJAX=null;                                         
-        }                                                      
-      }                                                        
-      that.updating = new Date();                              
-      if (/post/i.test(postMethod)) {
-        var uri=urlCall+'?'+that.updating.getTime();
-        that.AJAX.open("POST", uri, true);
-        that.AJAX.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
-        that.AJAX.send(passData);
-      } else {
-        var uri=urlCall+'?'+passData+'&timestamp='+(that.updating.getTime()); 
-        that.AJAX.open("GET", uri, true);                             
-        that.AJAX.send(null);                                         
-      }              
-      return true;                                             
-    }                                                                           
-  }
-  var urlCall = url;        
-  this.callback = callbackFunction || function () { };
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/arrayfind.js b/HarvestMan/harvestman/lib/js/samples/fail/arrayfind.js
deleted file mode 100755
index 0e8ab67..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/arrayfind.js
+++ /dev/null
@@ -1,17 +0,0 @@
-Array.prototype.find = function(searchStr) {
-  var returnArray = false;
-  for (i=0; i<this.length; i++) {
-    if (typeof(searchStr) == 'function') {
-      if (searchStr.test(this[i])) {
-        if (!returnArray) { returnArray = [] }
-        returnArray.push(i);
-      }
-    } else {
-      if (this[i]===searchStr) {
-        if (!returnArray) { returnArray = [] }
-        returnArray.push(i);
-      }
-    }
-  }
-  return returnArray;
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/bportugal.js b/HarvestMan/harvestman/lib/js/samples/fail/bportugal.js
deleted file mode 100755
index 35f13e0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/bportugal.js
+++ /dev/null
@@ -1,18 +0,0 @@
-document.writeln("<frameset framespacing=\"0\" frameborder=\"0\" rows=\"83,*,31\">"
-  + "<frame name=\"topo\" scrolling=\"no\" noresize target=\"main\" src=\"default_top_p.htm\""
-  + "marginwidth=\"0\" marginheight=\"0\">"
-  + "<frameset cols=\"119,*,145\">"
-  + " <frame name=\"contents\" target=\"main\" src=\"default_toc_p.htm\" scrolling=\"no\">"
-  + " <frame name=\"main\" src=\"" + (location.search ? location.search.substring(1) : "principal_p.htm") + "\" scrolling=\"auto\" marginwidth=\"0\""
-  + " marginheight=\"0\" noresize>"
-  +     "<frame name=\"euromenu\" src=\"default_menu_p.htm\" target=\"main\" scrolling=\"auto\" marginwidth=\"0\""
-  + "  marginheight=\"0\" noresize>"
-  + "</frameset>"
-  + "<frame name=\"bottom\" scrolling=\"no\" noresize target=\"main\" marginwidth=\"0\""
-  + " marginheight=\"0\" src=\"scripts/default_bottom_p.asp\">"
-  + "<noframes>"
-  + "<body topmargin=\"0\" leftmargin=\"0\">"
-  + "<p>This page uses frames, but your browser doesn't support them.</p>"
-  + "</body>"
-  + "</noframes>"
-  + "</frameset>");
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/cookiehandler.js b/HarvestMan/harvestman/lib/js/samples/fail/cookiehandler.js
deleted file mode 100755
index a7d97ab..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/cookiehandler.js
+++ /dev/null
@@ -1,42 +0,0 @@
-/**
-*
-*  Javascript cookies
-*  http://www.webtoolkit.info/
-*
-**/
-
-function CookieHandler() {
-
-        this.setCookie = function (name, value, seconds) {
-
-                if (typeof(seconds) != 'undefined') {
-                        var date = new Date();
-                        date.setTime(date.getTime() + (seconds*1000));
-                        var expires = "; expires=" + date.toGMTString();
-                }
-                else {
-                        var expires = "";
-                }
-
-                document.cookie = name+"="+value+expires+"; path=/";
-        }
-
-        this.getCookie = function (name) {
-
-                name = name + "=";
-                var carray = document.cookie.split(';');
-
-                for(var i=0;i < carray.length;i++) {
-                        var c = carray[i];
-                        while (c.charAt(0)==' ') c = c.substring(1,c.length);
-                        if (c.indexOf(name) == 0) return c.substring(name.length,c.length);
-                }
-
-                return null;
-        }
-
-        this.deleteCookie = function (name) {
-                this.setCookie(name, "", -1);
-        }
-
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/createcss.js b/HarvestMan/harvestman/lib/js/samples/fail/createcss.js
deleted file mode 100755
index d7fa3ab..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/createcss.js
+++ /dev/null
@@ -1,22 +0,0 @@
-function createCSS(selector, declaration) {
-        // test for IE
-        var ua = navigator.userAgent.toLowerCase();
-        var isIE = (/msie/.test(ua)) && !(/opera/.test(ua)) && (/win/.test(ua));
-
-        // create the style node for all browsers
-        var style_node = document.createElement("style");
-        style_node.setAttribute("type", "text/css");
-        style_node.setAttribute("media", "screen"); 
-
-        // append a rule for good browsers
-        if (!isIE) style_node.appendChild(document.createTextNode(selector + " {" + declaration + "}"));
-
-        // append the style node
-        document.getElementsByTagName("head")[0].appendChild(style_node);
-
-        // use alternative methods for IE
-        if (isIE && document.styleSheets && document.styleSheets.length > 0) {
-                var last_style_node = document.styleSheets[document.styleSheets.length - 1];
-                if (typeof(last_style_node.addRule) == "object") last_style_node.addRule(selector, declaration);
-        }
-};
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/draghandler.js b/HarvestMan/harvestman/lib/js/samples/fail/draghandler.js
deleted file mode 100755
index a8c902c..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/draghandler.js
+++ /dev/null
@@ -1,84 +0,0 @@
-/**
-*
-*  Crossbrowser Drag Handler
-*  http://www.webtoolkit.info/
-*
-**/
-
-var DragHandler = {
-
-
-        // private property.
-        _oElem : null,
-
-
-        // public method. Attach drag handler to an element.
-        attach : function(oElem) {
-                oElem.onmousedown = DragHandler._dragBegin;
-
-                // callbacks
-                oElem.dragBegin = new Function();
-                oElem.drag = new Function();
-                oElem.dragEnd = new Function();
-
-                return oElem;
-        },
-
-
-        // private method. Begin drag process.
-        _dragBegin : function(e) {
-                var oElem = DragHandler._oElem = this;
-
-                if (isNaN(parseInt(oElem.style.left))) { oElem.style.left = '0px'; }
-                if (isNaN(parseInt(oElem.style.top))) { oElem.style.top = '0px'; }
-
-                var x = parseInt(oElem.style.left);
-                var y = parseInt(oElem.style.top);
-
-                e = e ? e : window.event;
-                oElem.mouseX = e.clientX;
-                oElem.mouseY = e.clientY;
-
-                oElem.dragBegin(oElem, x, y);
-
-                document.onmousemove = DragHandler._drag;
-                document.onmouseup = DragHandler._dragEnd;
-                return false;
-        },
-
-
-        // private method. Drag (move) element.
-        _drag : function(e) {
-                var oElem = DragHandler._oElem;
-
-                var x = parseInt(oElem.style.left);
-                var y = parseInt(oElem.style.top);
-
-                e = e ? e : window.event;
-                oElem.style.left = x + (e.clientX - oElem.mouseX) + 'px';
-                oElem.style.top = y + (e.clientY - oElem.mouseY) + 'px';
-
-                oElem.mouseX = e.clientX;
-                oElem.mouseY = e.clientY;
-
-                oElem.drag(oElem, x, y);
-
-                return false;
-        },
-
-
-        // private method. Stop drag process.
-        _dragEnd : function() {
-                var oElem = DragHandler._oElem;
-
-                var x = parseInt(oElem.style.left);
-                var y = parseInt(oElem.style.top);
-
-                oElem.dragEnd(oElem, x, y);
-
-                document.onmousemove = null;
-                document.onmouseup = null;
-                DragHandler._oElem = null;
-        }
-
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/editor_main.js b/HarvestMan/harvestman/lib/js/samples/fail/editor_main.js
deleted file mode 100755
index f829851..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/editor_main.js
+++ /dev/null
@@ -1,1253 +0,0 @@
-//
-// htmlArea v2.02 - Copyright (c) 2002 interactivetools.com, inc.
-// This copyright notice MUST stay intact for use (see license.txt).
-//
-// A free WYSIWYG editor replacement for <textarea> fields.
-// For full source code and docs, visit http://www.interactivetools.com/
-//
-
-// write out styles for UI buttons
-document.write('<style type="text/css">\n');
-document.write('.btn     { width: 22px; height: 22px; border: 1px solid buttonface; margin: 0; padding: 0; }\n');
-document.write('.btnOver { width: 22px; height: 22px; border: 1px outset; }\n');
-document.write('.btnDown { width: 22px; height: 22px; border: 1px inset; background-color: buttonhighlight; }\n');
-document.write('.btnNA   { width: 22px; height: 22px; border: 1px solid buttonface; filter: alpha(opacity=25); }\n');
-document.write('.cMenu     { background-color: threedface; color: menutext; cursor: Default; font-family: MS Sans Serif; font-size: 8pt; padding: 2 12 2 16; }');
-document.write('.cMenuOver { background-color: highlight; color: highlighttext; cursor: Default; font-family: MS Sans Serif; font-size: 8pt; padding: 2 12 2 16; }');
-document.write('.cMenuDivOuter { background-color: threedface; height: 9 }');
-document.write('.cMenuDivInner { margin: 0 4 0 4; border-width: 1; border-style: solid; border-color: threedshadow threedhighlight threedhighlight threedshadow; }');
-document.write('</style>\n');
-
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_defaultConfig
-  Description : default configuration settings for wysiwyg editor
-\* ---------------------------------------------------------------------- */
-
-function editor_defaultConfig(objname) {
-
-this.version = "2.02"
-
-this.width =  "auto";
-this.height = "auto";
-this.bodyStyle = 'background-color: #FFFFFF; font-family: "Verdana"; font-size: x-small;';
-this.imgURL = _editor_url + 'images/cms/editor/';
-this.debug  = 0;
-
-this.replaceNextlines = 0; // replace nextlines from spaces (on output)
-this.plaintextInput = 0;   // replace nextlines with breaks (on input)
-
-this.toolbar = [
-	//['word'],
-    //['fontname'],
-    //['fontsize'],
-    //['fontstyle'],
-    //['linebreak'],
-    ['bold','italic','underline','separator'],
-    //['strikethrough','subscript','superscript','separator'],
-    //['justifyleft','justifycenter','justifyright','separator'],
-    //['OrderedList','UnOrderedList','Outdent','Indent','separator'],
-    ['forecolor','backcolor','separator'],
-    //['Createlink','InsertImage','InsertTable','htmlmode','separator'], //'HorizontalRule',
-    ['htmlmode'], //'HorizontalRule',
-    //['custom1','custom2','custom3','separator'],
-    //['custom2','custom3','separator'],
-    //['popupeditor','about','help']
-	];
-
-this.fontnames = {
-    "Arial":           "arial, helvetica, sans-serif",
-    "Courier New":     "courier new, courier, mono",
-    "Georgia":         "Georgia, Times New Roman, Times, Serif",
-    "Tahoma":          "Tahoma, Arial, Helvetica, sans-serif",
-    "Times New Roman": "times new roman, times, serif",
-    "Verdana":         "Verdana, Arial, Helvetica, sans-serif",
-    "impact":          "impact",
-    "WingDings":       "WingDings"};
-
-this.fontsizes = {
-    "1 (8 pt)":  "1",
-    "2 (10 pt)": "2",
-    "3 (12 pt)": "3",
-    "4 (14 pt)": "4",
-    "5 (18 pt)": "5",
-    "6 (24 pt)": "6",
-    "7 (36 pt)": "7"
-  };
-
-this.stylesheet = "../update.css"; // full URL to stylesheet
-
-this.fontstyles = [     // make sure these exist in the header of page the content is being display as well in or they won't work!
-//    { name: "headline",     className: "headline",  classStyle: "font-family: arial black, arial; font-size: 28px; letter-spacing: -2px;" },
-//    { name: "arial red",    className: "headline2", classStyle: "font-family: arial black, arial; font-size: 12px; letter-spacing: -2px; color:red" },
-//    { name: "verdana blue", className: "headline4", classStyle: "font-family: verdana; font-size: 18px; letter-spacing: -2px; color:blue" },
-
-{ name: "body",				className: "body",		classStyle: "" },
-{ name: "tr",				className: "tr",		classStyle: "" },
-{ name: "td",				className: "td",		classStyle: "" },
-{ name: "tr.nav",			className: "nav",		classStyle: "" },
-{ name: "td.nav",			className: "nav",		classStyle: "" },
-{ name: "tr.sm",			className: "sm",		classStyle: "" },
-{ name: "td.sm",			className: "sm",		classStyle: "" },
-{ name: "tr.whereweare",	className: "whereweare",classStyle: "" },
-{ name: "td.whereweare",	className: "whereweare",classStyle: "" },
-{ name: "span.sm",			className: "sm",		classStyle: "" },
-{ name: "span.p",			className: "p",			classStyle: "" },
-{ name: "span.title",		className: "title",		classStyle: "" },
-{ name: "span.date",		className: "date",		classStyle: "" },
-{ name: "span.headline",	className: "headline",	classStyle: "" },
-];
-
-this.btnList = {
-    // buttonName:    commandID,               title,                onclick,                   image,
-    "word":	      ['word',				   'Cleanup MS-WORD tags', 'editor_action(this.id)', 'ed_word.gif'], 
-    "bold":           ['Bold',                 'Bold',               'editor_action(this.id)',  'ed_format_bold.gif'],
-    "italic":         ['Italic',               'Italic',             'editor_action(this.id)',  'ed_format_italic.gif'],
-    "underline":      ['Underline',            'Underline',          'editor_action(this.id)',  'ed_format_underline.gif'],
-    "strikethrough":  ['StrikeThrough',        'Strikethrough',      'editor_action(this.id)',  'ed_format_strike.gif'],
-    "subscript":      ['SubScript',            'Subscript',          'editor_action(this.id)',  'ed_format_sub.gif'],
-    "superscript":    ['SuperScript',          'Superscript',        'editor_action(this.id)',  'ed_format_sup.gif'],
-    "justifyleft":    ['JustifyLeft',          'Justify Left',       'editor_action(this.id)',  'ed_align_left.gif'],
-    "justifycenter":  ['JustifyCenter',        'Justify Center',     'editor_action(this.id)',  'ed_align_center.gif'],
-    "justifyright":   ['JustifyRight',         'Justify Right',      'editor_action(this.id)',  'ed_align_right.gif'],
-    "orderedlist":    ['InsertOrderedList',    'Ordered List',       'editor_action(this.id)',  'ed_list_num.gif'],
-    "unorderedlist":  ['InsertUnorderedList',  'Bulleted List',      'editor_action(this.id)',  'ed_list_bullet.gif'],
-    "outdent":        ['Outdent',              'Decrease Indent',    'editor_action(this.id)',  'ed_indent_less.gif'],
-    "indent":         ['Indent',               'Increase Indent',    'editor_action(this.id)',  'ed_indent_more.gif'],
-    "forecolor":      ['ForeColor',            'Font Color',         'editor_action(this.id)',  'ed_color_fg.gif'],
-    "backcolor":      ['BackColor',            'Background Color',   'editor_action(this.id)',  'ed_color_bg.gif'],
-    "horizontalrule": ['InsertHorizontalRule', 'Horizontal Rule',    'editor_action(this.id)',  'ed_hr.gif'],
-    "createlink":     ['CreateLink',           'Insert Web Link',    'editor_action(this.id)',  'ed_link.gif'],
-    "insertimage":    ['InsertImage',          'Insert Image',       'editor_action(this.id)',  'ed_image.gif'],
-    "inserttable":    ['InsertTable',          'Insert Table',       'editor_action(this.id)',  'insert_table.gif'],
-    "htmlmode":       ['HtmlMode',             'View HTML Source',   'editor_setmode(\''+objname+'\')', 'ed_html.gif'],
-    "popupeditor":    ['popupeditor',          'Enlarge Editor',     'editor_action(this.id)',  'fullscreen_maximize.gif'],
-    "about":          ['about',                'About this editor',  'editor_about(\''+objname+'\')',  'ed_about.gif'],
-
-    // Add custom buttons here:
-   // "custom1":           ['custom1',         'Purpose of button 1',  'editor_action(this.id)',  'ed_custom.gif'],
-    "custom2":           ['custom2',         'Insert List Code',  'editor_action(this.id)',  'ed_list.gif'],
-    "custom3":           ['custom3',         'Insert Custom HR',  'editor_action(this.id)',  'ed_hr_custom.gif'],
-   // end: custom buttons
-
-    "help":           ['showhelp',             'Help using editor',  'editor_action(this.id)',  'ed_help.gif']};
-
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_generate
-  Description : replace textarea with wysiwyg editor
-  Usage       : editor_generate("textarea_id",[height],[width]);
-  Arguments   : objname - ID of textarea to replace
-                w       - width of wysiwyg editor
-                h       - height of wysiwyg editor
-\* ---------------------------------------------------------------------- */
-
-
-function editor_generate(objname,userConfig) {
-
-  // Default Settings
-  var config = new editor_defaultConfig(objname);
-  if (userConfig) { 
-    for (var thisName in userConfig) {
-      if (userConfig[thisName]) { config[thisName] = userConfig[thisName]; }
-    }
-  }
-  document.all[objname].config = config;                  // store config settings
-
-  // set size to specified size or size of original object
-  var obj    = document.all[objname];
-  if (!config.width || config.width == "auto") {
-    if      (obj.style.width) { config.width = obj.style.width; }      // use css style
-    else if (obj.cols)        { config.width = (obj.cols * 8) + 22; }  // col width + toolbar
-    else                      { config.width = '100%'; }               // default
-  }
-  if (!config.height || config.height == "auto") {
-    if      (obj.style.height) { config.height = obj.style.height; }   // use css style
-    else if (obj.rows)         { config.height = obj.rows * 17 }       // row height
-    else                       { config.height = '200'; }              // default
-  }
-
-  var tblOpen  = '<table border=0 cellspacing=0 cellpadding=0 style="float: left;"  unselectable="on"><tr><td style="border: none; padding: 1 0 0 0"><nobr>';
-  var tblClose = '</nobr></td></tr></table>\n';
-
-  // build button toolbar
-
-  var toolbar = '';
-  var btnGroup, btnItem, aboutEditor;
-  for (var btnGroup in config.toolbar) {
-
-    // linebreak
-    if (config.toolbar[btnGroup].length == 1 &&
-        config.toolbar[btnGroup][0].toLowerCase() == "linebreak") {
-      toolbar += '<br clear="all">';
-      continue;
-    }
-
-    toolbar += tblOpen;
-    for (var btnItem in config.toolbar[btnGroup]) {
-      var btnName = config.toolbar[btnGroup][btnItem].toLowerCase();
-
-      // fontname
-      if (btnName == "fontname") {
-        toolbar += '<select id="_' +objname+ '_FontName" onChange="editor_action(this.id)" unselectable="on" style="margin: 1 2 0 2; font-size: 12px;">';
-        for (var fontname in config.fontnames) {
-          toolbar += '<option value="' +config.fontnames[fontname]+ '">' +fontname+ '</option>'
-        }
-        toolbar += '</select>';
-        continue;
-      }
-
-      // fontsize
-      if (btnName == "fontsize") {
-        toolbar += '<select id="_' +objname+ '_FontSize" onChange="editor_action(this.id)" unselectable="on" style="margin: 1 2 0 0; font-size: 12px;">';
-        for (var fontsize in config.fontsizes) {
-          toolbar += '<option value="' +config.fontsizes[fontsize]+ '">' +fontsize+ '</option>'
-        }
-        toolbar += '</select>\n';
-        continue;
-      }
-
-      // font style
-      if (btnName == "fontstyle") {
-        toolbar += '<select id="_' +objname+ '_FontStyle" onChange="editor_action(this.id)" unselectable="on" style="margin: 1 2 0 0; font-size: 12px;">';
-        + '<option value="">Font Style</option>';
-        for (var i in config.fontstyles) {
-          var fontstyle = config.fontstyles[i];
-          toolbar += '<option value="' +fontstyle.className+ '">' +fontstyle.name+ '</option>'
-        }
-        toolbar += '</select>';
-        continue;
-      }
-
-      // separator
-      if (btnName == "separator") {
-        toolbar += '<span style="border: 1px inset; width: 1px; font-size: 16px; height: 16px; margin: 0 3 0 3"></span>';
-        continue;
-      }
-
-      // buttons
-      var btnObj = config.btnList[btnName];
-      if (btnName == 'linebreak') { alert("htmlArea error: 'linebreak' must be in a subgroup by itself, not with other buttons.\n\nhtmlArea wysiwyg editor not created."); return; }
-      if (!btnObj) { alert("htmlArea error: button '" +btnName+ "' not found in button list when creating the wysiwyg editor for '"+objname+"'.\nPlease make sure you entered the button name correctly.\n\nhtmlArea wysiwyg editor not created."); return; }
-      var btnCmdID   = btnObj[0];
-      var btnTitle   = btnObj[1];
-      var btnOnClick = btnObj[2];
-      var btnImage   = btnObj[3];
-      toolbar += '<button title="' +btnTitle+ '" id="_' +objname+ '_' +btnCmdID+ '" class="btn" onClick="' +btnOnClick+ '" onmouseover="if(this.className==\'btn\'){this.className=\'btnOver\'}" onmouseout="if(this.className==\'btnOver\'){this.className=\'btn\'}" unselectable="on"><img src="' +config.imgURL + btnImage+ '" border=0 unselectable="on"></button>';
-
-
-    } // end of button sub-group
-    toolbar += tblClose;
-  } // end of entire button set
-
-  // build editor
-
-  var editor = '<span id="_editor_toolbar"><table border=0 cellspacing=0 cellpadding=0 bgcolor="buttonface" style="padding: 1 0 0 2" width=' + config.width + ' unselectable="on"><tr><td>\n'
-  + toolbar
-  + '</td></tr></table>\n'
-  + '</td></tr></table></span>\n'
-  + '<textarea ID="_' +objname + '_editor" style="width:' +config.width+ '; height:' +config.height+ '; margin-top: -1px; margin-bottom: -1px;" wrap=soft></textarea>';
-
-  // add context menu
-  editor += '<div id="_' +objname + '_cMenu" style="position: absolute; visibility: hidden;"></div>';
-
-  //  hide original textarea and insert htmlarea after it
-  if (!config.debug) { document.all[objname].style.display = "none"; }
-
-  if (config.plaintextInput) {     // replace nextlines with breaks
-    var contents = document.all[objname].value;
-    contents = contents.replace(/\r\n/g, '<br>');
-    contents = contents.replace(/\n/g, '<br>');
-    contents = contents.replace(/\r/g, '<br>');
-    document.all[objname].value = contents;
-  }
-
-  // insert wysiwyg
-  document.all[objname].insertAdjacentHTML('afterEnd', editor)
-
-  // convert htmlarea from textarea to wysiwyg editor
-  editor_setmode(objname, 'init');
-
-  // call filterOutput when user submits form
-  for (var idx=0; idx < document.forms.length; idx++) {
-    var r = document.forms[idx].attachEvent('onsubmit', function() { editor_filterOutput(objname); });
-    if (!r) { alert("Error attaching event to form!"); }
-  }
-
-return true;
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_action
-  Description : perform an editor command on selected editor content
-  Usage       :
-  Arguments   : button_id - button id string with editor and action name
-\* ---------------------------------------------------------------------- */
-
-function editor_action(button_id) {
-
-  // split up button name into "editorID" and "cmdID"
-  var BtnParts = Array();
-  BtnParts = button_id.split("_");
-  var objname    = button_id.replace(/^_(.*)_[^_]*$/, '$1');
-  var cmdID      = BtnParts[ BtnParts.length-1 ];
-  var button_obj = document.all[button_id];
-  var editor_obj = document.all["_" +objname + "_editor"];
-  var config     = document.all[objname].config;
-
-  // help popup
-  if (cmdID == 'showhelp') {
-    //window.open(_editor_url + "popups/editor_help.html", 'EditorHelp');
-    window.open(_editor_url + "http://www.interactivetools.com/forum/forum.cgi?forum=18;", 'EditorHelp');
-    return;
-  }
-
-  // popup editor
-  if (cmdID == 'popupeditor') {
-    window.open(_editor_url + "popups/fullscreen.html?"+objname,
-                'FullScreen',
-                'toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=800,height=700');
-    return;
-  }
-
-  // check editor mode (don't perform actions in textedit mode)
-  if (editor_obj.tagName.toLowerCase() == 'textarea') { return; }
-
-  var editdoc = editor_obj.contentWindow.document;
-  editor_focus(editor_obj);
-
-  // get index and value for pulldowns
-  var idx = button_obj.selectedIndex;
-  var val = (idx != null) ? button_obj[ idx ].value : null;
-
-  if (0) {}   // use else if for easy cutting and pasting
-
-  //
-  // CUSTOM BUTTONS START HERE
-  //
-
-  // Custom1
-  else if (cmdID == 'custom1') {
-    alert("Hello, I am custom button 1!");
-  }
-
-  // Custom2
-  else if (cmdID == 'custom2') {  // insert some text from a popup window
-    var myTitle = "This is a custom title";
-    var myText = showModalDialog(_editor_url + "popups/custom2.html",
-                                 myTitle,      // str or obj specified here can be read from dialog as "window.dialogArguments"
-                                 "resizable: yes; help: no; status: no; scroll: no; ");
-    if (myText) { editor_insertHTML(objname, myText); }
-  }
-
-  // Custom3
-  else if (cmdID == 'custom3') {  // insert some text
-    editor_insertHTML(objname, "<hr noshade color=#C0C0C0 size=1 style='border-style: dotted; border-width: 1'>");
-  }
-
-//WME WORD  
-  else if (cmdID == 'word') {  
-  var oTags = editdoc.all.tags("SPAN");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  oTags[i].outerHTML = oTags[i].innerHTML;  		
-  }  	
-}    	
-
-oTags = editdoc.all.tags("FONT");  	
-if (oTags != null) {  	
-for (var i = oTags.length - 1; i >= 0; i--) {  	
-oTags[i].outerHTML = oTags[i].innerHTML;  	
-}  	
-}
-
-  oTags = editdoc.all.tags("P");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]);  		
-  }  	
-}  
-  oTags = editdoc.all.tags("H1");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]); 		
-}  
-}  
-  oTags = editdoc.all.tags("H2");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]); 		
-}  
-}
-  oTags = editdoc.all.tags("H3");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]); 		
-}  
-}
-  oTags = editdoc.all.tags("H4");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]); 		
-}  
-}	
-  oTags = editdoc.all.tags("OL");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]); 		
-}  
-}    	
-  oTags = editdoc.all.tags("UL");  	
-  if (oTags != null) {  		
-  for (var i = oTags.length - 1; i >= 0; i--) {  			
-  cleanEmptyTag(oTags[i]);  		}  	
-}  
-}
-
-  //
-  // END OF CUSTOM BUTTONS
-  //
-
-  // FontName
-  else if (cmdID == 'FontName' && val) {
-    editdoc.execCommand(cmdID,0,val);
-  }
-
-  // FontSize
-  else if (cmdID == 'FontSize' && val) {
-    editdoc.execCommand(cmdID,0,val);
-  }
-
-  // FontStyle (change CSS className)
-  else if (cmdID == 'FontStyle' && val) {
-    editdoc.execCommand('RemoveFormat');
-    editdoc.execCommand('FontName',0,'636c6173734e616d6520706c616365686f6c646572');
-    var fontArray = editdoc.all.tags("FONT");
-    for (i=0; i<fontArray.length; i++) {
-      if (fontArray[i].face == '636c6173734e616d6520706c616365686f6c646572') {
-        fontArray[i].face = "";
-        fontArray[i].className = val;
-        fontArray[i].outerHTML = fontArray[i].outerHTML.replace(/face=['"]+/, "");
-        }
-    }
-    button_obj.selectedIndex =0;
-  }
-
-  // fgColor and bgColor
-  else if (cmdID == 'ForeColor' || cmdID == 'BackColor') {
-    var oldcolor = _dec_to_rgb(editdoc.queryCommandValue(cmdID));
-    var newcolor = showModalDialog(_editor_url + "popups/select_color.html", oldcolor, "resizable: no; help: no; status: no; scroll: no;");
-    if (newcolor != null) { editdoc.execCommand(cmdID, false, "#"+newcolor); }
-  }
-
-  // execute command for buttons - if we didn't catch the cmdID by here we'll assume it's a
-  // commandID and pass it to execCommand().   See http://msdn.microsoft.com/workshop/author/dhtml/reference/commandids.asp
-  else {
-    // subscript & superscript, disable one before enabling the other
-    if (cmdID.toLowerCase() == 'subscript' && editdoc.queryCommandState('superscript')) { editdoc.execCommand('superscript'); }
-    if (cmdID.toLowerCase() == 'superscript' && editdoc.queryCommandState('subscript')) { editdoc.execCommand('subscript'); }
-
-    // insert link
-    //if (cmdID.toLowerCase() == 'createlink'){
-    //  editdoc.execCommand(cmdID,1);
-    //  showModalDialog(_editor_url + "popups/insert_link.html", editdoc, "resizable: no; help: no; status: no; scroll: no; ");
-    //}
-	
-	// insert link (modified)
-	if (cmdID.toLowerCase() == 'createlink') {
-	if (editdoc.selection.createRange().text != "") {
-	var highlightedText = editdoc.selection.createRange().text;
-	}
-	else {
-	var highlightedText = "";
-	}
-	var myText = showModalDialog(_editor_url + "popups/insert_hyperlink.html",
-	highlightedText, "target is new Window");
-	if (myText) { editor_insertHTML(objname, unescape( myText) ); }
-	}
-
-    // insert image
-    else if (cmdID.toLowerCase() == 'insertimage'){
-      showModalDialog(_editor_url + "popups/insert_image.html", editdoc, "resizable: no; help: no; status: no; scroll: no; ");
-    }
-
-    // insert table
-    else if (cmdID.toLowerCase() == 'inserttable'){
-      showModalDialog(_editor_url + "popups/insert_table.html?"+objname,
-                                 window,
-                                 "resizable: yes; help: no; status: no; scroll: no; ");
-    }
-
-    // all other commands microsoft Command Identifiers
-    else { editdoc.execCommand(cmdID); }
-  }
-
-  editor_event(objname);
-}
-
-
-
-
-
-// WME: MS-Word clean-up (begin)
-/* ---------------------------------------------------------------------- *\
-  Function    : MS-Word clean-up 
-  Description : replace textarea with wysiwyg editor
-  Usage       : editor_generate("textarea_id",[height],[width]);
-  Arguments   : objname - ID of textarea to replace
-                w       - width of wysiwyg editor
-                h       - height of wysiwyg editor
-\* ---------------------------------------------------------------------- */
-function cleanEmptyTag(oElem) {  	
-if (oElem.hasChildNodes) {  		
-var tmp = oElem  		
-for (var k = tmp.children.length; k >= 0; k--) {  			
-if (tmp.children[k] != null) {  				
-cleanEmptyTag(tmp.children[k]);  			
-}  		
-}  
-	
-} 
- 	
-var oAttribs = oElem.attributes;  	
-if (oAttribs != null) {  		
-for (var j = oAttribs.length - 1; j >=0; j--) {  			
-var oAttrib = oAttribs[j];  			
-if (oAttrib.nodeValue != null) {  				
-oAttribs.removeNamedItem('class')  			
-}  	
-	}  
-	}  
-	oElem.style.cssText = '';  	
-if (oElem.innerHTML == '' || oElem.innerHTML == '&nbsp;') {  		
-oElem.outerHTML = '';  	} 
-}
-
-function cleanTable(oElem) {  	
-	oElem.style.cssText = '';  	
-	var oAttribs = oElem.attributes;  	
-	if (oAttribs != null) {  		
-		for (var j = oAttribs.length - 1; j >=0; j--) {  			
-			var oAttrib = oAttribs[j];  			
-			if (oAttrib.nodeValue != null) {  				
-				oAttribs.removeNamedItem('class')  			
-			}  		
-		}  	
-	}    	
-	var oTR = oElem.rows;  	
-	if (oTR != null) {  		
-		for (var r = oTR.length - 1; r >= 0; r--) {  			
-			oTR[r].style.cssText = '';  		
-		}  	
-	}    	
-	var oTD = oElem.cells;  	
-	if (oTD != null) {  		
-		for (var t = oTD.length - 1; t >= 0; t--) {  			
-			oTD[t].style.cssText = '';  		
-		}  	
-	}  
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_event
-  Description : called everytime an editor event occurs
-  Usage       : editor_event(objname, runDelay, eventName)
-  Arguments   : objname - ID of textarea to replace
-                runDelay: -1 = run now, no matter what
-                          0  = run now, if allowed
-                        1000 = run in 1 sec, if allowed at that point
-\* ---------------------------------------------------------------------- */
-
-function editor_event(objname,runDelay) {
-  var config = document.all[objname].config;
-  var editor_obj  = document.all["_" +objname+  "_editor"];       // html editor object
-  if (runDelay == null) { runDelay = 0; }
-  var editdoc;
-  var editEvent = editor_obj.contentWindow ? editor_obj.contentWindow.event : event;
-
-  // catch keypress events
-    if (editEvent && editEvent.keyCode) {
-      var ord       = editEvent.keyCode;    // ascii order of key pressed
-      var ctrlKey   = editEvent.ctrlKey;
-      var altKey    = editEvent.altKey;
-      var shiftKey  = editEvent.shiftKey;
-
-      if (ord == 16) { return; }  // ignore shift key by itself
-      if (ord == 17) { return; }  // ignore ctrl key by itself
-      if (ord == 18) { return; }  // ignore alt key by itself
-
-
-       // cancel ENTER key and insert <BR> instead
-//       if (ord == 13 && editEvent.type == 'keypress') {
-//         editEvent.returnValue = false;
-//         editor_insertHTML(objname, "<br>");
-//         return;
-//       }
-
-      if (ctrlKey && (ord == 122 || ord == 90)) {     // catch ctrl-z (UNDO)
-//      TODO: Add our own undo/redo functionality
-//        editEvent.cancelBubble = true;
-        return;
-      }
-      if ((ctrlKey && (ord == 121 || ord == 89)) ||
-          ctrlKey && shiftKey && (ord == 122 || ord == 90)) {     // catch ctrl-y, ctrl-shift-z (REDO)
-//      TODO: Add our own undo/redo functionality
-        return;
-      }
-    }
-
-  // setup timer for delayed updates (some events take time to complete)
-  if (runDelay > 0) { return setTimeout(function(){ editor_event(objname); }, runDelay); }
-
-  // don't execute more than 3 times a second (eg: too soon after last execution)
-  if (this.tooSoon == 1 && runDelay >= 0) { this.queue = 1; return; } // queue all but urgent events
-  this.tooSoon = 1;
-  setTimeout(function(){
-    this.tooSoon = 0;
-    if (this.queue) { editor_event(objname,-1); };
-    this.queue = 0;
-    }, 333);  // 1/3 second
-
-
-  editor_updateOutput(objname);
-  editor_updateToolbar(objname);
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_updateToolbar
-  Description : update toolbar state
-  Usage       :
-  Arguments   : objname - ID of textarea to replace
-                action  - enable, disable, or update (default action)
-\* ---------------------------------------------------------------------- */
-
-function editor_updateToolbar(objname,action) {
-  var config = document.all[objname].config;
-  var editor_obj  = document.all["_" +objname+  "_editor"];
-
-  // disable or enable toolbar
-
-  if (action == "enable" || action == "disable") {
-    var tbItems = new Array('FontName','FontSize','FontStyle');                           // add pulldowns
-    for (var btnName in config.btnList) { tbItems.push(config.btnList[btnName][0]); } // add buttons
-
-    for (var idxN in tbItems) {
-      var cmdID = tbItems[idxN].toLowerCase();
-      var tbObj = document.all["_" +objname+ "_" +tbItems[idxN]];
-      if (cmdID == "htmlmode" || cmdID == "about" || cmdID == "showhelp" || cmdID == "popupeditor") { continue; } // don't change these buttons
-      if (tbObj == null) { continue; }
-      var isBtn = (tbObj.tagName.toLowerCase() == "button") ? true : false;
-
-      if (action == "enable")  { tbObj.disabled = false; if (isBtn) { tbObj.className = 'btn' }}
-      if (action == "disable") { tbObj.disabled = true;  if (isBtn) { tbObj.className = 'btnNA' }}
-    }
-    return;
-  }
-
-  // update toolbar state
-
-  if (editor_obj.tagName.toLowerCase() == 'textarea') { return; }   // don't update state in textedit mode
-  var editdoc = editor_obj.contentWindow.document;
-
-  // Set FontName pulldown
-  var fontname_obj = document.all["_" +objname+ "_FontName"];
-  if (fontname_obj) {
-    var fontname = editdoc.queryCommandValue('FontName');
-    if (fontname == null) { fontname_obj.value = null; }
-    else {
-      var found = 0;
-      for (i=0; i<fontname_obj.length; i++) {
-        if (fontname.toLowerCase() == fontname_obj[i].text.toLowerCase()) {
-          fontname_obj.selectedIndex = i;
-          found = 1;
-        }
-      }
-      if (found != 1) { fontname_obj.value = null; }     // for fonts not in list
-    }
-  }
-
-  // Set FontSize pulldown
-  var fontsize_obj = document.all["_" +objname+ "_FontSize"];
-  if (fontsize_obj) {
-    var fontsize = editdoc.queryCommandValue('FontSize');
-    if (fontsize == null) { fontsize_obj.value = null; }
-    else {
-      var found = 0;
-      for (i=0; i<fontsize_obj.length; i++) {
-        if (fontsize == fontsize_obj[i].value) { fontsize_obj.selectedIndex = i; found=1; }
-      }
-      if (found != 1) { fontsize_obj.value = null; }     // for sizes not in list
-    }
-  }
-
-  // Set FontStyle pulldown
-  var classname_obj = document.all["_" +objname+ "_FontStyle"];
-  if (classname_obj) {
-    var curRange = editdoc.selection.createRange();
-
-    // check element and element parents for class names
-    var pElement;
-    if (curRange.length) { pElement = curRange[0]; }              // control tange
-    else                 { pElement = curRange.parentElement(); } // text range
-    while (pElement && !pElement.className) { pElement = pElement.parentElement; }  // keep going up
-
-    var thisClass = pElement ? pElement.className.toLowerCase() : "";
-    if (!thisClass && classname_obj.value) { classname_obj.value = null; }
-    else {
-      var found = 0;
-      for (i=0; i<classname_obj.length; i++) {
-        if (thisClass == classname_obj[i].value.toLowerCase()) {
-          classname_obj.selectedIndex = i;
-          found=1;
-        }
-      }
-      if (found != 1) { classname_obj.value = null; }     // for classes not in list
-    }
-  }
-
-  // update button states
-  var IDList = Array('Bold','Italic','Underline','StrikeThrough','SubScript','SuperScript','JustifyLeft','JustifyCenter','JustifyRight','InsertOrderedList','InsertUnorderedList');
-  for (i=0; i<IDList.length; i++) {
-    var btnObj = document.all["_" +objname+ "_" +IDList[i]];
-    if (btnObj == null) { continue; }
-    var cmdActive = editdoc.queryCommandState( IDList[i] );
-
-    if (!cmdActive)  {                                  // option is OK
-      if (btnObj.className != 'btn') { btnObj.className = 'btn'; }
-      if (btnObj.disabled  != false) { btnObj.disabled = false; }
-    } else if (cmdActive)  {                            // option already applied or mixed content
-      if (btnObj.className != 'btnDown') { btnObj.className = 'btnDown'; }
-      if (btnObj.disabled  != false)   { btnObj.disabled = false; }
-    }
-  }
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_updateOutput
-  Description : update hidden output field with data from wysiwg
-\* ---------------------------------------------------------------------- */
-
-function editor_updateOutput(objname) {
-  var config     = document.all[objname].config;
-  var editor_obj  = document.all["_" +objname+  "_editor"];       // html editor object
-  var editEvent = editor_obj.contentWindow ? editor_obj.contentWindow.event : event;
-  var isTextarea = (editor_obj.tagName.toLowerCase() == 'textarea');
-  var editdoc = isTextarea ? null : editor_obj.contentWindow.document;
-
-  // get contents of edit field
-  var contents;
-  if (isTextarea) { contents = editor_obj.value; }
-  else            { contents = editdoc.body.innerHTML; }
-
-  // check if contents has changed since the last time we ran this routine
-  if (config.lastUpdateOutput && config.lastUpdateOutput == contents) { return; }
-  else { config.lastUpdateOutput = contents; }
-
-  // update hidden output field
-  document.all[objname].value = contents;
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_filterOutput
-  Description :
-\* ---------------------------------------------------------------------- */
-
-function editor_filterOutput(objname) {
-  editor_updateOutput(objname);
-  var contents = document.all[objname].value;
-  var config   = document.all[objname].config;
-
-  // ignore blank contents
-  if (contents.toLowerCase() == '<p>&nbsp;</p>') { contents = ""; }
-
-  // filter tag - this code is run for each HTML tag matched
-  var filterTag = function(tagBody,tagName,tagAttr) {
-    tagName = tagName.toLowerCase();
-    var closingTag = (tagBody.match(/^<\//)) ? true : false;
-
-    // fix placeholder URLS - remove absolute paths that IE adds
-    if (tagName == 'img') { tagBody = tagBody.replace(/(src\s*=\s*.)[^*]*(\*\*\*)/, "$1$2"); }
-    if (tagName == 'a')   { tagBody = tagBody.replace(/(href\s*=\s*.)[^*]*(\*\*\*)/, "$1$2"); }
-
-    // add additional tag filtering here
-
-    // convert to vbCode
-//    if      (tagName == 'b' || tagName == 'strong') {
-//      if (closingTag) { tagBody = "[/b]"; } else { tagBody = "[b]"; }
-//    }
-//    else if (tagName == 'i' || tagName == 'em') {
-//      if (closingTag) { tagBody = "[/i]"; } else { tagBody = "[i]"; }
-//    }
-//    else if (tagName == 'u') {
-//      if (closingTag) { tagBody = "[/u]"; } else { tagBody = "[u]"; }
-//    }
-//    else {
-//      tagBody = ""; // disallow all other tags!
-//    }
-
-    return tagBody;
-  };
-
-  // match tags and call filterTag
-  RegExp.lastIndex = 0;
-    var matchTag = /<\/?(\w+)((?:[^'">]*|'[^']*'|"[^"]*")*)>/g;   // this will match tags, but still doesn't handle container tags (textarea, comments, etc)
-
-  contents = contents.replace(matchTag, filterTag);
-
-  // remove nextlines from output (if requested)
-  if (config.replaceNextlines) { 
-    contents = contents.replace(/\r\n/g, ' ');
-    contents = contents.replace(/\n/g, ' ');
-    contents = contents.replace(/\r/g, ' ');
-  }
-
-  // update output with filtered content
-  document.all[objname].value = contents;
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_setmode
-  Description : change mode between WYSIWYG and HTML editor
-  Usage       : editor_setmode(objname, mode);
-  Arguments   : objname - button id string with editor and action name
-                mode      - init, textedit, or wysiwyg
-\* ---------------------------------------------------------------------- */
-
-function editor_setmode(objname, mode) {
-  var config     = document.all[objname].config;
-  var editor_obj = document.all["_" +objname + "_editor"];
-
-  // wait until document is fully loaded
-  if (document.readyState != 'complete') {
-    setTimeout(function() { editor_setmode(objname,mode) }, 25);
-    return;
-  }
-
-  // define different editors
-  var TextEdit   = '<textarea ID="_' +objname + '_editor" style="width:' +editor_obj.style.width+ '; height:' +editor_obj.style.height+ '; margin-top: -1px; margin-bottom: -1px;"></textarea>';
-  var RichEdit   = '<iframe ID="_' +objname+ '_editor"    style="width:' +editor_obj.style.width+ '; height:' +editor_obj.style.height+ ';"></iframe>';
-
- // src="' +_editor_url+ 'popups/blank.html"
-
-  //
-  // Switch to TEXTEDIT mode
-  //
-
-  if (mode == "textedit" || editor_obj.tagName.toLowerCase() == 'iframe') {
-    config.mode = "textedit";
-    var editdoc = editor_obj.contentWindow.document;
-    var contents = editdoc.body.createTextRange().htmlText;
-    editor_obj.outerHTML = TextEdit;
-    editor_obj = document.all["_" +objname + "_editor"];
-    editor_obj.value = contents;
-    editor_event(objname);
-
-    editor_updateToolbar(objname, "disable");  // disable toolbar items
-
-    // set event handlers
-    editor_obj.onkeydown   = function() { editor_event(objname); }
-    editor_obj.onkeypress  = function() { editor_event(objname); }
-    editor_obj.onkeyup     = function() { editor_event(objname); }
-    editor_obj.onmouseup   = function() { editor_event(objname); }
-    editor_obj.ondrop      = function() { editor_event(objname, 100); }     // these events fire before they occur
-    editor_obj.oncut       = function() { editor_event(objname, 100); }
-    editor_obj.onpaste     = function() { editor_event(objname, 100); }
-    editor_obj.onblur      = function() { editor_event(objname, -1); }
-
-    editor_updateOutput(objname);
-    editor_focus(editor_obj);
-  }
-
-  //
-  // Switch to WYSIWYG mode
-  //
-
-  else {
-    config.mode = "wysiwyg";
-    var contents = editor_obj.value;
-    if (mode == 'init') { contents = document.all[objname].value; } // on init use original textarea content
-
-    // create editor
-    editor_obj.outerHTML = RichEdit;
-    editor_obj = document.all["_" +objname + "_editor"];
-
-    // get iframe document object
-
-    // create editor contents (and default styles for editor)
-    var html = "";
-    html += '<html><head>\n';
-    if (config.stylesheet) {
-      html += '<link href="' +config.stylesheet+ '" rel="stylesheet" type="text/css">\n';
-    }
-    html += '<style>\n';
-    html += 'body {' +config.bodyStyle+ '} \n';
-    for (var i in config.fontstyles) {
-      var fontstyle = config.fontstyles[i];
-      if (fontstyle.classStyle) {
-        html += '.' +fontstyle.className+ ' {' +fontstyle.classStyle+ '}\n';
-      }
-    }
-    html += '</style>\n'
-      + '</head>\n'
-      + '<body contenteditable="true" topmargin=1 leftmargin=1'
-
-// still working on this
-//      + ' oncontextmenu="parent.editor_cMenu_generate(window,\'' +objname+ '\');"'
-      +'>'
-      + contents
-      + '</body>\n'
-      + '</html>\n';
-
-    // write to editor window
-    var editdoc = editor_obj.contentWindow.document;
-
-    editdoc.open();
-    editdoc.write(html);
-    editdoc.close();
-
-    editor_updateToolbar(objname, "enable");  // enable toolbar items
-
-    // store objname under editdoc
-    editdoc.objname = objname;
-
-    // set event handlers
-    editdoc.onkeydown      = function() { editor_event(objname); }
-    editdoc.onkeypress     = function() { editor_event(objname); }
-    editdoc.onkeyup        = function() { editor_event(objname); }
-    editdoc.onmouseup      = function() { editor_event(objname); }
-    editdoc.body.ondrop    = function() { editor_event(objname, 100); }     // these events fire before they occur
-    editdoc.body.oncut     = function() { editor_event(objname, 100); }
-    editdoc.body.onpaste   = function() { editor_event(objname, 100); }
-    editdoc.body.onblur    = function() { editor_event(objname, -1); }
-
-    // bring focus to editor
-    if (mode != 'init') {             // don't focus on page load, only on mode switch
-      editor_focus(editor_obj);
-    }
-
-  }
-
-  // Call update UI
-  if (mode != 'init') {             // don't update UI on page load, only on mode switch
-    editor_event(objname);
-  }
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_focus
-  Description : bring focus to the editor
-  Usage       : editor_focus(editor_obj);
-  Arguments   : editor_obj - editor object
-\* ---------------------------------------------------------------------- */
-
-function editor_focus(editor_obj) {
-
-  // check editor mode
-  if (editor_obj.tagName.toLowerCase() == 'textarea') {         // textarea
-    var myfunc = function() { editor_obj.focus(); };
-    setTimeout(myfunc,100);                                     // doesn't work all the time without delay
-  }
-
-  else {                                                        // wysiwyg
-    var editdoc = editor_obj.contentWindow.document;            // get iframe editor document object
-    var editorRange = editdoc.body.createTextRange();           // editor range
-    var curRange    = editdoc.selection.createRange();          // selection range
-
-    if (curRange.length == null &&                              // make sure it's not a controlRange
-        !editorRange.inRange(curRange)) {                       // is selection in editor range
-      editorRange.collapse();                                   // move to start of range
-      editorRange.select();                                     // select
-      curRange = editorRange;
-    }
-  }
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_about
-  Description : display "about this editor" popup
-\* ---------------------------------------------------------------------- */
-
-function editor_about(objname) {
-  showModalDialog(_editor_url + "popups/about.html", window, "resizable: yes; help: no; status: no; scroll: no; ");
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : _dec_to_rgb
-  Description : convert dec color value to rgb hex
-  Usage       : var hex = _dec_to_rgb('65535');   // returns FFFF00
-  Arguments   : value   - dec value
-\* ---------------------------------------------------------------------- */
-
-function _dec_to_rgb(value) {
-  var hex_string = "";
-  for (var hexpair = 0; hexpair < 3; hexpair++) {
-    var myByte = value & 0xFF;            // get low byte
-    value >>= 8;                        // drop low byte
-    var nybble2 = myByte & 0x0F;          // get low nybble (4 bits)
-    var nybble1 = (myByte >> 4) & 0x0F;   // get high nybble
-    hex_string += nybble1.toString(16); // convert nybble to hex
-    hex_string += nybble2.toString(16); // convert nybble to hex
-  }
-  return hex_string.toUpperCase();
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_insertHTML
-  Description : insert string at current cursor position in editor.  If
-                two strings are specifed, surround selected text with them.
-  Usage       : editor_insertHTML(objname, str1, [str2], reqSelection)
-  Arguments   : objname - ID of textarea
-                str1 - HTML or text to insert
-                str2 - HTML or text to insert (optional argument)
-                reqSelection - (1 or 0) give error if no text selected
-\* ---------------------------------------------------------------------- */
-
-function editor_insertHTML(objname, str1,str2, reqSel) {
-  var config     = document.all[objname].config;
-  var editor_obj = document.all["_" +objname + "_editor"];    // editor object
-  if (str1 == null) { str1 = ''; }
-  if (str2 == null) { str2 = ''; }
-
-  // for non-wysiwyg capable browsers just add to end of textbox
-  if (document.all[objname] && editor_obj == null) {
-    document.all[objname].focus();
-    document.all[objname].value = document.all[objname].value + str1 + str2;
-    return;
-  }
-
-  // error checking
-  if (editor_obj == null) { return alert("Unable to insert HTML.  Invalid object name '" +objname+ "'."); }
-
-  editor_focus(editor_obj);
-
-  var tagname = editor_obj.tagName.toLowerCase();
-  var sRange;
-
- // insertHTML for wysiwyg iframe
-  if (tagname == 'iframe') {
-    var editdoc = editor_obj.contentWindow.document;
-    sRange  = editdoc.selection.createRange();
-    var sHtml   = sRange.htmlText;
-
-    // check for control ranges
-    if (sRange.length) { return alert("Unable to insert HTML.  Try highlighting content instead of selecting it."); }
-
-    // insert HTML
-    var oldHandler = window.onerror;
-    window.onerror = function() { alert("Unable to insert HTML for current selection."); return true; } // partial table selections cause errors
-    if (sHtml.length) {                                 // if content selected
-      if (str2) { sRange.pasteHTML(str1 +sHtml+ str2) } // surround
-      else      { sRange.pasteHTML(str1); }             // overwrite
-    } else {                                            // if insertion point only
-      if (reqSel) { return alert("Unable to insert HTML.  You must select something first."); }
-      sRange.pasteHTML(str1 + str2);                    // insert strings
-    }
-    window.onerror = oldHandler;
-  }
-
-  // insertHTML for plaintext textarea
-  else if (tagname == 'textarea') {
-    editor_obj.focus();
-    sRange  = document.selection.createRange();
-    var sText   = sRange.text;
-
-    // insert HTML
-    if (sText.length) {                                 // if content selected
-      if (str2) { sRange.text = str1 +sText+ str2; }  // surround
-      else      { sRange.text = str1; }               // overwrite
-    } else {                                            // if insertion point only
-      if (reqSel) { return alert("Unable to insert HTML.  You must select something first."); }
-      sRange.text = str1 + str2;                        // insert strings
-    }
-  }
-  else { alert("Unable to insert HTML.  Unknown object tag type '" +tagname+ "'."); }
-
-  // move to end of new content
-  sRange.collapse(false); // move to end of range
-  sRange.select();        // re-select
-
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_getHTML
-  Description : return HTML contents of editor (in either wywisyg or html mode)
-  Usage       : var myHTML = editor_getHTML('objname');
-\* ---------------------------------------------------------------------- */
-
-function editor_getHTML(objname) {
-  var editor_obj = document.all["_" +objname + "_editor"];
-  var isTextarea = (editor_obj.tagName.toLowerCase() == 'textarea');
-
-  if (isTextarea) { return editor_obj.value; }
-  else            { return editor_obj.contentWindow.document.body.innerHTML; }
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_setHTML
-  Description : set HTML contents of editor (in either wywisyg or html mode)
-  Usage       : editor_setHTML('objname',"<b>html</b> <u>here</u>");
-\* ---------------------------------------------------------------------- */
-
-function editor_setHTML(objname, html) {
-  var editor_obj = document.all["_" +objname + "_editor"];
-  var isTextarea = (editor_obj.tagName.toLowerCase() == 'textarea');
-
-  if (isTextarea) { editor_obj.value = html; }
-  else            { editor_obj.contentWindow.document.body.innerHTML = html; }
-}
-
-/* ---------------------------------------------------------------------- *\
-  Function    : editor_appendHTML
-  Description : append HTML contents to editor (in either wywisyg or html mode)
-  Usage       : editor_appendHTML('objname',"<b>html</b> <u>here</u>");
-\* ---------------------------------------------------------------------- */
-
-function editor_appendHTML(objname, html) {
-  var editor_obj = document.all["_" +objname + "_editor"];
-  var isTextarea = (editor_obj.tagName.toLowerCase() == 'textarea');
-
-  if (isTextarea) { editor_obj.value += html; }
-  else            { editor_obj.contentWindow.document.body.innerHTML += html; }
-}
-
-/* ---------------------------------------------------------------- */
-
-function _isMouseOver(obj,event) {       // determine if mouse is over object
-  var mouseX    = event.clientX;
-  var mouseY    = event.clientY;
-
-  var objTop    = obj.offsetTop;
-  var objBottom = obj.offsetTop + obj.offsetHeight;
-  var objLeft   = obj.offsetLeft;
-  var objRight  = obj.offsetLeft + obj.offsetWidth;
-
-  if (mouseX >= objLeft && mouseX <= objRight &&
-      mouseY >= objTop  && mouseY <= objBottom) { return true; }
-
-  return false;
-}
-
-/* ---------------------------------------------------------------- */
-
-function editor_cMenu_generate(editorWin,objname) {
-  var parentWin = window;
-  editorWin.event.returnValue = false;  // cancel default context menu
-
-  // define content menu options
-  var cMenuOptions = [ // menu name, shortcut displayed, javascript code
-    ['Cut', 'Ctrl-X', function() {}],
-    ['Copy', 'Ctrl-C', function() {}],
-    ['Paste', 'Ctrl-C', function() {}],
-    ['Delete', 'DEL', function() {}],
-    ['---', null, null],
-    ['Select All', 'Ctrl-A', function() {}],
-    ['Clear All', '', function() {}],
-    ['---', null, null],
-    ['About this editor...', '', function() {
-      alert("about this editor");
-    }]];
-    editor_cMenu.options = cMenuOptions; // save options
-
-  // generate context menu
-  var cMenuHeader = ''
-    + '<div id="_'+objname+'_cMenu" onblur="editor_cMenu(this);" oncontextmenu="return false;" onselectstart="return false"'
-    + '  style="position: absolute; visibility: hidden; cursor: default; width: 167px; background-color: threedface;'
-    + '         border: solid 1px; border-color: threedlightshadow threeddarkshadow threeddarkshadow threedlightshadow;">'
-    + '<table border=0 cellspacing=0 cellpadding=0 width="100%" style="width: 167px; background-color: threedface; border: solid 1px; border-color: threedhighlight threedshadow threedshadow threedhighlight;">'
-    + ' <tr><td colspan=2 height=1></td></tr>';
-
-  var cMenuList = '';
-
-  var cMenuFooter = ''
-    + ' <tr><td colspan=2 height=1></td></tr>'
-    + '</table></div>';
-
-  for (var menuIdx in editor_cMenu.options) {
-    var menuName = editor_cMenu.options[menuIdx][0];
-    var menuKey  = editor_cMenu.options[menuIdx][1];
-    var menuCode = editor_cMenu.options[menuIdx][2];
-
-    // separator
-    if (menuName == "---" || menuName == "separator") {
-      cMenuList += ' <tr><td colspan=2 class="cMenuDivOuter"><div class="cMenuDivInner"></div></td></tr>';
-    }
-
-    // menu option
-    else {
-      cMenuList += '<tr class="cMenu" onMouseOver="editor_cMenu(this)" onMouseOut="editor_cMenu(this)" onClick="editor_cMenu(this, \'' +menuIdx+ '\',\'' +objname+ '\')">';
-      if (menuKey) { cMenuList += ' <td align=left class="cMenu">' +menuName+ '</td><td align=right class="cMenu">' +menuKey+ '</td>'; }
-      else         { cMenuList += ' <td colspan=2 class="cMenu">' +menuName+ '</td>'; }
-      cMenuList += '</tr>';
-    }
-  }
-
-  var cMenuHTML = cMenuHeader + cMenuList + cMenuFooter;
-
-
-  document.all['_'+objname+'_cMenu'].outerHTML = cMenuHTML;
-
-  editor_cMenu_setPosition(parentWin, editorWin, objname);
-
-  parentWin['_'+objname+'_cMenu'].style.visibility = 'visible';
-  parentWin['_'+objname+'_cMenu'].focus();
-
-}
-
-/* ---------------------------------------------------------------- */
-
-function editor_cMenu_setPosition(parentWin, editorWin, objname) {      // set object position that won't overlap window edge
-  var event    = editorWin.event;
-  var cMenuObj = parentWin['_'+objname+'_cMenu'];
-  var mouseX   = event.clientX + parentWin.document.all['_'+objname+'_editor'].offsetLeft;
-  var mouseY   = event.clientY + parentWin.document.all['_'+objname+'_editor'].offsetTop;
-  var cMenuH   = cMenuObj.offsetHeight;
-  var cMenuW   = cMenuObj.offsetWidth;
-  var pageH    = document.body.clientHeight + document.body.scrollTop;
-  var pageW    = document.body.clientWidth + document.body.scrollLeft;
-
-  // set horzontal position
-  if (mouseX + 5 + cMenuW > pageW) { var left = mouseX - cMenuW - 5; } // too far right
-  else                            { var left = mouseX + 5; }
-
-  // set vertical position
-  if (mouseY + 5 + cMenuH > pageH) { var top = mouseY - cMenuH + 5; } // too far down
-  else                            { var top = mouseY + 5; }
-
-  cMenuObj.style.top = top;
-  cMenuObj.style.left = left;
-
-}
-
-/* ---------------------------------------------------------------- */
-
-function editor_cMenu(obj,menuIdx,objname) {
-  var action = event.type;
-  if      (action == "mouseover" && !obj.disabled && obj.tagName.toLowerCase() == 'tr') {
-    obj.className = 'cMenuOver';
-    for (var i=0; i < obj.cells.length; i++) { obj.cells[i].className = 'cMenuOver'; }
-  }
-  else if (action == "mouseout" && !obj.disabled && obj.tagName.toLowerCase() == 'tr')  {
-    obj.className = 'cMenu';
-    for (var i=0; i < obj.cells.length; i++) { obj.cells[i].className = 'cMenu'; }
-  }
-  else if (action == "click" && !obj.disabled) {
-    document.all['_'+objname+'_cMenu'].style.visibility = "hidden";
-    var menucode = editor_cMenu.options[menuIdx][2];
-    menucode();
-  }
-  else if (action == "blur") {
-    if (!_isMouseOver(obj,event)) { obj.style.visibility = 'hidden'; }
-    else {
-      if (obj.style.visibility != "hidden") { obj.focus(); }
-    }
-  }
-  else { alert("editor_cMenu, unknown action: " + action); }
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/functreebuilder.js b/HarvestMan/harvestman/lib/js/samples/fail/functreebuilder.js
deleted file mode 100755
index adf2708..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/functreebuilder.js
+++ /dev/null
@@ -1,110 +0,0 @@
-function inorder(T) {
-  if ((T.cargo  == '+') || (T.cargo  == '-') || (T.cargo  == '*') || (T.cargo  == '/'))
-    return [].concat(inorder(T.left)).concat(T.cargo).concat(inorder(T.right));
-
-  return new Array(T.cargo);
-}
-
-function preorder(T) {
-  if ((T.cargo  == '+') || (T.cargo  == '-') || (T.cargo  == '*') || (T.cargo  == '/'))
-    return [].concat(T.cargo).concat(preorder(T.left)).concat(preorder(T.right));
-
-  return new Array(T.cargo);
-}
-
-function postorder(T) {
-  if ((T.cargo  == '+') || (T.cargo  == '-') || (T.cargo  == '*') || (T.cargo  == '/'))
-    return [].concat(postorder(T.left)).concat(postorder(T.right)).concat(T.cargo);
-
-  return new Array(T.cargo);
-}
-
-function Tree(cargo, left, right) {
-  this.cargo = cargo;
-  this.left = left;
-  this.right = right;
-}
-
-function split(s) {
-  var r = [];
-  var cur = '';
-  for (var i = 0; i < s.length; ++i) {
-    var c = s.charAt(i);
-    if ((c == '+') || (c == '-') || (c == '*') || (c == '/') || (c == '(') || (c == ')')) {
-      cur.replace(" ", "");
-      if (cur.length > 0) {
-        r.push(cur);
-        cur = '';
-      }
-      r.push(c);
-    } else {
-      cur += c;
-    }
-  }
-  if (cur.length > 0)
-    r.push(cur);
-
-  return r;
-}
-
-function getToken(tokens, expected) {
-  if (tokens[0] == expected) {
-    tokens.splice(0, 1);
-    return true;
-  }
-    return false;
-}
-
-function getVar(tokens) {
-  if (getToken(tokens, '(')) {
-    var a = getSum(tokens);
-    getToken(tokens, ')');
-
-    return a;
-  } else {
-    var aux = tokens[0];
-    tokens.splice(0, 1);
-
-    return new Tree(aux, undefined, undefined);
-  }
-}
-
-function getProduct(tokens) {
-  var a = getVar(tokens);
-
-  if (getToken(tokens, '*')) {
-    var b = getProduct(tokens);
-
-    return new Tree('*', a, b);
-  } else if (getToken(tokens, '/')) {
-    var b = getProduct(tokens);
-
-    return new Tree('/', a, b);
-  }
-
-  return a;
-}
-
-function getSum(tokens) {
-  var a = getProduct(tokens);
-
-  if (getToken(tokens, '+')) {
-    var b = getSum(tokens);
-
-    return new Tree('+', a, b);
-  } else if (getToken(tokens, '-')) {
-    var b = getSum(tokens);
-
-    return new Tree('-', a, b);
-  }
-
-  return a;
-}
-
-function evalExp(s) {
-  var tokens = split(s);
-
-  //alert(tokens);
-
-  return t = getSum(tokens);
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/int2word.js b/HarvestMan/harvestman/lib/js/samples/fail/int2word.js
deleted file mode 100755
index f132fc5..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/int2word.js
+++ /dev/null
@@ -1,36 +0,0 @@
-var units = new Array ("Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Nineteen");
-var tens = new Array ("Twenty", "Thirty", "Forty", "Fifty", "Sixty", "Seventy", "Eighty", "Ninety");
-
-function num(it) {
-        var theword = "";
-        var started;
-        if (it>999) return "Lots";
-        if (it==0) return units[0];
-        for (var i = 9; i >= 1; i--){
-                if (it>=i*100) {
-                        theword += units[i];
-                        started = 1;
-                        theword += " hundred";
-                        if (it!=i*100) theword += " and ";
-                        it -= i*100;
-                        i=0;
-                }
-        };
-        
-        for (var i = 9; i >= 2; i--){
-                if (it>=i*10) {
-                        theword += (started?tens[i-2].toLowerCase():tens[i-2]);
-                        started = 1;
-                        if (it!=i*10) theword += "-";
-                        it -= i*10;
-                        i=0
-                }
-        };
-        
-        for (var i=1; i < 20; i++) {
-                if (it==i) {
-                        theword += (started?units[i].toLowerCase():units[i]);
-                }
-        };
-        return theword;
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/openwindow.js b/HarvestMan/harvestman/lib/js/samples/fail/openwindow.js
deleted file mode 100755
index b7fd54c..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/openwindow.js
+++ /dev/null
@@ -1,30 +0,0 @@
-function openAWindow( pageToLoad, winName, width, height, center)
-{
-    xposition=0; yposition=0;
-    if ((parseInt(navigator.appVersion) >= 4 ) && (center)){
-        xposition = (screen.width - width) / 2;
-        yposition = (screen.height - height) / 2;
-    }
-        
-        //0 => no
-        //1 => yes
-    var args = "";
-        args += "width=" + width + "," + "height=" + height + ","
-                + "location=0,"
-                + "menubar=0,"
-                + "resizable=0,"
-                + "scrollbars=0,"
-                + "statusbar=false,dependent,alwaysraised,"
-                + "status=false,"
-                + "titlebar=no,"
-                + "toolbar=0,"
-                + "hotkeys=0,"
-                + "screenx=" + xposition + ","  //NN Only
-                + "screeny=" + yposition + ","  //NN Only
-                + "left=" + xposition + ","     //IE Only
-                + "top=" + yposition;           //IE Only
-                //fullscreen=yes, add for full screen
-        // var dmcaWin = window.open(pageToLoad,winName,args );
-        // dmcaWin.focus();
-    //window.showModalDialog(pageToLoad,"","dialogWidth:650px;dialogHeight:500px");
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/openwindow2.js b/HarvestMan/harvestman/lib/js/samples/fail/openwindow2.js
deleted file mode 100755
index a8aa566..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/openwindow2.js
+++ /dev/null
@@ -1,54 +0,0 @@
-/**
-*
-*  Javascript open window
-*  http://www.webtoolkit.info/
-*
-**/
-
-function openWindow(anchor, options) {
-
-        var args = '';
-
-        if (typeof(options) == 'undefined') { var options = new Object(); }
-        if (typeof(options.name) == 'undefined') { options.name = 'win' + Math.round(Math.random()*100000); }
-
-        if (typeof(options.height) != 'undefined' && typeof(options.fullscreen) == 'undefined') {
-                args += "height=" + options.height + ",";
-        }
-
-        if (typeof(options.width) != 'undefined' && typeof(options.fullscreen) == 'undefined') {
-                args += "width=" + options.width + ",";
-        }
-
-        if (typeof(options.fullscreen) != 'undefined') {
-                args += "width=" + screen.availWidth + ",";
-                args += "height=" + screen.availHeight + ",";
-        }
-
-        if (typeof(options.center) == 'undefined') {
-                options.x = 0;
-                options.y = 0;
-                args += "screenx=" + options.x + ",";
-                args += "screeny=" + options.y + ",";
-                args += "left=" + options.x + ",";
-                args += "top=" + options.y + ",";
-        }
-
-        if (typeof(options.center) != 'undefined' && typeof(options.fullscreen) == 'undefined') {
-                options.y=Math.floor((screen.availHeight-(options.height || screen.height))/2)-(screen.height-screen.availHeight);
-                options.x=Math.floor((screen.availWidth-(options.width || screen.width))/2)-(screen.width-screen.availWidth);
-                args += "screenx=" + options.x + ",";
-                args += "screeny=" + options.y + ",";
-                args += "left=" + options.x + ",";
-                args += "top=" + options.y + ",";
-        }
-
-        if (typeof(options.scrollbars) != 'undefined') { args += "scrollbars=1,"; }
-        if (typeof(options.menubar) != 'undefined') { args += "menubar=1,"; }
-        if (typeof(options.locationbar) != 'undefined') { args += "location=1,"; }
-        if (typeof(options.resizable) != 'undefined') { args += "resizable=1,"; }
-
-        var win = window.open(anchor, options.name, args);
-        return false;
-
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/pluralize.js b/HarvestMan/harvestman/lib/js/samples/fail/pluralize.js
deleted file mode 100755
index 64ae2c2..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/pluralize.js
+++ /dev/null
@@ -1,143 +0,0 @@
-Inflector = {
-  Inflections: {
-    plural: [
-    [/(quiz)$/i,               "$1zes"  ],
-    [/^(ox)$/i,                "$1en"   ],
-    [/([m|l])ouse$/i,          "$1ice"  ],
-    [/(matr|vert|ind)ix|ex$/i, "$1ices" ],
-    [/(x|ch|ss|sh)$/i,         "$1es"   ],
-    [/([^aeiouy]|qu)y$/i,      "$1ies"  ],
-    [/(hive)$/i,               "$1s"    ],
-    [/(?:([^f])fe|([lr])f)$/i, "$1$2ves"],
-    [/sis$/i,                  "ses"    ],
-    [/([ti])um$/i,             "$1a"    ],
-    [/(buffal|tomat)o$/i,      "$1oes"  ],
-    [/(bu)s$/i,                "$1ses"  ],
-    [/(alias|status)$/i,       "$1es"   ],
-    [/(octop|vir)us$/i,        "$1i"    ],
-    [/(ax|test)is$/i,          "$1es"   ],
-    [/s$/i,                    "s"      ],
-    [/$/,                      "s"      ]
-    ],
-    singular: [
-    [/(quiz)zes$/i,                                                    "$1"     ],
-    [/(matr)ices$/i,                                                   "$1ix"   ],
-    [/(vert|ind)ices$/i,                                               "$1ex"   ],
-    [/^(ox)en/i,                                                       "$1"     ],
-    [/(alias|status)es$/i,                                             "$1"     ],
-    [/(octop|vir)i$/i,                                                 "$1us"   ],
-    [/(cris|ax|test)es$/i,                                             "$1is"   ],
-    [/(shoe)s$/i,                                                      "$1"     ],
-    [/(o)es$/i,                                                        "$1"     ],
-    [/(bus)es$/i,                                                      "$1"     ],
-    [/([m|l])ice$/i,                                                   "$1ouse" ],
-    [/(x|ch|ss|sh)es$/i,                                               "$1"     ],
-    [/(m)ovies$/i,                                                     "$1ovie" ],
-    [/(s)eries$/i,                                                     "$1eries"],
-    [/([^aeiouy]|qu)ies$/i,                                            "$1y"    ],
-    [/([lr])ves$/i,                                                    "$1f"    ],
-    [/(tive)s$/i,                                                      "$1"     ],
-    [/(hive)s$/i,                                                      "$1"     ],
-    [/([^f])ves$/i,                                                    "$1fe"   ],
-    [/(^analy)ses$/i,                                                  "$1sis"  ],
-    [/((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i, "$1$2sis"],
-    [/([ti])a$/i,                                                      "$1um"   ],
-    [/(n)ews$/i,                                                       "$1ews"  ],
-    [/s$/i,                                                            ""       ]
-    ],
-    irregular: [
-    ['move',   'moves'   ],
-    ['sex',    'sexes'   ],
-    ['child',  'children'],
-    ['man',    'men'     ],
-    ['person', 'people'  ]
-    ],
-    uncountable: [
-    "sheep",
-    "fish",
-    "series",
-    "species",
-    "money",
-    "rice",
-    "information",
-    "equipment"
-    ]
-  },
-  ordinalize: function(number) {
-    if (11 <= parseInt(number) % 100 && parseInt(number) % 100 <= 13) {
-      return number + "th";
-    } else {
-      switch (parseInt(number) % 10) {
-        case  1: return number + "st";
-        case  2: return number + "nd";
-        case  3: return number + "rd";
-        default: return number + "th";
-      }
-    }
-  },
-  pluralize: function(word) {
-    for (var i = 0; i < Inflector.Inflections.uncountable.length; i++) {
-      var uncountable = Inflector.Inflections.uncountable[i];
-      if (word.toLowerCase == uncountable) {
-        return uncountable;
-      }
-    }
-    for (var i = 0; i < Inflector.Inflections.irregular.length; i++) {
-      var singular = Inflector.Inflections.irregular[i][0];
-      var plural   = Inflector.Inflections.irregular[i][1];
-      if ((word.toLowerCase == singular) || (word == plural)) {
-        return plural;
-      }
-    }
-    for (var i = 0; i < Inflector.Inflections.plural.length; i++) {
-      var regex          = Inflector.Inflections.plural[i][0];
-      var replace_string = Inflector.Inflections.plural[i][1];
-      if (regex.test(word)) {
-        return word.replace(regex, replace_string);
-      }
-    }
-  },
-  singularize: function(word) {
-    for (var i = 0; i < Inflector.Inflections.uncountable.length; i++) {
-      var uncountable = Inflector.Inflections.uncountable[i];
-      if (word.toLowerCase == uncountable) {
-        return uncountable;
-      }
-    }
-    for (var i = 0; i < Inflector.Inflections.irregular.length; i++) {
-      var singular = Inflector.Inflections.irregular[i][0];
-      var plural   = Inflector.Inflections.irregular[i][1];
-      if ((word.toLowerCase == singular) || (word == plural)) {
-        return plural;
-      }
-    }
-    for (var i = 0; i < Inflector.Inflections.singular.length; i++) {
-      var regex          = Inflector.Inflections.singular[i][0];
-      var replace_string = Inflector.Inflections.singular[i][1];
-      if (regex.test(word)) {
-        return word.replace(regex, replace_string);
-      }
-    }
-  }
-}
-
-function ordinalize(number) {
-  return Inflector.ordinalize(number);
-}
-
-Object.extend(String.prototype, {
-  pluralize: function(count, plural) {
-    if (typeof count == 'undefined') {
-      return Inflector.pluralize(this);
-    } else {
-      return count + ' ' + (1 == parseInt(count) ? this : plural || Inflector.pluralize(this));
-    }
-  },
-  singularize: function(count) {
-    if (typeof count == 'undefined') {
-      return Inflector.singularize(this);
-    } else {
-      return count + " " + Inflector.singularize(this);
-    }
-  }
-});
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/selform.js b/HarvestMan/harvestman/lib/js/samples/fail/selform.js
deleted file mode 100755
index c9a37bb..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/selform.js
+++ /dev/null
@@ -1,22 +0,0 @@
-function selectiveTotal(form, which)
-{
-        var total = 0;
-        for(var j = 0; j < form.elements.length; j++){
-                var cid = form.elements[j].id.toString();
-                if ((form.elements[j].type == "text") && 
-                    (form.elements[j].value.length > 0) && 
-                    (cid.charAt(cid.length-1) == which) && 
-                    (cid != "total" + which))
-                {
-                        total += parseInt(form.elements[j].value);
-                }
-        }
-        document.getElementById("total" + which).value = total;
-        totalTheTotals("1","2","3");
-}
-
-function totalTheTotals(a, b, c)
-{
-        var total = parseInt(document.getElementById("total" + a).value) + parseInt(document.getElementById("total" + b).value);
-        document.getElementById("total" + c).value = total;
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/twitter.js b/HarvestMan/harvestman/lib/js/samples/fail/twitter.js
deleted file mode 100755
index 9894199..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/twitter.js
+++ /dev/null
@@ -1,14 +0,0 @@
-onload = function() {
-      var divs = document.getElementsByTagName('div');
-      for(var i=0; i<divs.length; i++) {
-        if(divs[i].className == 'regular' && divs[i].innerHTML.match(/\(via.*Twitter\s\//)) {
-          divs[i].className = 'quote';
-          var q = divs[i].innerHTML;
-          var t = q.replace(/^\s*.*?:/, '').replace(/\(via [^\)]+\)/, '');
-          size = (t.length > 75) ? 'medium' : 'short';
-          try {
-            divs[i].innerHTML = '<div class="quote_text"><span class="' + size + '">' + t + '</span></div><div class="source">' + q.match(/\((via [^\)]+)\)/)[1] + '</div>';
-          } catch(e) {}
-        }
-      }
-    }
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/typedtooltip.js b/HarvestMan/harvestman/lib/js/samples/fail/typedtooltip.js
deleted file mode 100755
index f1bf776..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/typedtooltip.js
+++ /dev/null
@@ -1,40 +0,0 @@
-var str='';
-var obj='false';
-var pntr='';
-var cnt=0;
-var textArr = new Array(5);
-textArr[0] = "All your bases are belong to us!";
-textArr[1] = "The answer is 42; of course.";
-textArr[2] = "The end of the universe will be the result of a mis-calculation in the core.";
-textArr[3] = "I don't believe its alive; but it thinks it is.";
-textArr[4] = "About 10 seconds after you open it; you'll wish you hadn't";
-
-function looptext()
-{
-   obj.innerHTML=str.substr(0,cnt);
-   cnt++;
-   if (cnt<=str.length) setTimeout('looptext()',100);
-}
-
-function typeText(lnk,lyr)
-{
-   str=textArr[lyr];
-   pntr= 'lyr'+lyr;
-   obj=document.getElementById(pntr);
-   obj.style.left=(lnk.offsetLeft+30)+'px';
-   obj.style.top=(lnk.offsetTop+25)+'px';
-   obj.style.display='block';
-   cnt=1;
-   looptext();
-}
-
-function clearText(lyr)
-{
-   cnt=2500;
-   pntr='lyr'+lyr;
-   obj=document.getElementById(pntr);
-   obj.style.display='none';
-   obj.style.left='-50px';
-   obj.style.top='-50px';
-   obj.innerHTML='&nbsp;';
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/vardump.js b/HarvestMan/harvestman/lib/js/samples/fail/vardump.js
deleted file mode 100755
index f75cef3..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/vardump.js
+++ /dev/null
@@ -1,61 +0,0 @@
-     function var_dump(data,addwhitespace,safety,level) {
-        var rtrn = '';
-        var dt,it,spaces = '';
-        if(!level) {level = 1;}
-        for(var i=0; i<level; i++) {
-           spaces += '   ';
-        }//end for i<level
-        if(typeof(data) != 'object') {
-           dt = data;
-           if(typeof(data) == 'string') {
-              if(addwhitespace == 'html') {
-                 dt = dt.replace(/&/g,'&amp;');
-                 dt = dt.replace(/>/g,'&gt;');
-                 dt = dt.replace(/</g,'&lt;');
-              }//end if addwhitespace == html
-              dt = dt.replace(/\"/g,'\"');
-              dt = '"' + dt + '"';
-           }//end if typeof == string
-           if(typeof(data) == 'function' && addwhitespace) {
-              dt = new String(dt).replace(/\n/g,"\n"+spaces);
-              if(addwhitespace == 'html') {
-                 dt = dt.replace(/&/g,'&amp;');
-                 dt = dt.replace(/>/g,'&gt;');
-                 dt = dt.replace(/</g,'&lt;');
-              }//end if addwhitespace == html
-           }//end if typeof == function
-           if(typeof(data) == 'undefined') {
-              dt = 'undefined';
-           }//end if typeof == undefined
-           if(addwhitespace == 'html') {
-              if(typeof(dt) != 'string') {
-                 dt = new String(dt);
-              }//end typeof != string
-              dt = dt.replace(/ /g,"&nbsp;").replace(/\n/g,"<br>");
-           }//end if addwhitespace == html
-           return dt;
-        }//end if typeof != object && != array
-        for (var x in data) {
-           if(safety && (level > safety)) {
-              dt = '*RECURSION*';
-           } else {
-              try {
-                 dt = var_dump(data[x],addwhitespace,safety,level+1);
-              } catch (e) {continue;}
-           }//end if-else level > safety
-           it = var_dump(x,addwhitespace,safety,level+1);
-           rtrn += it + ':' + dt + ',';
-           if(addwhitespace) {
-              rtrn += '\n'+spaces;
-           }//end if addwhitespace
-        }//end for...in
-        if(addwhitespace) {
-           rtrn = '{\n' + spaces + rtrn.substr(0,rtrn.length-(2+(level*3))) + '\n' + spaces.substr(0,spaces.length-3) + '}';
-        } else {
-           rtrn = '{' + rtrn.substr(0,rtrn.length-1) + '}';
-        }//end if-else addwhitespace
-        if(addwhitespace == 'html') {
-           rtrn = rtrn.replace(/ /g,"&nbsp;").replace(/\n/g,"<br>");
-        }//end if addwhitespace == html
-        return rtrn;
-     }//end function var_dump
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/fail/xpath.js b/HarvestMan/harvestman/lib/js/samples/fail/xpath.js
deleted file mode 100755
index db0f456..0000000
--- a/HarvestMan/harvestman/lib/js/samples/fail/xpath.js
+++ /dev/null
@@ -1,32 +0,0 @@
-function getXPath(node, path) {
-    path = path || [];
-    if(node.parentNode) {
-        path = getXPath(node.parentNode, path);
-    }
-    
-    if(node.previousSibling) {
-        var count = 1;
-        var sibling = node.previousSibling
-            do {
-                if(sibling.nodeType == 1 && sibling.nodeName == node.nodeName) {count++;}
-                sibling = sibling.previousSibling;
-            } while(sibling);
-        if(count == 1) {count = null;}
-    } else if(node.nextSibling) {
-        var sibling = node.nextSibling;
-        do {
-            if(sibling.nodeType == 1 && sibling.nodeName == node.nodeName) {
-                var count = 1;
-                sibling = null;
-            } else {
-                var count = null;
-                sibling = sibling.previousSibling;
-            }
-        } while(sibling);
-    }
-    
-    if(node.nodeType == 1) {
-        path.push(node.nodeName.toLowerCase() + (node.id ? "[@id='"+node.id+"']" : count > 0 ? "["+count+"]" : ''));
-    }
-    return path;
-};
diff --git a/HarvestMan/harvestman/lib/js/samples/formatnums.js b/HarvestMan/harvestman/lib/js/samples/formatnums.js
deleted file mode 100755
index d1778c4..0000000
--- a/HarvestMan/harvestman/lib/js/samples/formatnums.js
+++ /dev/null
@@ -1,12 +0,0 @@
-function addCommas(nStr)
-{
-        nStr += '';
-        x = nStr.split('.');
-        x1 = x[0];
-        x2 = x.length > 1 ? '.' + x[1] : '';
-        var rgx = /(\d+)(\d{3})/;
-        while (rgx.test(x1)) {
-                x1 = x1.replace(rgx, '$1' + ',' + '$2');
-        }
-        return x1 + x2;
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/formatpaper.js b/HarvestMan/harvestman/lib/js/samples/formatpaper.js
deleted file mode 100755
index 96c74ec..0000000
--- a/HarvestMan/harvestman/lib/js/samples/formatpaper.js
+++ /dev/null
@@ -1,113 +0,0 @@
-//<![CDATA[
-// updateForm() and cleanMe() for cleanPaperFilename
-// 2007 by Sascha Tayefeh
-//
-// Problem: How to cope with tons of papers to store? 
-// Solution: Give them systemantic, human-readable filenames.
-// Syntax: author__title__mag_vol_year.pdf
-//
-// rules: 
-// * Author is turned to lowercase (easier to find where searching
-//   is case sensitive)
-// * All non-alphanumeric-characters in the title will be turned
-//   either into '_' or deleted.
-// * Append the magazine-abbev., the volume and the year - and 
-//   celebrate
-//
-
-function cleanMe(unpure)
-{
-   var pure=unpure.replace(/[\s]/g,'_'); // First, replace all white-spaces by a single '_'
-   pure=pure.replace(/[^_\d\w\-]/g,''); // Next, replace all invalid characters
-   pure=pure.replace(/__*/g,'_'); // Clean up multiple '_'
-   return pure;
-
-}
-
-function startForm()
-{
-        if(document.getElementById('npUpdate').checked) t=setTimeout('updateForm()',100);
-        else t=setTimeout('startForm()',200);
-}
-
-function npReset()
-{
- document.getElementById('title').value
-   =document.getElementById('author').value
-   =document.getElementById('mag').value
-   =document.getElementById('vol').value
-   =document.getElementById('pubYear').value
-   ='';
-   
-}
-
-function updateForm()
-{
-   var green='#6f6';
-   var red='#f66';
-   var blue='#ccf';
-   var grey='#ccc';
-
-   var title=document.getElementById('title').value; // NOT .innerHTML, but actualle .value for textarea
-   var author=document.getElementById('author').value;
-   var mag=document.getElementById('mag').value;
-   var vol=document.getElementById('vol').value;
-   var pubYear=document.getElementById('pubYear').value;
-   var output='';
-
-
-   if (author) 
-   {
-      author=cleanMe(author).toLowerCase()+'__';
-      document.getElementById('author').style.backgroundColor=green;
-   } else document.getElementById('author').style.backgroundColor=grey;
-
-   if (title) 
-   {
-      document.getElementById('title').style.backgroundColor=green;
-      title=cleanMe(title)+"__";
-   } else document.getElementById('title').style.backgroundColor=grey;
-
-   if (mag) 
-   {
-      mag=cleanMe(mag)+"_";
-      document.getElementById('mag').style.backgroundColor=green;
-   } else document.getElementById('mag').style.backgroundColor=grey;
-
-   if (vol && vol.match(/^\d+$/g) ) 
-   {
-      vol=vol+"_"; 
-      document.getElementById('vol').style.backgroundColor=green; 
-   } else if (vol && ! vol.match(/d+$/g) )
-   { 
-      vol=''; 
-      document.getElementById('vol').style.backgroundColor=red; 
-   } else document.getElementById('vol').style.backgroundColor=grey;
-
-   if (pubYear && pubYear.match(/^\d+$/g) ) 
-   {
-      pubYear=pubYear+"_";
-      document.getElementById('pubYear').style.backgroundColor=green; 
-   }
-   else if (pubYear && ! pubYear.match(/d+$/g) )
-   { 
-      pubYear=''; 
-      document.getElementById('pubYear').style.backgroundColor=red; 
-   } else document.getElementById('pubYear').style.backgroundColor=grey;
-
-   // Assemble the filename string
-   output=author +  title + mag  + vol + pubYear + '.pdf';
-
-   if(output!='.pdf')
-   {
-      document.getElementById('output').innerHTML=output;
-      document.getElementById('output').style.backgroundColor=blue;
-   } else { 
-      document.getElementById('output').innerHTML='';
-      document.getElementById('output').style.backgroundColor=grey;
-   }
-
-  if(document.getElementById('npUpdate').checked) t=setTimeout('updateForm()',100);
-  else t=setTimeout('startForm()',200);
-}
-
diff --git a/HarvestMan/harvestman/lib/js/samples/getelems.js b/HarvestMan/harvestman/lib/js/samples/getelems.js
deleted file mode 100755
index 7e49eeb..0000000
--- a/HarvestMan/harvestman/lib/js/samples/getelems.js
+++ /dev/null
@@ -1,19 +0,0 @@
-function getElementsByClass(searchClass,node,tag) {
-
-        var classElements = new Array();
-        if (node == null)
-                node = document;
-        if (tag == null)
-                tag = '*';
-        var els = node.getElementsByTagName(tag);
-        var elsLen = els.length;
-        var pattern = new RegExp("(^|\\s)"+searchClass+"(\\s|$)");
-        var j = 0;
-        for (i = 0; i < elsLen; i++) {
-                if (pattern.test(els[i].className) ) {
-                        classElements[j] = els[i];
-                        j++;
-                }
-        }
-        return classElements;
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/gmxpath.js b/HarvestMan/harvestman/lib/js/samples/gmxpath.js
deleted file mode 100755
index 5ce9c2b..0000000
--- a/HarvestMan/harvestman/lib/js/samples/gmxpath.js
+++ /dev/null
@@ -1,34 +0,0 @@
-function gm_showXPath()
-{
-        var win = window.open("", window.location, "width="+700+",height="+300+",menubar=no,toolbar=no,directories=no,scrollbars=yes,status=no,left=0,top=0,resizable=yes");
-        var xpathInfo = "";     
-        var elt = null;
-        var links = document.getElementsByTagName('a');
-        var inputs = document.getElementsByTagName('input');
-        // add click events
-        xpathInfo +=  "<h2>" + window.location + "</h2>";
-        for (var i=0; i < links.length; i++)
-        {
-                elt = links[i];
-                var id = elt.getAttribute('id');
-                if (id != "gm_showxpath")
-                {
-                        xpathInfo += "href=" + elt.getAttribute('href') + ",xpath="+getElementXPath(links[i]);
-                        xpathInfo += "<br>";
-                }
-        }
-        
-        for (var j=0; j < inputs.length; j++)
-        {
-                elt = inputs[j];
-                var type = elt.getAttribute('type');
-                if (type != null && type.toLowerCase() == 'submit')
-                {
-                        xpathInfo += "href=" + elt.getAttribute('href') + ",xpath="+getElementXPath(links[i]);
-                        xpathInfo += "<br>";
-                }
-        }
-        
-        win.document.write(xpathInfo);
-        win.document.close();
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/hotkey.js b/HarvestMan/harvestman/lib/js/samples/hotkey.js
deleted file mode 100755
index d18af72..0000000
--- a/HarvestMan/harvestman/lib/js/samples/hotkey.js
+++ /dev/null
@@ -1,20 +0,0 @@
-var key1="32";
-var x='';
-function handler(e) 
-{
-  if (document.all) {
-  var evnt = window.event; 
-  x=evnt.keyCode;
-}
-else
-x=e.charCode;
-if (x==key1) location.href='http://www.expertsrt.com';
-}
-if (!document.all){
-window.captureEvents(Event.KEYPRESS);
-window.onkeypress=handler;
-}
-else
-{
-document.onkeypress = handler;
-} 
diff --git a/HarvestMan/harvestman/lib/js/samples/htmlselect.js b/HarvestMan/harvestman/lib/js/samples/htmlselect.js
deleted file mode 100755
index 41b23f9..0000000
--- a/HarvestMan/harvestman/lib/js/samples/htmlselect.js
+++ /dev/null
@@ -1,6 +0,0 @@
-function SendSelectValueToTextArea(selectValue,targetTextArea)
-{
-   var txtNode=document.createTextNode(selectValue);
-   var textArea=document.getElementById(targetTextArea);
-   textArea.appendChild(txtNode);
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/imgsrc.js b/HarvestMan/harvestman/lib/js/samples/imgsrc.js
deleted file mode 100755
index 90cbc81..0000000
--- a/HarvestMan/harvestman/lib/js/samples/imgsrc.js
+++ /dev/null
@@ -1,4 +0,0 @@
- function changeImg() {
-      var floatimg = document.getElementById("floatimg");
-      floatimg.setAttribute("src", "img/imprintTower_image004.jpg");
- }
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/incljsbydom.js b/HarvestMan/harvestman/lib/js/samples/incljsbydom.js
deleted file mode 100755
index 188dd70..0000000
--- a/HarvestMan/harvestman/lib/js/samples/incljsbydom.js
+++ /dev/null
@@ -1,8 +0,0 @@
-function include(file)
-{
-      var script  = document.createElement('script');
-      script.src  = file;
-      script.type = 'text/javascript';
-
-      document.getElementsByTagName('head').item(0).appendChild(script);
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/incljsbyxhr.js b/HarvestMan/harvestman/lib/js/samples/incljsbyxhr.js
deleted file mode 100755
index f253282..0000000
--- a/HarvestMan/harvestman/lib/js/samples/incljsbyxhr.js
+++ /dev/null
@@ -1,20 +0,0 @@
-function include(jsFileLocation)
-{
-        if(window.XMLHttpRequest)
-        {
-                var req = new XMLHttpRequest();
-        }
-        else
-        {
-                var req = new ActiveXObject("Microsoft.XMLHTTP");
-        }
-        req.open("GET", jsFileLocation,false);
-        req.onreadystatechange = function()
-        {
-                if (req.readyState == 4)
-                {
-                        window.eval(req.responseText);
-                }
-        }
-        req.send(null);
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/jsnodom.html b/HarvestMan/harvestman/lib/js/samples/jsnodom.html
deleted file mode 100755
index fa10f1f..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jsnodom.html
+++ /dev/null
@@ -1,11 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jsredirect.html b/HarvestMan/harvestman/lib/js/samples/jsredirect.html
deleted file mode 100755
index ddee8bb..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jsredirect.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<html>
-<title>Struer Kommune - Kommunen</title>
-<body bgcolor="#FFFFFF">
-
-<script language="javascript">
-<!--
-
-location.replace("http://www.struer.dk/webtop/site.asp?site=5");
-
-//-->
-</script>
-
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jsredirect4.html b/HarvestMan/harvestman/lib/js/samples/jsredirect4.html
deleted file mode 100755
index 24bc6d0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jsredirect4.html
+++ /dev/null
@@ -1,21 +0,0 @@
-html><head>
-
-<title>www.szigetszentmiklos.hu</title>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
-</head>
-<body bgcolor="#0F4233" text="#FFFFFF" leftmargin="0" topmargin="0">
-
-<script>
-{window.location.href="http://www.szszm.hu/szigetszentmiklos.hu";}
-</script>
-
-
-
-<br><br>
-</body>
-
-
-
-
-
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jsredirect5.html b/HarvestMan/harvestman/lib/js/samples/jsredirect5.html
deleted file mode 100755
index af8e0f0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jsredirect5.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-
-<html>
-<head>
-        <title>Sopron</title>   
-</head>
-<body>
-
-<script>
-this.location = 'sopron/main.php'
-</script>
-
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest.html b/HarvestMan/harvestman/lib/js/samples/jstest.html
deleted file mode 100755
index bcb7f14..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest.html
+++ /dev/null
@@ -1,20 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-document.write   ("<H1>This is a heading)</H1>");    
-document.writeln( "Hello World! ","Hello You! ","<p style='color:blue;'>Hello World!</p>");
-document.writeln("Hi, this is " + "<b>A bold line</b>" + " and this is "  + "<i>an italicized line</i>");
-document.writeln( "<p>Hi, this is ", "A paragraph</p>", "<p>And this is "  + "Another one</p>");
-document.writeln( - "<p>Hi, this is another one</p>");
-document.write('A text enclosed in single quote');
-</script>
-<p>Hello this is a paragraph</p>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest2.html b/HarvestMan/harvestman/lib/js/samples/jstest2.html
deleted file mode 100755
index c478622..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest2.html
+++ /dev/null
@@ -1,13 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="javascript">
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-document.write("ha ha")
-</script>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest2_dom.html b/HarvestMan/harvestman/lib/js/samples/jstest2_dom.html
deleted file mode 100755
index 9415db1..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest2_dom.html
+++ /dev/null
@@ -1,6 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-ha ha
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest3.html b/HarvestMan/harvestman/lib/js/samples/jstest3.html
deleted file mode 100755
index 3542ed6..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest3.html
+++ /dev/null
@@ -1,16 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="JavaScript1.2" type="text/javascript" >
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-</script>
-<p>Test</p>
-<script language="javascript">
-document.write("ha ha")
-</script>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest3_dom.html b/HarvestMan/harvestman/lib/js/samples/jstest3_dom.html
deleted file mode 100755
index 8698e79..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest3_dom.html
+++ /dev/null
@@ -1,14 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<script language="JavaScript1.2" type="text/javascript" >
-function test() 
-{
-  var a = 10
-  var b = 20
-}
-</script>
-<p>Test</p>
-ha ha
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/jstest_dom.html b/HarvestMan/harvestman/lib/js/samples/jstest_dom.html
deleted file mode 100755
index f2ce1ab..0000000
--- a/HarvestMan/harvestman/lib/js/samples/jstest_dom.html
+++ /dev/null
@@ -1,16 +0,0 @@
-<html>
-<head><title>Javascript Test</title</head>
-<body>
-<H1>This is a heading)</H1>Hello World! Hello You! <p style='color:blue;'>Hello World!</p>
-Hi, this is <b>A bold line</b> and this is <i>an italicized line</i>
-<p>Hi, this is A paragraph</p><p>And this is Another one</p>
-A text enclosed in single quote
-<p>Hello this is a paragraph</p>
-<script language="javascript">
-function test() 
-{
-  var x = 10
-}
-</script>
-</body>
-</html>
diff --git a/HarvestMan/harvestman/lib/js/samples/mouseposn.js b/HarvestMan/harvestman/lib/js/samples/mouseposn.js
deleted file mode 100755
index c55686a..0000000
--- a/HarvestMan/harvestman/lib/js/samples/mouseposn.js
+++ /dev/null
@@ -1,22 +0,0 @@
-function showit()
-{
-   document.forms['theform'].xcoord.value=event.x;
-   document.getElementById('spanx').innerHTML='x='+event.x;
-   document.forms.theform.ycoord.value=event.y;
-   document.getElementById('spany').innerHTML='y='+event.y;
-}
-function showitMOZ(e)
-{
-   document.forms['theform'].xcoord.value=e.pageX;
-   document.getElementById('spanx').innerHTML='x='+e.pageX;
-   document.getElementById('spany').innerHTML='y='+e.pageY;
-   document.forms.theform.ycoord.value=e.pageY;
-}
-if (!document.all){
-window.captureEvents(Event.CLICK);
-window.onclick=showitMOZ;
-}
-else
-{
-document.onclick=showit;
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/mysqltimestamp.js b/HarvestMan/harvestman/lib/js/samples/mysqltimestamp.js
deleted file mode 100755
index 2bec13a..0000000
--- a/HarvestMan/harvestman/lib/js/samples/mysqltimestamp.js
+++ /dev/null
@@ -1,7 +0,0 @@
-  function mysqlTimeStampToDate(timestamp) {
-    //function parses mysql datetime string and returns javascript Date object
-    //input has to be in this format: 2007-06-05 15:26:02
-    var regex=/^([0-9]{2,4})-([0-1][0-9])-([0-3][0-9]) (?:([0-2][0-9]):([0-5][0-9]):([0-5][0-9]))?$/;
-    var parts=timestamp.replace(regex,"$1 $2 $3 $4 $5 $6").split(' ');
-    return new Date(parts[0],parts[1],parts[2],parts[3],parts[4],parts[5]);
-  }
diff --git a/HarvestMan/harvestman/lib/js/samples/progress.js b/HarvestMan/harvestman/lib/js/samples/progress.js
deleted file mode 100755
index 3377deb..0000000
--- a/HarvestMan/harvestman/lib/js/samples/progress.js
+++ /dev/null
@@ -1,8 +0,0 @@
-function ProgressIndicator(element, options) {
-        var element = $(element);
-        var my_options = {show:Element.show, hide:Element.hide};
-        Object.extend(my_options, options || {});
-        this.show = function() { my_options.show(element) }
-        this.hide = function() { my_options.hide(element) }
-        this.hide();
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/redirect.js b/HarvestMan/harvestman/lib/js/samples/redirect.js
deleted file mode 100755
index ffde861..0000000
--- a/HarvestMan/harvestman/lib/js/samples/redirect.js
+++ /dev/null
@@ -1,16 +0,0 @@
-function redirect()
-{
-    window.location.assign("http://www.newsite.com");
-}
-
-function redirect2()
-{
-    location.replace("http://www.newsite.com");
-}
-
-function redirect3()
-{
-    window.location.href = "http://www.newsite.com";
-}
-
- location.replace("http://www.newsite.com");
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/selectlist.js b/HarvestMan/harvestman/lib/js/samples/selectlist.js
deleted file mode 100755
index 7ee36b0..0000000
--- a/HarvestMan/harvestman/lib/js/samples/selectlist.js
+++ /dev/null
@@ -1,54 +0,0 @@
-var count1 = 0;
-var count2 = 0;
-
-function insertOptionBefore(num)
-{
-  var elSel = document.getElementById('selectX');
-  if (elSel.selectedIndex >= 0) {
-    var elOptNew = document.createElement('option');
-    elOptNew.text = 'Insert' + num;
-    elOptNew.value = 'insert' + num;
-    var elOptOld = elSel.options[elSel.selectedIndex];  
-    try {
-      elSel.add(elOptNew, elOptOld); // standards compliant; doesn't work in IE
-    }
-    catch(ex) {
-      elSel.add(elOptNew, elSel.selectedIndex); // IE only
-    }
-  }
-}
-
-function removeOptionSelected()
-{
-  var elSel = document.getElementById('selectX');
-  var i;
-  for (i = elSel.length - 1; i>=0; i--) {
-    if (elSel.options[i].selected) {
-      elSel.remove(i);
-    }
-  }
-}
-
-function appendOptionLast(num)
-{
-  var elOptNew = document.createElement('option');
-  elOptNew.text = 'Append' + num;
-  elOptNew.value = 'append' + num;
-  var elSel = document.getElementById('selectX');
-
-  try {
-    elSel.add(elOptNew, null); // standards compliant; doesn't work in IE
-  }
-  catch(ex) {
-    elSel.add(elOptNew); // IE only
-  }
-}
-
-function removeOptionLast()
-{
-  var elSel = document.getElementById('selectX');
-  if (elSel.length > 0)
-  {
-    elSel.remove(elSel.length - 1);
-  }
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/showformvalues.js b/HarvestMan/harvestman/lib/js/samples/showformvalues.js
deleted file mode 100755
index 489d743..0000000
--- a/HarvestMan/harvestman/lib/js/samples/showformvalues.js
+++ /dev/null
@@ -1,15 +0,0 @@
-var displayWindow = window.open();
-
-function showFormValues(form ) { 
-    displayWindow.document.write('Form:');
-    displayWindow.document.write(form.name);
-    displayWindow.document.write('<br>');
-
-    var formElements = form.getElementsByTagName('input');
-
-    for (var i = 0; i < formElements.length; i++){
-        var element = formElements[i];
-        
-        displayWindow.document.write(element.name + ' :  ' + element.value + ' <br>');}} 
-
-Array.forEach(document.forms, showFormValues);
diff --git a/HarvestMan/harvestman/lib/js/samples/strcmp.js b/HarvestMan/harvestman/lib/js/samples/strcmp.js
deleted file mode 100755
index cb01ddd..0000000
--- a/HarvestMan/harvestman/lib/js/samples/strcmp.js
+++ /dev/null
@@ -1,14 +0,0 @@
-      function isSameString( s1, s2 )
-      {
-        alert( "s1: " + s1.toString() );
-        alert( "s2: " + s2.toString() );
-
-        if ( s1.toString() == s2.toString() )
-        {
-          return true;
-        }
-        else
-        {
-          return false;
-        }
-      }
diff --git a/HarvestMan/harvestman/lib/js/samples/submitags.js b/HarvestMan/harvestman/lib/js/samples/submitags.js
deleted file mode 100755
index 3da20ac..0000000
--- a/HarvestMan/harvestman/lib/js/samples/submitags.js
+++ /dev/null
@@ -1,7 +0,0 @@
- submitTags  = function()
-  {
-    var btnSubmitTags = document.getElementById( TagsHelperConfig.FORM_TAGS_ENTRY_SUBMIT_BUTTON_ID );
-
-    // Programmatically click the submit button
-    btnSubmitTags.click();
-  }
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/switchexample.js b/HarvestMan/harvestman/lib/js/samples/switchexample.js
deleted file mode 100755
index 6145d9d..0000000
--- a/HarvestMan/harvestman/lib/js/samples/switchexample.js
+++ /dev/null
@@ -1,17 +0,0 @@
-inputOnkeyup = function(event) {
-        if (fixDblKey()) { return; }
-        switch (event.keyCode) {
-                case 38 : /* up */
-                    break;
-                case 40 : /* down */
-                    break;
-                case 37 : /* left */
-                    break;
-                case 39 : /* right */
-                    break;
-                case  9 : /* tab */
-                    break;
-                case 13 : /* enter */
-                    break;
-        }
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/syncajax.js b/HarvestMan/harvestman/lib/js/samples/syncajax.js
deleted file mode 100755
index d519d22..0000000
--- a/HarvestMan/harvestman/lib/js/samples/syncajax.js
+++ /dev/null
@@ -1,16 +0,0 @@
-function getFile(url) {
-  if (window.XMLHttpRequest) {              
-    AJAX=new XMLHttpRequest();              
-  } else {                                  
-    AJAX=new ActiveXObject("Microsoft.XMLHTTP");
-  }
-  if (AJAX) {
-     AJAX.open("GET", url, false);                             
-     AJAX.send(null);
-     return AJAX.responseText;                                         
-  } else {
-     return false;
-  }                                             
-}
-
-var fileFromServer = getFile('http://somedomain.com/somefile.txt');
diff --git a/HarvestMan/harvestman/lib/js/samples/test.html b/HarvestMan/harvestman/lib/js/samples/test.html
deleted file mode 100755
index 1229a0c..0000000
--- a/HarvestMan/harvestman/lib/js/samples/test.html
+++ /dev/null
@@ -1,32 +0,0 @@
-<script language="text/javascript">
-<!--hide this script from non-javascript-enabled browsers
-if (Browser=="IE") {
-//   document.write('<span id="temp"><nobr>'+text+'</nobr></span>');
-//   document.all.temp.style.visibility="hidden";
-//   rollbreite=document.all.temp.offsetWidth;
-//   rollbreite=900;
-   rollbreite=1050;
-}
-if (Browser=="NS6") {
-//   document.write('<span id="temp"><nobr>'+text+'</nobr></span>');
-//   document.getElementById("temp").style.visibility="hidden";
-//   rollbreite=document.getElementById("temp").offsetWidth;
-//   rollbreite=900;
-   rollbreite=1050;
-}
-if (Browser=="IE"||Browser=="NS6") {
-
-   document.write('<div class="ticker" id ="hticker" style="position:absolute;top:130;left:174;background-color:#dbf0ff;clip:rect(0,455,20,0);">')
-   document.write('<div class="ticker" id="iticker" style="position:relative;width:'+rollbreite+'">');
-   document.write(text);
-   document.write('</div> </div>');
-
-} else if (Browser=="NS4") {
-   document.write('<layer class="ticker" id="hticker" top="82px" left="181px" bgColor="#dbf0ff" clip="0,0,457,18">');
-   document.write('<layer class="ticker" id="iticker" left="0" >');
-   document.write("<nobr>"+text+"</nobr>");
-   document.write('</layer></layer>');
-
-}
-// stop hiding -->
-</script>
diff --git a/HarvestMan/harvestman/lib/js/samples/test.js b/HarvestMan/harvestman/lib/js/samples/test.js
deleted file mode 100755
index 50d9f6b..0000000
--- a/HarvestMan/harvestman/lib/js/samples/test.js
+++ /dev/null
@@ -1,15 +0,0 @@
-function test() 
-{
- var a = 10
- var b = 11
- c = a + b
- 
-}
-
-function test2() 
-{
- var a = 10
- var b = 11
- c = a + b
- 
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/trim.js b/HarvestMan/harvestman/lib/js/samples/trim.js
deleted file mode 100755
index 321faa6..0000000
--- a/HarvestMan/harvestman/lib/js/samples/trim.js
+++ /dev/null
@@ -1,14 +0,0 @@
-
-function trim(str, chars) {
-    return ltrim(rtrim(str, chars), chars);
-}
-
-function ltrim(str, chars) {
-    chars = chars || "\\s";
-    return str.replace(new RegExp("^[" + chars + "]+", "g"), "");
-}
-
-function rtrim(str, chars) {
-    chars = chars || "\\s";
-    return str.replace(new RegExp("[" + chars + "]+$", "g"), "");
-}
\ No newline at end of file
diff --git a/HarvestMan/harvestman/lib/js/samples/trim2.js b/HarvestMan/harvestman/lib/js/samples/trim2.js
deleted file mode 100755
index 66cf1fc..0000000
--- a/HarvestMan/harvestman/lib/js/samples/trim2.js
+++ /dev/null
@@ -1,4 +0,0 @@
-// Removes leading and ending whitespaces, nbsps
-function trim(str) {
-    return str.replace(/(^[\s\xA0]+|[\s\xA0]+$)/g, '');
-}
diff --git a/HarvestMan/harvestman/lib/js/samples/xpathstring.js b/HarvestMan/harvestman/lib/js/samples/xpathstring.js
deleted file mode 100755
index d86be9e..0000000
--- a/HarvestMan/harvestman/lib/js/samples/xpathstring.js
+++ /dev/null
@@ -1,24 +0,0 @@
-function getElementXPath(elt)
-{
-     var path = "";
-     for (; elt && elt.nodeType == 1; elt = elt.parentNode)
-     {
-        idx = getElementIdx(elt);
-        xname = elt.tagName;
-        if (idx > 1) xname += "[" + idx + "]";
-        path = "/" + xname + path;
-     }
- 
-     return path;       
-}
-
-function getElementIdx(elt)
-{
-    var count = 1;
-    for (var sib = elt.previousSibling; sib ; sib = sib.previousSibling)
-    {
-        if(sib.nodeType == 1 && sib.tagName == elt.tagName)     count++
-    }
-    
-    return count;
-}
diff --git a/HarvestMan/harvestman/lib/js/testnarcissus.py b/HarvestMan/harvestman/lib/js/testnarcissus.py
deleted file mode 100755
index 5360f64..0000000
--- a/HarvestMan/harvestman/lib/js/testnarcissus.py
+++ /dev/null
@@ -1,38 +0,0 @@
-"""
-Simple test script which tests narcissus against all sample
-javascript files """
-
-from jsparse import print_functions
-import sys, os
-
-fcount, passcount = 0, 0
-
-oldstdout = sys.stdout
-# Redirect all output to /dev/null
-if os.name == 'posix':
-    sys.stdout = open('/dev/null','w')
-else: 
-    sys.stdout = open('.js.tmp','w')
-
-skip = ['editor_main.js','bportugal.js']
-
-for filename in os.listdir("samples"):
-    if filename.endswith('.js') and filename not in skip:
-        print >> sys.stderr, "Filename => " + filename
-        fcount += 1
-        try:
-            print_functions(os.path.join('samples',filename))
-            passcount += 1
-        except Exception, e:
-            print >> sys.stderr, "Error: " + str(e)
-
-sys.stdout = oldstdout
-
-if fcount:
-    print 'Tested with %d samples.' % fcount
-    if fcount == passcount:
-        print 'All tests passed'
-    else:
-        print '%d tests passed, %d failed' % (passcount, fcount-passcount)
-        
-    
diff --git a/HarvestMan/harvestman/lib/logger.py b/HarvestMan/harvestman/lib/logger.py
deleted file mode 100755
index 6892bb1..0000000
--- a/HarvestMan/harvestman/lib/logger.py
+++ /dev/null
@@ -1,386 +0,0 @@
-# -- coding: utf-8
-"""
-logger.py -  Logging functions for HarvestMan.
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created: Jan 23 2005
-
-Modification History
-
-   Aug 17 06 Anand   Made this to use Python logging module.
-   Jul 11 08 Anand   Modified to suit standard logging with an
-                     extra level named EXTRAINFO between INFO
-                     and DEBUG.Fix for google code issue #12.
-
-Copyright (C) 2005 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import logging, logging.handlers
-import os, sys
-from types import StringTypes
-
-class HandlerFactory(object):
-    """ Factory class to create handlers of different families for use by the logging class """
-    
-    def createHandle(handlertype, *args):
-        """ Return a logging handler of the given type.
-        The handler will be initialized using params from the args
-        argument tuple """
-
-        if handlertype == 'StreamHandler':
-            return logging.StreamHandler(*args)
-        elif handlertype == 'FileHandler':
-            return logging.FileHandler(*args)
-        elif handlertype == 'SocketHandler':
-            return logging.handlers.SocketHandler(*args)        
-        
-    createHandle = staticmethod(createHandle)
-
-# Our logging levels are the following. We confirm
-# to Python logging except for the additional level
-# between DEBUG and INFO named "EXTRAINFO".
-# CRITICAL 50
-# ERROR         40
-# WARNING 30
-# INFO  20
-# EXTRAINFO 15
-# DEBUG         10
-# NOTSET        0
-
-CRITICAL=logging.CRITICAL
-ERROR=logging.ERROR
-WARNING=logging.WARNING
-INFO=logging.INFO
-EXTRAINFO=(logging.INFO+logging.DEBUG)/2
-DEBUG=logging.DEBUG
-DISABLE=NOTSET=logging.NOTSET
-
-def getLogLevel(levelname):
-    """ Return the loglevel given the level name """
-
-    if type(levelname) in StringTypes:
-        return eval(levelname.upper())
-    elif type(levelname) is int:
-        return levelname
-
-def getLogLevelName(level):
-    """ Return the loglevel name given the level """
-    
-    return HarvestManLogger.getLevelName(level)
-
-class HarvestManLogger(object):
-    """ A customizable logging class for HarvestMan with different
-    levels of logging support """
-
-    alias = 'logger'
-
-    # Dictionary from levels to level names
-    _namemap = { 0: 'NOTSET',
-                 10: 'DEBUG',
-                 15: 'EXTRAINFO',
-                 20: 'INFO',
-                 30: 'WARNING',
-                 40: 'ERROR',
-                 50: 'CRITICAL' }
-
-    # Map of instances
-    _instances = {'default': None}
-    
-    def __init__(self, severity=INFO, logflag=2):
-        """ Initialize the logger class with severity and logflag """
-
-        # Add our custom logging level to the module
-        logging.addLevelName(EXTRAINFO, 'EXTRAINFO')
-        
-        self._severity = severity
-        self._formattercache = {}
-        
-        if logflag==0:
-            self._severity = DISABLE
-        else:
-            self._severity = severity
-            if logflag == 2:
-                self.consolelog = True
-
-    def make_logger(self):
-        
-        self._logger = logging.Logger('HarvestMan')
-        self._logger.setLevel(self._severity)
-            
-        if self.consolelog:
-            formatter = logging.Formatter('[%(asctime)s] %(message)s')
-            handler = logging.StreamHandler()
-            handler.setFormatter(formatter)
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def _getMessage(self, arg, *args):
-        """ Private method to create a message string from the supplied arguments """
-
-        try:
-            return ''.join((str(arg),' ',' '.join([str(item) for item in args])))
-        except UnicodeEncodeError:
-            try:
-                return ''.join((str(arg),' ',' '.join([str(item.encode('iso-8859-1')) for item in args])))
-            except UnicodeDecodeError:
-                return str(arg)
-        
-    @classmethod
-    def getLevelName(self, level):
-        """ Return the level name, given the level value """
-        
-        return self._namemap.get(level, '')
-
-    def getLogLevel(self):
-        """ Return the current log level """
-
-        # Same as severity
-        return self.getLogSeverity()
-
-    def getLogSeverity(self):
-        """ Return the current log severity """
-
-        return self._severity
-
-    def getLogLevelName(self):
-        """ Return the name of the current log level """
-
-        return self.getLevelName(self._severity)
-
-    def setLogSeverity(self, severity):
-        """ Set the log severity """
-
-        self._severity = severity
-        self._logger.setLevel(self._severity)        
-
-    def addLogHandler(self, handlertype, *args):
-        """ Generic method to add a handler to the logger """
-
-        # handlertype should be a string
-        # Call helper function here
-        handler = HandlerFactory.createHandle(handlertype, *args)
-        if handlertype != 'StreamHandler':
-            formatter = logging.Formatter('%(asctime)s %(levelname)s - %(message)s')
-        else:
-            # Minimal formatting for console log
-            formatter = logging.Formatter('[%(asctime)s] %(message)s')
-        handler.setFormatter(formatter)
-        self._logger.addHandler(handler)
-
-    def removeLogHandlers(self, handlertype):
-        """ Generic method to remove a handler from the logger """
-
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == handlertype:
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def enableConsoleLogging(self):
-        """ Enable console logging - if console logging is already
-        enabled, this method does not have any effect """
-
-        if not self.consolelog:
-            self.addLogHandler('StreamHandler')
-            self.consolelog = True
-
-    def disableConsoleLogging(self):
-        """ Disable console logging - if console logging is already
-        disabled, this method does not have any effect """
-
-        # Find out streamhandler if any
-        self.removeLogHandlers('StreamHandler')
-        self.consolelog=False
-
-    def disableFileLogging(self):
-        """ Disable file logging - if file logging is already
-        disabled, this method does not have any effect """
-
-        self.removeLogHandlers('FileHandler')
-
-    def addLogFile(self, filename):
-        """ Add a log file named 'filename' to the logger """
-
-        self.addLogHandler('FileHandler', filename)
-
-    def removeLogFile(self, filename):
-        """ Remove log file named 'filename' from the logger. If
-        only a filename is passed instead of a file path, the file
-        is assumed to be in the current directory """
-        
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':
-                fname = h.baseFilename
-                if (os.path.abspath(filename)==os.path.abspath(fname)):
-                    # Remove the handler
-                    self._logger.removeHandler(h)
-                    break
-
-    def setPlainFormat(self):
-        """ Set format to displaying messages-only without any timestamps etc """
-
-        formatter = logging.Formatter('%(message)s')
-        for h in self._logger.handlers:
-            # Cache previous formatter for reverting later if requested
-            self._formattercache[hash(h)] = h.formatter
-            h.setFormatter(formatter)
-
-    def revertFormatting(self):
-        """ Revert formatting for all handlers if a cache is found """
-
-        for h in self._logger.handlers:
-            h.setFormatter(self._formattercache[hash(h)])
-        
-    def debug(self, msg, *args):
-        """ Perform logging at the DEBUG level """
-
-        try:
-            self._logger.debug(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def info(self, msg, *args):
-        """ Perform logging at the INFO level """
-        
-        try:
-            self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def extrainfo(self, msg, *args):
-        """ Perform logging at the EXTRAINFO level """
-
-        try:
-            self._logger.log(EXTRAINFO, self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def warning(self, msg, *args):
-        """ Perform logging at the WARNING level """
-
-        try:
-            self._logger.warning(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def error(self, msg, *args):
-        """ Perform logging at the ERROR level """
-
-        try:
-            self._logger.error(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def critical(self, msg, *args):
-        """ Perform logging at the CRITICAL level """
-
-        try:
-            self._logger.critical(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def logconsole(self, msg, *args):
-        """ Directly log to console using sys.stdout """
-
-        try:
-            (self._severity>DISABLE) and sys.stdout.write(self._getMessage(msg, *args)+'\n')
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def getDefaultLogger(cls):
-        """ Return the default logger instance """
-
-        return cls._instances.get('default')
-    
-    def Instance(cls, name='default', *args):
-        """ Return an instance of this class """
-
-        inst = cls(*args)
-        cls._instances[name] = inst
-
-        return inst
-
-    def clean_up(self):
-        # Remove all handlers...
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':            
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def shutdown(self):
-        logging.shutdown()
-        
-    Instance = classmethod(Instance)
-    getDefaultLogger = classmethod(getDefaultLogger)
-    
-if __name__=="__main__":
-    import sys
-    
-    mylogger = HarvestManLogger.Instance()
-    mylogger.make_logger()
-    
-    mylogger.addLogHandler('FileHandler','report.log')
-    
-    p = 'HarvestMan'
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)
-    mylogger.critical("Test message 6",p)
-    
-    print mylogger.getLogSeverity()
-    print mylogger.getLogLevelName()
-
-    mylogger.enableConsoleLogging()
-    # mylogger.disableConsoleLogging()    
-    mylogger.disableFileLogging()
-
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)    
-    mylogger.critical("Test message 6",p)    
-
-    print HandlerFactory.createHandle('StreamHandler', sys.stdout)
-    print HandlerFactory.createHandle('FileHandler', 'my.txt')
-    print HandlerFactory.createHandle('SocketHandler', 'localhost', 5555)
-
-    mylogger.addLogHandler('FileHandler','my.txt')
-    
-    # Change severity to extrainfo
-    mylogger.setLogSeverity(EXTRAINFO)
-
-    mylogger.debug("Test message 1",p)
-    mylogger.extrainfo("Test message 2",p)
-    mylogger.info("Test message 3",p)
-    mylogger.warning("Test message 4",p)
-    mylogger.error("Test message 5",p)    
-    mylogger.critical("Test message 6",p)
-    
-    print HarvestManLogger.getDefaultLogger()==mylogger
-    print HarvestManLogger._instances
-    
-    print getLogLevel('info')
-    print getLogLevelName(EXTRAINFO)
diff --git a/HarvestMan/harvestman/lib/logger_old.py b/HarvestMan/harvestman/lib/logger_old.py
deleted file mode 100755
index 338270f..0000000
--- a/HarvestMan/harvestman/lib/logger_old.py
+++ /dev/null
@@ -1,325 +0,0 @@
-# -- coding: utf-8
-"""
-logger.py -  Logging functions for HarvestMan.
-This module is part of the HarvestMan program.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created: Jan 23 2005
-
-Modification History
-
-   Aug 17 06 Anand   Made this to use Python logging module.
-
-Copyright (C) 2005 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import logging, logging.handlers
-import os, sys
-
-class HandlerFactory(object):
-    """ Factory class to create handlers of different families for use by SIMLogger """
-    
-    def createHandle(handlertype, *args):
-        """ Return a logging handler of the given type.
-        The handler will be initialized using params from the args
-        argument tuple """
-
-        if handlertype == 'StreamHandler':
-            return logging.StreamHandler(*args)
-        elif handlertype == 'FileHandler':
-            return logging.FileHandler(*args)
-        elif handlertype == 'SocketHandler':
-            return logging.handlers.SocketHandler(*args)        
-        
-    createHandle = staticmethod(createHandle)
-
-
-
-# Useful macros for setting
-# the log level.
-    
-DISABLE = 0
-INFO = 1
-MOREINFO  = 2
-EXTRAINFO = 3
-DEBUG   = 4
-MOREDEBUG = 5
-        
-class HarvestManLogger(object):
-    """ A customizable logging class for HarvestMan with different
-    levels of logging support """
-
-    alias = 'logger'
-    
-    # Dictionary from levels to level names
-    _namemap = { 0: 'DISABLE',
-                 1: 'INFO',
-                 2: 'MOREINFO',
-                 3: 'EXTRAINFO',
-                 4: 'DEBUG',
-                 5: 'MOREDEBUG' }
-
-    # Map of instances
-    _instances = {'default': None}
-    
-    def __init__(self, severity=1, logflag=2):
-        """ Initialize the logger class with severity and logflag """
-        
-        self._severity = severity
-        # Handler cache
-        self._cachehandler = None
-        
-        if logflag==0:
-            self._severity = 0
-        else:
-            self._severity = severity
-            if logflag == 2:
-                self.consolelog = True
-
-    def make_logger(self):
-        
-        self._logger = logging.Logger('HarvestMan')
-        self._logger.setLevel(logging.DEBUG)
-            
-        if self.consolelog:
-            formatter = logging.Formatter('[%(asctime)s] %(message)s',
-                                          '%H:%M:%S')
-            handler = logging.StreamHandler()
-            handler.setFormatter(formatter)
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def _getMessage(self, arg, *args):
-        """ Private method to create a message string from the supplied arguments """
-
-        try:
-            return ''.join((str(arg),' ',' '.join([str(item) for item in args])))
-        except UnicodeEncodeError:
-            try:
-                return ''.join((str(arg),' ',' '.join([str(item.encode('iso-8859-1')) for item in args])))
-            except UnicodeDecodeError:
-                return str(arg)
-        
-    def getLevelName(self, level):
-        """ Return the level name, given the level value """
-        
-        return self._namemap.get(level, '')
-
-    def getLogLevel(self):
-        """ Return the current log level """
-
-        # Same as severity
-        return self.getLogSeverity()
-
-    def getLogSeverity(self):
-        """ Return the current log severity """
-
-        return self._severity
-
-    def getLogLevelName(self):
-        """ Return the name of the current log level """
-
-        return self.getLevelName(self._severity)
-
-    def setLogSeverity(self, severity):
-        """ Set the log severity """
-
-        self._severity = severity
-
-    def addLogHandler(self, handlertype, *args):
-        """ Generic method to add a handler to the logger """
-
-        # handlertype should be a string
-        # Call helper function here
-        handler = HandlerFactory.createHandle(handlertype, *args)
-        formatter = logging.Formatter('%(asctime)s %(message)s',
-                                          '(%d-%m-%y) [%H:%M:%S]')
-        handler.setFormatter(formatter)
-        self._logger.addHandler(handler)
-        
-    def enableConsoleLogging(self):
-        """ Enable console logging - if console logging is already
-        enabled, this method does not have any effect """
-
-        console = 'StreamHandler' in [h.__class__.__name__ for h in self._logger.handlers]
-        
-        if not console:
-            if self._cachehandler.__class__.__name__ == 'StreamHandler':
-                handler = self._cachehandler
-            else:
-                formatter = logging.Formatter('[%(asctime)s] %(message)s',
-                                              '%H:%M:%S')
-                handler = logging.StreamHandler()
-                handler.setFormatter(formatter)
-                
-            self._logger.addHandler(handler)
-        else:
-            pass
-
-    def disableConsoleLogging(self):
-        """ Disable console logging - if console logging is already
-        disabled, this method does not have any effect """
-
-        # Find out streamhandler if any
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'StreamHandler':
-                # Cache the handler if we want to readd quickly
-                self._cachehandler = h
-                # Remove the handler
-                self._logger.removeHandler(h)
-                break
-        else:
-            pass
-
-    def disableFileLogging(self):
-        """ Disable file logging - if file logging is already
-        disabled, this method does not have any effect """
-
-        # Find out filehandler if any
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':
-                # Remove the handler
-                self._logger.removeHandler(h)
-                break
-        else:
-            pass
-
-    def setPlainFormat(self):
-        """ Set format to displaying messages-only without any timestamps etc """
-
-        formatter = logging.Formatter('%(message)s')
-        for h in self._logger.handlers:
-            h.setFormatter(formatter)
-        
-    def info(self, msg, *args):
-        """ Perform logging at the INFO level """
-        
-        try:
-            (self._severity>=INFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def moreinfo(self, msg, *args):
-        """ Perform logging at the MOREINFO level """
-
-        try:
-            (self._severity>=MOREINFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def extrainfo(self, msg, *args):
-        """ Perform logging at the EXTRAINFO level """
-
-        try:
-            (self._severity>=EXTRAINFO) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def debug(self, msg, *args):
-        """ Perform logging at the DEBUG level """
-
-        try:
-            (self._severity>=DEBUG) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def moredebug(self, msg, *args):
-        """ Perform logging at the MOREDEBUG level """
-
-        try:
-            (self._severity>=MOREDEBUG) and self._logger.info(self._getMessage(msg, *args))
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-        
-    def logconsole(self, msg, *args):
-        """ Directly log to console using sys.stdout """
-
-        try:
-            (self._severity>DISABLE) and sys.stdout.write(self._getMessage(msg, *args)+'\n')
-        except ValueError, e:
-            pass
-        except IOError, e:
-            pass
-
-    def getDefaultLogger(cls):
-        """ Return the default logger instance """
-
-        return cls._instances.get('default')
-    
-    def Instance(cls, name='default', *args):
-        """ Return an instance of this class """
-
-        inst = cls(*args)
-        cls._instances[name] = inst
-
-        return inst
-
-    def clean_up(self):
-        # Remove all handlers...
-        for h in self._logger.handlers:
-            if h.__class__.__name__ == 'FileHandler':            
-                # Remove the handler
-                self._logger.removeHandler(h)
-
-    def shutdown(self):
-        logging.shutdown()
-        
-    Instance = classmethod(Instance)
-    getDefaultLogger = classmethod(getDefaultLogger)
-    
-if __name__=="__main__":
-    import sys
-    
-    mylogger = HarvestManLogger.Instance()
-    mylogger.addLogHandler('FileHandler','report.log')
-    # mylogger.setLogSeverity(1)
-    
-    p = 'HarvestMan'
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)
-    
-    print mylogger.getLogSeverity()
-    print mylogger.getLogLevelName()
-
-    mylogger.enableConsoleLogging()
-    # mylogger.disableConsoleLogging()    
-    mylogger.disableFileLogging()
-    
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)
-
-
-    print HandlerFactory.createHandle('StreamHandler', sys.stdout)
-    print HandlerFactory.createHandle('FileHandler', 'my.txt')
-    print HandlerFactory.createHandle('SocketHandler', 'localhost', 5555)
-
-    mylogger.info("Test message 1",p)
-    mylogger.moreinfo("Test message 2",p)
-    mylogger.extrainfo("Test message 3",p)
-    mylogger.debug("Test message 4",p)
-    mylogger.moredebug("Test message 5",p)    
-
-    print HarvestManLogger.getDefaultLogger()==mylogger
-    print HarvestManLogger._instances
diff --git a/HarvestMan/harvestman/lib/methodwrapper.py b/HarvestMan/harvestman/lib/methodwrapper.py
deleted file mode 100755
index 4f20579..0000000
--- a/HarvestMan/harvestman/lib/methodwrapper.py
+++ /dev/null
@@ -1,141 +0,0 @@
-# -- coding: utf-8
-""" methodwrapper.py - Module which provides a meta-class level
-implementation for method wrappers. The metaclasses provided here
-specifically wrap pre_* and post_* methods defined in classes
-and wrap them around the original method.
-
-Any class which wants to auto-implement pre and post callbacks
-need to set their __metaclass__ attribute to the type
-MethodWrapperMetaClass. This has to be done at the time of
-defining the class.
-
-This module provides the function set_method_wrapper which
-sets a given method as a pre or post callback method on a class.
-
-This module is part of the HarvestMan program. For licensing
-information see the file LICENSE.txt that is included in this distribution.
-
-Created Anand B Pillai <abpillai at gmail dot com> Feb 17 2007
-
-Copyright (C) 2003-2008 Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from new import function
-from harvestman.lib.common.common import objects
-
-class MethodWrapperBaseMetaClass(type):
-    """ A base meta class for method wrappers """
-    
-    # This class allows wrapping pre_ and post_ callbacks
-    # (methods) around a method. Original code courtesy
-    # Eiffel method wrapper implementation in Python
-    # subversion trunk @
-    # http://svn.python.org/view/python/trunk/Demo/newmetaclasses/Eiffel.py
-
-    
-    def my_new(cls, *args, **kwargs):
-        name = cls.__name__
-
-        state = objects.config
-        flag = False
-
-        if state:
-            plugins = state.get_klass_plugins(name)
-            if plugins:
-                for funcname, func in plugins.items():
-                    # print funcname,'=>',func
-                    setattr(cls, funcname, func)
-
-            callbacks = state.get_klass_callbacks(name)
-            if callbacks:
-                for funcname, (func, where)  in callbacks.items():
-                    # print funcname,'=>',func, where
-                    attr = where + '_' + funcname + '_callback'
-                    # print attr
-                    l = getattr(cls, attr, None)
-                    if not l:
-                        setattr(cls, attr, [func])
-                    else:
-                        l.append(func)
-
-                # Since this gets called at creation of every instance
-                # we need to make sure that it gets done exactly once.                    
-                if not getattr(cls, '__callbacks__',False):
-                    # print "Need to convert",cls
-                    cls.convert_methods(cls.__dict__)
-
-        return object.__new__(cls)
-    
-    def __init__(cls, name, bases, dct):
-        super(MethodWrapperBaseMetaClass, cls).__init__(name, bases, dct)
-        cls.__new__ = cls.my_new
-
-    def convert_methods(cls, dict):
-        """Replace functions in dict with MethodWrapper wrappers.
-
-        The dict is modified in place.
-
-        If a method ends in _pre or _post, it is removed from the dict
-        regardless of whether there is a corresponding method.
-        """
-        # find methods with pre or post conditions
-        methods = []
-        for k, v in dict.iteritems():
-            if k.startswith('pre_') or k.startswith('post_'):
-                print v
-            #    assert isinstance(v, list)
-            if isinstance(v, function):
-                methods.append(k)
-        for m in methods:
-            pre = dict.get("pre_%s_callback" % m)
-            post = dict.get("post_%s_callback" % m)
-            if pre or post:
-                setattr(cls, m, cls.make_wrapper_method(dict[m], pre, post))
-
-        setattr(cls, '__callbacks__', True)
-        
-class MethodWrapperMetaClass(MethodWrapperBaseMetaClass):
-    # an implementation of the "MethodWrapper" meta class that uses nested functions
-
-    def make_wrapper_method(func, pre, post):
-        def method(self, *args, **kwargs):
-            if pre:
-                for f in pre:
-                    f(self, *args, **kwargs)
-            x = func(self, *args, **kwargs)
-            if post:
-                for f in post:
-                    f(self, x, *args, **kwargs)
-            return x
-
-        if func.__doc__:
-            method.__doc__ = func.__doc__
-
-        return method
-
-    make_wrapper_method = staticmethod(make_wrapper_method)
-
-def set_wrapper_method(klass, method, callback, where='post'):
-    """ Set callback method 'callback' on the method with
-    the given name 'method' on class 'klass'. If the last
-    argument is 'post' the method is inserted as a post-callback.
-    If the last argument is 'pre', it is inserted as a pre-callback.
-    """
-    
-    # Note: 'method' is the method name, not the method object
-
-    # Set callback
-    attr = where + '_' + method + '_callback'
-    l = getattr(klass, attr, None)
-    if not l:
-        setattr(klass, attr, [callback])
-    else:
-        l.append(callback)
-        setattr(klass, attr, l)            
-
-if __name__ == "__main__":
-    pass
-    
diff --git a/HarvestMan/harvestman/lib/mirrors.py b/HarvestMan/harvestman/lib/mirrors.py
deleted file mode 100755
index e4ea3aa..0000000
--- a/HarvestMan/harvestman/lib/mirrors.py
+++ /dev/null
@@ -1,565 +0,0 @@
-# -- coding: utf-8
-""" mirrors.py - Module which provides support for managing
-mirrors for domains, for hget.
-
-Author - Anand B Pillai <abpillai at gmail dot com>
-
-Created,  Anand B Pillai 14/08/07.
-Modified  Anand B Pillai 10/10/07  Added file mirror support
-Modified  Anand B Pillai 12/11/07  Added logic to retry mirrors which
-                                   did not fail wi th fatal error.
-                                   Replaced duplicate mirroring code with
-                                   HarvestManMirror class.
-Modified Anand B Pillai 6/02/08    Added mirror search logic (Successfully
-                                   download tested apache ant binary using
-                                   findfiles.com mirrors).
-
-Copyright (C) 2007 Anand B Pillai.
-    
-"""
-
-import random
-import re
-from pyparsing import *
-
-from harvestman.lib import urlparser
-from harvestman.lib import connector
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.singleton import Singleton
-
-def test_parse():
-    print urls
-
-class HTMLTableParser(object):
-
-    def __init__(self):
-        self.grammar = Literal("<table") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%" + '"')) + Literal(">") + \
-                       ZeroOrMore(Literal("<tr>") + SkipTo(Literal("</tr>"))) + SkipTo(Literal("</table>"))
-        #self.grammar = Literal("<table") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%" + '"')) + Literal(">") + \
-        #               OneOrMore(Literal("<tr>") + ZeroOrMore(Literal("<td") + ZeroOrMore(Word(alphas) + Literal("=") + Word(alphanums + "%%" + '"')) + Literal(">") + SkipTo(Literal("</td>"))) + SkipTo(Literal("</tr>"))) + SkipTo(Literal("</table>"))
-
-    def parse(self, data):
-
-        # data='<table class="tinner" width="100%" cellspacing="0" cellpadding="0" cellpadding="4"><tr><td></td></tr></table>'
-        for item in self.grammar.scanString(data):
-            print item
-        
-        
-        
-class HarvestManMirror(object):
-    """ Class representing a URL mirror """
-
-    def __init__(self, url, absolute=False):
-        self.url = url
-        self.absolute = absolute
-        # Url object
-        self.urlobj = urlparser.HarvestManUrl(self.url)
-        # By default mirror URLs are assumed to be directory URLs.
-        # if this is an absolute file URL, then don't do anything
-        if not absolute:
-            self.urlobj.set_directory_url()
-        
-        # Reliability factor - FUTURE
-        self.reliability = 1.0
-        # Geo location - FUTURE
-        self.geoloc = 0
-        # Count of number of times
-        # this mirror was used
-        self.usecnt = 0
-
-    def __str__(self):
-        return self.url
-
-    def __repr__(self):
-        return str(self)
-    
-    def calc_relative_path(self, urlobj):
-
-        relpath = urlobj.get_relative_url()
-
-        # Global configuration object
-        if objects.config.mirroruserelpath:
-            if objects.config.mirrorpathindex:
-                items = relpath.split('/')
-                # Trim again....
-                items = [item for item in items if item != '']
-                relpath = '/'.join(items[cfg.mirrorpathindex:])
-        else:
-            # Do not use relative paths, return the filename
-            # of the URL...
-            relpath = urlobj.get_filename()
-
-        if is_sourceforge_url(urlobj):
-            relpath = 'sourceforge' + relpath
-        else:
-            if relpath[0] == '/':
-                relpath = relpath[1:]
-                
-        return relpath
-
-    
-    def mirror_url(self, urlobj):
-        """ Return mirror URL for the given URL """
-
-        if not self.absolute:
-            relpath = self.calc_relative_path(urlobj)
-            newurlobj = urlparser.HarvestManUrl(relpath, baseurl=self.urlobj)
-        else:
-            newurlobj = self.urlobj
-            
-        # Set mirror_url attribute
-        newurlobj.mirror_url = urlobj
-        # Set another attribute indicating the mirror is different
-        newurlobj.mirrored = True
-        newurlobj.trymultipart = True
-
-        self.usecnt += 1
-        # print '\t=>',newurlobj.get_full_url()
-        # logconsole("Mirror URL %d=> %s" % (x+1, newurlobj.get_full_url()))
-        return newurlobj
-
-    def new_mirror_url(self, urlobj):
-        """ Return new mirror URL for an already mirrored URL """
-
-        # Typically called when errors are seen with mirrors
-        orig_urlobj = urlobj.mirror_url
-        newurlobj = self.mirror_url(orig_urlobj)
-        newurlobj.clength = urlobj.clength
-        newurlobj.range = urlobj.range
-        newurlobj.mindex = urlobj.mindex
-
-        self.usecnt += 1
-        
-        return newurlobj
-
-class HarvestManMirrorSearch(object):
-    """ Search mirror sites for files """
-
-    # Mirror sites and their search URLs
-    sites = {'filewatcher':  ('http://www.filewatcher.com/_/?q=%s',),
-             'freewareweb':  ('http://www.freewareweb.com/cgi-bin/ftpsearch.pl?q=%s',),
-             'filesearching': ('http://www.filesearching.com/cgi-bin/s?q=%s&l=en',),
-             'findfiles' : ('http://www.findfiles.com/list.php?string=%s&db=Mirrors&match=Exact&search=',) }
-    
-    quotes_re = re.compile(r'[\'\"]')
-    filename_re = '%s[\?a-zA-Z0-9-_]*'
-
-    def __init__(self):
-        self.tried = []
-        self.valid = ('findfiles',)
-        self.cache = []
-
-
-    def make_urls(self, grammar, data, filename):
-
-        urls = []
-        rc = re.compile(self.filename_re % filename)
-        
-        for match in grammar.scanString(data):
-
-            if not match: continue
-            if len(match) != 3: continue
-            if len(match[0])==0: continue
-            if len(match[0][-1])==0: continue         
-            url = self.quotes_re.sub('', match[0][-1])
-            if url not in urls:
-                # Currently we cannot support FTP mirror URLs
-                #if url.startswith('ftp://') or \
-                #   url.startswith('http://') or \
-                #   url.startswith('https://'):
-                if url.startswith('http://') or \
-                   url.startswith('https://'):                    
-                
-                    if url.endswith(filename):
-                        urls.append(url)
-                    elif rc.search(url):
-                        # Prune any characters after filename
-                        idx = url.find(filename)
-                        if idx != -1: urls.append(url[:idx+len(filename)])
-                    
-        return urls
-
-    def search_filewatcher(self, filename):
-
-        # Note: this grammar could change if the site changes its templates
-        grammar = Literal("<p>") + Literal("<big>") + Literal("<a") + Literal("href") + Literal("=") + \
-                  SkipTo(Literal(">"))
-
-        urls = []
-        search_url = self.sites['filewatcher'][0] % filename
-        
-        conn = connector.HarvestManUrlConnector()
-        data = conn.get_url_data(search_url)
-
-        return self.make_urls(grammar, data, filename)
-
-    def search_findfiles(self, filename):
-
-        print 'Searching http://www.findfiles.com for mirrors of file %s...' % filename
-
-        # Note: this grammar could change if the site changes its templates        
-        content1 = Literal("<h1") + SkipTo(Literal("Advanced Search"))
-        content2 = Literal("<a") + Literal("href") + Literal("=") + SkipTo(Literal(">"))
-        
-        search_url = self.sites['findfiles'][0] % filename
-        
-        conn = connector.HarvestManUrlConnector()
-        data = conn.get_url_data(search_url)
-        # print data
-        matches = []
-        
-        for match in content1.scanString(data):
-            matches.append(match)
-            
-        # There will be only one match
-        if matches:
-            data = matches[0][0][-1]
-            idx1 = data.find('<table')
-            if idx1 != -1:
-                idx2 = data.find('</table>',idx1)
-                if idx2 != -1:
-                    data = data[idx1:idx2+8]
-                    return self.make_urls(content2, data, filename)
-                
-        return []
-
-    def search_freewareweb(self, filename):
-
-        # TODO
-        pass
-
-    def search_filesearching(self, filename):
-
-        # TODO
-        pass    
-    
-
-    def can_search(self):
-        """ Return whether we can search for new mirrors """
-        
-        # This queries whether we have used up all the mirror search sites
-        self.tried.sort()
-        l = list(self.valid)
-        l.sort()
-        return not (self.tried == l)
-        
-    def search(self, urlobj):
-        filename = urlobj.get_filename()
-        print 'Searching mirrors for %s...' % filename
-        
-        # Searching in other mirror search sites returns mostly
-        # FTP urls. We can currently do mirror downloads only
-        # for HTTP URLs.
-        # return self.search_filewatcher(filename)
-        for item in self.valid:
-            if item not in self.tried:
-                func = getattr(self,'search_' + item)
-                self.tried.append(item)
-                mirror_urls = func(filename)
-                if mirror_urls:
-                    mirrors = [HarvestManMirror(url, True) for url  in mirror_urls]
-                    self.cache = mirrors
-                    return mirrors
-            
-             
-class HarvestManMirrorManager(Singleton):
-    """ Mirror manager class for HarvestMan/Hget """
-    
-    # Sourceforge mirror information in the form
-    # of (servername, Place, Country) tuples.
-    sf_mirror_info = (('easynews', 'Arizona, USA'),
-                      ('internap', 'CA, USA'),
-                      ('superb-east','Virginia, USA'),
-                      ('superb-west','Washington, USA'),
-                      ('ufpr', 'Curitiba, Brazil'),
-                      ('belnet', 'Brussels, Belgium'),
-                      ('switch', 'Laussane, Switzerland'),
-                      ('mesh', 'Deusseldorf, Germany'),
-                      ('ovh', 'Paris, France'),
-                      ('dfn', 'Berlin, Germany'),
-                      ('heanet', 'Dublin, Ireland'),
-                      ('garr', 'Bologna, Italy'),
-                      ('surfnet', 'Amsterdam, The Netherlands'),
-                      ('kent', 'Kent, UK'),
-                      ('optusnet', 'Sydney, Australia'),
-                      ('jaist', 'Ishikawa, Japan'),
-                      ('nchc', 'Tainan, Taiwan'))
-                   
-
-    sf_mirrors = tuple([HarvestManMirror('http://%s.dl.sourceforge.net' % name[0]) for name in sf_mirror_info])
-
-    sf_mirror_domains = tuple([mirror.urlobj.get_full_domain() for mirror in sf_mirrors])
-    # print sf_mirror_domains
-
-    def __init__(self):
-        # List of mirror URLs loaded from a mirror file/other source
-        self.filemirrors = []
-        # Flag to perform mirror search
-        self.mirrorsearch = False
-        # List of current mirrors in use
-        self.current_mirrors = []
-        # List of used mirrors
-        self.used_mirrors = []
-        # List of mirrors which can be retried cuz they failed with
-        # non-fatal errors
-        self.mirrors_to_retry = []
-        # List of mirrors which failed (Includes above list)
-        self.failed_mirrors = []
-        # Mirror retry attempts
-        self.retries = 0
-        # Used flag
-        self.used = False
-        # Mirror search object
-        self.searcher = HarvestManMirrorSearch()
-        
-    def find_mirror(self, urlobj):
-
-        mirrors = self.get_mirrors(urlobj, False)
-        if mirrors == None:
-            return
-        
-        for m in mirrors:
-            if m.absolute:
-                if m.urlobj == urlobj:
-                    return m
-            elif m.urlobj == urlobj.baseurl:
-                return m
-    
-    def load_mirrors(self, mirrorfile):
-        """ Load mirror information from the mirror file """
-
-        if mirrorfile:
-            for line in file(mirrorfile):
-                url = line.strip()
-                if url != '':
-                    self.filemirrors.append(HarvestManMirror(url))
-    
-    def mirrors_available(self, urlobj):
-        return (is_sourceforge_url(urlobj) or len(self.filemirrors) or self.mirrorsearch)
-        # return len(self.filemirrors) or (self.mirrorsearch)    
-    
-    def search_for_mirrors(self, urlobj, find_new = True):
-
-        if not find_new:
-            return self.searcher.cache
-        
-        if self.searcher.can_search():
-            mirror_urls = self.searcher.search(urlobj)
-            
-            if mirror_urls:
-                print '%d mirror URLs found, queuing them for multipart downloads...' % len(mirror_urls)
-                return mirror_urls
-            else:
-                return []
-        else:
-            print 'Cannot search for new mirrors'
-            return []
-        
-        pass
-    
-    def get_mirrors(self, urlobj, find_new=True):
-
-        if is_sourceforge_url(urlobj):
-            return self.sf_mirrors
-        elif self.filemirrors:
-            return self.filemirrors
-        elif self.mirrorsearch:
-            return self.search_for_mirrors(urlobj, find_new)
-        
-    def create_multipart_urls(self, urlobj, numparts):
-
-        urlobjects = []
-        relpath = ''
-
-        mirrors = self.get_mirrors(urlobj)
-        if len(mirrors) < numparts:
-            numparts = len(mirrors)
-
-        if len(mirrors)==0:
-            print 'No mirrors found'
-            return []
-        elif len(mirrors)==1:
-            # Only one mirror - this is of no use
-            print 'Only single mirror found'
-            return []
-        
-        # Get a random list of servers
-
-        # Python seems to sometimes optimize these lists to tuples...
-        # This produced an error in Cygwin python, so forcefully
-        # coercing them to lists...
-        self.current_mirrors = list(mirrors[:numparts])
-        self.used_mirrors = list(self.current_mirrors[:])
-
-        orig_url = urlobj.get_full_url()
-
-        for x in range(numparts):
-            mirror = self.current_mirrors[x]
-            newurlobj = mirror.mirror_url(urlobj)
-            urlobjects.append(newurlobj)
-
-        return urlobjects
-    
-    def download_multipart_url(self, urlobj, clength, numparts, threadpool):
-        """ Download URL multipart from supported servers """
-
-        logconsole('Splitting download across mirrors...\n')
-
-        # List of servers - note that we are not doing
-        # any kind of search for the nearest servers. Instead
-        # a random list is created.
-        # Calculate size of each piece
-        piecesz = clength/numparts
-
-        # Calculate size of each piece
-        pcsizes = [piecesz]*numparts
-        # For last URL add the reminder
-        pcsizes[-1] += clength % numparts 
-        # Create a URL object for each and set range
-        urlobjects = self.create_multipart_urls(urlobj, numparts)
-
-        if (len(urlobjects)) == 0:
-            return MIRRORS_NOT_FOUND
-        
-        prev = 0
-
-        for x in range(len(urlobjects)):
-            curr = pcsizes[x]
-            next = curr + prev
-            urlobject = urlobjects[x]
-            urlobject.clength = clength
-            urlobject.range = (prev, next-1)
-            urlobject.mindex = x
-            prev = next
-
-            # Push this URL objects to the pool
-            threadpool.push(urlobject)
-
-        self.used = True
-        
-        return URL_PUSHED_TO_POOL
-
-    def get_different_mirror_url(self, urlobj, urlerror):
-        """ Return a different mirror URL for a (failed) mirror URL """
-        
-        mirror_url = self.find_mirror(urlobj)
-
-        if mirror_url == None:
-            return None
-        
-        if mirror_url not in self.failed_mirrors:
-            self.failed_mirrors.append(mirror_url)
-            
-        # If not fatal error, append to mirrors_to_retry
-        if not urlerror.fatal:
-            if mirror_url not in self.mirrors_to_retry:
-                self.mirrors_to_retry.append(mirror_url)
-
-        mirrors = self.get_mirrors(urlobj)
-        # Get the difference of the 2 sets
-        newmirrors = list(set(mirrors).difference(set(self.used_mirrors)))
-        # print 'New mirrors=>',newmirrors
-
-        if newmirrors:
-            extrainfo("Returning from new mirror list...")
-            # Get a random one out of it...
-            new_mirror = newmirrors[0]
-            # Remove the old mirror and replace it with new mirror in
-            # current_mirrors
-            self.current_mirrors.remove(mirror_url)
-            self.current_mirrors.append(new_mirror)
-            self.used_mirrors.append(new_mirror)
-
-        elif len(self.mirrors_to_retry)>1:
-            extrainfo("Returning from mirrors_to_retry...")        
-            # We don't want to go back to same mirror!
-            new_mirror = self.mirrors_to_retry.pop(0)
-            self.current_mirrors.remove(mirror_url)
-            self.current_mirrors.append(new_mirror)
-            if not new_mirror in self.used_mirrors:
-                self.used_mirrors.append(new_mirror)
-        else:
-            return None
-
-        self.retries += 1
-        
-        return new_mirror.new_mirror_url(urlobj)
-
-    def reset(self):
-        """ Reset the state """
-
-        self.current_mirrors = []
-        self.used_mirrors = []
-        self.mirrors_to_retry = []
-
-    def get_stats(self):
-        """ Provide statistics """
-
-        statsd = {}
-        statsd['filemirrors'] = len(self.filemirrors)
-        statsd['usedmirrors'] = len(self.used_mirrors)
-        statsd['failedmirrors'] = len(self.failed_mirrors)
-        statsd['retries'] = self.retries
-
-        return statsd
-    
-    def print_stats(self):
-        """ Print statistics to console """
-        
-        d = self.get_stats()
-
-        info = ''
-        fmirrors = d['filemirrors']
-        if fmirrors:
-            logconsole("\nPrinting mirror statistics...")
-            info = "%d mirrors were loaded from file, " % fmirrors
-
-        umirrors = d['usedmirrors']
-        if umirrors:
-            if info: info += ', '
-            info += "%d mirrors were used " % umirrors
-        else:
-            return
-        
-        fldmirrors = d['failedmirrors']
-        retries  = d['retries']
-        
-        if fldmirrors:
-            if info: info += ', '            
-            if fldmirrors>1:
-                info += "%d mirrors failed" % fldmirrors
-            else:
-                info += "%d mirror failed" % fldmirrors
-            
-        logconsole(info)
-        
-def is_multipart_download_supported(urlobj):
-    """ Check whether this URL (server) supports multipart downloads """
-    
-    return is_sourceforge_url(urlobj)
-
-def is_sourceforge_url(urlobj):
-    """ Is this a download from sourceforge ? """
-    
-    ret = (urlobj.domain in ('downloads.sourceforge.net', 'prdownloads.sourceforge.net') or \
-           urlobj.get_full_domain() in HarvestManMirrorManager.sf_mirror_domains )
-
-    return ret
-
-if __name__ == "__main__":
-    import config
-    import logger
-    import datamgr
-    
-    SetAlias(config.HarvestManStateObject())
-    cfg = objects.config
-    cfg.verbosity = 5
-    SetAlias(logger.HarvestManLogger())
-    SetLogSeverity()
-    SetAlias(datamgr.HarvestManDataManager())
-    
-    search = HarvestManMirrorSearch()
-    print search.search(urlparser.HarvestManUrl('http://pv-mirror02.mozilla.org/pub/mozilla.org/firefox/releases/2.0.0.11/linux-i686/en-US/firefox-2.0.0.11.tar.gz'))
-
diff --git a/HarvestMan/harvestman/lib/options.py b/HarvestMan/harvestman/lib/options.py
deleted file mode 100755
index 7f2c858..0000000
--- a/HarvestMan/harvestman/lib/options.py
+++ /dev/null
@@ -1,80 +0,0 @@
-# -- coding: utf-8
-""" options.py - Module keeping a list of command-line
-options for HarvestMan. 
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai - Feb 11 2007.
-
-Copyright (C) 2007 Anand B Pillai
-"""
-
-hman_options=\
-[ ('version', 'short=v','long=version','help=Print version information and exit', 'type=bool'),
-  ('simulate', 'short=m','long=simulate','help=Simulates crawling with the given configuration, without performing any actual downloads (same as "-g simulator")','type=bool'),
-  ('configfile', 'short=C','long=configfile','help=Read all options from the configuration file CFGFILE','meta=CFGFILE'),
-  ('projectfile', 'short=P','long=projectfile','help=Load the project file PROJFILE','meta=PROJFILE'),
-  ('urllist', 'short=F','long=urlfile',"help=Read a list of start URLs from file URLFILE and crawl them","meta=URLFILE"),  
-  ('basedir', 'short=b','long=basedir','help=Set the (optional) base directory to BASEDIR','meta=BASEDIR'),
-  ('project', 'short=p','long=project','help=Set the (optional) project name to PROJECT', 'meta=PROJECT'),
-  ('verbosity', 'short=V','long=verbosity','help= Set the verbosity level to LEVEL. Ranges from 0-5, default is 2','meta=LEVEL'),
-  ('fetchlevel', 'short=f','long=fetchlevel','help=Set the fetch-level of this project to LEVEL. Ranges from 0-4, default is 0','meta=LEVEL'),
-  ('localise', 'short=l','long=localise','help=Localize urls after download (yes/no, default is yes)'),
-  ('retries', 'short=r','long=retry','help=Set the number of retry attempts for failed urls to NUMRETRIES','meta=NUMRETRIES'),
-  ('proxy', 'short=X','long=proxy','help=Enable and set proxy to PROXYSERVER (host:port)','meta=PROXYSERVER'),
-  ('proxyuser', 'short=U','long=proxyuser','help=Set username for proxy server to USERNAME','meta=USERNAME'),
-  ('proxypasswd', 'short=W','long=proxypass','help= Set password for proxy server to PASSWORD','meta=PASSWORD'),
-  ('connections', 'short=n','long=connections','help=Limit number of simultaneous network connections to NUMCONNECTIONS','meta=NUMCONNECTIONS'),
-  ('cache', 'short=c','long=cache',"help=Enable/disable caching of downloaded files. If enabled(default), files won't be saved unless their timestamp is newer than the cache timestamp"),
-  ('depth', 'short=d','long=depth','help=Set the limit on the depth of urls to DEPTH','meta=DEPTH'),
-  ('workers', 'short=w','long=workers','help=Enable worker threads and set the number of worker threads to NUMWORKERS','meta=NUMWORKERS'),
-  ('maxthreads', 'short=T','long=maxthreads','help=Limit the number of tracker threads to NUMTHREADS','meta=NUMTHREADS'),
-  ('maxfiles', 'short=M','long=maxfiles','help=Limit the number of files downloaded to NUMFILES','meta=NUMFILES'),
-  ('timelimit', 'short=t','long=timelimit','help=Run the program for the specified time period PERIOD (in seconds)','meta=PERIOD'),
-  ('subdomain','short=s','long=subdomain','help=If set, treats subdomains in the same parent domain (like my.foo.com & his.foo.com) as the same','type=bool','default=False','meta=SUBDOMAIN'),
-#  ('savesessions', 'short=S','long=savesessions','help=Enable/disable session saver feature. If enabled(default), crashed sessions are automatically saved to disk and the program gives you the option of resuming them next time'),
-  ('robots', 'short=R','long=robots','help=Enable/disable Robot Exclusion Protocol and checking of META ROBOTS tags.'),
-  ('urlfilter', 'short=u','long=urlfilter','help=Use regular expression FILTER for filtering urls','meta=FILTER'),
-  ('plugins', 'short=g','long=plugins',"help=Load the set of plugins PLUGINS (Specified as plugin1+plugin2+...)",'meta=PLUGINS'),
-  ('option','short=o','long=option','meta=<name=value>','help=Pass a configuration param using <name=value> syntax'),
-  ('ui','long=ui','help=Start HarvestMan in Web UI mode','meta=UI','type=bool','default=False'),
-  ('genconfig','long=genconfig','help=Create Configuration File Using GenConfig Web UI mode','meta=genconfig','type=bool','default=False'),
-  ('selftest','long=selftest','help=Run a self test','meta=SELFTEST','type=bool','default=False')]
-
-hget_options=\
-[ ('version', 'short=v','long=version','help=Print version information and exit', 'type=bool'),
-  ('verbose','short=V','long=verbose','help=Be verbose','type=bool'),
-  ('resumeoff','short=n','long=noresume',"help=If set, will not try to resume partial downloads",'type=bool'),  
-  ('proxy', 'short=X','long=proxy','help=Enable and set proxy to PROXYSERVER (host:port)','meta=PROXYSERVER'),
-  ('proxyuser', 'short=U','long=proxyuser','help=Set username for proxy server to USERNAME','meta=USERNAME'),
-  ('proxypasswd', 'short=W','long=proxypass','help= Set password for proxy server to PASSWORD','meta=PASSWORD'),
-  ('username','short=u','long=username','help=Use username USERNAME for HTTP (basic) authentication','meta=USERNAME'),
-  ('passwd','short=p','long=passwd','help=Use password PASSWORD for HTTP (basic) authentication','meta=PASSWORD'),  
-  ('numparts','short=P','long=numparts','help=Force-split download into <NUMPARTS> parts (max 10)'),
-  ('retries','short=r','long=retries','help=Number of retry attempts for a failed connection (default 1)'),
-  ('memory','short=m','long=inmem','help=Keep data in memory instead of flushing to disk (Warning: use with care as this might exhaust memory for huge file downloads!)', 'type=bool' ,'default=False'),
-  ('currentdir','short=c','long=currentdir','help=Use current directory instead of system temp directory for saving intermediate files','type=bool'),
-  ('mirrorfile','short=f','long=mirrorfile','help=Load mirror information for the URL(s) from MIRRORFILE (The file must contain a list of valid mirrors for the URL, one per line)','meta=MIRRORFILE'),
-  ('mirrorsearch','short=M','long=mirrorsearch','help=Experimental Feature - Search for the file in file mirror sites and download file in multipart if mirrors are found (use with -P option)','meta=MIRRORSEARCH','type=bool','default=False'),  
-  ('relpathidx','short=i','long=relpathidx','help=When loading mirrors from a file, use the given index to calculate the relative path of the original URL (If given, the relative path of the original URL will be offset by this value)','meta=RELPATHINDEX'),
-  ('norelpath','short=N','long=norelpath','type=bool','default=False','help=When loading mirrors, do not compute mirror URLs using relative-path (Instead just appends the filename to the mirror URL)','meta=NORELPATH'),  
-  ('output','short=o','long=output','meta=FILE','help=Save document to FILE'),
-  ('outputdir','short=d','long=outputdir','meta=DIRECTORY','help=Save document to directory'), 
-  ] 
-
-def getOptList(appname):
-    """ Return the list of options """
-
-    if appname == 'HarvestMan':
-        return hman_options
-    elif appname == 'Hget':
-        return hget_options
-    else:
-        return []
-
-if __name__=="__main__":
-    print getOptList()
diff --git a/HarvestMan/harvestman/lib/pageparser.py b/HarvestMan/harvestman/lib/pageparser.py
deleted file mode 100755
index 4d20683..0000000
--- a/HarvestMan/harvestman/lib/pageparser.py
+++ /dev/null
@@ -1,584 +0,0 @@
-# -- coding: utf-8
-""" pageparser.py - Module to parse an html page and
-    extract its links. This module is part of the
-    HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-    Modification History
-    ====================
-
-
-   Jan 2007       Anand              Complete support for META robot tags implemented.
-                                     Requested by jim sloan of MCHS.
-   Mar 06 2007    Anand              Added support for HTML EMBED & OBJECT tags.
-   Apr 18 2007    Anand              Made to use the urltypes module.
-   Apr 19 2007    Anand              Created class HarvestManCSSParser to take
-                                     care of parsing stylesheet content to extract
-                                     URLs.
-   Aug 28 2007    Anand              Added a parser baed on Effbot's sgmlop
-                                     to parse pages with errors - as a part of
-                                     fixes for #491.
-   Sep 05 2007    Anand              Added a basic javascript parser to parse
-                                     Javascript statements - currently this can
-                                     perform Javascript based site redirection.
-   Sep 10 2007    Anand              Added logic to filter junk links produced
-                                     by web-directory pages.
-   Oct 3  2007    Anand              Removed class HarvestManJSParser since its
-                                     functionality and additional DOM processing
-                                     is done by the new JSParser class.
-
-   Apr 4 2008     Anand              Fix for EIAO bug #812.
-   Apr 6 2008     Anand              Added ParseTag class and features for EIAO bug
-                                     #808.
-   
-   
-  Copyright (C) 2004 Anand B Pillai.                                     
-                                     
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import re
-from sgmllib import SGMLParser
-
-from harvestman.lib.urltypes import *
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-
-class ParseTag(object):
-    """ Class representing a tag which is parsed by the HTML parser(s) """
-    
-    def __init__(self, tag, tagdict, pattern=None, enabled=True):
-        # Tag is the name of the tag (element) which will be parsed.
-        # Tagdict is a dictionary which contains the attributes
-        # of the tag which we are interested as keys and the type
-        # of URL the value of the attribute will be saved as, as
-        # the value. If there are more than one type of URL for this
-        # attribute key, then the value is a list.
-        
-        # For example valid tagdicts are {'href': [URL_TYPE_ANY, URL_TYPE_ANCHOR] },
-        # {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code': URL_TYPE_JAPPLET'}.
-        self.tag = tag
-        self.tagdict = tagdict
-        self.enabled = enabled
-        self.pattern = pattern
-
-    def disable(self):
-        """ Disable parsing of this tag """
-        self.enabled = False
-
-    def enable(self):
-        """ Enable parsing of this tag """
-        self.enabled = True
-
-    def isEnabled(self):
-        """ Is this tag enabled ? """
-        
-        return self.enabled
-
-    def setPattern(self, pattern):
-        self.pattern = pattern
-
-    def __eq__(self, item):
-        return self.tag.lower() == item.lower()
-    
-class HarvestManSimpleParser(SGMLParser):
-    """ An HTML/XHTML parser derived from SGMLParser """
-
-    # query_re = re.compile(r'[-.:_a-zA-Z0-9]*\?[-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*', re.UNICODE)
-    # A more lenient form of query regular expression
-    query_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE) 
-    skip_re = re.compile(r'(javascript:)|(mailto:)|(news:)')
-    # Junk URLs obtained by parsing HTML of web-directory pages
-    # i.e pages with title "Index of...". The filtering is done after
-    # looking at the title of the page.
-    index_page_re = re.compile(r'(\?[a-zA-Z0-9]=[a-zA-Z0-9])')
-
-    features = [ ParseTag('a', {'href': URL_TYPE_ANY}),
-                 ParseTag('base', {'href' : URL_TYPE_BASE}),
-                 ParseTag('frame', {'src' : URL_TYPE_FRAME}),
-                 ParseTag('img', {'src': URL_TYPE_IMAGE}),
-                 ParseTag('form', {'action': URL_TYPE_FORM}),
-                 ParseTag('link', {'href': URL_TYPE_ANY}),
-                 ParseTag('body', {'background' : URL_TYPE_IMAGE}),
-                 ParseTag('script', {'src': URL_TYPE_JAVASCRIPT}),
-                 ParseTag('applet', {'codebase': URL_TYPE_JAPPLET_CODEBASE, 'code' : URL_TYPE_JAPPLET}),
-                 ParseTag('area', {'href': URL_TYPE_ANY}),
-                 ParseTag('meta', {'CONTENT': URL_TYPE_ANY, 'content': URL_TYPE_ANY}),
-                 ParseTag('embed', {'src': URL_TYPE_ANY}),
-                 ParseTag('object', {'data': URL_TYPE_ANY}),
-                 ParseTag('option', {'value': URL_TYPE_ANY}, enabled=False) ]
-                 
-
-    handled_rel_types = ( URL_TYPE_STYLESHEET, )
-    
-    def __init__(self):
-        self.url = None
-        self.links = []
-        self.linkpos = {}
-        self.images = []
-        # Keywords
-        self.keywords = []
-        # Description of page
-        self.description = ''
-        # Title of page
-        self.title = ''
-        self.title_flag = True
-        # Fix for <base href="..."> links
-        self.base_href = False
-        # Base url for above
-        self.base = None
-        # anchor links flag
-        self._anchors = True
-        # For META robots tag
-        self.can_index = True
-        self.can_follow = True
-        # Current tag
-        self._tag = ''
-        SGMLParser.__init__(self)
-        # Type
-        self.typ = 0
-        
-    def save_anchors(self, value):
-        """ Set the save anchor links flag """
-
-        # Warning: If you set this to true, anchor links on
-        # webpages will be saved as separate files.
-        self._anchors = value
-
-    def enable_feature(self, tag):
-        """ Enable the given tag feature if it is disabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.enable()
-
-    def disable_feature(self, tag):
-        """ Disable the given tag feature if it is enabled """
-
-        if tag in self.features:
-            parsetag = self.features[self.features.index(tag)]
-            parsetag.disable()
-                
-    def filter_link(self, link):
-        """ Function to filter links, we decide here whether
-        to handle certain kinds of links """
-
-        if not link:
-            return LINK_EMPTY
-
-        # ignore javascript links (From 1.2 version javascript
-        # links of the form .js are fetched, but we still ignore
-        # the actual javascript actions since there is no
-        # javascript engine.)
-        llink = link.lower()
-
-        # Skip javascript, mailto, news and directory special tags.
-        if self.skip_re.match(llink):
-            return LINK_FILTERED
-
-        # If this is a web-directory Index page, then check for
-        # match with junk URLs of such index pages
-        if self.title.lower().startswith('index of'):
-            if self.index_page_re.match(llink):
-                # print 'Filtering link',llink
-                return LINK_FILTERED
-            
-        # Check if we're accepting query style URLs
-        if not objects.config.getquerylinks and self.query_re.search(llink):
-            debug('Query filtering link',link)
-            return LINK_FILTERED
-
-        return LINK_NOT_FILTERED
-
-    def handle_anchor_links(self, link):
-        """ Handle links of the form html#..."""
-
-        # if anchor tag, then get rid of anchor #...
-        # and only add the webpage link
-        if not link:
-            return LINK_EMPTY
-
-        # Need to do this here also
-        self.check_add_link(URL_TYPE_ANCHOR, link)
-
-        # No point in getting #anchor sort of links
-        # since typically they point to anchors in the
-        # same page
-
-        index = link.rfind('.html#')
-        if index != -1:
-            newhref = link[:(index + 5)]
-            self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-            return ANCHOR_LINK_FOUND
-        else:
-            index = link.rfind('.htm#')
-            if index != -1:
-                newhref = link[:(index + 4)]
-                self.check_add_link(URL_TYPE_WEBPAGE, newhref)
-                return ANCHOR_LINK_FOUND
-
-        return ANCHOR_LINK_NOT_FOUND
-
-    def unknown_starttag(self, tag, attrs):
-        """ This method gives you the tag in the html
-        page along with its attributes as a list of
-        tuples """
-
-        # Raise event for anybody interested in catching a tagparse event...
-        if objects.eventmgr and objects.eventmgr.raise_event('beforetag', self.url, None, tag=tag, attrs=attrs)==False:
-            # Don't parse this tag..
-            return
-                                     
-        # Set as current tag
-        self._tag = tag
-        # print self._tag, attrs
-        
-        if not attrs: return
-        isBaseTag = not self.base and tag == 'base'
-        # print 'Base=>',isBaseTag
-        
-        if tag in self.features:
-
-            d = CaselessDict(attrs)
-            parsetag = self.features[self.features.index(tag)]
-
-            # Don't do anything if the feature is disabled
-            if not parsetag.isEnabled():
-                return
-            
-            tagdict = parsetag.tagdict
-            
-            link = ''
-
-            for key, typ in tagdict.items():
-                # If there is a <base href="..."> tag
-                # set self.base_href
-                if isBaseTag and key=='href':
-                    self.base_href = True
-                    try:
-                        self.base = d[key]
-                    except:
-                        self.base_href = False
-                        continue
-                
-                # if the link already has a value, skip
-                # (except for applet tags)
-                if tag != 'applet':
-                    if link: continue
-
-                if tag == 'link':
-                    try:
-                        # Fix - only reset typ if it is one
-                        # of the valid handled rel types.
-                        foundtyp = d['rel'].lower()
-                        if foundtyp in self.handled_rel_types:
-                            typ = getTypeClass(foundtyp)
-                    except KeyError:
-                        pass
-
-                try:
-                    if tag == 'meta':
-                        # Handle meta tag for refresh
-                        foundtyp = d.get('http-equiv','').lower()
-                        if foundtyp.lower() == 'refresh':
-                            link = d.get(key,'')
-                            if not link: continue
-                            # This will be of the form of either
-                            # a time-gap (CONTENT="600") or a time-gap
-                            # with a URL (CONTENT="0; URL=<url>")
-                            items = link.split(';')
-                            if len(items)==1:
-                                # Only a time-gap, skip it
-                                continue
-                            elif len(items)==2:
-                                # Second one should be a URL
-                                reqd = items[1]
-                                # print 'Reqd=>',reqd
-                                if (reqd.find('URL') != -1 or reqd.find('url') != -1) and reqd.find('=') != -1:
-                                    link = reqd.split('=')[1].strip()
-                                    # print 'Link=>',link
-                                else:
-                                    continue
-                        else:
-                            # Handle robots meta tag
-                            name = d.get('name','').lower()
-                            if name=='robots':
-                                robots = d.get('content','').lower()
-                                # Split to ','
-                                contents = [item.strip() for item in robots.split(',')]
-                                # Check for nofollow
-                                self.can_follow = not ('nofollow' in contents)
-                                # Check for noindex
-                                self.can_index = not ('noindex' in contents)
-                            elif name=='keywords':
-                                self.keywords = d.get('content','').split(',')
-                                # Trim the keywords list
-                                self.keywords = [word.lower().strip() for word in self.keywords]
-                            elif name=='description':
-                                self.description = d.get('content','').strip()
-                            else:
-                                continue
-
-                    elif tag != 'applet':
-                        link = d[key]
-                    else:
-                        link += d[key]
-                        if key == 'codebase':
-                            if link:
-                                if link[-1] != '/':
-                                    link += '/'
-                            continue                                
-                except KeyError:
-                    continue
-
-                # see if this link is to be filtered
-                if self.filter_link(link) != LINK_NOT_FILTERED:
-                    # print 'Filtered link',link
-                    continue
-
-                # anchor links in a page should not be saved        
-                # index = link.find('#')
-
-                # Make sure not to wrongly categorize '#' in query strings
-                # as anchor URLs.
-                if link.find('#') != -1 and not self.query_re.search(link):
-                    # print 'Is an anchor link',link
-                    self.handle_anchor_links(link)
-                else:
-                    # append to private list of links
-                    self.check_add_link(typ, link)
-
-    def unknown_endtag(self, tag):
-            
-        self._tag = ''
-        if tag=='title':
-            self.title_flag = False
-            self.title = self.title.strip()
-            
-    def handle_data(self, data):
-
-        if self._tag.lower()=='title' and self.title_flag:
-            self.title += data
-
-    def check_add_link(self, typ, link):
-        """ To avoid adding duplicate links """
-
-        f = False
-
-        if typ == 'image':
-            if not (typ, link) in self.images:
-                self.images.append((typ, link))
-        elif not (typ, link) in self.links:
-                # print 'Adding link ', link, typ
-                pos = self.getpos()
-                self.links.append((typ, link))
-                self.linkpos[(typ,link)] = (pos[0],pos[1])
-                
-
-    def add_tag_info(self, taginfo):
-        """ Add new tag information to this object.
-        This can be used to change the behavior of this class
-        at runtime by adding new tags """
-
-        # The taginfo object should be a dictionary
-        # of the form { tagtype : (elementname, elementype) }
-
-        # egs: { 'body' : ('background', 'img) }
-        if type(taginfo) != dict:
-            raise AttributeError, "Attribute type mismatch, taginfo should be a dictionary!"
-
-        # get the key of the dictionary
-        key = (taginfo.keys())[0]
-        if len(taginfo[key]) != 2:
-            raise ValueError, 'Value mismatch, size of tag tuple should be 2'
-
-        # get the value tuple
-        tagelname, tageltype = taginfo[key]
-
-        # see if this is an already existing tagtype
-        if key in self.handled.keys:
-            _values = self.handled[key]
-
-            f=0
-            for index in xrange(len(_values)):
-                # if the elementname is also
-                # the same, just replace it.
-                v = _values[index]
-
-                elname, eltype = v
-                if elname == tagelname:
-                    f=1
-                    _values[index] = (tagelname, tageltype)
-                    break
-
-            # new element, add it to list
-            if f==0: _values.append((tagelname, tageltype))
-            return 
-        else:
-            # new key, directly modify dictionary
-            elements = []
-            elements.append((tagelname, tageltype))
-            self.handled[key] = elements 
-
-    def reset(self):
-        SGMLParser.reset(self)
-
-        self.url = None
-        self.base = None
-        self.links = []
-        self.images = []
-        self.base_href = False
-        self.base_url = ''
-        self.can_index = True
-        self.can_follow = True
-        self.title = ''
-        self.title_flag = True
-        self.description = ''
-        self.keywords = []
-        
-    def base_url_defined(self):
-        """ Return whether this url had a
-        base url of the form <base href='...'>
-        defined """
-
-        return self.base_href
-
-    def get_base_url(self):
-        return self.base
-
-    def set_url(self, url):
-        """ Set the URL whose data is about to be parsed """
-        self.url = url
-
-class HarvestManSGMLOpParser(HarvestManSimpleParser):
-    """ A parser based on effbot's sgmlop """
-
-    def __init__(self):
-        # This module should be built already!
-        import sgmlop
-        
-        self.parser = sgmlop.SGMLParser()
-        self.parser.register(self)
-        HarvestManSimpleParser.__init__(self)
-        # Type
-        self.typ = 1
-        
-    def finish_starttag(self, tag, attrs):
-        self.unknown_starttag(tag, attrs)
-
-    def finish_endtag(self, tag):
-        self.unknown_endtag(tag)        
-
-    def feed(self, data):
-        self.parser.feed(data)
-        
-class HarvestManCSSParser(object):
-    """ Class to parse stylesheets and extract URLs """
-
-    # Regexp to parse stylesheet imports
-    importcss1 = re.compile(r'(\@import\s+\"?)(?!url)([\w.-:/]+)(\"?)', re.MULTILINE|re.LOCALE|re.UNICODE)
-    importcss2 = re.compile(r'(\@import\s+url\(\"?)([\w.-:/]+)(\"?\))', re.MULTILINE|re.LOCALE|re.UNICODE)
-    # Regexp to parse URLs inside CSS files
-    cssurl = re.compile(r'(url\()([^\)]+)(\))', re.LOCALE|re.UNICODE)
-
-    def __init__(self):
-        # Any imported stylesheet URLs
-        self.csslinks = []
-        # All URLs including above
-        self.links = []
-
-    def feed(self, data):
-        self._parse(data)
-        
-    def _parse(self, data):
-        """ Parse stylesheet data and extract imported css links, if any """
-
-        # Return is a list of imported css links.
-        # This subroutine uses the specification mentioned at
-        # http://www.w3.org/TR/REC-CSS2/cascade.html#at-import
-        # for doing stylesheet imports.
-
-        # This takes care of @import "style.css" and
-        # @import url("style.css") and url(...) syntax.
-        # Media types specified if any, are ignored.
-        
-        # Matches for @import "style.css"
-        l1 = self.importcss1.findall(data)
-        # Matches for @import url("style.css")
-        l2 = self.importcss2.findall(data)
-        # Matches for url(...)
-        l3 = self.cssurl.findall(data)
-        
-        for item in (l1+l2):
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            self.csslinks.append(url)
-            self.links.append(url)
-            
-        for item in l3:
-            if not item: continue
-            url = item[1].replace("'",'').replace('"','')
-            if url not in self.links:
-                self.links.append(url)
-
-if __name__=="__main__":
-    import os
-    import config
-    import logger
-    
-    SetAlias(config.HarvestManStateObject())
-    SetAlias(logger.HarvestManLogger())
-    
-    cfg = objects.config
-    cfg.verbosity = 5
-    SetLogSeverity()
-    
-    cfg.getquerylinks = True
-    
-    p = HarvestManSimpleParser()
-    #p.enable_feature('option')
-    #p = HarvestManSGMLOpParser()
-    
-    urls = ['http://projecteuler.net/index.php?section=problems']
-    urls = ['http://www.evvs.dk/index.php?cPath=30&osCsid=3b110c689f01d722dbbe53c5cee0bf2d']
-    urls = ['http://nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html']
-    urls = ['http://wiki.java.net/bin/view/Javawsxml/Rome05Tutorials']
-    urls = ['http://bits.blogs.nytimes.com/2008/02/27/google-goes-after-another-microsoft-cash-cow/?ref=technology']
-    urls = ['http://mail.python.org/pipermail/bangpypers/2008-March/000410.html']
-
-    urls = ['http://www.bad-ischl.ooe.gv.at/system/web/default.aspx']
-    urls = ['http://europa.eu/languages/']    
-    urls = ['http://www.web2.cz/rs-reference/']
-    urls = ['http://harvestmanontheweb.com/']
-    urls = ['http://www.web2.cz/rs-uvod/']
-    urls = ['http://digitallife.co.in/indian-cheerleaders-for-ipl/']
-    urls = ['http://www.brodingberg.gv.at']
-    urls = ["www.malvik.kommune.no"]
-    urls = ["http://www.gr.ch/Deutsch/index.cfm"]
-    
-    for url in urls:
-        if os.system('wget %s -O index.html' % url ) == 0:
-            p.feed(open('index.html').read())
-            print p.links, len(p.links)
-            for link in p.links:
-                print link[1]
-                
-            print p.keywords
-            print p.description
-            print p.title
-            print p.base_href
-            print p.base
-            
-            p.reset()
-
-                                   
-
-
-
-
diff --git a/HarvestMan/harvestman/lib/robotparser.py b/HarvestMan/harvestman/lib/robotparser.py
deleted file mode 100755
index 4c6d953..0000000
--- a/HarvestMan/harvestman/lib/robotparser.py
+++ /dev/null
@@ -1,340 +0,0 @@
-# -- coding: utf-8
-""" robotparser.py - Robot Exclusion Principle for python.
-    This module is part of the HarvestMan program.
-
-    This module is a modified version of robotparser.py supplied
-    with Python standard library. The author does not assert
-    any copyright on this module.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    =============== Original Copyright ============================
-    robotparser.py
-    
-    Copyright (C) 2000  Bastian Kleineidam
-
-    You can choose between two licenses when using this package:
-    1) GNU GPLv2
-    2) PYTHON 2.0 OPEN SOURCE LICENSE
-
-    The robots.txt Exclusion Protocol is implemented as specified in
-    http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
-
-    ================ End Original Copyright =========================
-
-    Jan 5 2006          Anand   Fix for EIAO ticket #74, small change
-                                in open of URLopener class, Also moved
-                                import of HarvestManUrlConnector to top.
-
-    Jan 8 2006         Anand    Updated this file from EIAO robacc
-                                repository.
-    Jan 10 2006        Anand   Converted from dos to unix format (removed Ctrl-Ms).
-    Feb 23 2009        Anand    Updated module contents from Python 2.5.                
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import re
-import urlparse
-import urllib
-
-from harvestman.lib.connector import HarvestManUrlConnector
-
-__all__ = ["RobotFileParser"]
-
-debug = 0
-
-def _basicdebug(msg):
-    if debug>=1: print msg
-
-def _moredebug(msg):
-    if debug>=2: print msg
-
-def _verbosedebug(msg):
-    if debug>=3: print msg
-    
-def _debug(msg):
-    if debug: print msg
-
-
-import socket
-
-class RobotFileParser:
-    """ This class provides a set of methods to read, parse and answer
-    questions about a single robots.txt file.
-
-    """
-
-    def __init__(self, url=''):
-        self.entries = []
-        self.default_entry = None
-        self.disallow_all = False
-        self.allow_all = False
-        self.set_url(url)
-        self.last_checked = 0
-
-    def mtime(self):
-        """Returns the time the robots.txt file was last fetched.
-
-        This is useful for long-running web spiders that need to
-        check for new robots.txt files periodically.
-
-        """
-        return self.last_checked
-
-    def modified(self):
-        """Sets the time the robots.txt file was last fetched to the
-        current time.
-
-        """
-        import time
-        self.last_checked = time.time()
-
-    def set_url(self, url):
-        """Sets the URL referring to a robots.txt file."""
-        self.url = url
-        self.host, self.path = urlparse.urlparse(url)[1:3]
-
-    def read(self):
-        """Reads the robots.txt URL and feeds it to the parser."""
-        opener = URLopener()
-        f = opener.open(self.url)
-        if f is None:
-            return -1
-        
-        lines = []
-        line = f.readline()
-        while line:
-            lines.append(line.strip())
-            line = f.readline()
-        self.errcode = opener.errcode
-        if self.errcode == 401 or self.errcode == 403:
-            self.disallow_all = True
-            _debug("disallow all")
-        elif self.errcode >= 400:
-            self.allow_all = True
-            _debug("allow all")
-        elif self.errcode == 200 and lines:
-            _debug("parse lines")
-            self.parse(lines)
-
-    def _add_entry(self, entry):
-        if "*" in entry.useragents:
-            # the default entry is considered last
-            self.default_entry = entry
-        else:
-            self.entries.append(entry)
-
-    def parse(self, lines):
-        """parse the input lines from a robots.txt file.
-           We allow that a user-agent: line is not preceded by
-           one or more blank lines."""
-        state = 0
-        linenumber = 0
-        entry = Entry()
-
-        for line in lines:
-            linenumber = linenumber + 1
-            if not line:
-                if state==1:
-                    _debug("line %d: warning: you should insert"
-                           " allow: or disallow: directives below any"
-                           " user-agent: line" % linenumber)
-                    entry = Entry()
-                    state = 0
-                elif state==2:
-                    self._add_entry(entry)
-                    entry = Entry()
-                    state = 0
-            # remove optional comment and strip line
-            i = line.find('#')
-            if i>=0:
-                line = line[:i]
-            line = line.strip()
-            if not line:
-                continue
-            line = line.split(':', 1)
-            if len(line) == 2:
-                line[0] = line[0].strip().lower()
-                line[1] = urllib.unquote(line[1].strip())
-                if line[0] == "user-agent":
-                    if state==2:
-                        _debug("line %d: warning: you should insert a blank"
-                               " line before any user-agent"
-                               " directive" % linenumber)
-                        self._add_entry(entry)
-                        entry = Entry()
-                    entry.useragents.append(line[1])
-                    state = 1
-                elif line[0] == "disallow":
-                    if state==0:
-                        _debug("line %d: error: you must insert a user-agent:"
-                               " directive before this line" % linenumber)
-                    else:
-                        entry.rulelines.append(RuleLine(line[1], False))
-                        state = 2
-                elif line[0] == "allow":
-                    if state==0:
-                        _debug("line %d: error: you must insert a user-agent:"
-                               " directive before this line" % linenumber)
-                    else:
-                        entry.rulelines.append(RuleLine(line[1], True))
-                else:
-                    _debug("line %d: warning: unknown key %s" % (linenumber,
-                               line[0]))
-            else:
-                _debug("line %d: error: malformed line %s"%(linenumber, line))
-        if state==2:
-            self.entries.append(entry)
-        _debug("Parsed rules:\n%s" % str(self))
-
-
-    def can_fetch(self, useragent, url):
-        """using the parsed robots.txt decide if useragent can fetch url"""
-        _debug("Checking robots.txt allowance for:\n  user agent: %s\n  url: %s" %
-               (useragent, url))
-        if self.disallow_all:
-            return False
-        if self.allow_all:
-            return True
-        # search for given user agent matches
-        # the first match counts
-        url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
-        for entry in self.entries:
-            if entry.applies_to(useragent):
-                return entry.allowance(url)
-        # try the default entry last
-        if self.default_entry:
-            return self.default_entry.allowance(url)
-        # agent not found ==> access granted
-        return True
-
-
-    def __str__(self):
-        ret = ""
-        for entry in self.entries:
-            ret = ret + str(entry) + "\n"
-        return ret
-
-
-class RuleLine:
-    """A rule line is a single "Allow:" (allowance==True) or "Disallow:"
-       (allowance==False) followed by a path."""
-    def __init__(self, path, allowance):
-        if path == '' and not allowance:
-            # an empty value means allow all
-            allowance = True
-        self.path = urllib.quote(path)
-        self.allowance = allowance
-
-    def applies_to(self, filename):
-        return self.path=="*" or filename.startswith(self.path)
-
-    def __str__(self):
-        return (self.allowance and "Allow" or "Disallow")+": "+self.path
-
-
-class Entry:
-    """An entry has one or more user-agents and zero or more rulelines"""
-    def __init__(self):
-        self.useragents = []
-        self.rulelines = []
-
-    def __str__(self):
-        ret = ""
-        for agent in self.useragents:
-            ret = ret + "User-agent: "+agent+"\n"
-        for line in self.rulelines:
-            ret = ret + str(line) + "\n"
-        return ret
-
-    def applies_to(self, useragent):
-        """check if this entry applies to the specified agent"""
-        # split the name token and make it lower case
-        useragent = useragent.split("/")[0].lower()
-        for agent in self.useragents:
-            if agent=='*':
-                # we have the catch-all agent
-                return True
-            agent = agent.lower()
-            if agent in useragent:
-                return True
-        return False
-
-    def allowance(self, filename):
-        """Preconditions:
-        - our agent applies to this entry
-        - filename is URL decoded"""
-        for line in self.rulelines:
-            _debug((filename, str(line), line.allowance))
-            if line.applies_to(filename):
-                return line.allowance
-        return True
-
-class URLopener(urllib.FancyURLopener):
-    def __init__(self, *args):
-        urllib.FancyURLopener.__init__(self, *args)
-        self.errcode = 200
-
-    def http_error_default(self, url, fp, errcode, errmsg, headers):
-        self.errcode = errcode
-        return urllib.FancyURLopener.http_error_default(self, url, fp, errcode,
-                                                        errmsg, headers)
-
-    def open(self, url):
-        return HarvestManUrlConnector().robot_urlopen(url)
-
-def _check(a,b):
-    if not b:
-        ac = "access denied"
-    else:
-        ac = "access allowed"
-    if a!=b:
-        print "failed"
-    else:
-        print "ok (%s)" % ac
-    print
-
-def _test():
-    global debug
-    rp = RobotFileParser()
-    debug = 1
-
-    # robots.txt that exists, gotten to by redirection
-    rp.set_url('http://www.musi-cal.com/robots.txt')
-    rp.read()
-
-    # test for re.escape
-    _check(rp.can_fetch('*', 'http://www.musi-cal.com/'), 1)
-    # this should match the first rule, which is a disallow
-    _check(rp.can_fetch('', 'http://www.musi-cal.com/'), 0)
-    # various cherry pickers
-    _check(rp.can_fetch('CherryPickerSE',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    _check(rp.can_fetch('CherryPickerSE/1.0',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    _check(rp.can_fetch('CherryPickerSE/1.5',
-                       'http://www.musi-cal.com/cgi-bin/event-search'
-                       '?city=San+Francisco'), 0)
-    # case sensitivity
-    _check(rp.can_fetch('ExtractorPro', 'http://www.musi-cal.com/blubba'), 0)
-    _check(rp.can_fetch('extractorpro', 'http://www.musi-cal.com/blubba'), 0)
-    # substring test
-    _check(rp.can_fetch('toolpak/1.1', 'http://www.musi-cal.com/blubba'), 0)
-    # tests for catch-all * agent
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/search'), 0)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/Musician/me'), 1)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/'), 1)
-    _check(rp.can_fetch('spam', 'http://www.musi-cal.com/'), 1)
-
-    # robots.txt that does not exist
-    rp.set_url('http://www.lycos.com/robots.txt')
-    rp.read()
-    _check(rp.can_fetch('Mozilla', 'http://www.lycos.com/search'), 1)
-
-if __name__ == '__main__':
-    _test()
diff --git a/HarvestMan/harvestman/lib/rules.py b/HarvestMan/harvestman/lib/rules.py
deleted file mode 100755
index 37183d6..0000000
--- a/HarvestMan/harvestman/lib/rules.py
+++ /dev/null
@@ -1,758 +0,0 @@
-# -- coding: utf-8
-""" rules.py - Rules checker module for HarvestMan.
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-
-    Modification History
-    --------------------
-
-   Jan 8 2006          Anand    Updated this file from EIAO
-                                repository to get fixes for robot
-                                rules. Removed EIAO specific
-                                code.
-
-                                Put ext check rules before robots
-                                check to speed up things.
-   Jan 10 2006          Anand    Converted from dos to unix format
-                                (removed Ctrl-Ms).
-
-   April 11 2007        Anand   Not doing I.P comparison for
-                                non-robots.txt URLs in compare_domains
-                                method as it is erroneous.
-                                
-
-   Copyright (C) 2004 Anand B Pillai.
-                                
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import socket
-import re
-import os
-import time
-import copy
-
-from harvestman.lib.event import HarvestManEvent
-from harvestman.lib import robotparser
-from harvestman.lib.methodwrapper import MethodWrapperMetaClass
-from harvestman.lib import urlparser
-from harvestman.lib import filters
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.netinfo import tlds
-from harvestman.lib.common.lrucache import LRU
-
-# Defining pluggable functions
-__plugins__ = {'violates_rules_plugin': 'HarvestManRulesChecker:violates_rules'}
-
-# Defining functions with callbacks
-__callbacks__ = {'violates_rules_callback' : 'HarvestManRulesChecker:violates_rules'}
-
-
-class HarvestManRulesChecker(object):
-    """ Class which checks the download rules for urls. These
-    rules include depth checks, robot.txt rules checks, filter
-    checks, external server/directory checks, duplicate url
-    checks, maximum limits check etc. """
-
-    # For supporting callbacks
-    __metaclass__ = MethodWrapperMetaClass
-    alias = 'rulesmgr'
-    
-    # Regular expression for matching www. infront of domains
-    wwwre = re.compile(r'^www(\d*)\.')
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self._filter = {}
-        self._extservers = Ldeque(1000)
-        self._extdirs = Ldeque(1000)
-        self._wordstr = '[\s+<>]'
-        self._robots  = LRU(1000)
-        self._robocache = Ldeque(1000)
-        self._invalidservers = Ldeque(1000)
-        # Flag for making filters
-        self._madefilters = False
-        self._configobj = objects.config
-        self.junkfilter = filters.HarvestManJunkFilter()
-        self.urlfilter =  filters.HarvestManUrlFilter(self._configobj.pathurlfilters,
-                                                      self._configobj.extnurlfilters,
-                                                      self._configobj.regexurlfilters)
-        self.txtfilter = filters.HarvestManTextFilter(self._configobj.contentfilters,
-                                                      self._configobj.metafilters)
-
-    def violates_rules(self, urlObj):
-        """ Check the basic rules for this url object,
-        This function returns True if the url object
-        violates the rules, else returns False """
-
-        # raise event to allow custom logic
-        ret = objects.eventmgr.raise_event('includelinks', urlObj)
-        if ret==False:
-            self.add_to_filter(urlObj.index)            
-            return True
-        elif ret==True:
-            return False
-        
-        url = urlObj.get_full_url()
-
-        # New in 2.0
-        # If checking of rules on the type of this URL
-        # is set to be skipped, return False
-        if urlObj.typ in self._configobj.skipruletypes:
-            return False
-        
-        # if this url exists in filter list, return
-        # True rightaway
-        try:
-            self._filter[url.index]
-            return True
-        except KeyError:
-            pass
-
-        # now apply the url filter
-        if self.apply_url_filter(urlObj):
-            extrainfo("URL filter - filtered", url)
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # now apply the junk filter
-        if self.junkfilter:
-            if self.junkfilter.filter(urlObj):
-                extrainfo("Junk Filter - filtered", url)
-                self.add_to_filter(urlObj.index)                            
-                return True
-
-        # check if this is an external link
-        if self.is_external_link( urlObj ):
-            extrainfo("External link - filtered ", urlObj.get_full_url())
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # now apply REP
-        if self.apply_rep(urlObj):
-            extrainfo("Robots.txt rules prevents crawl of ", url)
-            self.add_to_filter(urlObj.index)
-            return True
-
-        # depth check
-        if self.apply_depth_check(urlObj):
-            extrainfo("Depth exceeds - filtered ", urlObj.get_full_url())
-            self.add_to_filter(urlObj.index)            
-            return True
-
-        return False
-
-    def add_to_filter(self, urlindex):
-        """ Add the link to the filter dictionary """
-
-        self._filter[urlindex] = 1
-
-    def compare_domains(self, domain1, domain2, robots=False):
-        """ Compare two domains (servers) first by
-        ip and then by name and return True if both point
-        to the same server, return False otherwise. """
-
-        # For comparing robots.txt file, first compare by
-        # ip and then by name.
-        if robots: 
-            firstval=self.compare_by_ip(domain1, domain2)
-            if firstval:
-                return firstval
-            else:
-                return self.compare_by_name(domain1, domain2)
-
-        # otherwise, we do only a name check
-        else:
-            return self.compare_by_name(domain1, domain2)
-
-    def _get_base_server(self, server):
-        """ Return the base server name of  the passed
-        server (domain) name """
-
-        # If the server name is of the form say bar.foo.com
-        # or vodka.bar.foo.com, i.e there are more than one
-        # '.' in the name, then we need to return the
-        # last string containing a dot in the middle.
-        if server.count('.') > 1:
-            dotstrings = server.split('.')
-            # now the list is of the form => [vodka, bar, foo, com]
-
-            # Skip the list for skipping over tld domain name endings
-            # such as .org.uk, .mobi.uk etc. For example, if the
-            # server is games.mobileworld.mobi.uk, then we
-            # need to return mobileworld.mobi.uk, not mobi.uk
-            dotstrings.reverse()
-            idx = 0
-            
-            for item in dotstrings:
-                if item.lower() in tlds:
-                    idx += 1
-                
-            return '.'.join(dotstrings[idx::-1])
-        else:
-            # The server is of the form foo.com or just "foo"
-            # so return it straight away
-            return server
-
-    def compare_no_tld(self, domain1, domain2):
-        """ Compare two server names without their tld endings """
-
-        # This will return True for www.foo.com, www.foo.org
-        # foo.co.uk etc.
-        dotstrings1 = self.wwwre.sub('', domain1.lower()).split('.')
-        dotstrings2 = self.wwwre.sub('', domain2.lower()).split('.')
-        l1 = [item for item in dotstrings1 if item not in tlds]
-        l2 = [item for item in dotstrings2 if item not in tlds]            
-        debug(l1, l2)
-        
-        return '.'.join(l1) =='.'.join(l2)
-        
-    def compare_by_name(self, domain1, domain2):
-        """ Compare two servers by their names. Return True
-        if similar, False otherwise """
-
-        # first check if both domains are same
-        if domain1.lower() == domain2.lower(): return True
-        # Check whether we are comparing something like www.foo.com
-        # and foo.com, they are assumed to be same. 
-        if self.wwwre.sub('',domain1.lower())==self.wwwre.sub('',domain2.lower()):
-            return True
-
-        # If ignoretlds is set to True, return True for two servers such
-        # as www.foo.com and www.foo.co.uk, www.foo.org etc.
-        if self._configobj.ignoretlds:
-            if self.compare_no_tld(domain1, domain2):
-                return True
-            
-        if not self._configobj.subdomain:
-            # Checks whether the domain names belong to
-            # the same base server, if the above config
-            # variable is set. For example, this will
-            # return True for two servers like server1.foo.com
-            # and server2.foo.com or server1.base and server2.base
-            baseserver1 = self.wwwre.sub('',self._get_base_server(domain1))
-            baseserver2 = self.wwwre.sub('',self._get_base_server(domain2))
-            debug('Bases=>',baseserver1, baseserver2)
-            
-            # Instead of checking for equality, check for endswith.
-            # This will return True even for cases like
-            # vanhall-larenstein.nl and larenstein.nl
-            if self._configobj.ignoretlds:
-                if self.compare_no_tld(baseserver1, baseserver2):
-                    return True
-                
-            return baseserver1.lower().endswith(baseserver2.lower())
-            # return (baseserver1.lower() == baseserver2.lower())
-        else:
-            # if the subdomain variable is set will return False for two servers like
-            # server1.foo.com and server2.foo.com i.e with same base domain but different
-            # subdomains.
-            return False
-
-    def compare_by_ip(self, domain1, domain2):
-        """ Compare two servers by their ip address. Return
-        True if same, False otherwise """
-
-        try:
-            ip1 = socket.gethostbyname(domain1)
-            ip2 = socket.gethostbyname(domain2)
-        except Exception:
-            return False
-
-        if ip1==ip2: return True
-        else: return False
-
-    def apply_url_filter(self, urlObj):
-        """ Apply URL filter to the URL. Return True if filtered and False otherwise """
-
-        return self.urlfilter.filter(urlObj)
-
-    def apply_text_filter(self, document, urlObj):
-        """ Apply text filter to the document object. Return True if filtered and
-        False otherwise """
-
-        return self.txtfilter.filter(document, urlObj)
-        
-    def apply_rep(self, urlObj):
-        """ See if the robots.txt file on the server
-        allows fetching of this url. Return 0 on success
-        (fetching allowed) and 1 on failure(fetching blocked) """
-
-        # robots option turned off
-        if self._configobj.robots==0: return False
-        
-        domport = urlObj.get_full_domain_with_port()
-        # The robots.txt file url
-        robotsfile = "".join((domport, '/robots.txt'))
-
-        # Check #1
-        # if this url exists in filter list, return
-        # True rightaway
-        try:
-            self._filter[urlObj.index]
-            return True
-        except KeyError:
-            pass
-
-        url_directory = urlObj.get_url_directory()
-
-        # Check #2: Check if this directory
-        # is already there in the white list
-        try:
-            self._robocache.index(url_directory)
-            return False
-        except ValueError:
-            pass
-
-        try:
-            rp = self._robots[domport]
-            # Check #4
-            # If there is an entry, but it
-            # is None, it means there is no
-            # robots.txt file in the server
-            # (see below). So return False.
-            if not rp: return False
-        except KeyError:
-            # Not there, create a fresh
-            # one and add it.
-            rp = robotparser.RobotFileParser()
-            rp.set_url(robotsfile)
-            ret = rp.read()
-            # Check #5                
-            if ret==-1:
-                # no robots.txt file
-                # Set the entry for this
-                # server as None, so next
-                # time we dont need to do
-                # this operation again.
-                self._robots[domport] = None
-                return False
-            else:
-                # Set it
-                self._robots[domport] = rp
-        
-        # Check #6
-        if rp.can_fetch(self._configobj.USER_AGENT, url_directory):
-            # Add to white list
-            self._robocache.append(url_directory)
-            return False
-
-        # Cannot fetch, so add to filter
-        # for quick look up later.
-        
-        return True
-
-    def apply_word_filter(self, data):
-        """ Apply the word filter """
-
-        if self._configobj.wordfilter:
-            if self._configobj.wordfilterre.search(data):
-                return True
-            else:
-                return False
-
-        return False
-
-    def is_under_starting_directory(self, urlObj):
-        """ Check whether the url in the url object belongs
-        to the same directory as the starting url for the
-        project """
-
-        directory = urlObj.get_url_directory()
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return True
-
-        # Bug: the original URL might have had been
-        # redirected, so its base URL might have got
-        # changed. We need to check with the original
-        # URL in such cases.
-        # Sample site: http://www.vegvesen.no
-
-        if baseUrlObj.reresolved:
-            # bdir = baseUrlObj.get_original_url_directory()
-            old_urlobj = baseUrlObj.get_original_state()
-            bdir = old_urlobj.get_url_directory()
-        else:
-            bdir = baseUrlObj.get_url_directory()
-            
-        # print 'BASEDIR=>',bdir
-        # print 'DIRECTORY=>',directory
-
-        # Look for bdir inside dir
-        index = directory.find(bdir)
-        
-        if index == 0:
-            return True
-
-        # Sometimes a simple string match
-        # is not good enough. May be both
-        # the directories are the same but
-        # the server names are slightly different
-        # ex: www-106.ibm.com and www.ibm.com
-        # for developerworks links.
-
-        # Check if both of them are in the same
-        # domain
-        if self.compare_domains(urlObj.get_domain(), baseUrlObj.get_domain()):
-            debug('Domains',urlObj.get_domain(),'and',baseUrlObj.get_domain(),'compare fine')
-            # Get url directory sans domain
-            directory = urlObj.get_url_directory_sans_domain()
-            bdir = baseUrlObj.get_url_directory_sans_domain()
-            debug('Directories',directory,bdir)
-            
-            # Check again
-            if directory.find(bdir) == 0:
-                return True
-
-        return False
-            
-    def is_external_server_link(self, urlObj):
-        """ Check whether the url in the url object belongs to
-        an external server """
-
-        # Get the tracker queue object
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        # Check based on the server
-        server = urlObj.get_domain()
-        baseserver = baseUrlObj.get_domain()
-
-        return not self.compare_domains( server, baseserver )
-
-    def is_external_link(self, urlObj):
-        """ Check if the url is an external link relative to starting url,
-        using the download rules set by the user """
-
-        # Example.
-        # Assume our start url is 'http://www.server.com/files/images/index.html"
-        # Then any url which starts with another server name or at a level
-        # above the start url's base directory on the same server is considered
-        # an external url
-        # i.e, http://www.yahoo.com will be external because of
-        # 1st reason &
-        # http://www.server.com/files/search.cgi will be external link because of
-        # 2nd reason.
-        # External links ?
-
-        # if under the same starting directory, return False
-        if self.is_under_starting_directory(urlObj):
-            return False
-
-        directory = urlObj.get_url_directory()
-
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        if urlObj.get_type() == 'stylesheet':
-            if self._configobj.getstylesheets: return False
-
-        elif urlObj.get_type() == 'image':
-            if self._configobj.getimagelinks: return False
-
-        if not self.is_external_server_link(urlObj):
-            debug('Same!')
-            if self._configobj.fetchlevel==0:
-                return True
-            
-            elif self._configobj.fetchlevel==3:
-                # check for the directory of the parent url
-                # if it is same as starting directory, allow this
-                # url, else deny
-                try:
-                    parentUrlObj = urlObj.get_parent_url()
-                    if not parentUrlObj:
-                        return False
-
-                    parentdir = parentUrlObj.get_url_directory()
-                    bdir = baseUrlObj.get_url_directory()
-
-                    if parentdir == bdir:
-                        self._increment_ext_directory_count(directory)
-                        return False
-                    else:
-                        return True
-                except urlparser.HarvestManUrlError, e:
-                    logconsole(e)
-                    
-            elif self._configobj.fetchlevel > 0:
-                # do other checks , just fall through
-                pass
-            
-            # Increment external directory count
-            # directory = urlObj.get_url_directory()
-
-            # res=self._ext_directory_check(directory)
-            # if not res:
-            #    extrainfo("External directory error - filtered!")
-            #    return True
-
-            # Apply depth check for external dirs here
-            if self._configobj.extdepth:
-                if self.apply_depth_check(urlObj, mode=2):
-                    return True
-
-            #if self._configobj.epagelinks:
-            #    # We can get external links belonging to same server,
-            #    # so this is not an external link
-            #    return False
-            #else:
-            #    # We cannot get external links belonging to same server,
-            #    # so this is an external link
-            #    return True
-            return False
-        else:
-            # print 'Different!',self._configobj.fetchlevel
-            debug('Different!')
-            # Both belong to different base servers
-            if self._configobj.fetchlevel==0 or self._configobj.fetchlevel == 1:
-                return True
-            elif self._configobj.fetchlevel==2 or self._configobj.fetchlevel==3:
-                # check whether the baseurl (parent url of this url)
-                # belongs to the starting server. If so allow fetching
-                # else deny. ( we assume the baseurl path is not relative! :-)
-                try:
-                    parentUrlObj = urlObj.get_parent_url()
-                    baseserver = baseUrlObj.get_domain()
-
-                    if not parentUrlObj:
-                        return False
-
-                    server = urlObj.get_domain()
-                    if parentUrlObj.get_domain() == baseserver:
-                        self._increment_ext_server_count(server)
-                        return False
-                    else:
-                        return True
-                except urlparser.HarvestManUrlError, e:
-                    logconsole(e)
-                    
-            elif self._configobj.fetchlevel>3:
-                pass
-                # this option takes precedence over the
-                # extserverlinks option, so set extserverlinks
-                # option to true.
-                # self._configobj.eserverlinks=1
-                # do other checks , just fall through
-
-            # res = self._ext_server_check(urlObj.get_domain())
-            # if not res:
-            #   return True
-
-            # Apply depth check for external servers here
-            if self._configobj.extdepth:
-                if self.apply_depth_check(urlObj, mode=2):
-                    return True
-
-            #if self._configobj.eserverlinks:
-            #    # We can get links belonging to another server, so
-            #    # this is NOT an external link
-            #    return False
-            #else:
-            #    # We cannot get external links beloning to another server,
-            #    # so this is an external link
-            #    return True
-            return False
-        
-        # We should not reach here
-        return False
-
-    def apply_depth_check(self, urlObj, mode=0):
-        """ Apply the depth setting for this url, if any """
-
-        # depth variable is -1 means no depth-check
-        baseUrlObj = objects.queuemgr.get_base_url()
-        if not baseUrlObj:
-            return False
-
-        reldepth = urlObj.get_relative_depth(baseUrlObj, mode)
-
-        if reldepth != -1:
-            # check if this exceeds allowed depth
-            if mode == 0 and self._configobj.depth != -1:
-                if reldepth > self._configobj.depth:
-                    return True
-            elif mode == 2 and self._configobj.extdepth:
-                if reldepth > self._configobj.extdepth:
-                    return True
-
-        return False
-
-    ## def _ext_directory_check(self, directory):
-    ##     """ Check whether the directory <directory>
-    ##     should be considered external """
-
-    ##     index=self._increment_ext_directory_count(directory)
-
-    ##     # Are we above a prescribed limit ?
-    ##     if self._configobj.maxextdirs and len(self._extdirs)>self._configobj.maxextdirs:
-    ##         if index != -1:
-    ##             # directory index was below the limit, allow its urls
-    ##             if index <= self._configobj.maxextdirs:
-    ##                 return True
-    ##             else:
-    ##                 # directory index was above the limit, block its urls
-    ##                 return False
-    ##         # new directory, block its urls
-    ##         else:
-    ##             return False
-    ##     else:
-    ##         return True
-
-    ## def _ext_server_check(self, server):
-    ##     """ Check whether the server <server> should be considered
-    ##     external """
-
-    ##     index=self._increment_ext_server_count(server)
-
-    ##     # are we above a prescribed limit ?
-    ##     if self._configobj.maxextservers and len(self._extservers)>self._configobj.maxextservers:
-    ##         if index != -1:
-    ##             # server index was below the limit, allow its urls
-    ##             if index <= self._configobj.maxextservers:
-    ##                 return True
-    ##             else:
-    ##                 return False
-    ##         # new server, block its urls
-    ##         else:
-    ##             return False
-    ##     else:
-    ##         return True
-
-    def _increment_ext_directory_count(self, directory):
-        """ Increment the external dir count """
-
-        index=-1
-        try:
-            index=self._extdirs.index(directory)
-        except:
-            self._extdirs.append(directory)
-
-        return index
-
-    def _increment_ext_server_count(self,server):
-        """ Increment the external server count """
-
-        index=-1
-        try:
-            index=self._extservers.index(server)
-        except:
-            self._extservers.append(server)
-
-        return index
-
-    def get_stats(self):
-        """ Return statistics as a 3 tuple. This returns
-        a 3 tuple of number of links, number of servers, and
-        number of directories in the base server parsed by
-        url trackers """
-
-        numservers=len(self._extservers)
-        numdirs=len(self._extdirs)
-        numfiltered = len(self._filter)
-        
-        return (numservers, numdirs, numfiltered)
-
-    def make_filters(self):
-        pass
-    
-   ##  def make_filters(self):
-##         """ This function creates the filter regexps
-##         for url/text based filtering of content """
-
-##         # URL regex filters
-        
-##         url_filters = self._make_filter(urlfilterstr)
-##         # print 'URL FILTERS=>',url_filters
-        
-##         self._configobj.set_option('urlfilterre_value', url_filters)
-
-
-##         server_filters = self._make_filter(serverfilterstr)
-##         self._configobj.set_option('serverfilterre_value', server_filters)
-
-##         #  url/server priority filters
-##         urlprioritystr = self._configobj.urlpriority
-##         # The return is a dictionary
-##         url_priorities = self._make_priority(urlprioritystr)
-
-##         self._configobj.set_option('urlprioritydict_value', url_priorities)
-
-##         serverprioritystr = self._configobj.serverpriority
-##         # The return is a dictionary        
-##         server_priorities = self._make_priority(serverprioritystr)
-
-##         self._configobj.set_option('serverprioritydict_value', server_priorities)
-
-##         # word filter list
-##         wordfilterstr = self._configobj.wordfilter.strip()
-##         # print 'Word filter string=>',wordfilterstr,len(wordfilterstr)
-##         if wordfilterstr:
-##             word_filter = self._make_word_filter(wordfilterstr)
-##             self._configobj.wordfilterre = word_filter
-
-##         self._madefilters = True
-        
-    def _make_priority(self, pstr):
-        """ Generate a priority dictionary from the priority string """
-
-        # file priority is based on file extensions &
-        # server priority based on server names
-
-        # Priority string is of the form...
-        # str1+pnum1,str2-pnum2,str3+pnum3 etc...
-        # Priority range is from [-5 ... +5]
-
-        # Split the string based on commas
-        pr_strs = pstr.split(',')
-
-        # For each string in list, create a dictionary
-        # with the string as key and the priority (including
-        # sign) as the value.
-
-        d = {}
-        for s in pr_strs:
-            if s.find('+') != -1:
-                key, val = s.split('+')
-                val = int(val)
-
-            elif s.find('-') != -1:
-                key, val = s.split('-')
-                val = -1*int(val)
-            else:
-                continue
-
-            # Since we dont allow values outside
-            # the range [-5 ..5] skip such values
-            if val not in range(-5,6): continue
-            d[key.lower()] = val
-
-        return d
-
-    def _make_word_filter(self, s):
-        """ Create a word filter rule for HarvestMan """
-
-        return re.compile(s, re.IGNORECASE|re.UNICODE)
-
-    def clean_up(self):
-        """ Purge data for a project by cleaning up
-        lists, dictionaries and resetting other member items"""
-
-        debug('Rules got cleaned up...!')
-        
-        self._filter = {}
-        self._extservers = []
-        self._extdirs = []
-        self._robocache = []
-        # Reset dicts
-        self._robots.clear()
-        
diff --git a/HarvestMan/harvestman/lib/test.html b/HarvestMan/harvestman/lib/test.html
deleted file mode 100644
index 4589ef5..0000000
--- a/HarvestMan/harvestman/lib/test.html
+++ /dev/null
@@ -1,10 +0,0 @@
-<html>
-<base href="http://razor.occams.info/code/repo/?/govtrack/sec/?" /> 
-
-
-<ul>
-  <li><a href="?/govtrack/?/sec/coderef.c">code2</a></li>
-  <li><a href="../gotrack2/../sec/?/../?/./sec/coderef.c">code3</a></li>
-</ul>
-
-</html>
diff --git a/HarvestMan/harvestman/lib/test_urlparser.py b/HarvestMan/harvestman/lib/test_urlparser.py
deleted file mode 100644
index c3838cb..0000000
--- a/HarvestMan/harvestman/lib/test_urlparser.py
+++ /dev/null
@@ -1,61 +0,0 @@
-import urlparser
-
-# Test 1
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
-print h
-assert(h.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec')
-h2 = urlparser.HarvestManUrl('coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/coderef.c')
-h2 = urlparser.HarvestManUrl('?/govtrack/sec/coderef2.c',baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec/coderef2.c')
-h2 = urlparser.HarvestManUrl("?/sec/coderef3.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/sec/coderef3.c')
-h2 = urlparser.HarvestManUrl("?sec/coderef4.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?sec/coderef4.c')
-h2 = urlparser.HarvestManUrl("sec/coderef5.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/sec/coderef5.c')
-h2 = urlparser.HarvestManUrl("/sec/coderef6.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/sec/coderef6.c')
-h2 = urlparser.HarvestManUrl("govtrack/sec/coderef7.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/sec/coderef7.c')
-h2 = urlparser.HarvestManUrl("govtrack/?/sec/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/../coderef8.c')
-h2 = urlparser.HarvestManUrl("http://www.foo.com/govtrack/./sec/?/id/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://www.foo.com/govtrack/sec/?/id/../coderef8.c')
-h2 = urlparser.HarvestManUrl("../repo2/govtrack/./sec/?/id/../coderef8.c", baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo2/govtrack/sec/?/id/../coderef8.c')
-
-# Test 2
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
-print h
-h2 = urlparser.HarvestManUrl('../coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/coderef.c')
-h2 = urlparser.HarvestManUrl('govtrack/?/sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/coderef.c')
-h2 = urlparser.HarvestManUrl('../govtrack2/?/../sec/.././sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/govtrack2/?/../sec/.././sec/coderef.c')
-
-print 'Test 3'
-# Test 3
-h = urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/?')
-print h
-h2 = urlparser.HarvestManUrl('?/govtrack/?/sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/?/sec/coderef.c')
-h2 = urlparser.HarvestManUrl('../gotrack2/../sec/?/../?/./sec/coderef.c', baseurl=h)
-print h2
-assert(h2.get_full_url()=='http://razor.occams.info/code/sec/?/../?/./sec/coderef.c')
-
diff --git a/HarvestMan/harvestman/lib/urlcollections.py b/HarvestMan/harvestman/lib/urlcollections.py
deleted file mode 100755
index 4bcdd26..0000000
--- a/HarvestMan/harvestman/lib/urlcollections.py
+++ /dev/null
@@ -1,246 +0,0 @@
-# -- coding: utf-8
-"""
-urlcollections.py - Module which defines URL collection
-and context classes.
-
-URL collection classes allow a programmer to
-create collections (aggregations) of URL objects
-with respect to certain contexts. This allows to
-treat URL objects belonging to the collection (and hence
-the context) as a single unit allowing you to write
-code based on the context rather than based on
-the URL.
-
-Examples of contexts include stylesheet context
-where a web-page and its CSS files forms part of
-this context. Other examples are frame contexts, where
-a context is associated to all frame URLs originating
-from a web-page and page contexts, which basically
-associates all URLs in page to the page URL.
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai April 17 2007 Based on inputs from
-                                     the EIAO project.
-
-Mod     Anand B Pillai May 26 2007   Added HarvestManAutoUrlCollection
-                                     class which automatically categorizes
-                                     URLs to contexts. Also, modified
-                                     HarvestManUrlCollection class so that
-                                     a collection class can be associated
-                                     to multiple contexts.
-                                     
-Copyright (C) 2007, Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-from harvestman.lib import urltypes
-from harvestman.lib.urlparser import HarvestManUrl
-
-class HarvestManUrlCollectionException(Exception):
-    """ Exception class for collections """
-
-    pass
-
-class HarvestManUrlContext(object):
-    """ This class defines the base URL context type for HarvestMan """
-
-    # Name for the context
-    name = 'BASE_URL_CONTEXT'
-    # Description for the context
-    description = 'Base type for URL contexts'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_ANY
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-    
-class HarvestManPageContext(HarvestManUrlContext):
-    """ Page context class. This context ties a webpage URL
-    with its child URLs """
-
-    # Name for the context
-    name = 'PAGE_URL_CONTEXT'
-    # Description for the context
-    description = 'Context type associating a page to its children'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_WEBPAGE
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-    
-class HarvestManFrameContext(HarvestManPageContext):
-    """ Frame context. This context ties a frameset URL
-    to the frame URLs """
-    
-    # Name for the context
-    name = 'FRAME_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a frameset URL to its frame URLs'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_FRAMESET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_FRAME
-
-class HarvestManStyleContext(HarvestManPageContext):
-    """ Style context. This context ties a webpage URL to its
-    stylesheet (css) URLs """
-
-    # Name for the context
-    name = 'STYLE_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a webpage to its stylesheets'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_WEBPAGE
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_STYLESHEET
-
-
-class HarvestManCSSContext(HarvestManPageContext):
-    """ CSS context. This context ties a stylesheet URL to any
-    URLs defined inside the stylesheet """
-
-    # Name for the context
-    name = 'CSS_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a stylesheet to its child URLs'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_STYLESHEET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_ANY
-
-class HarvestManCSS2Context(HarvestManCSSContext):
-    """ CSS2 context. This context ties a stylesheet URL to any
-    other stylesheets imported in it """
-
-    # Name for the context
-    name = 'CSS2_URL_CONTEXT'
-    # Description of the context
-    description = 'Context for tying a stylesheet to any stylesheets imported in it'
-    # Source URL type for the context
-    sourceurltype = urltypes.URL_TYPE_STYLESHEET
-    # Bag URL types for the context
-    bagurltype = urltypes.URL_TYPE_STYLESHEET     
-    
-class HarvestManUrlCollection(object):
-    """ URL collection classes for HarvestMan """
-
-    # This class is designed as a bag for HarvestManUrl
-    # objects, tied to a context. The key attributes of this
-    # class are a list of such url objects, a main URL
-    # object from which the context originates (the 'source'
-    # URL object) and a corresponding context.
-
-    def __init__(self, source = None):
-        # For efficiency purposes, we do not
-        # keep reference to urlobjects, only their indices.
-        if source:
-            self._source = source.index
-            self._sourcetyp = source.typ
-        else:
-            self._source = None
-            self._sourcetyp = urltypes.URL_TYPE_NONE
-            
-        self._collections = {}
-
-    def _getContext(self, urlobj):
-        """ Return the context at which the URL urlobj is to
-        be inserted """
-
-        # This class always returns HarvestManPageContext
-        return HarvestManPageContext
-        
-    def addURL(self, urlobj):
-        """ Add a url object to the collection """
-
-        # Check if the type of the urlobject matches the
-        # bagurltype defined for this context. Here we
-        # do a isA check since the url object's type can
-        # be a specialized form (derived class) of the
-        # bagurltype.
-
-        if not isinstance(urlobj, HarvestManUrl):
-            raise HarvestManUrlCollectionException, 'Error: Wrong argument type, expecting HarvestManUrl instance!'
-
-        # For efficiency on memory, we do not append
-        # url objects to the list, only their indices.
-        # Url objects can be mapped out later using their
-        # index from the datamgr object.
-
-        # Context is always HarvestManPageContext
-        context = self._getContext(urlobj)
-        # print 'CONTEXT for URL %s=>%s' % (urlobj.get_full_url(), context)
-        if urlobj.typ.isA(context.bagurltype):
-            # Check if this context exists as key in the collections dictionary
-            try:
-                listofurls = self._collections[context]
-                listofurls.append(urlobj.index)
-            except KeyError:
-                self._collections[context] = [urlobj.index]
-        else:
-            raise HarvestManUrlCollectionException, 'Error: mismatch in context and bag URL types!'
-
-    def getSourceURL(self):
-        """ Return the source URL object """
-
-        return self._source
-
-    def getSourceURLType(self):
-        """ Return the type of the source URL object """
-
-        return self._sourcetyp
-    
-    def getURLs(self, context):
-        """ Get list of URL objects for the given context """
-
-        return self._collections.get(context)
-
-    def getAllURLs(self):
-        """ Get list of all URL objects for this collection """
-
-        allurls = []
-        for urls in self._collections.values():
-            allurls.extend(urls)
-
-        return allurls
-
-    def getContextDict(self):
-        """ Returns a copy of the internal context dictionary """
-
-        return self._collections.copy()
-    
-class HarvestManAutoUrlCollection(HarvestManUrlCollection):
-    """ A sub-class of HarvestManUrlCollection which
-    automatically assigns contexts to URLs """
-
-    def _getContext(self, urlobj):
-        """ Return the context at which the URL urlobj is to be inserted """
-
-        # For frames, return HarvestManFrameContext
-        # For CSS files
-        # 1. Source => webpage, return HarvestManStyleContext
-        # 2. Source => stylesheet, return HarvestManCSS2Context
-        # For other URLs
-        # 1. Source => webpage, return HarvestManPageContext
-        # 2. Source => stylesheet, return HarvestManCSSContext
-
-        if urlobj.typ == urltypes.URL_TYPE_FRAME:
-            return HarvestManFrameContext
-
-        if urlobj.typ == urltypes.URL_TYPE_STYLESHEET:
-            # If source is web-page
-            if self._sourcetyp.isA(urltypes.URL_TYPE_WEBPAGE):
-                return HarvestManStyleContext
-            elif self._sourcetyp.isA(urltypes.URL_TYPE_STYLESHEET):
-                return HarvestManCSS2Context
-        else:
-            # For all other url types
-            if self._sourcetyp.isA(urltypes.URL_TYPE_STYLESHEET):
-                return HarvestManCSSContext
-            else:
-                return HarvestManPageContext
diff --git a/HarvestMan/harvestman/lib/urlparser.py b/HarvestMan/harvestman/lib/urlparser.py
deleted file mode 100755
index 9fb7843..0000000
--- a/HarvestMan/harvestman/lib/urlparser.py
+++ /dev/null
@@ -1,1654 +0,0 @@
-# -- coding: utf-8
-"""urlparser.py - Module containing class HarvestManUrl,
-representing a URL and its relation to disk files in
-HarvestMan.
-
-Creation Date: Nov 2 2004
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-
-   Jan 01 2006      jkleven  Change is_webpage to return 'true'
-                             if the URL looks like a form query.
-   Jan 10 2006      Anand    Converted from dos to unix format (removed Ctrl-Ms).
-   Oct 1 2006       Anand    Fixes for EIAO ticket #193 - added reduce_url
-                             method to take care of .. chars inside URLs.
-
-   Feb 25 2007      Anand    Added .ars as a web-page extension to support
-                             the popular ars-technica website.
-   Mar 12 2007      Anand    Added more fields for multipart. Fixed a bug in
-                             is_webpage - anchor links should be returned
-                             as web-page links.
-
-   Apr 12 2007      Anand    Fixed a bug in anchor link parsing. The current
-                             logic was not taking care of multiple anchor
-                             links (#anchor1#anchor2). Fixed it by using
-                             a regular expression.
-
-                             Test page is
-                             http://nltk.sourceforge.net/lite/doc/api/term-index.html
-
-   Mar 05 2008     Anand    Many changes integrated. Method to get canonical form
-                             of URL added .Generating index as hash of canonical URL
-                             now. Added queue macros.
-   Apr 24 2008     Anand    Fix for #829.
-   Jan 9 2009      Anand    Use a different hashing scheme for URL other than
-                            in-built 'hash'.
-
-Copyright (C) 2004 Anand B Pillai.
-   
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import re
-import mimetypes
-import copy
-import urlproc
-import md5
-import itertools
-import random
-
-from types import StringTypes
-
-from harvestman.lib import document
-from harvestman.lib.common.common import *
-from harvestman.lib.common.netinfo import *
-from harvestman.lib.urltypes import *
-
-# URL queueing status macros
-
-URL_NOT_QUEUED=0       # Fresh URL, not queued yet
-URL_QUEUED=1           # Fresh URL sent to queue, but not yet in queue
-URL_IN_QUEUE=2         # URL is in queue
-URL_IN_DOWNLOAD=3      # URL is out of queue and in download
-URL_DONE_DOWNLOAD=4    # URL has completed download, though this may not mean
-                       # that the download was successful.
-
-mask=0xffffff
-
-def one_stringhash(s):
-    # One-at-a-time string hash
-    l = len(s)
-    hashval = 0
-    for c in s:
-        hashval += ord(c)
-        hashval += (hashval << 10)
-        hashval ^= (hashval >> 6)
-
-    hashval += (hashval << 3)
-    hashval ^= (hashval >> 11)
-    hashval += (hashval << 15)
-
-    return (hashval & mask)
-
-class HarvestManUrlError(Exception):
-    """ Error class for HarvestManUrl """
-    
-    def __init__(self, value):
-        self.value = value
-
-    def __repr__(self):
-        return self.__str__()
-    
-    def __str__(self):
-        return str(self.value)
-    
-class HarvestManUrl(object):
-    """ A class representing a URL in HarvestMan """
-
-    TEST = False
-    hashes = {}
-    
-    def __init__(self, url, urltype = URL_TYPE_ANY, cgi = False, baseurl  = None, rootdir = ''):
-        # Remove trailing wspace chars.
-        url = url.rstrip()
-        try:
-            try:
-                try:
-                    url.encode("utf-8")
-                except UnicodeDecodeError:
-                    url = url.decode("iso-8859-1")
-            except UnicodeDecodeError, e:
-                url = url.decode("latin-1")
-        except UnicodeDecodeError, e:
-            pass
-                
-        # For saving original url
-        # since self.url can get
-        # modified
-        self.origurl = url
-        
-        self.url = url
-        self.url = urlproc.modify_url(self.url)
-        
-        self.typ = urltype
-        self.cgi = cgi
-        self.anchor = ''
-        self.index = 0
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        # If the url is a file like url
-        # this value will be true, if it is
-        # a directory like url, this value will
-        # be false.
-        self.filelike = False
-        # download status, a number indicating
-        # whether this url was downloaded successfully
-        # or not. 0 indicates a successful download, and
-        # any number >0 indicates a failed download
-        self.status = -1
-        # Url scheduled status, a number indicating
-        # how the URL is queued for download.
-        # It has the following values
-        # URL_NOT_QUEUED
-        # URL_QUEUED
-        # URL_IN_DOWNLOAD
-        # URL_DONE_DOWNLOAD
-        # The fact that the URL has URL_DONE_DOWNLOAD
-        # need not mean that the download was successful!
-        self.qstatus = URL_NOT_QUEUED
-        # Fatal status
-        self.fatal = False
-        # is starting url?
-        self.starturl = False
-        # Flag for files having extension
-        self.hasextn = False
-        # Relative path flags
-        self.isrel = False
-        # Relative to server?
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        # Url headers
-        self.contentdict = {}
-        # Url generation
-        self.generation = 0
-        # Url priority
-        self.priority = 0
-        # rules violation cache flags
-        self.violatesrules = False
-        self.rulescheckdone = False
-        # Bytes range - used for HTTP/1.1
-        # multipart downloads. This has to
-        # be set to an xrange object 
-        self.range = None
-        # Flag to try multipart
-        self.trymultipart = False
-        # Multipart index
-        self.mindex = 0
-        # Original url for mirrored URLs
-        self.mirror_url = None
-        # Flag set for URLs which are mirrored from
-        # a different server than the original URL
-        self.mirrored = False
-        # Content-length for multi-part
-        # This is the content length of the original
-        # content.
-        self.clength = 0
-        self.dirpath = []
-        # Re-computation flag
-        self.reresolved = False
-        # URL redirected flag
-        self.redirected = False
-        # Flag indicating we are using an old URL
-        # which was redirected, again for producing
-        # further redirections. This is used in Hget
-        # for automatic split-mirror downloading
-        # for URLs that auto-forward to mirrors.
-        self.redirected_old = False
-        self.baseurl = None
-        # Hash of page data
-        self.pagehash = ''
-        # Flag to decide whether to recalculate get_full_url(...)
-        # if flag is False, recalculate...
-        self.urlflag = False
-        # Cached full URL string
-        self.absurl = ''
-        # Base Url Dictionary
-        if baseurl:
-            if isinstance(baseurl, HarvestManUrl):
-                self.baseurl = baseurl
-            elif type(baseurl) in StringTypes:
-                self.baseurl = HarvestManUrl(baseurl, 'generic', cgi, None, rootdir)
-                      
-        # Root directory
-        if rootdir == '':
-            if self.baseurl and self.baseurl.rootdir:
-                self.rootdir = self.baseurl.rootdir
-            else:
-                self.rootdir = os.getcwd()
-        else:
-            self.rootdir = rootdir
-            
-        self.anchorcheck()
-        self.resolveurl()
-
-        # For starting URL, the index is 0, for the rest
-        # it is as hash of the canonical URL string...
-        self.index = one_stringhash(self.get_canonical_url())
-        # If this is a URL similar to start URL,
-        # reset its index to zero. The trick is
-        # to store only the hash of the start URL
-        # as key in the attribute 'hashes'.
-        
-        try:
-            val = self.hashes[self.index]
-            self.index = 0
-        except KeyError:
-            pass
-
-        # Copy of myself, this will be saved if
-        # a re-resolving is requested so that old
-        # parameters can be requested if needed
-        self.orig_state = None
-        
-    def reset(self):
-        """ Reset all the key attributes """
-
-        # Archive previous state
-        self.orig_state = copy.copy(self)
-
-        self.url = urlproc.modify_url(self.url)
-        self.lastpath = ''
-        self.protocol = ''
-        self.defproto = False
-        self.hasextn = False
-        self.isrel = False
-        self.isrels = False
-        self.port = 80
-        self.domain = ''
-        self.rpath = []
-        # Recursion depth
-        self.rdepth = 0
-        self.dirpath = []
-        self.filename = 'index.html'
-        self.validfilename = 'index.html'
-        # Set urlflag to False
-        self.urlflag = False
-        self.absurl = ''
-
-    def __str__(self):
-        return self.absurl
-    
-    def wrapper_resolveurl(self):
-        """ Called forcefully to re-resolve a URL, typically after a re-direction
-        or change in URL has been detected """
-
-        self.reset()
-        self.anchorcheck()
-        self.resolveurl()
-        self.reresolved = True
-        
-    def anchorcheck(self):
-        """ Checking for anchor tags and processing accordingly """
-
-        if self.typ == 'anchor':
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for anchor type url'
-
-            if '#' in self.url:
-                # Split with re
-                items = anchore.split(self.url)
-                # First item is the original url
-                if len(items):
-                    if items[0]:
-                        self.url = items[0]
-                    else:
-                        self.url = self.baseurl.get_full_url()
-                    # Rest forms the anchor tag
-                    self.anchor = '#' + '#'.join(items[1:])
-                    
-    def resolve_protocol(self):
-        """ Resolve the protocol of the url """
-
-        url2 = self.url.lower()
-        for proto in protocol_map.keys():
-            if url2.find(proto) != -1:
-                self.protocol = proto
-                self.port = protocol_map.get(proto)
-                return True
-        else:
-            # Fix: Use regex for detecting WWW urls.
-            # Check for WWW urls. These can begin
-            # with a 'www.' or 'www' followed by
-            # a single number (www1, www3 etc).
-            if www_re.match(url2):
-                self.protocol = 'http://'
-                self.url =  "".join((self.protocol, self.url))
-                return True
-
-            # We accept FTP urls beginning with just
-            # ftp.<server>, and consider it as FTP over HTTP
-            if url2.startswith('ftp.'):
-                # FTP over HTTP
-                self.protocol = 'http://'
-                self.url = ''.join((self.protocol, self.url))
-                return True
-            
-            # Urls relative to server might
-            # begin with a //. Then prefix the protocol
-            # string to them.
-            if self.url.find('//') == 0:
-                # Pick protocol from base url
-                if self.baseurl and self.baseurl.protocol:
-                    self.protocol = self.baseurl.protocol
-                else:
-                    self.protocol = "http://"   
-                self.url = "".join((self.protocol, self.url[2:]))
-                return True
-
-            # None of these
-            # Protocol not resolved, so check
-            # base url first, if not found, set
-            # default protocol...
-            if self.baseurl and self.baseurl.protocol:
-                self.protocol = self.baseurl.protocol
-            else:
-                self.protocol = 'http://'
-
-            self.defproto = True
-        
-            return False
-        
-    def resolveurl(self):
-        """ Resolves the url finding out protocol, port, domain etc
-        . Also resolves relative paths and builds a local file name
-        for the url based on the root directory path """
-
-        if len(self.url)==0:
-            raise HarvestManUrlError, 'Error: Zero Length Url'
-
-        proto = self.resolve_protocol()
-
-        paths = ''
-        
-        if not proto:
-            # Could not resolve protocol, must be a relative url
-            if not self.baseurl:
-                raise HarvestManUrlError, 'Base url should not be empty for relative urls'
-
-            # Set url-relative flag
-            self.isrel = True
-            # Is relative to server?
-            if self.url[0] == '/':
-                self.isrels = True
-            
-            # Split paths
-            relpaths = self.url.split(URLSEP)
-            try:
-                idx = relpaths.index(DOTDOT)
-            except ValueError:
-                idx = -1
-
-            # Only reduce if the URL itself does not start with
-            # .. - if it does our rpath algorithm takes
-            # care of it.
-
-            # Mod: This is commented out now, since it looks
-            # like there is no harm in allowing to reduce, even
-            # if the path starts with '..'
-            #if idx > 0:
-            
-            relpaths = self.reduce_url(relpaths)
-
-            # Build relative path by checking for "." and ".." strings
-            self.rindex = 0
-            for ritem in relpaths:
-                # If path item is ., .. or empty, increment
-                # relpath index.
-                if ritem in (DOT, DOTDOT, ""):
-                    self.rindex += 1
-                    # If path item is not empty, insert
-                    # to relpaths list.
-                    if ritem:
-                        self.rpath.append(ritem)
-
-                else:
-                    # Otherwise, add the rest to paths
-                    # with the separator
-                    for entry in relpaths[self.rindex:]:
-                        paths = "".join((paths, entry, URLSEP))
-
-                    # Remove the last entry
-                    paths = paths[:-1]
-                    
-                    # Again Trim if the relative path ends with /
-                    # like href = /img/abc.gif/ 
-                    #if paths[-1] == '/':
-                    #    paths = paths[:-1]
-                    break
-            
-        else:
-            # Absolute path, so 'paths' is the part of it
-            # minus the protocol part.
-            paths = self.url.replace(self.protocol, '')            
-            if paths=='':
-                # Error: URL consists only of protocol
-                raise HarvestManUrlError, 'Error: Invalid URL containing only protocol'                
-                
-            # Split URL
-            items = paths.split(URLSEP)
-            
-            # If there are nonsense .. and . chars in the paths, remove
-            # them to construct a sane path.
-            #try:
-            #    idx = items.index(DOTDOT)
-            #except ValueError:
-            #    idx = -1            
-            flag = (DOT in items) or (DOTDOT in items)
-            
-            if flag:
-                # Bugfix: Do not allow a URL like http://www.foo.com/../bar
-                # to become http://bar. Basically if the index of .. is
-                # 1, then remove the '..' entirely. This bug was encountered
-                # in EIAO testing of http://www.fylkesmannen.no/ for the URL
-                # http://www.fylkesmannen.no/osloogakershu
-                
-                items = self.reduce_url(items)
-                # Re-construct URL
-                paths = URLSEP.join(items)
-                
-        # Now compute local directory/file paths
-
-        self.compute_dirpaths(paths)
-        if not self.protocol.startswith('file:'):
-            self.compute_domain_and_port()
-
-        # For some file extensions, automatically set as directory URL.
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            if extn in default_directory_extns:
-                self.set_directory_url()
-
-        # print self.dirpath, self.domain
-
-        
-    def reduce_url(self, paths):
-        """ Remove nonsense .. and . chars from URL paths """
-
-        for x in range(len(paths)):
-            path = paths[x]
-            try:
-                nextpath = paths[x+1]
-                if nextpath in (DOT, DOTDOT):
-                    # Check if a ? occurs anywhere earlier in path.
-                    # If a ? occurs in the path, don't reduce
-                    # any paths coming after it.
-                    try:
-                        qindex = paths.index('?')
-                        if qindex < x+1:
-                            continue
-                    except ValueError:
-                        pass
-                    
-                if nextpath == DOTDOT:
-                    paths.pop(x+1)
-                    # Do not allow to remove the domain for
-                    # stupid URLs like 'http://www.foo.com/../bar' or
-                    # 'http://www.foo.com/camp/../../bar'. If allowed
-                    # they become nonsense URLs like http://bar.
-
-                    # This bug was encountered in EIAO testing of
-                    # http://www.fylkesmannen.no/ for the URL
-                    # http://www.fylkesmannen.no/osloogakershu
-                    
-                    if self.isrel or x>0:
-                        paths.remove(path)
-                    return self.reduce_url(paths)
-                elif nextpath==DOT:
-                    paths.pop(x+1)
-                    return self.reduce_url(paths)                    
-            except IndexError:
-                return paths
-        
-        
-    def compute_file_and_dir_paths(self):
-        """ Compute file and directory paths """
-
-        if self.lastpath:
-            dotindex = self.lastpath.find(DOT)
-            if dotindex != -1:
-                self.hasextn = True
-
-            # If there is no extension or if there is
-            # an extension which is occuring in the middle
-            # of last path...
-            if (dotindex == -1) or \
-                ((dotindex >0) and (dotindex < (len(self.lastpath)-1))):
-                self.filelike = True
-                # Bug fix - Strip leading spaces & newlines
-                self.validfilename =  self.make_valid_filename(self.lastpath.strip())
-                self.filename = self.lastpath.strip()
-                self.dirpath  = self.dirpath [:-1]
-        else:
-            if not self.isrel:
-                self.dirpath  = self.dirpath [:-1]
-
-        # Remove leading spaces & newlines from dirpath
-        dirpath2 = []
-        for item in self.dirpath:
-            dirpath2.append(item.strip())
-
-        # Copy
-        self.dirpath = dirpath2[:]
-            
-    def compute_dirpaths(self, path):
-        """ Computer local file & directory paths for the url """
-
-        self.dirpath = path.split(URLSEP)
-        self.lastpath = self.dirpath[-1]
-        # print self.dirpath, self.lastpath
-        
-        if self.isrel:
-            # Construct file/dir names - This is valid only if the path
-            # has more than one component - like www.python.org/doc .
-            # Otherwise, the url is a plain domain
-            # path like www.python.org .
-            self.compute_file_and_dir_paths()
-            # print 'Rpath=>',self.rpath
-            
-            # Interprets relative path
-            # ../../. Nonsense relative paths are graciously ignored,
-            self.rpath.reverse()
-            # print 'Base url dirpath=>',self.baseurl.dirpath
-            # print 'Rindex=>',self.rindex
-
-            # This simple logic is fine for most paths except
-            # when a base URL has a "?" as part of its dirpath.
-            # Example: http://razor.occams.info/code/repo/?/govtrack/sec .
-            # In that case, any pieces of the base URL after the
-            # ? is to be omitted.
-            if '?' in self.baseurl.dirpath:
-                # Trim base url to the part before ?
-                qindex = self.baseurl.dirpath.index('?')
-                self.baseurl.dirpath = self.baseurl.dirpath[:qindex]
-            
-            if len(self.rpath) == 0 :
-                if not self.rindex:
-                    self.dirpath = self.baseurl.dirpath + self.dirpath
-            else:
-                pathstack = self.baseurl.dirpath[0:]
-
-                for ritem in self.rpath:
-                    if ritem == DOT:
-                        pathstack = self.baseurl.dirpath[0:]
-                    elif ritem == DOTDOT:
-                        if len(pathstack) !=0:
-                                pathstack.pop()
-            
-                self.dirpath  = pathstack + self.dirpath 
-
-            # print 'Dirpath2=>',self.dirpath
-
-            #if self.noreduce:
-            #    return
-            
-            # Support for NONSENSE relative paths such as
-            # g/../foo and g/./foo 
-            # consider base = http:\\bar.com\bar1
-            # then g/../foo => http:\\bar.com\bar1\..\foo => http:\\bar.com\foo
-            # g/./foo  is utter nonsense and we feel free to ignore that.
-            #index = 0
-            #for item in self.dirpath:
-            #    if item in (DOT, DOTDOT):
-            #        self.dirpath.remove(item)
-            #    if item == DOTDOT:
-            #        self.dirpath.remove(self.dirpath[index - 1])
-            #    index += 1
-        else:
-            if len(self.dirpath) > 1:
-                self.compute_file_and_dir_paths()
-            
-    def compute_domain_and_port(self):
-        """ Computes url domain and port &
-        re-computes if necessary """
-
-        # Resolving the domain...
-        
-        # Domain is parent domain, if
-        # url is relative :-)
-        if self.isrel:
-            self.domain = self.baseurl.domain
-        else:
-            # If not relative, then domain
-            # if the first item of dirpath.
-            self.domain = self.dirpath[0]
-            self.dirpath = self.dirpath[1:]
-
-        # Find out if the domain contains a port number
-        # for example, server:8080
-        dom = self.domain
-        index = dom.find(PORTSEP)
-        if index != -1:
-            self.domain = dom[:index]
-            # A bug here => needs to be fixed
-            try:
-                self.port   = int(dom[index+1:])
-            except:
-                pass
-
-        # Now check if the base domain had a port specification (other than 80)
-        # Then we need to use that port for all its children, otherwise
-        # we can use default value.
-        if self.isrel and \
-               self.baseurl and \
-               self.baseurl.port != self.port and\
-               self.baseurl.protocol != 'file://':
-            
-            self.port = self.baseurl.port
-
-        # Convert domain to lower case
-        if self.domain != '':
-            self.domain = self.domain.lower()
-        
-    def make_valid_filename(self, s):
-        """ Replace junk characters to create a valid filename """
-
-        # Replace any %xx strings
-        percent_chars = percent_repl.findall(s)
-        for pchar in percent_chars:
-            try:
-                s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-            except UnicodeDecodeError:
-                try:
-                    s = s.decode('iso-8859-1')
-                    s = s.replace(pchar, chr(int(pchar.replace('%','0x'), 16)))
-                except UnicodeDecodeError, e:
-                    pass
-                
-        for x,y in itertools.izip(junk_chars, junk_chars_repl):
-            s = s.replace(x, y)
-
-        return s
-
-    def make_valid_url(self, url):
-        """ Make a valid url """
-
-        for x,y in itertools.izip(dirty_chars, dirty_chars_repl):
-            if x in url:
-                url = url.replace(x, y)
-
-        # Replace spaces between words
-        # with '%20'.
-        # For example http://www.foo.com/bar/this file.html
-        # Fix: Use regex instead of blind
-        # replacement.
-        if wspacere.search(url):
-            url = re.sub(r'\s', '%20', url)
-        
-        # Replace all % chars with their capital counterparts
-        # i.e %3a => %3A, %5b => %5B etc. This helps in
-        # canonicalization.
-        percent_chars = percent_repl.findall(url)
-        for pchar in percent_chars:
-            url = url.replace(pchar, pchar.upper())
-            
-        return url
-
-    def is_filename_url(self):
-        """ Return whether this is file name url """
-
-        # A directory url is something like http://www.python.org
-        # which points to the <index.html> file inside the www.python.org
-        # directory.A file name url is a url that points to an actual
-        # file like http://www.python.org/doc/current/tut/tut.html
-
-        return self.filelike
-
-    def is_cgi(self):
-        """ Check whether this url is a cgi (dynamic/form) link """
-
-        return self.cgi
-
-    def is_relative_path(self):
-        """ Return whether the original url was a relative one """
-
-        return self.isrel
-
-    def is_relative_to_server(self):
-        """ Return whether the original url was relative to the server """
-        
-        return self.isrels
-
-    def is_image(self):
-        """ Find out if the file is an image """
-
-        if self.typ == 'image':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in image_extns)
-             
-        return False
-
-    def is_multimedia(self):
-        """ Found out if the file is a multimedia (vide or audio) type """
-
-        return (self.is_video() or self.is_audio())
-        
-    def is_audio(self):
-        """ Find out if the file is a multimedia audio type """
-
-        if self.typ == 'audio':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in sound_extns)
-             
-        return False
-
-    def is_video(self):
-        """ Find out if the file is a multimedia video type """
-
-        if self.typ == 'video':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in movie_extns)
-             
-        return False
-            
-    def is_webpage(self):
-        """ Find out by if the file is a webpage type """
-
-        # Note: right now we treat dynamic server-side scripts namely
-        # php, psp, asp, pl, jsp, and cgi as possible html candidates, though
-        # actually they might be generating non-html content (like dynamic
-        # images.)
-        
-        if self.typ.isA(URL_TYPE_WEBPAGE):
-            return True
-        elif self.typ==URL_TYPE_ANY:
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                if extn in webpage_extns:
-                    return True
-                else:
-                    # jkleven: 10/1/06.  Forms were never being parsed for links.
-
-                    # If we are allowing download of query forms (i.e., bin?asdf=3 style URLs)
-                    # then run the URL through a regex if we're still not sure if its ok.
-                    # if it matches the from_re precompiled regex then we'll assume its
-                    # a query style URL and we'll return true.
-                    if objects.config and objects.config.getquerylinks and form_re.search(self.get_full_url()):
-                        return True
-
-        return False
-
-    def is_stylesheet(self):
-        """ Find out whether the url is a style sheet type """
-
-        if self.typ == 'stylesheet':
-            return True
-        elif self.typ == 'generic':
-            if self.validfilename:
-                extn = ((os.path.splitext(self.validfilename))[1]).lower()
-                return (extn in stylesheet_extns)
-             
-        return False
-
-    def is_document(self):
-        """ Return whether the url is a document """
-
-        # This method is useful for Indexers which use HarvestMan.
-        # We define any URL which is not an image, is either a web-page
-        # or any of the following types as a document.
-
-        # Microsoft word documents
-        # Openoffice documents
-        # Adobe PDF documents
-        # Postscript documents
-
-        if self.is_image(): return False
-        if self.is_webpage(): return True
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in document_extns)
-
-        return False
-
-    def is_flash(self):
-        """ Return whether the url is flash, flash source code
-        or flash action script """
-
-        # Check extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            return (extn in flash_extns)
-
-        return False        
-
-    def is_equal(self, url):
-        """ Find whether the passed url matches
-        my url """
-
-        # Try 2 tests, one straightforward
-        # other with a "/" appended at the end
-        myurl = self.get_full_url()
-        if url==myurl:
-            return True
-        #else:
-        #    myurl += URLSEP
-        #    if url==myurl:
-        #        return True
-
-        return False
-
-    def parseable(self):
-        """ Return whether the URL could have parseable content.
-        This function tries to make the best guess
-        based on the URL file extension and type. Parseable
-        means the possibility of having content which can
-        produce child URLs effectively meaning HTML and
-        parseable stylesheets """
-
-        # This is called before downloading of a URL. However
-        # whether a URL has parseable content is fully known
-        # only after downloading it. However in this case, we
-        # need this information prior to download, so we try
-        # to make the best guess....
-
-        # The guess is very much optimistic. That is the logic
-        # is tilted towards trying to make all possible checks
-        # on returning this as a parseable URL.
-
-        if self.is_webpage() or self.is_stylesheet():
-            return True
-
-        # Is this an image,multimedia,flash then return False
-        if self.is_image() or self.is_multimedia() or self.is_flash():
-            return False
-        
-        # Okay, it is not webpage/css/multimedia/flash, but this
-        # could be a directory URL which can turn out to be
-        # a web-page. So check for file extension
-        if self.validfilename:
-            extn = ((os.path.splitext(self.validfilename))[1]).lower()
-            # Document ? Return false
-            if extn in document_extns:
-                return False
-            
-            # The extension has to be a valid (at least 2 chars, excluding the dot)
-            # extension, to assume this is a valid non-HTML type file.
-            if len(extn)>=3:
-                return False
-
-
-        # Ok - Safely assume that this is parseable HTML type or
-        # will turn out to be that later!
-        return True
-        
-    # ============ End - Is (Boolean Get) Methods =========== #  
-    # ============ Begin - General Get Methods ============== #
-    def get_url_content_info(self):
-        """ Get the url content information """
-        
-        return self.contentdict
-    
-    def get_anchor(self):
-        """ Return the anchor tag of this url """
-
-        return self.anchor
-
-    def get_anchor_url(self):
-        """ Get the anchor url, if this url is an anchor type """
-
-        return "".join((self.get_full_url(), self.anchor))
-
-    def get_generation(self):
-        """ Return the generation of this url """
-        
-        return self.generation    
-
-    def get_priority(self):
-        """ Get the priority for this url """
-
-        return self.priority
-
-    def get_download_status(self):
-        """ Return the download status for this url """
-
-        return self.status
-
-    def get_type(self):
-        """ Return the type of this url as a string """
-        
-        return self.typ
-
-    def get_parent_url(self):
-        """ Return the parent url of this url """
-        
-        return self.baseurl
-
-    def get_url_directory(self):
-        """ Return the directory path (url minus its filename if any) of the url """
-        
-        # get the directory path of the url
-        fulldom = self.get_full_domain()
-        urldir = fulldom
-
-        if self.dirpath:
-            newpath = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-            urldir = "".join((fulldom, newpath))
-
-        return urldir
-
-    def get_url_directory_sans_domain(self):
-        """ Return url directory minus the domain """
-
-        # New function in 1.4.1
-        urldir = ''
-        
-        if self.dirpath:
-            urldir = "".join((URLSEP, "".join([ x+'/' for x in self.dirpath])))
-
-        return urldir        
-        
-    def get_url(self):
-        """ Return the url of this object """
-        
-        return self.url
-
-    def get_original_url(self):
-        """ Return the original url of this object """
-        
-        return self.origurl
-
-    def get_canonical_url(self):
-        """ Return the canonicalized form of this URL """
-
-        # A canonical URL or 'normalized' URL is a URL modified
-        # to a standardized form so that similar URLs can be
-        # found out by comparing their canonical forms. HarvestMan
-        # uses canonical URLs to remove DUST (Duplicate URLs with
-        # similar text) to some extent.
-
-        # Wikipedia describes canonicalization of a URL
-        # {http://en.wikipedia.org/wiki/URL_normalization}
-        #
-        # 1. Converting the scheme and host to lower case...
-        # 2. Adding trailing to directory URLs...
-        # 3. Removing directory index, i.e
-        #    http://www.example.com/default.asp => http://www.example.com/
-        #    http://www.example.com/index.html => http://www.example.com/
-        # 4. Case insensitive files => If the URL is running on a case insensitive
-        #    file system (Windows, example: FAT*, NTFS etc), then the canonical
-        #    form should use lower case.
-        # 5. Capitalizing letters in escape sequences - All letters within a
-        # percent-encoding triplet (e.g., "%3A") are case-insensitive, and should
-        # be capitalized.
-        # Egs: http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
-        # 6. Removing the anchor fragment 
-        # Egs: http://www.example.com/bar.html#section1 → http://www.example.com/bar.html
-        # 7. Removing the default port.
-        # 8. Removing dot segments. { http://example/com/b/c/.././file.html => http://example.com/b/file.html }
-        # 9. Removing www as the first domain label. i.e www.example.com => example.com
-        # 10. Sorting the variables of active pages (dynamic pages) -> 
-        #  {http://www.example.com/display?lang=en&article=fred → http://www.example.com/display?article=fred&lang=en}
-        # 11. Removing default querystring variables. A default value in the querystring will
-        # render identically whether it is there or not. When a default value appears in the querystring,
-        # it should be removed. {http://www.example.com/display?id=&sort=ascending =>  http://www.example.com/display}
-        # 12. Removing the "?" when the querystring is empty. When the querystring is
-        # empty, there is no need for the "?
-        # { http://www.example.com/display? → http://www.example.com/display  }
-
-        # HarvestMan does 1,2,3,5,6,7,8,9,10,11,12 in order. Note that HarvestMan
-        # performs 6,7,8 automatically when processing the original URL.
-
-        # 1 is already done when resolving the URL.
-        # 2 is already done when resolving the URL
-        # 3 is already done for root domains. i.e http://www.example.com
-        #  becomes http://www.example.com/ . However this is not done
-        # for directory URL since we are not sure if this would be a
-        # file or directory i.e http://www.example.com/docs/ and
-        # http://www.example.com/docs will parse to
-        # http://www.example.com/docs without the trailing slash since
-        # by default we assume that the URL refers to the file "/docs"
-        # rather than the directory index for "/docs" folder.
-        # Skip 4
-        # 5 is done by make_valid_url()...
-        # 6,7,8 are automatically done
-        # Doing 9, 10,11 and 12 and specifically. 
-
-        # Get full url first...
-        url = self.get_full_url()
-        params = params_re.findall(url)
-        lp = len(params)
-        if lp>1:
-            # Rule#11: Remove those params which are using a default value
-            # i.e which does not specify a value.
-            params = [param for param in params if param_re.match(param)]
-            # More than one param, sort it
-            params.sort()
-            url_sans_params = ampersand_re.sub('', params_re.sub('', url))
-            # Now put the params back in sorted order
-            url = url_sans_params + '&'.join(params)
-        elif lp==0:
-            # If no params but there is a ? at end, rule 12 applies
-            # Remove trailing ? at the end
-            url = question_re.sub('', url)
-
-        # Finally we strip off the www. from the beginning of the URL
-        url = www2_re.sub('', url)
-
-        return url
-        
-    def get_full_url(self):
-        """ Return the full url path of this url object after
-        resolving relative paths, filenames etc """
-
-        if self.urlflag:
-            return self.absurl
-        else:
-            rval = ''
-            
-            if not self.protocol.startswith('file:'):
-                rval = self.get_full_domain_with_port()
-
-                if self.dirpath:
-                    newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-                    rval = "".join((rval, URLSEP, newpath))
-
-                if rval[-1] != URLSEP:
-                    rval += URLSEP
-
-                if self.filelike:
-                    rval = "".join((rval, self.filename))
-                
-            else:
-                rval = ''
-                if self.dirpath:
-                    newpath = "/".join([x for x in self.dirpath if ((x and not x[-1] ==URLSEP) or (not x))])            
-                    rval += newpath
-                    
-                if self.filelike:
-                    rval = "".join((rval, URLSEP, self.filename))
-
-                return self.protocol + rval
-            
-            self.urlflag = True
-            self.absurl = self.make_valid_url(rval)
-
-            return self.absurl
-
-  ##       # If this is already calculated, return the cached value...
-##         if self.urlflag:
-##             return self.absurl
-##         else:
-##             rval = self.get_full_domain_with_port()
-##             if self.dirpath:
-##                 newpath = "".join([ x+URLSEP for x in self.dirpath if x and not x[-1] ==URLSEP])
-##                 rval = "".join((rval, URLSEP, newpath))
-            
-##             if rval[-1] != URLSEP:
-##                 rval += URLSEP
-
-##             if self.filelike:
-##                 rval = "".join((rval, self.filename))
-
-##             self.urlflag = True
-##             self.absurl = self.make_valid_url(rval)
-
-##             return self.absurl
-       # If this is already calculated, return the cached value...
-
-
-    def get_full_url_sans_port(self):
-        """ Return absolute url without the port number """
-
-        rval = self.get_full_domain()
-        if self.dirpath:
-            newpath = "".join([ x+'/' for x in self.dirpath])
-            rval = "".join((rval, URLSEP, newpath))
-
-        if rval[-1] != URLSEP:
-            rval += URLSEP
-
-        if self.filelike:
-            rval = "".join((rval, self.filename))
-
-        return self.make_valid_url(rval)
-
-    def get_port_number(self):
-        """ Return the port number of this url """
-
-        # 80 -> http urls
-        return self.port
-
-    def get_relative_url(self):
-        """ Return relative path of url w.r.t the domain """
-
-        newpath=""
-        if self.dirpath:
-            newpath =  "".join(("/", "".join([ x+'/' for x in self.dirpath])))
-
-        if self.filelike:
-            newpath = "".join((newpath, URLSEP, self.filename))
-            
-        return self.make_valid_url(newpath)
-
-    def get_base_domain(self):
-        """ Return the base domain for this url object """
-
-        # Explanation: Base domain is the domain
-        # at the root of a given domain. For example
-        # base domain of stats.foo.com is foo.com.
-        # If there is no subdomain, this will be
-        # the same as the domain itself.
-
-        # If the server name is of the form say bar.foo.com
-        # or vodka.bar.foo.com, i.e there are more than one
-        # '.' in the name, then we need to return the
-        # last string containing a dot in the middle.
-
-        # Get domain
-        domain = self.domain
-        
-        if domain.count('.') > 1:
-            dotstrings = domain.split('.')
-            # now the list is of the form => [vodka, bar, foo, com]
-
-            # Return the last two items added with a '.'
-            # in between
-            return "".join((dotstrings[-2], ".", dotstrings[-1]))
-        else:
-            # The server is of the form foo.com or just "foo"
-            # so return it straight away
-            return domain
-
-    def get_base_domain_with_port(self):
-        """ Return the base domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-        
-        if ((self.protocol == 'http://' and int(self.port) != 80) \
-            or (self.protocol == 'https://' and int(self.port) != 443) \
-            or (self.protocol == 'ftp://' and int(self.port) != 21)):
-            return self.get_base_domain() + ':' + str(self.port)
-        else:
-            return self.get_base_domain()
-
-    def get_url_hash(self):
-        """ Return a hash value for the URL """
-
-        m = md5.new()
-        m.update(self.get_full_url())
-        return str(m.hexdigest())
-    
-    def get_domain_hash(self):
-        """ Return the hask value for the domain """
-
-        m = md5.new()
-        m.update(self.get_full_domain())
-        return str(m.hexdigest())
-
-    def get_data_hash(self):
-        """ Return the hash value for the URL data """
-
-        return self.pagehash
-
-    def get_domain(self):
-        """ Return the domain (server) for this url object """
-        
-        return self.domain
-
-    def get_full_domain(self):
-        """ Return the full domain (protocol + domain) for this url object """
-        
-        return self.protocol + self.domain
-
-    def get_full_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and int(self.port) != 80) \
-           or (self.protocol == 'https://' and int(self.port) != 443) \
-           or (self.protocol == 'ftp://' and int(self.port) != 21):
-            return self.get_full_domain() + ':' + str(self.port)
-        else:
-            return self.get_full_domain()
-
-    def get_domain_with_port(self):
-        """ Return the domain (server) with port number
-        appended to it, if the port number is not the
-        default for the current protocol """
-
-        if (self.protocol == 'http://' and self.port != 80) \
-           or (self.protocol == 'https://' and self.port != 443) \
-           or (self.protocol == 'ftp://' and self.port != 21):
-            return self.domain + ':' + str(self.port)
-        else:
-            return self.domain
-
-    def get_full_filename(self):
-        """ Return the full filename of this url on the disk.
-        This is created w.r.t the local directory where we save
-        the url data """
-
-        return os.path.join(self.get_local_directory(), self.get_filename())
-
-    def get_filename(self):
-        """ Return the filename of this url on the disk. """
-
-        # NOTE: This is just the filename, not the absolute filename path
-        return self.validfilename
-
-    def get_relative_filename(self, filename=''):
-
-        # NOTE: Rewrote this method completely
-        # on Nov 18 for 1.4 b2.
-        
-        # If no file name given, file name
-        # is the file name of the parent url
-        if not filename:
-            if self.baseurl:
-                filename = self.baseurl.get_full_filename()
-
-        # Still filename is NULL,
-        # return my absolute path
-        if not filename:
-            return self.get_full_filename()
-        
-        # Get directory of 'filename'
-        diry = os.path.dirname(filename)
-        if diry[-1] != os.sep:
-            diry += os.sep
-            
-        # Get my filename
-        myfilename = self.get_full_filename()
-        # If the base domains are different, we
-        # cannot find a relative path, so return
-        # my filename itself.
-        bdomain = self.baseurl.get_domain()
-        mydomain = self.get_domain()
-
-        if mydomain != bdomain:
-            return myfilename
-
-        # If both filenames are the same,
-        # return just the filename.
-        if myfilename==filename:
-            return self.get_filename()
-        
-        # Get common prefix of my file name &
-        # other file name.
-        prefix = os.path.commonprefix([myfilename, filename])
-        relfilename = ''
-        
-        if prefix:
-            if not os.path.exists(prefix):
-                prefix = os.path.dirname(prefix)
-            
-            if prefix[-1] != os.sep:
-                prefix += os.sep
-
-            # If prefix is the name of the project
-            # directory, both files have no
-            # common component.
-            try:
-                if os.path.samepath(prefix,self.rootdir):
-                    return myfilename
-            except:
-                if prefix==self.rootdir:
-                    return myfilename
-            
-            # If my directory is a subdirectory of
-            # 'dir', then prefix should be the same as
-            # 'dir'.
-            sub=False
-
-            # To test 'sub-directoriness', check
-            # whether dir is wholly contained in
-            # prefix. 
-            prefix2 = os.path.commonprefix([diry,prefix])
-            if prefix2[-1] != os.sep:
-                prefix2 += os.sep
-            
-            # os.path.samepath is not avlbl in all
-            # platforms.
-            try:
-                if os.path.samepath(diry, prefix2):
-                    sub=True
-            except:
-                if diry==prefix2:
-                    sub=True
-
-            # If I am in a sub-directory, relative
-            # path is my filename minus the common
-            # path.
-            if sub:
-                relfilename = myfilename.replace(prefix2, '')
-                return relfilename
-            else:
-                # If I am not in sub-directory, then
-                # we need to get the relative path.
-                dirwithoutprefix = diry.replace(prefix, '')
-                filewithoutprefix = myfilename.replace(prefix, '')
-                relfilename = filewithoutprefix
-                    
-                paths = dirwithoutprefix.split(os.sep)
-                for item in paths:
-                    if item:
-                        relfilename = "".join(('..', os.sep, relfilename))
-
-                return relfilename
-        else:
-            # If there is no common prefix, then
-            # it means me and the passed filename
-            # have no common paths, so return my
-            # full path.
-            return myfilename
-            
-    def get_relative_depth(self, hu, mode=0):
-        """ Get relative depth of current url object vs passed url object.
-        Return a postive integer if successful and -1 on failure """
-
-        # Fixed 2 bugs on 22/7/2003
-        # 1 => passing arguments to find function in wrong order
-        # 2 => Since we allow the notion of zero depth, even zero
-        # value of depth should be returned.
-
-        # This mode checks for depth based on a directory path
-        # This check is valid only if dir2 is a sub-directory of dir1
-        dir1=self.get_url_directory()
-        dir2=hu.get_url_directory()
-        
-        # spit off the protocol from directories
-        dir1 = dir1.replace(self.protocol, '')
-        dir2 = dir2.replace(hu.protocol, '')      
-        
-        # Append a '/' to the dirpath if not already present
-        if dir1[-1] != '/': dir1 += '/'
-        if dir2[-1] != '/': dir2 += '/'
-
-        if mode==0:
-            # check if dir2 is present in dir1
-            # bug: we were passing arguments to the find function
-            # in the wrong order.
-            if dir1.find(dir2) != -1:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-
-            return -1
-        # This mode checks for depth based on the base server(domain).
-        # This check is valid only if dir1 and dir2 belong to the same
-        # base server (checked by name)
-        elif mode==1:
-            if self.domain == hu.domain:
-                # we need to check for depth only if the above condition is true.
-                l1=dir1.split('/')
-                l2=dir2.split('/')
-                if l1 and l2:
-                    diff=len(l1) - len(l2)
-                    if diff>=0: return diff
-            return -1
-
-        # This check is done for the current url against current base server (domain)
-        # i.e, this mode does not use the argument 'hu'
-        elif mode==2:
-            dir2 = self.domain
-            if dir2[-1] != '/':
-                dir2 += '/'
-
-            # we need to check for depth only if the above condition is true.
-            l1=dir1.split('/')
-            l2=dir2.split('/')
-            if l1 and l2:
-                diff=len(l1) - len(l2)
-                if diff>=0: return diff
-            return -1
-
-        return -1
-
-    def get_root_dir(self):
-        """ Return root directory """
-        
-        return self.rootdir
-    
-    def get_local_directory(self):
-        """ Return the local directory path of this url w.r.t
-        the directory on the disk where we save the files of this url """
-        
-        # Gives Local Direcory path equivalent to URL Path in server
-        rval = ''
-        if not self.protocol.startswith('file:'):
-            rval = os.path.join(self.rootdir, self.domain)
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))
-        else:
-            rval = self.rootdir
-            
-            for diry in self.dirpath:
-                if not diry: continue
-                rval = os.path.abspath( os.path.join(rval, self.make_valid_filename(diry)))                    
-                    
-        return os.path.normpath(rval)
-
-    def get_original_state(self):
-        """ Return the original state of this URL object. This is useful
-        to obtain earlier attributes of a URL after it's state was
-        changed by a URL modification """
-
-        # It is up to the caller to check this value
-        return self.orig_state
-        
-    # ============ Begin - Set Methods =========== #
-
-    def set_directory_url(self):
-        """ Set this as a directory url """
-
-        self.filelike = False
-        # print self.dirpath, self.lastpath, self.domain
-        
-        if (not self.dirpath and self.lastpath != self.domain) or (self.dirpath and (self.dirpath[-1] != self.lastpath)):
-            self.dirpath.append(self.lastpath)
-        self.validfilename = 'index.html'
-        self.urlflag = False
-        
-    def set_url_content_info(self, headers):
-        """ This function sets the url content information of this
-        url. It is a convenient function which can be used by connectors
-        to store url content information """
-
-        if headers:
-            self.contentdict = copy.deepcopy(headers)
-
-    def violates_rules(self):
-        """ Check if this url violates existing download rules """
-
-        # If I am the base url object, violates rule checks apply
-        # only if my original URL has changed.
-
-        if self.starturl and not self.reresolved:
-            return False
-            
-        if not self.rulescheckdone:
-            self.violatesrules = objects.rulesmgr.violates_rules(self)
-            self.rulescheckdone = True
-
-        return self.violatesrules
-
-    def recalc_locations(self):
-        """ Recalculate filenames/directories etc """
-
-        # Case 1 - trying to save as a file when the
-        # parent "directory" is an existing file.
-        # Solution - Change the paths of parent URL object
-        # to change its filename...
-        directory = self.get_url_directory()
-        if os.path.isfile(directory):
-            parent = self.baseurl
-            # Anything can be done on this only if this
-            # is a HarvestManUrl object
-            if isinstance(parent, HarvestManUrl):
-                parent.dirpath.append(parent.filename)
-                parent.filename = 'index.html'
-                parent.validfilename = 'index.html'
-
-        # Case 2 - trying to save as file when the
-        # path is an existing directory.
-        # Solution - Save as index.html in the directory
-        filename = self.get_full_filename()
-        if os.path.isdir(filename):
-            self.dirpath.append(self.filename)
-            self.filename = 'index.html'
-            self.validfilename = 'index.html'
-        
-    def manage_content_type(self, content_type):
-        """ This function gets called from connector modules
-        connect method, after retrieving information about
-        a url. This function can manage the content type of
-        the url object if there are any differences between
-        the assumed type and the returned type """
-
-        # Guess extension of type
-        extn = mimetypes.guess_extension(content_type)
-        
-        if extn:
-            if extn in webpage_extns:
-                self.typ = URL_TYPE_WEBPAGE
-            elif extn in image_extns:
-                self.typ = URL_TYPE_IMAGE
-            elif extn in stylesheet_extns:
-                self.typ = URL_TYPE_STYLESHEET
-            elif extn in sound_extns:
-                self.typ = URL_TYPE_AUDIO
-            elif extn in movie_extns:
-                self.typ = URL_TYPE_VIDEO
-            else:
-                self.typ = URL_TYPE_FILE
-        else:
-            if content_type:
-                # Do some generic tests
-                klass, typ = content_type.split('/')
-                if klass == 'image':
-                    self.typ = URL_TYPE_IMAGE
-                elif klass == 'audio':
-                    self.typ = URL_TYPE_AUDIO
-                elif klass == 'video':
-                    self.typ = URL_TYPE_VIDEO
-                elif typ == 'html':
-                    self.typ = URL_TYPE_WEBPAGE
-            else:
-                # Do static checks
-                if self.is_webpage():
-                    self.typ = URL_TYPE_WEBPAGE
-                elif self.is_image():
-                    self.typ = URL_TYPE_IMAGE
-                elif self.is_audio():
-                    self.typ = URL_TYPE_AUDIO
-                elif self.is_video():
-                    self.typ = URL_TYPE_VIDEO
-                elif self.is_stylesheet():
-                    self.typ = URL_TYPE_STYLESHEET
-                else:
-                    self.typ = URL_TYPE_FILE
-
-    def make_document(self, data, keywords, description, children):
-        """ Return a HarvestManDocument object filling up all the fields """
-        
-        doc = document.HarvestManDocument(self)
-        doc.content = data
-        doc.keywords = keywords[:]
-        doc.description = description
-        doc.content_hash = self.pagehash
-        doc.headers = self.contentdict.copy()
-        for child in children:
-            doc.add_child(child)
-        
-        doc.lastmodified = self.contentdict.get('last-modified','')
-        doc.etag = self.contentdict.get('etag','')
-        doc.content_type = self.contentdict.get('content-type','')
-        doc.content_encoding = self.contentdict.get('content-encoding','plain')
-        return doc
-    
-    # ============ End - Set Methods =========== #
-
-
-def test():
-    
-    # Test code
-    HarvestManUrl.TEST = 1
-    hulist = [ HarvestManUrl('http://www.yahoo.com/photos/my photo.gif'),
-               HarvestManUrl('http://www.rediff.com:80/r/r/tn2/2003/jun/25usfed.htm'),
-               HarvestManUrl('http://cwc2003.rediffblogs.com'),
-               HarvestManUrl('/sports/2003/jun/25beck1.htm',
-                                   'generic', 0, 'http://www.rediff.com', ''),
-               HarvestManUrl('http://ftp.gnu.org/pub/lpf.README'),
-               HarvestManUrl('http://www.python.org/doc/2.3b2/'),
-               HarvestManUrl('//images.sourceforge.net/div.png',
-                                   'image', 0, 'http://sourceforge.net', ''),
-               HarvestManUrl('http://pyro.sourceforge.net/manual/LICENSE'),
-               HarvestManUrl('python/test.htm', 'generic', 0,
-                                   'http://www.foo.com/bar/index.html', ''),
-               HarvestManUrl('/python/test.css', 'generic',
-                                   0, 'http://www.foo.com/bar/vodka/test.htm', ''),
-               HarvestManUrl('/visuals/standard.css', 'generic', 0,
-                                   'http://www.garshol.priv.no/download/text/perl.html',
-                                   'd:/websites'),
-               HarvestManUrl('www.fnorb.org/index.html', 'generic',
-                                   0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('http://profigure.sourceforge.net/index.html',
-                                   'generic', 0, 'http://pyro.sourceforge.net',
-                                   'd:/websites'),
-               HarvestManUrl('#anchor', 'anchor', 0, 
-                                   'http://www.foo.com/bar/index.html',
-                                   'd:/websites'),
-               HarvestManUrl('nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html#__init__#index-after', 'anchor', 0, 'http://nltk.sourceforge.net/lite/doc/api/term-index.html', 'd:/websites'),               
-               HarvestManUrl('../../icons/up.png', 'image', 0,
-                                   'http://www.python.org/doc/current/tut/node2.html',
-                                   ''),
-               HarvestManUrl('../eway/library/getmessage.asp?objectid=27015&moduleid=160',
-                                   'generic',0,'http://www.eidsvoll.kommune.no/eway/library/getmessage.asp?objectid=27015&moduleid=160'),
-               HarvestManUrl('fileadmin/dz.gov.si/templates/../../../index.php',
-                                   'generic',0,'http://www.dz-rs.si','~/websites'),
-               HarvestManUrl('http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70','form',True,'http://www.evvs.dk'),
-               HarvestManUrl('http://arstechnica.com/reviews/os/macosx-10.4.ars')]
-                                  
-                                  
-    for hu in hulist:
-        print 'Full filename = ', hu.get_full_filename()
-        print 'Valid filename = ', hu.validfilename
-        print 'Local Filename  = ', hu.get_filename()
-        print 'Is relative path = ', hu.is_relative_path()
-        print 'Full domain = ', hu.get_full_domain()
-        print 'Domain      = ', hu.domain
-        print 'Local Url directory = ', hu.get_url_directory_sans_domain()
-        print 'Canonical Url = ', hu.get_canonical_url()
-        print 'Absolute Url = ', hu.get_full_url()
-        print 'Absolute Url Without Port = ', hu.get_full_url_sans_port()
-        print 'Local Directory = ', hu.get_local_directory()
-        print 'Is filename parsed = ', hu.filelike
-        print 'Path rel to domain = ', hu.get_relative_url()
-        print 'Connection Port = ', hu.get_port_number()
-        print 'Domain with port = ', hu.get_full_domain_with_port()
-        print 'Relative filename = ', hu.get_relative_filename()
-        print 'Anchor url     = ', hu.get_anchor_url()
-        print 'Anchor tag     = ', hu.get_anchor()
-        print  'Index=>',hu.index
-        
-
-if __name__=="__main__":
-    test()
diff --git a/HarvestMan/harvestman/lib/urlproc.py b/HarvestMan/harvestman/lib/urlproc.py
deleted file mode 100755
index d5d3c6e..0000000
--- a/HarvestMan/harvestman/lib/urlproc.py
+++ /dev/null
@@ -1,202 +0,0 @@
-# -- coding: utf-8
-""" urlproc.py - Module to process URLs and replace
-    entity characters (characters starting with an ampersand
-    and ending with a semicolon). 
-
-    All entities here added from
-    http://www.w3schools.com/tags/ref_entities.asp
-    
-    This module is part of the HarvestMan program.
-    For licensing information see the file LICENSE.txt that
-    is included in this distribution.
-
-   Author: Anand B Pillai <abpillai at gmail dot com>
-   
-   Modification History
-   --------------------
-   
-   Created - Anand B Pillai 28 Sep 06
-
-   Copyright (C) 2006 Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import unicodedata
-import itertools
-
-char_names = ['LESS-THAN SIGN',
-              'GREATER-THAN SIGN',
-              'AMPERSAND',
-              'QUOTATION MARK',
-              'SPACE',
-              'LATIN CAPITAL LETTER C WITH CEDILLA',
-              'LATIN SMALL LETTER C WITH CEDILLA',
-              'LATIN CAPITAL LETTER N WITH TILDE',
-              'LATIN SMALL LETTER N WITH TILDE',
-              'LATIN CAPITAL LETTER THORN',
-              'LATIN SMALL LETTER THORN',
-              'LATIN CAPITAL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH ACUTE',
-              'LATIN SMALL LETTER Y WITH DIAERESIS',
-              'LATIN SMALL LETTER SHARP S',
-              'LATIN CAPITAL LETTER AE',
-              'LATIN CAPITAL LETTER A WITH ACUTE',
-              'LATIN CAPITAL LETTER A WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER A WITH GRAVE',
-              'LATIN CAPITAL LETTER A WITH RING ABOVE',
-              'LATIN CAPITAL LETTER A WITH TILDE',
-              'LATIN CAPITAL LETTER A WITH DIAERESIS',
-              'LATIN SMALL LETTER AE',
-              'LATIN SMALL LETTER A WITH ACUTE',
-              'LATIN SMALL LETTER A WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER A WITH GRAVE',
-              'LATIN SMALL LETTER A WITH RING ABOVE',
-              'LATIN SMALL LETTER A WITH TILDE',
-              'LATIN SMALL LETTER A WITH DIAERESIS',
-              'LATIN CAPITAL LETTER ETH',
-              'LATIN CAPITAL LETTER E WITH ACUTE',
-              'LATIN CAPITAL LETTER E WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER E WITH GRAVE',
-              'LATIN CAPITAL LETTER E WITH DIAERESIS',
-              'LATIN SMALL LETTER ETH',
-              'LATIN SMALL LETTER E WITH ACUTE',
-              'LATIN SMALL LETTER E WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER E WITH GRAVE',
-              'LATIN SMALL LETTER E WITH DIAERESIS',
-              'LATIN CAPITAL LETTER I WITH ACUTE',
-              'LATIN CAPITAL LETTER I WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER I WITH GRAVE',
-              'LATIN CAPITAL LETTER I WITH DIAERESIS',
-              'LATIN SMALL LETTER I WITH ACUTE',
-              'LATIN SMALL LETTER I WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER I WITH GRAVE',
-              'LATIN SMALL LETTER I WITH DIAERESIS',
-              'LATIN CAPITAL LETTER O WITH ACUTE',
-              'LATIN CAPITAL LETTER O WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER O WITH GRAVE',
-              'LATIN CAPITAL LETTER O WITH STROKE',
-              'LATIN CAPITAL LETTER O WITH TILDE',
-              'LATIN CAPITAL LETTER O WITH DIAERESIS',
-              'LATIN SMALL LETTER O WITH ACUTE',
-              'LATIN SMALL LETTER O WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER O WITH GRAVE',
-              'LATIN SMALL LETTER O WITH STROKE',
-              'LATIN SMALL LETTER O WITH TILDE',
-              'LATIN SMALL LETTER O WITH DIAERESIS',
-              'LATIN CAPITAL LETTER U WITH ACUTE',
-              'LATIN CAPITAL LETTER U WITH CIRCUMFLEX',
-              'LATIN CAPITAL LETTER U WITH GRAVE',
-              'LATIN CAPITAL LETTER U WITH DIAERESIS',
-              'LATIN SMALL LETTER U WITH ACUTE',
-              'LATIN SMALL LETTER U WITH CIRCUMFLEX',
-              'LATIN SMALL LETTER U WITH GRAVE',
-              'LATIN SMALL LETTER U WITH DIAERESIS',
-              'REGISTERED SIGN',
-              'PLUS-MINUS SIGN',
-              'MICRO SIGN',
-              'PILCROW SIGN',
-              'MIDDLE DOT',
-              'CENT SIGN',
-              'POUND SIGN',
-              'YEN SIGN',
-              'VULGAR FRACTION ONE QUARTER',
-              'VULGAR FRACTION ONE HALF',
-              'VULGAR FRACTION THREE QUARTERS',
-              'SUPERSCRIPT ONE',
-              'SUPERSCRIPT TWO',
-              'SUPERSCRIPT THREE',
-              'INVERTED QUESTION MARK',
-              'DEGREE SIGN',
-              'BROKEN BAR',
-              'SECTION SIGN',
-              'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK',
-              'EURO SIGN',
-              'SINGLE RIGHT-POINTING ANGLE QUOTATION MARK',
-              'SINGLE LEFT-POINTING ANGLE QUOTATION MARK',              
-              'PER MILLE SIGN',
-              'HORIZONTAL ELLIPSIS',
-              'DOUBLE DAGGER',
-              'DAGGER',
-              'DOUBLE LOW-9 QUOTATION MARK',
-              'RIGHT DOUBLE QUOTATION MARK',
-              'LEFT DOUBLE QUOTATION MARK',
-              'SINGLE LOW-9 QUOTATION MARK',
-              'RIGHT SINGLE QUOTATION MARK',
-              'LEFT SINGLE QUOTATION MARK',
-              'EM DASH',
-              'EN DASH',
-              'LATIN SMALL LETTER S WITH CARON',
-              'LATIN CAPITAL LETTER S WITH CARON',
-              'LATIN SMALL LIGATURE OE',
-              'LATIN CAPITAL LIGATURE OE',
-              'INVERTED EXCLAMATION MARK',
-              'CURRENCY SIGN',
-              'DIAERESIS',
-              'FEMININE ORDINAL INDICATOR',
-              'NOT SIGN',
-              'TRADE MARK SIGN',
-              'MACRON',
-              'ACUTE ACCENT',
-              'CEDILLA',
-              'MASCULINE ORDINAL INDICATOR',
-              'MULTIPLICATION SIGN',
-              'DIVISION SIGN'
-              ]
-
-# Entity characters
-ampersand_strings = ('&lt;','&gt;','&amp;','&quot;',
-                     '&nbsp;','&Ccedil;','&ccdil;','&Ntilde;',
-                     '&ntilde;','&THORN;','&thorn;','&Yacute;',
-                     '&yacute;','&yuml;','&szlig;','&AElig;',
-                     '&Aacute;','&Acirc;','&Agrave;','&Aring;',
-                     '&Atilde;','&Auml;','&aelig;','&acirc;',
-                     '&aacute;','&agrave;','&aring;','&atilde;',
-                     '&auml;', '&ETH;','&Eacute;','&Ecirc;',
-                     '&Egrave;','&Euml;','&eth;','&eacute;',
-                     '&ecirc;','&egrave;','&euml;','&Iacute;',
-                     '&Icirc;','&Igrave;','&Iuml;','&iacute;',
-                     '&icirc;','&igrave;','&iuml;','&Oacute;',
-                     '&Ocirc;','&Ograve;','&Oslash;','&Otilde;',
-                     '&Ouml;','&oacute;','&ocirc;','&ograve;',
-                     '&oslash;','&otilde;','&ouml;','&Uacute;',
-                     '&Ucirc;','&Ugrave;','&Uuml;','&uacute;',
-                     '&ucirc;','&ugrave;','&uuml;','&reg;',
-                     '&plusmn;','&micro;','&para;','&middot;',
-                     '&cent;','&pound;','&yen;','&frac14;',
-                     '&frac12;','&frac34;','&sup1;','&sup2;',
-                     '&sup3;','&iquest;','&deg;','&brvbar;',
-                     '&sect;','&laquo;','&raquo;','&euro;',
-                     '&rsaquo;','&lsaquo;','&permil;','&hellip;',
-                     '&Dagger;','&dagger;','&bdquo;','&rdquo;',
-                     '&ldquo;','&sbquo;','&rsquo;','&lsquo;',
-                     '&mdash;','&ndash;','&scaron;','&Scaron;',
-                     '&oelig;','&OElig;','&iexcl;','&curren;',
-                     '&uml;','&ordf;','&not;','&trade;',
-                     '&macr;','&acute;','&cedil;','&ordm;','&times;',
-                     '&divide;')
-                         
-                         
-def modify_url(url):
-    """ Replace entity characters in URLs with the original
-    string representations """
-    
-    for ampersand_string, ucode_name in itertools.izip(ampersand_strings, char_names):
-        if url.find(ampersand_string) != -1:
-            ucode_char = unicodedata.lookup(ucode_name)            
-            try:
-                url = url.replace(ampersand_string, ucode_char)
-            except UnicodeDecodeError:
-                pass
-            
-    return url
-
-def main():
-    # Test code
-    url = 'http://www.nbb.be/pub/Home.htm?l=nl&amp;t=ho'
-    print modify_url(url)
-
-if __name__ == "__main__":
-    main()
diff --git a/HarvestMan/harvestman/lib/urlqueue.py b/HarvestMan/harvestman/lib/urlqueue.py
deleted file mode 100755
index 0c14c51..0000000
--- a/HarvestMan/harvestman/lib/urlqueue.py
+++ /dev/null
@@ -1,824 +0,0 @@
-# -- coding: utf-8
-""" urlqueue.py - Module which controls queueing of urls
-    created by crawler threads. This is part of the HarvestMan
-    program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Modification History
-
-     Anand Jan 12 2005 -   Created this module, by splitting urltracker.py
-     Aug 11 2006  Anand    Checked in changes for reducing CPU utilization.
-
-     Aug 22 2006  Anand    Changes for fixing single-thread mode.
-     Oct 19 2007  Anand    Added a very basic state-machine for managing
-                           crawler end condition.
-     Oct 22 2007  Anand    Enhanced the state machine with additional states,
-                           checks and a modified mainloop etc.
-
-     April 04 2008 Anand   Fixes in state machine and mainloop.
-     Jun   03 2008 Anand   Fixes in abnormal exit logic. Other fixes.
-     
-   Copyright (C) 2005 Anand B Pillai.     
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import bisect
-import time
-import threading
-import sys, os
-import copy
-
-from collections import deque
-from Queue import *
-
-from harvestman.lib import crawler
-from harvestman.lib import urlparser
-from harvestman.lib import document
-from harvestman.lib import datamgr
-from harvestman.lib import urltypes
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.singleton import Singleton
-
-class HarvestManCrawlerState(Singleton):
-    """ State machine for signalling crawler end condition
-    and for managing end-condition stalemates and other
-    issues """
-
-    def __init__(self, queue):
-        self.reset()
-        self.cond = threading.Condition(threading.Lock())
-        self.queue = queue
-
-    def reset(self):
-        self.ts = {}
-        # Flags
-        # All threads blocked (waiting)
-        self.blocked = False
-        # All fetchers blocked (waiting)
-        self.fblocked = False
-        # All crawlers blocked (waiting)
-        self.cblocked = False
-        # Crawler thread transitions
-        self.ctrans = 0
-        # Fetcher thread transitions
-        self.ftrans = 0
-        # Pushes by fetcher
-        self.fpush = 0
-        # Pushes by crawler
-        self.cpush = 0
-        # Gets by fetcher
-        self.fgets = 0
-        # Gets by crawler
-        self.cgets = 0
-        # Number of crawlers
-        self.numcrawlers = 0
-        # Number of fetchers
-        self.numfetchers = 0
-        # Suspend time-stamp, initially
-        # set to None. To suspend end-state
-        # checking, set this to current time...
-        self.st = None
-        # End state message. If normal exit
-        # this is not set, for abormal exit
-        # this could be set...
-        self.abortmsg = ''
-        self.blkcnt = 0
-        self.lastcheck = time.time()
-        
-    def set(self, thread, state):
-
-        curr, role = None, None
-        item = self.get(thread)
-        if item:
-            curr, role = item
-            
-        if curr != state:
-            # print 'Thread %s changes from state %s to state %s' % (thread, curr, state)
-            self.ts[thread.getName()] = state, thread._role
-
-        self.state_callback(thread, state)
-
-    def get(self, thread):
-        return self.ts.get(thread.getName())
-
-    def zero_thread(self):
-        """ Function which returns whether any of the
-        thread counts (fetcher/crawler) have gone to zero """
-
-        return (self.numfetchers==0) or (self.numcrawlers == 0)
-
-    def suspend(self):
-        """ Suspend checks on end state. This uses a timeout
-        on the suspend flag which automaticall ages the flag
-        and resets if, if not set within the aging period """
-
-        self.st = time.time()
-
-    def unsuspend(self):
-        """ Unsuspend checks on end-state. """
-
-        self.st = None
-        
-    def state_callback(self, thread, state):
-        """ Callbacks for taking action according to state transitions """
-
-        self.cond.acquire()
-        typ = thread._role
-        
-        if state == crawler.THREAD_STARTED:
-            if thread.resuming:
-                # Resuming threads should call unsuspend..
-                self.unsuspend()
-            
-        # If the thread is killed, try to regenerate it...
-        elif state == crawler.THREAD_DIED:
-            # print 'THREAD DIED',thread
-            # Don't try to regenerate threads if this is a local exception.
-            e = thread.exception
-            logconsole("Thread died due to exception => ", str(e))
-            # See class for last error. If it is same as
-            # this error, don't do anything since this could
-            # be a programming error and will send us into
-            # a loop...
-            #if str(thread.__class__._lasterror) == str(e):
-            #    debug('Looks like a repeating error, not trying to restart thread %s' % (str(thread)))
-            # In this case the thread has died, so reduce local thread count
-            if typ=='crawler':
-                self.numcrawlers -= 1
-            elif typ == 'fetcher':
-                self.numfetchers -= 1
-                
-            del self.ts[thread.getName()]
-            #else:
-            #    thread.__class__._lasterror = e
-            #    # Release the lock now!
-            #    self.cond.release()
-            #    # Set suspend flag
-            #    self.suspend()
-            #    
-            #    del self.ts[thread]
-            #    extrainfo('Tracker thread %s has died due to error: %s' % (str(thread), str(e)))
-            #    self.queue.dead_thread_callback(thread)
-            #    return
-            
-        elif state == crawler.FETCHER_PUSHED_URL:
-            # Push count for fetcher threads
-            self.fpush += 1
-        elif state == crawler.CRAWLER_PUSHED_URL:            
-            # Push count for fetcher threads
-            self.cpush += 1
-        elif state == crawler.FETCHER_GOT_DATA:
-            # Get count for fetcher threads
-            self.fgets += 1
-        elif state == crawler.CRAWLER_GOT_DATA:
-            # Get count for fetcher threads
-            self.cgets += 1            
-        elif state == crawler.THREAD_SLEEPING:
-            # A sleep state can be achieved only after a work state
-            # so this indicates a cycle of transitions since
-            # a cycle ends with a sleep...
-            if typ == 'crawler':
-                # Transition count for crawler threads                
-                self.ctrans += 1
-            elif typ == 'fetcher':
-                # Transition count for crawler threads                
-                self.ftrans += 1                
-                
-        elif state in (crawler.FETCHER_WAITING, crawler.CRAWLER_WAITING):
-            if self.end_state():
-                # This is useful only if the waiter is waiting
-                # using wait1(...) method. If he is waiting
-                # using wait2(...) method, he needs to devise
-                # his own wake-up logic.
-                self.cond.notify()
-
-        self.cond.release()
-
-    def all_are_waiting(self):
-        """ This method returns whether the threads are all starved for
-        data during regular crawl, which signals an end condition for the
-        program """
-        
-        # Time stamp of calling this function
-        currt = time.time()
-        # Check suspend time-stamp
-        if self.st:
-            # Calculate difference, do not allow suspending
-            # for more than 5 seconds.
-            if (currt - self.st)>5.0:
-                self.st = None
-            return False
-
-        if self.zero_thread():
-            self.abortmsg = "Fatal thread reduction, stopping program"
-            return True
-
-        for status, role in self.ts.values():
-            if status.__name__ not in ('PERM_EXCEPT','FETCHER_WAITING','CRAWLER_WAITING','THREAD_DIED', 'THREAD_STOPPED'):
-                return False
-        
-        #if self.queue.url_q.qsize() or self.queue.data_q.qsize():
-        #    return False
-
-        return ((self.fpush == self.cgets) and \
-               (self.cpush == self.fgets))
-
-    def all_have_stopped(self):
-        """ This method returns whether the threads are all stopped
-        (or sleeping) after an abnormal exit of the program either
-        by an explicit interrupt or after a program exception """
-        
-        if self.zero_thread():
-            self.abortmsg = "Fatal thread reduction, stopping program"
-            return True
-
-        for status, role in self.ts.values():
-            if status.__name__ not in ('PERM_EXCEPT','THREAD_STOPPED','THREAD_SLEEPING'):
-                return False
-        
-        return True
-
-    def end_state(self):
-        """ Check end state for the program. Returns True
-        if the program is ready to end. Abnormal exits are
-        not handled here """
-
-        return self.all_are_waiting()
-
-    def exit_state(self):
-        """ Checks end of state for program for abnormal
-        exits. Returns True if the program is ready to end """
-
-        return self.all_have_stopped()
-        
-##         # Time stamp of calling this function
-##         currt = time.time()
-##         # Check suspend time-stamp
-##         if self.st:
-##             # Calculate difference, do not allow suspending
-##             # for more than 5 seconds.
-##             if (currt - self.st)>5.0:
-##                 self.st = None
-##             return False
-
-##         if self.zero_thread():
-##             self.abortmsg = "Fatal thread reduction, stopping program"
-##             return True
-        
-##         flag = True
-##         numthreads = 0
-        
-##         fcount, fnblock, ccount, cnblock = 0, 0, 0, 0
-##         self.blocked, self.fblocked, self.cblocked = False, False, False
-
-##         for state, role in self.ts.values():
-##             numthreads += 1
-
-##             print role,'=>',state
-            
-##             if role == 'fetcher':
-##                 fcount += 1
-##                 if state == crawler.FETCHER_WAITING:
-##                     fnblock += 1
-##                 else:
-##                     flag = False
-##                     break
-##             else:
-##                 ccount += 1
-##                 if state == crawler.CRAWLER_WAITING:
-##                     cnblock += 1
-##                 else:
-##                     flag = False
-##                     break
-
-##         # print 'flag=>',flag
-##          # For exit based on thread waiting state allignment
-##         if flag:
-##             self.blocked = True
-##             print 'Numthreads=>',numthreads
-
-##         if ccount==cnblock:
-##             self.cblocked = True
-
-##         if fcount==fnblock:
-##             self.fblocked = True
-
-##         if self.blocked:
-##             # print 'BLOCKED!'
-##             # print 'Length1 => ',len(self.queue.data_q)
-##             # print 'Length2 => ',len(self.queue.url_q)
-            
-##             #print "Pushes=>",self.queue._pushes
-##             # print 'Transitions',self.ctrans, self.ftrans,self.fpush,self.cpush
-
-##             # If we have one fpush event, we need to have at least
-##             # one cpush associated to it...
-##             # Error: this is a dangerous condition... we never know if the
-##             # crawler has filtered out the children of the first URL itself
-##             # so cpush could be == 0, commented this out!
-            
-##             #if self.fpush>0:
-##             #    return (self.cpush>0)
-##             self.blkcnt += 1
-##             if self.blkcnt > 10:
-##                 return True
-##             #return False
-
-##         return False
-
-    def __str__(self):
-        return str(self.ts)
-
-    def wait1(self, timeout):
-        """ Regular wait method. This should be typically
-        called with a large timeout value """
-
-        while not self.end_state():
-            self.cond.acquire()
-            try:
-                self.cond.wait(timeout)
-            except IOError, e:
-                break
-            
-            self.cond.release()
-        
-    def wait2(self, timeout):
-        """ Secondary wait method. This should be typically
-        called with a small timeout value. When calling
-        this the caller should have additional logic to
-        make sure he does not time out before the condition
-        is met """
-        
-        try:
-            self.cond.acquire()
-            self.cond.wait(timeout)
-            self.cond.release()
-        except IOError, e:
-            pass
-        
-class PriorityQueue(Queue):
-    """ Priority queue based on bisect module (courtesy: Effbot) """
-
-    def __init__(self, maxsize=0):
-        Queue.__init__(self, maxsize)
-
-    def _init(self, maxsize):
-        self.maxsize = maxsize
-        self.queue = []
-        
-    def _put(self, item):
-        bisect.insort(self.queue, item)
-
-    def __len__(self):
-        return self.qsize()
-    
-    def _qsize(self):
-        return len(self.queue)
-
-    def _empty(self):
-        return not self.queue
-
-    def _full(self):
-        return self.maxsize>0 and len(self.queue) == self.maxsize
-
-    def _get(self):
-        return self.queue.pop(0)    
-
-    def clear(self):
-        while True:
-            try:
-                self.queue.pop()
-            except IndexError:
-                break
-        
-class HarvestManCrawlerQueue(object):
-    """ This class functions as the thread safe queue
-    for storing url data for tracker threads """
-
-    alias = 'queuemgr'
-    
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        
-        self.basetracker = None
-        self.controller = None 
-        self.flag = 0
-        self.pushes = 0
-        self.lasttimestamp = time.time()
-        self.trackers  = []
-        self.savers = []
-        self.requests = 0
-        self.trackerindex = 0
-        self.baseurl = None
-        self.stateobj = HarvestManCrawlerState(self)
-        self.configobj = objects.config
-        self.url_q = PriorityQueue(self.configobj.queuesize)
-        self.data_q = PriorityQueue(self.configobj.queuesize)
-            
-        # Local buffer
-        self.buffer = []
-        # Lock for creating new threads
-        self.cond = threading.Lock()
-        # Flag indicating a forceful exit
-        self.forcedexit = False
-        # Sleep event
-        self.evnt = SleepEvent(self.configobj.queuetime)
-
-    def get_controller(self):
-        """ Return the controller thread object """
-
-        return self.controller
-    
-    def configure(self):
-        """ Configure this class """
-
-        try:
-            self.baseurl = urlparser.HarvestManUrl(self.configobj.url,
-                                                   urltypes.URL_TYPE_ANY,
-                                                   0, self.configobj.url,
-                                                   self.configobj.projdir)
-
-            # Put the original hash of the start url in the class
-            urlparser.HarvestManUrl.hashes[self.baseurl.index] = 1
-            # Reset index to zero
-            self.baseurl.index = 0
-            objects.datamgr.add_url(self.baseurl)
-            
-        except urlparser.HarvestManUrlError:
-            return False
-
-        self.baseurl.starturl = True
-        
-        #if self.configobj.fastmode:
-        try:
-            self.basetracker = crawler.HarvestManUrlFetcher( 0, self.baseurl, True )
-        except Exception, e:
-            print "Fatal Error:",e
-            hexit(1)
-                
-        #else:
-        #    # Not much point keeping this code...!
-        #    
-        #    # Disable usethreads
-        #    self.configobj.usethreads = False
-        #    # Disable blocking
-        #    self.configobj.blocking = False
-        #    self.basetracker = crawler.HarvestManUrlDownloader( 0, self.baseurl, False )
-            
-        self.trackers.append(self.basetracker)
-        # Reset state object
-        self.stateobj.reset()
-        
-        return True
-
-    def mainloop(self):
-        """ Main program loop which waits for
-        the threads to do their work """
-
-        # print 'Waiting...'
-        timeslot, tottime = 0.5, 0
-        pool = objects.datamgr.get_url_threadpool()
-        
-        while not self.stateobj.end_state():
-            self.stateobj.wait2(timeslot)
-            tottime += timeslot
-            if self.flag: 
-                break
-            
-        if pool: pool.wait(10.0, self.configobj.timeout)
-
-        if self.stateobj.abortmsg:
-            extrainfo(self.stateobj.abortmsg)
-            
-        if not self.forcedexit:
-            self.end_threads()
-
-    def endloop(self, forced=False):
-        """ Exit the mainloop """
-
-        # Set flag to 1 to denote that downloading is finished.
-        self.flag = 1
-        if forced:
-            self.forcedexit = True
-            # A forced exit happens when we exit because a
-            # download limit is breached, so instruct connectors
-            # to not save anything from here on...
-            conndict = objects.connfactory.get_connector_dict()
-            for conn in conndict.keys():
-                if conndict.get(conn):
-                    conn.blockwrite = True
-
-    def restart(self):
-        """ Alternate method to start from a previous restored state """
-
-        # Start harvestman controller thread
-        import datamgr
-
-        if self.configobj.enable_controller():
-            self.controller = datamgr.HarvestManController()
-            self.controller.start()
-
-        # Start base tracker
-        self.basetracker.start()
-        time.sleep(2.0)
-
-        for t in self.trackers[1:]:
-            try:
-                t.start()
-            except AssertionError, e:
-                logconsole(e)
-                pass
-
-        self.mainloop()
-        
-    def crawl(self):
-        """ Starts crawling for this project """
-
-        # Reset flag
-        self.flag = 0
-
-        t1=time.time()
-
-
-        # Clear the queues...
-        self.url_q.clear()
-        self.data_q.clear()
-        
-        # Push the first URL directly to the url queue
-        self.url_q.put((self.baseurl.priority, self.baseurl))
-        # This is pushed to url queue, so increment crawler push...
-        self.stateobj.cpush += 1
-        
-        #if self.configobj.fastmode:
-
-        # Start harvestman controller thread
-        if self.configobj.enable_controller():        
-            self.controller = datamgr.HarvestManController()
-            self.controller.start()
-
-        # Create the number of threads in the config file
-        # Pre-launch the number of threads specified
-        # in the config file.
-
-        # Initialize thread dictionary
-        self.stateobj.numfetchers = int(0.75*self.configobj.maxtrackers)
-        self.stateobj.numcrawlers = self.configobj.maxtrackers - self.stateobj.numfetchers
-
-        self.basetracker.setDaemon(True)
-        self.basetracker.start()
-
-        evt = SleepEvent(0.1)
-        while self.stateobj.get(self.basetracker) == crawler.FETCHER_WAITING:
-            evt.sleep()
-
-        # Set start time on config object
-        self.configobj.starttime = t1
-
-        del evt
-        for x in range(1, self.stateobj.numfetchers):
-            t = crawler.HarvestManUrlFetcher(x, None)
-            self.add_tracker(t)
-            t.setDaemon(True)
-            t.start()
-
-        for x in range(self.stateobj.numcrawlers):
-            t = crawler.HarvestManUrlCrawler(x, None)
-            self.add_tracker(t)
-            t.setDaemon(True)
-            t.start()
-
-        self.mainloop()
-        #else:
-        #    self.basetracker.action()
-
-    def get_base_tracker(self):
-        """ Get the base tracker object """
-
-        return self.basetracker
-
-    def get_base_url(self):
-
-        return self.baseurl
-    
-    def get_url_data(self, role):
-        """ Pop url data from the queue """
-
-        if self.flag: return None
-        
-        obj = None
-
-        blk = self.configobj.blocking
-
-        slptime = self.configobj.queuetime
-        ct = threading.currentThread()
-        
-        if role == 'crawler':
-            if blk:
-                obj=self.data_q.get()
-                self.stateobj.set(ct, crawler.CRAWLER_GOT_DATA)                
-            else:
-                self.stateobj.set(ct, crawler.CRAWLER_WAITING)
-                try:
-                    obj = self.data_q.get(timeout=slptime)
-                    self.stateobj.set(ct, crawler.CRAWLER_GOT_DATA)                                    
-                except Empty, TypeError:
-                    obj = None
-                    
-        elif role == 'fetcher' or role=='tracker':
-            
-            if blk:
-                obj = self.url_q.get()
-                self.stateobj.set(ct, crawler.FETCHER_GOT_DATA)                                
-            else:
-                self.stateobj.set(ct, crawler.FETCHER_WAITING)
-                try:
-                    obj = self.url_q.get(timeout=slptime)
-                    self.stateobj.set(ct, crawler.FETCHER_GOT_DATA)                                                    
-                except Empty, TypeError:
-                    obj = None
-            
-        self.lasttimestamp = time.time()        
-
-        self.requests += 1
-        return obj
-
-    def add_tracker(self, tracker):
-
-        self.trackers.append( tracker )
-        self.trackerindex += 1
-
-    def remove_tracker(self, tracker):
-
-        self.trackers.remove(tracker)
-
-    def dead_thread_callback(self, t):
-        """ Call back function called by a thread if it
-        dies with an exception. This class then creates
-        a fresh thread, migrates the data of the dead
-        thread to it """
-
-        
-        try:
-            debug('Trying to regenerate thread...')
-            self.cond.acquire()
-            # First find out the type
-            role = t._role
-            new_t = None
-
-            if role == 'fetcher':
-                new_t = crawler.HarvestManUrlFetcher(t.get_index(), None)
-            elif role == 'crawler':
-                new_t = crawler.HarvestManUrlCrawler(t.get_index(), None)
-
-            # Migrate data and start thread
-            if new_t:
-                new_t._url = t._url
-                new_t._urlobject = t._urlobject
-                
-                new_t.buffer = copy.deepcopy(t.buffer)
-                # If this is a crawler get links also
-                if role == 'crawler':
-                    new_t.links = t.links[:]
-                    
-                # Replace dead thread in the list
-                idx = self.trackers.index(t)
-                self.trackers[idx] = new_t
-                new_t.resuming = True
-                new_t.start()
-
-                debug('Regenerated thread...')
-                
-                return THREAD_MIGRATION_OK
-            else:
-                # Could not make new thread, so decrement
-                # count of threads.
-                # Remove from tracker list
-                self.trackers.remove(t)
-                
-                if role == 'fetcher':
-                    self.stateobj.numfetchers -= 1
-                elif role == 'crawler':
-                    self.stateobj.numcrawlers -= 1
-
-                return THREAD_MIGRATION_ERROR
-        finally:
-            self.cond.release()
-                
-    def push(self, obj, role):
-        """ Push trackers to the queue """
-
-        if self.flag: return
-        
-        ntries, status = 0, 0
-        ct = threading.currentThread()
-        
-        if role == 'crawler' or role=='tracker' or role =='downloader':
-            # debug('Pushing stuff to buffer',ct)
-            self.stateobj.set(ct, crawler.CRAWLER_PUSH_URL)
-            
-            while ntries < 5:
-                try:
-                    ntries += 1
-                    self.url_q.put((obj.priority, obj))
-                    self.pushes += 1
-                    status = 1
-                    self.stateobj.set(ct, crawler.CRAWLER_PUSHED_URL)
-                    break
-                except Full:
-                    self.evnt.sleep()
-                    
-        elif role == 'fetcher':
-            # print 'Pushing stuff to buffer', ct
-            self.stateobj.set(ct, crawler.FETCHER_PUSH_URL)                                
-            # stuff = (obj[0].priority, (obj[0].index, obj[1]))
-            while ntries < 5:
-                try:
-                    ntries += 1
-                    self.data_q.put(obj)
-                    self.pushes += 1
-                    status = 1
-                    self.stateobj.set(ct, crawler.FETCHER_PUSHED_URL)                    
-                    break
-                except Full:
-                    self.evnt.sleep()                    
-
-        self.lasttimestamp = time.time()
-
-        return status
-    
-    def end_threads(self):
-        """ Stop all running threads and clean
-        up the program. This function is called
-        for a normal/abnormal exit of HravestMan """
-
-        extrainfo("Ending threads...")
-        if self.configobj.project:
-            if self.forcedexit:
-                info('Terminating project ',self.configobj.project,'...')
-            else:
-                info("Ending Project", self.configobj.project,'...')
-
-        # Stop controller
-        if self.controller:
-            self.controller.stop()
-        
-        if self.forcedexit:
-            self._kill_tracker_threads()
-        else:
-            # Do a regular stop and join
-            for t in self.trackers:
-                try:
-                    t.stop()
-                except Exception, e:
-                    pass
-
-            # Wait till all threads report
-            # to the state machine, with a
-            # timeout of 5 minutes.
-            extrainfo("Waiting for threads to finish up...")
-
-            timeslot, tottime = 0.5, 0
-            while not self.stateobj.exit_state():
-                # print 'Waiting...'
-                self.stateobj.wait2(timeslot)
-                tottime += timeslot
-                if tottime>=300.0:
-                    break
-
-            pool = objects.datamgr.get_url_threadpool()
-            if pool: pool.wait(10.0, 120.0)
-            
-            extrainfo("Done.")
-            # print 'Done.'
-        
-        self.trackers = []
-        self.basetracker = None
-
-    def _kill_tracker_threads(self):
-        """ This function kills running tracker threads """
-
-        count =0
-
-        for tracker in self.trackers:
-            count += 1
-            sys.stdout.write('...')
-
-            if count % 10 == 0: sys.stdout.write('\n')
-
-            try:
-                tracker.stop()
-            except AssertionError, e:
-                logconsole(str(e))
-            except ValueError, e:
-                logconsole(str(e))
-            except crawler.HarvestManUrlCrawlerException, e:
-                pass
-
diff --git a/HarvestMan/harvestman/lib/urlthread.py b/HarvestMan/harvestman/lib/urlthread.py
deleted file mode 100755
index fa42764..0000000
--- a/HarvestMan/harvestman/lib/urlthread.py
+++ /dev/null
@@ -1,876 +0,0 @@
-# -- coding: utf-8
-""" urlthread.py - Url thread downloader module.
-    Has two classes, one for downloading of urls and another
-    for managing the url threads.
-
-    This module is part of the HarvestMan program.
-
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    Modification History
-
-    Jan 10 2006  Anand  Converted from dos to unix format (removed Ctrl-Ms).
-    Jan 20 2006  Anand  Small change in printing debug info in download
-                        method.
-
-    Mar 05 2007  Anand  Implemented http 304 handling in notify(...).
-
-    Apr 09 2007  Anand  Added check to make sure that threads are not
-                        re-started for the same recurring problem.
-    
-    Copyright (C) 2004 Anand B Pillai.
-
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-import os, sys
-import math
-import time
-import threading
-import copy
-import random
-import sha
-from collections import deque
-from Queue import Queue, Full, Empty
-
-from harvestman.lib import urlparser
-
-from harvestman.lib.mirrors import HarvestManMirrorManager
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-
-class HarvestManUrlThreadInterrupt(Exception):
-    """ Interrupt raised to kill a harvestManUrlThread class's object """
-
-    def __init__(self, value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManUrlThread(threading.Thread):
-    """ Class to download a url in a separate thread """
-
-    # The last error which caused a thread instance to die
-    _lasterror = None
-    
-    def __init__(self, name, timeout, threadpool):
-        """ Constructor, the constructor takes a url, a filename
-        , a timeout value, and the thread pool object pooling this
-        thread """
-
-        # url Object (This is an instance of urlPathParser class)
-        self._urlobject = None
-        # thread queue object pooling this thread
-        self._pool = threadpool
-        # max lifetime for the thread
-        self._timeout = timeout
-        # start time of thread
-        self._starttime = 0
-        # start time of a download
-        self._dstartime = 0
-        # sleep time
-        self._sleepTime = 1.0
-        # error object
-        self._error = None
-        # download status 
-        self._downloadstatus = 0
-        # busy flag
-        self._busyflag = False
-        # end flag
-        self._endflag = False
-        # Url data, only used for CONNECTOR_DATA_MODE_INMEM
-        self._data = ''
-        # Url temp file, used for mode CONNECTOR_DATA_MODE_FLUSH
-        self._urltmpfile = ''
-        # Current connector
-        self._conn = None
-        # initialize threading
-        threading.Thread.__init__(self, None, None, name)
-
-    def __str__(self):
-        return self.getName()
-    
-    def get_error(self):
-        """ Get error object of this thread """
-
-        return self._error
-
-    def get_status(self):
-        """ Get the download status of this thread """
-
-        return self._downloadstatus
-
-    def get_data(self):
-        """ Return the data of this thread """
-
-        return self._data
-
-    def get_tmpfname(self):
-        """ Return the temp filename if any """
-
-        return self._urltmpfile
-
-    def set_tmpfname(self, filename):
-        """ Set the temporary filename """
-
-        # Typically called by connector objects
-        self._urltmpfile = filename
-        
-    def set_status(self, status):
-        """ Set the download status of this thread """
-
-        self._downloadstatus = status
-
-    def is_busy(self):
-        """ Get busy status for this thread """
-
-        return self._busyflag
-
-    def set_busy_flag(self, flag):
-        """ Set busy status for this thread """
-
-        self._busyflag = flag
-
-    def join(self):
-        """ The thread's join method to be called
-        by other threads """
-
-        threading.Thread.join(self, self._timeout)
-
-    def terminate(self):
-        """ Kill this thread """
-
-        self.stop()
-        msg = 'Download thread, ' + self.getName() + ' killed!'
-        raise HarvestManUrlThreadInterrupt, msg
-
-    def stop(self):
-        """ Stop this thread """
-
-        # If download was not completed, push-back object
-        # to the pool.
-        if self._downloadstatus==0 and self._urlobject:
-            self._pool.push(self._urlobject)
-            
-        self._endflag = True
-
-    def download(self, url_obj):
-        """ Download this url """
-
-        # Set download status
-        self._downloadstatus = 0
-        self._dstartime = time.time()
-        
-        url = url_obj.get_full_url()
-
-        if not url_obj.trymultipart:
-            # print 'Gewt URL=>',url,self
-            if url_obj.is_image():
-                extrainfo('Downloading image ...', url)
-            else:
-                extrainfo('Downloading url ...', url)
-        else:
-            startrange = url_obj.range[0]
-            endrange = url_obj.range[-1]
-            # print "Got URL",url,self
-            extrainfo('%s: Downloading url %s, byte range(%d - %d)' % (str(self),url,startrange,endrange))
-
-        # This call will block if we exceed the number of connections
-        self._conn = objects.connfactory.create_connector()
-        mode = self._conn.get_data_mode()
-        
-        if not url_obj.trymultipart:
-            res = self._conn.save_url(url_obj)
-        else:
-            # print 'Downloading URL',url,self
-            res = self._conn.wrapper_connect(url_obj)
-            # print 'Connector returned',self,url_obj.get_full_url()
-            
-            if mode == CONNECTOR_DATA_MODE_FLUSH:
-                self._urltmpfile = self._conn.get_tmpfname()
-            elif mode == CONNECTOR_DATA_MODE_INMEM:
-                self._data = self._conn.get_data()
-
-        # Add page hash to URL object
-        data = self._conn.get_data()
-        # Update pagehash on the URL object
-        if data: 
-            url_obj.pagehash = sha.new(data).hexdigest()
-            
-        # Remove the connector from the factory
-        objects.connfactory.remove_connector(self._conn)
-        
-        # Set this as download status
-        self._downloadstatus = res
-        
-        # get error flag from connector
-        self._error = self._conn.get_error()
-
-        self._conn = None
-        
-        # Notify thread pool
-        self._pool.notify(self)
-
-        if SUCCESS(res):
-            if not url_obj.trymultipart:            
-                extrainfo('Finished download of ', url)
-            else:
-                startrange = url_obj.range[0]
-                endrange = url_obj.range[-1]                            
-                debug('Finished download of byte range(%d - %d) of %s' % (startrange,endrange, url))
-        elif self._error.number != 304:
-            error('Failed to download URL',url)
-
-        objects.datamgr.update_url(url_obj)
-        
-    def run(self):
-        """ Run this thread """
-
-        while not self._endflag:
-            try:
-                self._starttime=time.time()
-
-                # print 'Waiting for next URL task',self
-                url_obj = self._pool.get_next_urltask()
-
-                # Dont do duplicate checking for multipart...
-                if not url_obj.trymultipart and self._pool.check_duplicates(url_obj):
-                    print 'Is duplicate',url_obj.get_full_url()
-                    continue
-
-                if not url_obj:
-                    time.sleep(0.1)
-                    continue
-
-                # set busy flag to 1
-                self._busyflag = True
-
-                # Save reference
-                self._urlobject = url_obj
-                # print 'Got url=>',url_obj.get_full_url()
-                
-                filename, url = url_obj.get_full_filename(), url_obj.get_full_url()
-                if not filename and not url:
-                    return
-
-                # Perf fix: Check end flag
-                # in case the program was terminated
-                # between start of loop and now!
-                if not self._endflag: self.download(url_obj)
-                # reset busyflag
-                # print 'Setting busyflag to False',self
-                self._busyflag = False
-            except Exception, e:
-                raise
-                error('Worker thread Exception',e)
-                # Now I am dead - so I need to tell the pool
-                # object to migrate my data and produce a new thread.
-                
-                # See class for last error. If it is same as
-                # this error, don't do anything since this could
-                # be a programming error and will send us into
-                # a loop...
-
-                # Set busyflag to False
-                self._busyflag = False
-                # Remove the connector from the factory
-                
-                if self._conn and (not self._conn.is_released()):
-                    objects.connfactory.remove_connector(self._conn)
-                
-                if str(self.__class__._lasterror) == str(e):
-                    debug('Looks like a repeating error, not trying to restart worker thread %s' % (str(self)))
-                else:
-                    self.__class__._lasterror = e
-                    # self._pool.dead_thread_callback(self)
-                    error('Worker thread %s has died due to error: %s' % (str(self), str(e)))
-                    error('Worker thread was downloading URL %s' % url_obj.get_full_url())
-
-    def get_url(self):
-
-        if self._urlobject:
-            return self._urlobject.get_full_url()
-
-        return ""
-
-    def get_filename(self):
-
-        if self._urlobject:
-            return self._urlobject.get_full_filename()
-
-        return ""
-
-    def get_urlobject(self):
-        """ Return this thread's url object """
-
-        return self._urlobject
-
-    def get_connector(self):
-        """ Return the connector object """
-
-        return self._conn
-    
-    def set_urlobject(self, urlobject):
-            
-        self._urlobject = urlobject
-        
-    def get_start_time(self):
-        """ Return the start time of current download """
-
-        return self._starttime
-
-    def set_start_time(self, starttime):
-        """ Return the start time of current download """
-
-        self._starttime = starttime
-    
-    def get_elapsed_time(self):
-        """ Get the time taken for this thread """
-
-        now=time.time()
-        fetchtime=float(math.ceil((now-self._starttime)*100)/100)
-        return fetchtime
-
-    def get_elapsed_download_time(self):
-        """ Return elapsed download time for this thread """
-
-        fetchtime=float(math.ceil((time.time()-self._dstartime)*100)/100)
-        return fetchtime
-        
-    def long_running(self):
-        """ Find out if this thread is running for a long time
-        (more than given timeout) """
-
-        # if any thread is running for more than <timeout>
-        # time, return TRUE
-        return (self.get_elapsed_time() > self._timeout)
-
-    def set_timeout(self, value):
-        """ Set the timeout value for this thread """
-
-        self._timeout = value
-
-    def close_file(self):
-        """ Close temporary file objects of the connector """
-
-        # Currently used only by hget
-        if self._conn:
-            reader = self._conn.get_fileobj()
-            if reader: reader.close()
-        
-class HarvestManUrlThreadPool(Queue):
-    """ Thread group/pool class to manage download threads """
-
-    def __init__(self):
-        """ Initialize this class """
-
-        # list of spawned threads
-        self._threads = []
-        # list of url tasks
-        self._tasks = []
-        self._cfg = objects.config
-        # Maximum number of threads spawned
-        self._numthreads = self._cfg.threadpoolsize
-        self._timeout = self._cfg.timeout
-        
-        # Last thread report time
-        self._ltrt = 0.0
-        # Local buffer
-        self.buffer = []
-        # Data dictionary for multi-part downloads
-        # Keys are URLs and value is the data
-        self._multipartdata = {}
-        # Status of URLs being downloaded in
-        # multipart. Keys are URLs
-        self._multipartstatus = {}
-        # Flag that is set when one of the threads
-        # in a multipart download fails
-        self._multiparterror = False
-        # Number of parts
-        self._parts = self._cfg.numparts
-        # Condition object
-        self._cond = threading.Condition(threading.Lock())
-        # Condition object for waiting on end condition
-        self._endcond = threading.Condition(threading.Lock())
-        # Monitor object, used with hget
-        self._monitor = None
-        
-        Queue.__init__(self, self._numthreads + 5)
-        
-    def start_threads(self):
-        """ Start threads if they are not running """
-
-        for t in self._threads:
-            try:
-                t.start()
-            except AssertionError, e:
-                pass
-            
-    def spawn_threads(self):
-        """ Start the download threads """
-
-        for x in range(self._numthreads):
-            name = 'Worker-'+ str(x+1)
-            fetcher = HarvestManUrlThread(name, self._timeout, self)
-            fetcher.setDaemon(True)
-            # Append this thread to the list of threads
-            self._threads.append(fetcher)
-            # print 'Starting thread',fetcher
-            fetcher.start()
-
-    def download_urls(self, listofurlobjects):
-        """ Method to download a list of urls.
-        Each member is an instance of a urlPathParser class """
-
-        for urlinfo in listofurlobjects:
-            self.push(urlinfo)
-
-    def _get_num_blocked_threads(self):
-
-        blocked = 0
-        for t in self._threads:
-            if not t.is_busy(): blocked += 1
-
-        return blocked
-
-    def is_blocked(self):
-        """ The queue is considered blocked if all threads
-        are waiting for data, and no data is coming """
-
-        blocked = self._get_num_blocked_threads()
-
-        if blocked == len(self._threads):
-            return True
-        else:
-            return False
-
-    def push(self, urlObj):
-        """ Push the url object and start downloading the url """
-
-        # print 'Pushed',urlObj.get_full_url()
-        # unpack the tuple
-        try:
-            filename, url = urlObj.get_full_filename(), urlObj.get_full_url()
-        except:
-            return
-
-        # Wait till we have a thread slot free, and push the
-        # current url's info when we get one
-        try:
-            self.put( urlObj )
-            urlObj.qstatus = urlparser.URL_IN_QUEUE            
-            # If this URL was multipart, mark it as such
-            self._multipartstatus[url] = MULTIPART_DOWNLOAD_QUEUED
-        except Full:
-            self.buffer.append(urlObj)
-        
-    def get_next_urltask(self):
-
-        # Insert a random sleep in range
-        # of 0 - 0.5 seconds
-        # time.sleep(random.random()*0.5)
-        try:
-            if len(self.buffer):
-                # Get last item from buffer
-                item = buffer.pop()
-                return item
-            else:
-                # print 'Waiting to get item',threading.currentThread()
-                item = self.get()
-                return item
-            
-        except Empty:
-            return None
-
-    def notify(self, thread):
-        """ Method called by threads to notify that they
-        have finished """
-
-        try:
-            self._cond.acquire()
-
-            # Mark the time stamp (last thread report time)
-            self._ltrt = time.time()
-
-            urlObj = thread.get_urlobject()
-            
-            # See if this was a multi-part download
-            if urlObj.trymultipart:
-                status = thread.get_status()
-                if status == CONNECT_YES_DOWNLOADED:
-                    extrainfo('Thread %s reported %s' % (thread, urlObj.get_full_url()))
-                    # For flush mode, get the filename
-                    # for memory mode, get the data
-                    datamode = self._cfg.datamode
-
-                    fname, data = '',''
-                    if datamode == CONNECTOR_DATA_MODE_FLUSH:
-                        fname = thread.get_tmpfname()
-                        datalen = os.path.getsize(fname)
-                    else:
-                        data = thread.get_data()
-                        datalen = len(data)
-                        
-
-                    # See if the data was downloaded fully...,else reschedule this piece
-                    expected = (urlObj.range[-1] - urlObj.range[0]) + 1
-                    if datalen != expected:
-                        extrainfo("Expected: %d, Got: %d" % (expected, datalen))
-                        extrainfo("Thread %s did only a partial download, rescheduling this piece..." % thread)
-                        if self._monitor:
-                            # print 'Notifying failure',thread
-                            self._monitor.notify_failure(urlObj, thread)
-                            return
-                        
-                    index = urlObj.mirror_url.index
-                    # print 'Index=>',index
-                    
-                    if index in self._multipartdata:
-                        infolist = self._multipartdata[index]
-                        if data:
-                            infolist.append((urlObj.range[0],data))
-                        elif fname:
-                            infolist.append((urlObj.range[0],fname))                        
-                    else:
-                        infolist = []
-                        if data:
-                            infolist.append((urlObj.range[0],data))
-                        elif fname:
-                            infolist.append((urlObj.range[0],fname))
-                        #else:
-                        #    self._parts -= 1 # AD-HOC
-
-                        self._multipartdata[index] = infolist
-
-                    # print 'Length of data list is',len(infolist),self._parts
-                    if len(infolist)==self._parts:
-                        # Sort the data list  according to byte-range
-                        infolist.sort()
-                        # Download of this URL is complete...
-                        logconsole('Download of %s is complete...' % urlObj.get_full_url())
-
-                        if datamode == CONNECTOR_DATA_MODE_INMEM:
-                            data = ''.join([item[1] for item in infolist])
-                            self._multipartdata['data:' + str(index)] = data
-                        else:
-                            pass
-
-                        self._multipartstatus[index] = MULTIPART_DOWNLOAD_COMPLETED
-                else:
-                    # Currently when a thread reports an error, we abort the download
-                    # In future, we can inspect whether the error is fatal or not
-                    # and resume download in another thread etc...
-                    extrainfo('Thread %s reported error => %s' % (str(thread), str(thread.get_error())))
-                    if self._monitor:
-                        # print 'Notifying failure',thread
-                        self._monitor.notify_failure(urlObj, thread)
-                        # print 'Notified failure',thread
-                        
-            # if the thread failed, update failure stats on the data manager
-            err = thread.get_error()
-
-            tstatus = thread.get_status()
-
-            # Either file was fetched or file was uptodate
-            if err.number in (0, 304):
-                # thread succeeded, increment file count stats on the data manager
-                objects.datamgr.update_file_stats( urlObj, tstatus)
-
-        finally:
-            self._cond.release()
-
-    def has_busy_threads(self):
-        """ Return whether I have any busy threads """
-
-        val=0
-        for thread in self._threads:
-            if thread.is_busy():
-                val += 1
-                break
-            
-        return val
-
-    def get_busy_threads(self):
-        """ Return a list of busy threads """
-
-        return [thread for thread in self._threads if thread.is_busy()]
-
-    def get_busy_count(self):
-        """ Return a count of busy threads """
-
-        return len(self.get_busy_threads())
-
-    def get_busy_figure(self):
-
-        s=''
-        for t in self._threads:
-            if t.is_busy():
-                s=s + t.getName().split('-')[-1] + ' '
-
-        return s
-
-    def wait(self, period, timeout):
-
-        # Wait for the pool to signal that there
-        # are no more busy threads
-
-        # Note: timeout must be > period
-        count = 0.0
-        while self.has_busy_threads():
-            self._endcond.acquire()
-            
-            try:
-                self._endcond.wait(period)
-                count += period
-                self.end_hanging_threads()
-            except IOError, e:
-                break
-
-            self._endcond.release()
-            if count>=timeout:
-                break
-    
-    def locate_thread(self, url):
-        """ Find a thread which downloaded a certain url """
-
-        for x in self._threads:
-            if not x.is_busy():
-                if x.get_url() == url:
-                    return x
-
-        return None
-
-    def locate_busy_threads(self, url):
-        """ Find all threads which are downloading a certain url """
-
-        threads=[]
-        for x in self._threads:
-            if x.is_busy():
-                if x.get_url() == url:
-                    threads.append(x)
-
-        return threads
-
-    def check_duplicates(self, urlobj):
-        """ Avoid downloading same url file twice.
-        It can happen that same url is linked from
-        different web pages """
-
-        filename = urlobj.get_full_filename()
-        url = urlobj.get_full_url()
-
-        # First check if any thread is in the process
-        # of downloading this url.
-        if self.locate_thread(url):
-            debug('Another thread is downloading %s' % url)
-            return True
-
-        return False
-
-    def end_hanging_threads(self):
-        """ If any download thread is running for too long,
-        kill it, and remove it from the thread pool """
-
-        pool=[]
-        for thread in self._threads:
-            if thread.long_running(): pool.append(thread)
-
-        for thread in pool:
-            extrainfo('Killing hanging thread ', thread)
-            # remove this thread from the thread list
-            self._threads.remove(thread)
-            # kill it
-            try:
-                thread.terminate()
-            except HarvestManUrlThreadInterrupt:
-                pass
-
-            del thread
-
-    def end_all_threads(self):
-        """ Kill all running threads """
-
-        try:
-            self._cond.acquire()
-            for t in self._threads:
-                try:
-                    t.terminate()
-                    t.join()
-                except HarvestManUrlThreadInterrupt, e:
-                    debug(str(e))
-                    pass
-
-            self._threads = []
-        finally:
-            self._cond.release()
-
-    def stop_all_threads(self):
-        """ Stop all running threads """
-
-        # Same as end_all_threads but only that
-        # we don't print the killed message.
-        try:
-            self._cond.acquire()
-            for t in self._threads:
-                try:
-                    t.terminate()
-                    t.join()
-                except HarvestManUrlThreadInterrupt, e:
-                    pass
-
-            self._threads = []
-        finally:
-            self._cond.release()
-
-    def remove_finished_threads(self):
-        """ Clean up all threads that have completed """
-
-        for thread in self._threads:
-            if not thread.is_busy():
-                self._threads.remove(thread)
-                del thread
-
-    def last_thread_report_time(self):
-        """ Return the last thread reported time """
-
-        return self._ltrt
-    
-    def get_multipart_download_status(self, url):
-        """ Get status of multipart downloads """
-
-        # If a thread has failed, signal exit
-        if self._multiparterror:
-            return MULTIPART_DOWNLOAD_ERROR
-        else:
-            return self._multipartstatus.get(url.index, MULTIPART_DOWNLOAD_STATUS_UNKNOWN)
-
-    def get_multipart_url_data(self, url):
-        """ Return data for multipart downloads """
-
-        return self._multipartdata.get('data:'+ str(url.index), '')
-
-    def get_multipart_url_info(self, url):
-        """ Return information for multipart downloads """
-
-        return self._multipartdata.get(url.index, '')
-
-    def dead_thread_callback(self, t):
-        """ Call back function called by a thread if it
-        dies with an exception. This class then creates
-        a fresh thread, migrates the data of the dead
-        thread to it """
-
-        try:
-            self._cond.acquire()
-            new_t = HarvestManUrlThread(t.getName(), self._timeout, self)
-            # Migrate data and start thread
-            if new_t:
-                new_t.set_urlobject(t.get_urlobject())
-                # Replace dead thread in the list
-                idx = self._threads.index(t)
-                self._threads[idx] = new_t
-                new_t.start()
-            else:
-                # Could not make new thread, remove
-                # current thread anyway
-                self._threads.remove(t)
-        finally:
-            self._cond.release()                
-                    
-    def get_threads(self):
-        """ Return the list of thread objects """
-
-        return self._threads
-
-    def get_thread_urls(self):
-        """ Return a list of current URLs being downloaded """
-
-        # This returns a list of URL objects, not URL strings
-        urlobjs = []
-
-        for t in self._threads:
-            if t.is_busy():
-                urlobj = t.get_urlobject()
-                if urlobj: urlobjs.append(urlobj)
-
-        return urlobjs
-
-    def reset_multipart_data(self):
-        """ Reset multipart related state """
-
-        self._multiparterror = False
-        self._multipartdata.clear()
-        self._multipartdata = {}
-        self._multipartstatus.clear()
-        self._multipartstatus = {}
-        
-        
-class HarvestManUrlThreadPoolMonitor(threading.Thread):
-
-    def __init__(self, threadpool):
-        self._pool = threadpool
-        self._pool._monitor = self
-        self.lock = threading.Lock()
-        self._failedurls = []
-        self._listfailed = []
-        self._flag = False
-        # Mirror manager
-        self.mirrormgr = HarvestManMirrorManager.getInstance()
-        # initialize threading
-        threading.Thread.__init__(self, None, None, "Monitor")        
-
-    def run(self):
-
-        while not self._flag:
-            try:
-                self.lock.acquire()
-                items = []
-
-                self._failedurls = self._listfailed[:]
-                
-                for urlobj, urlerror in self._failedurls:
-                    # Reset URL to parent and try again...
-                    if urlobj.mirrored:
-                        # Try getting a new mirror URL
-                        new_urlobj = self.mirrormgr.get_different_mirror_url(urlobj, urlerror)
-                        
-                        if new_urlobj:
-                            extrainfo("New mirror URL=>", new_urlobj.get_full_url())
-                            items.append((urlobj, urlerror))
-                            self._pool.push(new_urlobj)
-                        else:
-                            logconsole('Could not find new working mirror. Exiting...')
-                            self._pool._multiparterror = True
-                            self._listfailed = []
-                            break
-                    else:
-                        logconsole("URL is not mirrored, so no new mirrors to try. Exiting...")
-                        self._pool._multiparterror = True
-                        break
-
-                for item in items:
-                    self._listfailed.remove(item)
-
-                self.lock.release()
-                time.sleep(0.1)
-
-            finally:
-                pass
-            
-    def notify_failure(self, urlobj, thread):
-        self.lock.acquire()
-        self._listfailed.append((urlobj, thread.get_error()))
-        self.lock.release()
-
-    def stop(self):
-        self._flag = True
-
-    def reset(self):
-        """ Reset the state """
-
-        self._listfailed = []
-        self._failedurls = []
-        
diff --git a/HarvestMan/harvestman/lib/urltypes.py b/HarvestMan/harvestman/lib/urltypes.py
deleted file mode 100755
index f020219..0000000
--- a/HarvestMan/harvestman/lib/urltypes.py
+++ /dev/null
@@ -1,218 +0,0 @@
-# -- coding: utf-8
-"""
-urltypes - Module defining types of URLs and their
-relationships.
-
-This module is part of the HarvestMan program.
-For licensing information see the file LICENSE.txt that
-is included in this distribution.
-
-Author: Anand B Pillai <abpillai at gmail dot com>
-
-Created Anand B Pillai April 18 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-# The URL types are defined as classes with easy-to-use
-# string representations. 
-
-# Also, these classes are to be used as "raw", in other words
-# ideally the clients of these classes need not create instances
-# from the classes. Instead they should use them as given,
-# i.e as classes.
-
-class URL_TYPE_META(type):
-    """ Meta-class for type classes """
-
-    def __eq__(cls, other):
-        return (str(cls) == str(other))
-
-    def __str__(cls):
-        return cls.typ
-
-    def isA(cls, baseklass):
-        """ Check whether the passed class is a subclass of my class """
-        
-        return issubclass(cls, baseklass)
-
-class URL_TYPE_ANY(str):
-    """ Class representing a URL which belongs to any type.
-    This is the base class for all other URL types """
-
-    __metaclass__ = URL_TYPE_META
-    
-    typ = 'generic'
-
-class URL_TYPE_NONE(URL_TYPE_ANY):
-    """ Class representing the None type for URLs """
-
-    __metaclass__ = URL_TYPE_META
-
-    typ = 'none'
-    
-class URL_TYPE_WEBPAGE(URL_TYPE_ANY):
-    """ Class representing a webpage URL. A webpage URL will
-    consist of some (X)HTML markup which can be parsed by an
-    (X)HTML parser. """
-
-    typ = 'webpage'
-
-class URL_TYPE_BASE(URL_TYPE_WEBPAGE):
-    """ Class representing the base URL of a web site. This is
-    a special kind of webpage type """
-
-    typ = 'base'
-
-class URL_TYPE_ANCHOR(URL_TYPE_WEBPAGE):
-    """ Class representing HTML anchor links. Anchor links are
-    part of the same web-page and are typically labels defined
-    in the same page or in another page. They start with a '#'"""
-
-    typ = 'anchor'
-
-class URL_TYPE_FRAMESET(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which defines HTML frames. The
-    children of this URL point to HTML frame documents """
-
-    typ = 'frameset'
-
-
-class URL_TYPE_FRAME(URL_TYPE_WEBPAGE):
-    """ Class representing a URL which acts as the source for an
-    HTML 'frame' element. This URL is typically the child of
-    an HTML 'frameset' URL """
-
-    typ = 'frame'
-    
-class URL_TYPE_QUERY(URL_TYPE_ANY):
-    """ Class representing a URL which is used to submit queries to
-    web servers. Such queries can result in html or non-html result,
-    but typically they consist of session IDs """
-
-    typ = 'query'
-    
-class URL_TYPE_FORM(URL_TYPE_QUERY):
-    """ A URL which points to an action, usually used to submit
-    form contents to a ReST endpoint. This URL is part of the submit
-    action of an HTML <form> element """
-
-    typ = 'form'
-
-class URL_TYPE_IMAGE(URL_TYPE_ANY):
-    """ Class representing a URL which points to a binary raster image """
-
-    typ = 'image'
-
-class URL_TYPE_MULTIMEDIA(URL_TYPE_ANY):
-    """ Class representing a multimedia (audio/video) URL type """
-
-    typ = 'multimedia'
-
-class URL_TYPE_AUDIO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia audio URL type """
-
-    typ = 'audio'        
-
-class URL_TYPE_VIDEO(URL_TYPE_MULTIMEDIA):
-    """ Class representing a multimedia video URL type """
-
-    typ = 'video'
-
-class URL_TYPE_FLASH(URL_TYPE_MULTIMEDIA):
-    """ Class representing Adobe shockwave flash/action script type """
-
-    typ = 'flash'            
-
-class URL_TYPE_STYLESHEET(URL_TYPE_ANY):
-    """ Class representing a URL which points to a stylesheet (CSS) file """
-
-    typ = 'stylesheet'
-
-class URL_TYPE_JAVASCRIPT(URL_TYPE_ANY):
-    """ Class which defines a URL which stands for server-side javascript files """
-
-    typ = 'javascript'
-
-class URL_TYPE_JAPPLET(URL_TYPE_ANY):
-    """ Class which defines a URL that points to a Java applet class """
-
-    typ = 'javaapplet'
-
-class URL_TYPE_JAPPLET_CODEBASE(URL_TYPE_ANY):
-    """ Class which defines a URL that points to the code-base path of a Java applet """
-
-    typ = 'appletcodebase'
-    
-class URL_TYPE_FILE(URL_TYPE_ANY):
-    """ Class representing a URL which points to any kind of file other
-    than webpages, images, stylesheets,server-side javascript files, java
-    applets, form queries etc """
-
-    # This is a generic catch-all for all URLs which are not defined so far.
-    typ = 'file'
-
-class URL_TYPE_DOCUMENT(URL_TYPE_ANY):
-    """ Class which stands for URLs that point to documents which can be
-    indexed by search engines. Examples are text files, xml files, PDF files,
-    word documents, open office documents etc """
-
-    # This type is not used in HarvestMan, but is useful for indexers
-    # which work with HarvestMan, such as swish-e.
-    typ = 'document'
-
-
-# An easy-to-use dictionary for type string to type class mapping
-
-type_map = { 'generic' : URL_TYPE_ANY,
-             'webpage' : URL_TYPE_WEBPAGE,
-             'base': URL_TYPE_BASE,
-             'anchor': URL_TYPE_ANCHOR,
-             'query': URL_TYPE_QUERY,
-             'form' : URL_TYPE_FORM,
-             'image': URL_TYPE_IMAGE,
-             'multimedia': URL_TYPE_MULTIMEDIA,
-             'audio' : URL_TYPE_AUDIO,
-             'video' : URL_TYPE_VIDEO,
-             'stylesheet': URL_TYPE_STYLESHEET,
-             'javascript': URL_TYPE_JAVASCRIPT,
-             'javaapplet': URL_TYPE_JAPPLET,
-             'appletcodebase': URL_TYPE_JAPPLET_CODEBASE,
-             'file': URL_TYPE_FILE,
-             'document': URL_TYPE_DOCUMENT }
-
-
-def getTypeClass(typename):
-    """ Return the type class, given the type name """
-
-    return type_map.get(typename, URL_TYPE_ANY)
-
-if __name__ == "__main__":
-    print URL_TYPE_ANY == 'generic'
-    print URL_TYPE_WEBPAGE == 'webpage'
-    print URL_TYPE_BASE == 'base'
-    print URL_TYPE_ANCHOR == 'anchor'
-    print URL_TYPE_QUERY == 'query'
-    print URL_TYPE_FORM == 'form'
-    print URL_TYPE_IMAGE == 'image'
-    print URL_TYPE_STYLESHEET == 'stylesheet'
-    print URL_TYPE_JAVASCRIPT == 'javascript'
-    print URL_TYPE_JAPPLET == 'javaapplet'
-    print URL_TYPE_JAPPLET_CODEBASE == 'appletcodebase'
-    print URL_TYPE_FILE == 'file'
-    print URL_TYPE_DOCUMENT == 'document'
-    
-
-    print URL_TYPE_ANY in ('generic','webpage')
-    print issubclass(URL_TYPE_ANCHOR, URL_TYPE_WEBPAGE)
-    print issubclass(URL_TYPE_BASE, URL_TYPE_WEBPAGE)    
-    print type(URL_TYPE_ANCHOR), type(URL_TYPE_ANY)
-    
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANCHOR.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE.isA(URL_TYPE_WEBPAGE)
-    print URL_TYPE_ANY.isA(URL_TYPE_ANY)
-    print URL_TYPE_IMAGE in ('image','stylesheet')
diff --git a/HarvestMan/harvestman/lib/utils.py b/HarvestMan/harvestman/lib/utils.py
deleted file mode 100755
index cdfba25..0000000
--- a/HarvestMan/harvestman/lib/utils.py
+++ /dev/null
@@ -1,520 +0,0 @@
-# -- coding: utf-8
-""" utils.py - Utility classes for harvestman
-    program.
-
-    Created: Anand B Pillai on Sep 25 2003.
-    
-    Author: Anand B Pillai <abpillai at gmail dot com>
-    
-    This contains a class for pickling using compressed data
-    streams and another one for writing project files.
-
-   Jan 10 2006     Anand   Converted from dos to unix format (removed Ctrl-Ms).
-   Mar 03 2007     Anand   Modified cache read/write functions to dump URL data
-                           to separate *.data files - this helps to reduce
-                           the size of the cache files.
-
-   Apr 11 2007     Anand   Modified extension of harvestman project files to .hpf.
-   
-   Copyright (C) 2005 Anand B Pillai
-                          
-"""
-
-__version__ = '2.0 b1'
-__author__ = 'Anand B Pillai'
-
-
-import os
-import cPickle, pickle
-import zlib
-import shelve
-import glob
-from shutil import copy
-
-from harvestman.lib.common.common import *
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.pydblite import Base
-
-HARVESTMAN_XML_HEAD1="""<?xml version=\"1.0\" encoding=\"UTF-8\"?>"""
-HARVESTMAN_XML_HEAD2="""<!DOCTYPE HarvestManProject SYSTEM \"HarvestManProject.dtd\">"""
-
-#=====Start Browser page macro strings ================#
-HARVESTMAN_SIG="Daddy Long Legs"
-
-HARVESTMAN_PROJECTINFO="""\
-<TR align=center>
-    <TD>
-    %(PROJECTNAME)s
-    </TD>
-    <TD>&middot;
-    <!-- PROJECTPAGE --><A HREF=\"%(PROJECTSTARTPAGE)s\"><!-- END -->
-    <!-- PROJECTURL -->%(PROJECTURL)s<!-- END -->
-        </A>
-    </TD>
-</TR>"""
-
-HARVESTMAN_BOAST="""HarvestMan is an easy-to-use website copying utility. It allows you to download a website in the World Wide Web from the Internet to a local directory. It retrieves html, images, and other files from the remote server to your computer. It builds the local directory structures recursively, and rebuilds links relatively so that you can browse the local site without again connecting to the internet. The robot allows you to customize it in a variety of ways, filtering files based on file extensions/websites/keywords. The robot is customizable by using a configuration file. The program is completely written in Python."""
-
-HARVESTMAN_KEYWORDS="""HarvestMan, HARVESTMAN, HARVESTMan, offline browser, robot, web-spider, website mirror utility, aspirateur web, surf offline, web capture, www mirror utility, browse offline, local  site builder, website mirroring, aspirateur www, internet grabber, capture de site web, internet tool, hors connexion, windows, windows 95, windows 98, windows nt, windows 2000, python apps, python tools, python spider"""
-
-HARVESTMAN_CREDITS="""\
-&copy; 2004-2005, Anand B Pillai. """
-
-
-HARVESTMAN_BROWSER_CSS="""\
-body {
-    margin: 0;
-    padding: 1;
-    margin-bottom: 15px;
-    margin-top: 15px;
-    background: #678;
-}
-body, td {
-    font: 14px Arial, Times, sans-serif;
-    }
-
-#subTitle {
-    background: #345;  color: #fff;  padding: 4px;  font-weight: bold;
-    }
-
-#siteNavigation a, #siteNavigation .current {
-    font-weight: bold;  color: #448;
-    }
-#siteNavigation a:link    { text-decoration: none; }
-#siteNavigation a:visited { text-decoration: none; }
-
-#siteNavigation .current { background-color: #ccd; }
-
-#siteNavigation a:hover   { text-decoration: none;  background-color: #fff;  color: #000; }
-#siteNavigation a:active  { text-decoration: none;  background-color: #ccc; }
-
-
-a:link    { text-decoration: underline;  color: #00f; }
-a:visited { text-decoration: underline;  color: #000; }
-a:hover   { text-decoration: underline;  color: #c00; }
-a:active  { text-decoration: underline; }
-
-#pageContent {
-    clear: both;
-    border-bottom: 6px solid #000;
-    padding: 10px;  padding-top: 20px;
-    line-height: 1.65em;
-    background-image: url(backblue.gif);
-    background-repeat: no-repeat;
-    background-position: top right;
-    }
-
-#pageContent, #siteNavigation {
-    background-color: #ccd;
-    }
-
-
-.imgLeft  { float: left;   margin-right: 10px;  margin-bottom: 10px; }
-.imgRight { float: right;  margin-left: 10px;   margin-bottom: 10px; }
-
-hr { height: 1px;  color: #000;  background-color: #000;  margin-bottom: 15px; }
-
-h1 { margin: 0;  font: 14px \"Monotype Corsiva\", Times, Arial;
-font-weight: bold;  font-size: 2em; }
-h2 { margin: 0;  font-weight: bold;  font-size: 1.6em; }
-h3 { margin: 0;  font-weight: bold;  font-size: 1.3em; }
-h4 { margin: 0;  font-weight: bold;  font-size: 1.18em; }
-
-.blak { background-color: #000; }
-.hide { display: none; }
-.tableWidth { min-width: 400px; }
-
-.tblRegular       { border-collapse: collapse; }
-.tblRegular td    { padding: 6px;  background-image: url(fade.gif);  border: 2px solid #99c; }
-.tblHeaderColor, .tblHeaderColor td { background: #99c; }
-.tblNoBorder td   { border: 0; }"""
-
-HARVESTMAN_BROWSER_TABLE1="""\
-<table width=\"76%\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"3\" class=\"tableWidth\">
-    <tr>
-    <td id=\"subTitle\">HARVESTMan Internet Spider - Website Copier</td>
-    </tr>
-</table>"""
-
-HARVESTMAN_BROWSER_HEADER="Index of Downloaded Sites:"
-
-HARVESTMAN_BROWSER_TABLE2= """\
-<table width=\"76%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"0\" class=\"tableWidth\">
-<tr class=\"blak\">
-<td>
-    <table width=\"100%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"1\" cellpadding=\"0\">
-    <tr>
-    <td colspan=\"6\">
-        <table width=\"100%(PER)s\" border=\"0\" align=\"center\" cellspacing=\"0\" cellpadding=\"10\">
-        <tr>
-        <td id=\"pageContent\">
-<!-- ==================== End prologue ==================== -->
-
-    <meta name=\"generator\" content=\"HARVESTMAN Internet Spider Version %(VERSION)s \">
-    <TITLE>Local index - HarvestMan</TITLE>
-</HEAD>
-<h1 ALIGN=left><u>%(HEADER)s</i></h1>
-    <TABLE BORDER=\"0\" WIDTH=\"100%(PER)s\" CELLSPACING=\"1\" CELLPADDING=\"0\">
-    <BR>
-        <TR align=center>
-            <TD>
-            %(PROJECTNAME)s
-            </TD>
-            <TD>&middot;
-                <!-- PROJECTPAGE --><A HREF=\"%(PROJECTSTARTPAGE)s\"><!-- END -->
-                    <!-- PROJECTURL -->%(PROJECTURL)s<!-- END -->
-                </A>
-            </TD>
-        </TR>
-    </TABLE>
-    <BR>
-    <BR>
-    <BR>
-    <H6 ALIGN=\"RIGHT\">
-    <I>Mirror and index made by HarvestMan Web Crawler [ABP 2006]</I>
-    </H6>
-<!-- ==================== Start epilogue ==================== -->
-    </td>
-    </tr>
-    </table>
-    </td>
-    </tr>
-    </table>
-</td>
-</tr>
-</table>"""
-
-HARVESTMAN_BROWSER_TABLE3="""\
-<table width=\"76%(PER)s\" border=\"0\" align=\"center\" valign=\"bottom\" cellspacing=\"0\" cellpadding=\"0\">
-    <tr>
-    <td id=\"footer\"><small>%(CREDITS)s </small></td>
-    </tr>
-</table>"""
-
-HARVESTMAN_CACHE_README="""\
-This directory contains important cache information for HarvestMan.
-This information is used by HarvestMan to update the project files.
-If you delete this directory or its contents, the project update/caching
-mechanism wont work.
-
-"""
-
-#=====End Browser page macro strings ==============#
-
-
-class HarvestManSerializerError(Exception):
-
-    def __init__(self, value):
-        self.value = value
-
-    def __str__(self):
-        return str(self.value)
-
-class HarvestManSerializer(object):
-
-    def __init__(self):
-        pass
-
-    def dump(self, obj, filename):
-        """ dump method similar to pickle. The main difference is that
-        this method accepts a filename string rather than a file
-        stream as pickle does """
-
-        try:
-            # print obj
-            cPickle.dump(obj, open(filename,'wb'))
-        except Exception, e:
-            raise HarvestManSerializerError, str(e)
-            # return HARVESTMAN_FAIL
-
-        return HARVESTMAN_OK
-
-    def load(self, filename):
-        """ load method similar to pickle. The main difference is that
-        this method accepts a filename string rather than a file
-        stream as pickle does """
-
-        try:
-            obj = cPickle.load(open(filename,'rb'))
-        except Exception, e:
-            raise HarvestManSerializerError, str(e)            
-
-        return obj
-
-class HarvestManCacheReaderWriter(object):
-    """ Utility class to read/write different cache files for HarvestMan """
-
-    def __init__(self, directory):
-        self._cachedir = directory
-
-        # Create cache directory if it does not exist
-        if not os.path.isdir(self._cachedir):
-            try:
-                os.makedirs(self._cachedir)
-                extrainfo('Created directory => ', self._cachedir)
-                # Copy a readme.txt file to the cache directory
-                readmefile = os.path.join(self._cachedir, "Readme.txt")
-                if not os.path.isfile(readmefile):
-                    try:
-                        fs=open(readmefile, 'w')
-                        fs.write(HARVESTMAN_CACHE_README)
-                        fs.close()
-                    except Exception, e:
-                        debug(str(e))
-
-            except OSError, e:
-                debug('OS Exception ', e)
-
-        self._cachefilename = os.path.join(self._cachedir, 'cache')
-        
-    def read_project_cache(self):
-        """ Try to read the project cache file """
-
-        found = False
-
-        # Get cache filename
-        if not os.path.exists(self._cachefilename):
-            info("Project cache not found")
-
-        cache_obj = Base(self._cachefilename)
-
-        if os.path.isfile(self._cachefilename):
-            try:
-                cache_obj.open()
-                found = True
-            except Exception, e:
-                logconsole(e)
-
-        return (cache_obj, found)
-
-    def write_project_cache(self, cache):
-        """ Commit the project cache to the disk """
-
-        cache.commit()
-        
-    def write_url_headers(self, headerdict):
-
-        try:
-            pickler = HarvestManSerializer()
-            pickler.dump(headerdict, os.path.join(self._cachedir, 'urlheaders.db'))
-        except HarvestManSerializerError, e:
-            logconsole(str(e))
-            return WRITE_URL_HEADERS_ERROR
-
-        return WRITE_URL_HEADERS_OK
-    
-class HarvestManProjectManager(object):
-    """ Utility class to read/write project files """
-
-    def __init__(self):
-        pass
-
-    def write_project(self):
-        """ Write project files """
-
-        info('Writing Project Files...')
-
-        cfg = objects.config.copy()
-
-        pckfile = os.path.join(cfg.basedir, cfg.project + '.hpf')
-        
-        if os.path.exists(pckfile):
-            try:
-                os.remove(pckfile)
-            except OSError, e:
-                logconsole(e)
-                return PROJECT_FILE_REMOVE_ERROR
-
-        try:
-            pickler = HarvestManSerializer()
-            pickler.dump( cfg, pckfile)
-        except HarvestManSerializerError, e:
-            logconsole(str(e))
-            return PROJECT_FILE_WRITE_ERROR
-
-        extrainfo('Done.')
-        
-        return PROJECT_FILE_WRITE_OK
-
-
-    def read_project(self):
-        """ Load an existing HarvestMan project file and
-        crete dictionary for the passed config object """
-
-        projectfile = config.projectfile
-
-        try:
-            pickler = HarvestManSerializer()
-            d = pickler.load(projectfile)
-
-            for key in objects.config.keys():
-                try:
-                    objects.config[key] = d[key]
-                except:
-                    pass
-
-            objects.config.fromprojfile = True
-
-            return PROJECT_FILE_READ_OK
-        except HarvestManSerializerError, e:
-            logconsole(e)
-            return PROJECT_FILE_READ_ERROR
-
-
-class HarvestManBrowser(object):
-    """ Utility class to write the project browse pages """
-
-    def __init__(self):
-        self._projectstartpage = os.path.abspath(objects.queuemgr.get_base_url().get_full_filename())
-        self._projectstartpage = 'file://' + self._projectstartpage.replace('\\', '/')
-        self._cfg = objects.config
-
-    def make_project_browse_page(self):
-        """ This creates an xhtml page for browsing the downloaded html pages """
-
-        if self._cfg.browsepage == 0:
-            return
-
-        ret = self._add_project_to_browse_page()
-        if ret == BROWSE_FILE_NOT_FOUND:
-            return self._make_initial_browse_page()
-        else:
-            return ret
-
-    def open_project_browse_page(self):
-        """ Open the project page in the user's web-browser """
-        
-        import webbrowser
-
-        info('Opening project in browser...')
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-        try:
-            webbrowser.open(browsefile)
-            extrainfo('Done.')
-        except webbrowser.Error, e:
-            logconsole(e)
-        return 
-
-    def _add_project_to_browse_page(self):
-        """ Append new project information to existing project browser page """
-
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-        if not os.path.exists(browsefile):
-            return BROWSE_FILE_NOT_FOUND
-
-        # read contents of file
-        contents=''
-        try:
-            f=open(browsefile, 'r')
-            contents=f.read()
-            f.close()
-        except (IOError, OSError), e:
-            logconsole(e)
-            return BROWSE_FILE_READ_ERROR
-
-        if not contents:
-            return BROWSE_FILE_EMPTY
-
-        # See if this is a proper browse file created by HARVESTMan
-        index = contents.find("HARVESTMan SIG:")
-        if index == -1: return -1
-        sig=contents[(index+17):(index+32)].strip()
-        if sig != HARVESTMAN_SIG: return -1
-        # Locate position to insert project info
-        index = contents.find(HARVESTMAN_BROWSER_HEADER)
-        if index == -1: return BROWSE_FILE_INVALID
-        # get project page
-        index=contents.rfind('<!-- PROJECTPAGE -->', index)
-        if index == -1: return BROWSE_FILE_INVALID
-        newindex=contents.find('<!-- END -->', index)
-        projpage=contents[(index+29):(newindex-2)]
-        # get project url
-        index=contents.find('<!-- PROJECTURL -->', newindex)
-        if index == -1: return BROWSE_FILE_INVALID
-
-        newindex=contents.find('<!-- END -->', index)
-        prjurl=contents[(index+19):newindex]
-
-        if prjurl and prjurl==self._cfg.url:
-            debug('Duplicate project!')
-            if projpage:
-                newcontents=contents.replace(projpage,self._projectstartpage)
-            if prjurl:
-                newcontents=contents.replace(prjurl, self._cfg.url)
-            try:
-                f=open(browsefile, 'w')
-                f.write(newcontents)
-                f.close()
-
-                return BROWSE_FILE_WRITE_OK
-            except OSError, e:
-                logconsole(e)
-                return BROWSE_FILE_WRITE_ERROR
-        else:
-            # find location of </TR> from this index
-            index = contents.find('</TR>', newindex)
-            if index==-1: return BROWSE_FILE_INVALID
-            newprojectinfo = HARVESTMAN_PROJECTINFO % {'PROJECTNAME': self._cfg.project,
-                                                       'PROJECTSTARTPAGE': self._projectstartpage,
-                                                       'PROJECTURL' : self._cfg.url }
-            # insert this string
-            newcontents = contents[:index] + '\n' + newprojectinfo + contents[index+5:]
-            try:
-                f=open(browsefile, 'w')
-                f.write(newcontents)
-                f.close()
-
-                return BROWSE_FILE_WRITE_OK                
-            except OSError, e:
-                logconsole(e)
-                return BROWSE_FILE_WRITE_ERROR
-
-    def _make_initial_browse_page(self):
-        """ This creates an xhtml page for browsing the downloaded
-        files similar to HTTrack copier """
-
-        debug('Making fresh page...')
-
-        browsefile=os.path.join(self._cfg.basedir, 'index.html')
-
-        f=open(browsefile, 'w')
-        f.write("<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\">\n\n")
-        f.write("<head>\n")
-        f.write("\t<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\" />\n")
-        f.write("\t<meta name=\"description\" content=\"" + HARVESTMAN_BOAST + "\" />\n")
-        f.write("\t<meta name=\"keywords\" content=\"" + HARVESTMAN_KEYWORDS + "\" />\n")
-        f.write("\t<title>Local index - HARVESTMAN Internet Spider</title>\n")
-        f.write("<!-- Mirror and index made by HARVESTMAN Internet Spider/" + self._cfg.version + " [ABP, NK '2003] -->\n")
-        f.write("<style type=\"text/css\">\n")
-        f.write("<!--\n\n")
-        f.write(HARVESTMAN_BROWSER_CSS)
-        f.write("\n\n")
-        f.write("// -->\n")
-        f.write("</style>\n")
-        f.write("</head>\n")
-        f.write(HARVESTMAN_BROWSER_TABLE1)
-        s=HARVESTMAN_BROWSER_TABLE2 % {'PER'    : '%',
-                                         'VERSION': self._cfg.version,
-                                         'HEADER' : HARVESTMAN_BROWSER_HEADER,
-                                         'PROJECTNAME': self._cfg.project,
-                                         'PROJECTSTARTPAGE': self._projectstartpage,
-                                         'PROJECTURL' : self._cfg.url}
-        f.write(s)
-        f.write("<BR><BR><BR><BR>\n")
-        f.write("<HR width=76%>\n")
-        s=HARVESTMAN_BROWSER_TABLE3 % {'PER'    : '%',
-                                         'CREDITS': HARVESTMAN_CREDITS }
-        f.write(s)
-        f.write("</body>\n")
-
-        # insert signature
-        sigstr = "<!-- HARVESTMan SIG: <" + HARVESTMAN_SIG + "> -->\n"
-        f.write(sigstr)
-        f.write("</html>\n")
-
-
-if __name__=="__main__":
-    pass
-
-
-
diff --git a/HarvestMan/harvestman/test/__init__.py b/HarvestMan/harvestman/test/__init__.py
deleted file mode 100755
index a93df7e..0000000
--- a/HarvestMan/harvestman/test/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# -- coding: utf-8
diff --git a/HarvestMan/harvestman/test/fail.html b/HarvestMan/harvestman/test/fail.html
deleted file mode 100644
index 63bb079..0000000
--- a/HarvestMan/harvestman/test/fail.html
+++ /dev/null
@@ -1,11 +0,0 @@
-<html>
-<head>
-<title>Failure Page</title>
-<link rel="stylesheet" type="text/css" href="meny/popupstyles.css">
-<script language="JavaScript" src="meny/popup.js"></script>
-<! link rel="stylesheet" href="nyheter.css" type="text/css">
-<link rel="stylesheet" href="nyheter/global.css" type="text/css">
-</head>
-<!-- end -->
-<a href="test.html">This is a test link</a>
-</html>
diff --git a/HarvestMan/harvestman/test/pass.css b/HarvestMan/harvestman/test/pass.css
deleted file mode 100755
index c882b3b..0000000
--- a/HarvestMan/harvestman/test/pass.css
+++ /dev/null
@@ -1,133 +0,0 @@
-@import "css1.css";
-
-
-body
-{
-background-color: #EEFFFF;
-font-family: Lucida, Georgia, Verdana, Helvetica, sans-serif, Times;
-color: #000000;
-
-}
-
-h1{
- font-size: 20pt;
- color: #334d55;
- text-align: center;
-}
-
-#masthead{
-        margin: 2px solid black;
-        padding: 5px 2px;
-        border-bottom: 1px solid #cccccc;
-        background-color: #a2bcef;
-        height: 80px;
-}
-
-@import url("css2.css");
-
-div#top-navbar {
-  float: left; 
-  width: 20%; 
-  background-color: #A9BCEF; 
-  background-color: #a2ccef;
-  padding: 1em 0px;
-  margin-top: 0.8em;
-  border: 1px solid black; 
-}
-div#top-navbar span {
-  display:block; 
-  margin-left: .5em;
-  margin-top: 0.8em;
-}
-div#top-navbar span.navbar-title {
-  font-weight: bold; 
-  font-size:110%; 
-  border-bottom: 1px solid #007; 
-  margin-right: .1em; 
-  padding-bottom: 0.3em
-  align: center;
-}
-div#top-navbar span.first-item {
-  margin-top: 1em;
-}
-div#top-navbar span.colored-item {
-  background: #ffffff;
-  text-align: center;
-  margin-right: 1em;
-  padding: 0;
-}
-
-div#bottom-navbar {
-  float: left;
-  width: 20%; 
-  background-color: EEF; 
-  border: 2px solid black; 
-  padding: 1em 0px;
-  margin-top: 100em;
-
-}
-
-div#bottom-navbar span {
-  display:block; 
-  margin-left: .5em;
-  margin-top: .25em;
-}
-div#bottom-navbar span.navbar-title {
-  font-weight: bold; 
-  font-size:110%; 
-  border-bottom: 1px solid #007; 
-  margin-right: .5em; 
-  padding-bottom: 0.3em
-  align: center;
-}
-
-div#bottom-navbar span.colored-item {
-  background: #ffffcc;
-  text-align: center;
-  margin-right: 1em;
-  padding: 0;
-}
-
-div#top-navbar span.nbsep {
-  display: none;
-}
-
-div#top-navbar span.nbsep2 {
-  display: yes;
-  align: center;
-}
-
-div#footer {
-  position: relative; 
-  bottom: 0.5em;
-  width: 96%; 
-  text-align: center; 
-  border-top: 1px solid #AAA; 
-  color: #AAA; 
-  margin: 0 1%; 
-  padding: 0
-}
-
-div#main-content {
-  position: relative; 
-  margin-left: 25.0%; 
-  padding-top: .5em;
-  top: -1em; 
-  border-top: 1 px solid black;
-  font-size: 100%;
-}
-
-div#section-title span
-{
-  border: 2px solid #999999;
-  padding: 2px 4px;
-  background: #cccccc;
-  underline: yes;
-}
-
-div#table-placement
-{
-  float: right;
-}
-
-UL           { list-style: url(fancybullet.gif) disc }
\ No newline at end of file
diff --git a/HarvestMan/harvestman/test/pass.html b/HarvestMan/harvestman/test/pass.html
deleted file mode 100644
index 26a5345..0000000
--- a/HarvestMan/harvestman/test/pass.html
+++ /dev/null
@@ -1,221 +0,0 @@
-<html><head><title>The HarvestMan WebCrawler </title></head><body bgcolor="#FFFFFF">
-
-
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
-  <head>
-    <base href="http://harvestmanontheweb.com"/>
-    <title>The HarvestMan Webcrawler</title>
-     <meta name="author" content="Anand B Pillai"/>
-    <meta name="keywords" content="crawler, spider, bot, web-bot, robot, offline, browser, web, Internet, harvest, HarvestMan, HTTP, browsing, searching, Python, tools, aggregator, mining, intelligent, agents, agent-based computing,
-     autonomous, documents"/>
-    <meta name="description" content="Project page of the HarvestMan WebCrawler"/>
-    <meta name="copyright" content="Anand B Pillai"/>
-    <meta name="license" content="GNU General Public License, Copyright (C) 2004-2005, Anand B Pillai" />
-
-    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
-    <link href="style.css" rel="stylesheet" type="text/css"/>
-    <link href="style.css" rel="stylesheet" type="text/css"/>
-
-  </head>
- <body>
-<div id="masthead"> 
-  <h1 id="siteName"><img src="images/HarvestMan_s.jpg" alt="HarvestMan" align="absbottom"> - The HarvestMan Web Crawler</h1> 
-</div> 
-
-   <div id="top-navbar">
-      <span class="navbar-title">HarvestMan</span>
-      <span class="nbsep">:</span>
-
-      <span class="navbar-item first-item"><a href="news.html">
-            News</a> </span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="/">
-            About</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="releases.html">
-            Releases</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="http://harvestman-crawler.googlecode.com">
-            Project page</a></span>
-    
-      <hr noshade>
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="faq.html">
-            FAQ</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="architecture.html">
-            Architecture</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item"><a href="download.html">
-            Downloads</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="projects.html">
-            Projects</a></span>
-
-      <span class="nbsep">|</span>
-      <span class="navbar-item">
-            <a href="related.html">
-            Links & Related Projects</a></span>
-      <br>
-      <span class="nbsep">|</span>
-      <center>
-        <p>
-          <a href="http://www.efytimes.com/efytimes/24867/news.htm"><img src="images/FOSS_awards_low_res.jpg" alt="foss india awards icon" align="absbottom"></a>
-        </p>
-      </center>
-      <span class="nbsep">|</span>
-      <hr noshade>
-<span>
-<!-- SiteSearch Google -->
-<center>
-
-<form method="get" action="http://www.google.com/search" target="_top">
-<table border="0" bgcolor="#ffffff">
-<tr><td nowrap="nowrap" valign="top" align="left" height="32">
-<a href="http://www.google.com/">
-<img src="http://www.google.com/logos/Logo_25wht.gif" border="0" alt="Google" align="middle"></a>
-<br/>
-<input type="hidden" name="domains" value="http://www.harvestmanontheweb.com"></input>
-<input type="text" name="q" size="20" maxlength="255" value="harvestman"></input>
-</td></tr>
-<tr>
-<td nowrap="nowrap">
-<table>
-<tr>
-<td>
-<input type="radio" name="sitesearch" value="" checked="checked"></input>
-<font size="-1" color="#000000">Web</font>
-</td>
-<td>
-<input type="radio" name="sitesearch" value="http://www.harvestmanontheweb.com"></input>
-<font size="-1" color="#000000">this site</font>
-</td>
-</tr>
-</table>
-<input type="submit" name="sa" value="Search"></input>
-</td></tr></table>
-</form>
-
-<!-- SiteSearch Google -->          
-</center>
-</span>    
-     </div>
-   
-   <div id="main-content">
-     <p><span class="section-title"><b><u>Welcome</u></b></span></p>
-     <p>Welcome to the project page of the HarvestMan web crawler.</p>
-    <p><span class="section-title"><b><u>Companion Website <font color="red">(new)</font><b></b></u></b></span></p> 
-    <p>HarvestMan has a new <a href="http://harvestman.everythingability.com">companion website</a>,  thanks to Tom Smith.
-     The new site has more current information including a Wiki which is updated frequently.</p>
-    <p><span class="section-title"><b><u>News <b>(Updated May 08 2008)</b></u></b></span></p> 
-    <p>Read the latest <a href="news.html#latest">news</a> about HarvestMan.</p>
-    <p><span class="section-title"><b><u>Development Code</b></u></b></span></p> 
-    <p>Browse or download the <a href="current.html">bleeding edge</a> source code.</p>
-     <p><span class="section-title"><b><u>About HarvestMan</u></b></span></p>
-     <p>HarvestMan is a <a href="http://en.wikipedia.org/wiki/Web_crawler">web crawler application</a> written in the <a href="http://www.python.org">Python</a> programming language.
-        HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization
-        options. HarvestMan is a console (command-line) application.
-     </p>
-     <p>HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the <a href="http://www.gnu.org/copyleft/gpl.html">GNU General Public License</a>.</p>
-    <p><span class="section-title"><b><u>Current Release</u></b></span></p>
-    <p>The latest release of HarvestMan is 1.4.6.
-       <ul>
-        <li>Read the <a href="files/Changelog.txt">Changelog</a> for this release</li>
-        <li><a href="download.html#latest">Download</a> the files for this release</li>
-       </ul>
-    <p>More information is available on the <a href="releases.html">releases page</a>.
-    </p>
-
-    <p><span class="section-title"><b><u>Architecture</u></b></span></p>
-    <p>See the <a href="architecture.html">architecture of HarvestMan</a>.</p>
-    <p><span class="section-title"><b><u>HarvestMan Configuration</u></b></span></p>    
-    <p>HarvestMan is typically run by reading options from a configuration file. The configuration file
-    is in the XML format. By default it is named <i>config.xml</i>. This overrides an older text format, where configuration options were represented
-    as name/value pairs in a text file. This <a href="configoptions.html">page</a> describes the older
-    format in detail. 
-    </p>
-    <p>Here is a <a href="http://download.berlios.de/harvestman/config.xml">sample config file</a> of 
-     HarvestMan.</p>
-    <p><span class="section-title"><b><u>HarvestMan command-line options</u></b></span></p>
-    <p>HarvestMan also accepts command-line options. The <a href="commandline.html">Command line FAQ</a>
-    describes the most important command-line options for HarvestMan.
-    </p>
-    <p><span class="section-title"><b><u>Developers</u></b></span></p> 
-    <p>The original developer of HarvestMan is <a href="http://randombytes.blogspot.com">Anand B Pillai</a>.
-    Anand is a software professional, based in Bangalore, India.</a>.
-    </p>
-    <p><span class="section-title"><b><u>History</u></b></span></p> 
-    <p>For an interesting article on the history of HarvestMan, read this <a href="http://developer.spikesource.com/wiki/index.php/An_Interview_with_the_author_of_harvestman">interview</a>.
-    </p>
-
-    <p><span class="section-title"><b><u>Downloads</u></b></span></p> 
-    <p>Check the <a href="download.html">download page</a> for HarvestMan downloads.</p>
-    <p><span class="section-title"><b><u>Contacts</u></b></span></p> 
-    <p><a href="mailto:print 'nocvyynv@tznvy.pbz'.encode('rot13')">Email address</a>.</p>
-    </p>
-    
-    </div>
-
-</p>
-
-<!---
-<p>
-<form method="get" action="http://groups.yahoo.com/subscribe/BangPypers">
-<table cellspacing="0" cellpadding="2" border="0" bgcolor="#ffffcc" align="center">
-  <tr>
-    <td colspan="2" align="center">
-      <em>Subscribe to BangPypers</em>
-    </td>
-  </tr>
-  <tr>
-    <td>
-      <input type="text" name="user" value="enter email address" size="20">
-    </td>
-    <td>
-      <input type="image" border="0" alt="Click here to join BangPypers" 
-       name="Click here to join BangPypers"
-       src="http://us.i1.yimg.com/us.yimg.com/i/yg/img/i/us/ui/join.gif">
-    </td>
-  </tr>
-  <tr align="center">
-    <td colspan="2">
-      Powered by&nbsp;<a href="http://groups.yahoo.com/">groups.yahoo.com</a> 
-    </td>
-  </tr>
-</table>
-</form>
-</p> 
--->
-<br><br>
-     <div id="footer">
-
-      <br>&copy; Anand B Pillai<br>
-      Last modified on Jun 16 2008<br>
-      <br>
-<br>
-      <br><br><br><br>
-      <p>   
-      <table border="0" cellSpacing="10" cellPadding="0" align="center">
-      <tr><td</td>
-      <td><a href="http://www.python.org"><img src="images/py_powered.gif"></a></td>
-      <td><img src="images/HarvestMan_s.jpg"></td>
-      </tr>
-      </table>
-      </p>
-</div>
-
-
-  </body>
-
-</body></html>
- 
diff --git a/HarvestMan/harvestman/test/run_tests.py b/HarvestMan/harvestman/test/run_tests.py
deleted file mode 100755
index 9133f15..0000000
--- a/HarvestMan/harvestman/test/run_tests.py
+++ /dev/null
@@ -1,108 +0,0 @@
-# -- coding: utf-8
-""" Unit test wrapper module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jun 02 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import unittest
-import test_base
-import glob
-import os, sys
-
-# FIXME: Add a unit-test log for failures with complete tracebacks
-def run_all_tests():
-    """ Run all unit tests in this folder """
-
-    test_base.setUp()
-    # Get location of this module
-    curdir = os.path.abspath(os.path.dirname(test_base.__file__))
-    sys.path.append(curdir)
-    # Comment following line and uncomment line after it before checking in code...
-    # test_modules = glob.glob(os.path.join(curdir, 'test_[!base|connector]*.py'))
-    test_modules = glob.glob(os.path.join(curdir, 'test_[!base]*.py'))    
-    result = unittest.TestResult()
-
-    for module in test_modules:
-        try:
-            print 'Running unit-test for %s...' % module
-            modpath, modfile = os.path.split(module)
-            m = __import__(modfile.replace('.py',''))
-            m.run(result)
-        except ImportError,e :
-            raise
-            pass
-        
-    test_base.clean_up()
-    return result
-
-def run_test_connector():
-    import test_connector
-
-    print 'Running test_connector...'
-    suite = unittest.makeSuite(test_connector.TestHarvestManUrlConnector)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urlparser():
-    import test_urlparser
-
-    print 'Running test_urlparser...'
-    suite = unittest.makeSuite(test_urlparser.TestHarvestManUrl)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_logger():
-    import test_logger
-    
-    print 'Running test_logger...'
-    suite = unittest.makeSuite(test_logger.TestHarvestManLogger)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urltypes():
-    import test_urltypes
-    
-    print 'Running test_urltypes...'
-    suite = unittest.makeSuite(test_urltypes.HarvestManUrlTypes)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_urlcollections():
-    import test_urlcollections
-    
-    print 'Running test_urlcollections...'
-    suite = unittest.makeSuite(test_urlcollections.HarvestManUrlCollections)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-def run_test_pageparser():
-    import test_pageparser
-    
-    print 'Running test_pageparser...'
-    suite = unittest.makeSuite(test_pageparser.TestHarvestManPageParser)
-    result = unittest.TestResult()
-    suite.run(result)
-
-    test_base.clean_up()    
-    return result
-
-
-if __name__=="__main__":
-    print run_all_tests()
diff --git a/HarvestMan/harvestman/test/test_base.py b/HarvestMan/harvestman/test/test_base.py
deleted file mode 100755
index 4c203da..0000000
--- a/HarvestMan/harvestman/test/test_base.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# -- coding: utf-8
-""" Base module for unit tests
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import sys, os
-import unittest
-
-flag = False
-
-def setUpPaths():
-    """ Set up paths """
-
-    f = globals()['__file__']
-    parentdir = os.path.dirname(os.path.dirname(f))
-    sys.path.append(os.path.dirname(parentdir))
-
-def setUp():
-    """ Set up """
-
-    global flag
-    if flag: return
-    
-    setUpPaths()
-
-    from harvestman.lib.common.common import SetAlias
-    
-    from harvestman.lib import config
-    cfg = config.HarvestManStateObject()
-    # Enable testing flag
-    cfg.testing = 1
-    
-    SetAlias(cfg)
-
-    from harvestman.lib import datamgr
-    from harvestman.lib import rules
-    from harvestman.lib import connector
-    from harvestman.lib import urlqueue
-    from harvestman.lib import logger
-    from harvestman.lib import event
-
-    log=logger.HarvestManLogger()
-    log.make_logger()
-    SetAlias(log)
-    
-    # Data manager object
-    dmgr = datamgr.HarvestManDataManager()
-    dmgr.initialize()
-    SetAlias(dmgr)
-    
-    # Rules checker object
-    ruleschecker = rules.HarvestManRulesChecker()
-    SetAlias(ruleschecker)
-    
-    # Connector manager object
-    connmgr = connector.HarvestManNetworkConnector()
-    SetAlias(connmgr)
-    
-    # Connector factory
-    conn_factory = connector.HarvestManUrlConnectorFactory(5)
-    SetAlias(conn_factory)
-    
-    queuemgr = urlqueue.HarvestManCrawlerQueue()
-    SetAlias(queuemgr)
-    
-    SetAlias(event.HarvestManEvent())    
-
-    flag = True
-    
-def clean_up():
-    from harvestman.lib.common.common import objects
-    objects.datamgr.clean_up()
-
-def run_test(testklass, result):
-    suite = unittest.makeSuite(testklass)
-    return suite.run(result)
diff --git a/HarvestMan/harvestman/test/test_connector.py b/HarvestMan/harvestman/test/test_connector.py
deleted file mode 100755
index 2164cdd..0000000
--- a/HarvestMan/harvestman/test/test_connector.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# -- coding: utf-8
-""" Unit test for connector module
-
-Created: Anand B Pillai <abpillai@gmail.com> May 21 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-import time
-import random
-
-test_base.setUp()
-
-from harvestman.lib.connector import HarvestManUrlConnector
-from harvestman.lib.urlparser import HarvestManUrl    
-from harvestman.lib.common.macros import *
-from harvestman.lib.common.common import objects
-
-urls = ['http://www.google.com','http://www.yahoo.com','http://www.python.org', 'ftp.gnu.org']
-
-class TestHarvestManUrlConnector(unittest.TestCase):
-    """ Unit test class for HarvestManUrlConnector class """
-
-    etag = ''
-    lmt = ''
-
-    # Turn caching etc off
-    objects.config.pagecache = 0
-    objects.config.rawsave = True
-    
-    def test_connect(self):
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.connect(HarvestManUrl(url))
-        error = conn.get_error()
-        if error.number==0:        
-            assert(res == CONNECT_YES_DOWNLOADED)
-            assert(conn.get_content_length()>0)
-            content_type = conn.get_content_type()
-            assert(content_type == 'text/html')
-            fo = conn.get_fileobj()
-            assert(fo != None)
-            assert(fo.get_data() == '')        
-            # Since default is flushing to file, the file
-            # object should not be None
-            assert(fo.get_tmpfile() != None)
-        else:
-            print 'Error in fetching data, skipping tests...'
-            
-        # Now set connector to in-mem mode and test again
-        objects.config.datamode = CONNECTOR_DATA_MODE_INMEM
-
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.connect(HarvestManUrl(url))
-        # There could be an error...
-        error = conn.get_error()
-        if error.number==0:
-            assert(res == CONNECT_YES_DOWNLOADED)
-            assert(conn.get_content_length()>0)
-            content_type = conn.get_content_type()
-            assert(content_type == 'text/html')
-            fo = conn.get_fileobj()
-            assert(fo != None)
-            assert(fo.get_data() != '')
-            assert(fo.get_tmpfile() == None)
-        else:
-            print 'Error in fetching data, skipping tests...'
-
-    def test_saveurl(self):
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.save_url(HarvestManUrl(url))
-        if conn.get_error().number==0:
-            assert(res==DOWNLOAD_YES_OK)
-            if os.path.isfile('index.html'):
-                os.remove('index.html')
-        else:
-            print 'Error in fetching data, skipping tests...'                
-
-    def test_urltofile(self):
-        
-        objects.config.showprogress = False
-        conn = HarvestManUrlConnector()
-        url = random.choice(urls)
-        res = conn.url_to_file(HarvestManUrl(url))
-        if conn.get_error().number==0:
-            assert(res==URL_DOWNLOAD_OK)
-            if os.path.isfile('index.html'):
-                os.remove('index.html')
-        else:
-            print 'Error in fetching data, skipping tests...'                
-        
-    def test_connfactory(self):
-        pass
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlConnector, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlConnector)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
diff --git a/HarvestMan/harvestman/test/test_logger.py b/HarvestMan/harvestman/test/test_logger.py
deleted file mode 100755
index 28cbf4c..0000000
--- a/HarvestMan/harvestman/test/test_logger.py
+++ /dev/null
@@ -1,113 +0,0 @@
-# -- coding: utf-8
-""" Unit test for logger module
-
-Created: Anand B Pillai <abpillai at gmail.com> Jul 11 2007
-
-Copyright (C) 2003-2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-from harvestman.lib import logger
-
-filename1='harvestman-test1.log'
-filename2='harvestman-test1.log'
-
-class TestHarvestManLogger(unittest.TestCase):
-    """ Unit test class for HarvestManLogger class """
-
-    mylogger = logger.HarvestManLogger.Instance()
-    mylogger.make_logger()
-    mylogger.disableConsoleLogging()
-    
-    def test_loglevel(self):
-
-        mylogger = self.mylogger
-
-        # Remove file if exists
-        if os.path.isfile(filename1):
-            os.remove(filename1)
-            
-        mylogger.addLogFile(filename1)
-        
-        p='HarvestMan'
-        mylogger.debug("Test message 1",p)
-        mylogger.extrainfo("Test message 2",p)
-        mylogger.info("Test message 3",p)
-        mylogger.warning("Test message 4",p)
-        mylogger.error("Test message 5",p)
-        mylogger.critical("Test message 6",p)
-    
-        # Verify file exists
-        assert(os.path.isfile(filename1))
-        # Check it has only 4 lines
-        lines = open(filename1).readlines()
-        assert(len(lines)==4)
-        # Check that line 1 has 'INFO' in it
-        assert(lines[0].find('INFO') != -1)
-        
-        # Remove this handler
-        os.remove(filename1)
-        mylogger.removeLogFile(filename1)
-        
-        # Add a new log file
-        # Remove file if exists
-        if os.path.isfile(filename2):
-            os.remove(filename2)
-            
-        mylogger.addLogFile(filename2)
-        mylogger.setLogSeverity(logger.EXTRAINFO)
-        mylogger.debug("Test message 1",p)
-        mylogger.extrainfo("Test message 2",p)
-        mylogger.info("Test message 3",p)
-        mylogger.warning("Test message 4",p)
-        mylogger.error("Test message 5",p)
-        mylogger.critical("Test message 6",p)
-
-        # Verify file exists
-        assert(os.path.isfile(filename2))
-        # Check it has only 5 lines
-        lines = open(filename2).readlines()
-        assert(len(lines)==5)        
-
-        # Check that line 1 has 'EXTRAINFO' in it
-        assert(lines[0].find('EXTRAINFO') != -1)
-        mylogger.removeLogFile(filename2)
-        os.remove(filename2)
-        
-    def test_others(self):
-
-        mylogger = self.mylogger
-        # Test other things
-         # Remove file if exists
-        if os.path.isfile(filename1):
-            os.remove(filename1)
-            
-        mylogger.addLogFile(filename1)
-        mylogger.setPlainFormat()
-        
-        msg = "Test message"
-        mylogger.info(msg)
-        # Verify that the log file contains nothing more than
-        # the message
-        # Verify file exists
-        assert(os.path.isfile(filename1))
-        lines = open(filename1).readlines()
-        assert(lines[0].strip()==msg)
-        # Revert formatting
-        mylogger.revertFormatting()
-        mylogger.info(msg)
-        lines = open(filename1).readlines()
-        assert(lines[-1].strip()!=msg)
-        os.remove(filename1)
-
-def run(result):
-    return test_base.run_test(TestHarvestManLogger, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManLogger)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()  
diff --git a/HarvestMan/harvestman/test/test_pageparser.py b/HarvestMan/harvestman/test/test_pageparser.py
deleted file mode 100644
index 8e7132f..0000000
--- a/HarvestMan/harvestman/test/test_pageparser.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# -- coding: utf-8
-""" Unit test for pageparser module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 17 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-import time
-
-test_base.setUp()
-from harvestman.lib.pageparser import HarvestManSimpleParser, HarvestManSGMLOpParser, HarvestManCSSParser
-from harvestman.lib.urlparser import HarvestManUrl    
-from harvestman.lib.common.macros import *
-from harvestman.lib.urltypes import *
-from sgmllib import SGMLParseError
-
-curdir = os.path.abspath(os.path.dirname(test_base.__file__))
-
-class Link:
-
-    def __init__(self, typ, url):
-        self.typ = typ
-        self.url = url
-
-    def __eq__(self, link):
-        return link==self.url
-        
-def Linkify(links):
-
-    Links = []
-    for typ,url in links:
-        Links.append(Link(typ, url))
-    return Links
-
-class TestHarvestManPageParser(unittest.TestCase):
-    """ Unit test class for all classes in pageparser module """
-
-    # Supported tags
-    tags = ('a','frame','img','form','link','body','script',
-            'applet','area','meta','embed','object','option')
-    
-    def test_simpleparser(self):
-        # Test features (tags)
-        for tag in self.tags:
-            assert(tag in HarvestManSimpleParser.features)
-
-        # Parse test
-        p=HarvestManSimpleParser()
-        p.feed(open(os.path.join(curdir, 'pass.html')).read())
-        # There should be 29 links and 4 images
-        assert(len(p.links)==29)
-        assert(len(p.images)==4)
-        assert(p.keywords==['crawler', 'spider', 'bot', 'web-bot', 'robot', 'offline', 'browser', 'web', 'internet', 'harvest', 'harvestman', 'http', 'browsing', 'searching', 'python', 'tools', 'aggregator', 'mining', 'intelligent', 'agents', 'agent-based computing', 'autonomous', 'documents'])
-        assert(p.description=="Project page of the HarvestMan WebCrawler")
-        assert(p.title=='The HarvestMan WebCrawler')
-        
-        link_urls = Linkify(p.links)
-        # There should be a stylesheet link
-        assert('style.css' in link_urls)
-        # There will be an anchor link
-        l = link_urls.index('download.html#latest')
-        assert(link_urls[l].typ==URL_TYPE_ANCHOR)
-
-        image_urls = Linkify(p.images)
-        assert('images/HarvestMan_s.jpg' in image_urls)
-        p.reset()
-        try:
-            # This page shoud fail the parser...
-            p.feed(open(os.path.join(curdir, 'fail.html')).read())
-            assert()
-        except SGMLParseError:
-            pass
-
-    def test_sgmlopparser(self):
-
-        # There is only one test, i.e the fail page
-        # should parse with this parser.
-        try:
-            p=HarvestManSGMLOpParser()
-            # This page shoud not fail the parser...
-            p.feed(open(os.path.join(curdir, 'fail.html')).read())
-            assert(len(p.links)==4)
-        except Exception:
-            assert()
-            pass
-
-
-    def test_cssparser(self):
-
-        p = HarvestManCSSParser()
-        p.feed(open(os.path.join(curdir, 'pass.css')).read())
-        assert(p.links==['css1.css','css2.css','fancybullet.gif'])
-        assert(p.csslinks==['css1.css','css2.css'])
-        
-def run(result):
-    return test_base.run_test(TestHarvestManPageParser, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManPageParser)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan/harvestman/test/test_urlcollections.py b/HarvestMan/harvestman/test/test_urlcollections.py
deleted file mode 100755
index 7c444b8..0000000
--- a/HarvestMan/harvestman/test/test_urlcollections.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urlcollections module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 18 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-
-test_base.setUp()
-from harvestman.lib.urltypes import *
-from harvestman.lib.urlcollections import *
-from harvestman.lib.urlparser import *
-
-class TestHarvestManUrlCollections(unittest.TestCase):
-    """ Unit test class for all classes in urltypes module """
-    
-    def test_urlcollections(self):
-        srcurl = HarvestManUrl('http://www.foo.com',urltype=URL_TYPE_WEBPAGE)
-        child_url1 = HarvestManUrl('url1.html',baseurl=srcurl)
-        child_url2 = HarvestManUrl('url2.html',baseurl=srcurl)        
-        child_css = HarvestManUrl('test.css',urltype=URL_TYPE_STYLESHEET, baseurl=srcurl)        
-
-        indices = [obj.index for obj in (child_url1, child_url2, child_css)]
-        indices.sort()
-        
-        coll = HarvestManAutoUrlCollection(srcurl)
-        coll.addURL(child_url1)
-        coll.addURL(child_url2)
-        coll.addURL(child_css)
-
-        assert(coll.getSourceURL()==srcurl.index)
-        urlindices = coll.getAllURLs()
-        urlindices.sort()
-
-        assert(urlindices == indices)
-        assert(coll.getURLs(HarvestManStyleContext)==[child_css.index])
-        assert(coll.getURLs(HarvestManPageContext)==[child_url1.index, child_url2.index])        
-
-        # Now for a stylesheet containing URLs
-        srccss = HarvestManUrl('http://www.foo.com/style.css',urltype=URL_TYPE_STYLESHEET)
-        css_url1 = HarvestManUrl('cssurl1.html',baseurl=srccss)
-        css_url2 = HarvestManUrl('cssurl2.html',baseurl=srccss)        
-        css_css = HarvestManUrl('test.css',urltype=URL_TYPE_STYLESHEET, baseurl=srccss)
-
-        indices = [obj.index for obj in (css_url1, css_url2, css_css)]
-        indices.sort()
-
-        coll = HarvestManAutoUrlCollection(srccss)
-        coll.addURL(css_url1)
-        coll.addURL(css_url2)
-        coll.addURL(css_css)
-        assert(coll.getSourceURL()==srccss.index)
-        urlindices = coll.getAllURLs()
-        urlindices.sort()
-
-        assert(urlindices == indices)
-        assert(coll.getURLs(HarvestManCSS2Context)==[css_css.index])
-        assert(coll.getURLs(HarvestManCSSContext)==[css_url1.index, css_url2.index])                
-
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlCollections, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlCollections)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan/harvestman/test/test_urlfilter.py b/HarvestMan/harvestman/test/test_urlfilter.py
deleted file mode 100644
index 968aeae..0000000
--- a/HarvestMan/harvestman/test/test_urlfilter.py
+++ /dev/null
@@ -1,73 +0,0 @@
-# -- coding: utf-8
-""" Unit test for filters module
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Mod   Anand         Sep 29 08      Fix for issue #24.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-
-from harvestman.lib.urlparser import HarvestManUrl, HarvestManUrlError
-from harvestman.lib.filters import HarvestManUrlFilter
-
-
-class TestHarvestManUrlFilter(unittest.TestCase):
-    """ Unit test class for HarvestManUrlFilter class """
-
-    hfilter = HarvestManUrlFilter([(u'-/images/*+/images/public/*', 1, '')],
-                                  [(u'-jpg-png+doc', 0, '')],
-                                  [(u'\d+\.doc$', 0, ''), (u'\d+\.pdf$',0,'')])
-    
-    url1 = HarvestManUrl('http://www.yahoo.com/photos/my photo.gif')
-    url2 = HarvestManUrl('http://www.foo.com/images/photo.bmp')
-    url3 = HarvestManUrl('http://www.foo.com/images/public/photo.bmp')
-    url4 = HarvestManUrl('http://www.foo.com/images/public/image.png')
-    url5 = HarvestManUrl('http://www.foo.com/images/public/image.jpg')
-    url6 = HarvestManUrl('http://www.foo.com/photos/image.jpg')
-    url7 = HarvestManUrl('http://www.foo.com/photos/image.png')        
-    url8 = HarvestManUrl('http://website.com/documents/mydoc.pdf')
-    url9 = HarvestManUrl('http://website.com/documents/mydoc-20.pdf')
-    url10 = HarvestManUrl('http://website.com/documents/mydoc-25.doc')                        
-
-    def test_urlfilter(self):
-
-        f = self.hfilter
-
-        # False
-        assert(f.filter(self.url1)==False)
-        # True
-        assert(f.filter(self.url2)==True)
-        # False - inclusion
-        assert(f.filter(self.url3)==False)
-        assert(f.filter(self.url4)==False)
-        assert(f.filter(self.url5)==False)
-
-        # True - extn
-        assert(f.filter(self.url6)==True)
-        assert(f.filter(self.url7)==True)
-
-        # False
-        assert(f.filter(self.url8)==False)
-
-        # True - regex
-        assert(f.filter(self.url9)==True)
-
-        # False - inclusion
-        assert(f.filter(self.url10)==False)
-        
-def run(result):
-    return test_base.run_test(TestHarvestManUrlFilter, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlFilter)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()            
-
-
diff --git a/HarvestMan/harvestman/test/test_urlparser.py b/HarvestMan/harvestman/test/test_urlparser.py
deleted file mode 100755
index eccf85e..0000000
--- a/HarvestMan/harvestman/test/test_urlparser.py
+++ /dev/null
@@ -1,326 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urlparser module
-
-Created: Anand B Pillai <abpillai@gmail.com> Apr 17 2007
-
-Mod   Anand         Sep 29 08      Fix for issue #24.
-
-Copyright (C) 2007, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-import sys, os
-
-test_base.setUp()
-
-from harvestman.lib.urlparser import HarvestManUrl, HarvestManUrlError
-
-class TestHarvestManUrl(unittest.TestCase):
-    """ Unit test class for HarvestManUrl class """
-
-    # Basic test set
-    l = [ HarvestManUrl('http://www.yahoo.com/photos/my photo.gif'),
-          HarvestManUrl('http://www.rediff.com:80/r/r/tn2/2003/jun/25usfed.htm'),
-          HarvestManUrl('http://cwc2003.rediffblogs.com'),
-          HarvestManUrl('/sports/2003/jun/25beck1.htm',
-                              'generic', 0, 'http://www.rediff.com', ''),
-          HarvestManUrl('http://ftp.gnu.org/pub/lpf.README'),
-          HarvestManUrl('http://www.python.org/doc/2.3b2'),
-          HarvestManUrl('//images.sourceforge.net/div.png',
-                              'image', 0, 'http://sourceforge.net', ''),
-          HarvestManUrl('http://pyro.sourceforge.net/manual/LICENSE'),
-          HarvestManUrl('python/test.htm', 'generic', 0,
-                              'http://www.foo.com/bar/index.html', ''),
-          HarvestManUrl('/python/test.css', 'generic',
-                              0, 'http://www.foo.com/bar/vodka/test.htm', ''),
-          HarvestManUrl('/visuals/standard.css', 'generic', 0,
-                              'http://www.garshol.priv.no/download/text/perl.html'),
-          HarvestManUrl('www.fnorb.org/index.html', 'generic',
-                              0, 'http://pyro.sourceforge.net'),
-          HarvestManUrl('http://profigure.sourceforge.net/index.html',
-                              'generic', 0, 'http://pyro.sourceforge.net'),
-          HarvestManUrl('#anchor', 'anchor', 0, 
-                              'http://www.foo.com/bar/index.html'),
-          HarvestManUrl('nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html#__init__#index-after', 'anchor', 0, 'http://nltk.sourceforge.net/lite/doc/api/term-index.html'),              
-          HarvestManUrl('../icons/up.png', 'image', 0,
-                              'http://www.python.org/doc/current/tut/node2.html',
-                              ''),
-          HarvestManUrl('../eway/library/getmessage.asp?objectid=27015&moduleid=160',
-                              'generic',0,'http://www.eidsvoll.kommune.no/eway/library/getmessage.asp?objectid=27015&moduleid=160'),
-          HarvestManUrl('fileadmin/dz.gov.si/templates/../../../index.php',
-                              'generic',0,'http://www.dz-rs.si'),
-          HarvestManUrl('http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70','form',True,'http://www.evvs.dk'),
-          HarvestManUrl('http://arstechnica.com/reviews/os/macosx-10.4.ars'),
-          HarvestManUrl('http://www.fylkesmannen.no/../fmt_hoved.asp',baseurl='http://www.fylkesmannen.no/osloogakershu'),
-          HarvestManUrl('http://www.example.com/display%3c%5d%2f?weight=1.0&article=fred&lang=en&size=100&country=in&q=&id='),
-          HarvestManUrl('file:extension.css'),
-          HarvestManUrl('file://home/anand/style.css'),
-          HarvestManUrl('file://style.css'),
-          HarvestManUrl('file:/home/anand/style.css'),
-          HarvestManUrl('file:/home/anand/'),
-          HarvestManUrl('file://home/anand/'),
-          HarvestManUrl('/bar/',baseurl='http://www.foo.com')]
-
-    # Second test set - For base URL containing a '?' in path
-    h = HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/')
-    h2 = HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec/?')
-    
-    l2 = [ HarvestManUrl('coderef.c', baseurl=h),
-           HarvestManUrl('?/govtrack/sec/coderef2.c',baseurl=h),
-           HarvestManUrl("?/sec/coderef3.c", baseurl=h),
-           HarvestManUrl("?sec/coderef4.c", baseurl=h),
-           HarvestManUrl("sec/coderef5.c", baseurl=h),
-           HarvestManUrl("/sec/coderef6.c", baseurl=h),
-           HarvestManUrl("govtrack/sec/coderef7.c", baseurl=h),
-           HarvestManUrl("govtrack/?/sec/../coderef8.c", baseurl=h),
-           HarvestManUrl("http://www.foo.com/govtrack/./sec/?/id/../coderef9.c"),
-           HarvestManUrl("../repo2/govtrack/./sec/?/id/../coderef10.c", baseurl=h),
-           HarvestManUrl('../coderef11.c', baseurl=h),
-           HarvestManUrl('govtrack/?/sec/coderef12.c', baseurl=h),
-           HarvestManUrl('../govtrack2/?/../sec/.././sec/coderef13.c', baseurl=h),
-           HarvestManUrl('?/govtrack/?/sec/coderef14.c', baseurl=h2),
-           HarvestManUrl('../gotrack2/../sec/?/../?/./sec/coderef15.c', baseurl=h2)
-           ]
-
-    def test_filename(self):
-        d = os.path.abspath(os.curdir)
-        
-        assert(self.l[0].get_full_filename()==os.path.join(d, 'www.yahoo.com/photos/my photo.gif'))
-        assert(self.l[1].get_full_filename()==os.path.join(d, 'www.rediff.com/r/r/tn2/2003/jun/25usfed.htm'))
-        assert(self.l[2].get_full_filename()==os.path.join(d, 'cwc2003.rediffblogs.com/index.html'))
-        assert(self.l[3].get_full_filename()==os.path.join(d, 'www.rediff.com/sports/2003/jun/25beck1.htm'))
-        assert(self.l[4].get_full_filename()==os.path.join(d, 'ftp.gnu.org/pub/lpf.README'))
-        assert(self.l[5].get_full_filename()==os.path.join(d, 'www.python.org/doc/2.3b2'))
-        assert(self.l[6].get_full_filename()==os.path.join(d, 'images.sourceforge.net/div.png'))
-        assert(self.l[7].get_full_filename()==os.path.join(d, 'pyro.sourceforge.net/manual/LICENSE'))
-        assert(self.l[8].get_full_filename()==os.path.join(d, 'www.foo.com/bar/python/test.htm'))
-        assert(self.l[9].get_full_filename()==os.path.join(d, 'www.foo.com/python/test.css'))
-        assert(self.l[10].get_full_filename()==os.path.join(d, 'www.garshol.priv.no/visuals/standard.css'))
-        assert(self.l[11].get_full_filename()==os.path.join(d, 'www.fnorb.org/index.html'))
-        assert(self.l[12].get_full_filename()==os.path.join(d, 'profigure.sourceforge.net/index.html'))
-        assert(self.l[13].get_full_filename()==os.path.join(d, 'www.foo.com/bar/index.html'))
-        assert(self.l[14].get_full_filename()==os.path.join(d, 'nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html'))
-        assert(self.l[15].get_full_filename()==os.path.join(d, 'www.python.org/doc/current/icons/up.png'))
-        assert(self.l[16].get_full_filename()==os.path.join(d, 'www.eidsvoll.kommune.no/eway/eway/library/getmessage.aspobjectid=27015&moduleid=160'))
-        assert(self.l[17].get_full_filename()==os.path.join(d, 'www.dz-rs.si/index.php'))
-        assert(self.l[18].get_full_filename()==os.path.join(d, 'www.evvs.dk/index.phpcPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70'))
-        assert(self.l[19].get_full_filename()==os.path.join(d, 'arstechnica.com/reviews/os/macosx-10.4.ars/index.html'))
-        assert(self.l[20].get_full_filename()==os.path.join(d, 'www.fylkesmannen.no/fmt_hoved.asp'))
-        assert(self.l[21].get_full_filename()==os.path.join(d, 'www.example.com/display]weight=1.0&article=fred&lang=en&size=100&country=in&q=&id='))
-        
-    def test_valid_filename(self):
-
-        assert(self.l[0].validfilename=='my photo.gif')
-        assert(self.l[1].validfilename=='25usfed.htm')
-        assert(self.l[2].validfilename=='index.html')
-        assert(self.l[3].validfilename=='25beck1.htm')
-        assert(self.l[4].validfilename=='lpf.README')
-        assert(self.l[5].validfilename=='2.3b2')
-        assert(self.l[6].validfilename=='div.png')
-        assert(self.l[7].validfilename=='LICENSE')
-        assert(self.l[8].validfilename=='test.htm')
-        assert(self.l[9].validfilename=='test.css')
-        assert(self.l[10].validfilename=='standard.css')
-        assert(self.l[11].validfilename=='index.html')
-        assert(self.l[12].validfilename=='index.html')
-        assert(self.l[13].validfilename=='index.html')
-        assert(self.l[14].validfilename=='nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html')
-        assert(self.l[15].validfilename=='up.png')
-        assert(self.l[16].validfilename=='getmessage.aspobjectid=27015&moduleid=160')
-        assert(self.l[17].validfilename=='index.php')
-        assert(self.l[18].validfilename=='index.phpcPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70')
-        assert(self.l[19].validfilename=='index.html')
-        assert(self.l[20].validfilename=='fmt_hoved.asp')
-        assert(self.l[21].validfilename=='display]weight=1.0&article=fred&lang=en&size=100&country=in&q=&id=')
-        
-
-    def test_is_relative_path(self):
-
-        assert(self.l[0].is_relative_path()==False)
-        assert(self.l[1].is_relative_path()==False)
-        assert(self.l[2].is_relative_path()==False)
-        assert(self.l[3].is_relative_path()==True)
-        assert(self.l[4].is_relative_path()==False)
-        assert(self.l[5].is_relative_path()==False)
-        assert(self.l[6].is_relative_path()==False)
-        assert(self.l[7].is_relative_path()==False)
-        assert(self.l[8].is_relative_path()==True)
-        assert(self.l[9].is_relative_path()==True)
-        assert(self.l[10].is_relative_path()==True)
-        assert(self.l[11].is_relative_path()==False)
-        assert(self.l[12].is_relative_path()==False)
-        assert(self.l[13].is_relative_path()==False)
-        assert(self.l[14].is_relative_path()==True)
-        assert(self.l[15].is_relative_path()==True)
-        assert(self.l[16].is_relative_path()==True)
-        assert(self.l[17].is_relative_path()==True)
-        assert(self.l[18].is_relative_path()==False)
-        assert(self.l[19].is_relative_path()==False)
-        assert(self.l[20].is_relative_path()==False)
-        assert(self.l[21].is_relative_path()==False)        
-        
-    def test_absolute_url(self):
-
-        assert(self.l[0].get_full_url()=='http://www.yahoo.com/photos/my%20photo.gif')
-        assert(self.l[1].get_full_url()=='http://www.rediff.com/r/r/tn2/2003/jun/25usfed.htm')
-        assert(self.l[2].get_full_url()=='http://cwc2003.rediffblogs.com/')
-        assert(self.l[3].get_full_url()=='http://www.rediff.com/sports/2003/jun/25beck1.htm')
-        assert(self.l[4].get_full_url()=='http://ftp.gnu.org/pub/lpf.README')
-        assert(self.l[5].get_full_url()=='http://www.python.org/doc/2.3b2')
-        assert(self.l[6].get_full_url()=='http://images.sourceforge.net/div.png')
-        assert(self.l[7].get_full_url()=='http://pyro.sourceforge.net/manual/LICENSE')
-        assert(self.l[8].get_full_url()=='http://www.foo.com/bar/python/test.htm')
-        assert(self.l[9].get_full_url()=='http://www.foo.com/python/test.css')
-        assert(self.l[10].get_full_url()=='http://www.garshol.priv.no/visuals/standard.css')
-        assert(self.l[11].get_full_url()=='http://www.fnorb.org/index.html')
-        assert(self.l[12].get_full_url()=='http://profigure.sourceforge.net/index.html')
-        assert(self.l[13].get_full_url()=='http://www.foo.com/bar/index.html')
-        assert(self.l[14].get_full_url()=='http://nltk.sourceforge.net/lite/doc/api/nltk_lite.contrib.fst.draw_graph.GraphEdgeWidget-class.html')
-        assert(self.l[15].get_full_url()=='http://www.python.org/doc/current/icons/up.png')
-        assert(self.l[16].get_full_url()=='http://www.eidsvoll.kommune.no/eway/eway/library/getmessage.asp?objectid=27015&moduleid=160')
-        assert(self.l[17].get_full_url()=='http://www.dz-rs.si/index.php')
-        assert(self.l[18].get_full_url()=='http://www.evvs.dk/index.php?cPath=26&osCsid=90207c4908a98db6503c0381b6b7aa70')
-        assert(self.l[19].get_full_url()=='http://arstechnica.com/reviews/os/macosx-10.4.ars/')
-        assert(self.l[20].get_full_url()=='http://www.fylkesmannen.no/fmt_hoved.asp')
-        assert(self.l[21].get_full_url()=='http://www.example.com/display%3C%5D%2F?weight=1.0&article=fred&lang=en&size=100&country=in&q=&id=')
-        assert(self.l[22].get_full_url()=='file:extension.css')
-        assert(self.l[23].get_full_url()=='file://home/anand/style.css')
-        assert(self.l[24].get_full_url()=='file://style.css')
-        assert(self.l[25].get_full_url()=='file:/home/anand/style.css')
-        assert(self.l[26].get_full_url()=='file:/home/anand')
-        assert(self.l[27].get_full_url()=='file://home/anand')
-        assert(self.l[28].get_full_url()=='http://www.foo.com/bar/')        
-
-        # Second set
-        assert(self.l2[0].get_full_url()=='http://razor.occams.info/code/repo/coderef.c')
-        assert(self.l2[1].get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/sec/coderef2.c')
-        assert(self.l2[2].get_full_url()=='http://razor.occams.info/code/repo/?/sec/coderef3.c')
-        assert(self.l2[3].get_full_url()=='http://razor.occams.info/code/repo/?sec/coderef4.c')
-        assert(self.l2[4].get_full_url()=='http://razor.occams.info/code/repo/sec/coderef5.c')
-        assert(self.l2[5].get_full_url()=='http://razor.occams.info/sec/coderef6.c')
-        assert(self.l2[6].get_full_url()=='http://razor.occams.info/code/repo/govtrack/sec/coderef7.c')
-        assert(self.l2[7].get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/../coderef8.c')
-        assert(self.l2[8].get_full_url()=='http://www.foo.com/govtrack/sec/?/id/../coderef9.c')
-        assert(self.l2[9].get_full_url()=='http://razor.occams.info/code/repo2/govtrack/sec/?/id/../coderef10.c')
-        assert(self.l2[10].get_full_url()=='http://razor.occams.info/code/coderef11.c')
-        assert(self.l2[11].get_full_url()=='http://razor.occams.info/code/repo/govtrack/?/sec/coderef12.c')
-        assert(self.l2[12].get_full_url()=='http://razor.occams.info/code/govtrack2/?/../sec/.././sec/coderef13.c')
-        assert(self.l2[13].get_full_url()=='http://razor.occams.info/code/repo/?/govtrack/?/sec/coderef14.c')
-        assert(self.l2[14].get_full_url()=='http://razor.occams.info/code/sec/?/../?/./sec/coderef15.c')                
-        
-        
-               
-    def test_is_file_like(self):
-
-        assert(self.l[0].filelike==True)
-        assert(self.l[1].filelike==True)
-        assert(self.l[2].filelike==False)
-        assert(self.l[3].filelike==True)
-        assert(self.l[4].filelike==True)
-        assert(self.l[5].filelike==True)
-        assert(self.l[6].filelike==True)
-        assert(self.l[7].filelike==True)
-        assert(self.l[8].filelike==True)
-        assert(self.l[9].filelike==True)
-        assert(self.l[10].filelike==True)
-        assert(self.l[11].filelike==True)
-        assert(self.l[12].filelike==True)
-        assert(self.l[13].filelike==True)
-        assert(self.l[14].filelike==True)
-        assert(self.l[15].filelike==True)
-        assert(self.l[16].filelike==True)
-        assert(self.l[17].filelike==True)
-        assert(self.l[18].filelike==True)
-        assert(self.l[19].filelike==False)
-        assert(self.l[20].filelike==True)
-        assert(self.l[21].filelike==True)                        
-        
-    def test_anchor_tag(self):
-
-        assert(self.l[0].get_anchor()=='')
-        assert(self.l[1].get_anchor()=='')
-        assert(self.l[2].get_anchor()=='')
-        assert(self.l[3].get_anchor()=='')
-        assert(self.l[4].get_anchor()=='')
-        assert(self.l[5].get_anchor()=='')
-        assert(self.l[6].get_anchor()=='')
-        assert(self.l[7].get_anchor()=='')
-        assert(self.l[8].get_anchor()=='')
-        assert(self.l[9].get_anchor()=='')
-        assert(self.l[10].get_anchor()=='')
-        assert(self.l[11].get_anchor()=='')
-        assert(self.l[12].get_anchor()=='')
-        assert(self.l[13].get_anchor()=='#anchor')
-        assert(self.l[14].get_anchor()=='#__init__#index-after')
-        assert(self.l[15].get_anchor()=='')
-        assert(self.l[16].get_anchor()=='')
-        assert(self.l[17].get_anchor()=='')
-        assert(self.l[18].get_anchor()=='')
-        assert(self.l[19].get_anchor()=='')
-        assert(self.l[20].get_anchor()=='')
-        assert(self.l[21].get_anchor()=='')                        
-
-    def test_canonical_url(self):
-        assert(self.l[21].get_canonical_url()=='http://example.com/display%3C%5D%2F?article=fred&country=in&lang=en&size=100&weight=1.0')
-    def test_invalid_urls(self):
-
-        # Make sure invalid URLs do raise an error
-        try:
-            HarvestManUrl('')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Zero Length Url')
-
-        try:
-            HarvestManUrl('',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Zero Length Url')
-
-        try:
-            HarvestManUrl('http://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('https://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('ftp://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-        try:
-            HarvestManUrl('file://',baseurl='http://www.foo.com')
-            # If it comes here, it is an error
-            assert(0==1)
-        except HarvestManUrlError, e:
-            # This should produce an error
-            assert(str(e)=='Error: Invalid URL containing only protocol')
-
-            
-def run(result):
-    return test_base.run_test(TestHarvestManUrl, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrl)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-    
-    
diff --git a/HarvestMan/harvestman/test/test_urltypes.py b/HarvestMan/harvestman/test/test_urltypes.py
deleted file mode 100644
index fc4a5e5..0000000
--- a/HarvestMan/harvestman/test/test_urltypes.py
+++ /dev/null
@@ -1,52 +0,0 @@
-# -- coding: utf-8
-""" Unit test for urltypes module
-
-Created: Anand B Pillai <abpillai@gmail.com> Jul 17 2008
-
-Copyright (C) 2008, Anand B Pillai.
-"""
-
-import test_base
-import unittest
-
-test_base.setUp()
-from harvestman.lib.urltypes import *
-
-class TestHarvestManUrlTypes(unittest.TestCase):
-    """ Unit test class for all classes in urltypes module """
-    
-    def test_urltypes(self):
-        assert( URL_TYPE_ANY == 'generic')
-        assert( URL_TYPE_WEBPAGE == 'webpage')
-        assert( URL_TYPE_BASE == 'base')
-        assert( URL_TYPE_ANCHOR == 'anchor')
-        assert( URL_TYPE_QUERY == 'query')
-        assert( URL_TYPE_FORM == 'form')
-        assert( URL_TYPE_IMAGE == 'image')
-        assert( URL_TYPE_STYLESHEET == 'stylesheet')
-        assert( URL_TYPE_JAVASCRIPT == 'javascript')
-        assert( URL_TYPE_JAPPLET == 'javaapplet')
-        assert( URL_TYPE_JAPPLET_CODEBASE == 'appletcodebase')
-        assert( URL_TYPE_FILE == 'file')
-        assert( URL_TYPE_DOCUMENT == 'document')
-        assert( URL_TYPE_FLASH == 'flash')
-        assert( issubclass(URL_TYPE_FLASH, URL_TYPE_MULTIMEDIA))
-        assert( issubclass(URL_TYPE_VIDEO, URL_TYPE_MULTIMEDIA))
-        assert( issubclass(URL_TYPE_AUDIO, URL_TYPE_MULTIMEDIA))                
-        assert( URL_TYPE_ANY in ('generic','webpage'))
-        assert( issubclass(URL_TYPE_ANCHOR, URL_TYPE_WEBPAGE))
-        assert( issubclass(URL_TYPE_BASE, URL_TYPE_WEBPAGE))    
-        assert( URL_TYPE_ANCHOR.isA(URL_TYPE_WEBPAGE))
-        assert( URL_TYPE_ANCHOR.isA(URL_TYPE_ANY))
-        assert( not URL_TYPE_IMAGE.isA(URL_TYPE_WEBPAGE))
-        assert( URL_TYPE_ANY.isA(URL_TYPE_ANY))
-        assert( URL_TYPE_IMAGE in ('image','stylesheet'))       
-
-def run(result):
-    return test_base.run_test(TestHarvestManUrlTypes, result)
-
-if __name__=="__main__":
-    s = unittest.makeSuite(TestHarvestManUrlTypes)
-    unittest.TextTestRunner(verbosity=2).run(s)
-    test_base.clean_up()
-        
diff --git a/HarvestMan/harvestman/tools/__init__.py b/HarvestMan/harvestman/tools/__init__.py
deleted file mode 100755
index e69de29..0000000
diff --git a/HarvestMan/harvestman/tools/genconfig.py b/HarvestMan/harvestman/tools/genconfig.py
deleted file mode 100755
index 284d2d2..0000000
--- a/HarvestMan/harvestman/tools/genconfig.py
+++ /dev/null
@@ -1,49 +0,0 @@
-"""
-genconfig.py - Interactive web-based HarvestMan configuration
-file generator using web.py .
-
-Created Anand B Pillai <abpillai at gmail dot com> May 29 2008
-Modified Anand B Pillai  Moved most code to lib/gui.py Jun 01 2008
-                         and trimmed this modul.e
-Modified Lukasz Szybalski 
-    Created main function and will be added to harvestman menu as harvestman --genconfig.
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-
-import sys
-import webbrowser
-import threading
-import time
-import web
-
-from harvestman.lib import gui
-
-index = gui.HarvestManConfigGenerator
-urls = ('/', 'index')
-
-def open_page():
-    print 'Opening page...'
-    webbrowser.open("http://localhost:5940")
-
-if __name__=="__main__":
-    sys.argv.append("5940")
-    print "Starting web.py at port 5940..."
-    web.internalerror = web.debugerror
-    # Start timer thread to run after 5 seconds
-    print 'Waiting for page to load in browser...'
-    threading.Timer(5.0, open_page).start()
-    web.run(urls, globals(), web.reloader)
-
-#Allows to be imported and run
-def main():
-    #if __name__=="__main__":
-    #sys.argv.append("5940")
-    #Because web.py expects the port to be passed on argv[1] I will replace it here. The original argv[1] is '--genconf'
-    sys.argv[1]='5940'
-    print "Starting web.py at port 5940..."
-    web.internalerror = web.debugerror
-    # Start timer thread to run after 5 seconds
-    print 'Waiting for page to load in browser...'
-    threading.Timer(5.0, open_page).start()
-    web.run(urls, globals(), web.reloader)
diff --git a/HarvestMan/harvestman/tools/printstats.py b/HarvestMan/harvestman/tools/printstats.py
deleted file mode 100755
index 4bc15e9..0000000
--- a/HarvestMan/harvestman/tools/printstats.py
+++ /dev/null
@@ -1,49 +0,0 @@
-"""
-printstats.py - Print project statistics and information
-by reading the user's crawls database.
-
-Created by Anand B Pillai <abpillai at gmail dot com> May 30 2008
-
-Copyright (C) 2008 Anand B Pillai.
-"""
-import sys
-import os
-import time
-
-try:
-    import sqlite3
-except ImportError:
-    sys.exit('sqlite3 module not found!')
-
-conn = sqlite3.connect(os.path.expanduser("~/.harvestman/db/crawls.db"))
-c1 = conn.cursor()
-cur = c1.execute("select *  from projects order by id")
-
-for member in c1:
-    # project id is first member
-    proj_id = member[0]
-    print 'Project #%d crawled at [%s] with URL {%s}, saved to name "<%s>"...' % (proj_id, time.ctime(float(member[1])),
-                                                                     member[3], member[2])
-    c2 = conn.cursor()
-    c2.execute("select * from project_stats where project_id=%d" % proj_id)
-    data = c2.fetchall()
-    if len(data)==0: continue
-
-    data = data[0]
-    print 'Statistics'
-    print '----------'
-    print '  Total # of URLs=>',data[1]
-    print '  Processed URLs=>',data[2]
-    print '  Filtered URLs=>',data[3]
-    print '  Failed URLs=>',data[4]
-    print '  Broken URLs=>',data[5]
-    print '  URLs found in Cache=>',data[6]
-    print '  # of domains=>',data[7]
-    print '  # of directories=>',data[8]
-    print '  # of files=>',data[9]
-    print '  Data downloaded=>',data[10],'bytes.'
-    print '  Duration=>',data[11],'seconds.'
-    c2.close()
-
-c1.close()
-conn.close()
diff --git a/HarvestMan/harvestman/ui/templates/content/example-print.css b/HarvestMan/harvestman/ui/templates/content/example-print.css
deleted file mode 100755
index 59a61e0..0000000
--- a/HarvestMan/harvestman/ui/templates/content/example-print.css
+++ /dev/null
@@ -1,8 +0,0 @@
-/* $Id: example-print.css,v 1.2 2006/03/06 04:11:55 pat Exp $ */
-/* When printing, hide the tab navigation list
-   and don't use any other styles
-*/
-
-.tabbernav {
- display:none;
-}
diff --git a/HarvestMan/harvestman/ui/templates/content/example.css b/HarvestMan/harvestman/ui/templates/content/example.css
deleted file mode 100755
index 35eaa06..0000000
--- a/HarvestMan/harvestman/ui/templates/content/example.css
+++ /dev/null
@@ -1,109 +0,0 @@
-/* $Id: example.css,v 1.5 2006/03/27 02:44:36 pat Exp $ */
-
-/*--------------------------------------------------
-  REQUIRED to hide the non-active tab content.
-  But do not hide them in the print stylesheet!
-  --------------------------------------------------*/
-.tabberlive .tabbertabhide {
- display:none;
-}
-
-/*--------------------------------------------------
-  .tabber = before the tabber interface is set up
-  .tabberlive = after the tabber interface is set up
-  --------------------------------------------------*/
-.tabber {
-}
-.tabberlive {
- margin-top:1em;
-}
-
-/*--------------------------------------------------
-  ul.tabbernav = the tab navigation list
-  li.tabberactive = the active tab
-  --------------------------------------------------*/
-ul.tabbernav
-{
- margin:0;
- padding: 3px 0;
- border-bottom: 1px solid #778;
- font: bold 12px Verdana, sans-serif;
-}
-
-ul.tabbernav li
-{
- list-style: none;
- margin: 0;
- display: inline;
-}
-
-ul.tabbernav li a
-{
- padding: 3px 0.5em;
- margin-left: 3px;
- border: 1px solid #778;
- border-bottom: none;
- background: #DDE;
- text-decoration: none;
-}
-
-ul.tabbernav li a:link { color: #448; }
-ul.tabbernav li a:visited { color: #667; }
-
-ul.tabbernav li a:hover
-{
- color: #000;
- background: #AAE;
- border-color: #227;
-}
-
-ul.tabbernav li.tabberactive a
-{
- background-color: #fff;
- border-bottom: 1px solid #fff;
-}
-
-ul.tabbernav li.tabberactive a:hover
-{
- color: #000;
- background: white;
- border-bottom: 1px solid white;
-}
-
-/*--------------------------------------------------
-  .tabbertab = the tab content
-  Add style only after the tabber interface is set up (.tabberlive)
-  --------------------------------------------------*/
-.tabberlive .tabbertab {
- padding:5px;
- border:1px solid #aaa;
- border-top:0;
-
- /* If you don't want the tab size changing whenever a tab is changed
-    you can set a fixed height */
-
- /* height:200px; */
-
- /* If you set a fix height set overflow to auto and you will get a
-    scrollbar when necessary */
-
- /* overflow:auto; */
-}
-
-/* If desired, hide the heading since a heading is provided by the tab */
-.tabberlive .tabbertab h2 {
- display:none;
-}
-.tabberlive .tabbertab h3 {
- display:none;
-}
-
-/* Example of using an ID to set different styles for the tabs on the page */
-.tabberlive#tab1 {
-}
-.tabberlive#tab2 {
-}
-.tabberlive#tab2 .tabbertab {
- height:200px;
- overflow:auto;
-}
diff --git a/HarvestMan/harvestman/ui/templates/content/tabber.js b/HarvestMan/harvestman/ui/templates/content/tabber.js
deleted file mode 100755
index 4fea314..0000000
--- a/HarvestMan/harvestman/ui/templates/content/tabber.js
+++ /dev/null
@@ -1,523 +0,0 @@
-/*==================================================
-  $Id: tabber.js,v 1.9 2006/04/27 20:51:51 pat Exp $
-  tabber.js by Patrick Fitzgerald pat@barelyfitz.com
-
-  Documentation can be found at the following URL:
-  http://www.barelyfitz.com/projects/tabber/
-
-  License (http://www.opensource.org/licenses/mit-license.php)
-
-  Copyright (c) 2006 Patrick Fitzgerald
-
-  Permission is hereby granted, free of charge, to any person
-  obtaining a copy of this software and associated documentation files
-  (the "Software"), to deal in the Software without restriction,
-  including without limitation the rights to use, copy, modify, merge,
-  publish, distribute, sublicense, and/or sell copies of the Software,
-  and to permit persons to whom the Software is furnished to do so,
-  subject to the following conditions:
-
-  The above copyright notice and this permission notice shall be
-  included in all copies or substantial portions of the Software.
-
-  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
-  BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
-  ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
-  CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-  SOFTWARE.
-  ==================================================*/
-
-function tabberObj(argsObj)
-{
-  var arg; /* name of an argument to override */
-
-  /* Element for the main tabber div. If you supply this in argsObj,
-     then the init() method will be called.
-  */
-  this.div = null;
-
-  /* Class of the main tabber div */
-  this.classMain = "tabber";
-
-  /* Rename classMain to classMainLive after tabifying
-     (so a different style can be applied)
-  */
-  this.classMainLive = "tabberlive";
-
-  /* Class of each DIV that contains a tab */
-  this.classTab = "tabbertab";
-
-  /* Class to indicate which tab should be active on startup */
-  this.classTabDefault = "tabbertabdefault";
-
-  /* Class for the navigation UL */
-  this.classNav = "tabbernav";
-
-  /* When a tab is to be hidden, instead of setting display='none', we
-     set the class of the div to classTabHide. In your screen
-     stylesheet you should set classTabHide to display:none.  In your
-     print stylesheet you should set display:block to ensure that all
-     the information is printed.
-  */
-  this.classTabHide = "tabbertabhide";
-
-  /* Class to set the navigation LI when the tab is active, so you can
-     use a different style on the active tab.
-  */
-  this.classNavActive = "tabberactive";
-
-  /* Elements that might contain the title for the tab, only used if a
-     title is not specified in the TITLE attribute of DIV classTab.
-  */
-  this.titleElements = ['h2','h3','h4','h5','h6'];
-
-  /* Should we strip out the HTML from the innerHTML of the title elements?
-     This should usually be true.
-  */
-  this.titleElementsStripHTML = true;
-
-  /* If the user specified the tab names using a TITLE attribute on
-     the DIV, then the browser will display a tooltip whenever the
-     mouse is over the DIV. To prevent this tooltip, we can remove the
-     TITLE attribute after getting the tab name.
-  */
-  this.removeTitle = true;
-
-  /* If you want to add an id to each link set this to true */
-  this.addLinkId = false;
-
-  /* If addIds==true, then you can set a format for the ids.
-     <tabberid> will be replaced with the id of the main tabber div.
-     <tabnumberzero> will be replaced with the tab number
-       (tab numbers starting at zero)
-     <tabnumberone> will be replaced with the tab number
-       (tab numbers starting at one)
-     <tabtitle> will be replaced by the tab title
-       (with all non-alphanumeric characters removed)
-   */
-  this.linkIdFormat = '<tabberid>nav<tabnumberone>';
-
-  /* You can override the defaults listed above by passing in an object:
-     var mytab = new tabber({property:value,property:value});
-  */
-  for (arg in argsObj) { this[arg] = argsObj[arg]; }
-
-  /* Create regular expressions for the class names; Note: if you
-     change the class names after a new object is created you must
-     also change these regular expressions.
-  */
-  this.REclassMain = new RegExp('\\b' + this.classMain + '\\b', 'gi');
-  this.REclassMainLive = new RegExp('\\b' + this.classMainLive + '\\b', 'gi');
-  this.REclassTab = new RegExp('\\b' + this.classTab + '\\b', 'gi');
-  this.REclassTabDefault = new RegExp('\\b' + this.classTabDefault + '\\b', 'gi');
-  this.REclassTabHide = new RegExp('\\b' + this.classTabHide + '\\b', 'gi');
-
-  /* Array of objects holding info about each tab */
-  this.tabs = new Array();
-
-  /* If the main tabber div was specified, call init() now */
-  if (this.div) {
-
-    this.init(this.div);
-
-    /* We don't need the main div anymore, and to prevent a memory leak
-       in IE, we must remove the circular reference between the div
-       and the tabber object. */
-    this.div = null;
-  }
-}
-
-
-/*--------------------------------------------------
-  Methods for tabberObj
-  --------------------------------------------------*/
-
-
-tabberObj.prototype.init = function(e)
-{
-  /* Set up the tabber interface.
-
-     e = element (the main containing div)
-
-     Example:
-     init(document.getElementById('mytabberdiv'))
-   */
-
-  var
-  childNodes, /* child nodes of the tabber div */
-  i, i2, /* loop indices */
-  t, /* object to store info about a single tab */
-  defaultTab=0, /* which tab to select by default */
-  DOM_ul, /* tabbernav list */
-  DOM_li, /* tabbernav list item */
-  DOM_a, /* tabbernav link */
-  aId, /* A unique id for DOM_a */
-  headingElement; /* searching for text to use in the tab */
-
-  /* Verify that the browser supports DOM scripting */
-  if (!document.getElementsByTagName) { return false; }
-
-  /* If the main DIV has an ID then save it. */
-  if (e.id) {
-    this.id = e.id;
-  }
-
-  /* Clear the tabs array (but it should normally be empty) */
-  this.tabs.length = 0;
-
-  /* Loop through an array of all the child nodes within our tabber element. */
-  childNodes = e.childNodes;
-  for(i=0; i < childNodes.length; i++) {
-
-    /* Find the nodes where class="tabbertab" */
-    if(childNodes[i].className &&
-       childNodes[i].className.match(this.REclassTab)) {
-      
-      /* Create a new object to save info about this tab */
-      t = new Object();
-      
-      /* Save a pointer to the div for this tab */
-      t.div = childNodes[i];
-      
-      /* Add the new object to the array of tabs */
-      this.tabs[this.tabs.length] = t;
-
-      /* If the class name contains classTabDefault,
-	 then select this tab by default.
-      */
-      if (childNodes[i].className.match(this.REclassTabDefault)) {
-	defaultTab = this.tabs.length-1;
-      }
-    }
-  }
-
-  /* Create a new UL list to hold the tab headings */
-  DOM_ul = document.createElement("ul");
-  DOM_ul.className = this.classNav;
-  
-  /* Loop through each tab we found */
-  for (i=0; i < this.tabs.length; i++) {
-
-    t = this.tabs[i];
-
-    /* Get the label to use for this tab:
-       From the title attribute on the DIV,
-       Or from one of the this.titleElements[] elements,
-       Or use an automatically generated number.
-     */
-    t.headingText = t.div.title;
-
-    /* Remove the title attribute to prevent a tooltip from appearing */
-    if (this.removeTitle) { t.div.title = ''; }
-
-    if (!t.headingText) {
-
-      /* Title was not defined in the title of the DIV,
-	 So try to get the title from an element within the DIV.
-	 Go through the list of elements in this.titleElements
-	 (typically heading elements ['h2','h3','h4'])
-      */
-      for (i2=0; i2<this.titleElements.length; i2++) {
-	headingElement = t.div.getElementsByTagName(this.titleElements[i2])[0];
-	if (headingElement) {
-	  t.headingText = headingElement.innerHTML;
-	  if (this.titleElementsStripHTML) {
-	    t.headingText.replace(/<br>/gi," ");
-	    t.headingText = t.headingText.replace(/<[^>]+>/g,"");
-	  }
-	  break;
-	}
-      }
-    }
-
-    if (!t.headingText) {
-      /* Title was not found (or is blank) so automatically generate a
-         number for the tab.
-      */
-      t.headingText = i + 1;
-    }
-
-    /* Create a list element for the tab */
-    DOM_li = document.createElement("li");
-
-    /* Save a reference to this list item so we can later change it to
-       the "active" class */
-    t.li = DOM_li;
-
-    /* Create a link to activate the tab */
-    DOM_a = document.createElement("a");
-    DOM_a.appendChild(document.createTextNode(t.headingText));
-    DOM_a.href = "javascript:void(null);";
-    DOM_a.title = t.headingText;
-    DOM_a.onclick = this.navClick;
-
-    /* Add some properties to the link so we can identify which tab
-       was clicked. Later the navClick method will need this.
-    */
-    DOM_a.tabber = this;
-    DOM_a.tabberIndex = i;
-
-    /* Do we need to add an id to DOM_a? */
-    if (this.addLinkId && this.linkIdFormat) {
-
-      /* Determine the id name */
-      aId = this.linkIdFormat;
-      aId = aId.replace(/<tabberid>/gi, this.id);
-      aId = aId.replace(/<tabnumberzero>/gi, i);
-      aId = aId.replace(/<tabnumberone>/gi, i+1);
-      aId = aId.replace(/<tabtitle>/gi, t.headingText.replace(/[^a-zA-Z0-9\-]/gi, ''));
-
-      DOM_a.id = aId;
-    }
-
-    /* Add the link to the list element */
-    DOM_li.appendChild(DOM_a);
-
-    /* Add the list element to the list */
-    DOM_ul.appendChild(DOM_li);
-  }
-
-  /* Add the UL list to the beginning of the tabber div */
-  e.insertBefore(DOM_ul, e.firstChild);
-
-  /* Make the tabber div "live" so different CSS can be applied */
-  e.className = e.className.replace(this.REclassMain, this.classMainLive);
-
-  /* Activate the default tab, and do not call the onclick handler */
-  this.tabShow(defaultTab);
-
-  /* If the user specified an onLoad function, call it now. */
-  if (typeof this.onLoad == 'function') {
-    this.onLoad({tabber:this});
-  }
-
-  return this;
-};
-
-
-tabberObj.prototype.navClick = function(event)
-{
-  /* This method should only be called by the onClick event of an <A>
-     element, in which case we will determine which tab was clicked by
-     examining a property that we previously attached to the <A>
-     element.
-
-     Since this was triggered from an onClick event, the variable
-     "this" refers to the <A> element that triggered the onClick
-     event (and not to the tabberObj).
-
-     When tabberObj was initialized, we added some extra properties
-     to the <A> element, for the purpose of retrieving them now. Get
-     the tabberObj object, plus the tab number that was clicked.
-  */
-
-  var
-  rVal, /* Return value from the user onclick function */
-  a, /* element that triggered the onclick event */
-  self, /* the tabber object */
-  tabberIndex, /* index of the tab that triggered the event */
-  onClickArgs; /* args to send the onclick function */
-
-  a = this;
-  if (!a.tabber) { return false; }
-
-  self = a.tabber;
-  tabberIndex = a.tabberIndex;
-
-  /* Remove focus from the link because it looks ugly.
-     I don't know if this is a good idea...
-  */
-  a.blur();
-
-  /* If the user specified an onClick function, call it now.
-     If the function returns false then do not continue.
-  */
-  if (typeof self.onClick == 'function') {
-
-    onClickArgs = {'tabber':self, 'index':tabberIndex, 'event':event};
-
-    /* IE uses a different way to access the event object */
-    if (!event) { onClickArgs.event = window.event; }
-
-    rVal = self.onClick(onClickArgs);
-    if (rVal === false) { return false; }
-  }
-
-  self.tabShow(tabberIndex);
-
-  return false;
-};
-
-
-tabberObj.prototype.tabHideAll = function()
-{
-  var i; /* counter */
-
-  /* Hide all tabs and make all navigation links inactive */
-  for (i = 0; i < this.tabs.length; i++) {
-    this.tabHide(i);
-  }
-};
-
-
-tabberObj.prototype.tabHide = function(tabberIndex)
-{
-  var div;
-
-  if (!this.tabs[tabberIndex]) { return false; }
-
-  /* Hide a single tab and make its navigation link inactive */
-  div = this.tabs[tabberIndex].div;
-
-  /* Hide the tab contents by adding classTabHide to the div */
-  if (!div.className.match(this.REclassTabHide)) {
-    div.className += ' ' + this.classTabHide;
-  }
-  this.navClearActive(tabberIndex);
-
-  return this;
-};
-
-
-tabberObj.prototype.tabShow = function(tabberIndex)
-{
-  /* Show the tabberIndex tab and hide all the other tabs */
-
-  var div;
-
-  if (!this.tabs[tabberIndex]) { return false; }
-
-  /* Hide all the tabs first */
-  this.tabHideAll();
-
-  /* Get the div that holds this tab */
-  div = this.tabs[tabberIndex].div;
-
-  /* Remove classTabHide from the div */
-  div.className = div.className.replace(this.REclassTabHide, '');
-
-  /* Mark this tab navigation link as "active" */
-  this.navSetActive(tabberIndex);
-
-  /* If the user specified an onTabDisplay function, call it now. */
-  if (typeof this.onTabDisplay == 'function') {
-    this.onTabDisplay({'tabber':this, 'index':tabberIndex});
-  }
-
-  return this;
-};
-
-tabberObj.prototype.navSetActive = function(tabberIndex)
-{
-  /* Note: this method does *not* enforce the rule
-     that only one nav item can be active at a time.
-  */
-
-  /* Set classNavActive for the navigation list item */
-  this.tabs[tabberIndex].li.className = this.classNavActive;
-
-  return this;
-};
-
-
-tabberObj.prototype.navClearActive = function(tabberIndex)
-{
-  /* Note: this method does *not* enforce the rule
-     that one nav should always be active.
-  */
-
-  /* Remove classNavActive from the navigation list item */
-  this.tabs[tabberIndex].li.className = '';
-
-  return this;
-};
-
-
-/*==================================================*/
-
-
-function tabberAutomatic(tabberArgs)
-{
-  /* This function finds all DIV elements in the document where
-     class=tabber.classMain, then converts them to use the tabber
-     interface.
-
-     tabberArgs = an object to send to "new tabber()"
-  */
-  var
-    tempObj, /* Temporary tabber object */
-    divs, /* Array of all divs on the page */
-    i; /* Loop index */
-
-  if (!tabberArgs) { tabberArgs = {}; }
-
-  /* Create a tabber object so we can get the value of classMain */
-  tempObj = new tabberObj(tabberArgs);
-
-  /* Find all DIV elements in the document that have class=tabber */
-
-  /* First get an array of all DIV elements and loop through them */
-  divs = document.getElementsByTagName("div");
-  for (i=0; i < divs.length; i++) {
-    
-    /* Is this DIV the correct class? */
-    if (divs[i].className &&
-	divs[i].className.match(tempObj.REclassMain)) {
-      
-      /* Now tabify the DIV */
-      tabberArgs.div = divs[i];
-      divs[i].tabber = new tabberObj(tabberArgs);
-    }
-  }
-  
-  return this;
-}
-
-
-/*==================================================*/
-
-
-function tabberAutomaticOnLoad(tabberArgs)
-{
-  /* This function adds tabberAutomatic to the window.onload event,
-     so it will run after the document has finished loading.
-  */
-  var oldOnLoad;
-
-  if (!tabberArgs) { tabberArgs = {}; }
-
-  /* Taken from: http://simon.incutio.com/archive/2004/05/26/addLoadEvent */
-
-  oldOnLoad = window.onload;
-  if (typeof window.onload != 'function') {
-    window.onload = function() {
-      tabberAutomatic(tabberArgs);
-    };
-  } else {
-    window.onload = function() {
-      oldOnLoad();
-      tabberAutomatic(tabberArgs);
-    };
-  }
-}
-
-
-/*==================================================*/
-
-
-/* Run tabberAutomaticOnload() unless the "manualStartup" option was specified */
-
-if (typeof tabberOptions == 'undefined') {
-
-    tabberAutomaticOnLoad();
-
-} else {
-
-  if (!tabberOptions['manualStartup']) {
-    tabberAutomaticOnLoad(tabberOptions);
-  }
-
-}
diff --git a/HarvestMan/harvestman/ui/templates/form.html b/HarvestMan/harvestman/ui/templates/form.html
deleted file mode 100755
index 6f099df..0000000
--- a/HarvestMan/harvestman/ui/templates/form.html
+++ /dev/null
@@ -1,15 +0,0 @@
-$def with (form)
-
-<p><b>Generate HarvestMan Configuration</b></p>
-<p>Enter the parameters in the form below and click on "Generate Configuration" button at the end.</p>
-<p><i>Hovering on the input boxes with your mouse will pop up a tool-tip with help text.</i></p>
-</p>
-
-<form name="main" method="post"> 
-$if not form.valid: <p class="error"></p>
-$:form.render()
-<br>
-<p align="center">
-<input type="submit" value="Generate Configuration" />
-</p>
-</form>
diff --git a/HarvestMan/pydocgen.sh b/HarvestMan/pydocgen.sh
deleted file mode 100755
index 680a496..0000000
--- a/HarvestMan/pydocgen.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-#!/bin/bash
-
-epydoc -o pydoc -n HarvestMan -u http://www.harvestmanontheweb.com ./harvestman/apps/harvestman.py ./harvestman/apps/hget.py ./harvestman/apps/appbase.py ./harvestman/apps/harvestmanimp.py ./harvestman/apps/__init__.py ./harvestman/apps/samples/taganalyzer.py ./harvestman/apps/samples/test.py ./harvestman/apps/samples/blogger.py ./harvestman/apps/samples/imagecrawler.py ./harvestman/apps/samples/blogger_google.py ./harvestman/apps/samples/postingcrawler.py ./harvestman/apps/samples/__init__.py ./harvestman/apps/samples/linkchecker.py ./harvestman/apps/samples/datacrawler.py ./harvestman/apps/samples/htmlcrawler.py ./harvestman/ext/lucene.py ./harvestman/ext/swish-e.py ./harvestman/ext/userbrowse.py ./harvestman/ext/datafilter.py ./harvestman/ext/lucene/SearchFiles.py ./harvestman/ext/lucene/IndexFiles.py ./harvestman/ext/__init__.py ./harvestman/ext/spam.py ./harvestman/ext/simulator.py ./harvestman/lib/urlthread.py ./harvestman/lib/connector.py ./harvestman/lib/db.py ./harvestman/lib/configparser.py ./harvestman/lib/urlcollections.py ./harvestman/lib/datamgr.py ./harvestman/lib/rules.py ./harvestman/lib/logger.py ./harvestman/lib/js/jsdom.py ./harvestman/lib/js/jsparser.py ./harvestman/lib/js/__init__.py ./harvestman/lib/utils.py ./harvestman/lib/methodwrapper.py ./harvestman/lib/common/progress.py ./harvestman/lib/common/bst.py ./harvestman/lib/common/spincursor.py ./harvestman/lib/common/lrucache.py ./harvestman/lib/common/keepalive.py ./harvestman/lib/common/dictcache.py ./harvestman/lib/common/optionparser.py ./harvestman/lib/common/bst_orig.py ./harvestman/lib/common/properties.py ./harvestman/lib/common/macros.py ./harvestman/lib/common/__init__.py ./harvestman/lib/common/pydblite.py ./harvestman/lib/common/netinfo.py ./harvestman/lib/common/singleton.py ./harvestman/lib/common/common.py ./harvestman/lib/urlqueue.py ./harvestman/lib/config.py ./harvestman/lib/crawler.py ./harvestman/lib/mirrors.py ./harvestman/lib/urlparser.py ./harvestman/lib/event.py ./harvestman/lib/__init__.py ./harvestman/lib/pageparser.py ./harvestman/lib/robotparser.py ./harvestman/lib/document.py ./harvestman/lib/options.py ./harvestman/lib/urlproc.py ./harvestman/lib/hooks.py ./harvestman/lib/urltypes.py ./harvestman/__init__.py
diff --git a/HarvestMan/schema/HarvestMan.xsd b/HarvestMan/schema/HarvestMan.xsd
deleted file mode 100755
index d99160e..0000000
--- a/HarvestMan/schema/HarvestMan.xsd
+++ /dev/null
@@ -1,442 +0,0 @@
-<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
-  
-  <xsd:annotation>
-    <xsd:documentation xml:lang="en">
-      W3C Schema file for HarvestMan webcrawler configuration file.
-      Copyright 2004-2005 Anand B Pillai. All rights reserved.
-      Written on Jan 18 2005, validated with xmllint.
-    </xsd:documentation>
-  </xsd:annotation>
-  
-  <xsd:element name="HarvestMan" type="HarvestManType"/>
-  
-  <!--- Defining the non-null uri -->
-  <xsd:simpleType name="validURI">
-    <xsd:restriction base="xsd:anyURI">
-      <xsd:minLength value="1"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-  
-  <!--- Defining the non-null string -->
-  <xsd:simpleType name="validString">
-    <xsd:restriction base="xsd:string">
-      <xsd:minLength value="1"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!-- Defining the "verbosity" type -->
-  <xsd:simpleType name="VerbType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="5"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-  
-  <!-- Defining the "fetchlevel" type -->
-  <xsd:simpleType name="FetchType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="4"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!-- Defining the "localise" type -->
-  <xsd:simpleType name="LocaliseType">
-    <xsd:restriction base="xsd:integer">
-      <xsd:minInclusive value="0"/>
-      <xsd:maxInclusive value="2"/>
-    </xsd:restriction>
-  </xsd:simpleType>
-
-  <!--- Defining the 'HarvestMan element -->
-  <xsd:complexType name="HarvestManType">
-    <xsd:sequence>
-      <xsd:element name="config" type="HarvestManConfigType"/>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'config' element -->
-  <xsd:complexType name="HarvestManConfigType">
-    <xsd:sequence>
-      <xsd:element name="projects" type="ConfigProjectsType" minOccurs="0" />
-      <xsd:element name="network" type="ConfigNetworkType"  minOccurs="0" />
-      <xsd:element name="download" type="ConfigDownloadType" minOccurs="0" />
-      <xsd:element name="control" type="ConfigControlType" minOccurs="0" />
-      <xsd:element name="parser" type="ConfigParserType" minOccurs="0" />
-      <xsd:element name="system" type="ConfigSystemType" minOccurs="0" />      
-      <xsd:element name="files" type="ConfigFilesType" minOccurs="0" />
-      <xsd:element name="display" type="ConfigDisplayType" minOccurs="0" />
-    </xsd:sequence>
-    <xsd:attribute name="version" type="validString" fixed="3.0" use="required"/>
-    <xsd:attribute name="xmlversion" type="validString" fixed="1.0" use="required"/>
-  </xsd:complexType>
-
-  <!-- Defining the 'projects' element -->
-  <xsd:complexType name="ConfigProjectsType">
-    <xsd:sequence>
-      <!--- This element contains zero or more 'project' elements -->
-      <xsd:element name="project" type="ProjectsProjectType" minOccurs="0" maxOccurs="unbounded"/>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'project' element -->
-  <xsd:complexType name="ProjectsProjectType">
-    <xsd:sequence>
-      <xsd:element name="url" type="xsd:anyURI"/>
-      <xsd:element name="name" type="validString" minOccurs="0" />      
-      <xsd:element name="basedir" type="validString" minOccurs="0" />
-      <xsd:element name="verbosity" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="VerbType" default="3"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timeout" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:double" default="600.0"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-    <xsd:attribute name="ignore" type="xsd:integer" default="0"/>                 
-  </xsd:complexType>
-  
-  <!--- Defining the 'network' element -->
-  <xsd:complexType name="ConfigNetworkType">
-    <xsd:sequence>
-      <!--- This consists of one 'proxy' element and one 'urlserver' element, both optional -->
-      <xsd:element name="proxy" type="NetworkProxyType" minOccurs="0"/>      
-    </xsd:sequence>
-  </xsd:complexType>
-  
-  <!--- Defining the 'proxy' element -->
-  <xsd:complexType name="NetworkProxyType">
-    <xsd:sequence>
-      <xsd:element name="proxyserver" type="xsd:string" minOccurs="0" />
-      <xsd:element name="proxyuser" type="xsd:string" minOccurs="0" />      
-      <xsd:element name="proxypasswd" type="xsd:string" minOccurs="0" />
-      <xsd:element name="proxyport">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="80"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'download' element -->
-  <xsd:complexType name="ConfigDownloadType">
-    <xsd:sequence>
-      <!--- Consists of three elements, 'types', 'cache' and 'misc' -->
-      <xsd:element name="types" type="DownloadTypesType" minOccurs="0" />
-      <xsd:element name="cache" type="DownloadCacheType" minOccurs="0" />
-      <xsd:element name="protocol" type="DownloadProtocolType" minOccurs="0" />
-      <xsd:element name="misc" type="DownloadMiscType" minOccurs="0" />
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Definint the 'types' element -->
-  <xsd:complexType name="DownloadTypesType">
-      <xsd:sequence>
-        <xsd:element name="html" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="images" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="movies" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="flash" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="sounds" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="documents" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="javascript" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="javaapplet" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-        <xsd:element name="querylinks" minOccurs="0">
-          <xsd:complexType>
-            <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-          </xsd:complexType>
-        </xsd:element>
-      </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'cache' type -->
-  <xsd:complexType name="DownloadCacheType">
-    <xsd:sequence>
-      <xsd:element name="datacache" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>     
-    <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-  </xsd:complexType>
-
-  <!--- Defining the 'protocol' type -->
-  <xsd:complexType name="DownloadProtocolType">
-    <xsd:sequence>
-      <xsd:element name="http" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="compress" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>     
-  </xsd:complexType>
-
-  <!--- Defining the 'misc' type -->
-  <xsd:complexType name="DownloadMiscType">
-    <xsd:sequence>
-      <xsd:element name="retries" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="tidyhtml" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'control' element -->
-  <xsd:complexType name="ConfigControlType">
-    <xsd:sequence>
-      <!--- Consists of six elements, 'links', 'extent' and 'limits', 'rules', filters' & plugins -->
-      <xsd:element name="links" type="ControlLinksType" minOccurs="0" />
-      <xsd:element name="extent" type="ControlExtentType" minOccurs="0" />
-      <xsd:element name="limits" type="ControlLimitsType" minOccurs="0" />
-      <xsd:element name="rules" type="ControlRulesType" minOccurs="0" />
-      <xsd:element name="filters" type="ControlFiltersType" minOccurs="0" />
-      <xsd:element name="plugins" type="ControlPluginsType" minOccurs="0" />
-    </xsd:sequence>
-  </xsd:complexType>
-  
-  <!--- Defining the 'links' element -->
-  <xsd:complexType name="ControlLinksType">
-    <xsd:sequence>
-      <xsd:element name="imagelinks" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="stylesheetlinks" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="offset" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="start" type="xsd:integer" default="0" use="optional"/>
-          <xsd:attribute name="end" type="xsd:integer" default="-1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'extent' element -->
-  <xsd:complexType name="ControlExtentType">
-    <xsd:sequence>
-      <xsd:element name="fetchlevel" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="FetchType" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="depth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="10" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="extdepth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="subdomain" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="ignoretlds" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'limits' element -->
-  <xsd:complexType name="ControlLimitsType">
-    <xsd:sequence>
-      <xsd:element name="maxfiles" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="5000" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxfilesize" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="1048576" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxbytes" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="maxbandwidth" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="connections" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="5" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timelimit" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:integer" default="-1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-
-  <!--- Defining the 'rules' element -->
-  <xsd:complexType name="ControlRulesType">
-    <xsd:sequence>
-      <xsd:element name="robots" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="urlpriority" type="xsd:string" minOccurs="0" />
-      <xsd:element name="serverpriority" type="xsd:string" minOccurs="0" />
-    </xsd:sequence>       
-  </xsd:complexType>
-        
-   <!--- Defining the 'filters' element -->
-  <xsd:complexType name="ControlFiltersType">
-    <xsd:sequence>
-      <xsd:element name="urlfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="serverfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="wordfilter" type="xsd:string" minOccurs="0" />
-      <xsd:element name="junkfilter" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-
-   <!--- Defining the 'plugins' element -->
-  <xsd:complexType name="ControlPluginsType">
-    <xsd:sequence>
-      <xsd:element name="plugin" minOccurs="0" maxOccurs="unbounded">
-        <xsd:complexType>
-          <xsd:attribute name="name" type="xsd:string" use="required"/>
-          <xsd:attribute name="enable" type="xsd:boolean" use="required"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence> 
-    <xsd:attribute name="type" type="xsd:integer" default="0" use="optional"/>
-  </xsd:complexType>
-
-  <!--- Defining the 'parser' element -->
-  <xsd:complexType name="ConfigParserType">
-    <xsd:sequence>
-      <xsd:element name="feature" minOccurs="0" maxOccurs="unbounded">
-        <xsd:complexType>
-          <xsd:attribute name="name" type="xsd:string" use="required"/>
-          <xsd:attribute name="enable" type="xsd:boolean" use="required"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>       
-  </xsd:complexType>
-  
-  <!--- Defining the 'system' element -->
-  <xsd:complexType name="ConfigSystemType">
-    <xsd:sequence>
-      <xsd:element name="workers" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-          <xsd:attribute name="size" type="xsd:positiveInteger" default="10" use="optional"/>
-          <xsd:attribute name="timeout" type="xsd:double" default="1200.0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="trackers" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:positiveInteger" default="4" use="optional"/>
-          <xsd:attribute name="timeout" type="xsd:double" default="240.0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="timegap" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:double" default="0.5" use="optional"/>
-          <xsd:attribute name="random" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <!--- Defining the 'files' element -->
-  <xsd:complexType name="ConfigFilesType">
-    <xsd:sequence>
-      <xsd:element name="urltreefile" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      
-      <xsd:element name="archive" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="0" use="optional"/>
-          <xsd:attribute name="format" type="xsd:string" default="bzip" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="urlheaders" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="status" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-      <xsd:element name="localise" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="LocaliseType" default="0" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-  <xsd:complexType name="ConfigDisplayType">
-    <xsd:sequence>
-      <xsd:element name="browsepage" minOccurs="0">
-        <xsd:complexType>
-          <xsd:attribute name="value" type="xsd:boolean" default="1" use="optional"/>
-        </xsd:complexType>
-      </xsd:element>
-    </xsd:sequence>
-  </xsd:complexType>
-
-</xsd:schema>
diff --git a/HarvestMan/setup.cfg b/HarvestMan/setup.cfg
deleted file mode 100644
index 3abf138..0000000
--- a/HarvestMan/setup.cfg
+++ /dev/null
@@ -1,4 +0,0 @@
-[egg_info]
-tag_build = dev
-tag_svn_revision = true
-
diff --git a/HarvestMan/setup.py b/HarvestMan/setup.py
deleted file mode 100644
index a4b932c..0000000
--- a/HarvestMan/setup.py
+++ /dev/null
@@ -1,101 +0,0 @@
-#Gets setuptools
-try:
-    from setuptools import setup, find_packages
-except ImportError:
-    from ez_setup import use_setuptools
-    use_setuptools()
-    from setuptools import setup, find_packages
-
-#Normal setup.py starts here
-import sys, os
-
-version = '2.0.4beta'
-
-setup(name='HarvestMan',
-      version=version,
-      description="HarvestMan is a web crawler application and framework.",
-      long_description="""\
-HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan? can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
-""",
-      # Get strings from http://pypi.python.org/pypi?%3Aaction=list_classifiers
-      classifiers=[
-          'Development Status :: 5 - Production/Stable',
-          'Environment :: Console',
-          'Environment :: Web Environment',
-          'Intended Audience :: End Users/Desktop',
-          'Intended Audience :: Developers',
-          'License :: OSI Approved :: GNU General Public License (GPL)',
-          'Operating System :: OS Independent',
-          'Programming Language :: Python',
-          'Topic :: Internet :: WWW/HTTP :: Indexing/Search',
-          'Topic :: Software Development :: Libraries :: Python Modules',
-          'Topic :: Software Development :: Testing :: Traffic Generation',
-          'Topic :: Text Processing :: Indexing',
-          ],
-      keywords='crawler spider web-crawler web-bot robot data-mining python',
-      author='Anand B Pillai',
-      author_email='abpillai at gmail dot com',
-      maintainer='Lukasz Szybalski',
-      maintainer_email='szybalski@gmail.com',
-      url='http://code.google.com/p/harvestman-crawler/',
-      license='GPLv2',
-      #find_packages replaces package_dir and packages
-      #packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),
-      #package_dir = {'harvestman': 'harvestman'}, #Instalation package:path
-      packages=find_packages(exclude=['ez_setup', 'examples']),
-      include_package_data = True,    # include everything in source control
-      package_dir = {'harvestman': 'harvestman'}, #Instalation package:path
-      #packages = ['harvestman',
-      #           'harvestman.apps',
-      #           'harvestman.lib',
-      #           'harvestman.lib.common',
-      #           'harvestman.lib.js',
-      #           'harvestman.ext',
-      #           'harvestman.test',
-      #           'harvestman.dev',
-      #           'harvestman.tools'
-      #           ],
-      #Package_data is for none-py files
-      package_data = {'harvestman' : ['ui/templates/*.html', 'ui/templates/content/*']},
-      zip_safe=False,
-      install_requires=[
-      "sgmlop >= 1.1.1",
-      "pyparsing >= 1.4.8",
-      "web.py >= 0.23",
-          # -*- Extra requirements: -*-
-      ],
-      entry_points="""
-      [console_scripts]
-        harvestman = harvestman.apps.spider:main
-        hget = harvestman.apps.hget:main
-      """,
-      )
-
-#Create config.xml
-#print sys.prefix
-#if sys.prefix!='/usr':
-#    pass
-
-#from harvestman.lib.config import HarvestManStateObject
-#cfg = HarvestManStateObject()
-
-#etcdir = cfg.etcdir
-#print 'Creating basic configuration in %s...' % etcdir
-
-# If using sgmlop parser set htmlparser option to 1...
-#conf_data = cfg.generate_system_configuration()
-
-#if not os.path.isdir(etcdir):
-#    try:
-#        os.makedirs(etcdir)
-#    except OSError, e:
-#        print e
-#        sys.exit(1)
-
-#if os.path.isdir(etcdir):
-#    etcfile = os.path.join(etcdir, 'config.xml')
-#    open(etcfile, 'w').write(conf_data)
-#    print 'done.'
-#else:
-#    print 'Could not create system configuration!'
-
diff --git a/HarvestMan/tarhm.py b/HarvestMan/tarhm.py
deleted file mode 100755
index 7dee3ff..0000000
--- a/HarvestMan/tarhm.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import sys, os
-import shutil
-import time
-import ftplib
-import getpass
-
-srcdir = sys.argv[1]
-# Copy to /tmp
-#os.system('cp -r %s /tmp/HarvestMan-2.0' % srcdir)
-#curdir = os.path.abspath('.')
-# Change to /tmp
-#os.chdir('/tmp')
-#srcdir = 'HarvestMan-2.0'
-# Remove .pyc files
-os.system('rm -rf `find %s -name \*.pyc`' % srcdir)
-# Remove files starting with #
-os.system('rm -rf `find %s -name \#*`' % srcdir)
-# Remove files ending with ~
-os.system('rm -rf `find %s -name \*~`' % srcdir)
-# Remove .svn directories
-os.system('rm -rf `find %s -name \.svn`' % srcdir)
-# Remove any saved www* entries
-os.system('rm -rf `find %s -name www*`' % srcdir)
-# Remove any other saved files...
-# Remove any .hpf files
-os.system('rm -rf `find %s -name \*.hpf`' % srcdir)
-# Remove any other .com/.org sites
-os.system('rm -rf `find %s -regex ".*\.org"`' % srcdir)
-os.system('rm -rf `find %s -regex ".*\.com"`' % srcdir)
-# Remove any other stray html
-os.system('rm -rf `find %s -name \*.htm\* | grep -v samples | grep -v bugs | grep -v templates | grep -v test`' % srcdir)
-# Remove any .bidx files
-os.system('rm -rf `find %s -name \.bidx*`' % srcdir)
-# Now tar it up...
-tarfile = time.strftime('HarvestMan-2.0alpha%d%m%Y.tar.gz', time.localtime())
-ret = os.system('tar -czvf %s %s' % (tarfile, srcdir))
-# Clean up the folder
-# os.system('rm -rf ' + srcdir)
-ret = raw_input('Go ahead with upload [y/n] ? ')
-if ret.strip().lower() != 'y':
-    sys.exit('Done.')
-    
-# Upload it to harvestmanontheweb.com
-user=raw_input('Enter FTP username: ').strip()
-passw=getpass.getpass('Enter FTP password: ').strip()
-print 'Logging into harvestmanontheweb.com...',
-ftp = ftplib.FTP('www.harvestmanontheweb.com')
-ftp.login(user, passw)
-print 'done.'
-print 'Changing directory to www/packages/2.0...',
-ftp.cwd('www/packages/2.0')
-print 'done.'
-t1 = time.time()
-print 'Transferring file %s...' % tarfile
-ftp.storbinary('STOR %s' % tarfile, open(tarfile,'rb'))
-ftp.close()
-print 'Uploaded file %s in %.1f seconds' % (tarfile , time.time()-t1)
-# Remove the tarfile
-os.system('rm -rf ' + tarfile)
-os.chdir(curdir)
-
-
diff --git a/InstallHarvestMan.md b/InstallHarvestMan.md
new file mode 100644
index 0000000..d508b28
--- /dev/null
+++ b/InstallHarvestMan.md
@@ -0,0 +1,62 @@
+**You have few options when installing harvestman**
+
+Currently the best way is to download the source code from the trunk of our repository.
+
+# Check out code from svn #
+  * First you need to checkout version of harvestman from repository.You will need '''subversion''' program to do that. When subversion is installed run this command:
+```
+svn checkout http://harvestman-crawler.googlecode.com/svn/trunk/HarvestMan-lite harvestman-crawler
+```
+
+# Install harvestman #
+
+## Requirements ##
+  * Harvestman requires:
+
+  1. python2.4 or higher. (python2.3 untested)
+  1. python-dev package
+  1. sgmlop,pyparsing,web.py which will get installed automatically as part of the installation.
+
+## Installation of harvestman ##
+  * You have two options of installing harvestman. You can install it in a system base folder or you can use virtualenv to install it in your defined folder.
+
+### Vritualenv ###
+  * If you want to use virtualenv
+  * Setup virtualenv folder. If you don't have virtualenv install it using **easy\_install virtualenv**
+```
+virtualenv --no-site-packages harvestmanENV
+
+You should see:
+New python executable in harvestmanENV/bin/python
+Installing setuptools............done.
+```
+  * Activate your virtual setup.
+```
+source harvestmanENV/bin/activate
+```
+  * You should see (harvestmanENV) on your command line:
+```
+(harvestmanENV)lucas@delldebian:~/tmp$ 
+```
+  * Now go into the harvestman folder and install it
+```
+cd harvestman-crawler/HarvestMan/
+python setup.py install
+```
+  * Test the installation by running:
+```
+harvestman --selftest
+```
+  * When done you can deactivate your virtualenv
+```
+deactivate
+```
+  * See UsingHarvestman
+
+### system wide installation ###
+  * If you want to do a system wide installation:
+  * Go into the harvestman folder, and run setup file.
+```
+cd harvestman-crawler/HarvestMan/
+python setup.py install
+```
\ No newline at end of file
diff --git a/NewDevelopersNotes.md b/NewDevelopersNotes.md
new file mode 100644
index 0000000..c9dd60b
--- /dev/null
+++ b/NewDevelopersNotes.md
@@ -0,0 +1,30 @@
+## Standards ##
+  * When adding modifying code make sure you use the base imports
+```
+[good]from harvestman.lib import *
+[good]from harvestman.lib import xyz
+[good]from hasvestman.lib.common import *
+[bad]import xyz
+[bad]import *
+```
+## branches ##
+  * Here is a short intro on how do do merging and branching.
+  * http://kenkinder.com/svnmerge/
+# Packages #
+## python eggs ##
+  * To make a python egg you can do the following:
+  * In a folder that has setup.py:
+```
+python setup.py bdist_egg
+```
+  * The file should be in a dist folder
+```
+dist/HarvestMan-2.0.2dev_r107-py2.4.egg
+```
+  * You can send the egg to somebody and they can install it using easy\_install command:
+```
+easy_install HarvestMan-2.0.2dev_r107-py2.4.egg
+```
+
+## Making win32 package ##
+  * Environment:Windows XP SP2 running Python 2.5.2, py2exe 0.6.8
\ No newline at end of file
diff --git a/ProjectHome.md b/ProjectHome.md
new file mode 100644
index 0000000..f8b5bb0
--- /dev/null
+++ b/ProjectHome.md
@@ -0,0 +1,3 @@
+HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions.
+
+The final goal of the project is to develop a full-fledged semantic personal data mining platform which can be used to retrieve information from the Internet in a highly customizable manner, so that one can fetch information from the web _the way he wants it, when he wants it_. For this, HarvestMan project will provide support for Web 2.0 and 3.0 technologies such as RSS, RDF, OWL etc.
diff --git a/UsingHarvestMan.md b/UsingHarvestMan.md
new file mode 100644
index 0000000..df6998c
--- /dev/null
+++ b/UsingHarvestMan.md
@@ -0,0 +1,99 @@
+![http://harvestman.everythingability.com/chrome/common/banner.png](http://harvestman.everythingability.com/chrome/common/banner.png)
+# First Time Usage #
+
+When using HarvestMan you have two options of downloading pages/files. You can either use command line options or use config.xml file.
+
+## Command Line Options ##
+  * To get a list of commands, type:
+```
+harvestman --help
+```
+  * Here are some available choices:
+```
+Options:
+  -h, --help            show this help message and exit
+  -v, --version         Print version information and exit
+  -m, --simulate        Simulates crawling with the given configuration,
+                        without performing any actual downloads (same as "-g
+                        simulator")
+  -C CFGFILE, --configfile=CFGFILE
+                        Read all options from the configuration file CFGFILE
+  -P PROJFILE, --projectfile=PROJFILE
+                        Load the project file PROJFILE
+  -F URLFILE, --urlfile=URLFILE
+                        Read a list of start URLs from file URLFILE and crawl
+                        them
+  -b BASEDIR, --basedir=BASEDIR
+                        Set the (optional) base directory to BASEDIR
+  -p PROJECT, --project=PROJECT
+                        Set the (optional) project name to PROJECT
+  -V LEVEL, --verbosity=LEVEL
+                         Set the verbosity level to LEVEL. Ranges from 0-5,
+                        default is 2
+  -f LEVEL, --fetchlevel=LEVEL
+                        Set the fetch-level of this project to LEVEL. Ranges
+                        from 0-4, default is 0
+  -l LOCALISE, --localise=LOCALISE
+                        Localize urls after download (yes/no, default is yes)
+  -r NUMRETRIES, --retry=NUMRETRIES
+                        Set the number of retry attempts for failed urls to
+                        NUMRETRIES
+  -X PROXYSERVER, --proxy=PROXYSERVER
+                        Enable and set proxy to PROXYSERVER (host:port)
+  -U USERNAME, --proxyuser=USERNAME
+                        Set username for proxy server to USERNAME
+  -W PASSWORD, --proxypass=PASSWORD
+                         Set password for proxy server to PASSWORD
+  -n NUMCONNECTIONS, --connections=NUMCONNECTIONS
+                        Limit number of simultaneous network connections to
+                        NUMCONNECTIONS
+  -c CACHE, --cache=CACHE
+                        Enable/disable caching of downloaded files. If
+                        enabled(default), files won't be saved unless their
+                        timestamp is newer than the cache timestamp
+  -d DEPTH, --depth=DEPTH
+                        Set the limit on the depth of urls to DEPTH
+  -w NUMWORKERS, --workers=NUMWORKERS
+                        Enable worker threads and set the number of worker
+                        threads to NUMWORKERS
+  -T NUMTHREADS, --maxthreads=NUMTHREADS
+                        Limit the number of tracker threads to NUMTHREADS
+  -M NUMFILES, --maxfiles=NUMFILES
+                        Limit the number of files downloaded to NUMFILES
+  -t PERIOD, --timelimit=PERIOD
+                        Run the program for the specified time period PERIOD
+                        (in seconds)
+  -s, --subdomain       If set, treats subdomains in the same parent domain
+                        (like my.foo.com & his.foo.com) as the same
+  -R ROBOTS, --robots=ROBOTS
+                        Enable/disable Robot Exclusion Protocol and checking
+                        of META ROBOTS tags.
+  -u FILTER, --urlfilter=FILTER
+                        Use regular expression FILTER for filtering urls
+  -g PLUGINS, --plugins=PLUGINS
+                        Load the set of plugins PLUGINS (Specified as
+                        plugin1+plugin2+...)
+  -o <name=value>, --option=<name=value>
+                        Pass a configuration param using <name=value> syntax
+  --ui                  Start HarvestMan in Web UI mode
+  --genconfig           Create Configuration File Using GenConfig Web UI mode
+  --selftest            Run a self test
+```
+
+
+## Config.xml Options ##
+  * To create config.xml you can run our generate configuration program.
+```
+harvestman --genconfig
+```
+  * Fill all the information and when done save the xml to myconfig.xml and run the following command:
+```
+harvestman -C myconfig.xml
+```
+  * Here is how the web interface look.
+
+---
+
+![http://lucasmanual.com/out/harvestman.png](http://lucasmanual.com/out/harvestman.png)
+
+---
diff --git a/WorkAroundHttpForbidden.md b/WorkAroundHttpForbidden.md
new file mode 100644
index 0000000..fcb3a25
--- /dev/null
+++ b/WorkAroundHttpForbidden.md
@@ -0,0 +1,46 @@
+# HTTP 403 Error #
+
+Sometimes when you try to crawl a site, you get an error as follows.
+
+```
+[2010-01-10 13:18:34,965] Starting project en.wikipedia.org ...
+[2010-01-10 13:18:34,965] Writing Project Files... 
+[2010-01-10 13:18:35,005] Starting download of url http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A ...
+[2010-01-10 13:18:35,095] Reading Project Cache... 
+[2010-01-10 13:18:35,100] Downloading http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A
+[2010-01-10 13:18:36,015] Forbidden =>  http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A
+[2010-01-10 13:18:37,023] Ending Project en.wikipedia.org ...
+```
+
+Most of the time this is because the web-site does not accept the
+USER-AGENT string of the crawler. To fix this you need to
+change the USER-AGENT of HarvestMan to a more acceptable one.
+
+# Work-around #
+
+If using the config XML file then you can cheat the web-site
+by supplying a standard browser USER-AGENT. For example,
+
+```
+    <system>
+      <useragent value="Firefox/v3.5" />
+      <workers status="1" size="10" timeout="1200"/>
+      <trackers value="10"/>
+      <timegap value="3.0" random="1" />
+      <connections type="flush" />
+    </system>
+```
+
+You can also do this directly in code as follows. For example,
+
+```
+    spider = HarvestMan()
+    spider.initialize()
+    config = spider.get_config()
+    config.USER_AGENT = "Firefox/v3.5"
+```
+
+The basic idea is to use a USER\_AGENT string that is more
+popular, such as that of the Firefox web browser. This
+works for most of the web-pages and gets rid of the HTTP
+403 error.
\ No newline at end of file
diff --git a/WorldsSimplestCrawler.md b/WorldsSimplestCrawler.md
new file mode 100644
index 0000000..39c3e2c
--- /dev/null
+++ b/WorldsSimplestCrawler.md
@@ -0,0 +1,177 @@
+## World's Simplest Crawler - Crawling the Web from Python prompt with HarvestMan ##
+
+Once you install HarvestMan, it is very easy to do a quick crawl
+of a web-site or URL from the Python interpreter prompt. This page
+explains how.
+
+At the very first, make sure you have installed HarvestMan to
+Python. Check InstallHarvestMan for a guide to this.
+
+Once you have done so, here are the rest of the steps.
+
+---
+
+First start up Python.
+
+```
+[anand@localhost ~]$ python
+Python 2.5.1 (r251:54863, Jul 17 2008, 13:21:31) 
+[GCC 4.3.1 20080708 (Red Hat 4.3.1-4)] on linux2
+Type "help", "copyright", "credits" or "license" for more information.
+>>>
+```
+
+Now, import the HarvestMan application.
+```
+>>> from harvestman.apps.spider import HarvestMan
+```
+
+Next, create an instance of the crawler class.
+```
+>>> spider = HarvestMan()
+```
+
+Initialize the instance. (This has to be done explicitly when running the crawler like this on the interactive prompt).
+```
+>>> spider.init()
+```
+
+Now, get the configuration of the crawler. This is done by the "get\_config" method.
+(The config object is an instance of _harvestman.lib.config.HarvestManStateObject).
+```
+>>> cfg = spider.get_config()
+>>> type(cfg)
+<class 'harvestman.lib.config.HarvestManStateObject'>
+```_
+
+The config object defines an _add_ method to add URLs as projects to it directly. The method is as follows.
+```
+>>> help(cfg.add)
+Help on method add in module harvestman.lib.config:
+
+add(self, url, name='', basedir='.', verbosity=20) method of harvestman.lib.config.HarvestManStateObject in
+stance
+    Adds a crawl project to the crawler. The arguments
+    are the starting URL, and optional name for the project,
+    a base directory for saving files and project verbosity
+```
+
+So, let us add a project with starting URL _http://docs.python.org/tutorial/index.html_ with the name _pytut_. Let us save the files to _$HOME/websites_.
+
+```
+>>> cfg.add(url='http://docs.python.org/tutorial/index.html',name='pytut',basedir='~/websites')
+```
+
+_(Of course, the name in front of the arguments are not necessary. But it has been added to clarify what each argument stands for)_.
+
+Finally, we need to setup the configuration object and call the _main_ method on the crawler.
+
+```
+>>> cfg.setup()
+>>> spider.main()
+```
+
+That is it! The crawl starts immediately in the Python command line!
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:34:25] *** Log Started ***
+ 
+[17:34:25] Starting project pytut ...
+[17:34:25] Writing Project Files... 
+[17:34:25] Starting download of url http://docs.python.org/tutorial/index.html ...
+[17:34:25] Reading Project Cache... 
+[17:34:25] Project cache not found 
+[17:34:25] Downloading file for url http://docs.python.org/tutorial/index.html
+[17:34:27] Saved to /home/anand/websites/pytut/docs.python.org/tutorial/index.html
+[17:34:28] Fetching links for url http://docs.python.org/tutorial/index.html
+[17:34:28] Not Found =>  http://docs.python.org/robots.txt
+[17:34:29] Downloading file for url http://docs.python.org/tutorial/stdlib2.html
+[17:34:29] Downloading file for url http://docs.python.org/tutorial/interpreter.html
+...
+```
+
+Ain't that cool. So summarizing the steps, the world's simplest (and coolest!) crawl can be done right in the Python command line with the following  steps.
+
+```
+>>> from harvestman.apps.spider import HarvestMan
+>>> spider = HarvestMan()
+>>> spider.init()
+>>> cfg = spider.get_config()
+>>> cfg.add(url='http://docs.python.org/tutorial/index.html',name='pytut',basedir='~/websites')
+>>> cfg.setup()
+>>> spider.main()
+```
+
+That is about 6 lines of code excluding the import line. Guess it does not get simpler than that ;)
+
+Now, what if you want to re-crawl the same site. Ok, let us try the
+same crawl again.
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:38:37] *** Log Started ***
+ 
+Traceback (most recent call last):
+  File "/usr/lib64/python2.5/logging/__init__.py", line 750, in emit
+    self.stream.write(fs % msg)
+ValueError: I/O operation on closed file
+[17:38:37] Starting project pytut ...
+Traceback (most recent call last):
+  File "/usr/lib64/python2.5/logging/__init__.py", line 750, in emit
+    self.stream.write(fs % msg)
+ValueError: I/O operation on closed file
+...
+```
+
+Oops, something has gone wrong!
+
+The problem is that the crawler state has to be reset, before you attempt to recrawl again using the same instance of the crawler. This is automatically done if you run the
+program non-interactively, but in the interactive prompt, this has to be done manually.
+
+But there is no problem, since it just boils down to a single line of code...!
+
+```
+>>> spider.reset()   
+```
+
+That is it! This resets the state, so now you can crawl to your heart's content :)
+
+Let us crawl again...
+
+```
+>>> spider.main()
+[]
+Starting HarvestMan 2.0 alpha 1... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[17:41:01] *** Log Started ***
+ 
+[17:41:01] Starting project pytut ...
+[17:41:01] Writing Project Files... 
+[17:41:01] Starting download of url http://docs.python.org/tutorial/index.html ...
+[17:41:01] Reading Project Cache... 
+[17:41:01] Downloading file for url http://docs.python.org/tutorial/index.html
+...
+```
+
+
+Watch this space. More cool tutorials are coming on custom crawling the web with
+HarvestMan!
+
+
+**NOTE: This works only with the most recent trunk code (since Oct 6 2008), as
+the code referencing the application classes have changed in the last few days.**
+
+
+---
+
+_Please send your feedbacks regarding this tutorial to the project owner(s)_.
\ No newline at end of file
diff --git a/WritingCustomCrawlers.md b/WritingCustomCrawlers.md
new file mode 100644
index 0000000..cc9420b
--- /dev/null
+++ b/WritingCustomCrawlers.md
@@ -0,0 +1,196 @@
+# How to use HarvestMan events API to write custom crawlers #
+
+## HarvestMan events API ##
+HarvestMan provides a very well-defined events API which can be used by developers to write custom crawlers suited for a specific crawling/data mining task.
+
+### Events ###
+Events are implemented using a callback mechanism. At different times during the program execution, HarvestMan raises events with specific names. These events can be hooked into custom functions by subscribing to the events and defining functions which process the state supplied along with the event.
+
+Events are mainly of two types - _post_ events are those that are raised after an action is performed and _pre_ (_before_) events are those that are raised prior to performing an action. In HarvestMan, _pre_ events are more useful for controlling program flow since their return values are checked for True/False to decide rest of processing. Fore more information, read on.
+
+## Illustration ##
+Let us say that you want to write a custom crawler which saves only images which are larger than 4K to the disk (a practical example of this would be a crawler which ignores thumbnail images, since thumbnails are typically  of size 2K-4K). This is how you would do this by subscribing to the **_save\_url\_data_** event.
+
+First you need to define a custom crawler class over-riding the _HarvestMan_ class.
+
+```
+from harvestman.apps.spider import HarvestMan
+from harvestman.lib.common.macros import *
+
+class MyCustomCrawler(HarvestMan):
+    """ A custom crawler """
+
+    size_threshold = 4096
+
+    def save_this_url(self, event, *args, **kwargs):
+        """ Custom callback function which modifies behaviour
+            of saving URLs to disk """
+
+        # Get the url object
+        url = event.url
+        # If not image, save always
+        if not url.is_image():
+            return True
+        else:
+            # If image, check for content-length > 4K
+            size = url.clength
+            return (size>self.size_threshold)
+
+# Set up the custom crawler
+if __name__ == "__main__":
+    crawler = MyCustomCrawler()
+    crawler.initialize()
+    # Get the configuration object
+    config = crawler.get_config()
+    # Register for 'save_url_data' event which will be called
+    # back just before a URL is saved to disk
+    crawler.register('save_url_data', crawler.save_this_url)
+    # Run
+    crawler.main()
+```
+
+You can run the program as if you would run HarvestMan. For example if you save this code to a file named _customcrawler.py_ then you would run it as,
+
+```
+$ python customcrawler.py [URL]
+```
+
+Here is a sample crawl of a site containing images.
+```
+$ python customcrawler.py http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/crawler.py:53: DeprecationWarning: the sha module is deprecated; use the hashlib module instead
+  import sha
+/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/urlparser.py:50: DeprecationWarning: the md5 module is deprecated; use hashlib instead
+  import md5
+Loading user configuration... 
+Starting HarvestMan 2.0 beta 5... 
+Copyright (C) 2004, Anand B Pillai 
+  
+[2010-02-10 19:21:51,052] *** Log Started ***
+ 
+[2010-02-10 19:21:51,052] Starting project www.tcm.phy.cam.ac.uk ...
+[2010-02-10 19:21:51,052] Writing Project Files... 
+[2010-02-10 19:21:51,191] Starting download of url http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html ...
+[2010-02-10 19:21:51,250] Reading Project Cache... 
+[2010-02-10 19:21:51,253] Project cache not found 
+[2010-02-10 19:21:51,256] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,211] Saved /home/anand/work/harvestman/HarvestMan-lite/harvestman/apps/samples/www.tcm.phy.cam.ac.uk/www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,299] Fetching links http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
+[2010-02-10 19:21:52,730] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/index.html
+[2010-02-10 19:21:52,731] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0002.html
+...
+```
+
+## Diving deep ##
+Let us dissect the custom crawler application we have built to understand the
+events API.
+
+Here are the steps involved.
+
+  * Create a custom crawler class inheriting the _HarvestMan_ class.
+  * Create a custom function which hooks into a specific event.
+  * In the `__main__` section, configure the crawler to subscribe to the event by using the _register_ method. This method takes the event name as first argument and event handler as the second argument.
+
+Let us take a deeper look at the function _save\_this\_url_.
+
+```
+    def save_this_url(self, event, *args, **kwargs):
+        """ Custom callback function which modifies behaviour
+            of saving URLs to disk """
+
+        # Get the url object
+        url = event.url
+        # If not image, save always
+        if not url.is_image():
+            return True
+        else:
+            # If image, check for content-length > 4K
+            size = url.clength
+            return (size>self.size_threshold)
+```
+
+The function simply tells the crawler to alway save non-Image URLs, by
+returning True. For image URLs, it checks for size and returns True
+only if the size is greater than the required threshold size, else False.
+
+The main object here of interest is the **_event_** object. This object contains
+all the state the programmer needs to write the custom behavior. The _event_
+object is of type **_Event_**, a class defined in the module _harvestman.lib.event_.
+
+The _Event_ class is defined as follows.
+```
+class Event(object):
+    """ Event class for HarvestMan """
+
+    def __init__(self):
+        self.name = ''
+        self.config = objects.config
+        self.url = None
+        self.document = None
+```
+
+The attributes of the class are namely, _name_, _config_, _url_ and _document_. Of these the attributes of primary interest to the developer are _url_ and _document_.
+
+### The _url_ attribute ###
+The _url_ attribute contains the current URL object which is being processed. The URL object is of type _HarvestManUrl_ (in module _harvestman.lib.urlparser_). It keeps
+all the state of the current URL under processing.
+
+### The _document_ attribute ###
+The _document_ attribute holds information on the current web-page being crawled.
+The _document_ object is of type _HarvestManDocument_ (module _harvestman.lib.document_). This object holds information on the URL as a document, i.e its content, etag, last modified time etc. The document object is useful for URLs which represent web-pages or documents such as PDF etc.
+
+### The _config_ attribute ###
+This attribute binds to the global configuration object. Instead of having to call _objects.config_ everytime, you can get global configuration in the event handler by accessing the _event.config_ attribute.
+
+### The _name_ attribute ###
+This will contain the name of the event. For example in the above code, this would be _save\_url\_data_.
+
+### How to use attributes of _event_ object ###
+For most events, the _url_ attribute it present and is required to do any meaningful processing. The _document_ attribute is present only for the events which are dealing with a web-page with parseable content. For some events which are related to program
+stages (such as start/end of a project), both these attributes wont be present, i.e they will be _None_.
+
+### Additional arguments ###
+Additional arguments could be passed to the event handler by specific events. Positional arguments will appear as the _`*args`_ and keyword arguments as the _`**kwargs`_ variables respectively. For example the _before\_tag\_parse_ event passes in the current HTML tag and its attributes using positional arguments.
+
+
+
+## Table of Events ##
+The following table lists the main events raised by HarvestMan and the attributes that are filled in for each event, additional arguments, points in program flow when the events are raised, module which raises the event etc.
+
+| **Event** | **Raised when** | **Attributes** | **Positional Arguments** | **Module** | **Comments** |
+|:----------|:----------------|:---------------|:-------------------------|:-----------|:-------------|
+| before\_start\_project | Before starting a project | _url_ | None | _harvestman.apps.spider_ | _url_ is the starting URL |
+| post\_start\_project | After starting a project | _url_ | None | _harvestman.apps.spider_  | _url_ is the starting URL |
+| before\_finish\_project | Before finishing a project | _url_ | None| _harvestman.apps.spider_ | _url_ is the starting URL |
+| after\_finish\_project | After finishing a project | _url_ | None |_harvestman.apps.spider_ | _url_ is the starting URL |
+| before\_crawl\_url | Before a URL is crawled | _url_, _document_ | None | _harvestman.lib.crawler_ | _crawled_ here means the function _crawl\_url_ |
+| post\_crawl\_url | After a URL is crawled | _url_, _document_ | None | _harvestman.lib.crawler_ | _crawled_ here means the function _crawl\_url_ |
+| before\_download\_url | Before a URL is downloaded | _url_ | None | _harvestman.lib.crawler_ |  |
+| before\_parse\_url | Before a URL is parsed | _url_,_document_ | None | _harvestman.lib.crawler_ | This always comes after _before\_download\_url_ hook |
+| post\_parse\_url | After a URL is parsed | _url_,_document_ | _links_ | _harvestman.lib.crawler_ | _links_ stand for the child links of this URL |
+| before\_url\_connect | Before connection for a URL is done | _url_ | _last\_modified_, _etag_ | _harvestman.lib.connector_ | last\_modified, etag args are valid (not None) only if there is cache for the URL |
+| post\_url\_connect | After connection for a URL is done | _url_ | None | _harvestman.lib.connector_ |  |
+| save\_url\_data | Before saving data for a URL to disk | _url_ | _data_ | _harvestman.lib.connector_ | _data_ is the content of the URL |
+| post\_crawl\_complete | After the crawl is completed | None | None | _harvestman.lib.datamgr_ |  |
+| before\_tag\_parse | Before an HTML tag is parsed | _url_ | _tag_,_attrs_ | _harvestman.lib.pageparser_ | _tag_ is the tag name and _attrs_ the attributes dictionary |
+| before\_tag\_data | Before CDATA of an HTML tag is parsed | _url_ | _tag_, _cdata_ | _harvestman.lib.pageparser_ | _tag_ is the tag name and _cdata_ is its CDATA |
+| include\_this\_url | Before a URL is checked for rules | _url_ | None | _harvestman.lib.rules_ | This comes before a URL is crawled |
+
+## Programming using Events ##
+The key towards programming with events is that, the programmer can control the program flow by binding to any _before_ event and returning True or False depending upon his logic.
+
+All events which are raised **before** a certain action is performed, checks for the return value of the event processing. **_If the return value is False, the rest of the processing in the function which raised the event is NOT done_**. **_If the return value is True, the function continues processing as if nothing happened_**.
+
+This can be exploited to write custom crawlers that perform specific actions. For example in the sample code illustrated previously, we return _True_ if the URL is not an image and _False_ if the URL is an image, but below the given size. This way we modify the functionality of the function which raised the event, thereby causing the program to not save image URLs below a certain size.
+
+_Post_ events (events raised after an action, check table) are also useful, but since their return values are not checked in code, they are much less useful in controlling program flow when compared to _pre_ events.
+
+_NOTE_: In the table, any event for which _Raised when_ says _before_ is a pre-event. Any event for which it says _post_ or _after_ is a post-event.
+
+## More Reading ##
+For more information, check out sample custom crawler applications in the folder [\_harvestman/apps/samples\_](http://code.google.com/p/harvestman-crawler/source/browse/#svn/trunk/HarvestMan-lite/harvestman/apps/samples) in the code-base. Also read the HOWTO at [\_doc/events.HOWTO\_](http://code.google.com/p/harvestman-crawler/source/browse/trunk/HarvestMan-lite/doc/events.HOWTO).
+
+## Caveat ##
+In the earlier released versions, the _register_ method is not present. Instead the method name is _bind\_event_, with the same arguments. Also in released versions, positional arguments are not supported and additional arguments are always passed in as keyword arguments. This document is conformant to the most recent release (HarvestMan 2.0.5 beta) and the current trunk-code under HarvestMan-lite branch.
+
+
diff --git a/bot.md b/bot.md
new file mode 100644
index 0000000..56ef528
--- /dev/null
+++ b/bot.md
@@ -0,0 +1,5 @@
+## Why are you crawling my site? ##
+Harvestman is an open source crawler that people can install on their computer and crawl websites. People probably crawl your website because they think your website is useful to them.
+
+## How are you crawling my site? ##
+Harvestman is very extensible. It follows your robot.txt by default but can be configured to fit peoples needs. It is capable to setting bandwidth limits, connection limits, filter by specific content, ignore certain files, crawl only subpages. It is a decision of the person that sets up the crawler how they will crawl you website.
\ No newline at end of file