diff --git a/AboutHarvestMan.md b/AboutHarvestMan.md new file mode 100644 index 0000000..aa91205 --- /dev/null +++ b/AboutHarvestMan.md @@ -0,0 +1,33 @@ +# What is HarvestMan # + +HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python. + +HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. + +HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License. + +# History of HarvestMan # + 1. Harvestman crawler was started by Anand B Pillai in June 2003 as a hobby project to develop a personal web crawler in Python, along with Nirmal Chidambaram. + 1. Nirmal writes the original code in mid June 2003(one module, single threaded crawler) which Anand improves substantially and develops into a multithreaded crawler. + 1. First version (0.8) was released by Anand in July 2003. + 1. Released on [freshmeat](http://www.freshmeat.net/projects/harvestman) (1.3) on Dec 2003 + 1. Eight releases done between Dec 2003 (1.3) and Dec 2004 (1.4). + 1. Project chosen as crawler for the [EIAO](http://www.eiao.net) web accessibility observatory in Feb 2005. EIAO chose version 1.4, which then underwent several minor releases. + 1. Most recent release is 1.4.6, released on Sep 2005. + 1. Since early 2006, HarvestMan has been undergoing development along with the EIAO project (mostly with EIAO feedbacks), but no public releases have been done. + 1. Version 1.5 started development in mid 2006, but was never released. + 1. Version 1.4.6 gets accepted into Debian in March 2006. + 1. By mid 2007, the program had so many changes, that the version number under development was incremented to 2.0 from 1.5. Version 2.0 has been under development effectively since mid 2006, but most code changes happened since mid 2007. + 1. Version 1.4.6 gets into Ubuntu repositories in May 2007. + 1. Development hosted in [BerliOS](http://developer.berlios.de/projects/harvestman) till June 2008, when it was moved to Google code. + 1. Contributors to 2.0 till June 2008 - Anand B Pillai (Main), Nils Ultveit Moe (EIAO), Morten Goodwin Olsen (EIAO), John Kleven. + 1. Version 2.0 alpha package releases started in Aug 2007 on the website. + 1. HarvestMan wins FOSS India Award in April 2008. + 1. In June 2008, Lukasz Szybalski joins the team. + +## Future of HarvestMan ## +> It is a brave new world out there... :-) +> Well, currently the development stands at 2.0.5 beta, i.e the 2.0 version is still +> not completed. The development is slow and I need to take time off from a regular +> job to do this, so well, can't give a final date for this, but hopefully one day +> it will be done :) \ No newline at end of file diff --git a/ConfigXml.md b/ConfigXml.md new file mode 100644 index 0000000..09019ea --- /dev/null +++ b/ConfigXml.md @@ -0,0 +1,108 @@ +# Harvestman config.xml # +## Configuration File Structure ## + +The configuration file has different categories inside that split the configuration options into different sections. At present, the configuration file has the following namespaces: + + 1. **project** - This section holds the options related to the current HarvestMan project + 1. **network** - This section holds the configuration options related to your network connection + 1. **download** - This section holds configuration option that affect your downloads in a generic way + 1. **control** - This section is similar to the above one, but holds options that affect your downloads in a much more specific way. This is a kind of 'tweak' section, that allows you to exert more fine-grained control over your projects. + 1. **system** - This section controls the threading options, regional (locale) options and any other options related to Python interpreter and your computer. + 1. **indexer** - This section holds variables related to how the files are processed after downloading. Right now it holds variables related to localizing links. + 1. **files** - This section holds variables that control the files created by HarvestMan namely error log, message log and an optional url log. + 1. **display** - This holds a single variable related to creating a browser page for all HarvestMan projects on your computer. + +## Control Section ## +### fetchlevel ### + +HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL. In general **increasing a fetch level allows the program to crawl more files** on the Internet. + +A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL. + +For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the + +<images> + + sub-directory and directories below it, and no other file. + +The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page. + +A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers. + +A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL. + +A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc. + +Place the paramter in **control** element under **extent** section. +Here is a sample XML element including this new param. +``` + + ... + + + + + + + + ... + +``` + + +**The value can be 0,1,2,3,4** + +See FAQ for more explanations. + +### maxbandwidth ### +MaxBandwidth controls the speed of crawling. The throttling of bandwidth is used when we are downloading huge amount of data from a host. MaxBandwidh should prevent user from "Denial Of Service" that one could impose on the crawled server. By using this configuration variable you can limit your download speed to 5kb per second. With this speed the host should not have any problems serving your crawl and be able to proceed with its normal operations. + +Place the paramter in **control** element under **limits** section. +Here is a sample XML element including this new param. +``` + +.... + + + + + + + + + +... + +``` + +**The value needs to be specified in kb/sec , not in bytes/sec.** + +### maxbytes ### +MaxBytes controls how many bytes your crawl will download. The max bytes is used when we are downloading huge amount of data from a host, and in conjunction with MaxBandwidh we want to limit how much data we download. By using this configuration variable with maxbandwidth you can set your crawl to download 10mb at 5kb/s. With this fine grained control of your download size and speed the host should not have any problems serving your crawl and be able to proceed with its normal operations. + +Place the paramter in **control** element under **limits** section. +Here is a sample XML element including this new param. +``` + +.... + + + + + + + + + + +... + +``` + +**The value accepts plain numbers(assumes bytes), KB, MB and GB.** +``` + - End crawl at 5000 bytes + - End crawl at 10kb + - End crawl at 50 MB. + - End crawl at 1 GB. +``` \ No newline at end of file diff --git a/FAQ.md b/FAQ.md new file mode 100644 index 0000000..dbf965b --- /dev/null +++ b/FAQ.md @@ -0,0 +1,1068 @@ +## This is still a work under progress and has been lifted verbatim from the HarvestMan web-site with little or no modifications. A lot of information is out of date and needs to be updated. Also the FAQ doesn't confirm to wiki style, so proceed with care ! ## + +HarvestMan - FAQ +Version 2.0 +NOTE: The FAQ is being modified currently to be in sync with HarvestMan 1.4, so you might find that some parts of the FAQ are inconsistent with the rest of it. This is because some of the FAQ is modified, while the rest is still to be modified. + + * 1. Overview +> > o 1.1. What is HarvestMan? +> > o 1.2. Why do you call it HarvestMan? +> > o 1.3. What HarvestMan can be used for? +> > o 1.4. What HarvestMan cannot be used for... +> > o 1.5. What do I need to run HarvestMan? + * 2. Usage +> > o 2.1. What is the HarvestMan Configuration File? +> > o 2.2. Can HarvestMan be run as a command-line application? + * 3. Architecture +> > o 3.1.What are "tracker" threads and what is their function? +> > o 3.2.What are "crawler" threads? +> > o 3.3.What are "fetcher" threads? +> > o 3.4.How do the crawlers and fetchers co-operate? +> > o 3.5.How many different Queues of information flow are there? +> > o 3.6.What are worker (downloader) threads? +> > o 3.7.How does a HarvestMan project finish? + * 4. Protocols & File Types +> > o 4.1. What are the protocols supported by HarvestMan? +> > o 4.2. What kind of files can be downloaded by HarvestMan? +> > o 4.3. Can HarvestMan run javascript code? +> > o 4.4. Can HarvestMan run java applets? +> > o 4.5. How to prevent downloading of HTML & CGI forms? +> > o 4.6. Does HarvestMan download dynamically generated cgi files (server-side) ? +> > o 4.7. How does HarvestMan determine the filetype of dynamically generated server side files? +> > o 4.8. Does HarvestMan obey the Robots Exclusion Protocol? +> > o 4.9. Can I restart a project to download links that failed (Caching Mechanism)? + * 5. Network, Security & Access Rules +> > o 5.1. Can HarvestMan work across proxies? +> > o 5.2. Does HarvestMan support proxy authentication? +> > o 5.3. Does HarvestMan work inside an intranet? +> > o 5.4. Can HarvestMan crawl a site that requires HTTP authentication? +> > o 5.5. Can HarvestMan crawl a site that requires HTTPS(SSL) authentication? +> > o 5.6. Can I prevent the program from accessing specific domains? +> > o 5.7. Can I specify download filters to prevent download of certain files or directories on a server? +> > o 5.8. Is it possible to control the depth of traversal in a domain? + * 6. Download Control - Basic +> > o 6.1. Can I set a limit on the maximum number of files that are downloaded? +> > o 6.2 .Can I set a limit on the number of external servers crawled? +> > o 6.3. Can I set a limit on the number of outside directories that are crawled? +> > o 6.4. How can I prevent download of images? +> > o 6.5. How can I prevent download of stylesheets? +> > o 6.6. How to disable traversal of external servers? +> > o 6.7. Can I specify a project timeout? +> > o 6.8. Can I specify a thread timeout for worker threads? +> > o 6.9. How to tell the program to retry failed links? + * 7. Download Control - Advanced +> > o 7.1. What are fetchlevels and how can I use them? + * 8. Application development & customization +> > o 8.1. I want to customize HarvestMan for a research project. Can you help out ? +> > o 8.2. I want to customize HarvestMan for a commercial project. Can you help out ? + * 9. Diagrams +> > o 9.1. HarvestMan Class Diagram + +1. Overview + +1.1. What is HarvestMan? +HarvestMan (with a capital 'H' and a capital 'M') is a webcrawler program. HarvestMan belongs to a family of programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc. + +These programs are used to crawl a distributed network of computers like the Internet and download files locally. + +1.2. Why do you call it HarvestMan? +The name "HarvestMan" is derived from a kind of small spider found in different parts of the world called "Daddy long legs" or Opiliones. + +Since this program is a web-spider, the analogy was compelling to name it after some species of spiders. The process of downloading data from websites is sometimes called harvesting. + +Both these similarities gave arise to the name HarvestMan. + +1.3. What HarvestMan can be used for? +HarvestMan is a desktop tool for web search/data gathering. It works on the client side. + +As of the recent version, HarvestMan can be used for, + + 1. Downloading a website or a part of it. + +> 2. Download certain files from a website (matching certain patterns) +> 3. Search a website for keywords & download the files containing them +> 4. Scan a website for links and download them specifically using filters. + +1.4. What HarvestMan cannot be used for... +HarvestMan is a small-medium size web-crawler mostly intended for personal use or for use by a small group. It cannot be used for massive data harvesting from the web. However a project to create a large-scale, distributed web crawler based on HarvestMan is underway. It is calld 'Distributed HarvestMan' or 'D-HarvestMan' in short. D-HarvestMan is currently at a prototype stage. + +Projects like EIAO has been able to customize HarvestMan for medium-large scale data gathering from the Internet. The EIAO project uses HarvestMan to download as much as 100,000 files from European websites daily. + +What HarvestMan is not, + + 1. HarvestMan is not an Internet search engine. +> 2. HarvestMan is not an indexer or taxonomy tool for web documents +> 3. HarvestMan is not a server-side program. + +1.5. What do I need to run HarvestMan? +HarvestMan is written in a programming language called Python. Python is an interactive, interpreted, object-oriented programming language created by Guido Van Rossum and maintained by a team of volunteers from all over the world. Python is a very versatile language which can be used for a variety of tasks ranging from scripting to web frameworks to developing highly complex applications. + +HarvestMan is written completely in Python. It works with Python version 2.3 upward on all platforms where Python runs. However, HarvestMan has some performance optimizations that require the latest version of Python which is Python 2.4. The suggested version is Python 2.4. Anyway, HarvestMan will also work with Python 2.3 too, but with reduced performance. + +You need a machine with a rather large amount of RAM to run HarvestMan. HarvestMan tends to heavily use the system memory especially when performing large data downloads or when run with more than 10 threads. It is preferable to have a machine with 512 MB RAM and a fast CPU (Intel Pentium IV or higher) to run HarvestMan efficiently. + +2. Usage + +2.1. How do I run HarvestMan? +HarvestMan is a command-line application. It has no GUI. + +From the 1.4 version, HarvestMan can be run by calling the main HarvestMan module as an executable script on the command-line as follows: + + +% harvestman.py + + +This will work, considering that you have edited your environment PATH variable to include the local HarvestMan installation directory on your machine. If you have not, you can run HarvestMan by running the harvestman.py module as an argument to the python interpreter, as follows: +% python harvestman.py + + +On Win32 systems, if you have associated the ".py" extension to the appropriate python.exe, you can run HarvestMan without invoking the interpreter explicitly. + + +Note that this assumes that you have a config file named config.xmlin the directory from where you invoke HarvestMan. If you dont have a config file locally, you need to use the command-line options of HarvestMan to pass a different configuration file to the program. +2.2. What is the HarvestMan Configuration (config) file? + +The standard way to run HarvestMan is to run the program with no arguments, allowing it to pick up its configuration parameters from an XML configuration file which is named config.xml by default. + +It is also possible to pass command-line options to HarvestMan. HarvestMan supports a limited set of command-line options which allow you to run the program without using a configuration file. You can learn more about the command-line options in the HarvestMan command-line options FAQ. +The HarvestMan configuration is an XML file with the configuration options split into different elements and their hieararchies. A typical HarvestMan configuration file looks as follows: + + + + + + + +<HarvestMan> + + +> + +<config version="3.0" xmlversion="1.0"> + + +> > + +<project> + + +> > > + +<url> + +http://www.python.org/doc/current/tut/tut.html + +</url> + + + + +> + +<name> + +pytut + +</name> + + +> + +<basedir> + +~/websites + +</basedir> + + +> + +<verbosity value="3"/> + + +> + +<timeout value="600.0"/> + + + +> + +</project> + + + +> + +<network> + + +> > + +<proxy> + + +> > > + +<proxyserver> + + + +</proxyserver> + + +> > > + +<proxyuser> + + + +</proxyuser> + + + + +> + +<proxypasswd> + + + +</proxypasswd> + + +> + +<proxyport value=""/> + + +> + +</proxy> + + +> + +<urlserver status="0"> + + +> > + +<urlhost> + +localhost + +</urlhost> + + + + +> + +<urlport value="3081"/> + + +> + +</urlserver> + + +> + +</network> + + + +> + +<download> + + +> > + +<types> + + + + +> + +<html value="1"/> + + +> + +<images value="1"/> + + +> + +<javascript value="1"/> + + +> + +<javaapplet value="1"/> + + + +> + +<forms value="0"/> + + +> + +<cookies value="1"/> + + +> + +</types> + + +> + +<cache status="1"> + + + +> + +<datacache value="1"/> + + +> + +</cache> + + +> + +<misc> + + +> > + +<retries value="1"/> + + +> > + +<tidyhtml value="1"/> + + + + +> + +</misc> + + +> + +</download> + + + +> + +<control> + + +> > + +<links> + + +> > > + +<imagelinks value="1"/> + + + + +> + +<stylesheetlinks value="1"/> + + +> + +</links> + + +> + +<extent> + + +> > + +<fetchlevel value="0"/> + + +> > + +<extserverlinks value="0"/> + + + + +> + +<extpagelinks value="1"/> + + +> + +<depth value="10"/> + + +> + +<extdepth value="0"/> + + +> + +<subdomain value="0"/> + + + +> + +</extent> + + +> + +<limits> + + +> > + +<maxextservers value="0"/> + + +> > + +<maxextdirs value="0"/> + + +> > + +<maxfiles value="5000"/> + + + + +> + +<maxfilesize value="1048576"/> + + +> + +<connections value="5"/> + + +> + +<requests value="5"/> + + +> + +<timelimit value="-1"/> + + + +> + +</limits> + + +> + +<rules> + + +> > + +<robots value="1"/> + + +> > + +<urlpriority> + + + +</urlpriority> + + +> > + +<serverpriority> + + + +</serverpriority> + + + + +> + +</rules> + + +> + +<filters> + + +> > + +<urlfilter> + + + +</urlfilter> + + +> > + +<serverfilter> + + + +</serverfilter> + + +> > + +<wordfilter> + + + +</wordfilter> + + + + +> + +<junkfilter value="0"/> + + +> + +</filters> + + +> + +</control> + + + +> + +<system> + + +> > + +<workers status="1" size="10" timeout="200"/> + + + + +> + +<trackers value="4"/> + + +> + +<locale> + +american + +</locale> + + +> + +<fastmode value="1"/> + + +> + +</system> + + + +> + +<files> + + +> > + +<urllistfile> + + + +</urllistfile> + + +> > + +<urltreefile> + + + +</urltreefile> + + + +> + +</files> + + + +> + +<indexer> + + + +> + +<localise value="2"/> + + +> + +</indexer> + + + +> + +<display> + + +> > + +<browsepage value="1"/> + + + + +> + +</display> + + + +> + +</config> + + + + + +</HarvestMan> + + + + +The current configuration file holds more than 60 configuration options. The variables that are essential to a project are project.url, project.name and project.basedir. These determine the identity of a HarvestMan crawl and normally require unique values for each HarvestMan project. + +For a more detailed discussion on the config file, click here. + +2.3. Can HarvestMan be run as a command-line application? +Yes, it can. For details on this, refer the Command line FAQ. +3. Architecture + +3.1. HarvestMan is a multithreaded program. What is the threading architecture of HarvestMan ? +HarvestMan uses a multithreaded architecture. It assigns each thread with specific functions which help the program to complete the downloads at a relatively fast pace. + +HarvestMan is a network-bound program. This means that most of the time for the program is spent in waiting for network connections, fetching network data and closing the connections. HarvestMan can be considered to be not IO-bound since we can assume that there is ample disk space for the downloads, at least in most common cases. + +Whenever any program is network-boundor IO-bound, it helps to split the task to multiple threads of control, which perform their function without affecting other threads or the main thread. + +HarvestMan uses this theory to create a multithreaded system of co-operating threads, most of which gather data from the network, processes the data and writes the files to the disk. These threads are calledtracker threads. The name is derived from the fact that the thread tracks a web-page, downloads its links and further tracks each of the pages pointed by the links, doing this recursively for each link. + +HarvestMan uses a pre-emptive threaded architecture where trackers are launched when the program starts. They wait in turns for work, which is managed by a thread-safe Queueof data. Tracker threads post and retrieve data from the queue. These threads die only at the end of the program, spinning in a loop otherwise, looking for data. + +There are two different kind of trackers, namely crawlers and fetchers.These are described in the sections below. +3.2. What are "crawler" threads? +Crawlersor crawler-threads are trackers which perform the specific function of parsing a web-page. They parse the data from a web-page, extract the links, and post the links to a url queue. + +The crawlers get their data from a dataqueue. + + +3.3. What are "fetcher" threads? +Fetchersor fetcher-threads are trackers which perform the function of "fetching", i.e downloading the files pointed to by urls. They download URLs which do not produce web-page content (HTML/XHTML) statically or dynamically. They download non-webpage URLs such as images, pdf files, zip files etc. + +The fetchers get their data from the urlqueue and they post web-page data to the dataqueue. + + +3.4. How do the crawlers and fetchers co-operate? +The design of HarvestMan forces the crawlers and fetchers to be synergic. This is because, the crawlers obtain their data (web-page data) from the data queue, and post their results to the url queue. The fetchers in turn obtain their data (urls) from the url queue, and post their results to the data queue. + +The program starts off by spawing the first thread which is a fetcher. It gets the web-page data for the starting page and posts it to the data queue. The first crawler in line gets this data, parses it and extracts the links, posting it to the url queue. The next fetcher thread waiting in the url queue gets this data, and the process repeats in a synergic manner, till the program runs out of urls to parse, when the project ends. + +3.5. How many different Queues of information flow are there? +There are two queues of data flow, the url queue and the data queue. + +The crawlers **feed the url queue and**feed-off the data queue. +The fetchers feed the data queue and feed-off the url queue. + +**feed = post data to**feed-off = get data from + + +3.6. What are "worker" (downloader) threads? +Apart from the tracker threads, you can specify additional threads to take charge of downloading urls. The urls can be downloaded in these threads instead of consuming the time of the fetcher threads. + +These threads are launched 'apriori', similar to the tracker threads, before the start of the crawl. By default, HarvestMan launches a set of 10 of these worker threads which are managed by a thread pool object. The fetcher threads delegate the actual job of downloading to the workers. However, if the worker threads are disabled, the fetchers will do the downloads themselves. + +These threads also die only at the end of a HarvestMan crawl. + +3.7. How does a HarvestMan project finish? +(Make sure that you have read items 3.1 - 3.6 before reading this.) + +As mentioned before, HarvestMan works by the co-operation of crawler and fetcher family of tracker threads, each feeding on the data provided by the other. + +A project nears its end when there are no more web-pages to crawl according to the configurations of the project. This means that the fetchers have less web-page data to fetch, which in-turn dries the data source for the crawlers. The crawlers in-turn go idle, thus posting less data to the url queue, which again dries the data source for the fetchers. The synergy works at this phase also, as it does when the project is active and all tracker threads are running. + +After some time, all the tracker threads go idle, as there is no more data to feed from the queues. In the main thread of the HarvestMan program, there is a loop that spins continously checking for this event. Once all threads go idle, the loop detects it and exits; the project (and the program) comes to a halt. + +HarvestMan main thread enters this loop immediately after spawning all the tracker threads and waits in the loop till. It checks for the idle condition every 1 or 2 seconds, spinning in a loop. Once it detects that all threads have gone idle, it ends the threads, performs post-download operations, cleanup etc and brings the program to an end. + + +4. Protocols & File Types + +4.1. What are the protocols supported by HarvestMan? + +HarvestMan supports the following protocols + + 1. HTTP +> 2. FTP + +Support for HTTPS (SSL) protocol depends on the Python version you are running. Python 2.3 version and later has HTTPS support built into Python, so HarvestMan will support the HTTPS protocol, if you are running it using Python 2.3 or higher versions. + +The GOPHERand FILE://protocols should also work with HarvestMan. + + +4.2. What kind of files can be downloaded by HarvestMan? +HarvestMan can download **any** kind of file as long as it is served up by a web-server using HTTP/FTP/HTTPS. There are no restrictions on the type of file or the size of a single file. + +HarvestMan assumes that the URLs with the following extension are web-pages, static or dynamic. + +'.htm', '.html', '.shtm', '.shtml', '.php','.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl','.cgi', '.stx', '.cfm', '.cfml', '.cms' + +The URL with no extension is also assumed to be a web-page. However, the program has a mechanism by which it looks at the URL headers of the HTTP request and figures out the actual file type of the URL by doing a mimetype analysis. This happens immediately after the HTTP request is answered by the server. So if the program finds that the assumed type of a URL is different from the actual type, it sets the type correctly at this point. + +You can restrict download of certain files by creating specific filters for HarvestMan. These are described in a section somewhere below. + +A related question is the html-tagssupported by HarvestMan using which it downloads files. +These are listed below. + + 1. Hypertext links of the form .
+
2. Image links of the form .
+3. Stylesheets of the form
+
+
+
+.
+4. Javascript source files of the form
+
+ - - - - - - - - -
- - - - - - - - - - - - - - - - -
- - - - -
- - - - - - -
- - - - -
- - - - - - - - - - - -
- - - - - -
- - - - - -
- - - - - - - -
- - - - - - -
- - - - - -
- - - - - - - -
- - - - - - - - - - - - - - - - - - - - - -
  
Contact webmestre
- © Mairie d'Epinay-sous-Sénart
 
 
N° vert propreté
N° vert : 0800091860
 
- - - - - - -
 
- - - - - - - - - - -
- -
Hôtel - de ville
- 8, rue Sainte-Geneviève
- 91860 Épinay-sous-Sénart
-

Téléphone - : 01 60 47 85 00
- Télécopie - : 01 60 46 68 34
- Mail : contact@ville-epinay-senart.fr
-
- Horaires - d’ouverture au public :
- Lundi, mardi, jeudi et vendredi : 8h30 à - 11h45 et de 13h30 à 17h30
- Mercredi et samedi : 8h30 à 11h45

-

........................................

-

Espace Informations Spinolien
- Centre Commercial Principal
- 91860 Epinay-sous-Sénart

-

Téléphone : 01 60 46 - 93 49
- Télécopie : 01 60 46 16 59
-
- Horaires - d’ouverture au public :

- Matin : mardi, mercredi et vendredi de 10h00 à 12h00
- Après midi : du lundi au vendredi de 14h00 à 18h30

-

Situé au coeur de la ville, un service de proximité qui rapproche les habitants des services publics.
-
-
Consultation des comptes rendus - des séances du Conseil Municipal.
- Exposition sur les réalisations - en cours dans la ville.

-

-
 
-
 
-
-
- - diff --git a/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html b/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html deleted file mode 100755 index 20fe7b4..0000000 --- a/HarvestMan-lite/harvestman/bugs/soskut_hu_index.html +++ /dev/null @@ -1,255 +0,0 @@ - - - - Sóskút - - - - - - - - - - - - - - - - - - - - - -
- - - - - -
Vissza a fõoldalra
- - - - - - - -
- -
-
-
-
-
-
-
-
-
-
-
- -
- - -
-
-
- - -
- - - - -
- -


- - - - - - - - - - - -
- - -
-
- - - - - - - - - - - - - - - - - -
Felhasználónév:
Jelszó:
- - - - - -
-
-
- - -
- -
-
- - - -
-
- - - -
-
-
- -
-
- - - -
- -
-
- - -
- - -
- - -

- - - -
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- -
-
- -
- - - -
-
-
- - - diff --git a/HarvestMan-lite/harvestman/bugs/test_808.py b/HarvestMan-lite/harvestman/bugs/test_808.py deleted file mode 100755 index 251830d..0000000 --- a/HarvestMan-lite/harvestman/bugs/test_808.py +++ /dev/null @@ -1,40 +0,0 @@ -# Demoing fix for #808. -# 808: Crawler should try and parse links in "select" options in HTML -# forms. -# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/808 -import sys -sys.path.append('..') -from lib import pageparser -from lib import config -from lib import logger -from lib.common.common import * - - -SetAlias(config.HarvestManStateObject()) -SetAlias(logger.HarvestManLogger()) - -# First parse with sgmlop parser with option parsing disabled... -print 'Testing with sgmlop parser...' -p = pageparser.HarvestManSGMLOpParser() -p.feed(open('s_municipaux.htm').read()) -print 'Asserting link count with option tag disabled...' -assert(len(p.links)==18) - -# Now turn on option tag parsing -p.enable_feature('option') -p.feed(open('s_municipaux.htm').read()) -print 'Asserting link count with option tag enabled...' -assert(len(p.links)==31) - -print 'Testing with pure Python parser...' -p = pageparser.HarvestManSimpleParser() -p.disable_feature('option') -p.feed(open('s_municipaux.htm').read()) -print 'Asserting link count with option tag disabled...' -assert(len(p.links)==18) - -# Now turn on option tag parsing -p.enable_feature('option') -p.feed(open('s_municipaux.htm').read()) -print 'Asserting link count with option tag enabled...' -assert(len(p.links)==31) diff --git a/HarvestMan-lite/harvestman/bugs/test_812.py b/HarvestMan-lite/harvestman/bugs/test_812.py deleted file mode 100755 index 36cad66..0000000 --- a/HarvestMan-lite/harvestman/bugs/test_812.py +++ /dev/null @@ -1,49 +0,0 @@ -# Demoing fix for EIAO bug #812. -# 812: Crawler does not identify links with arguments containing "#". -# Bug: http://trac.eiao.net/cgi-bin/trac.cgi/ticket/812 - -import sys -sys.path.append('..') -from lib import pageparser -from lib import config -from lib import logger -from lib.common.common import * -from lib import urltypes - -class Url(str): - - def __init__(self, link): - self.url = link[1] - self.typ = link[0] - - def __eq__(self, item): - return item == self.url - -SetAlias(config.HarvestManStateObject()) -SetAlias(logger.HarvestManLogger()) - -cfg = objects.config -cfg.getquerylinks = True - -p = pageparser.HarvestManSGMLOpParser() -p.feed(open('soskut_hu_index.html').read()) - -urls = [] -for link in p.links: - urls.append(Url(link)) - -print urls - -test_urls = ['?module=municip#MIDDLE', - '?module=institutes#MIDDLE', - '?module=regulations#MIDDLE', - '?module=events#MIDDLE'] - -for turl in test_urls: - print 'Asserting',turl - assert(turl in urls) - -for url in urls: - if url in test_urls: - print 'Asserting type of',turl - assert(url.typ == urltypes.URL_TYPE_ANY and url.typ != urltypes.URL_TYPE_ANCHOR) diff --git a/HarvestMan-lite/harvestman/dev/__init__.py b/HarvestMan-lite/harvestman/dev/__init__.py deleted file mode 100755 index e69de29..0000000 diff --git a/HarvestMan-lite/harvestman/dev/filethread.py b/HarvestMan-lite/harvestman/dev/filethread.py deleted file mode 100755 index f6d1b80..0000000 --- a/HarvestMan-lite/harvestman/dev/filethread.py +++ /dev/null @@ -1,112 +0,0 @@ -# -- coding: utf-8 -""" filethread.py - File saver thread module. - This module is part of the HarvestMan program. - - Author: Anand B Pillai - - Copyright (C) 2007 Anand B Pillai. - -""" - -# Currently no code from this module is being used anywhere -# in the program. - -import threading -from common.common import * -from common.singleton import Singleton -import sys, os -import shutil -from Queue import Queue - -class FileQueue(Queue, Singleton): - """ File saver queue class """ - - def push(self, filename, directory, url, datastring): - self.put((filename, directory, url, datastring)) - -class HarvestManFileThread(threading.Thread): - """ File saver thread """ - - def __init__(self): - self.q = FileQueue.getInstance() - self._flag = False - self._cfg = objects.config - threading.Thread.__init__(self, None, None, 'Saver') - - def _write_url_filename(self, data, filename): - """ Write downloaded data to the passed file """ - - try: - extrainfo('Writing file ', filename) - f=open(filename, 'wb') - # print 'Data len=>',len(self._data) - f.write(data.getvalue()) - f.close() - except IOError,e: - debug('IO Exception' , str(e)) - return 0 - except ValueError, e: - return 0 - - return 1 - - def stop(self): - self._flag = True - - def run(self): - - while not self._flag: - item = self.q.get() - if item: - filename, directory, url, datastring = item - if self.create_local_directory(directory) == 0: - self._write_url_filename( datastring, filename ) - else: - extrainfo("Error in creating local directory for", url) - - def create_local_directory(self, directory): - """ Create the directories on the disk named 'directory' """ - - # new in 1.4.5 b1 - No need to create the - # directory for raw saves using the nocrawl - # option. - if self._cfg.rawsave: - return 0 - - try: - # Fix for EIAO bug #491 - # Sometimes, however had we try, certain links - # will be saved as files, whereas they might be - # in fact directories. In such cases, check if this - # is a file, then create a folder of the same name - # and move the file as index.html to it. - path = directory - while path: - if os.path.isfile(path): - # Rename file to file.tmp - fname = path - os.rename(fname, fname + '.tmp') - # Now make the directory - os.makedirs(path) - # If successful, move the renamed file as index.html to it - if os.path.isdir(path): - fname = fname + '.tmp' - shutil.move(fname, os.path.join(path, 'index.html')) - - path2 = os.path.dirname(path) - # If we hit the root, break - if path2 == path: break - path = path2 - - if not os.path.isdir(directory): - os.makedirs( directory ) - extrainfo("Created => ", directory) - return 0 - except OSError: - moreinfo("Error in creating directory", directory) - return -1 - - return 0 - - - diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test.py b/HarvestMan-lite/harvestman/dev/sqlite_test.py deleted file mode 100755 index 58a9fe8..0000000 --- a/HarvestMan-lite/harvestman/dev/sqlite_test.py +++ /dev/null @@ -1,28 +0,0 @@ -import sqlite3 - -class Point(object): - - def __init__(self, x, y): - self.x, self.y = x, y - - def __conform__(self, protocol): - if protocol is sqlite3.PrepareProtocol: - return '%f;%f' % (self.x, self.y) - -con = sqlite3.connect("test") -c = con.cursor() - -p = Point(5.0, 3.5) - -c.execute("drop table points") -c.execute("create table points (point text)") -#cur.execute("select ?", (p,)) -#print cur.fetchone()[0] -c.execute("insert into points values (?)", (p,)) - -c.execute("select * from points") -print c.fetchall() - -c.close() - - diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test2.py b/HarvestMan-lite/harvestman/dev/sqlite_test2.py deleted file mode 100755 index 5b143bd..0000000 --- a/HarvestMan-lite/harvestman/dev/sqlite_test2.py +++ /dev/null @@ -1,24 +0,0 @@ -import sqlite3 -import datetime, time - -def adapt_datetime(ts): - return time.mktime(ts.timetuple()) - -sqlite3.register_adapter(datetime.datetime, adapt_datetime) - -con = sqlite3.connect("test") -c = con.cursor() - -now = datetime.datetime.now() -c.execute("drop table if exists times") -c.execute("create table times (time real)") -#cur.execute("select ?", (p,)) -#print cur.fetchone()[0] -c.execute("insert into times values (?)", (now,)) - -c.execute("select * from times") -print c.fetchall() - -c.close() - - diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test3.py b/HarvestMan-lite/harvestman/dev/sqlite_test3.py deleted file mode 100755 index 25f6d5c..0000000 --- a/HarvestMan-lite/harvestman/dev/sqlite_test3.py +++ /dev/null @@ -1,27 +0,0 @@ -import sqlite3 -import datetime, time - -def adapt_datetime(ts): - return time.mktime(ts.timetuple()) - -sqlite3.register_adapter(datetime.datetime, adapt_datetime) - -con = sqlite3.connect("test") -c = con.cursor() - -c.execute("drop table if exists projects") -c.execute("create table projects (id integer primary key autoincrement default 0, date real, project text)") -#cur.execute("select ?", (p,)) -#print cur.fetchone()[0] -c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project1')) -time.sleep(1.0) -c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project2')) -time.sleep(1.0) -c.execute("insert into projects (date, project) values (?, ?)", (datetime.datetime.now(), 'project3')) - -c.execute("select max(id) from projects") -print c.fetchone()[0] - -c.close() - - diff --git a/HarvestMan-lite/harvestman/dev/sqlite_test4.py b/HarvestMan-lite/harvestman/dev/sqlite_test4.py deleted file mode 100755 index 90b7690..0000000 --- a/HarvestMan-lite/harvestman/dev/sqlite_test4.py +++ /dev/null @@ -1,17 +0,0 @@ -import sqlite3 -import datetime, time - -def adapt_datetime(ts): - return time.mktime(ts.timetuple()) - -sqlite3.register_adapter(datetime.datetime, adapt_datetime) - -con = sqlite3.connect("/home/anand/.harvestman/db/crawls.db") -c = con.cursor() - -c.execute("select * from project_stats") -print c.fetchall() - -c.close() - - diff --git a/HarvestMan-lite/harvestman/ext/__init__.py b/HarvestMan-lite/harvestman/ext/__init__.py deleted file mode 100755 index e69de29..0000000 diff --git a/HarvestMan-lite/harvestman/ext/datafilter.py b/HarvestMan-lite/harvestman/ext/datafilter.py deleted file mode 100755 index 3a47587..0000000 --- a/HarvestMan-lite/harvestman/ext/datafilter.py +++ /dev/null @@ -1,85 +0,0 @@ -# -- coding: utf-8 -""" Data filter plugin example based on the -simulator plugin for HarvestMan. This -plugin changes the behaviour of HarvestMan -to only simulate crawling without actually -downloading anything. In addition, it shows -how to get access to the data downloaded by the crawler, -to implement various kinds of data filters. - -Author: Anand B Pillai - -Created Feb 7 2007 Anand B Pillai -Modified Nov 2 2007 by: Nils Ulltveit-Moe - - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -from harvestman.lib import hooks -from harvestman.lib.common.common import * - -from HTMLParser import HTMLParser - -class MyHTMLParser(HTMLParser): - # Example on a HTML parser, to filter img tags - - def handle_starttag(self, tag, attrs): - - # This just prints the image tag and its attributes - if tag=="img": - print tag,attrs - -def process_url(self, data): - """ Post process url callback test """ - # This shows how to get access to the - # downloaded HTML document that is being processed. - # Data is either HTML document or None - if data: - p = MyHTMLParser() - p.feed(data) - - return data - -def save_url(self, urlobj): - - # For simulation, we need to modify the behaviour - # of save_url function in HarvestManUrlConnector class. - # This is achieved by injecting this function as a plugin - # Note that the signatures of both functions have to - # be the same. - url = urlobj.get_full_url() - self.connect(urlobj, True, self._cfg.retryfailed) - - return 6 - -def apply_plugin(): - """ All plugin modules need to define this method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - cfg = objects.config - cfg.simulate = True - cfg.localise = 0 - - # Dummy function that does not really write the mirrored files. - hooks.register_plugin_function('connector:save_url_plugin', save_url) - - # Hook to get access to the downloaded data after process_url has been called. - hooks.register_post_callback_method('crawler:fetcher_process_url_callback', - process_url) - # Turn off caching, since no files are saved - cfg.pagecache = 0 - # Turn off header dumping, since no files are saved - cfg.urlheaders = 0 - logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.') diff --git a/HarvestMan-lite/harvestman/ext/lucene.py b/HarvestMan-lite/harvestman/ext/lucene.py deleted file mode 100755 index d934afb..0000000 --- a/HarvestMan-lite/harvestman/ext/lucene.py +++ /dev/null @@ -1,132 +0,0 @@ -# -- coding: utf-8 -""" Lucene plugin to HarvestMan. This plugin modifies the -behaviour of HarvestMan to create an index of crawled -webpages by using PyLucene. - -Author: Anand B Pillai - -Created Aug 7 2007 Anand B Pillai - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -import PyLucene -import sys, os -import time - -from harvestman.lib import hooks -from harvestman.lib.common.common import * - -class PorterStemmerAnalyzer(object): - - def tokenStream(self, fieldName, reader): - - result = PyLucene.StandardTokenizer(reader) - result = PyLucene.StandardFilter(result) - result = PyLucene.LowerCaseFilter(result) - result = PyLucene.PorterStemFilter(result) - result = PyLucene.StopFilter(result, PyLucene.StopAnalyzer.ENGLISH_STOP_WORDS) - - return result - -def create_index(self, arg): - """ Post download setup callback for creating a lucene index """ - - moreinfo("Creating lucene index") - storeDir = "index" - if not os.path.exists(storeDir): - os.mkdir(storeDir) - - store = PyLucene.FSDirectory.getDirectory(storeDir, True) - - self.lucene_writer = PyLucene.IndexWriter(store, PyLucene.StandardAnalyzer(), True) - # Uncomment this line to enable a PorterStemmer analyzer - # self.lucene_writer = PyLucene.IndexWriter(store, PorterStemmerAnalyzer(), True) - self.lucene_writer.setMaxFieldLength(1048576) - - count = 0 - - urllist = [] - - for node in self._urldb.preorder(): - urlobj = node.get() - - # Only index if web-page or document - if not urlobj.is_webpage() and not urlobj.is_document(): continue - - filename = urlobj.get_full_filename() - url = urlobj.get_full_url() - - try: - urllist.index(urlobj.index) - continue - except ValueError: - urllist.append(urlobj.index) - - if not os.path.isfile(filename): continue - - data = '' - - moreinfo('Adding index for URL',url) - - try: - data = unicode(open(filename).read(), 'iso-8859-1') - except UnicodeDecodeError, e: - data = '' - - try: - doc = PyLucene.Document() - doc.add(PyLucene.Field("name", 'file://' + filename, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.UN_TOKENIZED)) - doc.add(PyLucene.Field("path", url, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.UN_TOKENIZED)) - if data and len(data) > 0: - doc.add(PyLucene.Field("contents", data, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.TOKENIZED)) - else: - extrainfo("warning: no content in %s" % filename) - - self.lucene_writer.addDocument(doc) - except PyLucene.JavaError, e: - print e - - count += 1 - - moreinfo('Created lucene index for %d documents' % count) - moreinfo('Optimizing lucene index') - self.lucene_writer.optimize() - self.lucene_writer.close() - -def apply_plugin(): - """ Apply the plugin - overrideable method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook/plugin function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - cfg = objects.config - - hooks.register_post_callback_method('datamgr:post_download_setup_callback', - create_index) - #logger.disableConsoleLogging() - # Turn off session-saver feature - cfg.savesessions = False - # Turn off interrupt handling - # cfg.ignoreinterrupts = True - # No need for localising - cfg.localise = 0 - # Turn off image downloading - cfg.images = 0 - # Turn off caching - cfg.pagecache = 0 diff --git a/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py b/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py deleted file mode 100755 index 789aa95..0000000 --- a/HarvestMan-lite/harvestman/ext/lucene/IndexFiles.py +++ /dev/null @@ -1,85 +0,0 @@ -# -- coding: utf-8 -#!/usr/bin/env python - -import sys, os, PyLucene, threading, time -from datetime import datetime - -""" -This class is loosely based on the Lucene (java implementation) demo class -org.apache.lucene.demo.IndexFiles. It will take a directory as an argument -and will index all of the files in that directory and downward recursively. -It will index on the file path, the file name and the file contents. The -resulting Lucene index will be placed in the current directory and called -'index'. -""" - -class Ticker(object): - - def __init__(self): - self.tick = True - - def run(self): - while self.tick: - sys.stdout.write('.') - sys.stdout.flush() - time.sleep(1.0) - -class IndexFiles(object): - """Usage: python IndexFiles """ - - def __init__(self, root, storeDir, analyzer): - - if not os.path.exists(storeDir): - os.mkdir(storeDir) - store = PyLucene.FSDirectory.getDirectory(storeDir, True) - writer = PyLucene.IndexWriter(store, analyzer, True) - writer.setMaxFieldLength(1048576) - self.indexDocs(root, writer) - ticker = Ticker() - print 'optimizing index', - threading.Thread(target=ticker.run).start() - writer.optimize() - writer.close() - ticker.tick = False - print 'done' - - def indexDocs(self, root, writer): - for root, dirnames, filenames in os.walk(root): - for filename in filenames: - #if not filename.endswith('.txt'): - # continue - print "adding", filename - try: - path = os.path.join(root, filename) - file = open(path) - contents = unicode(file.read(), 'iso-8859-1') - file.close() - doc = PyLucene.Document() - doc.add(PyLucene.Field("name", filename, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.UN_TOKENIZED)) - doc.add(PyLucene.Field("path", path, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.UN_TOKENIZED)) - if len(contents) > 0: - doc.add(PyLucene.Field("contents", contents, - PyLucene.Field.Store.YES, - PyLucene.Field.Index.TOKENIZED)) - else: - print "warning: no content in %s" % filename - writer.addDocument(doc) - except Exception, e: - print "Failed in indexDocs:", e - -if __name__ == '__main__': - if len(sys.argv) < 2: - print IndexFiles.__doc__ - sys.exit(1) - print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION - start = datetime.now() - try: - IndexFiles(sys.argv[1], "index", PyLucene.StandardAnalyzer()) - end = datetime.now() - print end - start - except Exception, e: - print "Failed: ", e diff --git a/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py b/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py deleted file mode 100755 index a9bfa0a..0000000 --- a/HarvestMan-lite/harvestman/ext/lucene/SearchFiles.py +++ /dev/null @@ -1,40 +0,0 @@ -# -- coding: utf-8 -#!/usr/bin/env python -from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory -from PyLucene import VERSION, LUCENE_VERSION - -""" -This script is loosely based on the Lucene (java implementation) demo class -org.apache.lucene.demo.SearchFiles. It will prompt for a search query, then it -will search the Lucene index in the current directory called 'index' for the -search query entered against the 'contents' field. It will then display the -'path' and 'name' fields for each of the hits it finds in the index. Note that -search.close() is currently commented out because it causes a stack overflow in -some cases. -""" -def run(searcher, analyzer): - - while True: - print - print "Hit enter with no input to quit." - command = raw_input("Query:") - if command == '': - return - - print - print "Searching for:", command - query = QueryParser("contents", analyzer).parse(command) - hits = searcher.search(query) - print "%s total matching documents" % hits.length() - - for i, doc in hits: - print 'path:', doc.get("path"), 'name:', doc.get("name"), 100*hits.score(i) - -if __name__ == '__main__': - STORE_DIR = "index" - print 'PyLucene', VERSION, 'Lucene', LUCENE_VERSION - directory = FSDirectory.getDirectory(STORE_DIR, False) - searcher = IndexSearcher(directory) - analyzer = StandardAnalyzer() - run(searcher, analyzer) - searcher.close() diff --git a/HarvestMan-lite/harvestman/ext/simulator.py b/HarvestMan-lite/harvestman/ext/simulator.py deleted file mode 100755 index 66bac2d..0000000 --- a/HarvestMan-lite/harvestman/ext/simulator.py +++ /dev/null @@ -1,57 +0,0 @@ -# -- coding: utf-8 -""" Simulator plugin for HarvestMan. This -plugin changes the behaviour of HarvestMan -to only simulate crawling without actually -downloading anything. - -Author: Anand B Pillai - -Created Feb 7 2007 Anand B Pillai - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -from harvestman.lib import hooks -from harvestman.lib.common.common import * -from harvestman.lib.common.macros import CONNECTOR_DATA_MODE_INMEM - -def save_url(self, urlobj): - - # For simulation, we need to modify the behaviour - # of save_url function in HarvestManUrlConnector class. - # This is achieved by injecting this function as a plugin - # Note that the signatures of both functions have to - # be the same. - - url = urlobj.get_full_url() - self.connect(urlobj, True, self._cfg.retryfailed) - - return 6 - -def apply_plugin(): - """ All plugin modules need to define this method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - cfg = objects.config - cfg.simulate = True - cfg.localise = 0 - hooks.register_plugin_function('connector:save_url_plugin', save_url) - # Turn off caching, since no files are saved - cfg.pagecache = 0 - # Turn off header dumping, since no files are saved - cfg.urlheaders = 0 - # For simulator, we need in-mem data mode - # since files are never saved! - cfg.datamode = CONNECTOR_DATA_MODE_INMEM - logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.') diff --git a/HarvestMan-lite/harvestman/ext/spam.py b/HarvestMan-lite/harvestman/ext/spam.py deleted file mode 100755 index 3435f8d..0000000 --- a/HarvestMan-lite/harvestman/ext/spam.py +++ /dev/null @@ -1,34 +0,0 @@ -# -- coding: utf-8 -""" Test plugin for HarvestMan. This demonstrates -how to write a simple plugin based on callbacks. - -Author: Anand B Pillai - -Created Feb 7 2007 Anand B Pillai - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -from harvestman.lib import hooks -from harvestman.lib.common.common import * - -def func(self): - print 'Before running projects...' - -def apply_plugin(): - """ All plugin modules need to define this method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - hooks.register_pre_callback_method('harvestman:run_projects_callback', func) - diff --git a/HarvestMan-lite/harvestman/ext/swish-e.py b/HarvestMan-lite/harvestman/ext/swish-e.py deleted file mode 100755 index fc8c5c6..0000000 --- a/HarvestMan-lite/harvestman/ext/swish-e.py +++ /dev/null @@ -1,115 +0,0 @@ -# -- coding: utf-8 -""" Swish-e plugin to HarvestMan. This plugin modifies the -behaviour of HarvestMan to work as an external crawler program -for the swish-e search engine {http://swish-e.org} - -The data format is according to the guidelines given -at http://swish-e.org/docs/swish-run.html#indexing. - -Author: Anand B Pillai - -Created Feb 8 2007 Anand B Pillai -Modified Feb 17 2007 Anand B Pillai Modified logic to use callbacks - instead of hooks. The logic is - in a post callback registered - at context crawler:fetcher_process_url_callback. - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -import sys, os -import time -from types import StringTypes - -from harvestman.lib import hooks -from harvestman.lib.common.common import * - -urllist = [] - -def process_url(self, data): - """ Post process url callback for swish-e """ - - if (type(data) in StringTypes) and len(data): - global urllist - urllist.append(self._urlobject.get_full_url()) - - try: - data = data.encode('ascii', 'ignore') - l = len(data) - s = '' - - # Code which works for www.python.org/doc/current/tut/tut.html - # and for swish-e.org/docs - # if len(data) != len(data.strip()): - # data = data.strip() - # l = len(data) + 1 - - add = 0 - if l != len(data.strip()): - # print l, len(data.strip()), self._urlobject.get_full_url() - data = data.strip() - l = len(data) + 1 - # print l - - if self.wp.can_index: - s ="Path-Name:%s\nContent-Length:%d\n\n%s" % (self._urlobject.get_full_url(), - l, - data) - # Swish-e seems to be very sensitive to any additional - # blank lines between content and headers. So stripping - # the data of trailing and preceding newlines is important. - # print data.strip() - try: - print str(s) - except IOError, e: - # global urllist - # open('err.out','w').write('\n'.join(urllist)) - objects.queuemgr.endloop() - - return data - except UnicodeDecodeError, e: - # open('uni.out','a').write(self._urlobject.get_full_url() + '\n') - return data - - return data - - -def apply_plugin(): - """ Apply the plugin - overrideable method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook/plugin function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - cfg = objects.config - - # Makes sense to activate the callback only if swish-integration - # is enabled. - hooks.register_post_callback_method('crawler:fetcher_process_url_callback', - process_url) - # Turn off caching, since no files are saved - cfg.pagecache = 0 - # Turn off console-logging - logger = objects.logger - #logger.disableConsoleLogging() - # Turn off session-saver feature - cfg.savesessions = False - # Turn off interrupt handling - # cfg.ignoreinterrupts = True - # No need for localising - cfg.localise = 0 - # Turn off image downloading - cfg.images = 0 - # Increase sleep time - cfg.sleeptime = 1.0 - # sys.stderr = open('swish-errors.txt','wb') - # cfg.maxtrackers = 2 - cfg.usethreads = 0 diff --git a/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e b/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e deleted file mode 100755 index 6735e42..0000000 --- a/HarvestMan-lite/harvestman/ext/swish-e/HOWTO.swish-e +++ /dev/null @@ -1,122 +0,0 @@ -Using HarvestMan with swish-e ------------------------------ -HarvestMan can be used as an external crawler program for swish-e -indexer {http://www.swish-e.org}. The swish-e support for -HarvestMan is built into the swish-e plugin present in the plugins -folder. - -Swish-e configuration ---------------------- -In order to use swish-e with HarvestMan, an appropriate configuration -file needs to be generated. A sample configuration file is available -in this folder as swish-config.conf. Typically this configuration -file only contains two directives - -IndexDir -SwishProgParameters - -"IndexDir" is the path to the external crawler program. If HarvestMan -is installed in your machine, this would be "harvesttman". If the -PATH where HarvestMan is present is not part of the PATH environment -variable, you need to specify the full path. - -"SwishProgParameters" is the parameters required for the external -program. Here you can specify the parameters required for HarvestMan. - - -HarvestMan configuration for swish-e ------------------------------------- -In HarvestMan, there are two ways to load plugins like swish-e. -Either the plugin can be given as a command-line parameter using the --g/--plugins option, or it can be specified in the configuration file -by editing the "plugins" element and adding an appopriate plugin -element with its "enable" attribute set to 1. For more information -read the HOWTO.plugins document in the "doc" folder. - -There are also two ways to pass URL and other options. The suggested -way is to create an appropriate configuration file and put all the -options there. If the file is the default 'config.xml' present in -the current directory or the user's .harvestman directory, there is -no need to specify this file. In such case, "SwishProgParameters" -is empty and should not be specified. In this case the swish configuration -file will look like, - -IndexDir harvestman - -However, if the configuration file name is different, it has to be -passed to HarvestMan with the -C option. In order to enable swish-e, -the "enable" attribute of the swish-e plugin element should be set to -1 in this file. In this case the swish configuration file will look like, - -IndexDir harvestman -SwishProgParameters -C - -The other way is to specify a URL and other options in the command line -and pass it to HarvestMan. This typically can be used for the simplest -crawl which do not require a lot of customization. For example, - -IndexDir harvestman -SwishProgParameters -g swish-e http://swish-e.org/docs/ - -The last line instructs HarvestMan to crawl http://www.swish-e.org/docs . -Swish-e will in turn index the content of files contained at ths URL. - -NOTE: If you have more than three parameters to customize it is better to -use a configuration file than specifying them on the command line. - -Running directly from source ----------------------------- -In case you prefer to run HarvestMan directly from the source tree -with swish-e without installing it, the above mentioned configuration -would not work. - -In this case there are two ways of writing the configuration. The simplest -way is to make the harvestman.py module executable and use the -following configuration. - -IndexDir /harvestman.py -SwishProgParameters - -where is the relative path to where HarvestMan source code is -present. If it is the current directory, this would be '.'. - -The second way is to run harvestman.py as an argument to Python. In -this case the following configuration need to be used. - -IndexDir python -SwishProgParameters /harvestman.py - -In this case, the main program becomes Python and path to harvestman.py -is passed as the first part of SwishProgParameters param value. - -Running swish-e ---------------- -Once the appropriate swish configuration file is written, swish-e can -be run with HarvestMan as follows - -swish-e -c -S prog - -Once crawling and indexing starts, swish-e prints an output like, - -$ swish-e -c swish-config.cong -S prog - -Indexing Data Source: "External-Program" -Indexing "harvestman" -External Program found: /usr/bin/harvestman - -If everything goes well, the indexing will terminate soon after -the crawling is completed and an index summary is printed. - - - - - - - - - - - - - - diff --git a/HarvestMan-lite/harvestman/ext/swish-e/README.txt b/HarvestMan-lite/harvestman/ext/swish-e/README.txt deleted file mode 100755 index 8b37e2d..0000000 --- a/HarvestMan-lite/harvestman/ext/swish-e/README.txt +++ /dev/null @@ -1,2 +0,0 @@ -This folder contains sample files/code which demonstrates -the usage of plugins with HarvestMan. \ No newline at end of file diff --git a/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf b/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf deleted file mode 100755 index 8cb4ba5..0000000 --- a/HarvestMan-lite/harvestman/ext/swish-e/swish-config.conf +++ /dev/null @@ -1,10 +0,0 @@ -## Sample configuration file for HarvestMan integration with swish-e. -## See http://swish-e.org/docs/swish-run.html#indexing for more information. - -# Indexing program to use -IndexDir ./harvestman.py -# Parameters to pass to the Indexing program -# Change the last parameter to your own URL or configuration file. -# SwishProgParameters -g swish-e http://swish-e.org/docs -SwishProgParameters -g swish-e http://www.python.org/doc/current/ -# SwishProgParameters -g swish-e http://www.woogroups.com diff --git a/HarvestMan-lite/harvestman/ext/userbrowse.py b/HarvestMan-lite/harvestman/ext/userbrowse.py deleted file mode 100755 index 5071b36..0000000 --- a/HarvestMan-lite/harvestman/ext/userbrowse.py +++ /dev/null @@ -1,53 +0,0 @@ -# -- coding: utf-8 -""" User browse plugin. Simulate a scenario of a user -browsing a web-page. - -(Requested by Roy Cheeran) - -Author: Anand B Pillai - -Created Aug 13 2007 Anand B Pillai - -Copyright (C) 2007 Anand B Pillai - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -from harvestman.lib import hooks -from harvestman.lib.common.common import * - -# User browsing plugin approximates how a webpage -# presents itself to a user. This means a few things -# -# 1. All images and stylesheets referenced by the page are fetched. -# 2. In addition, all links directly linked from the page are -# fetched and saved to disk. Nothing further is crawled. -# -# This is done by using a fetchlevel control of 2, a depth -# control of 0, and allowing images & stylesheets to skip -# constraints. - -def apply_plugin(): - """ Apply the plugin - overrideable method """ - - # This method is expected to perform the following steps. - # 1. Register the required hook/plugin function - # 2. Get the config object and set/override any required settings - # 3. Print any informational messages. - - # The first step is required, the last two are of course optional - # depending upon the required application of the plugin. - - cfg = objects.config - # Set depth to 0 - cfg.depth = 0 - # Set fetchlevel to 2 - cfg.fetchlevel = 2 - # Images & stylesheets will skip rules - cfg.skipruletypes = ['image','stylesheet'] - # One might have to set robots to 0 - # sometimes to fetch images - uncomment this - # in such a case. - # cfg.robots = 0 diff --git a/HarvestMan-lite/harvestman/lib/__init__.py b/HarvestMan-lite/harvestman/lib/__init__.py deleted file mode 100755 index e69de29..0000000 diff --git a/HarvestMan-lite/harvestman/lib/common/__init__.py b/HarvestMan-lite/harvestman/lib/common/__init__.py deleted file mode 100755 index e69de29..0000000 diff --git a/HarvestMan-lite/harvestman/lib/common/bst.py b/HarvestMan-lite/harvestman/lib/common/bst.py deleted file mode 100755 index edf7f03..0000000 --- a/HarvestMan-lite/harvestman/lib/common/bst.py +++ /dev/null @@ -1,544 +0,0 @@ -""" -bst.py - Basic binary search tree in Python with automated disk caching at -the nodes. This is not a full-fledged implementation since it does not -implement node deletion, tree balancing etc. - -Created Anand B Pillai Feb 13 2008 -Modified Anand B Pillai Make BST use bsddb caching (experimental!) - -Copyright (C) 2008, Anand B Pillai. - -""" - -import cPickle -import math -import sys -import os -import shutil -import weakref -import bsddb - -from dictcache import DictCache - -class BSTNode(dict): - """ Node class for a BST """ - - def __init__(self, key, val, left=None, right=None, tree=None): - self.key = key - self[key] = val - self['left'] = left - self['right'] = right - # Mode flag - # 0 => mem - # 1 => disk - self.mode = 0 - # Number of gets - self.cgets = 0 - # Number of loads - self.cloads = 0 - # Link back to the tree - self.tree = weakref.proxy(tree) - - def __getitem__(self, key): - - try: - return super(BSTNode, self).__getitem__(key) - except KeyError: - return None - - def set(self, value): - self[self.key] = value - if self.mode == 1: - # Already dumped - self.mode = 0 - self.dump() - - def get(self): - - if self.mode==0: - self.cgets += 1 - return self[self.key] - else: - self.cloads += 1 - self.load() - return self[self.key] - - def is_balanced(self, level=1): - - # Return if this node is balanced - # The node balance check is done per - # level. Default is 1 which means we - # check whether this node has both left - # and right children. If level is 2, this - # is done at one more level, i.e for the - # child nodes also... - - # Leaf node is not balanced... - if self['left']==None and self['right']==None: - return False - - while level>0: - level -= 1 - - if self['left'] !=None and self['right'] != None: - if level: - return self['left'].is_balanced(level) and \ - self['right'].is_balanced(level) - else: - return True - else: - return False - - return False - - def load(self, recursive=False): - - # Load values from disk - try: - # Don't load if mode is 0 and value is not None - if self.mode==1 and self[self.key] == None: - self[self.key] = self.tree.from_cache(self.key) - self.mode = 0 - - if recursive: - left = self['left'] - if left: left.load(True) - right = self['right'] - if right: right.load(True) - - except Exception, e: - raise - - def dump(self, recursive=False): - - try: - if self.mode==0 and self[self.key] != None: - self.tree.to_cache(self.key, self[self.key]) - self[self.key]=None - self.mode = 1 - else: - # Dont do anything - pass - - if recursive: - left = self['left'] - if left: left.dump(True) - right = self['right'] - if right: right.dump(True) - - except Exception, e: - raise - - def clear(self): - - # Clear removes the data from memory as well as from disk - try: - del self[self.key] - except KeyError: - pass - - left = self['left'] - right = self['right'] - - if left: - left.clear() - if right: - right.clear() - - super(BSTNode, self).clear() - -class BST(object): - """ BST class with automated disk caching of node values """ - - # Increase the recursion limit for large trees - sys.setrecursionlimit(20000) - - def __init__(self, key=None, val=None): - # Size of tree - self.size = 0 - # Height of tree - self.height = 0 - # 'Hardened' flag - if the data structure - # is dumped to disk fully, the flag hard - # is set to True - self.hard = False - # Autocommit mode - self.auto = False - # Autocommit mode level - self.autolevel = 0 - # Current auto left node for autocommit - self.autocurr_l = None - # Current auto right node for autocommit - self.autocurr_r = None - # For stats - # Total number of lookups - self.nlookups = 0 - # Total number of in-mem lookups - self.ngets = 0 - # Total number of disk loads - self.nloads = 0 - self.root = None - if key: - self.root = self.insert(key, val) - self.bdir = '' - self.diskcache = None - - def __del__(self): - self.clear() - - def to_cache(self, key, val): - self.diskcache[str(key)] = cPickle.dumps(val) - self.diskcache.sync() - - def from_cache(self, key): - return cPickle.loads(self.diskcache[str(key)]) - - def addNode(self, key, val): - self.size += 1 - self.height = int(math.ceil(math.log(self.size+1, 2))) - node = BSTNode(key, val, tree=self) - - if self.auto and self.autolevel and self.size>1: - # print 'Auto-dumping...', self.size - if self.size % self.autolevel==0: - self.dump(self.autocurr_l) - # Set autocurr to this node - self.autocurr_l = node - - #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel): - # print 'Auto-dumping...', self.autocurr_l.key - # self.dump(self.autocurr_l) - # curr = self.autocurr_l - # # Set autocurr to the children of this node - # self.autocurr_l = curr.left - # self.autocurr_r = curr.right - # print 'Left=>',self.autocurr_l - # print 'Right=>',self.autocurr_r - # print 'Root=>',self.root.key - - #if self.autocurr_r == self.autocurr_l: - # return node - - #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel): - # print 'Auto-dumping...', self.autocurr_r.key - # self.dump(self.autocurr_r) - # curr = autocurr_r - # # Set autocurr to the children of this node - # self.autocurr_l = curr.left - # self.autocurr_r = curr.right - - - return node - - def __insert(self, root, key, val): - - if root==None: - return self.addNode(key, val) - - else: - if key<=root.key: - # Goes to left subtree - # print 'Inserting on left subtree...', key - root['left'] = self.__insert(root['left'], key, val) - else: - # Goes to right subtree - # print 'Inserting on right subtree...', key - root['right'] = self.__insert(root['right'], key, val) - - return root - - def __lookup(self, root, key): - - if root == None: - return None - else: - if key==root.key: - # Note we are returning the value - return root.get() - else: - if key < root.key: - return self.__lookup(root['left'], key) - else: - return self.__lookup(root['right'], key) - - def __update(self, root, key, newval): - - if root == None: - return None - else: - if key==root.key: - root.set(newval) - else: - if key < root.key: - return self.__update(root['left'], key, newval) - else: - return self.__update(root['right'], key, newval) - - def insert(self, key, val): - node = self.__insert(self.root, key, val) - - if self.root == None: - self.root = node - # Set auto node - self.autocurr_l = self.autocurr_r = self.root - - # If node is added to left of current autocurrent node.. - - return node - - def lookup(self, key): - return self.__lookup(self.root, key) - - def update(self, key, newval): - self.__update(self.root, key, newval) - - def __inorder(self, root): - - if root != None: - for node in self.__inorder(root['left']): - yield node - yield root - for node in self.__inorder(root['right']): - yield node - - def inorder(self): - # Inorder traversal, yielding the nodes - - return self.__inorder(self.root) - - def __preorder(self, root): - - if root != None: - yield root - for node in self.__preorder(root['left']): - yield node - for node in self.__preorder(root['right']): - yield node - - def preorder(self): - # Inorder traversal, yielding the nodes - return self.__preorder(self.root) - - def __postorder(self, root): - - if root != None: - for node in self.__postorder(root['left']): - yield node - for node in self.__postorder(root['right']): - yield node - yield root - - def postorder(self): - # Inorder traversal, yielding the nodes - return self.__postorder(self.root) - - def minnode(self): - # Node with the minimum key value - - root = self.root - - while (root['left'] != None): - root = root['left'] - - return root - - def minkey(self): - - node = self.minnode() - return node.key - - def maxnode(self): - # Node with the maximum key value - - root = self.root - - while (root['right'] != None): - root = root['right'] - - return root - - def maxkey(self): - - node = self.maxnode() - return node.key - - def size_lhs(self): - - # Traverse pre-order and increment counts - if self.root == None: - return 0 - - root_left = self.root['left'] - count = 0 - - for node in self.__preorder(root_left): - count += 1 - - return count - - - def size_rhs(self): - - if self.root == None: - return 0 - - # Traverse pre-order and increment counts - root_right = self.root['right'] - count = 0 - - for node in self.__preorder(root_right): - count += 1 - - return count - - def size(self): - return self.count - - def stats(self): - - d = {'gets': 0, 'loads': 0} - self.__stats(self.root, d) - return d - - def __stats(self, root, d): - - if root != None: - d['gets'] += root.cgets - d['loads'] += root.cloads - self.__stats(root['left'], d) - self.__stats(root['right'], d) - - def dump(self, startnode=None): - - if startnode==None: - startnode = self.root - - if startnode==None: - return None - else: - startnode.dump(True) - - self.hard = True - - def load(self): - if self.root==None: - return None - - if self.hard: - self.root.load(True) - self.hard = False - - def reset(self): - self.size = 0 - self.height = 0 - self.hard = False - # Autocommit mode - self.auto = False - self.autolevel = 0 - self.autocurr_l = None - self.autocurr_r = None - self.nlookups = 0 - self.ngets = 0 - self.nloads = 0 - self.root = None - - def clear(self): - - if self.root: - self.root.clear() - - self.reset() - if self.diskcache: - self.diskcache.clear() - - # Remvoe the directory - if self.bdir and os.path.isdir(self.bdir): - try: - shutil.rmtree(self.bdir) - except Exception, e: - print e - - def set_auto(self, level): - # Enable auto commit and set level - # If auto commit is set to true, the tree - # is flushed to disk after the existing - # autocommit node is balanced at the - # level 'level'. The starting autocommit - # node is root by default. - self.auto = True - self.autolevel = level - # Directory for file representation - self.bdir = '.bidx' + str(hash(self)) - if not os.path.isdir(self.bdir): - try: - os.makedirs(self.bdir) - except Exception, e: - raise - - self.diskcache = bsddb.btopen('cache.db','n') # DictCache(10, self.bdir) - # self.diskcache.freq = self.autolevel - -if __name__ == "__main__": - b = BST() - b.set_auto(3) - print b.root - b.insert(4,[4]) - b.insert(3,[2]) - b.insert(2,[6]) - b.insert(1, [3]) - b.insert(5,[5]) - b.insert(6,[7]) - b.insert(0,[8]) - print b.size - print b.height - print b.lookup(4) - b.dump() - # Now try to lookup item 3 - print b.lookup(3) - print b.lookup(3) - print b.lookup(3) - # Load all - b.load() - print b.size, b.height - - # Do inorder - print 'Inorder...' - for node in b.inorder(): - print node.key,'=>',node[node.key] - # Do preorder - print 'Preorder...' - for node in b.preorder(): - print node.key,'=>',node[node.key] - # Do postorder - print 'Postorder...' - for node in b.postorder(): - print node.key,'=>',node[node.key] - - print 'LHS=>',b.size_lhs() - print 'RHS=>',b.size_rhs() - - # b.clear() - print b.stats() - root = b.root - print root.is_balanced() - print root.is_balanced(2) - - del b - - b= BST() - b.insert(10,[4]) - b.insert(5,[2]) - b.insert(2,[6]) - b.insert(7, [3]) - b.insert(14,[5]) - b.insert(12,[7]) - b.insert(15,[8]) - - root = b.root - print root.is_balanced(1) - print root.is_balanced(2) - print root.is_balanced(3) - - print 'LHS=>',b.size_lhs() - print 'RHS=>',b.size_rhs() - diff --git a/HarvestMan-lite/harvestman/lib/common/bst_orig.py b/HarvestMan-lite/harvestman/lib/common/bst_orig.py deleted file mode 100755 index be0dc1c..0000000 --- a/HarvestMan-lite/harvestman/lib/common/bst_orig.py +++ /dev/null @@ -1,489 +0,0 @@ -""" -bst.py - Basic binary search tree in Python with automated disk caching at -the nodes. This is not a full-fledged implementation since it does not -implement node deletion, tree balancing etc. - -Created Anand B Pillai Feb 13 2008 - -Copyright (C) 2008, Anand B Pillai. - -""" - - - -import cPickle -import math -import os -import shutil - -class BSTNode(dict): - """ Node class for a BST """ - - def __init__(self, key, val, left=None, right=None): - self.key = key - self[key] = val - self[0] = left - self[1] = right - # Mode flag - # 0 => mem - # 1 => disk - self.mode = 0 - # Cached idx filename - self.fname = '' - # Number of gets - self.cgets = 0 - # Number of loads - self.cloads = 0 - - def __getitem__(self, key): - - try: - return super(BSTNode, self).__getitem__(key) - except KeyError: - return None - - def set(self, value): - self.val = value - - def get(self): - - if self.mode==0: - self.cgets += 1 - return self[self.key] - else: - self.cloads += 1 - self.load() - return self[self.key] - - def is_balanced(self, level=1): - - # Return if this node is balanced - # The node balance check is done per - # level. Default is 1 which means we - # check whether this node has both left - # and right children. If level is 2, this - # is done at one more level, i.e for the - # child nodes also... - - # Leaf node is not balanced... - if self[0]==None and self[1]==None: - return False - - while level>0: - level -= 1 - - if self[0] !=None and self[1] != None: - if level: - return self[0].is_balanced(level) and \ - self[1].is_balanced(level) - else: - return True - else: - return False - - return False - - def load(self, recursive=False): - - # Load values from disk - try: - # Don't load if mode is 0 and value is not None - if self.mode==1 and self[self.key] == None: - self[self.key] = cPickle.load(open(self.fname, 'rb')) - self.mode = 0 - - if recursive: - left = self[0] - if left: left.load(True) - right = self[1] - if right: right.load(True) - - except cPickle.UnpicklingError, e: - raise - except Exception, e: - raise - - def dump(self, bdir, recursive=False): - - try: - if self.mode==0: - self.fname = os.path.join(bdir, str(self.key)) - cPickle.dump(self[self.key], open(self.fname, 'wb')) - # If dumping was done, set val to None to - # reclaim memory... - del self[self.key] - self.mode = 1 - else: - # Dont do anything - pass - - if recursive: - left = self[0] - if left: left.dump(bdir, True) - right = self[1] - if right: right.dump(bdir, True) - - except cPickle.PicklingError, e: - raise - except Exception, e: - raise - - def clear(self): - - # Clear removes the data from memory as well as from disk - self.val = None - if self.fname and os.path.isfile(self.fname): - try: - os.remove(self.fname) - except Exception, e: - print e - - left = self[0] - right = self[1] - - if left: - left.clear() - if right: - right.clear() - - super(BSTNode, self).clear() - - -class BST(object): - """ BST class with automated disk caching of node values """ - - def __init__(self, key=None, val=None): - # Size of tree - self.size = 0 - # Height of tree - self.height = 0 - # 'Hardened' flag - if the data structure - # is dumped to disk fully, the flag hard - # is set to True - self.hard = False - # Autocommit mode - self.auto = False - # Autocommit mode level - self.autolevel = 0 - # Current auto left node for autocommit - self.autocurr_l = None - # Current auto right node for autocommit - self.autocurr_r = None - # For stats - # Total number of lookups - self.nlookups = 0 - # Total number of in-mem lookups - self.ngets = 0 - # Total number of disk loads - self.nloads = 0 - # Directory for file representation - self.bdir = '.bidx' + str(hash(self)) - if not os.path.isdir(self.bdir): - try: - os.makedirs(self.bdir) - except Exception, e: - raise - - self.root = None - if key: - self.root = self.insert(key, val) - - def addNode(self, key, val): - self.size += 1 - self.height = int(math.ceil(math.log(self.size+1, 2))) - node = BSTNode(key, val) - - if self.auto and self.autolevel and self.size>1: - # Check if the node has become balanced at the - # requested level... - - if self.auto and self.autolevel: - # print 'Auto-dumping...', self.size - if self.size % self.autolevel==0: - self.dump(self.autocurr_l) - # Set autocurr to this node - self.autocurr_l = node - - #if self.autocurr_l and self.autocurr_l.is_balanced(self.autolevel): - # print 'Auto-dumping...', self.autocurr_l.key - # self.dump(self.autocurr_l) - # curr = self.autocurr_l - # # Set autocurr to the children of this node - # self.autocurr_l = curr.left - # self.autocurr_r = curr.right - # print 'Left=>',self.autocurr_l - # print 'Right=>',self.autocurr_r - # print 'Root=>',self.root.key - - #if self.autocurr_r == self.autocurr_l: - # return node - - #if self.autocurr_r and self.autocurr_r.is_balanced(self.autolevel): - # print 'Auto-dumping...', self.autocurr_r.key - # self.dump(self.autocurr_r) - # curr = autocurr_r - # # Set autocurr to the children of this node - # self.autocurr_l = curr.left - # self.autocurr_r = curr.right - - - return node - - def __insert(self, root, key, val): - - if root==None: - return self.addNode(key, val) - - else: - if key<=root.key: - # Goes to left subtree - # print 'Inserting on left subtree...', key - root[0] = self.__insert(root[0], key, val) - else: - # Goes to right subtree - # print 'Inserting on right subtree...', key - root[1] = self.__insert(root[1], key, val) - - return root - - def __lookup(self, root, key): - - if root == None: - return None - else: - if key==root.key: - # Note we are returning the value - return root.get() - else: - if key < root.key: - return self.__lookup(root[0], key) - else: - return self.__lookup(root[1], key) - - def insert(self, key, val): - node = self.__insert(self.root, key, val) - - if self.root == None: - self.root = node - # Set auto node - self.autocurr_l = self.autocurr_r = self.root - - # If node is added to left of current autocurrent node.. - - return node - - def lookup(self, key): - return self.__lookup(self.root, key) - - def __inorder(self, root): - - if root != None: - for node in self.__inorder(root[0]): - yield node - yield root - for node in self.__inorder(root[1]): - yield node - - def inorder(self): - # Inorder traversal, yielding the nodes - - return self.__inorder(self.root) - - def __preorder(self, root): - - if root != None: - yield root - for node in self.__preorder(root[0]): - yield node - for node in self.__preorder(root[1]): - yield node - - def preorder(self): - # Inorder traversal, yielding the nodes - return self.__preorder(self.root) - - def __postorder(self, root): - - if root != None: - for node in self.__postorder(root[0]): - yield node - for node in self.__postorder(root[1]): - yield node - yield root - - def postorder(self): - # Inorder traversal, yielding the nodes - return self.__postorder(self.root) - - def minnode(self): - # Node with the minimum key value - - root = self.root - - while (root[0] != None): - root = root[0] - - return root - - def minkey(self): - - node = self.minnode() - return node.key - - def maxnode(self): - # Node with the maximum key value - - root = self.root - - while (root[1] != None): - root = root[1] - - return root - - def maxkey(self): - - node = self.maxnode() - return node.key - - def size_lhs(self): - - # Return the node size on the LHS (excluding root) - root = self.root - count = 0 - - while root[0] != None: - root = root[0] - count += 1 - - return count - - def size_rhs(self): - - # Return the node size on the LHS (excluding root) - root = self.root - count = 0 - - while root[1] != None: - root = root[1] - count += 1 - - return count - - def size(self): - return self.count - - def stats(self): - - d = {'gets': 0, 'loads': 0} - self.__stats(self.root, d) - return d - - def __stats(self, root, d): - - if root != None: - d['gets'] += root.cgets - d['loads'] += root.cloads - self.__stats(root[0], d) - self.__stats(root[1], d) - - def dump(self, startnode=None): - - if startnode==None: - startnode = self.root - - if startnode==None: - return None - else: - startnode.dump(self.bdir, True) - - self.hard = True - - def load(self): - if self.root==None: - return None - - if self.hard: - self.root.load(True) - self.hard = False - - def clear(self): - - if self.root: - self.root.clear() - # Remvoe the directory - if self.bdir and os.path.isdir(self.bdir): - try: - shutil.rmtree(self.bdir) - except Exception, e: - print e - - def set_auto(self, level): - # Enable auto commit and set level - # If auto commit is set to true, the tree - # is flushed to disk after the existing - # autocommit node is balanced at the - # level 'level'. The starting autocommit - # node is root by default. - self.auto = True - self.autolevel = level - - -if __name__ == "__main__": - b = BST() - b.set_auto(3) - print b.root - b.insert(4,[4]) - b.insert(3,[2]) - b.insert(2,[6]) - b.insert(1, [3]) - b.insert(5,[5]) - b.insert(6,[7]) - b.insert(0,[8]) - print b.size - print b.height - print b.lookup(4) - #b.dump() - # Now try to lookup item 3 - print b.lookup(3) - print b.lookup(3) - print b.lookup(3) - # Load all - b.load() - print b.size, b.height - - # Do inorder - print 'Inorder...' - for node in b.inorder(): - print node.key,'=>',node[node.key] - # Do preorder - print 'Preorder...' - for node in b.preorder(): - print node.key,'=>',node[node.key] - # Do postorder - print 'Postorder...' - for node in b.postorder(): - print node.key,'=>',node[node.key] - - print b.size_lhs() - print b.size_rhs() - - # b.clear() - print b.stats() - root = b.root - print root.is_balanced() - print root.is_balanced(2) - del b - - b= BST() - b.insert(10,[4]) - b.insert(5,[2]) - b.insert(2,[6]) - b.insert(7, [3]) - b.insert(14,[5]) - b.insert(12,[7]) - b.insert(15,[8]) - - root = b.root - print root.is_balanced(1) - print root.is_balanced(2) - print root.is_balanced(3) diff --git a/HarvestMan-lite/harvestman/lib/common/common.py b/HarvestMan-lite/harvestman/lib/common/common.py deleted file mode 100755 index 14d2e5c..0000000 --- a/HarvestMan-lite/harvestman/lib/common/common.py +++ /dev/null @@ -1,603 +0,0 @@ -# -- coding: utf-8 -""" common.py - Global functions for HarvestMan Program. - This file is part of the HarvestMan software. - For licensing information, see file LICENSE.TXT. - - Author: Anand B Pillai - - Created: Jun 10 2003 - - Aug 17 2006 Anand Modifications for the new logging - module. - - Feb 7 2007 Anand Some changes. Added logconsole - function. Split Initialize() to - InitConfig() and InitLogger(). - Feb 26 2007 Anand Replaced urlmappings dictionary - with a WeakValueDictionary. - - Copyright (C) 2004 - Anand B Pillai. - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -import weakref -import os, sys -import socket -import binascii -import copy -import threading -import shelve -import cStringIO -import traceback -import threading -import collections -import random -import cStringIO -import tokenize - -from types import * -from singleton import Singleton - -class Alias(Singleton): - def __getattr__(self, name): - try: - return super(Alias, self).__getattr__(name) - except AttributeError: - return None - pass - -class AliasError(Exception): - pass - -class GlobalData(Singleton): - def __getattr__(self, name): - try: - return super(Alias, self).__getattr__(name) - except AttributeError: - return None - -# Namespace for global unique objects - -# This varible holds each global object in HarvestMan -# If any module redefines an 'objects' variable locally, it -# is doing at its own peril! -objects = Alias() - -# Namespace for global data -globaldata = GlobalData() -globaldata.userdebug = [] - - -class SleepEvent(object): - """ A class representing a timeout event. This can be - used to passively wait for a given time-period instead of - using time.sleep(...) """ - - def __init__(self, sleeptime): - self._sleeptime = sleeptime - self.evt = threading.Event() - self.evt.set() - - def sleep(self): - self.evt.clear() - self.evt.wait(self._sleeptime) - self.evt.set() - -class RandomSleepEvent(SleepEvent): - """ A class representing a timeout event. This can be - used to passively wait for a given time-period instead of - using time.sleep(...) """ - - def sleep(self): - self.evt.clear() - self.evt.wait(random.random()*self._sleeptime) - self.evt.set() - -class DummyStderr(object): - """ A dummy class to imitate stderr """ - - def write(self, msg): - pass - -class CaselessDict(dict): - - def __init__(self, mapping=None): - if mapping: - if type(mapping) is dict: - for k,v in d.items(): - self.__setitem__(k, v) - elif type(mapping) in (list, tuple): - d = dict(mapping) - for k,v in d.items(): - self.__setitem__(k, v) - - # super(CaselessDict, self).__init__(d) - - def __setitem__(self, name, value): - - if type(name) in StringTypes: - super(CaselessDict, self).__setitem__(name.lower(), value) - else: - super(CaselessDict, self).__setitem__(name, value) - - def __getitem__(self, name): - if type(name) in StringTypes: - return super(CaselessDict, self).__getitem__(name.lower()) - else: - return super(CaselessDict, self).__getitem__(name) - - def __copy__(self): - pass - - -class Ldeque(collections.deque): - """ Length-limited deque """ - - def __init__(self, count=10): - self.max = count - super(Ldeque, self).__init__() - - def append(self, item): - super(Ldeque, self).append(item) - if len(self)>self.max: - # if size exceeds, pop from left - self.popleft() - - def appendleft(self, item): - super(Ldeque, self).appendleft(item) - if len(self)>self.max: - # if size exceeds, pop from right - self.pop() - - def index(self, item): - """ Return the index of an item from the deque """ - - return list(self).index(item) - - def remove(self, item): - """ Remove an item from the deque """ - - idx = self.index(item) - self.__delitem__(idx) - -def SysExceptHook(typ, val, tracebak): - """ Dummy function to replace sys.excepthook """ - pass - - -def SetAlias(obj): - """ Set unique alias for the object """ - - # Alias is another name for the object, it should be unique - # The object's class should have a field name 'alias' - if getattr(obj, 'alias') == None: - raise AliasError, "object does not define 'alias' attribute!" - - setattr(objects, obj.alias, obj) - -def SetLogFile(): - - logfile = objects.config.logfile - if logfile: - objects.logger.setLogSeverity(objects.config.verbosity) - # If simulation is turned off, add file-handle - if not objects.config.simulate: - objects.logger.addLogHandler('FileHandler',logfile) - -def SetUserDebug(message): - """ Used to store error messages related - to user settings in the config file/project file. - These will be printed at the end of the program """ - - if message: - try: - globaldata.userdebug.index(message) - except: - globaldata.userdebug.append(message) - -def SetLogSeverity(): - objects.logger.setLogSeverity(objects.config.verbosity) - -def wasOrWere(val): - """ What it says """ - - if val > 1: return 'were' - else: return 'was' - -def plural((s, val)): - """ What it says """ - - if val>1: - if s[len(s)-1] == 'y': - return s[:len(s)-1]+'ies' - else: return s + 's' - else: - return s - -# file type identification functions -# this is the precursor of a more generic file identificator -# based on the '/etc/magic' file on unices. - -signatures = { "gif" : [0, ("GIF87a", "GIF89a")], - "jpeg" :[6, ("JFIF",)], - "bmp" : [0, ("BM6",)] - } -aliases = { "gif" : (), # common extension aliases - "jpeg" : ("jpg", "jpe", "jfif"), - "bmp" : ("dib",) } - -def bin_crypt(data): - """ Encryption using binascii and obfuscation """ - - if data=='': - return '' - - try: - return binascii.hexlify(obfuscate(data)) - except TypeError, e: - debug('Error in encrypting data: <',data,'>', e) - return data - except ValueError, e: - debug('Error in encrypting data: <',data,'>', e) - return data - -def bin_decrypt(data): - """ Decrypttion using binascii and deobfuscation """ - - if data=='': - return '' - - try: - return unobfuscate(binascii.unhexlify(data)) - except TypeError, e: - logconsole('Error in decrypting data: <',data,'>', e) - return data - except ValueError, e: - logconsole('Error in decrypting data: <',data,'>', e) - return data - - -def obfuscate(data): - """ Obfuscate a string using repeated xor """ - - out = "" - import operator - - e0=chr(operator.xor(ord(data[0]), ord(data[1]))) - out = "".join((out, e0)) - - x=1 - eprev=e0 - for x in range(1, len(data)): - ax=ord(data[x]) - ex=chr(operator.xor(ax, ord(eprev))) - out = "".join((out,ex)) - eprev = ex - - return out - -def unobfuscate(data): - """ Unobfuscate a xor obfuscated string """ - - out = "" - x=len(data) - 1 - - import operator - - while x>1: - apos=data[x] - aprevpos=data[x-1] - epos=chr(operator.xor(ord(apos), ord(aprevpos))) - out = "".join((out, epos)) - x -= 1 - - out=str(reduce(lambda x, y: y + x, out)) - e2, a2 = data[1], data[0] - a1=chr(operator.xor(ord(a2), ord(e2))) - a1 = "".join((a1, out)) - out = a1 - e1,a1=out[0], data[0] - a0=chr(operator.xor(ord(a1), ord(e1))) - a0 = "".join((a0, out)) - out = a0 - - return out - -def send_url(data, host, port): - - cfg = objects.config - if cfg.urlserver_protocol == 'tcp': - return send_url_tcp(data, host, port) - elif cfg.urlserver_protocol == 'udp': - return send_url_udp(data, host, port) - -def send_url_tcp(data, host, port): - """ Send url to url server """ - - # Return's server response if connection - # succeeded and null string if failed. - try: - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - sock.connect((host,port)) - sock.sendall(data) - response = sock.recv(8192) - sock.close() - return response - except socket.error, e: - # print 'url server error:',e - pass - - return '' - -def send_url_udp(data, host, port): - """ Send url to url server """ - - # Return's server response if connection - # succeeded and null string if failed. - try: - sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) - sock.sendto(data,0,(host, port)) - response, addr = sock.recvfrom(8192, 0) - sock.close() - return response - except socket.error: - pass - - return '' - -def ping_urlserver(host, port): - - cfg = objects.config - - if cfg.urlserver_protocol == 'tcp': - return ping_urlserver_tcp(host, port) - elif cfg.urlserver_protocol == 'udp': - return ping_urlserver_udp(host, port) - -def ping_urlserver_tcp(host, port): - """ Ping url server to see if it is alive """ - - # Returns server's response if server is - # alive & null string if server is not alive. - try: - debug('Pinging server at (%s:%d)' % (host, port)) - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - sock.connect((host,port)) - # Send a small packet - sock.sendall("ping") - response = sock.recv(8192) - if response: - debug('Url server is alive') - sock.close() - return response - except socket.error: - debug('Could not connect to (%s:%d)' % (host, port)) - return '' - -def ping_urlserver_udp(host, port): - """ Ping url server to see if it is alive """ - - # Returns server's response if server is - # alive & null string if server is not alive. - try: - debug('Pinging server at (%s:%d)' % (host, port)) - sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) - # Send a small packet - sock.sendto("ping", 0, (host,port)) - response, addr = sock.recvfrom(8192,0) - if response: - debug('Url server is alive') - sock.close() - return response - except socket.error: - debug('Could not connect to (%s:%d)' % (host, port)) - return '' - -def GetTempDir(): - """ Return the temporary directory """ - - # Currently used by hget - tmpdir = max(map(lambda x: os.environ.get(x, ''), ['TEMP','TMP','TEMPDIR','TMPDIR'])) - - if tmpdir=='': - # No temp dir env variable - if os.name == 'posix': - if os.path.isdir('/tmp'): - return '/tmp' - elif os.path.isdir('/usr/tmp'): - return '/usr/tmp' - elif os.name == 'nt': - profiledir = os.environ.get('USERPROFILE','') - if profiledir: - return os.path.join(profiledir,'Local Settings','Temp') - else: - return os.path.abspath(tmpdir) - -def GetMyTempDir(): - """ Return temporary directory for HarvestMan. Also creates - it if the directory is not there """ - - # This is tempdir/HarvestMan - tmpdir = os.path.join(GetTempDir(), 'harvestman') - if not os.path.isdir(tmpdir): - try: - os.makedirs(tmpdir) - except OSError, e: - return '' - - return tmpdir - -def debug(arg, *args): - """ Log information, will log if verbosity is equal to DEBUG level """ - - objects.logger.debug(arg, *args) - -def info(arg, *args): - """ Log information, will log if verbosity is <= INFO level """ - - objects.logger.info(arg, *args) - -def extrainfo(arg, *args): - """ Log information, will log if verbosity is <= EXTRAINFO level """ - - objects.logger.extrainfo(arg, *args) - -def warning(arg, *args): - """ Log information, will log if verbosity is <= WARNING level """ - - objects.logger.warning(arg, *args) - -def error(arg, *args): - """ Log information, will log if verbosity is <= ERROR level """ - - objects.logger.error(arg, *args) - -def critical(arg, *args): - """ Log information, will log if verbosity is <= CRITICAL level """ - - objects.logger.critical(arg, *args) - -def logconsole(arg, *args): - """ Log directly to sys.stdout using print """ - - # Setting verbosity to 5 will print maximum information - # plus maximum debugging information. - objects.logger.logconsole(arg, *args) - -def logtraceback(console=False): - """ Log the most recent exception traceback. By default - the trace goes only to the log file """ - - s = cStringIO.StringIO() - traceback.print_tb(sys.exc_info()[-1], None, s) - if not console: - objects.logger.disableConsoleLogging() - # Log to logger - objects.logger.debug(s.getvalue()) - # Enable console logging again - objects.logger.enableConsoleLogging() - -def hexit(arg): - """ Exit wrapper for HarvestMan """ - - print_traceback() - sys.exit(arg) - -def print_traceback(): - print 'Printing error traceback for debugging...' - traceback.print_tb(sys.exc_info()[-1], None, sys.stdout) - -# Effbot's simple_eval function which is a safe replacement -# for Python's eval for tuples... - -def atom(next, token): - if token[1] == "(": - out = [] - token = next() - while token[1] != ")": - out.append(atom(next, token)) - token = next() - if token[1] == ",": - token = next() - return tuple(out) - elif token[0] is tokenize.STRING: - return token[1][1:-1].decode("string-escape") - elif token[0] is tokenize.NUMBER: - try: - return int(token[1], 0) - except ValueError: - return float(token[1]) - raise SyntaxError("malformed expression (%s)" % token[1]) - -def simple_eval(source): - src = cStringIO.StringIO(source).readline - src = tokenize.generate_tokens(src) - res = atom(src.next, src.next()) - if src.next()[0] is not tokenize.ENDMARKER: - raise SyntaxError("bogus data after expression") - return res - -def set_aliases(path=None): - - if path != None: - sys.path.append(path) - - import config - SetAlias(config.HarvestManStateObject()) - - import datamgr - import rules - import connector - import urlqueue - import logger - import event - - SetAlias(logger.HarvestManLogger()) - - # Data manager object - dmgr = datamgr.HarvestManDataManager() - dmgr.initialize() - SetAlias(dmgr) - - # Rules checker object - ruleschecker = rules.HarvestManRulesChecker() - SetAlias(ruleschecker) - - # Connector manager object - connmgr = connector.HarvestManNetworkConnector() - SetAlias(connmgr) - - # Connector factory - conn_factory = connector.HarvestManUrlConnectorFactory(objects.config.connections) - SetAlias(conn_factory) - - queuemgr = urlqueue.HarvestManCrawlerQueue() - SetAlias(queuemgr) - - SetAlias(event.HarvestManEvent()) - -def test_sgmlop(): - """ Test whether sgmlop is available and working """ - - html="""\ - < - title>Test sgmlop - -

This is a pargraph

- -
Feb 13 2008 - -Copyright (C) 2008, Anand B Pillai. - -""" - -import os -import cPickle -import time -from threading import Semaphore - -PID = os.getpid() - -class DictCache(object): - """ A dictionary like object with pickled disk caching - which allows to store large amount of data with minimal - memory costs """ - - def __init__(self, frequency, tmpdir=''): - # Frequency at which commits are done to disk - self.freq = frequency - # Total number of commit cycles - self.cycles = 0 - self.curr = 0 - # Disk cache... - self.cache = {} - # Internal temporary cache - self.d = {} - self.dmutex = Semaphore(1) - # Last loaded cache dictionary from disk - self.dcache = {} - # disk cache hits - self.dhits = 0 - # in-mem cache hits - self.mhits = 0 - # temp dict hits - self.thits = 0 - self.tmpdir = tmpdir - if self.tmpdir: - self.froot = os.path.join(self.tmpdir, '.' + str(PID) + '_' + str(abs(hash(self)))) - else: - self.froot = '.' + str(PID) + '_' + str(abs(hash(self))) - self.t = 0 - - def __setitem__(self, key, value): - - try: - self.dmutex.acquire() - try: - self.d[key] = value - self.curr += 1 - if self.curr==self.freq: - self.cycles += 1 - # Dump the cache dictionary to disk... - fname = ''.join((self.froot,'#',str(self.cycles))) - # print self.d - cPickle.dump(self.d, open(fname, 'wb')) - # We index the cache keys and associate the - # cycle number to them, since the filename - # is further associated to the cycle number, - # finding the cache file associated to a - # dictionary key is a simple dictionary look-up - # operation, costing only O(1)... - for k in self.d.iterkeys(): - self.cache[k] = self.cycles - self.d.clear() - self.curr = 0 - except Exception, e: - import traceback - print 'Exception:',e, traceback.extract_stack() - traceback.print_stack() - finally: - self.dmutex.release() - - def __len__(self): - # Return the 'virtual' length of the - # dictionary - - # Length is the temporary cache length - # plus size of disk caches. This assumes - # that all the committed caches are still - # present... - return len(self.d) + self.cycles*self.freq - - def __getitem__(self, key): - try: - item = self.d[key] - self.thits += 1 - return item - except KeyError: - try: - item = self.dcache[key] - self.mhits += 1 - return item - except KeyError: - t1 = time.time() - # Load cache from disk... - # Cache filename lookup is an O(1) operation... - try: - fname = ''.join((self.froot,'#',str(self.cache[key]))) - except KeyError: - return None - try: - # Always caches the last loaded entry - self.dcache = cPickle.load(open(fname,'rb')) - self.dhits += 1 - # print time.time() - t1 - self.t += time.time() - t1 - - return self.dcache[key] - except (OSError, IOError, EOFError,KeyError), e: - return None - - def clear(self): - - try: - self.dmutex.acquire() - self.d.clear() - self.dcache.clear() - - # Remove cache filenames - for k in self.cache.itervalues(): - fname = ''.join((self.froot,'#',str(k))) - if os.path.isfile(fname): - # print 'Removing file',fname - os.remove(fname) - - self.cache.clear() - # Reset counters - self.curr = 0 - self.cycles = 0 - self.clear_counters() - finally: - self.dmutex.release() - - def clear_counters(self): - self.dhits = 0 - self.thits = 0 - self.mhits = 0 - self.t = 0 - - def get_stats(self): - """ Return stats as a dictionary """ - - if len(self): - average = float(self.t)/float(len(self)) - else: - average = 0.0 - - return { 'disk_hits' : self.dhits, - 'mem_hits' : self.mhits, - 'temp_hits' : self.thits, - 'time': self.t, - 'average' : average } - - def __del__(self): - self.clear() diff --git a/HarvestMan-lite/harvestman/lib/common/feedparser.py b/HarvestMan-lite/harvestman/lib/common/feedparser.py deleted file mode 100755 index bb802df..0000000 --- a/HarvestMan-lite/harvestman/lib/common/feedparser.py +++ /dev/null @@ -1,2858 +0,0 @@ -#!/usr/bin/env python -"""Universal feed parser - -Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds - -Visit http://feedparser.org/ for the latest version -Visit http://feedparser.org/docs/ for the latest documentation - -Required: Python 2.1 or later -Recommended: Python 2.3 or later -Recommended: CJKCodecs and iconv_codec -""" - -__version__ = "4.1"# + "$Revision: 1.92 $"[11:15] + "-cvs" -__license__ = """Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved. - -Redistribution and use in source and binary forms, with or without modification, -are permitted provided that the following conditions are met: - -* Redistributions of source code must retain the above copyright notice, - this list of conditions and the following disclaimer. -* Redistributions in binary form must reproduce the above copyright notice, - this list of conditions and the following disclaimer in the documentation - and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' -AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE -LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE.""" -__author__ = "Mark Pilgrim " -__contributors__ = ["Jason Diamond ", - "John Beimler ", - "Fazal Majid ", - "Aaron Swartz ", - "Kevin Marks "] -_debug = 0 - -# HTTP "User-Agent" header to send to servers when downloading feeds. -# If you are embedding feedparser in a larger application, you should -# change this to your application name and URL. -USER_AGENT = "UniversalFeedParser/%s +http://feedparser.org/" % __version__ - -# HTTP "Accept" header to send to servers when downloading feeds. If you don't -# want to send an Accept header, set this to None. -ACCEPT_HEADER = "application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1" - -# List of preferred XML parsers, by SAX driver name. These will be tried first, -# but if they're not installed, Python will keep searching through its own list -# of pre-installed parsers until it finds one that supports everything we need. -PREFERRED_XML_PARSERS = ["drv_libxml2"] - -# If you want feedparser to automatically run HTML markup through HTML Tidy, set -# this to 1. Requires mxTidy -# or utidylib . -TIDY_MARKUP = 0 - -# List of Python interfaces for HTML Tidy, in order of preference. Only useful -# if TIDY_MARKUP = 1 -PREFERRED_TIDY_INTERFACES = ["uTidy", "mxTidy"] - -# ---------- required modules (should come with any Python distribution) ---------- -import sgmllib, re, sys, copy, urlparse, time, rfc822, types, cgi, urllib, urllib2 -try: - from cStringIO import StringIO as _StringIO -except: - from StringIO import StringIO as _StringIO - -# ---------- optional modules (feedparser will work without these, but with reduced functionality) ---------- - -# gzip is included with most Python distributions, but may not be available if you compiled your own -try: - import gzip -except: - gzip = None -try: - import zlib -except: - zlib = None - -# If a real XML parser is available, feedparser will attempt to use it. feedparser has -# been tested with the built-in SAX parser, PyXML, and libxml2. On platforms where the -# Python distribution does not come with an XML parser (such as Mac OS X 10.2 and some -# versions of FreeBSD), feedparser will quietly fall back on regex-based parsing. -try: - import xml.sax - xml.sax.make_parser(PREFERRED_XML_PARSERS) # test for valid parsers - from xml.sax.saxutils import escape as _xmlescape - _XML_AVAILABLE = 1 -except: - _XML_AVAILABLE = 0 - def _xmlescape(data): - data = data.replace('&', '&') - data = data.replace('>', '>') - data = data.replace('<', '<') - return data - -# base64 support for Atom feeds that contain embedded binary data -try: - import base64, binascii -except: - base64 = binascii = None - -# cjkcodecs and iconv_codec provide support for more character encodings. -# Both are available from http://cjkpython.i18n.org/ -try: - import cjkcodecs.aliases -except: - pass -try: - import iconv_codec -except: - pass - -# chardet library auto-detects character encodings -# Download from http://chardet.feedparser.org/ -try: - import chardet - if _debug: - import chardet.constants - chardet.constants._debug = 1 -except: - chardet = None - -# ---------- don't touch these ---------- -class ThingsNobodyCaresAboutButMe(Exception): pass -class CharacterEncodingOverride(ThingsNobodyCaresAboutButMe): pass -class CharacterEncodingUnknown(ThingsNobodyCaresAboutButMe): pass -class NonXMLContentType(ThingsNobodyCaresAboutButMe): pass -class UndeclaredNamespace(Exception): pass - -sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') -sgmllib.special = re.compile('' % (tag, ''.join([' %s="%s"' % t for t in attrs])), escape=0) - - # match namespaces - if tag.find(':') <> -1: - prefix, suffix = tag.split(':', 1) - else: - prefix, suffix = '', tag - prefix = self.namespacemap.get(prefix, prefix) - if prefix: - prefix = prefix + '_' - - # special hack for better tracking of empty textinput/image elements in illformed feeds - if (not prefix) and tag not in ('title', 'link', 'description', 'name'): - self.intextinput = 0 - if (not prefix) and tag not in ('title', 'link', 'description', 'url', 'href', 'width', 'height'): - self.inimage = 0 - - # call special handler (if defined) or default handler - methodname = '_start_' + prefix + suffix - try: - method = getattr(self, methodname) - return method(attrsD) - except AttributeError: - return self.push(prefix + suffix, 1) - - def unknown_endtag(self, tag): - if _debug: sys.stderr.write('end %s\n' % tag) - # match namespaces - if tag.find(':') <> -1: - prefix, suffix = tag.split(':', 1) - else: - prefix, suffix = '', tag - prefix = self.namespacemap.get(prefix, prefix) - if prefix: - prefix = prefix + '_' - - # call special handler (if defined) or default handler - methodname = '_end_' + prefix + suffix - try: - method = getattr(self, methodname) - method() - except AttributeError: - self.pop(prefix + suffix) - - # track inline content - if self.incontent and self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'): - # element declared itself as escaped markup, but it isn't really - self.contentparams['type'] = 'application/xhtml+xml' - if self.incontent and self.contentparams.get('type') == 'application/xhtml+xml': - tag = tag.split(':')[-1] - self.handle_data('' % tag, escape=0) - - # track xml:base and xml:lang going out of scope - if self.basestack: - self.basestack.pop() - if self.basestack and self.basestack[-1]: - self.baseuri = self.basestack[-1] - if self.langstack: - self.langstack.pop() - if self.langstack: # and (self.langstack[-1] is not None): - self.lang = self.langstack[-1] - - def handle_charref(self, ref): - # called for each character reference, e.g. for ' ', ref will be '160' - if not self.elementstack: return - ref = ref.lower() - if ref in ('34', '38', '39', '60', '62', 'x22', 'x26', 'x27', 'x3c', 'x3e'): - text = '&#%s;' % ref - else: - if ref[0] == 'x': - c = int(ref[1:], 16) - else: - c = int(ref) - text = unichr(c).encode('utf-8') - self.elementstack[-1][2].append(text) - - def handle_entityref(self, ref): - # called for each entity reference, e.g. for '©', ref will be 'copy' - if not self.elementstack: return - if _debug: sys.stderr.write('entering handle_entityref with %s\n' % ref) - if ref in ('lt', 'gt', 'quot', 'amp', 'apos'): - text = '&%s;' % ref - else: - # entity resolution graciously donated by Aaron Swartz - def name2cp(k): - import htmlentitydefs - if hasattr(htmlentitydefs, 'name2codepoint'): # requires Python 2.3 - return htmlentitydefs.name2codepoint[k] - k = htmlentitydefs.entitydefs[k] - if k.startswith('&#') and k.endswith(';'): - return int(k[2:-1]) # not in latin-1 - return ord(k) - try: name2cp(ref) - except KeyError: text = '&%s;' % ref - else: text = unichr(name2cp(ref)).encode('utf-8') - self.elementstack[-1][2].append(text) - - def handle_data(self, text, escape=1): - # called for each block of plain text, i.e. outside of any tag and - # not containing any character or entity references - if not self.elementstack: return - if escape and self.contentparams.get('type') == 'application/xhtml+xml': - text = _xmlescape(text) - self.elementstack[-1][2].append(text) - - def handle_comment(self, text): - # called for each comment, e.g. - pass - - def handle_pi(self, text): - # called for each processing instruction, e.g. - pass - - def handle_decl(self, text): - pass - - def parse_declaration(self, i): - # override internal declaration handler to handle CDATA blocks - if _debug: sys.stderr.write('entering parse_declaration\n') - if self.rawdata[i:i+9] == '', i) - if k == -1: k = len(self.rawdata) - self.handle_data(_xmlescape(self.rawdata[i+9:k]), 0) - return k+3 - else: - k = self.rawdata.find('>', i) - return k+1 - - def mapContentType(self, contentType): - contentType = contentType.lower() - if contentType == 'text': - contentType = 'text/plain' - elif contentType == 'html': - contentType = 'text/html' - elif contentType == 'xhtml': - contentType = 'application/xhtml+xml' - return contentType - - def trackNamespace(self, prefix, uri): - loweruri = uri.lower() - if (prefix, loweruri) == (None, 'http://my.netscape.com/rdf/simple/0.9/') and not self.version: - self.version = 'rss090' - if loweruri == 'http://purl.org/rss/1.0/' and not self.version: - self.version = 'rss10' - if loweruri == 'http://www.w3.org/2005/atom' and not self.version: - self.version = 'atom10' - if loweruri.find('backend.userland.com/rss') <> -1: - # match any backend.userland.com namespace - uri = 'http://backend.userland.com/rss' - loweruri = uri - if self._matchnamespaces.has_key(loweruri): - self.namespacemap[prefix] = self._matchnamespaces[loweruri] - self.namespacesInUse[self._matchnamespaces[loweruri]] = uri - else: - self.namespacesInUse[prefix or ''] = uri - - def resolveURI(self, uri): - return _urljoin(self.baseuri or '', uri) - - def decodeEntities(self, element, data): - return data - - def push(self, element, expectingText): - self.elementstack.append([element, expectingText, []]) - - def pop(self, element, stripWhitespace=1): - if not self.elementstack: return - if self.elementstack[-1][0] != element: return - - element, expectingText, pieces = self.elementstack.pop() - output = ''.join(pieces) - if stripWhitespace: - output = output.strip() - if not expectingText: return output - - # decode base64 content - if base64 and self.contentparams.get('base64', 0): - try: - output = base64.decodestring(output) - except binascii.Error: - pass - except binascii.Incomplete: - pass - - # resolve relative URIs - if (element in self.can_be_relative_uri) and output: - output = self.resolveURI(output) - - # decode entities within embedded markup - if not self.contentparams.get('base64', 0): - output = self.decodeEntities(element, output) - - # remove temporary cruft from contentparams - try: - del self.contentparams['mode'] - except KeyError: - pass - try: - del self.contentparams['base64'] - except KeyError: - pass - - # resolve relative URIs within embedded markup - if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types: - if element in self.can_contain_relative_uris: - output = _resolveRelativeURIs(output, self.baseuri, self.encoding) - - # sanitize embedded markup - if self.mapContentType(self.contentparams.get('type', 'text/html')) in self.html_types: - if element in self.can_contain_dangerous_markup: - output = _sanitizeHTML(output, self.encoding) - - if self.encoding and type(output) != type(u''): - try: - output = unicode(output, self.encoding) - except: - pass - - # categories/tags/keywords/whatever are handled in _end_category - if element == 'category': - return output - - # store output in appropriate place(s) - if self.inentry and not self.insource: - if element == 'content': - self.entries[-1].setdefault(element, []) - contentparams = copy.deepcopy(self.contentparams) - contentparams['value'] = output - self.entries[-1][element].append(contentparams) - elif element == 'link': - self.entries[-1][element] = output - if output: - self.entries[-1]['links'][-1]['href'] = output - else: - if element == 'description': - element = 'summary' - self.entries[-1][element] = output - if self.incontent: - contentparams = copy.deepcopy(self.contentparams) - contentparams['value'] = output - self.entries[-1][element + '_detail'] = contentparams - elif (self.infeed or self.insource) and (not self.intextinput) and (not self.inimage): - context = self._getContext() - if element == 'description': - element = 'subtitle' - context[element] = output - if element == 'link': - context['links'][-1]['href'] = output - elif self.incontent: - contentparams = copy.deepcopy(self.contentparams) - contentparams['value'] = output - context[element + '_detail'] = contentparams - return output - - def pushContent(self, tag, attrsD, defaultContentType, expectingText): - self.incontent += 1 - self.contentparams = FeedParserDict({ - 'type': self.mapContentType(attrsD.get('type', defaultContentType)), - 'language': self.lang, - 'base': self.baseuri}) - self.contentparams['base64'] = self._isBase64(attrsD, self.contentparams) - self.push(tag, expectingText) - - def popContent(self, tag): - value = self.pop(tag) - self.incontent -= 1 - self.contentparams.clear() - return value - - def _mapToStandardPrefix(self, name): - colonpos = name.find(':') - if colonpos <> -1: - prefix = name[:colonpos] - suffix = name[colonpos+1:] - prefix = self.namespacemap.get(prefix, prefix) - name = prefix + ':' + suffix - return name - - def _getAttribute(self, attrsD, name): - return attrsD.get(self._mapToStandardPrefix(name)) - - def _isBase64(self, attrsD, contentparams): - if attrsD.get('mode', '') == 'base64': - return 1 - if self.contentparams['type'].startswith('text/'): - return 0 - if self.contentparams['type'].endswith('+xml'): - return 0 - if self.contentparams['type'].endswith('/xml'): - return 0 - return 1 - - def _itsAnHrefDamnIt(self, attrsD): - href = attrsD.get('url', attrsD.get('uri', attrsD.get('href', None))) - if href: - try: - del attrsD['url'] - except KeyError: - pass - try: - del attrsD['uri'] - except KeyError: - pass - attrsD['href'] = href - return attrsD - - def _save(self, key, value): - context = self._getContext() - context.setdefault(key, value) - - def _start_rss(self, attrsD): - versionmap = {'0.91': 'rss091u', - '0.92': 'rss092', - '0.93': 'rss093', - '0.94': 'rss094'} - if not self.version: - attr_version = attrsD.get('version', '') - version = versionmap.get(attr_version) - if version: - self.version = version - elif attr_version.startswith('2.'): - self.version = 'rss20' - else: - self.version = 'rss' - - def _start_dlhottitles(self, attrsD): - self.version = 'hotrss' - - def _start_channel(self, attrsD): - self.infeed = 1 - self._cdf_common(attrsD) - _start_feedinfo = _start_channel - - def _cdf_common(self, attrsD): - if attrsD.has_key('lastmod'): - self._start_modified({}) - self.elementstack[-1][-1] = attrsD['lastmod'] - self._end_modified() - if attrsD.has_key('href'): - self._start_link({}) - self.elementstack[-1][-1] = attrsD['href'] - self._end_link() - - def _start_feed(self, attrsD): - self.infeed = 1 - versionmap = {'0.1': 'atom01', - '0.2': 'atom02', - '0.3': 'atom03'} - if not self.version: - attr_version = attrsD.get('version') - version = versionmap.get(attr_version) - if version: - self.version = version - else: - self.version = 'atom' - - def _end_channel(self): - self.infeed = 0 - _end_feed = _end_channel - - def _start_image(self, attrsD): - self.inimage = 1 - self.push('image', 0) - context = self._getContext() - context.setdefault('image', FeedParserDict()) - - def _end_image(self): - self.pop('image') - self.inimage = 0 - - def _start_textinput(self, attrsD): - self.intextinput = 1 - self.push('textinput', 0) - context = self._getContext() - context.setdefault('textinput', FeedParserDict()) - _start_textInput = _start_textinput - - def _end_textinput(self): - self.pop('textinput') - self.intextinput = 0 - _end_textInput = _end_textinput - - def _start_author(self, attrsD): - self.inauthor = 1 - self.push('author', 1) - _start_managingeditor = _start_author - _start_dc_author = _start_author - _start_dc_creator = _start_author - _start_itunes_author = _start_author - - def _end_author(self): - self.pop('author') - self.inauthor = 0 - self._sync_author_detail() - _end_managingeditor = _end_author - _end_dc_author = _end_author - _end_dc_creator = _end_author - _end_itunes_author = _end_author - - def _start_itunes_owner(self, attrsD): - self.inpublisher = 1 - self.push('publisher', 0) - - def _end_itunes_owner(self): - self.pop('publisher') - self.inpublisher = 0 - self._sync_author_detail('publisher') - - def _start_contributor(self, attrsD): - self.incontributor = 1 - context = self._getContext() - context.setdefault('contributors', []) - context['contributors'].append(FeedParserDict()) - self.push('contributor', 0) - - def _end_contributor(self): - self.pop('contributor') - self.incontributor = 0 - - def _start_dc_contributor(self, attrsD): - self.incontributor = 1 - context = self._getContext() - context.setdefault('contributors', []) - context['contributors'].append(FeedParserDict()) - self.push('name', 0) - - def _end_dc_contributor(self): - self._end_name() - self.incontributor = 0 - - def _start_name(self, attrsD): - self.push('name', 0) - _start_itunes_name = _start_name - - def _end_name(self): - value = self.pop('name') - if self.inpublisher: - self._save_author('name', value, 'publisher') - elif self.inauthor: - self._save_author('name', value) - elif self.incontributor: - self._save_contributor('name', value) - elif self.intextinput: - context = self._getContext() - context['textinput']['name'] = value - _end_itunes_name = _end_name - - def _start_width(self, attrsD): - self.push('width', 0) - - def _end_width(self): - value = self.pop('width') - try: - value = int(value) - except: - value = 0 - if self.inimage: - context = self._getContext() - context['image']['width'] = value - - def _start_height(self, attrsD): - self.push('height', 0) - - def _end_height(self): - value = self.pop('height') - try: - value = int(value) - except: - value = 0 - if self.inimage: - context = self._getContext() - context['image']['height'] = value - - def _start_url(self, attrsD): - self.push('href', 1) - _start_homepage = _start_url - _start_uri = _start_url - - def _end_url(self): - value = self.pop('href') - if self.inauthor: - self._save_author('href', value) - elif self.incontributor: - self._save_contributor('href', value) - elif self.inimage: - context = self._getContext() - context['image']['href'] = value - elif self.intextinput: - context = self._getContext() - context['textinput']['link'] = value - _end_homepage = _end_url - _end_uri = _end_url - - def _start_email(self, attrsD): - self.push('email', 0) - _start_itunes_email = _start_email - - def _end_email(self): - value = self.pop('email') - if self.inpublisher: - self._save_author('email', value, 'publisher') - elif self.inauthor: - self._save_author('email', value) - elif self.incontributor: - self._save_contributor('email', value) - _end_itunes_email = _end_email - - def _getContext(self): - if self.insource: - context = self.sourcedata - elif self.inentry: - context = self.entries[-1] - else: - context = self.feeddata - return context - - def _save_author(self, key, value, prefix='author'): - context = self._getContext() - context.setdefault(prefix + '_detail', FeedParserDict()) - context[prefix + '_detail'][key] = value - self._sync_author_detail() - - def _save_contributor(self, key, value): - context = self._getContext() - context.setdefault('contributors', [FeedParserDict()]) - context['contributors'][-1][key] = value - - def _sync_author_detail(self, key='author'): - context = self._getContext() - detail = context.get('%s_detail' % key) - if detail: - name = detail.get('name') - email = detail.get('email') - if name and email: - context[key] = '%s (%s)' % (name, email) - elif name: - context[key] = name - elif email: - context[key] = email - else: - author = context.get(key) - if not author: return - emailmatch = re.search(r'''(([a-zA-Z0-9\_\-\.\+]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?))''', author) - if not emailmatch: return - email = emailmatch.group(0) - # probably a better way to do the following, but it passes all the tests - author = author.replace(email, '') - author = author.replace('()', '') - author = author.strip() - if author and (author[0] == '('): - author = author[1:] - if author and (author[-1] == ')'): - author = author[:-1] - author = author.strip() - context.setdefault('%s_detail' % key, FeedParserDict()) - context['%s_detail' % key]['name'] = author - context['%s_detail' % key]['email'] = email - - def _start_subtitle(self, attrsD): - self.pushContent('subtitle', attrsD, 'text/plain', 1) - _start_tagline = _start_subtitle - _start_itunes_subtitle = _start_subtitle - - def _end_subtitle(self): - self.popContent('subtitle') - _end_tagline = _end_subtitle - _end_itunes_subtitle = _end_subtitle - - def _start_rights(self, attrsD): - self.pushContent('rights', attrsD, 'text/plain', 1) - _start_dc_rights = _start_rights - _start_copyright = _start_rights - - def _end_rights(self): - self.popContent('rights') - _end_dc_rights = _end_rights - _end_copyright = _end_rights - - def _start_item(self, attrsD): - self.entries.append(FeedParserDict()) - self.push('item', 0) - self.inentry = 1 - self.guidislink = 0 - id = self._getAttribute(attrsD, 'rdf:about') - if id: - context = self._getContext() - context['id'] = id - self._cdf_common(attrsD) - _start_entry = _start_item - _start_product = _start_item - - def _end_item(self): - self.pop('item') - self.inentry = 0 - _end_entry = _end_item - - def _start_dc_language(self, attrsD): - self.push('language', 1) - _start_language = _start_dc_language - - def _end_dc_language(self): - self.lang = self.pop('language') - _end_language = _end_dc_language - - def _start_dc_publisher(self, attrsD): - self.push('publisher', 1) - _start_webmaster = _start_dc_publisher - - def _end_dc_publisher(self): - self.pop('publisher') - self._sync_author_detail('publisher') - _end_webmaster = _end_dc_publisher - - def _start_published(self, attrsD): - self.push('published', 1) - _start_dcterms_issued = _start_published - _start_issued = _start_published - - def _end_published(self): - value = self.pop('published') - self._save('published_parsed', _parse_date(value)) - _end_dcterms_issued = _end_published - _end_issued = _end_published - - def _start_updated(self, attrsD): - self.push('updated', 1) - _start_modified = _start_updated - _start_dcterms_modified = _start_updated - _start_pubdate = _start_updated - _start_dc_date = _start_updated - - def _end_updated(self): - value = self.pop('updated') - parsed_value = _parse_date(value) - self._save('updated_parsed', parsed_value) - _end_modified = _end_updated - _end_dcterms_modified = _end_updated - _end_pubdate = _end_updated - _end_dc_date = _end_updated - - def _start_created(self, attrsD): - self.push('created', 1) - _start_dcterms_created = _start_created - - def _end_created(self): - value = self.pop('created') - self._save('created_parsed', _parse_date(value)) - _end_dcterms_created = _end_created - - def _start_expirationdate(self, attrsD): - self.push('expired', 1) - - def _end_expirationdate(self): - self._save('expired_parsed', _parse_date(self.pop('expired'))) - - def _start_cc_license(self, attrsD): - self.push('license', 1) - value = self._getAttribute(attrsD, 'rdf:resource') - if value: - self.elementstack[-1][2].append(value) - self.pop('license') - - def _start_creativecommons_license(self, attrsD): - self.push('license', 1) - - def _end_creativecommons_license(self): - self.pop('license') - - def _addTag(self, term, scheme, label): - context = self._getContext() - tags = context.setdefault('tags', []) - if (not term) and (not scheme) and (not label): return - value = FeedParserDict({'term': term, 'scheme': scheme, 'label': label}) - if value not in tags: - tags.append(FeedParserDict({'term': term, 'scheme': scheme, 'label': label})) - - def _start_category(self, attrsD): - if _debug: sys.stderr.write('entering _start_category with %s\n' % repr(attrsD)) - term = attrsD.get('term') - scheme = attrsD.get('scheme', attrsD.get('domain')) - label = attrsD.get('label') - self._addTag(term, scheme, label) - self.push('category', 1) - _start_dc_subject = _start_category - _start_keywords = _start_category - - def _end_itunes_keywords(self): - for term in self.pop('itunes_keywords').split(): - self._addTag(term, 'http://www.itunes.com/', None) - - def _start_itunes_category(self, attrsD): - self._addTag(attrsD.get('text'), 'http://www.itunes.com/', None) - self.push('category', 1) - - def _end_category(self): - value = self.pop('category') - if not value: return - context = self._getContext() - tags = context['tags'] - if value and len(tags) and not tags[-1]['term']: - tags[-1]['term'] = value - else: - self._addTag(value, None, None) - _end_dc_subject = _end_category - _end_keywords = _end_category - _end_itunes_category = _end_category - - def _start_cloud(self, attrsD): - self._getContext()['cloud'] = FeedParserDict(attrsD) - - def _start_link(self, attrsD): - attrsD.setdefault('rel', 'alternate') - attrsD.setdefault('type', 'text/html') - attrsD = self._itsAnHrefDamnIt(attrsD) - if attrsD.has_key('href'): - attrsD['href'] = self.resolveURI(attrsD['href']) - expectingText = self.infeed or self.inentry or self.insource - context = self._getContext() - context.setdefault('links', []) - context['links'].append(FeedParserDict(attrsD)) - if attrsD['rel'] == 'enclosure': - self._start_enclosure(attrsD) - if attrsD.has_key('href'): - expectingText = 0 - if (attrsD.get('rel') == 'alternate') and (self.mapContentType(attrsD.get('type')) in self.html_types): - context['link'] = attrsD['href'] - else: - self.push('link', expectingText) - _start_producturl = _start_link - - def _end_link(self): - value = self.pop('link') - context = self._getContext() - if self.intextinput: - context['textinput']['link'] = value - if self.inimage: - context['image']['link'] = value - _end_producturl = _end_link - - def _start_guid(self, attrsD): - self.guidislink = (attrsD.get('ispermalink', 'true') == 'true') - self.push('id', 1) - - def _end_guid(self): - value = self.pop('id') - self._save('guidislink', self.guidislink and not self._getContext().has_key('link')) - if self.guidislink: - # guid acts as link, but only if 'ispermalink' is not present or is 'true', - # and only if the item doesn't already have a link element - self._save('link', value) - - def _start_title(self, attrsD): - self.pushContent('title', attrsD, 'text/plain', self.infeed or self.inentry or self.insource) - _start_dc_title = _start_title - _start_media_title = _start_title - - def _end_title(self): - value = self.popContent('title') - context = self._getContext() - if self.intextinput: - context['textinput']['title'] = value - elif self.inimage: - context['image']['title'] = value - _end_dc_title = _end_title - _end_media_title = _end_title - - def _start_description(self, attrsD): - context = self._getContext() - if context.has_key('summary'): - self._summaryKey = 'content' - self._start_content(attrsD) - else: - self.pushContent('description', attrsD, 'text/html', self.infeed or self.inentry or self.insource) - - def _start_abstract(self, attrsD): - self.pushContent('description', attrsD, 'text/plain', self.infeed or self.inentry or self.insource) - - def _end_description(self): - if self._summaryKey == 'content': - self._end_content() - else: - value = self.popContent('description') - context = self._getContext() - if self.intextinput: - context['textinput']['description'] = value - elif self.inimage: - context['image']['description'] = value - self._summaryKey = None - _end_abstract = _end_description - - def _start_info(self, attrsD): - self.pushContent('info', attrsD, 'text/plain', 1) - _start_feedburner_browserfriendly = _start_info - - def _end_info(self): - self.popContent('info') - _end_feedburner_browserfriendly = _end_info - - def _start_generator(self, attrsD): - if attrsD: - attrsD = self._itsAnHrefDamnIt(attrsD) - if attrsD.has_key('href'): - attrsD['href'] = self.resolveURI(attrsD['href']) - self._getContext()['generator_detail'] = FeedParserDict(attrsD) - self.push('generator', 1) - - def _end_generator(self): - value = self.pop('generator') - context = self._getContext() - if context.has_key('generator_detail'): - context['generator_detail']['name'] = value - - def _start_admin_generatoragent(self, attrsD): - self.push('generator', 1) - value = self._getAttribute(attrsD, 'rdf:resource') - if value: - self.elementstack[-1][2].append(value) - self.pop('generator') - self._getContext()['generator_detail'] = FeedParserDict({'href': value}) - - def _start_admin_errorreportsto(self, attrsD): - self.push('errorreportsto', 1) - value = self._getAttribute(attrsD, 'rdf:resource') - if value: - self.elementstack[-1][2].append(value) - self.pop('errorreportsto') - - def _start_summary(self, attrsD): - context = self._getContext() - if context.has_key('summary'): - self._summaryKey = 'content' - self._start_content(attrsD) - else: - self._summaryKey = 'summary' - self.pushContent(self._summaryKey, attrsD, 'text/plain', 1) - _start_itunes_summary = _start_summary - - def _end_summary(self): - if self._summaryKey == 'content': - self._end_content() - else: - self.popContent(self._summaryKey or 'summary') - self._summaryKey = None - _end_itunes_summary = _end_summary - - def _start_enclosure(self, attrsD): - attrsD = self._itsAnHrefDamnIt(attrsD) - self._getContext().setdefault('enclosures', []).append(FeedParserDict(attrsD)) - href = attrsD.get('href') - if href: - context = self._getContext() - if not context.get('id'): - context['id'] = href - - def _start_source(self, attrsD): - self.insource = 1 - - def _end_source(self): - self.insource = 0 - self._getContext()['source'] = copy.deepcopy(self.sourcedata) - self.sourcedata.clear() - - def _start_content(self, attrsD): - self.pushContent('content', attrsD, 'text/plain', 1) - src = attrsD.get('src') - if src: - self.contentparams['src'] = src - self.push('content', 1) - - def _start_prodlink(self, attrsD): - self.pushContent('content', attrsD, 'text/html', 1) - - def _start_body(self, attrsD): - self.pushContent('content', attrsD, 'application/xhtml+xml', 1) - _start_xhtml_body = _start_body - - def _start_content_encoded(self, attrsD): - self.pushContent('content', attrsD, 'text/html', 1) - _start_fullitem = _start_content_encoded - - def _end_content(self): - copyToDescription = self.mapContentType(self.contentparams.get('type')) in (['text/plain'] + self.html_types) - value = self.popContent('content') - if copyToDescription: - self._save('description', value) - _end_body = _end_content - _end_xhtml_body = _end_content - _end_content_encoded = _end_content - _end_fullitem = _end_content - _end_prodlink = _end_content - - def _start_itunes_image(self, attrsD): - self.push('itunes_image', 0) - self._getContext()['image'] = FeedParserDict({'href': attrsD.get('href')}) - _start_itunes_link = _start_itunes_image - - def _end_itunes_block(self): - value = self.pop('itunes_block', 0) - self._getContext()['itunes_block'] = (value == 'yes') and 1 or 0 - - def _end_itunes_explicit(self): - value = self.pop('itunes_explicit', 0) - self._getContext()['itunes_explicit'] = (value == 'yes') and 1 or 0 - -if _XML_AVAILABLE: - class _StrictFeedParser(_FeedParserMixin, xml.sax.handler.ContentHandler): - def __init__(self, baseuri, baselang, encoding): - if _debug: sys.stderr.write('trying StrictFeedParser\n') - xml.sax.handler.ContentHandler.__init__(self) - _FeedParserMixin.__init__(self, baseuri, baselang, encoding) - self.bozo = 0 - self.exc = None - - def startPrefixMapping(self, prefix, uri): - self.trackNamespace(prefix, uri) - - def startElementNS(self, name, qname, attrs): - namespace, localname = name - lowernamespace = str(namespace or '').lower() - if lowernamespace.find('backend.userland.com/rss') <> -1: - # match any backend.userland.com namespace - namespace = 'http://backend.userland.com/rss' - lowernamespace = namespace - if qname and qname.find(':') > 0: - givenprefix = qname.split(':')[0] - else: - givenprefix = None - prefix = self._matchnamespaces.get(lowernamespace, givenprefix) - if givenprefix and (prefix == None or (prefix == '' and lowernamespace == '')) and not self.namespacesInUse.has_key(givenprefix): - raise UndeclaredNamespace, "'%s' is not associated with a namespace" % givenprefix - if prefix: - localname = prefix + ':' + localname - localname = str(localname).lower() - if _debug: sys.stderr.write('startElementNS: qname = %s, namespace = %s, givenprefix = %s, prefix = %s, attrs = %s, localname = %s\n' % (qname, namespace, givenprefix, prefix, attrs.items(), localname)) - - # qname implementation is horribly broken in Python 2.1 (it - # doesn't report any), and slightly broken in Python 2.2 (it - # doesn't report the xml: namespace). So we match up namespaces - # with a known list first, and then possibly override them with - # the qnames the SAX parser gives us (if indeed it gives us any - # at all). Thanks to MatejC for helping me test this and - # tirelessly telling me that it didn't work yet. - attrsD = {} - for (namespace, attrlocalname), attrvalue in attrs._attrs.items(): - lowernamespace = (namespace or '').lower() - prefix = self._matchnamespaces.get(lowernamespace, '') - if prefix: - attrlocalname = prefix + ':' + attrlocalname - attrsD[str(attrlocalname).lower()] = attrvalue - for qname in attrs.getQNames(): - attrsD[str(qname).lower()] = attrs.getValueByQName(qname) - self.unknown_starttag(localname, attrsD.items()) - - def characters(self, text): - self.handle_data(text) - - def endElementNS(self, name, qname): - namespace, localname = name - lowernamespace = str(namespace or '').lower() - if qname and qname.find(':') > 0: - givenprefix = qname.split(':')[0] - else: - givenprefix = '' - prefix = self._matchnamespaces.get(lowernamespace, givenprefix) - if prefix: - localname = prefix + ':' + localname - localname = str(localname).lower() - self.unknown_endtag(localname) - - def error(self, exc): - self.bozo = 1 - self.exc = exc - - def fatalError(self, exc): - self.error(exc) - raise exc - -class _BaseHTMLProcessor(sgmllib.SGMLParser): - elements_no_end_tag = ['area', 'base', 'basefont', 'br', 'col', 'frame', 'hr', - 'img', 'input', 'isindex', 'link', 'meta', 'param'] - - def __init__(self, encoding): - self.encoding = encoding - if _debug: sys.stderr.write('entering BaseHTMLProcessor, encoding=%s\n' % self.encoding) - sgmllib.SGMLParser.__init__(self) - - def reset(self): - self.pieces = [] - sgmllib.SGMLParser.reset(self) - - def _shorttag_replace(self, match): - tag = match.group(1) - if tag in self.elements_no_end_tag: - return '<' + tag + ' />' - else: - return '<' + tag + '>' - - def feed(self, data): - data = re.compile(r'', self._shorttag_replace, data) # bug [ 1399464 ] Bad regexp for _shorttag_replace - data = re.sub(r'<([^<\s]+?)\s*/>', self._shorttag_replace, data) - data = data.replace(''', "'") - data = data.replace('"', '"') - if self.encoding and type(data) == type(u''): - data = data.encode(self.encoding) - sgmllib.SGMLParser.feed(self, data) - - def normalize_attrs(self, attrs): - # utility method to be called by descendants - attrs = [(k.lower(), v) for k, v in attrs] - attrs = [(k, k in ('rel', 'type') and v.lower() or v) for k, v in attrs] - return attrs - - def unknown_starttag(self, tag, attrs): - # called for each start tag - # attrs is a list of (attr, value) tuples - # e.g. for
, tag='pre', attrs=[('class', 'screen')]
-        if _debug: sys.stderr.write('_BaseHTMLProcessor, unknown_starttag, tag=%s\n' % tag)
-        uattrs = []
-        # thanks to Kevin Marks for this breathtaking hack to deal with (valid) high-bit attribute values in UTF-8 feeds
-        for key, value in attrs:
-            if type(value) != type(u''):
-                value = unicode(value, self.encoding)
-            uattrs.append((unicode(key, self.encoding), value))
-        strattrs = u''.join([u' %s="%s"' % (key, value) for key, value in uattrs]).encode(self.encoding)
-        if tag in self.elements_no_end_tag:
-            self.pieces.append('<%(tag)s%(strattrs)s />' % locals())
-        else:
-            self.pieces.append('<%(tag)s%(strattrs)s>' % locals())
-
-    def unknown_endtag(self, tag):
-        # called for each end tag, e.g. for 
, tag will be 'pre' - # Reconstruct the original end tag. - if tag not in self.elements_no_end_tag: - self.pieces.append("" % locals()) - - def handle_charref(self, ref): - # called for each character reference, e.g. for ' ', ref will be '160' - # Reconstruct the original character reference. - self.pieces.append('&#%(ref)s;' % locals()) - - def handle_entityref(self, ref): - # called for each entity reference, e.g. for '©', ref will be 'copy' - # Reconstruct the original entity reference. - self.pieces.append('&%(ref)s;' % locals()) - - def handle_data(self, text): - # called for each block of plain text, i.e. outside of any tag and - # not containing any character or entity references - # Store the original text verbatim. - if _debug: sys.stderr.write('_BaseHTMLProcessor, handle_text, text=%s\n' % text) - self.pieces.append(text) - - def handle_comment(self, text): - # called for each HTML comment, e.g. - # Reconstruct the original comment. - self.pieces.append('' % locals()) - - def handle_pi(self, text): - # called for each processing instruction, e.g. - # Reconstruct original processing instruction. - self.pieces.append('' % locals()) - - def handle_decl(self, text): - # called for the DOCTYPE, if present, e.g. - # - # Reconstruct original DOCTYPE - self.pieces.append('' % locals()) - - _new_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9:]*\s*').match - def _scan_name(self, i, declstartpos): - rawdata = self.rawdata - n = len(rawdata) - if i == n: - return None, -1 - m = self._new_declname_match(rawdata, i) - if m: - s = m.group() - name = s.strip() - if (i + len(s)) == n: - return None, -1 # end of buffer - return name.lower(), m.end() - else: - self.handle_data(rawdata) -# self.updatepos(declstartpos, i) - return None, -1 - - def output(self): - '''Return processed HTML as a single string''' - return ''.join([str(p) for p in self.pieces]) - -class _LooseFeedParser(_FeedParserMixin, _BaseHTMLProcessor): - def __init__(self, baseuri, baselang, encoding): - sgmllib.SGMLParser.__init__(self) - _FeedParserMixin.__init__(self, baseuri, baselang, encoding) - - def decodeEntities(self, element, data): - data = data.replace('<', '<') - data = data.replace('<', '<') - data = data.replace('>', '>') - data = data.replace('>', '>') - data = data.replace('&', '&') - data = data.replace('&', '&') - data = data.replace('"', '"') - data = data.replace('"', '"') - data = data.replace(''', ''') - data = data.replace(''', ''') - if self.contentparams.has_key('type') and not self.contentparams.get('type', 'xml').endswith('xml'): - data = data.replace('<', '<') - data = data.replace('>', '>') - data = data.replace('&', '&') - data = data.replace('"', '"') - data = data.replace(''', "'") - return data - -class _RelativeURIResolver(_BaseHTMLProcessor): - relative_uris = [('a', 'href'), - ('applet', 'codebase'), - ('area', 'href'), - ('blockquote', 'cite'), - ('body', 'background'), - ('del', 'cite'), - ('form', 'action'), - ('frame', 'longdesc'), - ('frame', 'src'), - ('iframe', 'longdesc'), - ('iframe', 'src'), - ('head', 'profile'), - ('img', 'longdesc'), - ('img', 'src'), - ('img', 'usemap'), - ('input', 'src'), - ('input', 'usemap'), - ('ins', 'cite'), - ('link', 'href'), - ('object', 'classid'), - ('object', 'codebase'), - ('object', 'data'), - ('object', 'usemap'), - ('q', 'cite'), - ('script', 'src')] - - def __init__(self, baseuri, encoding): - _BaseHTMLProcessor.__init__(self, encoding) - self.baseuri = baseuri - - def resolveURI(self, uri): - return _urljoin(self.baseuri, uri) - - def unknown_starttag(self, tag, attrs): - attrs = self.normalize_attrs(attrs) - attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs] - _BaseHTMLProcessor.unknown_starttag(self, tag, attrs) - -def _resolveRelativeURIs(htmlSource, baseURI, encoding): - if _debug: sys.stderr.write('entering _resolveRelativeURIs\n') - p = _RelativeURIResolver(baseURI, encoding) - p.feed(htmlSource) - return p.output() - -class _HTMLSanitizer(_BaseHTMLProcessor): - acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big', - 'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col', - 'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em', 'fieldset', - 'font', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', - 'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 'optgroup', - 'option', 'p', 'pre', 'q', 's', 'samp', 'select', 'small', 'span', 'strike', - 'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'tfoot', 'th', - 'thead', 'tr', 'tt', 'u', 'ul', 'var'] - - acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey', - 'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing', - 'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'clear', 'cols', - 'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 'disabled', - 'enctype', 'for', 'frame', 'headers', 'height', 'href', 'hreflang', 'hspace', - 'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'media', 'method', - 'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly', - 'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size', - 'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type', - 'usemap', 'valign', 'value', 'vspace', 'width'] - - unacceptable_elements_with_end_tag = ['script', 'applet'] - - def reset(self): - _BaseHTMLProcessor.reset(self) - self.unacceptablestack = 0 - - def unknown_starttag(self, tag, attrs): - if not tag in self.acceptable_elements: - if tag in self.unacceptable_elements_with_end_tag: - self.unacceptablestack += 1 - return - attrs = self.normalize_attrs(attrs) - attrs = [(key, value) for key, value in attrs if key in self.acceptable_attributes] - _BaseHTMLProcessor.unknown_starttag(self, tag, attrs) - - def unknown_endtag(self, tag): - if not tag in self.acceptable_elements: - if tag in self.unacceptable_elements_with_end_tag: - self.unacceptablestack -= 1 - return - _BaseHTMLProcessor.unknown_endtag(self, tag) - - def handle_pi(self, text): - pass - - def handle_decl(self, text): - pass - - def handle_data(self, text): - if not self.unacceptablestack: - _BaseHTMLProcessor.handle_data(self, text) - -def _sanitizeHTML(htmlSource, encoding): - p = _HTMLSanitizer(encoding) - p.feed(htmlSource) - data = p.output() - if TIDY_MARKUP: - # loop through list of preferred Tidy interfaces looking for one that's installed, - # then set up a common _tidy function to wrap the interface-specific API. - _tidy = None - for tidy_interface in PREFERRED_TIDY_INTERFACES: - try: - if tidy_interface == "uTidy": - from tidy import parseString as _utidy - def _tidy(data, **kwargs): - return str(_utidy(data, **kwargs)) - break - elif tidy_interface == "mxTidy": - from mx.Tidy import Tidy as _mxtidy - def _tidy(data, **kwargs): - nerrors, nwarnings, data, errordata = _mxtidy.tidy(data, **kwargs) - return data - break - except: - pass - if _tidy: - utf8 = type(data) == type(u'') - if utf8: - data = data.encode('utf-8') - data = _tidy(data, output_xhtml=1, numeric_entities=1, wrap=0, char_encoding="utf8") - if utf8: - data = unicode(data, 'utf-8') - if data.count(''): - data = data.split('>', 1)[1] - if data.count('= '2.3.3' - assert base64 != None - user, passw = base64.decodestring(req.headers['Authorization'].split(' ')[1]).split(':') - realm = re.findall('realm="([^"]*)"', headers['WWW-Authenticate'])[0] - self.add_password(realm, host, user, passw) - retry = self.http_error_auth_reqed('www-authenticate', host, req, headers) - self.reset_retry_count() - return retry - except: - return self.http_error_default(req, fp, code, msg, headers) - -def _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers): - """URL, filename, or string --> stream - - This function lets you define parsers that take any input source - (URL, pathname to local or network file, or actual data as a string) - and deal with it in a uniform manner. Returned object is guaranteed - to have all the basic stdio read methods (read, readline, readlines). - Just .close() the object when you're done with it. - - If the etag argument is supplied, it will be used as the value of an - If-None-Match request header. - - If the modified argument is supplied, it must be a tuple of 9 integers - as returned by gmtime() in the standard Python time module. This MUST - be in GMT (Greenwich Mean Time). The formatted date/time will be used - as the value of an If-Modified-Since request header. - - If the agent argument is supplied, it will be used as the value of a - User-Agent request header. - - If the referrer argument is supplied, it will be used as the value of a - Referer[sic] request header. - - If handlers is supplied, it is a list of handlers used to build a - urllib2 opener. - """ - - if hasattr(url_file_stream_or_string, 'read'): - return url_file_stream_or_string - - if url_file_stream_or_string == '-': - return sys.stdin - - if urlparse.urlparse(url_file_stream_or_string)[0] in ('http', 'https', 'ftp'): - if not agent: - agent = USER_AGENT - # test for inline user:password for basic auth - auth = None - if base64: - urltype, rest = urllib.splittype(url_file_stream_or_string) - realhost, rest = urllib.splithost(rest) - if realhost: - user_passwd, realhost = urllib.splituser(realhost) - if user_passwd: - url_file_stream_or_string = '%s://%s%s' % (urltype, realhost, rest) - auth = base64.encodestring(user_passwd).strip() - # try to open with urllib2 (to use optional headers) - request = urllib2.Request(url_file_stream_or_string) - request.add_header('User-Agent', agent) - if etag: - request.add_header('If-None-Match', etag) - if modified: - # format into an RFC 1123-compliant timestamp. We can't use - # time.strftime() since the %a and %b directives can be affected - # by the current locale, but RFC 2616 states that dates must be - # in English. - short_weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] - months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] - request.add_header('If-Modified-Since', '%s, %02d %s %04d %02d:%02d:%02d GMT' % (short_weekdays[modified[6]], modified[2], months[modified[1] - 1], modified[0], modified[3], modified[4], modified[5])) - if referrer: - request.add_header('Referer', referrer) - if gzip and zlib: - request.add_header('Accept-encoding', 'gzip, deflate') - elif gzip: - request.add_header('Accept-encoding', 'gzip') - elif zlib: - request.add_header('Accept-encoding', 'deflate') - else: - request.add_header('Accept-encoding', '') - if auth: - request.add_header('Authorization', 'Basic %s' % auth) - if ACCEPT_HEADER: - request.add_header('Accept', ACCEPT_HEADER) - request.add_header('A-IM', 'feed') # RFC 3229 support - opener = apply(urllib2.build_opener, tuple([_FeedURLHandler()] + handlers)) - opener.addheaders = [] # RMK - must clear so we only send our custom User-Agent - try: - return opener.open(request) - finally: - opener.close() # JohnD - - # try to open with native open function (if url_file_stream_or_string is a filename) - try: - return open(url_file_stream_or_string) - except: - pass - - # treat url_file_stream_or_string as string - return _StringIO(str(url_file_stream_or_string)) - -_date_handlers = [] -def registerDateHandler(func): - '''Register a date handler function (takes string, returns 9-tuple date in GMT)''' - _date_handlers.insert(0, func) - -# ISO-8601 date parsing routines written by Fazal Majid. -# The ISO 8601 standard is very convoluted and irregular - a full ISO 8601 -# parser is beyond the scope of feedparser and would be a worthwhile addition -# to the Python library. -# A single regular expression cannot parse ISO 8601 date formats into groups -# as the standard is highly irregular (for instance is 030104 2003-01-04 or -# 0301-04-01), so we use templates instead. -# Please note the order in templates is significant because we need a -# greedy match. -_iso8601_tmpl = ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO', - 'YY-?MM-?DD', 'YY-?OOO', 'YYYY', - '-YY-?MM', '-OOO', '-YY', - '--MM-?DD', '--MM', - '---DD', - 'CC', ''] -_iso8601_re = [ - tmpl.replace( - 'YYYY', r'(?P\d{4})').replace( - 'YY', r'(?P\d\d)').replace( - 'MM', r'(?P[01]\d)').replace( - 'DD', r'(?P[0123]\d)').replace( - 'OOO', r'(?P[0123]\d\d)').replace( - 'CC', r'(?P\d\d$)') - + r'(T?(?P\d{2}):(?P\d{2})' - + r'(:(?P\d{2}))?' - + r'(?P[+-](?P\d{2})(:(?P\d{2}))?|Z)?)?' - for tmpl in _iso8601_tmpl] -del tmpl -_iso8601_matches = [re.compile(regex).match for regex in _iso8601_re] -del regex -def _parse_date_iso8601(dateString): - '''Parse a variety of ISO-8601-compatible formats like 20040105''' - m = None - for _iso8601_match in _iso8601_matches: - m = _iso8601_match(dateString) - if m: break - if not m: return - if m.span() == (0, 0): return - params = m.groupdict() - ordinal = params.get('ordinal', 0) - if ordinal: - ordinal = int(ordinal) - else: - ordinal = 0 - year = params.get('year', '--') - if not year or year == '--': - year = time.gmtime()[0] - elif len(year) == 2: - # ISO 8601 assumes current century, i.e. 93 -> 2093, NOT 1993 - year = 100 * int(time.gmtime()[0] / 100) + int(year) - else: - year = int(year) - month = params.get('month', '-') - if not month or month == '-': - # ordinals are NOT normalized by mktime, we simulate them - # by setting month=1, day=ordinal - if ordinal: - month = 1 - else: - month = time.gmtime()[1] - month = int(month) - day = params.get('day', 0) - if not day: - # see above - if ordinal: - day = ordinal - elif params.get('century', 0) or \ - params.get('year', 0) or params.get('month', 0): - day = 1 - else: - day = time.gmtime()[2] - else: - day = int(day) - # special case of the century - is the first year of the 21st century - # 2000 or 2001 ? The debate goes on... - if 'century' in params.keys(): - year = (int(params['century']) - 1) * 100 + 1 - # in ISO 8601 most fields are optional - for field in ['hour', 'minute', 'second', 'tzhour', 'tzmin']: - if not params.get(field, None): - params[field] = 0 - hour = int(params.get('hour', 0)) - minute = int(params.get('minute', 0)) - second = int(params.get('second', 0)) - # weekday is normalized by mktime(), we can ignore it - weekday = 0 - # daylight savings is complex, but not needed for feedparser's purposes - # as time zones, if specified, include mention of whether it is active - # (e.g. PST vs. PDT, CET). Using -1 is implementation-dependent and - # and most implementations have DST bugs - daylight_savings_flag = 0 - tm = [year, month, day, hour, minute, second, weekday, - ordinal, daylight_savings_flag] - # ISO 8601 time zone adjustments - tz = params.get('tz') - if tz and tz != 'Z': - if tz[0] == '-': - tm[3] += int(params.get('tzhour', 0)) - tm[4] += int(params.get('tzmin', 0)) - elif tz[0] == '+': - tm[3] -= int(params.get('tzhour', 0)) - tm[4] -= int(params.get('tzmin', 0)) - else: - return None - # Python's time.mktime() is a wrapper around the ANSI C mktime(3c) - # which is guaranteed to normalize d/m/y/h/m/s. - # Many implementations have bugs, but we'll pretend they don't. - return time.localtime(time.mktime(tm)) -registerDateHandler(_parse_date_iso8601) - -# 8-bit date handling routines written by ytrewq1. -_korean_year = u'\ub144' # b3e2 in euc-kr -_korean_month = u'\uc6d4' # bff9 in euc-kr -_korean_day = u'\uc77c' # c0cf in euc-kr -_korean_am = u'\uc624\uc804' # bfc0 c0fc in euc-kr -_korean_pm = u'\uc624\ud6c4' # bfc0 c8c4 in euc-kr - -_korean_onblog_date_re = \ - re.compile('(\d{4})%s\s+(\d{2})%s\s+(\d{2})%s\s+(\d{2}):(\d{2}):(\d{2})' % \ - (_korean_year, _korean_month, _korean_day)) -_korean_nate_date_re = \ - re.compile(u'(\d{4})-(\d{2})-(\d{2})\s+(%s|%s)\s+(\d{,2}):(\d{,2}):(\d{,2})' % \ - (_korean_am, _korean_pm)) -def _parse_date_onblog(dateString): - '''Parse a string according to the OnBlog 8-bit date format''' - m = _korean_onblog_date_re.match(dateString) - if not m: return - w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \ - {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\ - 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\ - 'zonediff': '+09:00'} - if _debug: sys.stderr.write('OnBlog date parsed as: %s\n' % w3dtfdate) - return _parse_date_w3dtf(w3dtfdate) -registerDateHandler(_parse_date_onblog) - -def _parse_date_nate(dateString): - '''Parse a string according to the Nate 8-bit date format''' - m = _korean_nate_date_re.match(dateString) - if not m: return - hour = int(m.group(5)) - ampm = m.group(4) - if (ampm == _korean_pm): - hour += 12 - hour = str(hour) - if len(hour) == 1: - hour = '0' + hour - w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \ - {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\ - 'hour': hour, 'minute': m.group(6), 'second': m.group(7),\ - 'zonediff': '+09:00'} - if _debug: sys.stderr.write('Nate date parsed as: %s\n' % w3dtfdate) - return _parse_date_w3dtf(w3dtfdate) -registerDateHandler(_parse_date_nate) - -_mssql_date_re = \ - re.compile('(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})(\.\d+)?') -def _parse_date_mssql(dateString): - '''Parse a string according to the MS SQL date format''' - m = _mssql_date_re.match(dateString) - if not m: return - w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s:%(second)s%(zonediff)s' % \ - {'year': m.group(1), 'month': m.group(2), 'day': m.group(3),\ - 'hour': m.group(4), 'minute': m.group(5), 'second': m.group(6),\ - 'zonediff': '+09:00'} - if _debug: sys.stderr.write('MS SQL date parsed as: %s\n' % w3dtfdate) - return _parse_date_w3dtf(w3dtfdate) -registerDateHandler(_parse_date_mssql) - -# Unicode strings for Greek date strings -_greek_months = \ - { \ - u'\u0399\u03b1\u03bd': u'Jan', # c9e1ed in iso-8859-7 - u'\u03a6\u03b5\u03b2': u'Feb', # d6e5e2 in iso-8859-7 - u'\u039c\u03ac\u03ce': u'Mar', # ccdcfe in iso-8859-7 - u'\u039c\u03b1\u03ce': u'Mar', # cce1fe in iso-8859-7 - u'\u0391\u03c0\u03c1': u'Apr', # c1f0f1 in iso-8859-7 - u'\u039c\u03ac\u03b9': u'May', # ccdce9 in iso-8859-7 - u'\u039c\u03b1\u03ca': u'May', # cce1fa in iso-8859-7 - u'\u039c\u03b1\u03b9': u'May', # cce1e9 in iso-8859-7 - u'\u0399\u03bf\u03cd\u03bd': u'Jun', # c9effded in iso-8859-7 - u'\u0399\u03bf\u03bd': u'Jun', # c9efed in iso-8859-7 - u'\u0399\u03bf\u03cd\u03bb': u'Jul', # c9effdeb in iso-8859-7 - u'\u0399\u03bf\u03bb': u'Jul', # c9f9eb in iso-8859-7 - u'\u0391\u03cd\u03b3': u'Aug', # c1fde3 in iso-8859-7 - u'\u0391\u03c5\u03b3': u'Aug', # c1f5e3 in iso-8859-7 - u'\u03a3\u03b5\u03c0': u'Sep', # d3e5f0 in iso-8859-7 - u'\u039f\u03ba\u03c4': u'Oct', # cfeaf4 in iso-8859-7 - u'\u039d\u03bf\u03ad': u'Nov', # cdefdd in iso-8859-7 - u'\u039d\u03bf\u03b5': u'Nov', # cdefe5 in iso-8859-7 - u'\u0394\u03b5\u03ba': u'Dec', # c4e5ea in iso-8859-7 - } - -_greek_wdays = \ - { \ - u'\u039a\u03c5\u03c1': u'Sun', # caf5f1 in iso-8859-7 - u'\u0394\u03b5\u03c5': u'Mon', # c4e5f5 in iso-8859-7 - u'\u03a4\u03c1\u03b9': u'Tue', # d4f1e9 in iso-8859-7 - u'\u03a4\u03b5\u03c4': u'Wed', # d4e5f4 in iso-8859-7 - u'\u03a0\u03b5\u03bc': u'Thu', # d0e5ec in iso-8859-7 - u'\u03a0\u03b1\u03c1': u'Fri', # d0e1f1 in iso-8859-7 - u'\u03a3\u03b1\u03b2': u'Sat', # d3e1e2 in iso-8859-7 - } - -_greek_date_format_re = \ - re.compile(u'([^,]+),\s+(\d{2})\s+([^\s]+)\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})\s+([^\s]+)') - -def _parse_date_greek(dateString): - '''Parse a string according to a Greek 8-bit date format.''' - m = _greek_date_format_re.match(dateString) - if not m: return - try: - wday = _greek_wdays[m.group(1)] - month = _greek_months[m.group(3)] - except: - return - rfc822date = '%(wday)s, %(day)s %(month)s %(year)s %(hour)s:%(minute)s:%(second)s %(zonediff)s' % \ - {'wday': wday, 'day': m.group(2), 'month': month, 'year': m.group(4),\ - 'hour': m.group(5), 'minute': m.group(6), 'second': m.group(7),\ - 'zonediff': m.group(8)} - if _debug: sys.stderr.write('Greek date parsed as: %s\n' % rfc822date) - return _parse_date_rfc822(rfc822date) -registerDateHandler(_parse_date_greek) - -# Unicode strings for Hungarian date strings -_hungarian_months = \ - { \ - u'janu\u00e1r': u'01', # e1 in iso-8859-2 - u'febru\u00e1ri': u'02', # e1 in iso-8859-2 - u'm\u00e1rcius': u'03', # e1 in iso-8859-2 - u'\u00e1prilis': u'04', # e1 in iso-8859-2 - u'm\u00e1ujus': u'05', # e1 in iso-8859-2 - u'j\u00fanius': u'06', # fa in iso-8859-2 - u'j\u00falius': u'07', # fa in iso-8859-2 - u'augusztus': u'08', - u'szeptember': u'09', - u'okt\u00f3ber': u'10', # f3 in iso-8859-2 - u'november': u'11', - u'december': u'12', - } - -_hungarian_date_format_re = \ - re.compile(u'(\d{4})-([^-]+)-(\d{,2})T(\d{,2}):(\d{2})((\+|-)(\d{,2}:\d{2}))') - -def _parse_date_hungarian(dateString): - '''Parse a string according to a Hungarian 8-bit date format.''' - m = _hungarian_date_format_re.match(dateString) - if not m: return - try: - month = _hungarian_months[m.group(2)] - day = m.group(3) - if len(day) == 1: - day = '0' + day - hour = m.group(4) - if len(hour) == 1: - hour = '0' + hour - except: - return - w3dtfdate = '%(year)s-%(month)s-%(day)sT%(hour)s:%(minute)s%(zonediff)s' % \ - {'year': m.group(1), 'month': month, 'day': day,\ - 'hour': hour, 'minute': m.group(5),\ - 'zonediff': m.group(6)} - if _debug: sys.stderr.write('Hungarian date parsed as: %s\n' % w3dtfdate) - return _parse_date_w3dtf(w3dtfdate) -registerDateHandler(_parse_date_hungarian) - -# W3DTF-style date parsing adapted from PyXML xml.utils.iso8601, written by -# Drake and licensed under the Python license. Removed all range checking -# for month, day, hour, minute, and second, since mktime will normalize -# these later -def _parse_date_w3dtf(dateString): - def __extract_date(m): - year = int(m.group('year')) - if year < 100: - year = 100 * int(time.gmtime()[0] / 100) + int(year) - if year < 1000: - return 0, 0, 0 - julian = m.group('julian') - if julian: - julian = int(julian) - month = julian / 30 + 1 - day = julian % 30 + 1 - jday = None - while jday != julian: - t = time.mktime((year, month, day, 0, 0, 0, 0, 0, 0)) - jday = time.gmtime(t)[-2] - diff = abs(jday - julian) - if jday > julian: - if diff < day: - day = day - diff - else: - month = month - 1 - day = 31 - elif jday < julian: - if day + diff < 28: - day = day + diff - else: - month = month + 1 - return year, month, day - month = m.group('month') - day = 1 - if month is None: - month = 1 - else: - month = int(month) - day = m.group('day') - if day: - day = int(day) - else: - day = 1 - return year, month, day - - def __extract_time(m): - if not m: - return 0, 0, 0 - hours = m.group('hours') - if not hours: - return 0, 0, 0 - hours = int(hours) - minutes = int(m.group('minutes')) - seconds = m.group('seconds') - if seconds: - seconds = int(seconds) - else: - seconds = 0 - return hours, minutes, seconds - - def __extract_tzd(m): - '''Return the Time Zone Designator as an offset in seconds from UTC.''' - if not m: - return 0 - tzd = m.group('tzd') - if not tzd: - return 0 - if tzd == 'Z': - return 0 - hours = int(m.group('tzdhours')) - minutes = m.group('tzdminutes') - if minutes: - minutes = int(minutes) - else: - minutes = 0 - offset = (hours*60 + minutes) * 60 - if tzd[0] == '+': - return -offset - return offset - - __date_re = ('(?P\d\d\d\d)' - '(?:(?P-|)' - '(?:(?P\d\d\d)' - '|(?P\d\d)(?:(?P=dsep)(?P\d\d))?))?') - __tzd_re = '(?P[-+](?P\d\d)(?::?(?P\d\d))|Z)' - __tzd_rx = re.compile(__tzd_re) - __time_re = ('(?P\d\d)(?P:|)(?P\d\d)' - '(?:(?P=tsep)(?P\d\d(?:[.,]\d+)?))?' - + __tzd_re) - __datetime_re = '%s(?:T%s)?' % (__date_re, __time_re) - __datetime_rx = re.compile(__datetime_re) - m = __datetime_rx.match(dateString) - if (m is None) or (m.group() != dateString): return - gmt = __extract_date(m) + __extract_time(m) + (0, 0, 0) - if gmt[0] == 0: return - return time.gmtime(time.mktime(gmt) + __extract_tzd(m) - time.timezone) -registerDateHandler(_parse_date_w3dtf) - -def _parse_date_rfc822(dateString): - '''Parse an RFC822, RFC1123, RFC2822, or asctime-style date''' - data = dateString.split() - if data[0][-1] in (',', '.') or data[0].lower() in rfc822._daynames: - del data[0] - if len(data) == 4: - s = data[3] - i = s.find('+') - if i > 0: - data[3:] = [s[:i], s[i+1:]] - else: - data.append('') - dateString = " ".join(data) - if len(data) < 5: - dateString += ' 00:00:00 GMT' - tm = rfc822.parsedate_tz(dateString) - if tm: - return time.gmtime(rfc822.mktime_tz(tm)) -# rfc822.py defines several time zones, but we define some extra ones. -# 'ET' is equivalent to 'EST', etc. -_additional_timezones = {'AT': -400, 'ET': -500, 'CT': -600, 'MT': -700, 'PT': -800} -rfc822._timezones.update(_additional_timezones) -registerDateHandler(_parse_date_rfc822) - -def _parse_date(dateString): - '''Parses a variety of date formats into a 9-tuple in GMT''' - for handler in _date_handlers: - try: - date9tuple = handler(dateString) - if not date9tuple: continue - if len(date9tuple) != 9: - if _debug: sys.stderr.write('date handler function must return 9-tuple\n') - raise ValueError - map(int, date9tuple) - return date9tuple - except Exception, e: - if _debug: sys.stderr.write('%s raised %s\n' % (handler.__name__, repr(e))) - pass - return None - -def _getCharacterEncoding(http_headers, xml_data): - '''Get the character encoding of the XML document - - http_headers is a dictionary - xml_data is a raw string (not Unicode) - - This is so much trickier than it sounds, it's not even funny. - According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type - is application/xml, application/*+xml, - application/xml-external-parsed-entity, or application/xml-dtd, - the encoding given in the charset parameter of the HTTP Content-Type - takes precedence over the encoding given in the XML prefix within the - document, and defaults to 'utf-8' if neither are specified. But, if - the HTTP Content-Type is text/xml, text/*+xml, or - text/xml-external-parsed-entity, the encoding given in the XML prefix - within the document is ALWAYS IGNORED and only the encoding given in - the charset parameter of the HTTP Content-Type header should be - respected, and it defaults to 'us-ascii' if not specified. - - Furthermore, discussion on the atom-syntax mailing list with the - author of RFC 3023 leads me to the conclusion that any document - served with a Content-Type of text/* and no charset parameter - must be treated as us-ascii. (We now do this.) And also that it - must always be flagged as non-well-formed. (We now do this too.) - - If Content-Type is unspecified (input was local file or non-HTTP source) - or unrecognized (server just got it totally wrong), then go by the - encoding given in the XML prefix of the document and default to - 'iso-8859-1' as per the HTTP specification (RFC 2616). - - Then, assuming we didn't find a character encoding in the HTTP headers - (and the HTTP Content-type allowed us to look in the body), we need - to sniff the first few bytes of the XML data and try to determine - whether the encoding is ASCII-compatible. Section F of the XML - specification shows the way here: - http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info - - If the sniffed encoding is not ASCII-compatible, we need to make it - ASCII compatible so that we can sniff further into the XML declaration - to find the encoding attribute, which will tell us the true encoding. - - Of course, none of this guarantees that we will be able to parse the - feed in the declared character encoding (assuming it was declared - correctly, which many are not). CJKCodecs and iconv_codec help a lot; - you should definitely install them if you can. - http://cjkpython.i18n.org/ - ''' - - def _parseHTTPContentType(content_type): - '''takes HTTP Content-Type header and returns (content type, charset) - - If no charset is specified, returns (content type, '') - If no content type is specified, returns ('', '') - Both return parameters are guaranteed to be lowercase strings - ''' - content_type = content_type or '' - content_type, params = cgi.parse_header(content_type) - return content_type, params.get('charset', '').replace("'", '') - - sniffed_xml_encoding = '' - xml_encoding = '' - true_encoding = '' - http_content_type, http_encoding = _parseHTTPContentType(http_headers.get('content-type')) - # Must sniff for non-ASCII-compatible character encodings before - # searching for XML declaration. This heuristic is defined in - # section F of the XML specification: - # http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info - try: - if xml_data[:4] == '\x4c\x6f\xa7\x94': - # EBCDIC - xml_data = _ebcdic_to_ascii(xml_data) - elif xml_data[:4] == '\x00\x3c\x00\x3f': - # UTF-16BE - sniffed_xml_encoding = 'utf-16be' - xml_data = unicode(xml_data, 'utf-16be').encode('utf-8') - elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') and (xml_data[2:4] != '\x00\x00'): - # UTF-16BE with BOM - sniffed_xml_encoding = 'utf-16be' - xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8') - elif xml_data[:4] == '\x3c\x00\x3f\x00': - # UTF-16LE - sniffed_xml_encoding = 'utf-16le' - xml_data = unicode(xml_data, 'utf-16le').encode('utf-8') - elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and (xml_data[2:4] != '\x00\x00'): - # UTF-16LE with BOM - sniffed_xml_encoding = 'utf-16le' - xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8') - elif xml_data[:4] == '\x00\x00\x00\x3c': - # UTF-32BE - sniffed_xml_encoding = 'utf-32be' - xml_data = unicode(xml_data, 'utf-32be').encode('utf-8') - elif xml_data[:4] == '\x3c\x00\x00\x00': - # UTF-32LE - sniffed_xml_encoding = 'utf-32le' - xml_data = unicode(xml_data, 'utf-32le').encode('utf-8') - elif xml_data[:4] == '\x00\x00\xfe\xff': - # UTF-32BE with BOM - sniffed_xml_encoding = 'utf-32be' - xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8') - elif xml_data[:4] == '\xff\xfe\x00\x00': - # UTF-32LE with BOM - sniffed_xml_encoding = 'utf-32le' - xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8') - elif xml_data[:3] == '\xef\xbb\xbf': - # UTF-8 with BOM - sniffed_xml_encoding = 'utf-8' - xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8') - else: - # ASCII-compatible - pass - xml_encoding_match = re.compile('^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) - except: - xml_encoding_match = None - if xml_encoding_match: - xml_encoding = xml_encoding_match.groups()[0].lower() - if sniffed_xml_encoding and (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', 'iso-10646-ucs-4', 'ucs-4', 'csucs4', 'utf-16', 'utf-32', 'utf_16', 'utf_32', 'utf16', 'u16')): - xml_encoding = sniffed_xml_encoding - acceptable_content_type = 0 - application_content_types = ('application/xml', 'application/xml-dtd', 'application/xml-external-parsed-entity') - text_content_types = ('text/xml', 'text/xml-external-parsed-entity') - if (http_content_type in application_content_types) or \ - (http_content_type.startswith('application/') and http_content_type.endswith('+xml')): - acceptable_content_type = 1 - true_encoding = http_encoding or xml_encoding or 'utf-8' - elif (http_content_type in text_content_types) or \ - (http_content_type.startswith('text/')) and http_content_type.endswith('+xml'): - acceptable_content_type = 1 - true_encoding = http_encoding or 'us-ascii' - elif http_content_type.startswith('text/'): - true_encoding = http_encoding or 'us-ascii' - elif http_headers and (not http_headers.has_key('content-type')): - true_encoding = xml_encoding or 'iso-8859-1' - else: - true_encoding = xml_encoding or 'utf-8' - return true_encoding, http_encoding, xml_encoding, sniffed_xml_encoding, acceptable_content_type - -def _toUTF8(data, encoding): - '''Changes an XML data stream on the fly to specify a new encoding - - data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already - encoding is a string recognized by encodings.aliases - ''' - if _debug: sys.stderr.write('entering _toUTF8, trying encoding %s\n' % encoding) - # strip Byte Order Mark (if present) - if (len(data) >= 4) and (data[:2] == '\xfe\xff') and (data[2:4] != '\x00\x00'): - if _debug: - sys.stderr.write('stripping BOM\n') - if encoding != 'utf-16be': - sys.stderr.write('trying utf-16be instead\n') - encoding = 'utf-16be' - data = data[2:] - elif (len(data) >= 4) and (data[:2] == '\xff\xfe') and (data[2:4] != '\x00\x00'): - if _debug: - sys.stderr.write('stripping BOM\n') - if encoding != 'utf-16le': - sys.stderr.write('trying utf-16le instead\n') - encoding = 'utf-16le' - data = data[2:] - elif data[:3] == '\xef\xbb\xbf': - if _debug: - sys.stderr.write('stripping BOM\n') - if encoding != 'utf-8': - sys.stderr.write('trying utf-8 instead\n') - encoding = 'utf-8' - data = data[3:] - elif data[:4] == '\x00\x00\xfe\xff': - if _debug: - sys.stderr.write('stripping BOM\n') - if encoding != 'utf-32be': - sys.stderr.write('trying utf-32be instead\n') - encoding = 'utf-32be' - data = data[4:] - elif data[:4] == '\xff\xfe\x00\x00': - if _debug: - sys.stderr.write('stripping BOM\n') - if encoding != 'utf-32le': - sys.stderr.write('trying utf-32le instead\n') - encoding = 'utf-32le' - data = data[4:] - newdata = unicode(data, encoding) - if _debug: sys.stderr.write('successfully converted %s data to unicode\n' % encoding) - declmatch = re.compile('^<\?xml[^>]*?>') - newdecl = '''''' - if declmatch.search(newdata): - newdata = declmatch.sub(newdecl, newdata) - else: - newdata = newdecl + u'\n' + newdata - return newdata.encode('utf-8') - -def _stripDoctype(data): - '''Strips DOCTYPE from XML document, returns (rss_version, stripped_data) - - rss_version may be 'rss091n' or None - stripped_data is the same XML document, minus the DOCTYPE - ''' - entity_pattern = re.compile(r']*?)>', re.MULTILINE) - data = entity_pattern.sub('', data) - doctype_pattern = re.compile(r']*?)>', re.MULTILINE) - doctype_results = doctype_pattern.findall(data) - doctype = doctype_results and doctype_results[0] or '' - if doctype.lower().count('netscape'): - version = 'rss091n' - else: - version = None - data = doctype_pattern.sub('', data) - return version, data - -def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=[]): - '''Parse a feed from a URL, file, stream, or string''' - result = FeedParserDict() - result['feed'] = FeedParserDict() - result['entries'] = [] - if _XML_AVAILABLE: - result['bozo'] = 0 - if type(handlers) == types.InstanceType: - handlers = [handlers] - try: - f = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers) - data = f.read() - except Exception, e: - result['bozo'] = 1 - result['bozo_exception'] = e - data = '' - f = None - - # if feed is gzip-compressed, decompress it - if f and data and hasattr(f, 'headers'): - if gzip and f.headers.get('content-encoding', '') == 'gzip': - try: - data = gzip.GzipFile(fileobj=_StringIO(data)).read() - except Exception, e: - # Some feeds claim to be gzipped but they're not, so - # we get garbage. Ideally, we should re-request the - # feed without the 'Accept-encoding: gzip' header, - # but we don't. - result['bozo'] = 1 - result['bozo_exception'] = e - data = '' - elif zlib and f.headers.get('content-encoding', '') == 'deflate': - try: - data = zlib.decompress(data, -zlib.MAX_WBITS) - except Exception, e: - result['bozo'] = 1 - result['bozo_exception'] = e - data = '' - - # save HTTP headers - if hasattr(f, 'info'): - info = f.info() - result['etag'] = info.getheader('ETag') - last_modified = info.getheader('Last-Modified') - if last_modified: - result['modified'] = _parse_date(last_modified) - if hasattr(f, 'url'): - result['href'] = f.url - result['status'] = 200 - if hasattr(f, 'status'): - result['status'] = f.status - if hasattr(f, 'headers'): - result['headers'] = f.headers.dict - if hasattr(f, 'close'): - f.close() - - # there are four encodings to keep track of: - # - http_encoding is the encoding declared in the Content-Type HTTP header - # - xml_encoding is the encoding declared in the ; changed -# project name -#2.5 - 7/25/2003 - MAP - changed to Python license (all contributors agree); -# removed unnecessary urllib code -- urllib2 should always be available anyway; -# return actual url, status, and full HTTP headers (as result['url'], -# result['status'], and result['headers']) if parsing a remote feed over HTTP -- -# this should pass all the HTTP tests at ; -# added the latest namespace-of-the-week for RSS 2.0 -#2.5.1 - 7/26/2003 - RMK - clear opener.addheaders so we only send our custom -# User-Agent (otherwise urllib2 sends two, which confuses some servers) -#2.5.2 - 7/28/2003 - MAP - entity-decode inline xml properly; added support for -# inline and as used in some RSS 2.0 feeds -#2.5.3 - 8/6/2003 - TvdV - patch to track whether we're inside an image or -# textInput, and also to return the character encoding (if specified) -#2.6 - 1/1/2004 - MAP - dc:author support (MarekK); fixed bug tracking -# nested divs within content (JohnD); fixed missing sys import (JohanS); -# fixed regular expression to capture XML character encoding (Andrei); -# added support for Atom 0.3-style links; fixed bug with textInput tracking; -# added support for cloud (MartijnP); added support for multiple -# category/dc:subject (MartijnP); normalize content model: 'description' gets -# description (which can come from description, summary, or full content if no -# description), 'content' gets dict of base/language/type/value (which can come -# from content:encoded, xhtml:body, content, or fullitem); -# fixed bug matching arbitrary Userland namespaces; added xml:base and xml:lang -# tracking; fixed bug tracking unknown tags; fixed bug tracking content when -# element is not in default namespace (like Pocketsoap feed); -# resolve relative URLs in link, guid, docs, url, comments, wfw:comment, -# wfw:commentRSS; resolve relative URLs within embedded HTML markup in -# description, xhtml:body, content, content:encoded, title, subtitle, -# summary, info, tagline, and copyright; added support for pingback and -# trackback namespaces -#2.7 - 1/5/2004 - MAP - really added support for trackback and pingback -# namespaces, as opposed to 2.6 when I said I did but didn't really; -# sanitize HTML markup within some elements; added mxTidy support (if -# installed) to tidy HTML markup within some elements; fixed indentation -# bug in _parse_date (FazalM); use socket.setdefaulttimeout if available -# (FazalM); universal date parsing and normalization (FazalM): 'created', modified', -# 'issued' are parsed into 9-tuple date format and stored in 'created_parsed', -# 'modified_parsed', and 'issued_parsed'; 'date' is duplicated in 'modified' -# and vice-versa; 'date_parsed' is duplicated in 'modified_parsed' and vice-versa -#2.7.1 - 1/9/2004 - MAP - fixed bug handling " and '. fixed memory -# leak not closing url opener (JohnD); added dc:publisher support (MarekK); -# added admin:errorReportsTo support (MarekK); Python 2.1 dict support (MarekK) -#2.7.4 - 1/14/2004 - MAP - added workaround for improperly formed
tags in -# encoded HTML (skadz); fixed unicode handling in normalize_attrs (ChrisL); -# fixed relative URI processing for guid (skadz); added ICBM support; added -# base64 support -#2.7.5 - 1/15/2004 - MAP - added workaround for malformed DOCTYPE (seen on many -# blogspot.com sites); added _debug variable -#2.7.6 - 1/16/2004 - MAP - fixed bug with StringIO importing -#3.0b3 - 1/23/2004 - MAP - parse entire feed with real XML parser (if available); -# added several new supported namespaces; fixed bug tracking naked markup in -# description; added support for enclosure; added support for source; re-added -# support for cloud which got dropped somehow; added support for expirationDate -#3.0b4 - 1/26/2004 - MAP - fixed xml:lang inheritance; fixed multiple bugs tracking -# xml:base URI, one for documents that don't define one explicitly and one for -# documents that define an outer and an inner xml:base that goes out of scope -# before the end of the document -#3.0b5 - 1/26/2004 - MAP - fixed bug parsing multiple links at feed level -#3.0b6 - 1/27/2004 - MAP - added feed type and version detection, result['version'] -# will be one of SUPPORTED_VERSIONS.keys() or empty string if unrecognized; -# added support for creativeCommons:license and cc:license; added support for -# full Atom content model in title, tagline, info, copyright, summary; fixed bug -# with gzip encoding (not always telling server we support it when we do) -#3.0b7 - 1/28/2004 - MAP - support Atom-style author element in author_detail -# (dictionary of 'name', 'url', 'email'); map author to author_detail if author -# contains name + email address -#3.0b8 - 1/28/2004 - MAP - added support for contributor -#3.0b9 - 1/29/2004 - MAP - fixed check for presence of dict function; added -# support for summary -#3.0b10 - 1/31/2004 - MAP - incorporated ISO-8601 date parsing routines from -# xml.util.iso8601 -#3.0b11 - 2/2/2004 - MAP - added 'rights' to list of elements that can contain -# dangerous markup; fiddled with decodeEntities (not right); liberalized -# date parsing even further -#3.0b12 - 2/6/2004 - MAP - fiddled with decodeEntities (still not right); -# added support to Atom 0.2 subtitle; added support for Atom content model -# in copyright; better sanitizing of dangerous HTML elements with end tags -# (script, frameset) -#3.0b13 - 2/8/2004 - MAP - better handling of empty HTML tags (br, hr, img, -# etc.) in embedded markup, in either HTML or XHTML form (
,
,
) -#3.0b14 - 2/8/2004 - MAP - fixed CDATA handling in non-wellformed feeds under -# Python 2.1 -#3.0b15 - 2/11/2004 - MAP - fixed bug resolving relative links in wfw:commentRSS; -# fixed bug capturing author and contributor URL; fixed bug resolving relative -# links in author and contributor URL; fixed bug resolvin relative links in -# generator URL; added support for recognizing RSS 1.0; passed Simon Fell's -# namespace tests, and included them permanently in the test suite with his -# permission; fixed namespace handling under Python 2.1 -#3.0b16 - 2/12/2004 - MAP - fixed support for RSS 0.90 (broken in b15) -#3.0b17 - 2/13/2004 - MAP - determine character encoding as per RFC 3023 -#3.0b18 - 2/17/2004 - MAP - always map description to summary_detail (Andrei); -# use libxml2 (if available) -#3.0b19 - 3/15/2004 - MAP - fixed bug exploding author information when author -# name was in parentheses; removed ultra-problematic mxTidy support; patch to -# workaround crash in PyXML/expat when encountering invalid entities -# (MarkMoraes); support for textinput/textInput -#3.0b20 - 4/7/2004 - MAP - added CDF support -#3.0b21 - 4/14/2004 - MAP - added Hot RSS support -#3.0b22 - 4/19/2004 - MAP - changed 'channel' to 'feed', 'item' to 'entries' in -# results dict; changed results dict to allow getting values with results.key -# as well as results[key]; work around embedded illformed HTML with half -# a DOCTYPE; work around malformed Content-Type header; if character encoding -# is wrong, try several common ones before falling back to regexes (if this -# works, bozo_exception is set to CharacterEncodingOverride); fixed character -# encoding issues in BaseHTMLProcessor by tracking encoding and converting -# from Unicode to raw strings before feeding data to sgmllib.SGMLParser; -# convert each value in results to Unicode (if possible), even if using -# regex-based parsing -#3.0b23 - 4/21/2004 - MAP - fixed UnicodeDecodeError for feeds that contain -# high-bit characters in attributes in embedded HTML in description (thanks -# Thijs van de Vossen); moved guid, date, and date_parsed to mapped keys in -# FeedParserDict; tweaked FeedParserDict.has_key to return True if asking -# about a mapped key -#3.0fc1 - 4/23/2004 - MAP - made results.entries[0].links[0] and -# results.entries[0].enclosures[0] into FeedParserDict; fixed typo that could -# cause the same encoding to be tried twice (even if it failed the first time); -# fixed DOCTYPE stripping when DOCTYPE contained entity declarations; -# better textinput and image tracking in illformed RSS 1.0 feeds -#3.0fc2 - 5/10/2004 - MAP - added and passed Sam's amp tests; added and passed -# my blink tag tests -#3.0fc3 - 6/18/2004 - MAP - fixed bug in _changeEncodingDeclaration that -# failed to parse utf-16 encoded feeds; made source into a FeedParserDict; -# duplicate admin:generatorAgent/@rdf:resource in generator_detail.url; -# added support for image; refactored parse() fallback logic to try other -# encodings if SAX parsing fails (previously it would only try other encodings -# if re-encoding failed); remove unichr madness in normalize_attrs now that -# we're properly tracking encoding in and out of BaseHTMLProcessor; set -# feed.language from root-level xml:lang; set entry.id from rdf:about; -# send Accept header -#3.0 - 6/21/2004 - MAP - don't try iso-8859-1 (can't distinguish between -# iso-8859-1 and windows-1252 anyway, and most incorrectly marked feeds are -# windows-1252); fixed regression that could cause the same encoding to be -# tried twice (even if it failed the first time) -#3.0.1 - 6/22/2004 - MAP - default to us-ascii for all text/* content types; -# recover from malformed content-type header parameter with no equals sign -# ('text/xml; charset:iso-8859-1') -#3.1 - 6/28/2004 - MAP - added and passed tests for converting HTML entities -# to Unicode equivalents in illformed feeds (aaronsw); added and -# passed tests for converting character entities to Unicode equivalents -# in illformed feeds (aaronsw); test for valid parsers when setting -# XML_AVAILABLE; make version and encoding available when server returns -# a 304; add handlers parameter to pass arbitrary urllib2 handlers (like -# digest auth or proxy support); add code to parse username/password -# out of url and send as basic authentication; expose downloading-related -# exceptions in bozo_exception (aaronsw); added __contains__ method to -# FeedParserDict (aaronsw); added publisher_detail (aaronsw) -#3.2 - 7/3/2004 - MAP - use cjkcodecs and iconv_codec if available; always -# convert feed to UTF-8 before passing to XML parser; completely revamped -# logic for determining character encoding and attempting XML parsing -# (much faster); increased default timeout to 20 seconds; test for presence -# of Location header on redirects; added tests for many alternate character -# encodings; support various EBCDIC encodings; support UTF-16BE and -# UTF16-LE with or without a BOM; support UTF-8 with a BOM; support -# UTF-32BE and UTF-32LE with or without a BOM; fixed crashing bug if no -# XML parsers are available; added support for 'Content-encoding: deflate'; -# send blank 'Accept-encoding: ' header if neither gzip nor zlib modules -# are available -#3.3 - 7/15/2004 - MAP - optimize EBCDIC to ASCII conversion; fix obscure -# problem tracking xml:base and xml:lang if element declares it, child -# doesn't, first grandchild redeclares it, and second grandchild doesn't; -# refactored date parsing; defined public registerDateHandler so callers -# can add support for additional date formats at runtime; added support -# for OnBlog, Nate, MSSQL, Greek, and Hungarian dates (ytrewq1); added -# zopeCompatibilityHack() which turns FeedParserDict into a regular -# dictionary, required for Zope compatibility, and also makes command- -# line debugging easier because pprint module formats real dictionaries -# better than dictionary-like objects; added NonXMLContentType exception, -# which is stored in bozo_exception when a feed is served with a non-XML -# media type such as 'text/plain'; respect Content-Language as default -# language if not xml:lang is present; cloud dict is now FeedParserDict; -# generator dict is now FeedParserDict; better tracking of xml:lang, -# including support for xml:lang='' to unset the current language; -# recognize RSS 1.0 feeds even when RSS 1.0 namespace is not the default -# namespace; don't overwrite final status on redirects (scenarios: -# redirecting to a URL that returns 304, redirecting to a URL that -# redirects to another URL with a different type of redirect); add -# support for HTTP 303 redirects -#4.0 - MAP - support for relative URIs in xml:base attribute; fixed -# encoding issue with mxTidy (phopkins); preliminary support for RFC 3229; -# support for Atom 1.0; support for iTunes extensions; new 'tags' for -# categories/keywords/etc. as array of dict -# {'term': term, 'scheme': scheme, 'label': label} to match Atom 1.0 -# terminology; parse RFC 822-style dates with no time; lots of other -# bug fixes -#4.1 - MAP - removed socket timeout; added support for chardet library diff --git a/HarvestMan-lite/harvestman/lib/common/keepalive.py b/HarvestMan-lite/harvestman/lib/common/keepalive.py deleted file mode 100755 index 675febb..0000000 --- a/HarvestMan-lite/harvestman/lib/common/keepalive.py +++ /dev/null @@ -1,650 +0,0 @@ -# -- coding: utf-8 -# keepalive.py - Module which supports HTTP/HTTPS keep-alive connections -# on the same host using a thread-safe connection pool. -# -# Created Anand B Pillai Sep 10 2007 Code borrowed from urlgrabber -# project. -# -# Original copyright follows: -#--------------Original Copyright----------------------------------- -# This library is free software; you can redistribute it and/or -# modify it under the terms of the GNU Lesser General Public -# License as published by the Free Software Foundation; either -# version 2.1 of the License, or (at your option) any later version. -# -# This library is distributed in the hope that it will be useful, -# but WITHOUT ANY WARRANTY; without even the implied warranty of -# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -# Lesser General Public License for more details. -# -# You should have received a copy of the GNU Lesser General Public -# License along with this library; if not, write to the -# Free Software Foundation, Inc., -# 59 Temple Place, Suite 330, -# Boston, MA 02111-1307 USA - -# This file is part of urlgrabber, a high-level cross-protocol url-grabber -# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko -#------------Original Copyright--------------------------------------- -# -# - -__author__ = "Anand B Pillai" -__maintainer__ = "Anand B Pillai" -__version__ = "2.0 b1" - -"""An HTTP handler for urllib2 that supports HTTP 1.1 and keepalive. - ->>> import urllib2 ->>> from keepalive import HTTPHandler ->>> keepalive_handler = HTTPHandler() ->>> opener = urllib2.build_opener(keepalive_handler) ->>> urllib2.install_opener(opener) ->>> ->>> fo = urllib2.urlopen('http://www.python.org') - -If a connection to a given host is requested, and all of the existing -connections are still in use, another connection will be opened. If -the handler tries to use an existing connection but it fails in some -way, it will be closed and removed from the pool. - -To remove the handler, simply re-run build_opener with no arguments, and -install that opener. - -You can explicitly close connections by using the close_connection() -method of the returned file-like object (described below) or you can -use the handler methods: - - close_connection(host) - close_all() - open_connections() - -NOTE: using the close_connection and close_all methods of the handler -should be done with care when using multiple threads. - * there is nothing that prevents another thread from creating new - connections immediately after connections are closed - * no checks are done to prevent in-use connections from being closed - ->>> keepalive_handler.close_all() - -EXTRA ATTRIBUTES AND METHODS - - Upon a status of 200, the object returned has a few additional - attributes and methods, which should not be used if you want to - remain consistent with the normal urllib2-returned objects: - - close_connection() - close the connection to the host - readlines() - you know, readlines() - status - the return status (ie 404) - reason - english translation of status (ie 'File not found') - - If you want the best of both worlds, use this inside an - AttributeError-catching try: - - >>> try: status = fo.status - >>> except AttributeError: status = None - - Unfortunately, these are ONLY there if status == 200, so it's not - easy to distinguish between non-200 responses. The reason is that - urllib2 tries to do clever things with error codes 301, 302, 401, - and 407, and it wraps the object upon return. - - For python versions earlier than 2.4, you can avoid this fancy error - handling by setting the module-level global HANDLE_ERRORS to zero. - You see, prior to 2.4, it's the HTTP Handler's job to determine what - to handle specially, and what to just pass up. HANDLE_ERRORS == 0 - means "pass everything up". In python 2.4, however, this job no - longer belongs to the HTTP Handler and is now done by a NEW handler, - HTTPErrorProcessor. Here's the bottom line: - - python version < 2.4 - HANDLE_ERRORS == 1 (default) pass up 200, treat the rest as - errors - HANDLE_ERRORS == 0 pass everything up, error processing is - left to the calling code - python version >= 2.4 - HANDLE_ERRORS == 1 pass up 200, treat the rest as errors - HANDLE_ERRORS == 0 (default) pass everything up, let the - other handlers (specifically, - HTTPErrorProcessor) decide what to do - - In practice, setting the variable either way makes little difference - in python 2.4, so for the most consistent behavior across versions, - you probably just want to use the defaults, which will give you - exceptions on errors. - -""" - -# $Id: keepalive.py,v 1.2 2007/10/08 20:52:00 pythonhacker Exp $ - -import urllib2 -import httplib -import socket -import thread - -class FakeLogger: - def debug(self, msg, *args): print msg % args - info = warning = error = debug - -DEBUG = None - -# import sslfactory - -import sys -if sys.version_info < (2, 4): HANDLE_ERRORS = 1 -else: HANDLE_ERRORS = 0 - -class ConnectionManager: - """ - The connection manager must be able to: - * keep track of all existing - """ - def __init__(self): - self._lock = thread.allocate_lock() - self._hostmap = {} # map hosts to a list of connections - self._connmap = {} # map connections to host - self._readymap = {} # map connection to ready state - - def add(self, host, connection, ready): - self._lock.acquire() - try: - if not self._hostmap.has_key(host): self._hostmap[host] = [] - self._hostmap[host].append(connection) - self._connmap[connection] = host - self._readymap[connection] = ready - finally: - self._lock.release() - - def remove(self, connection): - self._lock.acquire() - try: - try: - host = self._connmap[connection] - except KeyError: - pass - else: - del self._connmap[connection] - del self._readymap[connection] - self._hostmap[host].remove(connection) - if not self._hostmap[host]: del self._hostmap[host] - finally: - self._lock.release() - - def set_ready(self, connection, ready): - try: self._readymap[connection] = ready - except KeyError: pass - - def get_ready_conn(self, host): - conn = None - self._lock.acquire() - try: - if self._hostmap.has_key(host): - for c in self._hostmap[host]: - if self._readymap[c]: - self._readymap[c] = 0 - conn = c - break - finally: - self._lock.release() - return conn - - def get_all(self, host=None): - if host: - return list(self._hostmap.get(host, [])) - else: - return dict(self._hostmap) - -class KeepAliveHandler: - def __init__(self): - self._cm = ConnectionManager() - - #### Connection Management - def open_connections(self): - """return a list of connected hosts and the number of connections - to each. [('foo.com:80', 2), ('bar.org', 1)]""" - return [(host, len(li)) for (host, li) in self._cm.get_all().items()] - - def close_connection(self, host): - """close connection(s) to - host is the host:port spec, as in 'www.cnn.com:8080' as passed in. - no error occurs if there is no connection to that host.""" - for h in self._cm.get_all(host): - self._cm.remove(h) - h.close() - - def close_all(self): - """close all open connections""" - for host, conns in self._cm.get_all().items(): - for h in conns: - self._cm.remove(h) - h.close() - - def _request_closed(self, request, host, connection): - """tells us that this request is now closed and the the - connection is ready for another request""" - self._cm.set_ready(connection, 1) - - def _remove_connection(self, host, connection, close=0): - if close: connection.close() - self._cm.remove(connection) - - #### Transaction Execution - def do_open(self, req): - host = req.get_host() - if not host: - raise urllib2.URLError('no host given') - - try: - h = self._cm.get_ready_conn(host) - while h: - r = self._reuse_connection(h, req, host) - - # if this response is non-None, then it worked and we're - # done. Break out, skipping the else block. - if r: break - - # connection is bad - possibly closed by server - # discard it and ask for the next free connection - h.close() - self._cm.remove(h) - h = self._cm.get_ready_conn(host) - else: - # no (working) free connections were found. Create a new one. - h = self._get_connection(host) - if DEBUG: DEBUG.info("creating new connection to %s (%d)" % (host, id(h))) - self._cm.add(host, h, 0) - self._start_transaction(h, req) - r = h.getresponse() - except (socket.error, httplib.HTTPException), err: - raise urllib2.URLError(err) - - # if not a persistent connection, don't try to reuse it - if r.will_close: self._cm.remove(h) - - if DEBUG: DEBUG.info("STATUS: %s, %s" % (r.status, r.reason)) - r._handler = self - r._host = host - r._url = req.get_full_url() - r._connection = h - r.code = r.status - r.headers = r.msg - r.msg = r.reason - - if r.status == 200 or not HANDLE_ERRORS: - return r - else: - return self.parent.error('http', req, r, - r.status, r.msg, r.headers) - - def _reuse_connection(self, h, req, host): - """start the transaction with a re-used connection - return a response object (r) upon success or None on failure. - This DOES not close or remove bad connections in cases where - it returns. However, if an unexpected exception occurs, it - will close and remove the connection before re-raising. - """ - try: - self._start_transaction(h, req) - r = h.getresponse() - # note: just because we got something back doesn't mean it - # worked. We'll check the version below, too. - except (socket.error, httplib.HTTPException): - r = None - except: - # adding this block just in case we've missed - # something we will still raise the exception, but - # lets try and close the connection and remove it - # first. We previously got into a nasty loop - # where an exception was uncaught, and so the - # connection stayed open. On the next try, the - # same exception was raised, etc. The tradeoff is - # that it's now possible this call will raise - # a DIFFERENT exception - if DEBUG: DEBUG.error("unexpected exception - closing " + \ - "connection to %s (%d)" % host, id(h)) - self._cm.remove(h) - h.close() - raise - - if r is None or r.version == 9: - # httplib falls back to assuming HTTP 0.9 if it gets a - # bad header back. This is most likely to happen if - # the socket has been closed by the server since we - # last used the connection. - if DEBUG: DEBUG.info("failed to re-use connection to %s (%d)" % (host, id(h))) - r = None - else: - if DEBUG: DEBUG.info("re-using connection to %s (%d)" % (host, id(h))) - - return r - - def _start_transaction(self, h, req): - try: - if req.has_data(): - data = req.get_data() - h.putrequest('POST', req.get_selector()) - if not req.headers.has_key('Content-type'): - h.putheader('Content-type', - 'application/x-www-form-urlencoded') - if not req.headers.has_key('Content-length'): - h.putheader('Content-length', '%d' % len(data)) - else: - h.putrequest('GET', req.get_selector()) - except (socket.error, httplib.HTTPException), err: - raise urllib2.URLError(err) - - for args in self.parent.addheaders: - h.putheader(*args) - for k, v in req.headers.items(): - h.putheader(k, v) - h.endheaders() - if req.has_data(): - h.send(data) - - def _get_connection(self, host): - return NotImplementedError - -class HTTPHandler(KeepAliveHandler, urllib2.HTTPHandler): - def __init__(self): - KeepAliveHandler.__init__(self) - - def http_open(self, req): - return self.do_open(req) - - def _get_connection(self, host): - return HTTPConnection(host) - -class HTTPSHandler(KeepAliveHandler, urllib2.HTTPSHandler): - def __init__(self, ssl_factory=None): - KeepAliveHandler.__init__(self) - #if not ssl_factory: - # ssl_factory = sslfactory.get_factory() - #self._ssl_factory = ssl_factory - - def https_open(self, req): - return self.do_open(req) - - def _get_connection(self, host): - # return self._ssl_factory.create_https_connection(host) - return HTTPSConnection(host) - -class HTTPResponse(httplib.HTTPResponse): - # we need to subclass HTTPResponse in order to - # 1) add readline() and readlines() methods - # 2) add close_connection() methods - # 3) add info() and geturl() methods - - # in order to add readline(), read must be modified to deal with a - # buffer. example: readline must read a buffer and then spit back - # one line at a time. The only real alternative is to read one - # BYTE at a time (ick). Once something has been read, it can't be - # put back (ok, maybe it can, but that's even uglier than this), - # so if you THEN do a normal read, you must first take stuff from - # the buffer. - - # the read method wraps the original to accomodate buffering, - # although read() never adds to the buffer. - # Both readline and readlines have been stolen with almost no - # modification from socket.py - - - def __init__(self, sock, debuglevel=0, strict=0, method=None): - if method: # the httplib in python 2.3 uses the method arg - httplib.HTTPResponse.__init__(self, sock, debuglevel, method) - else: # 2.2 doesn't - httplib.HTTPResponse.__init__(self, sock, debuglevel) - self.fileno = sock.fileno - self.code = None - self._rbuf = '' - self._rbufsize = 8096 - self._handler = None # inserted by the handler later - self._host = None # (same) - self._url = None # (same) - self._connection = None # (same) - - _raw_read = httplib.HTTPResponse.read - - def close(self): - if self.fp: - self.fp.close() - self.fp = None - if self._handler: - self._handler._request_closed(self, self._host, - self._connection) - - def close_connection(self): - self._handler._remove_connection(self._host, self._connection, close=1) - self.close() - - def info(self): - return self.headers - - def geturl(self): - return self._url - - def read(self, amt=None): - # the _rbuf test is only in this first if for speed. It's not - # logically necessary - if self._rbuf and not amt is None: - L = len(self._rbuf) - if amt > L: - amt -= L - else: - s = self._rbuf[:amt] - self._rbuf = self._rbuf[amt:] - return s - - s = self._rbuf + self._raw_read(amt) - self._rbuf = '' - return s - - def readline(self, limit=-1): - data = "" - i = self._rbuf.find('\n') - while i < 0 and not (0 < limit <= len(self._rbuf)): - new = self._raw_read(self._rbufsize) - if not new: break - i = new.find('\n') - if i >= 0: i = i + len(self._rbuf) - self._rbuf = self._rbuf + new - if i < 0: i = len(self._rbuf) - else: i = i+1 - if 0 <= limit < len(self._rbuf): i = limit - data, self._rbuf = self._rbuf[:i], self._rbuf[i:] - return data - - def readlines(self, sizehint = 0): - total = 0 - list = [] - while 1: - line = self.readline() - if not line: break - list.append(line) - total += len(line) - if sizehint and total >= sizehint: - break - return list - - -class HTTPConnection(httplib.HTTPConnection): - # use the modified response class - response_class = HTTPResponse - -class HTTPSConnection(httplib.HTTPSConnection): - response_class = HTTPResponse - - def connect(self): - import _socket - - # For fixing #503 - sock = _socket.socket(socket.AF_INET, socket.SOCK_STREAM) - sock.connect((self.host, self.port)) - # Change this to certicate paths where you have your SSL client certificates - # to be able to download URLs producing SSL errors. - ssl = socket.ssl(sock, None, None) - - self.sock = httplib.FakeSocket(sock, ssl) - - - -######################################################################### -##### TEST FUNCTIONS -######################################################################### - -def error_handler(url): - global HANDLE_ERRORS - orig = HANDLE_ERRORS - keepalive_handler = HTTPHandler() - opener = urllib2.build_opener(keepalive_handler) - urllib2.install_opener(opener) - pos = {0: 'off', 1: 'on'} - for i in (0, 1): - print " fancy error handling %s (HANDLE_ERRORS = %i)" % (pos[i], i) - HANDLE_ERRORS = i - try: - fo = urllib2.urlopen(url) - foo = fo.read() - fo.close() - try: status, reason = fo.status, fo.reason - except AttributeError: status, reason = None, None - except IOError, e: - print " EXCEPTION: %s" % e - raise - else: - print " status = %s, reason = %s" % (status, reason) - HANDLE_ERRORS = orig - hosts = keepalive_handler.open_connections() - print "open connections:", hosts - keepalive_handler.close_all() - -def continuity(url): - import md5 - format = '%25s: %s' - - # first fetch the file with the normal http handler - opener = urllib2.build_opener() - urllib2.install_opener(opener) - fo = urllib2.urlopen(url) - foo = fo.read() - fo.close() - m = md5.new(foo) - print format % ('normal urllib', m.hexdigest()) - - # now install the keepalive handler and try again - opener = urllib2.build_opener(HTTPHandler()) - urllib2.install_opener(opener) - - fo = urllib2.urlopen(url) - foo = fo.read() - fo.close() - m = md5.new(foo) - print format % ('keepalive read', m.hexdigest()) - - fo = urllib2.urlopen(url) - foo = '' - while 1: - f = fo.readline() - if f: foo = foo + f - else: break - fo.close() - m = md5.new(foo) - print format % ('keepalive readline', m.hexdigest()) - -def comp(N, url): - print ' making %i connections to:\n %s' % (N, url) - - sys.stdout.write(' first using the normal urllib handlers') - # first use normal opener - opener = urllib2.build_opener() - urllib2.install_opener(opener) - t1 = fetch(N, url) - print ' TIME: %.3f s' % t1 - - sys.stdout.write(' now using the keepalive handler ') - # now install the keepalive handler and try again - opener = urllib2.build_opener(HTTPHandler()) - urllib2.install_opener(opener) - t2 = fetch(N, url) - print ' TIME: %.3f s' % t2 - print ' improvement factor: %.2f' % (t1/t2, ) - -def fetch(N, url, delay=0): - import time - lens = [] - starttime = time.time() - for i in range(N): - if delay and i > 0: time.sleep(delay) - fo = urllib2.urlopen(url) - foo = fo.read() - fo.close() - lens.append(len(foo)) - diff = time.time() - starttime - - j = 0 - for i in lens[1:]: - j = j + 1 - if not i == lens[0]: - print "WARNING: inconsistent length on read %i: %i" % (j, i) - - return diff - -def test_timeout(url): - global DEBUG - dbbackup = DEBUG - class FakeLogger: - def debug(self, msg, *args): print msg % args - info = warning = error = debug - DEBUG = FakeLogger() - print " fetching the file to establish a connection" - fo = urllib2.urlopen(url) - data1 = fo.read() - fo.close() - - i = 20 - print " waiting %i seconds for the server to close the connection" % i - while i > 0: - sys.stdout.write('\r %2i' % i) - sys.stdout.flush() - time.sleep(1) - i -= 1 - sys.stderr.write('\r') - - print " fetching the file a second time" - fo = urllib2.urlopen(url) - data2 = fo.read() - fo.close() - - if data1 == data2: - print ' data are identical' - else: - print ' ERROR: DATA DIFFER' - - DEBUG = dbbackup - - -def test(url, N=10): - print "checking error hander (do this on a non-200)" - try: error_handler(url) - except IOError, e: - print "exiting - exception will prevent further tests" - sys.exit() - print - print "performing continuity test (making sure stuff isn't corrupted)" - continuity(url) - print - print "performing speed comparison" - comp(N, url) - print - print "performing dropped-connection check" - test_timeout(url) - -if __name__ == '__main__': - import time - import sys - try: - N = int(sys.argv[1]) - url = sys.argv[2] - except: - print "%s " % sys.argv[0] - else: - test(url, N) diff --git a/HarvestMan-lite/harvestman/lib/common/lrucache.py b/HarvestMan-lite/harvestman/lib/common/lrucache.py deleted file mode 100755 index bd8abc4..0000000 --- a/HarvestMan-lite/harvestman/lib/common/lrucache.py +++ /dev/null @@ -1,370 +0,0 @@ -# -- coding: utf-8 -""" -lrucache.py - Length-limited O(1) LRU cache implementation - -Author: Anand B Pillai - -Created Anand B Pillai Jun 26 2007 from ASPN Python Cookbook recipe #252524. - -{http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252524} - -Original code courtesy Josiah Carlson. - -Copyright (C) 2007, Anand B Pillai. -""" -import copy -import cPickle, os, sys -import time -import cStringIO -from threading import Semaphore -from dictcache import DictCache - -class Node(object): - # __slots__ = ['prev', 'next', 'me'] - - def __init__(self, prev, me): - self.prev = prev - self.me = me - self.next = None - - def __copy__(self): - n = Node(self.prev, self.me) - n.next = self.next - - return n - - #def __getstate__(self): - # return self - -class LRU(object): - """ - Implementation of a length-limited O(1) LRU queue. - Built for and used by PyPE: - http://pype.sourceforge.net - Copyright 2003 Josiah Carlson. - """ - def __init__(self, count, pairs=[]): - self.count = max(count, 1) - self.d = {} - self.first = None - self.last = None - for key, value in pairs: - self[key] = value - - def __copy__(self): - lrucopy = LRU(self.count) - lrucopy.first = copy.copy(self.first) - lrucopy.last = copy.copy(self.last) - lrucopy.d = self.d.copy() - for key,value in self.iteritems(): - lrucopy[key] = value - - return lrucopy - - def __contains__(self, obj): - return obj in self.d - - def __getitem__(self, obj): - - a = self.d[obj].me - self[a[0]] = a[1] - return a[1] - - def __setitem__(self, obj, val): - if obj in self.d: - del self[obj] - nobj = Node(self.last, (obj, val)) - if self.first is None: - self.first = nobj - if self.last: - self.last.next = nobj - self.last = nobj - self.d[obj] = nobj - if len(self.d) > self.count: - if self.first == self.last: - self.first = None - self.last = None - return - a = self.first - if a: - if a.next: - a.next.prev = None - self.first = a.next - a.next = None - try: - del self.d[a.me[0]] - except KeyError: - pass - del a - - def __delitem__(self, obj): - nobj = self.d[obj] - if nobj.prev: - nobj.prev.next = nobj.next - else: - self.first = nobj.next - if nobj.next: - nobj.next.prev = nobj.prev - else: - self.last = nobj.prev - del self.d[obj] - - def __iter__(self): - cur = self.first - while cur != None: - cur2 = cur.next - yield cur.me[1] - cur = cur2 - - def iteritems(self): - cur = self.first - while cur != None: - cur2 = cur.next - yield cur.me - cur = cur2 - - def iterkeys(self): - return iter(self.d) - - def itervalues(self): - for i,j in self.iteritems(): - yield j - - def keys(self): - return self.d.keys() - - def clear(self): - self.d.clear() - - def __len__(self): - return len(self.d) - - -class LRU2(object): - """ - Implementation of a length-limited O(1) LRU queue - with disk caching. - """ - - # This LRU drops off items to a disk dictionary cache - # when older items are dropped. - def __init__(self, count, freq, cachedir='', pairs=[]): - self.count = max(count, 1) - self.d = {} - self.lastmutex = Semaphore(1) - self.first = None - self.last = None - for key, value in pairs: - self[key] = value - # Set the frequency to something like 1/100th of - # the expected dictionary final size to achieve - # best performance. - self.diskcache = DictCache(freq, cachedir) - - def __copy__(self): - lrucopy = LRU(self.count) - lrucopy.first = copy.copy(self.first) - lrucopy.last = copy.copy(self.last) - lrucopy.d = self.d.copy() - for key,value in self.iteritems(): - lrucopy[key] = value - - return lrucopy - - def __contains__(self, obj): - return obj in self.d - - def __getitem__(self, obj): - try: - a = self.d[obj].me - self[a[0]] = a[1] - return a[1] - except (KeyError,AttributeError): - return self.diskcache[obj] - - def __setitem__(self, obj, val): - if obj in self.d: - del self[obj] - nobj = Node(self.last, (obj, val)) - if self.first is None: - self.first = nobj - self.lastmutex.acquire() - try: - if self.last: - self.last.next = nobj - self.last = nobj - except: - pass - self.lastmutex.release() - self.d[obj] = nobj - if len(self.d) > self.count: - self.lastmutex.acquire() - try: - if self.first == self.last: - self.first = None - self.last = None - self.lastmutex.release() - return - except: - pass - self.lastmutex.release() - a = self.first - if a: - if a.next: - a.next.prev = None - self.first = a.next - a.next = None - try: - key, val = a.me[0], self.d[a.me[0]] - del self.d[a.me[0]] - del a - self.diskcache[key] = val.me[1] - except (KeyError,AttributeError): - pass - - def __delitem__(self, obj): - nobj = self.d[obj] - if nobj.prev: - nobj.prev.next = nobj.next - else: - self.first = nobj.next - if nobj.next: - nobj.next.prev = nobj.prev - else: - self.last = nobj.prev - del self.d[obj] - - def __iter__(self): - cur = self.first - while cur != None: - cur2 = cur.next - yield cur.me[1] - cur = cur2 - - def iteritems(self): - cur = self.first - while cur != None: - cur2 = cur.next - yield cur.me - cur = cur2 - - def iterkeys(self): - return iter(self.d) - - def itervalues(self): - for i,j in self.iteritems(): - yield j - - def keys(self): - return self.d.keys() - - def clear(self): - self.d.clear() - self.diskcache.clear() - - def __len__(self): - return len(self.d) - - def get_stats(self): - """ Return statistics as a dictionary """ - - return self.diskcache.get_stats() - - def test(self, N): - - # Test to see if the diskcache works. Pass - # the total number of items added to this - # function... - - flag = True - - for x in range(N): - if self[x] == None: - flag = False - break - - return flag - -def test_lru2(): - import random - - n1, n2 = 10000, 5000 - - l=LRU2(n1, 100) - for x in range(n1): - l[x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html') - - # make use of first n2 random entries - for x in range(n2): - l[random.randint(0,n2)] - - # Add another n2 entries - # This will cause the LRU to drop - # entries and cache old entries. - for x in range(n2): - l[n1+x] = x + 1 #urlparser.HarvestManUrlParser('htt://www.python.org/doc/current/tut/tut.html') # x + 1 - - print l.test(n1+n2) - - print 'Random access test...' - # Try to access random entries - for x in range(n1+n2): - # A random access will take more time since in-mem - # cache will be emptied more often - l[random.randint(0,n1+n2-1)] - - print - print "Disk Hits",l.diskcache.dhits - print "Mem Hits",l.diskcache.mhits - print "Temp dict Hits",l.diskcache.thits - print "Time taken",l.diskcache.t - print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2) - print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits) - print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache)) - - l.diskcache.clear_counters() - - print 'Sequential access test...' - - for x in range(n1+n2): - # A sequential access will be faster since in-mem cache - # will be hit sequentially... - l[x] - - print - - print "Disk Hits",l.diskcache.dhits - print "Mem Hits",l.diskcache.mhits - print "Temp dict Hits",l.diskcache.thits - print "Time taken",l.diskcache.t - print 'Hit %=>',100*float(l.diskcache.dhits)/float(n1+n2) - print 'Time per disk cache hit=>',float(l.diskcache.t)/float(l.diskcache.dhits) - print 'Average disk access time=>',float(l.diskcache.t)/float(len(l.diskcache)) - - l.clear() - -if __name__=="__main__": - test_lru2() - ## l = LRU2(10) -## for x in range(10): -## l[x] = x -## print l.keys() -## print l[3] -## print l[3] -## print l[9] -## print l[9] - -## l[12]=11 -## l[13]=12 -## l[14]=15 -## l[15]=16 -## l[16]=17 -## l[17]=18 -## l[18]=19 -## l[19]=20 -## print l.keys() -## print len(l) -## print l[0] -## print l[1] -## print l[2] -## print copy.copy(l).keys() diff --git a/HarvestMan-lite/harvestman/lib/common/macros.py b/HarvestMan-lite/harvestman/lib/common/macros.py deleted file mode 100755 index 63cd500..0000000 --- a/HarvestMan-lite/harvestman/lib/common/macros.py +++ /dev/null @@ -1,185 +0,0 @@ -# -- coding: utf-8 -""" -macros.py - Defining macro variables for use by other -modules. - -Created Anand B Pillai Oct 5 2007 - -Copyright (C) 2007, Anand B Pillai. -""" - -class HarvestManMacroVariable(type): - """ A metaclass for HarvestMan macro variables """ - - PIDX = 0 - NIDX = 0 - macrodict = {} - - def __new__(cls, name, bases=(), dct={}): - - val = dct.get('value') - if val != None: - dct['index'] = val - - elif dct.get('negate'): - cls.NIDX -= 1 - dct['index'] = cls.NIDX - else: - cls.PIDX += 1 - dct['index'] = cls.PIDX - - item = type.__new__(cls, name, bases, dct) - cls.macrodict[name] = item - return item - - def __init__(cls, name, bases=(), dct={}): - pass - def __str__(self): - return '%d' % (self.index) - - def __eq__(self, number): - # Makes it easy to do things like - # THREAD_IDLE == 0 in code. - return self.index == number - - def __lt__(self, number): - - return self.index < number - - def __gt__(self, number): - - return self.index > number - - def __le__(self, number): - - return self.index <= number - - def __ge__(self, number): - - return self.index >= number - - -def DEFINE_MACRO(name, val=None): - """ A factory function for defining macros """ - - if val != None: - globals()[name] = HarvestManMacroVariable(name, dct={'value': val}) - else: - globals()[name] = HarvestManMacroVariable(name) - -def DEFINE_NEGATIVE_MACRO(name, val=None): - """ A factory function for defining macros with negative values """ - - if val != None: - globals()[name] = HarvestManMacroVariable(name, dct={'value': val,'negate': True}) - else: - globals()[name] = HarvestManMacroVariable(name, dct={'negate': True}) - - -def SUCCESS(status): - return (status > 0) - -DEFINE_ERROR_MACRO = DEFINE_NEGATIVE_MACRO - -# Special (predefined) macros -DEFINE_MACRO("HARVESTMAN_OK", 1) -DEFINE_MACRO("HARVESTMAN_FAIL", -1) -DEFINE_MACRO("OPTION_TURN_OFF", 0) -DEFINE_MACRO("OPTION_TURN_ON", 1) -DEFINE_MACRO("CONNECTOR_DATA_MODE_FLUSH", 0) -DEFINE_MACRO("CONNECTOR_DATA_MODE_INMEM", 1) - -# Success macros -DEFINE_MACRO("RESTORE_STATE_OK") -DEFINE_MACRO("SAVE_STATE_OK") -DEFINE_MACRO("CONFIG_FILE_EXISTS") -DEFINE_MACRO("CONFIG_FILE_PARSE_OK") -DEFINE_MACRO("CONFIG_OPTION_SET") -DEFINE_MACRO("CONFIG_ITEM_SKIPPED") -DEFINE_MACRO("CONFIG_OPTION_NOT_DEFINED") -DEFINE_MACRO("CONFIG_ARGUMENT_OK") -DEFINE_MACRO("CONFIG_ARGUMENTS_OK") -DEFINE_MACRO("PROJECT_FILE_EXISTS", 0) -DEFINE_MACRO("CONFIGURE_PROTOCOL_OK") -DEFINE_MACRO("CONNECT_MULTIPART_DOWNLOAD") -DEFINE_MACRO("CONNECT_NO_UPTODATE") -DEFINE_MACRO("CONNECT_YES_DOWNLOADED") -DEFINE_MACRO("DOWNLOAD_YES_WITH_MODIFICATION") -DEFINE_MACRO("DOWNLOAD_NO_UPTODATE") -DEFINE_MACRO("DOWNLOAD_NO_CACHE_SYNCED") -DEFINE_MACRO("DOWNLOAD_YES_OK") -DEFINE_MACRO("URL_PUSHED_TO_POOL") -DEFINE_MACRO("CREATE_DIRECTORY_OK") -DEFINE_MACRO("URL_DOWNLOAD_OK") -DEFINE_MACRO("DATA_ALREADY_PRESENT") -DEFINE_MACRO("FILE_WRITE_OK") -DEFINE_MACRO("WRITE_URL_OK") -DEFINE_MACRO("DUMP_URL_OK") -DEFINE_MACRO("PROJECT_FILE_READ_OK") -DEFINE_MACRO("PROJECT_FILE_WRITE_OK") -DEFINE_MACRO("WRITE_URL_HEADERS_OK") -DEFINE_MACRO("BROWSE_FILE_WRITE_OK") -DEFINE_MACRO("LINK_FILTERED") -DEFINE_MACRO("LINK_NOT_FILTERED") -DEFINE_MACRO("LINK_EMPTY") -DEFINE_MACRO("ANCHOR_LINK_FOUND") -DEFINE_MACRO("SET_STATE_OK") -DEFINE_MACRO("THREAD_MIGRATION_OK") -DEFINE_MACRO("MULTIPART_DOWNLOAD_QUEUED") -DEFINE_MACRO("MULTIPART_DOWNLOAD_COMPLETED") -DEFINE_MACRO("MULTIPART_DOWNLOAD_STATUS_UNKNOWN") -DEFINE_MACRO("HGET_DOWNLOAD_OK") - -# Error macros -DEFINE_ERROR_MACRO("SAVE_STATE_NOT_OK") -DEFINE_ERROR_MACRO("RESTORE_STATE_NOT_OK") -DEFINE_ERROR_MACRO("CONFIG_FILE_DOES_NOT_EXIST") -DEFINE_ERROR_MACRO("CONFIG_FILE_PARSE_ERROR") -DEFINE_ERROR_MACRO("CONFIG_VALUE_EMPTY") -DEFINE_ERROR_MACRO("CONFIG_VALUE_MISMATCH") -DEFINE_ERROR_MACRO("CONFIG_OPTION_NOT_SET") -DEFINE_ERROR_MACRO("CONFIG_OPTION_ASSIGN_ERROR") -DEFINE_ERROR_MACRO("CONFIG_INVALID_ARGUMENT") -DEFINE_ERROR_MACRO("CONFIG_ARGUMENT_ERROR") -DEFINE_ERROR_MACRO("CONNECT_NO_RULES_VIOLATION") -DEFINE_ERROR_MACRO("CONNECT_NO_FILTERED") -DEFINE_ERROR_MACRO("CONNECT_NO_ERROR") -DEFINE_ERROR_MACRO("CONNECT_DOWNLOAD_ABORTED") -DEFINE_ERROR_MACRO("DOWNLOAD_NO_ERROR") -DEFINE_ERROR_MACRO("DOWNLOAD_NO_WRITE_FILTERED") -DEFINE_ERROR_MACRO("DOWNLOAD_NO_RULE_VIOLATION") -DEFINE_ERROR_MACRO("DOWNLOAD_NO_CACHE_SYNC_FAILED") -DEFINE_ERROR_MACRO("CREATE_DIRECTORY_NOT_OK") -DEFINE_ERROR_MACRO("URL_DOWNLOAD_FAILED") -DEFINE_ERROR_MACRO("DATA_DOWNLOAD_ERROR") -DEFINE_ERROR_MACRO("DATA_EMPTY_ERROR") -DEFINE_ERROR_MACRO("FILE_WRITE_ERROR") -DEFINE_ERROR_MACRO("WRITE_URL_FAILED") -DEFINE_ERROR_MACRO("NULL_URLOBJECT_ERROR") -DEFINE_ERROR_MACRO("INVALID_ARCHIVE_FORMAT") -DEFINE_ERROR_MACRO("FILE_TRUNCATE_ERROR") -DEFINE_ERROR_MACRO("DUMP_URL_ERROR") -DEFINE_ERROR_MACRO("PROJECT_FILE_READ_ERROR") -DEFINE_ERROR_MACRO("PROJECT_FILE_WRITE_ERROR") -DEFINE_ERROR_MACRO("PROJECT_FILE_REMOVE_ERROR") -DEFINE_ERROR_MACRO("WRITE_URL_HEADERS_ERROR") -DEFINE_ERROR_MACRO("BROWSE_FILE_NOT_FOUND") -DEFINE_ERROR_MACRO("BROWSE_FILE_READ_ERROR") -DEFINE_ERROR_MACRO("BROWSE_FILE_EMPTY") -DEFINE_ERROR_MACRO("BROWSE_FILE_INVALID") -DEFINE_ERROR_MACRO("BROWSE_FILE_WRITE_ERROR") -DEFINE_ERROR_MACRO("ANCHOR_LINK_NOT_FOUND") -DEFINE_ERROR_MACRO("SET_STATE_ERROR") -DEFINE_ERROR_MACRO("THREAD_MIGRATION_ERROR") -DEFINE_ERROR_MACRO("MULTIPART_DOWNLOAD_ERROR") -DEFINE_ERROR_MACRO("HGET_FATAL_ERROR") -DEFINE_ERROR_MACRO("HGET_KEYBOARD_INTERRUPT") -DEFINE_ERROR_MACRO("HGET_DOWNLOAD_ERROR") -DEFINE_ERROR_MACRO("MIRRORS_NOT_FOUND") -DEFINE_ERROR_MACRO("WRITE_URL_FILTERED") -DEFINE_ERROR_MACRO("WRITE_URL_BLOCKED") -DEFINE_ERROR_MACRO("CONTROLLER_EXIT") - -if __name__ == "__main__": - for key, val in HarvestManMacroVariable.macrodict.iteritems(): - print key,'=>',val.index diff --git a/HarvestMan-lite/harvestman/lib/common/netinfo.py b/HarvestMan-lite/harvestman/lib/common/netinfo.py deleted file mode 100755 index 080e85a..0000000 --- a/HarvestMan-lite/harvestman/lib/common/netinfo.py +++ /dev/null @@ -1,184 +0,0 @@ -""" -netinfo - Module summarizing information regarding protocols, -ports, file extensions, regular expressions for analyzing URLs etc -for HarvestMan. - -Created Anand B Pillai Feb 22 2008, moving - content from urlparser.py - -Copyright (C) 2008, Anand B Pillai. -""" - -import re - -URLSEP = '/' # URL separator character -PROTOSEP = '//' # String which separates a protocol string from the rest of URL -DOTDOT = '..' # A convenient name for the .. string -DOT = '.' # A convenient name for the . string -PORTSEP = ':' # Protocol separator character, character which separates the protocol - # string from rest of URL -BACKSLASH = '\\' # A convenient name for the backslash character - -# Mapping popular protocols to most widely used port numbers -protocol_map = { "http://" : 80, - "ftp://" : 21, - "https://" : 443, - "file://": 0, - "file:": 0 - } - -# Popular image types file extensions -image_extns = ('.bmp', '.dib', '.dcx', '.emf', '.fpx', '.gif', '.ico', '.img', - '.jp2', '.jpc', '.j2k', '.jpf', '.jpg', '.jpeg', '.jpe', - '.mng', '.pbm', '.pcd', '.pcx', '.pgm', '.png', '.ppm', - '.psd', '.ras', '.rgb', '.tga', '.tif', '.tiff', '.wbmp', - '.xbm', '.xpm') - -# Popular video types file extensions -movie_extns = ('.3gp', '.avi', '.asf','.asx', '.avs', '.bay', - '.bik', '.bsf', '.dat', '.dv' ,'.dvr-ms', 'flc', - '.flv', '.ivf', '.m1v', '.m2ts', '.m2v', '.m4v', - '.mgv', '.mkv', '.mov', '.mp2v', '.mp4', '.mpe', - '.mpeg', '.mpg', '.ogm', '.qt', '.ratDVD', '.rm', - '.smi', '.vob', '.wm', '.wmv', '.xvid' ) - -# Popular audio types file extensions -sound_extns = ('.aac', '.aif', '.aiff', '.aifc', '.aifr', '.amr', - '.ape' ,'.asf', '.au', '.aud', '.aup', '.bwf', - '.cda', '.dct', '.dss', '.dts', '.dvf', '.esu', - '.eta', '.flac', '.gsm', '.jam', '.m4a', '.m4p', - '.mdi', '.mid', '.midi', '.mka', '.mod', '.mp1', '.mp2', - '.mp3', '.mpa', '.mpc', '.mpega', '.msv', '.mus', - '.nrj', '.nwc', '.nwp', '.ogg', '.psb', '.psm', '.ra', - '.ram', '.rel', '.sab', '.shn', '.smf', '.snd', '.speex', - '.tta', '.vox', '.vy3', '.wav', '.wave', '.wma', - '.wpk', '.wv', '.wvc') - -# Most common web page url file extensions -# including dynamic server pages & cgi scripts. -webpage_extns = ('', '.htm', '.html', '.shtm', '.shtml', '.php', - '.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl', - '.cgi', '.stx', '.cfm', '.cfml', '.cms', '.ars') - - -# Document extensions -document_extns = ('.doc','.rtf','.odt','.odp','.ott','.sxw','.stw', - '.sdw','.vor','.pdf','.ps') - -# Extensions for flash/flash source code/flash action script -flash_extns = ('.swf', '.fla', '.mxml', '.as', '.abc') - -# Web-page extensions which automatically default to directories -# These are special web-page types which are web-pages as well -# as directories. Most common example is the .ars file extension -# of arstechnica.com. -default_directory_extns = ('.ars',) - -# Most common stylesheet url file extensions -stylesheet_extns = ( '.css', ) - -# Regular expression for matching -# urls which contain white spaces -wspacere = re.compile(r'\s+\S+', re.LOCALE|re.UNICODE) - -# Regular expression for anchor tags -anchore = re.compile(r'\#+') - -# jkleven: Regex if we still don't recognize a URL address as HTML. Only -# to be used if we've looked at everything else and URL still isn't -# a known type. This regex is similar to one in pageparser.py but -# we changed a few '*' to '+' to get one or more matches. -# form_re = re.compile(r'[-.:_a-zA-Z0-9]+\?[-.:_a-zA-Z0-9]+=[-.a:_-zA-Z0-9]*', re.UNICODE) - -# Made this more generic and lenient. -form_re = re.compile(r'([^&=\?]*\?)([^&=\?]*=[^&=\?])*', re.UNICODE) - -# Junk chars which cannot be part of valid filenames -junk_chars = ('?','*','"','<','>','!',':','/','\\') - -# Replacement chars -junk_chars_repl = ('',)*len(junk_chars) - -# Dirty chars which need to be hex encoded in URLs (apart from white-space) -# We are assuming that there won't be many idiots who would put a backslash in a URL... -dirty_chars = ('<','>','(',')','{','}','[',']','^','`','|') - -# These are replaced with their hex counterparts -dirty_chars_repl = ('%3C','%3E','%28','%29','%7B','%7D','%5B','%5D','%5E','%60','%7C') - -# %xx char replacement regexp -percent_repl = re.compile(r'\%[a-f0-9][a-f0-9]', re.IGNORECASE) -# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[-.a:_-zA-Z0-9]*)+', re.UNICODE) -# params_re = re.compile(r'([-.:_a-zA-Z0-9]*=[^\&]*)+', re.UNICODE) - -# Regexp which extracts params from query URLs, most generic -params_re = re.compile(r'([^&=\?]*=[^&=\?]*)', re.UNICODE) -# Regular expression for validating a query param group (such as "lang=en") -param_re = re.compile(r'([^&=\?]+=[^&=\?\s]+)', re.UNICODE) - -# ampersand regular expression at URL end -ampersand_re = re.compile(r'\&+$') -# question mark regular expression at URL end -question_re = re.compile(r'\?+$') -# Regular expression for www prefixes at front of a string -www_re = re.compile(r'^www(\d*)\.') -# Regular expression for www prefixes anywhere -www2_re = re.compile(r'www(\d*)\.') - -# List of TLD (top-level domain) name endings from http://data.iana.org/TLD/tlds-alpha-by-domain.txt - -tlds = ['ac', 'ad', 'ae', 'aero', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', - 'ar', 'arpa', 'as', 'asia', 'at', 'au', 'aw','ax', 'az', 'ba', 'bb', 'bd', - 'be', 'bf', 'bg', 'bh', 'bi', 'biz', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', - 'bt', 'bv', 'bw', 'by', 'bz', 'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch', - 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'com', 'coop', 'cr', 'cu', 'cv', 'cx', - 'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'edu', 'ee', 'eg', - 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo', 'fr', 'ga', 'gb', - 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq', - 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht', 'hu', - 'id', 'ie', 'il', 'im', 'in', 'info', 'int', 'io', 'iq', 'ir', 'is', - 'it', 'je', 'jm', 'jo', 'jobs', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn', - 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', - 'lt', 'lu', 'lv', 'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mil', 'mk', - 'ml', 'mm', 'mn', 'mo', 'mobi', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu', - 'museum', 'mv', 'mw', 'mx', 'my', 'mz', 'na', 'name', 'nc', 'ne', 'net', - 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'org', 'pa', - 'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'pro', 'ps', 'pt', - 'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', - 'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', - 'su', 'sv', 'sy', 'sz', 'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk', - 'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'travel', 'tt', 'tv', 'tw', 'tz', - 'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi', - 'vn', 'vu', 'wf', 'ws', 'xn--0zwm56d', 'xn--11b5bs3a9aj6g', 'xn--80akhbyknj4f', - 'xn--9t4b11yi5a', 'xn--deba0ad', 'xn--g6w251d', 'xn--hgbk6aj7f53bba', - 'xn--hlcj6aya9esc7a', 'xn--jxalpdlp', 'xn--kgbechtv', 'xn--zckzah', - 'ye', 'yt', 'yu', 'za', 'zm', 'zw'] - -def get_base_server(server): - """ Return the base server name of the passed - server (domain) name """ - - # If the server name is of the form say bar.foo.com - # or vodka.bar.foo.com, i.e there are more than one - # '.' in the name, then we need to return the - # last string containing a dot in the middle. - if server.count('.') > 1: - dotstrings = server.split('.') - # now the list is of the form => [vodka, bar, foo, com] - - # Skip the list for skipping over tld domain name endings - # such as .org.uk, .mobi.uk etc. For example, if the - # server is games.mobileworld.mobi.uk, then we - # need to return mobileworld.mobi.uk, not mobi.uk - dotstrings.reverse() - idx = 0 - - for item in dotstrings: - if item.lower() in tlds: - idx += 1 - - return '.'.join(dotstrings[idx::-1]) - else: - # The server is of the form foo.com or just "foo" - # so return it straight away - return server diff --git a/HarvestMan-lite/harvestman/lib/common/optionparser.py b/HarvestMan-lite/harvestman/lib/common/optionparser.py deleted file mode 100755 index 6c6cea7..0000000 --- a/HarvestMan-lite/harvestman/lib/common/optionparser.py +++ /dev/null @@ -1,286 +0,0 @@ -# -- coding: utf-8 -""" -optionparser.py - Generic option parser class. This class -can be used to write code that will parse command line options -for an application by invoking one of the standard Python -library command argument parser modules optparse or -getopt. - -The class first tries to use optparse. It it is not there -(< Python 2.3), it invokes getopt. However, this is -transparent to the application which uses the class. - -The class requires a list with tuple entries of the following -form for each command line option. - -('option_var', 'short=','long=', -'help=', 'meta=','default=', -'type=
links. - # So in order to filer images fully, we need to check the wp.links list also. - # Sample site: http://www.sheppeyseacadets.co.uk/gallery_2.htm - if self._configobj.images: - links += self.wp.images - else: - # Filter any links with image extensions out from links - links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \ - netinfo.image_extns] - - #for typ, link in links: - # print 'Link=>',link - - self.wp.reset() - - # Filter like that for video, flash & audio - if not self._configobj.movies: - # Filter any links with video extension out from links... - links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \ - netinfo.movie_extns] - - if not self._configobj.flash: - # Filter any links with flash extension out from links... - links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \ - netinfo.flash_extns] - - - if not self._configobj.sounds: - # Filter any links with audio extension out from links... - links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \ - netinfo.sound_extns] - - if not self._configobj.documents: - # Filter any links with popular documents extension out from links... - links = [(type, link) for type, link in links if link[link.rfind('.'):].lower() not in \ - netinfo.document_extns] - - links = self.offset_links(links) - # print "Filtered links",links - - # Create collection object - coll = HarvestManAutoUrlCollection(url_obj) - - children = [] - for typ, url in links: - - is_cgi, is_php = False, False - - # Not sure of the logical validity of the following 2 lines anymore...! - # This is old code... - if url.find('php?') != -1: is_php = True - if typ == 'form' or is_php: is_cgi = True - - if not url or len(url)==0: continue - # print 'URL=>',url,url_obj.get_full_url() - - try: - child_urlobj = urlparser.HarvestManUrl(url, - typ, - is_cgi, - url_obj) - - # print url, child_urlobj.get_full_url() - - if objects.datamgr.check_exists(child_urlobj): - continue - else: - objects.datamgr.add_url(child_urlobj) - coll.addURL(child_urlobj) - children.append(child_urlobj) - - except urlparser.HarvestManUrlError, e: - error('URL Error:', e) - continue - - # objects.queuemgr.endloop(True) - - # Update the document again... - for child in children: - document.add_child(child) - - if not objects.queuemgr.push((url_obj.priority, coll, document), 'fetcher'): - if self._pushflag: self.buffer.append((url_obj.priority, coll, document)) - - # Update links called here - objects.datamgr.update_links(url_obj, coll) - - - return data - - elif self.url.is_stylesheet() and data: - - # Parse stylesheet to find all contained URLs - # including imported stylesheets, if any. - - # Create a document and keep updating it -this is useful to provide - # information to events... - document = url_obj.make_document(data, [], '', []) - - sp = pageparser.HarvestManCSSParser() - sp.feed(data) - - links = self.offset_links(sp.links) - - # Filter the CSS URLs also w.r.t rules - # Filter any links with image extensions out from links - if not self._configobj.images: - links = [link for link in links if link[link.rfind('.'):].lower() not in netinfo.image_extns] - - children = [] - - # Create collection object - coll = HarvestManAutoUrlCollection(self.url) - - # Add these links to the queue - for url in links: - if not url: continue - - # There is no type information - so look at the - # extension of the URL. If ending with .css then - # add as stylesheet type, else as generic type. - - if url.lower().endswith('.css'): - urltyp = URL_TYPE_STYLESHEET - else: - urltyp = URL_TYPE_ANY - - try: - child_urlobj = urlparser.HarvestManUrl(url, - urltyp, - False, - self.url) - - - if objects.datamgr.check_exists(child_urlobj): - continue - else: - objects.datamgr.add_url(child_urlobj) - coll.addURL(child_urlobj) - children.append(child_urlobj) - - except urlparser.HarvestManUrlError: - continue - - # Update the document... - for child in children: - document.add_child(child) - - if not objects.queuemgr.push((self.url.priority, coll, document), 'fetcher'): - if self._pushflag: self.buffer.append((self.url.priority, coll, document)) - - # Update links called here - objects.datamgr.update_links(self.url, coll) - - # Successful return returns data - return data - else: - # Dont do anything - return None - - -class HarvestManUrlDownloader(HarvestManUrlFetcher, HarvestManUrlCrawler): - """ This is a mixin class which does both the jobs of crawling webpages - and download urls """ - - def __init__(self, index, url_obj = None, isThread=True): - HarvestManUrlFetcher.__init__(self, index, url_obj, isThread) - self.set_url_object(url_obj) - - def _initialize(self): - HarvestManUrlFetcher._initialize(self) - HarvestManUrlCrawler._initialize(self) - self._role = 'downloader' - - def set_url_object(self, obj): - HarvestManUrlFetcher.set_url_object(self, obj) - - def set_url_object2(self, obj): - HarvestManUrlCrawler.set_url_object(self, obj) - - def exit_condition(self, caller): - - # Exit condition for single thread case - if caller=='crawler': - return (objects.queuemgr.data_q.qsize()==0) - elif caller=='fetcher': - return (objects.queuemgr.url_q.qsize()==0) - - return False - - def is_exit_condition(self): - - return (self.exit_condition('crawler') and self.exit_condition('fetcher')) - - def action(self): - - if self._isThread: - self._loops = 0 - - while not self._endflag: - obj = objects.queuemgr.get_url_data("downloader") - if not obj: continue - - self.set_url_object(obj) - - self.process_url() - self.crawl_url() - - self._loops += 1 - self.sleep() - else: - while True: - self.process_url() - - obj = objects.queuemgr.get_url_data( "crawler" ) - if obj: self.set_url_object2(obj) - - if self.url.is_webpage(): - self.crawl_url() - - obj = objects.queuemgr.get_url_data("fetcher" ) - self.set_url_object(obj) - - if self.is_exit_condition(): - break - - def process_url(self): - - # First process urls using fetcher's algorithm - HarvestManUrlFetcher.process_url(self) - - def crawl_url(self): - HarvestManUrlCrawler.crawl_url(self) - - - diff --git a/HarvestMan-lite/harvestman/lib/datamgr.py b/HarvestMan-lite/harvestman/lib/datamgr.py deleted file mode 100755 index 216a035..0000000 --- a/HarvestMan-lite/harvestman/lib/datamgr.py +++ /dev/null @@ -1,1383 +0,0 @@ -# -- coding: utf-8 -""" datamgr.py - Data manager module for HarvestMan. - This module is part of the HarvestMan program. - - Author: Anand B Pillai - - Oct 13 2006 Anand Removed data lock since it is not required - Python GIL - automatically locks byte operations. - - Feb 2 2007 Anand Re-added function parse_style_sheet which went missing. - - Feb 26 2007 Anand Fixed bug in check_duplicate_download for stylesheets. - Also rewrote logic. - - Mar 05 2007 Anand Added method get_last_modified_time_and_data to support - server-side cache checking using HTTP 304. Fixed a small - bug in css url handling. - Apr 19 2007 Anand Made to work with URL collections. Moved url mapping - dictionary here. Moved CSS parsing logic to pageparser - module. - Feb 13 2008 Anand Replaced URL dictionary with disk caching binary search - tree. Other changes done later -> Got rid of many - redundant lists which were wasting memory. Need to trim - this further. - - Feb 14 2008 Anand Many changes. Replaced/removed datastructures. Merged - cache updating functions. Details in doc/Datastructures.txt . - - April 4 2008 Anand Added update_url method and corresponding update method - in bst.py to update state of URLs after download. Added - statement to print broken links information at end. - - Jan 13 2008 Anand Better check for thread download in download_url method. - Added method 'parseable' in urlparser.py for the same. - - Copyright (C) 2004 Anand B Pillai. - -""" - -__version__ = '2.0 b1' -__author__ = 'Anand B Pillai' - -import os, sys -import shutil -import time -import math -import re -import sha -import copy -import random -import shelve -import tarfile -import zlib - -import threading -# Utils -from harvestman.lib import utils -from harvestman.lib import urlparser - -from harvestman.lib.mirrors import HarvestManMirrorManager -from harvestman.lib.db import HarvestManDbManager - -from harvestman.lib.urlthread import HarvestManUrlThreadPool -from harvestman.lib.connector import * -from harvestman.lib.methodwrapper import MethodWrapperMetaClass - -from harvestman.lib.common.common import * -from harvestman.lib.common.macros import * -from harvestman.lib.common.bst import BST -from harvestman.lib.common.pydblite import Base - - -# Defining pluggable functions -__plugins__ = { 'download_url_plugin': 'HarvestManDataManager:download_url', - 'post_download_setup_plugin': 'HarvestManDataManager:post_download_setup', - 'print_project_info_plugin': 'HarvestManDataManager:print_project_info', - 'dump_url_tree_plugin': 'HarvestManDataManager:dump_url_tree'} - -# Defining functions with callbacks -__callbacks__ = { 'download_url_callback': 'HarvestManDataManager:download_url', - 'post_download_setup_callback' : 'HarvestManDataManager:post_download_setup' } - -class HarvestManDataManager(object): - """ The data manager cum indexer class """ - - # For supporting callbacks - __metaclass__ = MethodWrapperMetaClass - alias = 'datamgr' - - def __init__(self): - self.reset() - - def reset(self): - # URLs which failed with any error - self._numfailed = 0 - # URLs which failed even after a re-download - self._numfailed2 = 0 - # URLs which were retried - self._numretried = 0 - self.cache = None - self.savedfiles = 0 - self.reposfiles = 0 - self.cachefiles = 0 - self.filteredfiles = 0 - # Config object - self._cfg = objects.config - # Dictionary of servers crawled and - # their meta-data. Meta-data is - # a dictionary which currently - # has only one entry. - # i.e accept-ranges. - self._serversdict = {} - # byte count - self.bytes = 0L - # saved bytes count - self.savedbytes = 0L - # Redownload flag - self._redownload = False - # Mirror manager - self.mirrormgr = HarvestManMirrorManager.getInstance() - # Condition object for synchronization - self.cond = threading.Condition(threading.Lock()) - self._urldb = None - self.collections = None - - def initialize(self): - """ Do initializations per project """ - - # Url thread group class for multithreaded downloads - if self._cfg.usethreads: - self._urlThreadPool = HarvestManUrlThreadPool() - self._urlThreadPool.spawn_threads() - else: - self._urlThreadPool = None - - # URL database, a BST with disk-caching - self._urldb = BST() - # Collections database, a BST with disk-caching - self.collections = BST() - # For testing, don't set this otherwise we might - # be left with many orphaned .bidx... folders! - if not self._cfg.testing: - self._urldb.set_auto(2) - self.collections.set_auto(2) - - # Load any mirrors - self.mirrormgr.load_mirrors(self._cfg.mirrorfile) - # Set mirror search flag - self.mirrormgr.mirrorsearch = self._cfg.mirrorsearch - - def get_urldb(self): - return self._urldb - - def add_url(self, urlobj): - """ Add urlobject urlobj to the local dictionary """ - - # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index) - self._urldb.insert(urlobj.index, urlobj) - - def update_url(self, urlobj): - """ Update urlobject urlobj in the local dictionary """ - - # print 'Adding %s with index %d' % (urlobj.get_full_url(), urlobj.index) - self._urldb.update(urlobj.index, urlobj) - - def get_url(self, index): - - # return self._urldict[str(index)] - return self._urldb.lookup(index) - - def get_original_url(self, urlobj): - - # Return the original URL object for - # duplicate URLs. This is useful for - # processing URL objects obtained from - # the collection object, because many - # of them might be duplicate and would - # not have any post-download information - # such a headers etc. - if urlobj.refindex != -1: - return self.get_url(urlobj.refindex) - else: - # Return the same URL object to avoid - # an check on the caller - return urlobj - - def get_proj_cache_filename(self): - """ Return the cache filename for the current project """ - - # Note that this function does not actually build the cache directory. - # Get the cache file path - if self._cfg.projdir and self._cfg.project: - cachedir = os.path.join(self._cfg.projdir, "hm-cache") - cachefilename = os.path.join(cachedir, 'cache') - - return cachefilename - else: - return '' - - def get_proj_cache_directory(self): - """ Return the cache directory for the current project """ - - # Note that this function does not actually build the cache directory. - # Get the cache file path - if self._cfg.projdir and self._cfg.project: - return os.path.join(self._cfg.projdir, "hm-cache") - else: - return '' - - def get_server_dictionary(self): - return self._serversdict - - def supports_range_requests(self, urlobj): - """ Check whether the given url object - supports range requests """ - - # Look up its server in the dictionary - server = urlobj.get_full_domain() - if server in self._serversdict: - d = self._serversdict[server] - return d.get('accept-ranges', False) - - return False - - def read_project_cache(self): - """ Try to read the project cache file """ - - # Get cache filename - info('Reading Project Cache...') - cachereader = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory()) - obj, found = cachereader.read_project_cache() - self._cfg.cachefound = found - self.cache = obj - if not found: - # Fresh cache - create structure... - self.cache.create('url','last_modified','etag', 'updated','location','checksum', - 'content_length','data','headers') - - # Create an index on URL - self.cache.create_index('url') - else: - pass - - def write_file_from_cache(self, urlobj): - """ Write file from url cache. This - works only if the cache dictionary of this - url has a key named 'data' """ - - ret = False - - # print 'Inside write_file_from_cache...' - url = urlobj.get_full_url() - content = self.cache._url[url] - - if len(content): - # Value itself is a dictionary - item = content[0] - if not item.has_key('data'): - return ret - else: - urldata = item['data'] - if urldata: - fileloc = item['location'] - # Write file - extrainfo("Updating file from cache=>", fileloc) - try: - if SUCCESS(self.create_local_directory(os.path.dirname(fileloc))): - f=open(fileloc, 'wb') - f.write(zlib.decompress(urldata)) - f.close() - ret = True - except (IOError, zlib.error), e: - error("Error:",e) - - return ret - - def update_cache_for_url(self, urlobj, filename, urldata, contentlen, lastmodified, tag): - """ Method to update the cache information for the URL 'url' - associated to file 'filename' on the disk """ - - # if page caching is disabled, skip this... - if not objects.config.pagecache: - return - - url = urlobj.get_full_url() - if urldata: - csum = sha.new(urldata).hexdigest() - else: - csum = '' - - # Update all cache keys - content = self.cache._url[url] - if content: - rec = content[0] - self.cache.update(rec, checksum=csum, location=filename,content_length=contentlen, - last_modified=lastmodified,etag=tag, updated=True) - if self._cfg.datacache: - self.cache.update(rec,data=zlib.compress(urldata)) - else: - # Insert as new values - if self._cfg.datacache: - self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen,last_modified=lastmodified, - etag=tag, updated=True,data=zlib.compress(urldata)) - else: - self.cache.insert(url=url, checksum=csum, location=filename,content_length=contentlen, last_modified=lastmodified, - etag=tag, updated=True) - - - def get_url_cache_data(self, urlobj): - """ Get cached data for the URL from disk """ - - # This is returned as Unix time, i.e number of - # seconds since Epoch. - - # This will be called from connector to avoid downloading - # URL data using HTTP 304. However, we support this only - # if we have data for the URL. - if (not self._cfg.pagecache) or (not self._cfg.datacache): - return '' - - url = urlobj.get_full_url() - - content = self.cache._url[url] - if content: - item = content[0] - # Check if we have the data for the URL - data = item.get('data','') - if data: - try: - return zlib.decompress(data) - except zlib.error, e: - error('Error:',e) - return '' - - return '' - - def get_last_modified_time(self, urlobj): - """ Return last-modified-time and data of the given URL if it - was found in the cache """ - - # This is returned as Unix time, i.e number of - # seconds since Epoch. - - # This will be called from connector to avoid downloading - # URL data using HTTP 304. - if (not self._cfg.pagecache): - return '' - - url = urlobj.get_full_url() - - content = self.cache._url[url] - if content: - return content[0].get('last_modified', '') - else: - return '' - - def get_etag(self, urlobj): - """ Return the etag of the given URL if it was found in the cache """ - - # This will be called from connector to avoid downloading - # URL data using HTTP 304. - if (not self._cfg.pagecache): - return '' - - url = urlobj.get_full_url() - - content = self.cache._url[url] - if content: - return content[0].get('etag', '') - else: - return '' - - def is_url_cache_uptodate(self, urlobj, filename, urldata, contentlen=0, last_modified=0, etag=''): - """ Check with project cache and find out if the - content needs update """ - - # If page caching is not enabled, return False - # straightaway! - - # print 'Inside is_url_cache_uptodate...' - - if not self._cfg.pagecache: - return (False, False) - - # Return True if cache is uptodate(no update needed) - # and False if cache is out-of-date(update needed) - # NOTE: We are using an comparison of the sha checksum of - # the file's data with the sha checksum of the cache file. - - # Assume that cache is not uptodate apriori - uptodate, fileverified = False, False - - url = urlobj.get_full_url() - content = self.cache._url[url] - - if content: - cachekey = content[0] - cachekey['updated']=False - - fileloc = cachekey['location'] - if os.path.exists(fileloc) and os.path.abspath(fileloc) == os.path.abspath(filename): - fileverified=True - - # Use a cascading logic - if last_modified is available use it first - if last_modified: - if cachekey['last_modified']: - # Get current modified time - cmt = cachekey['last_modified'] - # print cmt,'=>',lmt - # If the latest page has a modified time greater than this - # page is out of date, otherwise it is uptodate - if last_modified<=cmt: - uptodate=True - - # Else if etag is available use it... - elif etag: - if cachekey['etag']: - tag = cachekey['etag'] - if etag==tag: - uptodate = True - # Finally use a checksum of actual data if everything else fails - elif urldata: - if cachekey['checksum']: - cachesha = cachekey['checksum'] - digest = sha.new(urldata).hexdigest() - - if cachesha == digest: - uptodate=True - - if not uptodate: - # Modified this logic - Anand Jan 10 06 - self.update_cache_for_url(urlobj, filename, urldata, contentlen, last_modified, etag) - - return (uptodate, fileverified) - - def conditional_cache_set(self): - """ A utility function to conditionally enable/disable - the cache mechanism """ - - # If already page cache is disabled, do not do anything - if not self._cfg.pagecache: - return - - # If the cache file exists for this project, disable - # cache, else enable it. - cachefilename = self.get_proj_cache_filename() - - if os.path.exists(cachefilename) and os.path.getsize(cachefilename): - self._cfg.writecache = False - else: - self._cfg.writecache = True - - def post_download_setup(self): - """ Actions to perform after project is complete """ - - # Loop through URL db, one by one and then for those - # URLs which were downloaded but did not succeed, try again. - # But make sure we don't download links which were not-modified - # on server-side (HTTP 304) and hence were skipped. - failed = [] - # Broken links (404) - nbroken = 0 - - for node in self._urldb.preorder(): - urlobj = node.get() - # print 'URL=>',urlobj.get_full_url() - - if urlobj.status == 404: - # print 'BROKEN', urlobj.get_full_url() - nbroken += 1 - elif urlobj.qstatus == urlparser.URL_DONE_DOWNLOAD and \ - urlobj.status != 0 and urlobj.status != 304: - failed.append(urlobj) - - self._numfailed = len(failed) - # print 'BROKEN=>', nbroken - - if self._cfg.retryfailed: - info(' ') - - # try downloading again - if self._numfailed: - info('Redownloading failed links...',) - self._redownload=True - - for urlobj in failed: - if urlobj.fatal or urlobj.starturl: continue - extrainfo('Re-downloading',urlobj.get_full_url()) - self._numretried += 1 - self.thread_download(urlobj) - - # Wait for the downloads to complete... - if self._numretried: - extrainfo("Waiting for the re-downloads to complete...") - self._urlThreadPool.wait(10.0, self._cfg.timeout) - - worked = 0 - # Let us calculate the failed rate again... - for urlobj in failed: - if urlobj.status == 0: - # Download was done - worked += 1 - - self._numfailed2 = self._numfailed - worked - - # Stop the url thread pool - # Stop worker threads - self._urlThreadPool.stop_all_threads() - - # bugfix: Moved the time calculation code here. - t2=time.time() - - self._cfg.endtime = t2 - - # Write cache file - if self._cfg.pagecache and self._cfg.writecache: - cachewriter = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory()) - self.add_headers_to_cache() - cachewriter.write_project_cache(self.cache) - - # If url header dump is enabled, dump it - if self._cfg.urlheaders: - self.dump_headers() - - if self._cfg.localise: - self.localise_links() - - # Write archive file... - if self._cfg.archive: - self.archive_project() - - # dump url tree (dependency tree) to a file - if self._cfg.urltreefile: - self.dump_urltree() - - if not self._cfg.project: return - - nlinks = self._urldb.size - # print stats of the project - nservers, ndirs, nfiltered = objects.rulesmgr.get_stats() - nfailed = self._numfailed - numstillfailed = self._numfailed2 - - numfiles = self.savedfiles - numfilesinrepos = self.reposfiles - numfilesincache = self.cachefiles - - numretried = self._numretried - - fetchtime = self._cfg.endtime-self._cfg.starttime - - statsd = { 'links' : nlinks, - 'filtered': nfiltered, - 'processed': nlinks - nfiltered, - 'broken': nbroken, - 'extservers' : nservers, - 'extdirs' : ndirs, - 'failed' : nfailed, - 'fatal' : numstillfailed, - 'files' : numfiles, - 'filesinrepos' : numfilesinrepos, - 'filesincache' : numfilesincache, - 'retries' : numretried, - 'bytes': self.bytes, - 'fetchtime' : fetchtime, - } - - self.print_project_info(statsd) - objects.eventmgr.raise_event('post_crawl_complete', None) - - def check_exists(self, urlobj): - - # Check if this URL object exits (is a duplicate) - return self._urldb.lookup(urlobj.index) - - def update_bytes(self, count): - """ Update the global byte count """ - - self.bytes += count - - def update_saved_bytes(self, count): - """ Update the saved byte count """ - - self.savedbytes += count - - def update_file_stats(self, urlObject, status): - """ Add the passed information to the saved file list """ - - if not urlObject: return NULL_URLOBJECT_ERROR - - filename = urlObject.get_full_filename() - - if status == DOWNLOAD_YES_OK: - self.savedfiles += 1 - elif status == DOWNLOAD_NO_UPTODATE: - self.reposfiles += 1 - elif status == DOWNLOAD_NO_CACHE_SYNCED: - self.cachefiles += 1 - elif status == DOWNLOAD_NO_WRITE_FILTERED: - self.filteredfiles += 1 - - return HARVESTMAN_OK - - def update_links(self, source, collection): - """ Update the links dictionary for this collection """ - - self.collections.insert(source.index, collection) - - def thread_download(self, url): - """ Schedule download of this web document in a separate thread """ - - # Add this task to the url thread pool - if self._urlThreadPool: - url.qstatus = urlparser.URL_QUEUED - self._urlThreadPool.push(url) - - def has_download_threads(self): - """ Return true if there are any download sub-threads - running, else return false """ - - if self._urlThreadPool: - num_threads = self._urlThreadPool.has_busy_threads() - if num_threads: - return True - - return False - - def last_download_thread_report_time(self): - """ Get the time stamp of the last completed - download (sub) thread """ - - if self._urlThreadPool: - return self._urlThreadPool.last_thread_report_time() - else: - return 0 - - def kill_download_threads(self): - """ Terminate all the download threads """ - - if self._urlThreadPool: - self._urlThreadPool.end_all_threads() - - def create_local_directory(self, directory): - """ Create the directories on the disk named 'directory' """ - - # new in 1.4.5 b1 - No need to create the - # directory for raw saves using the nocrawl - # option. - if self._cfg.rawsave: - return CREATE_DIRECTORY_OK - - try: - # Fix for EIAO bug #491 - # Sometimes, however had we try, certain links - # will be saved as files, whereas they might be - # in fact directories. In such cases, check if this - # is a file, then create a folder of the same name - # and move the file as index.html to it. - path = directory - while path: - if os.path.isfile(path): - # Rename file to file.tmp - fname = path - os.rename(fname, fname + '.tmp') - # Now make the directory - os.makedirs(path) - # If successful, move the renamed file as index.html to it - if os.path.isdir(path): - fname = fname + '.tmp' - shutil.move(fname, os.path.join(path, 'index.html')) - - path2 = os.path.dirname(path) - # If we hit the root, break - if path2 == path: break - path = path2 - - if not os.path.isdir(directory): - os.makedirs( directory ) - extrainfo("Created => ", directory) - return CREATE_DIRECTORY_OK - except OSError, e: - error("Error in creating directory", directory) - error(str(e)) - return CREATE_DIRECTORY_NOT_OK - - return CREATE_DIRECTORY_OK - - def download_multipart_url(self, urlobj, clength): - """ Download a URL using HTTP/1.1 multipart download - using range headers """ - - # First add entry of this domain in - # dictionary, if not there - domain = urlobj.get_full_domain() - orig_url = urlobj.get_full_url() - old_urlobj = urlobj.get_original_state() - - domain_changed_a_lot = False - - # If this was a re-directed URL, check if there is a - # considerable change in the domains. If there is, - # there is a very good chance that the original URL - # is redirecting to mirrors, so we can split on - # the original URL and this would automatically - # produce a split-mirror download without us having - # to do any extra work! - if urlobj.redirected and old_urlobj != None: - old_domain = old_urlobj.get_domain() - if old_domain != domain: - # Check if it is somewhat similar - # if domain.find(old_domain) == -1: - domain_changed_a_lot = True - - try: - self._serversdict[domain] - except KeyError: - self._serversdict[domain] = {'accept-ranges': True} - - if self.mirrormgr.mirrors_available(urlobj): - return self.mirrormgr.download_multipart_url(urlobj, clength, self._cfg.numparts, self._urlThreadPool) - else: - if domain_changed_a_lot: - urlobj = old_urlobj - # Set a flag to indicate this - urlobj.redirected_old = True - - parts = self._cfg.numparts - # Calculate size of each piece - piecesz = clength/parts - - # Calculate size of each piece - pcsizes = [piecesz]*parts - # For last URL add the reminder - pcsizes[-1] += clength % parts - # Create a URL object for each and set range - urlobjects = [] - for x in range(parts): - urlobjects.append(copy.copy(urlobj)) - - prev = 0 - for x in range(parts): - curr = pcsizes[x] - next = curr + prev - urlobject = urlobjects[x] - # Set mirror_url attribute - urlobject.mirror_url = urlobj - urlobject.trymultipart = True - urlobject.clength = clength - urlobject.range = (prev, next-1) - urlobject.mindex = x - prev = next - self._urlThreadPool.push(urlobject) - - # Push this URL objects to the pool - return URL_PUSHED_TO_POOL - - def download_url(self, caller, url): - - no_threads = (not self._cfg.usethreads) or \ - url.parseable() - - data="" - if no_threads: - # This call will block if we exceed the number of connections - url.qstatus = urlparser.URL_QUEUED - conn = objects.connfactory.create_connector() - - # Set status to queued - url.qstatus = urlparser.URL_IN_QUEUE - res = conn.save_url( url ) - - objects.connfactory.remove_connector(conn) - - filename = url.get_full_filename() - if res != CONNECT_NO_ERROR: - filename = url.get_full_filename() - - self.update_file_stats( url, res ) - - if res==DOWNLOAD_YES_OK: - info("Saved",filename) - - if url.is_webpage(): - if self._cfg.datamode==CONNECTOR_DATA_MODE_INMEM: - data = conn.get_data() - elif os.path.isfile(filename): - # Need to read data from the file... - data = open(filename, 'rb').read() - - else: - fetchurl = url.get_full_url() - extrainfo( "Failed to download url", fetchurl) - - self._urldb.update(url.index, url) - - else: - # Set status to queued - self.thread_download( url ) - - return data - - def clean_up(self): - """ Purge data for a project by cleaning up - lists, dictionaries and resetting other member items""" - - # Reset byte count - if self._urldb and self._urldb.size: - del self._urldb - if self.collections and self.collections.size: - del self.collections - self.reset() - - def archive_project(self): - """ Archive project files into a tar archive file. - The archive will be further compressed in gz or bz2 - format """ - - extrainfo("Archiving project files...") - # Get project directory - projdir = self._cfg.projdir - # Get archive format - if self._cfg.archformat=='bzip': - format='bz2' - elif self._cfg.archformat=='gzip': - format='gz' - else: - error("Archive Error: Archive format not recognized") - return INVALID_ARCHIVE_FORMAT - - # Create tarfile name - ptarf = os.path.join(self._cfg.basedir, "".join((self._cfg.project,'.tar.',format))) - cwd = os.getcwd() - os.chdir(self._cfg.basedir) - - # Create tarfile object - tf = tarfile.open(ptarf,'w:'+format) - # Projdir base name - pbname = os.path.basename(projdir) - - # Add directories - for item in os.listdir(projdir): - # Skip cache directory, if any - if item=='hm-cache': - continue - # Add directory - fullpath = os.path.join(projdir,item) - if os.path.isdir(fullpath): - tf.add(os.path.join(pbname,item)) - # Dump the tarfile - tf.close() - - os.chdir(cwd) - # Check whether writing was done - if os.path.isfile(ptarf): - extrainfo("Wrote archive file",ptarf) - return FILE_WRITE_OK - else: - error("Error in writing archive file",ptarf) - return FILE_WRITE_ERROR - - def add_headers_to_cache(self): - """ Add original URL headers of urls downloaded - as an entry to the cache file """ - - # Navigate in pre-order, i.e in the order of insertion... - for node in self.collections.preorder(): - coll = node.get() - - # Get list of links for this collection - for urlobjidx in coll.getAllURLs(): - urlobj = self.get_url(urlobjidx) - if urlobj==None: continue - - url = urlobj.get_full_url() - # Get headers - headers = urlobj.get_url_content_info() - - if headers: - content = self.cache._url[url] - if content: - urldict = content[0] - urldict['headers'] = headers - - - def dump_headers(self): - """ Dump the headers of the web pages - downloaded, into a DBM file """ - - # print dbmfile - extrainfo("Writing url headers database") - - headersdict = {} - for node in self.collections.preorder(): - coll = node.get() - - for urlobjidx in coll.getAllURLs(): - urlobj = self.get_url(urlobjidx) - - if urlobj: - url = urlobj.get_full_url() - # Get headers - headers = urlobj.get_url_content_info() - if headers: - headersdict[url] = str(headers) - - cache = utils.HarvestManCacheReaderWriter(self.get_proj_cache_directory()) - return cache.write_url_headers(headersdict) - - def localise_links(self): - """ Localise all links (urls) of the downloaded html pages """ - - # Dont confuse 'localising' with language localization. - # This means just converting the outward (Internet) pointing - # URLs in files to local files. - - info('Localising links of downloaded web pages...',) - - count = 0 - localized = [] - - for node in self.collections.preorder(): - coll = node.get() - - sourceurl = self.get_url(coll.getSourceURL()) - childurls = [self.get_url(index) for index in coll.getAllURLs()] - filename = sourceurl.get_full_filename() - - if (not filename in localized) and os.path.exists(filename): - extrainfo('Localizing links for',filename) - if SUCCESS(self.localise_file_links(filename, childurls)): - count += 1 - localized.append(filename) - - info('Localised links of',count,'web pages.') - - def localise_file_links(self, filename, links): - """ Localise links for this file """ - - data='' - - try: - fw=open(filename, 'r+') - data=fw.read() - fw.seek(0) - fw.truncate(0) - except (OSError, IOError),e: - return FILE_TRUNCATE_ERROR - - # Regex1 to replace ( at the end - r1 = re.compile(r'\)+$') - r2 = re.compile(r'\(+$') - - # MOD: Replace any line - basehrefre = re.compile(r'', re.IGNORECASE) - if basehrefre.search(data): - data = re.sub(basehrefre, '', data) - - for u in links: - if not u: continue - - url_object = u - typ = url_object.get_type() - - if url_object.is_image(): - http_str="src" - else: - http_str="href" - - v = url_object.get_original_url() - if v == '/': continue - - # Somehow, some urls seem to have an - # unbalanced parantheses at the end. - # Remove it. Otherwise it will crash - # the regular expressions below. - v = r1.sub('', v) - v2 = r2.sub('', v) - - # Bug fix, dont localize cgi links - if typ != 'base': - if url_object.is_cgi(): - continue - - fullfilename = os.path.abspath( url_object.get_full_filename() ) - urlfilename='' - - # Modification: localisation w.r.t relative pathnames - if self._cfg.localise==2: - urlfilename = url_object.get_relative_filename() - elif self._cfg.localise==1: - urlfilename = fullfilename - - # replace '\\' with '/' - urlfilename = urlfilename.replace('\\','/') - - newurl='' - oldurl='' - - # If we cannot get the filenames, replace - # relative url paths will full url paths so that - # the user can connect to them. - if not os.path.exists(fullfilename): - # for relative links, replace it with the - # full url path - fullurlpath = url_object.get_full_url_sans_port() - newurl = "href=\"" + fullurlpath + "\"" - else: - # replace url with urlfilename - if typ == 'anchor': - anchor_part = url_object.get_anchor() - urlfilename = "".join((urlfilename, anchor_part)) - # v = "".join((v, anchor_part)) - - if self._cfg.localise == 1: - newurl= "".join((http_str, "=\"", "file://", urlfilename, "\"")) - else: - newurl= "".join((http_str, "=\"", urlfilename, "\"")) - - else: - newurl="".join((http_str,"=\"","\"")) - - if typ != 'img': - oldurl = "".join((http_str, "=\"", v, "\"")) - try: - oldurlre = re.compile("".join((http_str,'=','\\"?',v,'\\"?'))) - except Exception, e: - debug("Error:",str(e)) - continue - - # Get the location of the link in the file - try: - if oldurl != newurl: - data = re.sub(oldurlre, newurl, data,1) - except Exception, e: - debug("Error:",str(e)) - continue - else: - try: - oldurlre1 = "".join((http_str,'=','\\"?',v,'\\"?')) - oldurlre2 = "".join(('href','=','\\"?',v,'\\"?')) - oldurlre = re.compile("".join(('(',oldurlre1,'|',oldurlre2,')'))) - except Exception, e: - debug("Error:",str(e)) - continue - - http_strs=('href','src') - - for item in http_strs: - try: - oldurl = "".join((item, "=\"", v, "\"")) - if oldurl != newurl: - data = re.sub(oldurlre, newurl, data,1) - except: - pass - - try: - fw.write(data) - fw.close() - except IOError, e: - logconsole(e) - return HARVESTMAN_FAIL - - return HARVESTMAN_OK - - def print_project_info(self, statsd): - """ Print project information """ - - nlinks = statsd['links'] - nservers = statsd['extservers'] + 1 - nfiles = statsd['files'] - ndirs = statsd['extdirs'] + 1 - numfailed = statsd['failed'] - nretried = statsd['retries'] - fatal = statsd['fatal'] - fetchtime = statsd['fetchtime'] - nfilesincache = statsd['filesincache'] - nfilesinrepos = statsd['filesinrepos'] - nbroken = statsd['broken'] - - # Bug fix, download time to be calculated - # precisely... - - dnldtime = fetchtime - - strings = [('link', nlinks), ('server', nservers), - ('file', nfiles), ('file', nfilesinrepos), - ('directory', ndirs), ('link', numfailed), ('link', fatal), - ('link', nretried), ('file', nfilesincache), ('link', nbroken) ] - - fns = map(plural, strings) - info(' ') - - bytes = self.bytes - savedbytes = self.savedbytes - - ratespec='KB/sec' - if bytes and dnldtime: - bps = float(bytes/dnldtime)/1024.0 - if bps<1.0: - bps *= 1000.0 - ratespec='bytes/sec' - bps = '%.2f' % bps - else: - bps = '0.0' - - fetchtime = float((math.modf(fetchtime*100.0)[1])/100.0) - - if self._cfg.simulate: - info("HarvestMan crawl simulation of",self._cfg.project,"completed in",fetchtime,"seconds.") - else: - info('HarvestMan mirror',self._cfg.project,'completed in',fetchtime,'seconds.') - - if nlinks: info(nlinks,fns[0],'scanned in',nservers,fns[1],'.') - else: info('No links parsed.') - if nfiles: info(nfiles,fns[2],'written.') - else:info('No file written.') - - if nfilesinrepos: - info(nfilesinrepos,fns[3],wasOrWere(nfilesinrepos),'already uptodate in the repository for this project and',wasOrWere(nfilesinrepos),'not updated.') - if nfilesincache: - info(nfilesincache,fns[8],wasOrWere(nfilesincache),'updated from the project cache.') - - if nbroken: info(nbroken,fns[9],wasOrWere(nbroken),'were broken.') - if fatal: info(fatal,fns[5],'had fatal errors and failed to download.') - if bytes: info(bytes,' bytes received at the rate of',bps,ratespec,'.') - if savedbytes: info(savedbytes,' bytes were written to disk.\n') - - info('*** Log Completed ***\n') - - # get current time stamp - s=time.localtime() - - tz=(time.tzname)[0] - - format='%b %d %Y '+tz+' %H:%M:%S' - tstamp=time.strftime(format, s) - - if not self._cfg.simulate: - # Write statistics to the crawl database - HarvestManDbManager.add_stats_record(statsd) - logconsole('Done.') - - # No longer writing a stats file... - # Write stats to a stats file - #statsfile = self._cfg.project + '.hst' - #statsfile = os.path.abspath(os.path.join(self._cfg.projdir, statsfile)) - #logconsole('Writing stats file ', statsfile , '...') - # Append to files contents - #sf=open(statsfile, 'a') - - # Write url, file count, links count, time taken, - # files per second, failed file count & time stamp - #infostr='url:'+self._cfg.url+',' - #infostr +='files:'+str(nfiles)+',' - #infostr +='links:'+str(nlinks)+',' - #infostr +='dirs:'+str(ndirs)+',' - #infostr +='failed:'+str(numfailed)+',' - #infostr +='refetched:'+str(nretried)+',' - #infostr +='fatal:'+str(fatal)+',' - #infostr +='elapsed:'+str(fetchtime)+',' - #infostr +='fps:'+str(fps)+',' - #infostr +='kbps:'+str(bps)+',' - #infostr +='timestamp:'+tstamp - #infostr +='\n' - - #sf.write(infostr) - #sf.close() - - def dump_urltree(self): - """ Dump url tree to a file """ - - # This creats an html file with - # each url and its children below - # it. Each url is a hyperlink to - # itself on the net if the file - # is an html file. - - # urltreefile is /urls.html - urlfile = os.path.join(self._cfg.projdir, 'urltree.html') - - try: - if os.path.exists(urlfile): - os.remove(urlfile) - except OSError, e: - logconsole(e) - - info('Dumping url tree to file', urlfile) - fextn = ((os.path.splitext(urlfile))[1]).lower() - - try: - f=open(urlfile, 'w') - if fextn in ('', '.txt'): - self.dump_urltree_textmode(f) - elif fextn in ('.htm', '.html'): - self.dump_urltree_htmlmode(f) - f.close() - except Exception, e: - logconsole(e) - return DUMP_URL_ERROR - - debug("Done.") - - return DUMP_URL_OK - - def dump_urltree_textmode(self, stream): - """ Dump urls in text mode """ - - for node in self.collections.preorder(): - coll = node.get() - - idx = 0 - links = [self.get_url(index) for index in coll.getAllURLs()] - children = [] - - for link in links: - if not link: continue - - # Get base link, only for first - # child url, since base url will - # be same for all child urls. - if idx==0: - children = [] - base_url = link.get_parent_url().get_full_url() - stream.write(base_url + '\n') - - childurl = link.get_full_url() - if childurl and childurl not in children: - stream.write("".join(('\t',childurl,'\n'))) - children.append(childurl) - - idx += 1 - - - def dump_urltree_htmlmode(self, stream): - """ Dump urls in html mode """ - - # Write html header - stream.write('\n') - stream.write('') - stream.write('Url tree generated by HarvestMan - Project %s' - % self._cfg.project) - stream.write('\n') - - stream.write('\n') - - stream.write('

\n') - stream.write('

    \n') - - for node in self.collections.preorder(): - coll = node.get() - - idx = 0 - links = [self.get_url(index) for index in coll.getAllURLs()] - - children = [] - for link in links: - if not link: continue - - # Get base link, only for first - # child url, since base url will - # be same for all child urls. - if idx==0: - children = [] - base_url = link.get_parent_url().get_full_url() - stream.write('
  1. ') - stream.write("".join(("",base_url,""))) - stream.write('
  2. \n') - stream.write('

    \n') - stream.write('

      \n') - - childurl = link.get_full_url() - if childurl and childurl not in children: - stream.write('
    • ') - stream.write("".join(("",childurl,""))) - stream.write('
    • \n') - children.append(childurl) - - idx += 1 - - - # Close the child list - stream.write('
    \n') - stream.write('

    \n') - - # Close top level list - stream.write('
\n') - stream.write('

\n') - stream.write('\n') - stream.write('\n') - - def get_url_threadpool(self): - """ Return the URL thread-pool object """ - - return self._urlThreadPool - -class HarvestManController(threading.Thread): - """ A controller class for managing exceptional - conditions such as file limits. Right now this - is written with the sole aim of managing file - & time limits, but could get extended in future - releases. """ - - def __init__(self): - self._dmgr = objects.datamgr - self._tq = objects.queuemgr - self._cfg = objects.config - self._exitflag = False - self._starttime = 0 - threading.Thread.__init__(self, None, None, 'HarvestMan Control Class') - - def run(self): - """ Run in a loop looking for - exceptional conditions """ - - while not self._exitflag: - # Wake up every half second and look - # for exceptional conditions - time.sleep(1.0) - if self._cfg.timelimit != -1: - if self._manage_time_limits()==CONTROLLER_EXIT: - break - if self._cfg.maxfiles: - if self._manage_file_limits()==CONTROLLER_EXIT: - break - if self._cfg.maxbytes: - if self._manage_maxbytes_limits()==CONTROLLER_EXIT: - break - - def stop(self): - """ Stop this thread """ - - self._exitflag = True - - def terminator(self): - """ The function which terminates the program - in case of an exceptional condition """ - - # This somehow got deleted in HarvestMan 1.4.5 - self._tq.endloop(True) - - def _manage_time_limits(self): - """ Manage limits on time for the project """ - - t2=time.time() - - timediff = float((math.modf((t2-self._cfg.starttime)*100.0)[1])/100.0) - timemax = self._cfg.timelimit - - if timediff >= timemax -1: - info('Specified time limit of',timemax ,'seconds reached!') - self.terminator() - return CONTROLLER_EXIT - - return HARVESTMAN_OK - - def _manage_file_limits(self): - """ Manage limits on maximum file count """ - - lsaved = self._dmgr.savedfiles - lmax = self._cfg.maxfiles - - if lsaved >= lmax: - info('Specified file limit of',lmax ,'reached!') - self.terminator() - return CONTROLLER_EXIT - - return HARVESTMAN_OK - - def _manage_maxbytes_limits(self): - """ Manage limits on maximum bytes a crawler should download in total per job. """ - - lsaved = self._dmgr.savedbytes - lmax = self._cfg.maxbytes - - # Let us check for a closer hit of 90%... - if (lsaved >=0.90*lmax): - info('Specified maxbytes limit of',lmax ,'reached!') - self.terminator() - return CONTROLLER_EXIT - - return HARVESTMAN_OK - - diff --git a/HarvestMan-lite/harvestman/lib/db.py b/HarvestMan-lite/harvestman/lib/db.py deleted file mode 100755 index 2235121..0000000 --- a/HarvestMan-lite/harvestman/lib/db.py +++ /dev/null @@ -1,133 +0,0 @@ -# -- coding: utf-8 -""" -db.py - Provides HarvestManDbManager class which takes care -of creating and managing the user's crawl database. The -crawl database is an sqlite database created as -$HOME/.harvestman/db/crawls.db where $HOME is the home -directory of the user. The crawls database is updated with -meta-data of every crawl after a crawl is completed. - -Created by Anand B Pillai Mar 26 2008 - -Copyright (C) 2008 Anand B Pillai. - -""" - -import os, sys -import time - -from harvestman.lib.common.common import objects, extrainfo, logconsole - -def adapt_datetime(ts): - return time.mktime(ts.timetuple()) - -class HarvestManDbManager(object): - """ Class performing the creation/management of crawl databases """ - - projid = 0 - - @classmethod - def try_import(cls): - try: - import sqlite3 - return sqlite3 - except ImportError, e: - pass - - @classmethod - def create_user_database(cls): - - sqlite3 = cls.try_import() - - if sqlite3 is None: - return - - logconsole("Creating user's crawl database file in %s..." % objects.config.userdbdir) - - dbfile = os.path.join(objects.config.userdbdir, "crawls.db") - conn = sqlite3.connect(dbfile) - c = conn.cursor() - - # Create table for projects - # This line is causing a problem in darwin - # c.execute("drop table if exists projects") - c.execute("""create table projects (id integer primary key autoincrement default 0, time real, name text, url str, config str)""") - # Create table for project statistics - # We are storing the information for - # 1. number of urls scanned - # 2. number of urls processed (fetched/crawled) - # 3. number of URLs which were crawl-filtered - # 4. number of urls failed to fetch - # 5. number of urls with 404 errors - # 6. number of URLs which hit the cache - # 7. number of servers scanned - # 8. number of unique directories scanned - # 9. number of files saved - # 10. Amount of data fetched in bytes - # 11. the total time for the crawl. - - # This line is causing a problem in darwin - # c.execute("drop table project_stats") - c.execute("""create table project_stats (project_id integer primary key, urls integer, procurls integer, filteredurls integer, failedurls integer, brokenurls integer, cacheurls integer, servers integer, directories integer, files integer, data real, duration text)""") - - c.close() - - @classmethod - def add_project_record(cls): - - sqlite3 = cls.try_import() - if sqlite3 is None: - return - - extrainfo('Writing project record to crawls database...') - dbfile = os.path.join(objects.config.userdbdir, "crawls.db") - - # Get the configuration as a pickled string - cfg = objects.config.copy() - - conn = sqlite3.connect(dbfile) - c = conn.cursor() - c.execute("insert into projects (time, name, url, config) values(?,?,?,?)", - (time.time(),cfg.project,cfg.url, repr(cfg))) - conn.commit() - - # Fetch the most recent project id and save it as projid - c.execute("select max(id) from projects") - cls.projid = c.fetchone()[0] - # print 'project id=>',cls.projid - c.close() - extrainfo("Done.") - - @classmethod - def add_stats_record(cls, statsd): - - sqlite3 = cls.try_import() - if sqlite3 is None: - return - - logconsole('Writing project statistics to crawl database...') - dbfile = os.path.join(objects.config.userdbdir, "crawls.db") - conn = sqlite3.connect(dbfile) - c = conn.cursor() - t = (cls.projid, - statsd['links'], - statsd['processed'], - statsd['filtered'], - statsd['fatal'], - statsd['broken'], - statsd['filesinrepos'], - statsd['extservers'] + 1, - statsd['extdirs'] + 1, - statsd['files'], - statsd['bytes'], - '%.2f' % statsd['fetchtime']) - - c.execute("insert into project_stats values(?,?,?,?,?,?,?,?,?,?,?,?)", t) - conn.commit() - c.close() - pass - -if __name__ == "__main__": - HarvestManDbManager.create_user_database() - pass - diff --git a/HarvestMan-lite/harvestman/lib/document.py b/HarvestMan-lite/harvestman/lib/document.py deleted file mode 100755 index 0f982fc..0000000 --- a/HarvestMan-lite/harvestman/lib/document.py +++ /dev/null @@ -1,74 +0,0 @@ -# -- coding: utf-8 -""" -document.py - Provides HarvestManDocument class which provides -an abstraction over a webpage with attributes such as URL, -content, child URLs, HTTP headers, lastmodified value and -other attributes. - -Created by Anand B Pillai Feb 26 2008 - -Copyright (C) 2008 Anand B Pillai. -""" - -import bz2 -from harvestman.lib.common.common import * - -class HarvestManDocument(object): - """ Web document class """ - - def __init__(self, urlobj): - # Store only index for conserving memory - self.urlindex = urlobj.index - # Also, list of children is actually list of - # child indices to save memory... - self.children = [] - self.content = '' - self.content_hash = '' - self.headers = {} - # Only valid for webpages - self.description = '' - # Only valid for webpages - self.keywords = [] - # Only valid for webpages - self.title = '' - self.lastmodified = '' - self.etag = '' - #self.httpstatus = '' - #self.httpreason = '' - self.content_type = '' - self.content_encoding = '' - self.error = None - - def get_url(self): - return objects.datamgr.get_url(self.urlindex) - - def set_url(self, urlobj): - self.urlindex = urlobj.index - - def add_child(self, urlobj): - self.children.append(urlobj.index) - - def get_links(self): - # Links are already "normalized" - return [objects.datamgr.get_url(index) for index in self.children] - - def get_content(self): - return self.content - - def set_content(self, data): - self.content = data - - def get_content_hash(self): - return self.content_hash - - def get_zipped_content(self): - # Return the content, gzipped - pass - - def get_bzipped_content(self): - return bz2.compress(self.content) - - - - - diff --git a/HarvestMan-lite/harvestman/lib/event.py b/HarvestMan-lite/harvestman/lib/event.py deleted file mode 100755 index 86f653e..0000000 --- a/HarvestMan-lite/harvestman/lib/event.py +++ /dev/null @@ -1,63 +0,0 @@ -# -- coding: utf-8 -"""event.py - Module defining an event notification framework -associated with the data flow in HarvestMan. - -Created Anand B Pillai Feb 28 2008 - -Copyright (C) 2008 Anand B Pillai. -""" - -from harvestman.lib.common.common import * -from harvestman.lib.common.singleton import Singleton - -class Event(object): - """ Event class for HarvestMan """ - - def __init__(self): - self.name = '' - self.config = objects.config - self.url = None - self.document = None - -class HarvestManEvent(Singleton): - """ Event manager class for HarvestMan """ - - alias = 'eventmgr' - - def __init__(self): - self.events = {} - - def bind(self, event, funktion, *args): - """ Register for a function 'funktion' to be bound to a certain event. - The return value of the function will be used to determine the behaviour - of the original function which raises the event in cases of events - which are called before the original function bound to the event. For - events which are raised after the original function is called, the - behavior of the original function is not changed """ - - # An event is a string, you can bind only one function to an event - # The function should accept a default first argument which is the - # event object. The event object will provide 4 attributes, namely - # the event name, the url associated to the event (should be valid), - # the document associated to the event (could be None) and the configuration - # object of the system. - self.events[event] = funktion - # print self.events - - def raise_event(self, event, url, document=None, *args, **kwargs): - """ Raise a certain event. This automatically calls back on any function - registered for the event and returns the return value of that function. This - is an internal method """ - - try: - funktion = self.events[event] - eventobj = Event() - eventobj.name = event - eventobj.url = url - eventobj.document = document - # Other keyword arguments - return funktion(eventobj, *args, **kwargs) - except KeyError: - pass - - diff --git a/HarvestMan-lite/harvestman/lib/filters.py b/HarvestMan-lite/harvestman/lib/filters.py deleted file mode 100755 index 588a88d..0000000 --- a/HarvestMan-lite/harvestman/lib/filters.py +++ /dev/null @@ -1,788 +0,0 @@ -# -- coding: utf-8 -""" -filters.py - Module which holds class definitions for -classes which define filters for filtering out URLs -and web pages based on regualr expression and other kinds -of filters. - - Author: Anand B Pillai - - Modification History - -------------------- - - Jul 23 2008 Anand Creation - Nov 17 2008 Anand Completed URL filters class implementation - and integrated with HarvestMan. - Jan 13 2009 Anand Added text filter class. Modified - junk filter class to follow the filter - class interface. - - Copyright (C) 2003-2008 Anand B Pillai. - -""" -import re -from harvestman.lib.common.common import * - -class HarvestManBaseFilter(object): - """ Base class for all HarvestMan filter classes """ - - def __init__(self): - self.context = None - - def filter(self, url): - raise NotImplementedError - - def make_regex(self, pattern, casing, flags): - - flag = 0 - if not casing: - flag |= re.IGNORECASE - if flags: - flag |= eval(flags) - - return re.compile(pattern, flag) - -class HarvestManUrlFilter(HarvestManBaseFilter): - """ Filter class for filtering out web pages based on the URL path string """ - - def __init__(self, pathfilters=[], extnfilters=[], regexfilters=[]): - # Filter pattern strings - self.regexfilterpatterns = regexfilters - self.pathfilterpatterns = pathfilters - self.extnfilterpatterns = extnfilters - # Intermediate patterns, dictionaries - # with keys 'include' and 'exclude' - self.regexpatterns = [] - self.pathpatterns = { 'include': [], 'exclude': [] } - self.extnpatterns = { 'include': [], 'exclude': [] } - # Actual filters - self.inclfilters = [] - self.exclfilters = [] - self.compile_filters() - - def parse_filter(self, filterstring): - """ Parse a filter pattern string and return a two - tuple of (, ) pattern string - lists """ - - fstr = filterstring - # First replace any ''' with '' - fstr=fstr.replace("'",'') - # regular expressions to include - include=[] - # regular expressions to exclude - exclude=[] - - index=0 - previndex=-1 - fstr += '+' - for c in fstr: - if c in ('+','-'): - previndex=index - index+=1 - - l=fstr.split('+') - - for s in l: - l2=s.split('-') - for x in xrange(len(l2)): - s=l2[x] - if s=='': continue - if x==0: - include.append(s) - else: - exclude.append(s) - - return (include, exclude) - - def create_filter(self, plainstr, extn=False): - """ Create a python regular expression based on - the list of filter strings provided as input """ - - # Final filter string - fstr = '' - s = plainstr - - # First replace any ''' with '' - s=s.replace("'",'') - # Then we remove the asteriks - s=s.replace('*','.*') - fstr = s - - if extn: - fstr = '\.' + fstr + '$' - - return fstr - - def make_path_filter(self, filterstring, casing=0, flags=''): - """ Creates a URL path filter. A URL path is specified - as an include/exclude filter. Wildcards are specified by - using asteriks """ - - include, exclude = self.parse_filter(filterstring) - - for pattern in include: - self.pathpatterns['include'].append((self.create_filter(pattern), casing, flags)) - for pattern in exclude: - self.pathpatterns['exclude'].append((self.create_filter(pattern), casing, flags)) - - def make_extn_filter(self, filterstring, casing=0, flags=''): - """ Creates a file extension filter. A file extension filter - is specified by concatenating file extensions with a + or - in - front of them to specify include/exclude respectively """ - - include, exclude = self.parse_filter(filterstring) - - for pattern in include: - self.extnpatterns['include'].append((self.create_filter(pattern, True), casing, flags)) - for pattern in exclude: - self.extnpatterns['exclude'].append((self.create_filter(pattern, True), casing, flags)) - - def make_regex_filter(self, filterstring, casing=0, flags=''): - """ Creates a regular expression filter. This is nothing but a Python - regular expression string which is compiled directly into a regex """ - - # Direct use - no processing required - self.regexpatterns.append((filterstring, casing, flags)) - - def compile_filters(self): - """ Compile all filter strings and create regular - expression objects """ - - for pattern, casing, flags in self.pathfilterpatterns: - self.make_path_filter(pattern, casing, flags) - - for pattern, casing, flags in self.extnfilterpatterns: - self.make_extn_filter(pattern, casing, flags) - - for pattern, casing, flags in self.regexfilterpatterns: - self.make_regex_filter(pattern, casing, flags) - - # Now, compile each to regular expressions and - # append to include & exclude regex filter list - for urlfilter in self.pathpatterns['include'] + self.extnpatterns['include']: - regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2]) - self.inclfilters.append(regexp) - - for urlfilter in self.pathpatterns['exclude'] + self.extnpatterns['exclude']: - regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2]) - self.exclfilters.append(regexp) - - for urlfilter in self.regexpatterns: - regexp = self.make_regex(urlfilter[0], urlfilter[1], urlfilter[2]) - self.exclfilters.append(regexp) - - def filter(self, urlobj): - """ Apply all URL filters on the passed URL object 'urlobj'. - Return True if filtered and False if not filtered """ - - # The logic of this is simple - The URL is checked - # against all inclusion filters first, if any. If - # anything matches, then we don't do exclusion filter - # check since inclusion (+) has preference over exclusion (-) - # In that case, False is returned. - - # Otherwise, the URL is checked against all exclusion - # filters and if any match, True is returned. - - # Finally, if none match, False is returned. - - url = urlobj.get_full_url() - matchincl, matchexcl = False, False - - for urlfilter in self.inclfilters: - m = urlfilter.search(url) - if m: - debug("Inclusion filter for URL %s found", url) - matchincl = True - break - - if matchincl: - return False - - for urlfilter in self.exclfilters: - m = urlfilter.search(url) - if m: - debug("Exclusion filter for URL %s found", url) - matchexcl = True - break - - if matchexcl: - return True - - return False - -class HarvestManTextFilter(HarvestManBaseFilter): - """ Filter class for filtering out web pages based on the URL path string """ - - def __init__(self, contentfilters=[], metafilters=[]): - # Filter pattern strings - self.contentpatterns = contentfilters - self.metapatterns = metafilters - # print 'Content=>',self.contentpatterns - # print 'Meta=>',self.metapatterns - # Actual filters - # Text filters are always exclude filters, so - # no need of separate include & exclude keys - self.contentfilter = [] - # Meta filters - self.keywordfilter = [] - self.titlefilter = [] - self.descfilter = [] - # Parse and compile the filters - self.compile_filters() - - def compile_filters(self): - - # Content filter is straight forward - for pattern, casing, flags in self.contentpatterns: - self.contentfilter.append(self.make_regex(pattern, casing, flags)) - - # Some pre-processing is involved in meta-filters - for pattern,casing,flags,tags in self.metapatterns: - regex = self.make_regex(pattern, casing, flags) - if tags=='all': - # Append to all filters - self.keywordfilter.append(regex) - self.titlefilter.append(regex) - self.descfilter.append(regex) - else: - # Split and see which all tags are specified - tagslist = tags.split(',') - if 'title' in tagslist: - self.titlefilter.append(regex) - if 'keywords' in tagslist: - self.keywordfilter.append(regex) - if 'description' in tagslist: - self.descfilter.append(regex) - - - def filter(self, urldoc, urlobj): - """ Apply all URL filters on the passed URL document object - Return True if filtered and False if not filtered """ - - filterurl = False - - # Apply content filter - for cfilter in self.contentfilter: - m = cfilter.search(urldoc.content) - if m: - debug("Content filter for URL %s found" % urlobj) - self.context='Content' - return True - - # Apply meta filters - for tfilter in self.titlefilter: - m = tfilter.search(urldoc.title) - if m: - debug("Title filter for URL %s found" % urlobj) - self.context='Title' - return True - - for dfilter in self.descfilter: - m = dfilter.search(urldoc.description) - if m: - debug("Description filter for URL %s found" % urlobj) - self.context='Description' - return True - - for kfilter in self.keywordfilter: - matches = [kfilter.search(keyword) for keyword in urldoc.keywords] - if len(matches): - debug("Keyword filter for URL %s found" % urlobj) - self.context='Keyword' - return True - - return False - -class HarvestManJunkFilter(HarvestManBaseFilter): - """ Junk filter class. Filter out junk urls such - as ads, banners, flash files etc """ - - # Domain specific blocking - List courtesy - # junkbuster proxy. - block_domains =[ '1ad.prolinks.de', - '1st-fuss.com', - '247media.com', - 'admaximize.com', - 'adbureau.net', - 'adsolution.de', - 'adwisdom.com', - 'advertising.com', - 'atwola.com', - 'aladin.de', - 'annonce.insite.dk', - 'a.tribalfusion.com', - 'avenuea.com', - 'bannercommunity.de', - 'banerswap.com', - 'bizad.nikkeibp.co.jp', - 'bluestreak.com', - 'bs.gsanet.com', - 'cash-for-clicks.de', - 'cashformel.com', - 'cash4banner.de', - 'cgi.tietovalta.fi', - 'cgicounter.puretec.de', - 'click-fr.com', - 'click.egroups.com', - 'commonwealth.riddler.com', - 'comtrack.comclick.com', - 'customad.cnn.com', - 'cybereps.com:8000', - 'cyberclick.net', - 'dino.mainz.ibm.de', - 'dinoadserver1.roka.net', - 'disneystoreaffiliates.com', - 'dn.adzerver.com', - 'doubleclick.net', - 'ds.austriaonline.at', - 'einets.com', - 'emap.admedia.net', - 'eu-adcenter.net', - 'eurosponser.de', - 'fastcounter.linkexchange.com', - 'findcommerce.com', - 'flycast.com', - 'focalink.com', - 'fp.buy.com', - 'globaltrack.com', - 'globaltrak.net', - 'gsanet.com', - 'hitbox.com', - 'hurra.de', - 'hyperbanner.net', - 'iadnet.com', - 'image.click2net.com', - 'image.linkexchange.com', - 'imageserv.adtech.de', - 'imagine-inc.com', - 'img.getstats.com', - 'img.web.de', - 'imgis.com', - 'james.adbutler.de', - 'jmcms.cydoor.com', - 'leader.linkexchange.com', - 'linkexchange.com', - 'link4ads.com', - 'link4link.com', - 'linktrader.com', - 'media.fastclick.net', - 'media.interadnet.com', - 'media.priceline.com', - 'mediaplex.com', - 'members.sexroulette.com', - 'newsads.cmpnet.com', - 'ngadcenter.net', - 'nol.at:81', - 'nrsite.com', - 'offers.egroups.com', - 'omdispatch.co.uk', - 'orientserve.com', - 'pagecount.com', - 'preferences.com', - 'promotions.yahoo.com', - 'pub.chez.com', - 'pub.nomade.fr', - 'qa.ecoupons.com', - 'qkimg.net', - 'resource-marketing.com', - 'revenue.infi.net', - 'sam.songline.com', - 'sally.songline.com', - 'sextracker.com', - 'smartage.com', - 'smartclicks.com', - 'spinbox1.filez.com', - 'spinbox.versiontracker.com', - 'stat.onestat.com', - 'stats.surfaid.ihost.com', - 'stats.webtrendslive.com', - 'swiftad.com', - 'tm.intervu.net', - 'tracker.tradedoubler.com', - 'ultra.multimania.com', - 'ultra1.socomm.net', - 'uproar.com', - 'usads.imdb.com', - 'valueclick.com', - 'valueclick.net', - 'victory.cnn.com', - 'videoserver.kpix.com', - 'view.atdmt.com', - 'webcounter.goweb.de', - 'websitesponser.de', - 'werbung.guj.de', - 'wvolante.com', - 'www.ad-up.com', - 'www.adclub.net', - 'www.americanpassage.com', - 'www.bannerland.de', - 'www.bannermania.nom.pl', - 'www.bizlink.ru', - 'www.cash4banner.com', - 'www.clickagents.com', - 'www.clickthrough.ca', - 'www.commision-junction.com', - 'www.eads.com', - 'www.flashbanner.no', - 'www.mediashower.com', - 'www.popupad.net', - 'www.smartadserver.com', - 'www.smartclicks.com:81', - 'www.spinbox.com', - 'www.sponsorpool.net', - 'www.ugo.net', - 'www.valueclick.com', - 'www.virtual-hideout.net', - 'www.web-stat.com', - 'www.webpeep.com', - 'www.zserver.com', - 'www3.exn.net:80', - 'xb.xoom.com', - 'yimg.com' ] - - # Common block patterns. These are created - # in the Python regular expression syntax. - # Original list courtesy junkbuster proxy. - block_patterns = [ r'/*.*/(.*[-_.])?ads?[0-9]?(/|[-_.].*|\.(gif|jpe?g))', - r'/*.*/(.*[-_.])?count(er)?(\.cgi|\.dll|\.exe|[?/])', - r'/*.*/(.*[-_.].*)?maino(kset|nta|s).*(/|\.(gif|html?|jpe?g|png))', - r'/*.*/(ilm(oitus)?|kampanja)(hallinta|kuvat?)(/|\.(gif|html?|jpe?g|png))', - r'/*.*/(ng)?adclient\.cgi', - r'/*.*/(plain|live|rotate)[-_.]?ads?/', - r'/*.*/(sponsor|banner)s?[0-9]?/', - r'/*.*/*preferences.com*', - r'/*.*/.*banner([-_]?[a-z0-9]+)?\.(gif|jpg)', - r'/*.*/.*bannr\.gif', - r'/*.*/.*counter\.pl', - r'/*.*/.*pb_ihtml\.gif', - r'/*.*/Advertenties/', - r'/*.*/Image/BannerAdvertising/', - r'/*.*/[?]adserv', - r'/*.*/_?(plain|live)?ads?(-banners)?/', - r'/*.*/abanners/', - r'/*.*/ad(sdna_image|gifs?)/', - r'/*.*/ad(server|stream|juggler)\.(cgi|pl|dll|exe)', - r'/*.*/adbanner*', - r'/*.*/adfinity', - r'/*.*/adgraphic*', - r'/*.*/adimg/', - r'/*.*/adjuggler', - r'/*.*/adlib/server\.cgi', - r'/*.*/ads\\', - r'/*.*/adserver', - r'/*.*/adstream\.cgi', - r'/*.*/adv((er)?ts?|ertis(ing|ements?))?/', - r'/*.*/advanbar\.(gif|jpg)', - r'/*.*/advanbtn\.(gif|jpg)', - r'/*.*/advantage\.(gif|jpg)', - r'/*.*/amazon([a-zA-Z0-9]+)\.(gif|jpg)', - r'/*.*/ana2ad\.gif', - r'/*.*/anzei(gen)?/?', - r'/*.*/ban[-_]cgi/', - r'/*.*/banner_?ads/', - r'/*.*/banner_?anzeigen', - r'/*.*/bannerimage/', - r'/*.*/banners?/', - r'/*.*/banners?\.cgi/', - r'/*.*/bizgrphx/', - r'/*.*/biznetsmall\.(gif|jpg)', - r'/*.*/bnlogo.(gif|jpg)', - r'/*.*/buynow([a-zA-Z0-9]+)\.(gif|jpg)', - r'/*.*/cgi-bin/centralad/getimage', - r'/*.*/drwebster.gif', - r'/*.*/epipo\.(gif|jpg)', - r'/*.*/gsa_bs/gsa_bs.cmdl', - r'/*.*/images/addver\.gif', - r'/*.*/images/advert\.gif', - r'/*.*/images/marketing/.*\.(gif|jpe?g)', - r'/*.*/images/na/us/brand/', - r'/*.*/images/topics/topicgimp\.gif', - r'/*.*/phpAds/phpads.php', - r'/*.*/phpAds/viewbanner.php', - r'/*.*/place-ads', - r'/*.*/popupads/', - r'/*.*/promobar.*', - r'/*.*/publicite/', - r'/*.*/randomads/.*\.(gif|jpe?g)', - r'/*.*/reklaam/.*\.(gif|jpe?g)', - r'/*.*/reklama/.*\.(gif|jpe?g)', - r'/*.*/reklame/.*\.(gif|jpe?g)', - r'/*.*/servfu.pl', - r'/*.*/siteads/', - r'/*.*/smallad2\.gif', - r'/*.*/spin_html/', - r'/*.*/sponsor.*\.gif', - r'/*.*/sponsors?[0-9]?/', - r'/*.*/ucbandeimg/', - r'/*.*/utopiad\.(gif|jpg)', - r'/*.*/werb\..*', - r'/*.*/werbebanner/', - r'/*.*/werbung/.*\.(gif|jpe?g)', - r'/*ad.*.doubleclick.net', - r'/.*(ms)?backoff(ice)?.*\.(gif|jpe?g)', - r'/.*./Adverteerders/', - r'/.*/?FPCreated\.gif', - r'/.*/?va_banner.html', - r'/.*/adv\.', - r'/.*/advert[0-9]+\.jpg', - r'/.*/favicon\.ico', - r'/.*/ie_?(buttonlogo|static?|anim.*)?\.(gif|jpe?g)', - r'/.*/ie_horiz\.gif', - r'/.*/ie_logo\.gif', - r'/.*/ns4\.gif', - r'/.*/opera13\.gif', - r'/.*/opera35\.gif', - r'/.*/opera_b\.gif', - r'/.*/v3sban\.gif', - r'/.*Ad00\.gif', - r'/.*activex.*(gif|jpe?g)', - r'/.*add_active\.gif', - r'/.*addchannel\.gif', - r'/.*adddesktop\.gif', - r'/.*bann\.gif', - r'/.*barnes_logo\.gif', - r'/.*book.search\.gif', - r'/.*by/main\.gif', - r'/.*cnnpostopinionhome.\.gif', - r'/.*cnnstore\.gif', - r'/.*custom_feature\.gif', - r'/.*exc_ms\.gif', - r'/.*explore.anim.*gif', - r'/.*explorer?.(gif|jpe?g)', - r'/.*freeie\.(gif|jpe?g)', - r'/.*gutter117\.gif', - r'/.*ie4_animated\.gif', - r'/.*ie4get_animated\.gif', - r'/.*ie_sm\.(gif|jpe?g)', - r'/.*ieget\.gif', - r'/.*images/cnnfn_infoseek\.gif', - r'/.*images/pathfinder_btn2\.gif', - r'/.*img/gen/fosz_front_em_abc\.gif', - r'/.*img/promos/bnsearch\.gif', - r'/.*infoseek\.gif', - r'/.*logo_msnhm_*', - r'/.*mcsp2\.gif', - r'/.*microdell\.gif', - r'/.*msie(30)?\.(gif|jpe?g)', - r'/.*msn2\.gif', - r'/.*msnlogo\.(gif|jpe?g)', - r'/.*n_iemap\.gif', - r'/.*n_msnmap\.gif', - r'/.*navbars/nav_partner_logos\.gif', - r'/.*nbclogo\.gif', - r'/.*office97_ad1\.(gif|jpe?g)', - r'/.*pathnet.warner\.gif', - r'/.*pbbobansm\.(gif|jpe?g)', - r'/.*powrbybo\.(gif|jpe?g)', - r'/.*s_msn\.gif', - r'/.*secureit\.gif', - r'/.*sqlbans\.(gif|jpe?g)', - r'/BannerImages/' - r'/BarnesandNoble/images/bn.recommend.box.*', - r'/Media/Images/Adds/', - r'/SmartBanner/', - r'/US/AD/', - r'/_banner/', - r'/ad[-_]container/', - r'/adcycle.cgi', - r'/adcycle/', - r'/adgenius/', - r'/adimages/', - r'/adproof/', - r'/adserve/', - r'/affiliate_banners/', - r'/annonser?/', - r'/anz/pics/', - r'/autoads/', - r'/av/gifs/av_logo\.gif', - r'/av/gifs/av_map\.gif', - r'/av/gifs/new/ns\.gif', - r'/bando/', - r'/bannerad/', - r'/bannerfarm/', - r'/bin/getimage.cgi/...\?AD', - r'/cgi-bin/centralad/', - r'/cgi-bin/getimage.cgi/....\?GROUP=', - r'/cgi-bin/nph-adclick.exe/', - r'/cgi-bin/nph-load', - r'/cgi-bin/webad.dll/ad', - r'/cgi/banners.cgi', - r'/cwmail/acc\.gif', - r'/cwmail/amzn-bm1\.gif', - r'/db_area/banrgifs/', - r'/digitaljam/images/digital_ban\.gif', - r'/free2try/', - r'/gfx/bannerdir/', - r'/gif/buttons/banner_.*', - r'/gif/buttons/cd_shop_.*', - r'/gif/cd_shop/cd_shop_ani_.*', - r'/gif/teasere/', - r'/grafikk/annonse/', - r'/graphics/advert', - r'/graphics/defaultAd/', - r'/grf/annonif', - r'/hotstories/companies/images/companies_banner\.gif', - r'/htmlad/', - r'/image\.ng/AdType', - r'/image\.ng/transactionID', - r'/images/.*/.*_anim\.gif', - r'/images/adds/', - r'/images/getareal2\.gif', - r'/images/locallogo.gif', - r'/img/special/chatpromo\.gif', - r'/include/watermark/v2/', - r'/ip_img/.*\.(gif|jpe?g)', - r'/ltbs/cgi-bin/click.cgi', - r'/marketpl*/', - r'/markets/images/markets_banner\.gif', - r'/minibanners/', - r'/ows-img/bnoble\.gif', - r'/ows-img/nb_Infoseek\.gif', - r'/p/d/publicid', - r'/pics/amzn-b5\.gif', - r'/pics/getareal1\.gif', - r'/pics/gotlx1\.gif', - r'/promotions/', - r'/rotads/', - r'/rotations/', - r'/torget/jobline/.*\.gif' - r'/viewad/' - r'/we_ba/', - r'/werbung/', - r'/world-banners/', - r'/worldnet/ad\.cgi', - r'/zhp/auktion/img/' ] - - - def __init__(self): - self.msg = '' - self.match = '' - # Compile pattern list for performance - self.patterns = map(re.compile, self.block_patterns) - # Create base domains list from domains list - self.base_domains = map(self.base_domain, self.block_domains) - - def reset_msg(self): - self.msg = '' - - def reset_match(self): - self.msg = '' - - def filter(self, urlobj): - """ Apply Junk filter on the passed URL object. Return True - if filtered and False if not filtered """ - - self.reset_msg() - self.reset_match() - - # Check domain first - ret = self._check_domain(urlobj) - if ret: - return ret - - # Check pattern next - return self._check_pattern(urlobj) - - def base_domain(self, domain): - - if domain.count(".") > 1: - strings = domain.split(".") - return "".join((strings[-2], strings[-1])) - else: - return domain - - def _check_domain(self, url_obj): - """ Check whether the url belongs to a junk - domain. Return true if url is O.K (NOT a junk - domain) and False otherwise """ - - # Get base server of the domain with port - base_domain_port = url_obj.get_base_domain_with_port() - # Get domain with port - domain_port = url_obj.get_domain_with_port() - - # First check for domain - if domain_port in self.block_domains: - self.msg = '' - return True - # Then check for base domain - else: - if base_domain_port in self.base_domains: - self.msg = '' - return True - - return False - - def _check_pattern(self, url_obj): - """ Check whether the url matches a junk pattern. - Return true if url is O.K (not a junk pattern) and - false otherwise """ - - url = url_obj.get_full_url() - - indx=0 - for p in self.patterns: - # Do a search, not match - if p.search(url): - self.msg = '' - self.match = self.block_patterns[indx] - return True - - indx += 1 - - return False - - def get_error_msg(self): - return self.msg - - def get_match(self): - return self.match - -if __name__=="__main__": - import urlparser - - # Test filter class - filter = HarvestManJunkFilter() - - # Violates, should return False - # The first two are direct domain matches, the - # next two are base domain matches. - u = urlparser.HarvestManUrl("http://a.tribalfusion.com/images/1.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - u = urlparser.HarvestManUrl("http://stats.webtrendslive.com/cgi-bin/stats.pl") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - u = urlparser.HarvestManUrl("http://stats.cyberclick.net/cgi-bin/stats.pl") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - u = urlparser.HarvestManUrl("http://m.doubleclick.net/images/anim.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - - # The next are pattern matches - u = urlparser.HarvestManUrl("http://www.foo.com/popupads/ad.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - u = urlparser.HarvestManUrl("http://www.foo.com/htmlad/1.html") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - u = urlparser.HarvestManUrl("http://www.foo.com/logos/nbclogo.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - u = urlparser.HarvestManUrl("http://www.foo.com/bar/siteads/1.ad") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - u = urlparser.HarvestManUrl("http://www.foo.com/banners/world-banners/banner.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - u = urlparser.HarvestManUrl("http://ads.foo.com/") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - print '\tMatch=>',filter.get_match() - - - # This one should not match - u = urlparser.HarvestManUrl("http://www.foo.com/doc/logo.gif") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - # This also... - u = urlparser.HarvestManUrl("http://www.foo.org/bar/vodka/pattern.html") - print filter.filter(u),filter.get_error_msg(),'=>',u.get_full_url() - diff --git a/HarvestMan-lite/harvestman/lib/gui.py b/HarvestMan-lite/harvestman/lib/gui.py deleted file mode 100755 index 16b0e88..0000000 --- a/HarvestMan-lite/harvestman/lib/gui.py +++ /dev/null @@ -1,664 +0,0 @@ -""" -gui.py - Module which provides a browser based UI -mode to HarvestMan using web.py. This module is part -of the HarvestMan program. - -Created Anand B Pillai Jun 01 2008 - -Copyright (C) 2008, Anand B Pillai. -""" - -import sys, os -import web -import webbrowser -import time - -from web import form, net #, request - -def get_templates_location(): - # Templates are located at harvestman/ui/templates folder... - top = os.path.dirname(os.path.dirname(os.path.abspath(globals()['__file__']))) - template_dir = os.path.join(top, 'ui','templates') - return template_dir - -# Global render object -g_render = web.template.render(get_templates_location()) - -CONFIG_HTML_TEMPLATE="""\ -HarvestMan Configuration File Generator -%s - - -%s - - -""" - -PLUG_TEMPLATE="""\ - -""" - -PLUGINS_TEMPLATE="""\ - - %s - -""" - - -CONFIG_XML_TEMPLATE="""\ - - - - - - - - - - - %(url)s - %(projname)s - - %(basedir)s - - - - - - - %(proxy)s - %(puser)s - %(ppasswd)s - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - %(urlpriority)s - %(serverpriority)s - - - %(urlfilter)s - %(serverfilter)s - %(wordfilter)s - - - %(PLUGIN)s - - - - - - - - - - - - - - - - - - - - - - - - - - - %(urltreefile)s - - - - - - - - - - - - -""" - -def render_stylesheet(): - css = """\ - - """ - - return css - - -def render_tabs(): - - content ="""\ - HarvestMan Web Console - - - - - - - - -

HarvestMan Web Console

- -
- -
-

Configuration

-

-

    -
  1. User configuration
  2. -
  3. System configuration
  4. -
  5. New configuration
  6. -
-

-
- - -
-

Projects

-

-

    -
  1. Project History
  2. -
  3. Current Project
  4. -
  5. New Project
  6. -
-

-
- -
- -

Documentation

-

-

    -
  1. Release Notes
  2. -
  3. Change History
  4. -
  5. API Documentation
  6. -
  7. HOWTOs & Tutorials
  8. -
-

-
- -
- -

About

-

HarvestMan - Web crawling application/framework written in pure Python.

-

WWW: HarvestMan on the Web

-
- -
- - - """ - - return content - - - - -############## Start web.py custom widgets #################################################### - -class SizedTextbox(form.Textbox): - """ A GUI class for a textbox which accepts an argument for - its size """ - - def __init__(self, name, size, title='', *validators, **attrs): - super(SizedTextbox, self).__init__(name, *validators, **attrs) - self.size = size - self.val = self.value - self.title = title - - def render(self): - x = 'No messages, 5=>Maximum messages", - [0,1,2,3,4,5], value=2), - Label("Network Configuration", True), - SizedTextbox("Proxy Server", 50, "Proxy server address for your network, if any"), - SizedTextbox("Proxy Server Port",10, "Port number for the proxy server", - value=80), - SizedTextbox("Proxy Server Username", 20, - "Username for authenticating the proxy server (leave blank for unauthenticated proxies)"), - SizedTextbox("Proxy Server Password", 20, - "Password for authenticating the proxy server (leave blank for unauthenticated proxies)"), - Label("Download Types/Caching/Protocol Configuration", True), - MyDropbox("HTML", 'Save HTML pages ?', ["Yes","No"]), - MyDropbox("Images",'Save images in pages ?',["Yes","No"]), - MyDropbox("Video",'Save video URLs (movies) ?',["No","Yes"]), - MyDropbox("Flash",'Save Adobe Flash URLs ?',["No","Yes"]), - MyDropbox("Audio",'Save audio URLs (sounds) ?',["No","Yes"]), - MyDropbox("Documents",'Save Microsoft Office, Openoffice, PDF and Postscript files ?', - ["Yes","No"]), - MyDropbox("Javascript",'Save server-side javascript URLs ?',["Yes","No"]), - MyDropbox("Javaapplet",'Save java applet class files ?',["Yes","No"]), - MyDropbox("Query Links",'Save links of the form "http://www.foo.com/query?param=val" ?', - ["Yes","No"]), - MyDropbox("Caching",'Enable URL caching in HarvestMan ?', - ["Yes","No"]), - MyDropbox("Data Caching",'Enable caching of URL data in the cache (requires more space) ?', - ["No","Yes"]), - MyDropbox("HTTP Compression",'Accept gzip compressed data from web servers ?', - ["Yes","No"]), - SizedTextbox("Retry Attempts", 10, - 'Number of additional download tries for URLs which produce errors', - value=1), - Label("Download Limits/Extent Configuration", True), - MyDropbox("Fetch Level", - 'Fetch level for the crawl (see FAQ)',[0,1,2,3,4]), - MyDropbox("Crawl Sub-domains", - 'Crawls "http://bar.foo.com" when starting URL belongs to "http://foo.com"', - ["No","Yes"]), - SizedTextbox("Maximum Files Limit",10, - 'Stops crawl when number of files downloaded reaches this limit', - value=5000), - SizedTextbox("Maximum File Size Limit",10, - 'Ignore URLs whose size is larger than this limit', - value=5242880), - SizedTextbox("Maximum Connections Limit",10, - 'Maximum number of simultaneously open HTTP connections', - value=5), - SizedTextbox("Maximum Bandwidth Limit(kb)",10, - 'Maximum number of bandwidth used for given HTTP connections', - value=0), - SizedTextbox("Crawl Time Limit",10, - 'Stops crawl after the crawl duration reaches this limit', - value=-1), - Label("Download Rules/Filters Configuration", True), - MyDropbox("Robots Rules", - 'Obey robots.txt and META ROBOTS rules ?', - ["Yes","No"]), - SizedTextbox("URL Filter String",100,'A filter string for URLs (see FAQ)'), - # SizedTextbox("Server Filter String",100, 'A filter string for servers (see FAQ)'), - SizedTextbox("Word Filter String",100, - 'A generic word filter based on regular expressions to filter web pages'), - MyDropbox("JunkFilter",'Enable the advertisement/banner/other junk URL filter ?', - ["Yes","No"]), - Label("Download Plugins Configuration", True), - Label("Add up-to 5 valid plugins in the boxes below",italic=True), - SizedTextbox("Plugin 1",20,'Enter the name of your plugin module here, without the .py* suffix'), - SizedTextbox("Plugin 2",20,'Enter the name of your plugin module here, without the .py* suffix'), - SizedTextbox("Plugin 3",20,'Enter the name of your plugin module here, without the .py* suffix'), - SizedTextbox("Plugin 4",20,'Enter the name of your plugin module here, without the .py* suffix'), - SizedTextbox("Plugin 5",20,'Enter the name of your plugin module here, without the .py* suffix'), - Label("Files Configuration", True), - SizedTextbox("Url Tree File", 20, - 'A filename which will capture parent/child relationship of all processed URLs', - value=''), - MyDropbox("Archive Saved Files", 'Archive all saved files to a single tar archive file ?', - ["No","Yes"]), - MyDropbox("Archive Format",'Archive format (tar.bz2 or tar.gz)',["bzip","gzip"]), - MyDropbox("Serialize URL Headers",'Serialize all URL headers to a file (urlheaders.db) ?', - ["Yes","No"]), - MyDropbox("Localise Links",'Convert outward (web) pointing links to disk pointing links ?', - ["No","Yes"]), - Label("Misc Configuration", True), - MyDropbox("Create Project Browse Page",'Create an HTML page which summarizes all crawled projects ?', - ["No","Yes"]), - Label("Advanced Configuration Settings", True), - Label('These are configuration parameters which are useful only for advanced tweaking. Most users can ignore the following settings and use the defaults',italic=True), - Label("Download Limits/Extent/Filters/Rules Configuration", True, True), - MyDropbox("Fetch Image Links Always", - 'Ignore download rules when fetching images ?',["Yes","No"]), - MyDropbox("Fetch Stylesheet Links Always", - 'Ignore download rules when fetching stylesheets ?',["Yes","No"]), - SizedTextbox("Links Offset Start", 10, - 'Offset of child links measured from zero (useful for crawling web directories)', - value=0), - SizedTextbox("Links Offset End", 10, - 'Offset of child links measured from end (useful for crawling web directories)', - value=-1), - MyDropbox("URL Depth", 'Maximum depth of a URL in relation to the starting URL', - [10,9,8,7,6,5,4,3,2,1,0]), - MyDropbox("External URL Depth", - 'Maximum depth of an external URL in relation to its server root (useful for only fetchlevels >1)', - [0,1,2,3,4,5,6,7,8,9,10]), - MyDropbox("Ignore TLDs (Top level domains)", - 'Consider http://foo.com and http://foo.org as the same server (dangerous)', - ["No","Yes"]), - SizedTextbox("URL Priority String",100,'A priority string for URLs (see FAQ)'), - # SizedTextbox("Server Priority String",100, 'A priority string for servers (see FAQ)'), - Label("Parser Configuration", True, True), - - Label("Enable/Disable parsing of the tags shown below",italic=True), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag
", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag ", 'Enable parsing of tags ?',["Yes","No"]), - MyDropbox("Tag