From e374f1713a7f94e1e16d81c5bb98608725d629c1 Mon Sep 17 00:00:00 2001 From: Google Code Exporter Date: Sat, 14 Mar 2015 01:43:10 -0400 Subject: [PATCH] Migrating wiki contents from Google Code --- AboutHarvestMan.md | 33 ++ ConfigXml.md | 108 ++++ FAQ.md | 1068 ++++++++++++++++++++++++++++++++++++ FAQ_NEW.md | 2 + HarvestMan.md | 0 InstallHarvestMan.md | 62 +++ NewDevelopersNotes.md | 30 + ProjectHome.md | 3 + UsingHarvestMan.md | 99 ++++ WorkAroundHttpForbidden.md | 46 ++ WorldsSimplestCrawler.md | 177 ++++++ WritingCustomCrawlers.md | 196 +++++++ bot.md | 5 + 13 files changed, 1829 insertions(+) create mode 100644 AboutHarvestMan.md create mode 100644 ConfigXml.md create mode 100644 FAQ.md create mode 100644 FAQ_NEW.md create mode 100644 HarvestMan.md create mode 100644 InstallHarvestMan.md create mode 100644 NewDevelopersNotes.md create mode 100644 ProjectHome.md create mode 100644 UsingHarvestMan.md create mode 100644 WorkAroundHttpForbidden.md create mode 100644 WorldsSimplestCrawler.md create mode 100644 WritingCustomCrawlers.md create mode 100644 bot.md diff --git a/AboutHarvestMan.md b/AboutHarvestMan.md new file mode 100644 index 0000000..aa91205 --- /dev/null +++ b/AboutHarvestMan.md @@ -0,0 +1,33 @@ +# What is HarvestMan # + +HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python. + +HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. + +HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License. + +# History of HarvestMan # + 1. Harvestman crawler was started by Anand B Pillai in June 2003 as a hobby project to develop a personal web crawler in Python, along with Nirmal Chidambaram. + 1. Nirmal writes the original code in mid June 2003(one module, single threaded crawler) which Anand improves substantially and develops into a multithreaded crawler. + 1. First version (0.8) was released by Anand in July 2003. + 1. Released on [freshmeat](http://www.freshmeat.net/projects/harvestman) (1.3) on Dec 2003 + 1. Eight releases done between Dec 2003 (1.3) and Dec 2004 (1.4). + 1. Project chosen as crawler for the [EIAO](http://www.eiao.net) web accessibility observatory in Feb 2005. EIAO chose version 1.4, which then underwent several minor releases. + 1. Most recent release is 1.4.6, released on Sep 2005. + 1. Since early 2006, HarvestMan has been undergoing development along with the EIAO project (mostly with EIAO feedbacks), but no public releases have been done. + 1. Version 1.5 started development in mid 2006, but was never released. + 1. Version 1.4.6 gets accepted into Debian in March 2006. + 1. By mid 2007, the program had so many changes, that the version number under development was incremented to 2.0 from 1.5. Version 2.0 has been under development effectively since mid 2006, but most code changes happened since mid 2007. + 1. Version 1.4.6 gets into Ubuntu repositories in May 2007. + 1. Development hosted in [BerliOS](http://developer.berlios.de/projects/harvestman) till June 2008, when it was moved to Google code. + 1. Contributors to 2.0 till June 2008 - Anand B Pillai (Main), Nils Ultveit Moe (EIAO), Morten Goodwin Olsen (EIAO), John Kleven. + 1. Version 2.0 alpha package releases started in Aug 2007 on the website. + 1. HarvestMan wins FOSS India Award in April 2008. + 1. In June 2008, Lukasz Szybalski joins the team. + +## Future of HarvestMan ## +> It is a brave new world out there... :-) +> Well, currently the development stands at 2.0.5 beta, i.e the 2.0 version is still +> not completed. The development is slow and I need to take time off from a regular +> job to do this, so well, can't give a final date for this, but hopefully one day +> it will be done :) \ No newline at end of file diff --git a/ConfigXml.md b/ConfigXml.md new file mode 100644 index 0000000..09019ea --- /dev/null +++ b/ConfigXml.md @@ -0,0 +1,108 @@ +# Harvestman config.xml # +## Configuration File Structure ## + +The configuration file has different categories inside that split the configuration options into different sections. At present, the configuration file has the following namespaces: + + 1. **project** - This section holds the options related to the current HarvestMan project + 1. **network** - This section holds the configuration options related to your network connection + 1. **download** - This section holds configuration option that affect your downloads in a generic way + 1. **control** - This section is similar to the above one, but holds options that affect your downloads in a much more specific way. This is a kind of 'tweak' section, that allows you to exert more fine-grained control over your projects. + 1. **system** - This section controls the threading options, regional (locale) options and any other options related to Python interpreter and your computer. + 1. **indexer** - This section holds variables related to how the files are processed after downloading. Right now it holds variables related to localizing links. + 1. **files** - This section holds variables that control the files created by HarvestMan namely error log, message log and an optional url log. + 1. **display** - This holds a single variable related to creating a browser page for all HarvestMan projects on your computer. + +## Control Section ## +### fetchlevel ### + +HarvestMan defines five fetchlevels with values ranging from 0 - 4 inclusive. These define the rules for the download of files in different servers other the server of the starting URL. In general **increasing a fetch level allows the program to crawl more files** on the Internet. + +A fetchlevel of "0" provides the maximum constraint for the download. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting URL. + +For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the + +<images> + + sub-directory and directories below it, and no other file. + +The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page. + +A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting URL, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers. + +A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting URL and first level external links, but does not fetch links outside the directory of the starting URL. + +A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters ,url filters, file limits, maximum server limit etc. + +Place the paramter in **control** element under **extent** section. +Here is a sample XML element including this new param. +``` + + ... + + + + + + + + ... + +``` + + +**The value can be 0,1,2,3,4** + +See FAQ for more explanations. + +### maxbandwidth ### +MaxBandwidth controls the speed of crawling. The throttling of bandwidth is used when we are downloading huge amount of data from a host. MaxBandwidh should prevent user from "Denial Of Service" that one could impose on the crawled server. By using this configuration variable you can limit your download speed to 5kb per second. With this speed the host should not have any problems serving your crawl and be able to proceed with its normal operations. + +Place the paramter in **control** element under **limits** section. +Here is a sample XML element including this new param. +``` + +.... + + + + + + + + + +... + +``` + +**The value needs to be specified in kb/sec , not in bytes/sec.** + +### maxbytes ### +MaxBytes controls how many bytes your crawl will download. The max bytes is used when we are downloading huge amount of data from a host, and in conjunction with MaxBandwidh we want to limit how much data we download. By using this configuration variable with maxbandwidth you can set your crawl to download 10mb at 5kb/s. With this fine grained control of your download size and speed the host should not have any problems serving your crawl and be able to proceed with its normal operations. + +Place the paramter in **control** element under **limits** section. +Here is a sample XML element including this new param. +``` + +.... + + + + + + + + + + +... + +``` + +**The value accepts plain numbers(assumes bytes), KB, MB and GB.** +``` + - End crawl at 5000 bytes + - End crawl at 10kb + - End crawl at 50 MB. + - End crawl at 1 GB. +``` \ No newline at end of file diff --git a/FAQ.md b/FAQ.md new file mode 100644 index 0000000..dbf965b --- /dev/null +++ b/FAQ.md @@ -0,0 +1,1068 @@ +## This is still a work under progress and has been lifted verbatim from the HarvestMan web-site with little or no modifications. A lot of information is out of date and needs to be updated. Also the FAQ doesn't confirm to wiki style, so proceed with care ! ## + +HarvestMan - FAQ +Version 2.0 +NOTE: The FAQ is being modified currently to be in sync with HarvestMan 1.4, so you might find that some parts of the FAQ are inconsistent with the rest of it. This is because some of the FAQ is modified, while the rest is still to be modified. + + * 1. Overview +> > o 1.1. What is HarvestMan? +> > o 1.2. Why do you call it HarvestMan? +> > o 1.3. What HarvestMan can be used for? +> > o 1.4. What HarvestMan cannot be used for... +> > o 1.5. What do I need to run HarvestMan? + * 2. Usage +> > o 2.1. What is the HarvestMan Configuration File? +> > o 2.2. Can HarvestMan be run as a command-line application? + * 3. Architecture +> > o 3.1.What are "tracker" threads and what is their function? +> > o 3.2.What are "crawler" threads? +> > o 3.3.What are "fetcher" threads? +> > o 3.4.How do the crawlers and fetchers co-operate? +> > o 3.5.How many different Queues of information flow are there? +> > o 3.6.What are worker (downloader) threads? +> > o 3.7.How does a HarvestMan project finish? + * 4. Protocols & File Types +> > o 4.1. What are the protocols supported by HarvestMan? +> > o 4.2. What kind of files can be downloaded by HarvestMan? +> > o 4.3. Can HarvestMan run javascript code? +> > o 4.4. Can HarvestMan run java applets? +> > o 4.5. How to prevent downloading of HTML & CGI forms? +> > o 4.6. Does HarvestMan download dynamically generated cgi files (server-side) ? +> > o 4.7. How does HarvestMan determine the filetype of dynamically generated server side files? +> > o 4.8. Does HarvestMan obey the Robots Exclusion Protocol? +> > o 4.9. Can I restart a project to download links that failed (Caching Mechanism)? + * 5. Network, Security & Access Rules +> > o 5.1. Can HarvestMan work across proxies? +> > o 5.2. Does HarvestMan support proxy authentication? +> > o 5.3. Does HarvestMan work inside an intranet? +> > o 5.4. Can HarvestMan crawl a site that requires HTTP authentication? +> > o 5.5. Can HarvestMan crawl a site that requires HTTPS(SSL) authentication? +> > o 5.6. Can I prevent the program from accessing specific domains? +> > o 5.7. Can I specify download filters to prevent download of certain files or directories on a server? +> > o 5.8. Is it possible to control the depth of traversal in a domain? + * 6. Download Control - Basic +> > o 6.1. Can I set a limit on the maximum number of files that are downloaded? +> > o 6.2 .Can I set a limit on the number of external servers crawled? +> > o 6.3. Can I set a limit on the number of outside directories that are crawled? +> > o 6.4. How can I prevent download of images? +> > o 6.5. How can I prevent download of stylesheets? +> > o 6.6. How to disable traversal of external servers? +> > o 6.7. Can I specify a project timeout? +> > o 6.8. Can I specify a thread timeout for worker threads? +> > o 6.9. How to tell the program to retry failed links? + * 7. Download Control - Advanced +> > o 7.1. What are fetchlevels and how can I use them? + * 8. Application development & customization +> > o 8.1. I want to customize HarvestMan for a research project. Can you help out ? +> > o 8.2. I want to customize HarvestMan for a commercial project. Can you help out ? + * 9. Diagrams +> > o 9.1. HarvestMan Class Diagram + +1. Overview + +1.1. What is HarvestMan? +HarvestMan (with a capital 'H' and a capital 'M') is a webcrawler program. HarvestMan belongs to a family of programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc. + +These programs are used to crawl a distributed network of computers like the Internet and download files locally. + +1.2. Why do you call it HarvestMan? +The name "HarvestMan" is derived from a kind of small spider found in different parts of the world called "Daddy long legs" or Opiliones. + +Since this program is a web-spider, the analogy was compelling to name it after some species of spiders. The process of downloading data from websites is sometimes called harvesting. + +Both these similarities gave arise to the name HarvestMan. + +1.3. What HarvestMan can be used for? +HarvestMan is a desktop tool for web search/data gathering. It works on the client side. + +As of the recent version, HarvestMan can be used for, + + 1. Downloading a website or a part of it. + +> 2. Download certain files from a website (matching certain patterns) +> 3. Search a website for keywords & download the files containing them +> 4. Scan a website for links and download them specifically using filters. + +1.4. What HarvestMan cannot be used for... +HarvestMan is a small-medium size web-crawler mostly intended for personal use or for use by a small group. It cannot be used for massive data harvesting from the web. However a project to create a large-scale, distributed web crawler based on HarvestMan is underway. It is calld 'Distributed HarvestMan' or 'D-HarvestMan' in short. D-HarvestMan is currently at a prototype stage. + +Projects like EIAO has been able to customize HarvestMan for medium-large scale data gathering from the Internet. The EIAO project uses HarvestMan to download as much as 100,000 files from European websites daily. + +What HarvestMan is not, + + 1. HarvestMan is not an Internet search engine. +> 2. HarvestMan is not an indexer or taxonomy tool for web documents +> 3. HarvestMan is not a server-side program. + +1.5. What do I need to run HarvestMan? +HarvestMan is written in a programming language called Python. Python is an interactive, interpreted, object-oriented programming language created by Guido Van Rossum and maintained by a team of volunteers from all over the world. Python is a very versatile language which can be used for a variety of tasks ranging from scripting to web frameworks to developing highly complex applications. + +HarvestMan is written completely in Python. It works with Python version 2.3 upward on all platforms where Python runs. However, HarvestMan has some performance optimizations that require the latest version of Python which is Python 2.4. The suggested version is Python 2.4. Anyway, HarvestMan will also work with Python 2.3 too, but with reduced performance. + +You need a machine with a rather large amount of RAM to run HarvestMan. HarvestMan tends to heavily use the system memory especially when performing large data downloads or when run with more than 10 threads. It is preferable to have a machine with 512 MB RAM and a fast CPU (Intel Pentium IV or higher) to run HarvestMan efficiently. + +2. Usage + +2.1. How do I run HarvestMan? +HarvestMan is a command-line application. It has no GUI. + +From the 1.4 version, HarvestMan can be run by calling the main HarvestMan module as an executable script on the command-line as follows: + + +% harvestman.py + + +This will work, considering that you have edited your environment PATH variable to include the local HarvestMan installation directory on your machine. If you have not, you can run HarvestMan by running the harvestman.py module as an argument to the python interpreter, as follows: +% python harvestman.py + + +On Win32 systems, if you have associated the ".py" extension to the appropriate python.exe, you can run HarvestMan without invoking the interpreter explicitly. + + +Note that this assumes that you have a config file named config.xmlin the directory from where you invoke HarvestMan. If you dont have a config file locally, you need to use the command-line options of HarvestMan to pass a different configuration file to the program. +2.2. What is the HarvestMan Configuration (config) file? + +The standard way to run HarvestMan is to run the program with no arguments, allowing it to pick up its configuration parameters from an XML configuration file which is named config.xml by default. + +It is also possible to pass command-line options to HarvestMan. HarvestMan supports a limited set of command-line options which allow you to run the program without using a configuration file. You can learn more about the command-line options in the HarvestMan command-line options FAQ. +The HarvestMan configuration is an XML file with the configuration options split into different elements and their hieararchies. A typical HarvestMan configuration file looks as follows: + + + + + + + +<HarvestMan> + + +> + +<config version="3.0" xmlversion="1.0"> + + +> > + +<project> + + +> > > + +<url> + +http://www.python.org/doc/current/tut/tut.html + +</url> + + + + +> + +<name> + +pytut + +</name> + + +> + +<basedir> + +~/websites + +</basedir> + + +> + +<verbosity value="3"/> + + +> + +<timeout value="600.0"/> + + + +> + +</project> + + + +> + +<network> + + +> > + +<proxy> + + +> > > + +<proxyserver> + + + +</proxyserver> + + +> > > + +<proxyuser> + + + +</proxyuser> + + + + +> + +<proxypasswd> + + + +</proxypasswd> + + +> + +<proxyport value=""/> + + +> + +</proxy> + + +> + +<urlserver status="0"> + + +> > + +<urlhost> + +localhost + +</urlhost> + + + + +> + +<urlport value="3081"/> + + +> + +</urlserver> + + +> + +</network> + + + +> + +<download> + + +> > + +<types> + + + + +> + +<html value="1"/> + + +> + +<images value="1"/> + + +> + +<javascript value="1"/> + + +> + +<javaapplet value="1"/> + + + +> + +<forms value="0"/> + + +> + +<cookies value="1"/> + + +> + +</types> + + +> + +<cache status="1"> + + + +> + +<datacache value="1"/> + + +> + +</cache> + + +> + +<misc> + + +> > + +<retries value="1"/> + + +> > + +<tidyhtml value="1"/> + + + + +> + +</misc> + + +> + +</download> + + + +> + +<control> + + +> > + +<links> + + +> > > + +<imagelinks value="1"/> + + + + +> + +<stylesheetlinks value="1"/> + + +> + +</links> + + +> + +<extent> + + +> > + +<fetchlevel value="0"/> + + +> > + +<extserverlinks value="0"/> + + + + +> + +<extpagelinks value="1"/> + + +> + +<depth value="10"/> + + +> + +<extdepth value="0"/> + + +> + +<subdomain value="0"/> + + + +> + +</extent> + + +> + +<limits> + + +> > + +<maxextservers value="0"/> + + +> > + +<maxextdirs value="0"/> + + +> > + +<maxfiles value="5000"/> + + + + +> + +<maxfilesize value="1048576"/> + + +> + +<connections value="5"/> + + +> + +<requests value="5"/> + + +> + +<timelimit value="-1"/> + + + +> + +</limits> + + +> + +<rules> + + +> > + +<robots value="1"/> + + +> > + +<urlpriority> + + + +</urlpriority> + + +> > + +<serverpriority> + + + +</serverpriority> + + + + +> + +</rules> + + +> + +<filters> + + +> > + +<urlfilter> + + + +</urlfilter> + + +> > + +<serverfilter> + + + +</serverfilter> + + +> > + +<wordfilter> + + + +</wordfilter> + + + + +> + +<junkfilter value="0"/> + + +> + +</filters> + + +> + +</control> + + + +> + +<system> + + +> > + +<workers status="1" size="10" timeout="200"/> + + + + +> + +<trackers value="4"/> + + +> + +<locale> + +american + +</locale> + + +> + +<fastmode value="1"/> + + +> + +</system> + + + +> + +<files> + + +> > + +<urllistfile> + + + +</urllistfile> + + +> > + +<urltreefile> + + + +</urltreefile> + + + +> + +</files> + + + +> + +<indexer> + + + +> + +<localise value="2"/> + + +> + +</indexer> + + + +> + +<display> + + +> > + +<browsepage value="1"/> + + + + +> + +</display> + + + +> + +</config> + + + + + +</HarvestMan> + + + + +The current configuration file holds more than 60 configuration options. The variables that are essential to a project are project.url, project.name and project.basedir. These determine the identity of a HarvestMan crawl and normally require unique values for each HarvestMan project. + +For a more detailed discussion on the config file, click here. + +2.3. Can HarvestMan be run as a command-line application? +Yes, it can. For details on this, refer the Command line FAQ. +3. Architecture + +3.1. HarvestMan is a multithreaded program. What is the threading architecture of HarvestMan ? +HarvestMan uses a multithreaded architecture. It assigns each thread with specific functions which help the program to complete the downloads at a relatively fast pace. + +HarvestMan is a network-bound program. This means that most of the time for the program is spent in waiting for network connections, fetching network data and closing the connections. HarvestMan can be considered to be not IO-bound since we can assume that there is ample disk space for the downloads, at least in most common cases. + +Whenever any program is network-boundor IO-bound, it helps to split the task to multiple threads of control, which perform their function without affecting other threads or the main thread. + +HarvestMan uses this theory to create a multithreaded system of co-operating threads, most of which gather data from the network, processes the data and writes the files to the disk. These threads are calledtracker threads. The name is derived from the fact that the thread tracks a web-page, downloads its links and further tracks each of the pages pointed by the links, doing this recursively for each link. + +HarvestMan uses a pre-emptive threaded architecture where trackers are launched when the program starts. They wait in turns for work, which is managed by a thread-safe Queueof data. Tracker threads post and retrieve data from the queue. These threads die only at the end of the program, spinning in a loop otherwise, looking for data. + +There are two different kind of trackers, namely crawlers and fetchers.These are described in the sections below. +3.2. What are "crawler" threads? +Crawlersor crawler-threads are trackers which perform the specific function of parsing a web-page. They parse the data from a web-page, extract the links, and post the links to a url queue. + +The crawlers get their data from a dataqueue. + + +3.3. What are "fetcher" threads? +Fetchersor fetcher-threads are trackers which perform the function of "fetching", i.e downloading the files pointed to by urls. They download URLs which do not produce web-page content (HTML/XHTML) statically or dynamically. They download non-webpage URLs such as images, pdf files, zip files etc. + +The fetchers get their data from the urlqueue and they post web-page data to the dataqueue. + + +3.4. How do the crawlers and fetchers co-operate? +The design of HarvestMan forces the crawlers and fetchers to be synergic. This is because, the crawlers obtain their data (web-page data) from the data queue, and post their results to the url queue. The fetchers in turn obtain their data (urls) from the url queue, and post their results to the data queue. + +The program starts off by spawing the first thread which is a fetcher. It gets the web-page data for the starting page and posts it to the data queue. The first crawler in line gets this data, parses it and extracts the links, posting it to the url queue. The next fetcher thread waiting in the url queue gets this data, and the process repeats in a synergic manner, till the program runs out of urls to parse, when the project ends. + +3.5. How many different Queues of information flow are there? +There are two queues of data flow, the url queue and the data queue. + +The crawlers **feed the url queue and**feed-off the data queue. +The fetchers feed the data queue and feed-off the url queue. + +**feed = post data to**feed-off = get data from + + +3.6. What are "worker" (downloader) threads? +Apart from the tracker threads, you can specify additional threads to take charge of downloading urls. The urls can be downloaded in these threads instead of consuming the time of the fetcher threads. + +These threads are launched 'apriori', similar to the tracker threads, before the start of the crawl. By default, HarvestMan launches a set of 10 of these worker threads which are managed by a thread pool object. The fetcher threads delegate the actual job of downloading to the workers. However, if the worker threads are disabled, the fetchers will do the downloads themselves. + +These threads also die only at the end of a HarvestMan crawl. + +3.7. How does a HarvestMan project finish? +(Make sure that you have read items 3.1 - 3.6 before reading this.) + +As mentioned before, HarvestMan works by the co-operation of crawler and fetcher family of tracker threads, each feeding on the data provided by the other. + +A project nears its end when there are no more web-pages to crawl according to the configurations of the project. This means that the fetchers have less web-page data to fetch, which in-turn dries the data source for the crawlers. The crawlers in-turn go idle, thus posting less data to the url queue, which again dries the data source for the fetchers. The synergy works at this phase also, as it does when the project is active and all tracker threads are running. + +After some time, all the tracker threads go idle, as there is no more data to feed from the queues. In the main thread of the HarvestMan program, there is a loop that spins continously checking for this event. Once all threads go idle, the loop detects it and exits; the project (and the program) comes to a halt. + +HarvestMan main thread enters this loop immediately after spawning all the tracker threads and waits in the loop till. It checks for the idle condition every 1 or 2 seconds, spinning in a loop. Once it detects that all threads have gone idle, it ends the threads, performs post-download operations, cleanup etc and brings the program to an end. + + +4. Protocols & File Types + +4.1. What are the protocols supported by HarvestMan? + +HarvestMan supports the following protocols + + 1. HTTP +> 2. FTP + +Support for HTTPS (SSL) protocol depends on the Python version you are running. Python 2.3 version and later has HTTPS support built into Python, so HarvestMan will support the HTTPS protocol, if you are running it using Python 2.3 or higher versions. + +The GOPHERand FILE://protocols should also work with HarvestMan. + + +4.2. What kind of files can be downloaded by HarvestMan? +HarvestMan can download **any** kind of file as long as it is served up by a web-server using HTTP/FTP/HTTPS. There are no restrictions on the type of file or the size of a single file. + +HarvestMan assumes that the URLs with the following extension are web-pages, static or dynamic. + +'.htm', '.html', '.shtm', '.shtml', '.php','.php3','.php4','.asp', '.aspx', '.jsp','.psp','.pl','.cgi', '.stx', '.cfm', '.cfml', '.cms' + +The URL with no extension is also assumed to be a web-page. However, the program has a mechanism by which it looks at the URL headers of the HTTP request and figures out the actual file type of the URL by doing a mimetype analysis. This happens immediately after the HTTP request is answered by the server. So if the program finds that the assumed type of a URL is different from the actual type, it sets the type correctly at this point. + +You can restrict download of certain files by creating specific filters for HarvestMan. These are described in a section somewhere below. + +A related question is the html-tagssupported by HarvestMan using which it downloads files. +These are listed below. + + 1. Hypertext links of the form .
+
2. Image links of the form .
+3. Stylesheets of the form
+
+
+
+.
+4. Javascript source files of the form
+
+