Simple Multi processing crawler in python
Using python's multiprocessing and any one of threading/gevent module, task is to write a web-scraper which takes a huge file as an input ( 1Million rows ) which contains a url in each line. The scraper then uses BeatuifulSoup to parse the content and finds if the content contains "jquery.js". If it does, dump the url into a file "accepted.csv" or if it doesn't, dump it into file "rejected.csv".
virtualenv env/
source env/bin/activate
pip install -r requirements.txt
python crawler.py
python test_crawler.py
urls.csv
is the input file which has list of urls to be processed
output accepted.csv
, rejected.csv
files will be created and the respective urls are put in to the respective files