GitHub - nava45/simplempcrawler: Simple Multiprocessing Crawler in python

SimpleMPCrawler

Simple Multi processing crawler in python

Problem stmt:

Using python's multiprocessing and any one of threading/gevent module, task is to write a web-scraper which takes a huge file as an input ( 1Million rows ) which contains a url in each line. The scraper then uses BeatuifulSoup to parse the content and finds if the content contains "jquery.js". If it does, dump the url into a file "accepted.csv" or if it doesn't, dump it into file "rejected.csv".

install


    virtualenv env/
    source env/bin/activate
    pip install -r requirements.txt
    python crawler.py

test

python test_crawler.py

steps

urls.csv is the input file which has list of urls to be processed

output accepted.csv, rejected.csv files will be created and the respective urls are put in to the respective files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt
test_crawler.py		test_crawler.py
urls.csv		urls.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleMPCrawler

Problem stmt:

install

test

steps

About

Releases

Packages

Languages

nava45/simplempcrawler

Folders and files

Latest commit

History

Repository files navigation

SimpleMPCrawler

Problem stmt:

install

test

steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages