Sample Project to crawl a web page for pdf,excels and urls
Steps to get started
Install scrapy on windows(via pip) - pip install scrapy if pip is not installed install it first via .. Download and install using python python
You might need to install few other packages including lxml , visual c++ .
Once everything is setup you can either clone this repo or follow
Steps to get this project running and scrapy settings Directory structure followed
tutorial/ scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here # project items file # project pipelines file # project settings file
spiders/ # a directory where you'll later put your spiders
1.Replace domain, start_url in to get the spider pointing to your web page You can set the rules to restrict some urls and allow urls you want to crawl in the spyder setting
rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//body//a/@href'))), Rule(SgmlLinkExtractor(allow=('<start_url>',)),callback='parse_item'), )
2.Go to root directory and run - scrapy crawl
3.You can use scrapy pipline to append tasks together similar to whats done in this project After crawling scrapy calls the tasks in the pipline which are downloading excels and pdfs set the pipline in scrapy settings
ITEM_PIPELINES = {'sample.pipelines.SamplePipeline' :1}
4.You can set the depth limit up to which scrapy can crawl recursively in scrapy settings