Skip to content

Redis-based components for scrapy that allows distributed crawling

License

Notifications You must be signed in to change notification settings

wyrover/scrapy-redis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Redis-based components for Scrapy
=================================

This is a initial work on Scrapy-Redis integration, not production-tested.
Use it at your own risk!

Features:
* Distributed crawling/scraping
* Distributed post-processing

Requirements:
* Scrapy >= 0.13 (development version)
* redis-py (tested on 2.4.9)
* redis server (tested on 2.2-2.4)

Available Scrapy components:
* Scheduler
* Duplication Filter
* Item Pipeline

Usage
-----

In your settings.py:

# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True

# store scraped item in redis for post-processing
ITEM_PIPELINES = [
    'scrapy_redis.pipelines.RedisPipeline',
]


Running the example project
---------------------------

You can test the funcionality following the next steps:

1. Setup scrapy_redis package in your PYTHONPATH

2. Run the crawler for first time then stop it

  $ cd example-project
  $ scrapy crawl dmoz
  ... [dmoz] ...
  ^C

3. Run the crawler again to resume stopped crawling

  $ scrapy crawl dmoz
  ... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)

4. Start one or more additional scrapy crawlers

  $ scrapy crawl dmoz
  ... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)

5. Start one or more post-processing workers

  $ python process_items.py
  Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
  Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
  ...

That's it.

About

Redis-based components for scrapy that allows distributed crawling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published