This directory contains an example Scrapy project integrated with scrapy-redis.
By default, all items are sent to redis (key <spider>:items
). All spiders
schedule requests through redis, so you can start additional spiders to speed
up the crawling.
dmoz
This spider simply scrapes dmoz.org.
myspider_redis
This spider uses redis as a shared requests queue and uses
myspider:start_urls
as start URLs seed. For each URL, the spider outputs one item.mycrawler_redis
This spider uses redis as a shared requests queue and uses
mycrawler:start_urls
as start URLs seed. For each URL, the spider follows are links.
Note
All requests are persisted by default. You can clear the queue by using the
SCHEDULER_FLUSH_ON_START
setting. For example: scrapy crawl dmoz -s
SCHEDULER_FLUSH_ON_START=1
.
This example illustrates how to share a spider's requests queue across multiple spider instances, highly suitable for broad crawls.
- Check scrapy_redis package in your
PYTHONPATH
- Run the crawler for first time then stop it
cd example-project
scrapy crawl dmoz
... [dmoz] ...
^C
- Run the crawler again to resume stopped crawling
scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
- Start one or more additional scrapy crawlers
scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
- Start one or more post-processing workers
python process_items.py dmoz:items -v
...
Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
...
The class scrapy_redis.spiders.RedisSpider
enables a spider to read the
urls from redis. The urls in the redis queue will be processed one
after another, if the first request yields more requests, the spider
will process those requests before fetching another url from redis.
For example, create a file myspider.py
with the code below:
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = "myspider"
def parse(self, response):
# do stuff
pass
Then:
- run the spider
scrapy runspider myspider.py
- push json data to redis
redis-cli lpush myspider '{"url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }'
Note
- These spiders rely on the spider idle signal to fetch start urls, hence it
may have a few seconds of delay between the time you push a new url and the spider starts crawling it.
- Also please pay attention to json formatting.
The process_items.py
provides an example of consuming the items queue:
.. code-block:: bash
python process_items.py --help
You require the following applications:
- docker (https://docs.docker.com/installation/)
- docker-compose (https://docs.docker.com/compose/install/)
For implementation details see Dockerfile and docker-compose.yml and read official docker documentation.
To start sample example-project (-d for daemon):
docker-compose up
To scale crawler (4 instances for example):
docker-compose scale crawler=4