Skip to content

Commit

Permalink
Merge branch 'dirbot'
Browse files Browse the repository at this point in the history
  • Loading branch information
rmax committed Oct 24, 2013
2 parents b65c4bb + 0d875f1 commit 997bfe9
Show file tree
Hide file tree
Showing 9 changed files with 125 additions and 0 deletions.
49 changes: 49 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
======
dirbot
======

This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.

Items
=====

The items scraped by this project are websites, and the item is defined in the
class::

dirbot.items.Website

See the source code for more details.

Spiders
=======

This project contains one spider called ``dmoz`` that you can see by running::

scrapy list

Spider: dmoz
------------

The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the `Scrapy tutorial`_

This spider doesn't crawl the entire dmoz.org site but only a few pages by
default (defined in the ``start_pages`` attribute). These pages are:

* http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
* http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/

So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape
only those two pages.

.. _Scrapy tutorial: http://doc.scrapy.org/intro/tutorial.html

Pipelines
=========

This project uses a pipeline to filter out websites containing certain
forbidden words in their description. This pipeline is defined in the class::

dirbot.pipelines.FilterWordsPipeline
Empty file added dirbot/__init__.py
Empty file.
8 changes: 8 additions & 0 deletions dirbot/items.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from scrapy.item import Item, Field


class Website(Item):

name = Field()
description = Field()
url = Field()
16 changes: 16 additions & 0 deletions dirbot/pipelines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from scrapy.exceptions import DropItem


class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""

# put all words in lowercase
words_to_filter = ['politics', 'religion']

def process_item(self, item, spider):
for word in self.words_to_filter:
if word in unicode(item['description']).lower():
raise DropItem("Contains forbidden word: %s" % word)
else:
return item
7 changes: 7 additions & 0 deletions dirbot/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Scrapy settings for dirbot project

SPIDER_MODULES = ['dirbot.spiders']
NEWSPIDER_MODULE = 'dirbot.spiders'
DEFAULT_ITEM_CLASS = 'dirbot.items.Website'

ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']
1 change: 1 addition & 0 deletions dirbot/spiders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Place here all your scrapy spiders
34 changes: 34 additions & 0 deletions dirbot/spiders/dmoz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dirbot.items import Website


class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]

def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul[@class="directory-url"]/li')
items = []

for site in sites:
item = Website()
item['name'] = site.select('a/text()').extract()
item['url'] = site.select('a/@href').extract()
item['description'] = site.select('text()').re('-\s([^\n]*?)\\n')
items.append(item)

return items
2 changes: 2 additions & 0 deletions scrapy.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[settings]
default = dirbot.settings
8 changes: 8 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from setuptools import setup, find_packages

setup(
name='dirbot',
version='1.0',
packages=find_packages(),
entry_points={'scrapy': ['settings = dirbot.settings']},
)

0 comments on commit 997bfe9

Please sign in to comment.