crawl_htmls
- scrapy project to download articles' text and write readability html to database. If website is blocked - uses Tor and Polipo proxyload_fb.py
- script to download facebook feed of a page, and store all the links in postgres. Requires facebook app, put your credentials inmy_fb.json
load_rss.py
- usesfeedparser
library to crawl RSS feeds of selected websites. Needs rewriting - onscrapy
, and to handle blocked websitessites_ids.csv
- list of all sites, its feeds and fb pagesra_server.js
- express.js simple Node app. Listens to port 3000 and returns readability html. Readability is from Mozilla. Run it before launchingcrawl_htmls
. To install dependencies - usenpm install
in this folder.node ra_server.js
will launch it on port 3000../psql_engine.txt
- Postgresql credentials example to use in data collection and preprocessing scripts in case you also want to store data in postgres
data_collection
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||