Skip to content

A scalable system for crawling and retrieving news content.

Notifications You must be signed in to change notification settings

datamut/whatsnews

Repository files navigation

whatsnews (what's news)

v0.0.1


Introduction

Whatsnews is a system used for scrawling content from news websites like bbc and the guardian. It is only for development use.

Requirements is based on an interview test. Details of this test is confidential and will not be mentioned here. Project name, naming of variables and comments do not contain any thing related with this test.

Sub modules/systems/services

This project consists of several different sub systems as below:

  • index-crawler - to crawl article urls from news websites
  • article-crawler - to crawl article content using urls fetched above
  • index-builder - to build index to mongodb/search-engine for full text search
  • auth-api - API used to grant privilege of using other APIs like search_api for external users
  • auth-service - centralized authorization service, called by auth_api and other APIs
  • search-api - search interface for users
  • search-service - service in charge of full text search on news content in search engine mentioned above
  • query-service - service used to process user's input queries, e.g. extend relevant queries, change queries, etc.
  • rank-service - service used for sorting search result, e.g. sort according user's profile/preferences

A flowchart will be provided for better understanding of this project.

Environment

This version use MongoDB as database and full text search engine. Crawler use Scrapy as crawler engine, and Kafka is used for scalability. Readability is used for distilling the content of articles. Services and APIs mainly use Python and Flask framework. Development version of these tools are listed below:

  • MongoDB 3.2.9
  • Kafka 2.11-0.10.0.0
  • Python 3.5.2
  • Scrapy 1.1.2
  • Flask 0.11.1

Requirement details of each sub project can be found in its root directory.

API Usage

>>> import requests
>>> requests.get('http://authapi.kc7ctmpd2z.us-west-2.elasticbeanstalk.com/token/ID123456/123456').json()
{'token': 'TK123456', 'expires_in': 86400}
>>> requests.post('http://searchapi.kc7ctmpd2z.us-west-2.elasticbeanstalk.com/search/ID123456/TK123456', data={'query': 'Guinness World Records'}).json()[0]
>>> ...

A very basic web interface has been built for an easier interact: Search on What's News

Project flowchart

Flow-Chart

About

A scalable system for crawling and retrieving news content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published