GitHub - datamut/whatsnews: A scalable system for crawling and retrieving news content.

whatsnews (what's news)

v0.0.1

Introduction

Whatsnews is a system used for scrawling content from news websites like bbc and the guardian. It is only for development use.

Requirements is based on an interview test. Details of this test is confidential and will not be mentioned here. Project name, naming of variables and comments do not contain any thing related with this test.

Sub modules/systems/services

This project consists of several different sub systems as below:

index-crawler - to crawl article urls from news websites
article-crawler - to crawl article content using urls fetched above
index-builder - to build index to mongodb/search-engine for full text search
auth-api - API used to grant privilege of using other APIs like search_api for external users
auth-service - centralized authorization service, called by auth_api and other APIs
search-api - search interface for users
search-service - service in charge of full text search on news content in search engine mentioned above
query-service - service used to process user's input queries, e.g. extend relevant queries, change queries, etc.
rank-service - service used for sorting search result, e.g. sort according user's profile/preferences

A flowchart will be provided for better understanding of this project.

Environment

This version use MongoDB as database and full text search engine. Crawler use Scrapy as crawler engine, and Kafka is used for scalability. Readability is used for distilling the content of articles. Services and APIs mainly use Python and Flask framework. Development version of these tools are listed below:

MongoDB 3.2.9
Kafka 2.11-0.10.0.0
Python 3.5.2
Scrapy 1.1.2
Flask 0.11.1

Requirement details of each sub project can be found in its root directory.

API Usage

>>> import requests
>>> requests.get('http://authapi.kc7ctmpd2z.us-west-2.elasticbeanstalk.com/token/ID123456/123456').json()
{'token': 'TK123456', 'expires_in': 86400}
>>> requests.post('http://searchapi.kc7ctmpd2z.us-west-2.elasticbeanstalk.com/search/ID123456/TK123456', data={'query': 'Guinness World Records'}).json()[0]
>>> ...

A very basic web interface has been built for an easier interact: Search on What's News

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whatsnews (what's news)

Introduction

Sub modules/systems/services

Environment

API Usage

Project flowchart

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
article-crawler		article-crawler
auth-api		auth-api
auth-service		auth-service
doc		doc
index-builder		index-builder
index-crawler		index-crawler
kafka-conf		kafka-conf
query-service		query-service
rank-service		rank-service
search-api		search-api
search-service		search-service
.gitignore		.gitignore
README.md		README.md

datamut/whatsnews

Folders and files

Latest commit

History

Repository files navigation

whatsnews (what's news)

Introduction

Sub modules/systems/services

Environment

API Usage

Project flowchart

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages