GitHub

Ideas

Create a backup mechanism that can stop and restart the crawling
Create a program that merges files to one computer
A better way of illustrating statistics on the Master program

Zhihu Web Crawler

This project is to crawl all questions on www.zhihu.com as a practice of Networking in Java.

The program consists of two parts:

master (only one)
slave (multiple)

Master's Outline

####Variable

UniverseSet: Set

An integer set that contains all urls (or question ids) of added questions on Zhihu.com.

populatingQueue: Queue

Threads

ExtractedIdAcceptingService

One thread for receiving ids from the slaves and putting the ids into a queue.

PopulatingService

Another thread is for populating the ids to each slaves so that they have urls to download and their workload can be balanced.

Class

ExtractedIdAcceptingService.java
PopulatingService.java
Master.java

Slave's Outline

Varibale

pendingIds: Queue

Threads

DownloaderService
HTMLParser
IdAcceptingService

Class

DownloadService.java
HTMLSoup.java
IdAcceptService.java
Slave.java

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
DownloadThread.java		DownloadThread.java
ExtractedIdAcceptService.java		ExtractedIdAcceptService.java
HTMLParser.java		HTMLParser.java
Master.java		Master.java
Protocol.java		Protocol.java
README.md		README.md
README.md~		README.md~
Slave.java		Slave.java
compile		compile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ideas

Zhihu Web Crawler

Master's Outline

Threads

Class

Slave's Outline

Varibale

Threads

Class

About

Releases

Packages

Languages

zfanCode/ZhihuQuestionCrawler

Folders and files

Latest commit

History

Repository files navigation

Ideas

Zhihu Web Crawler

Master's Outline

Threads

Class

Slave's Outline

Varibale

Threads

Class

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages