- Create a backup mechanism that can stop and restart the crawling
- Create a program that merges files to one computer
- A better way of illustrating statistics on the Master program
This project is to crawl all questions on www.zhihu.com as a practice of Networking in Java.
The program consists of two parts:
- master (only one)
- slave (multiple)
####Variable
- UniverseSet: Set
An integer set that contains all urls (or question ids) of added questions on Zhihu.com.
- populatingQueue: Queue
- ExtractedIdAcceptingService
One thread for receiving ids from the slaves and putting the ids into a queue.
- PopulatingService
Another thread is for populating the ids to each slaves so that they have urls to download and their workload can be balanced.
- ExtractedIdAcceptingService.java
- PopulatingService.java
- Master.java
- pendingIds: Queue
- DownloaderService
- HTMLParser
- IdAcceptingService
- DownloadService.java
- HTMLSoup.java
- IdAcceptService.java
- Slave.java