Problem:
Design a system to crawl and copy all of Wikipedia using a distributed network of machines.
More specifically, suppose your server has access to a set of client machines. Your client machines can execute code you have written to access Wikipedia pages, download and parse their data, and write the results to a database.
Some questions you may want to consider as part of your solution are:
* How will you reach as many pages as possible?
* How can you keep track of pages that have already been visited?
* How will you deal with your client machines being blacklisted?
* How can you update your database when Wikipedia pages are added or updated?
- Use the main page to generate a queue of links that is to be processed by the server.
- Each link is put into a queue that is processed on the server.
- The processing involves just monitoring and sending each link to a client.
- The client downloads and parses the page, stores them in the database and adds new URLs to the queue if the last update date is greater that the date the item was stored in a distributed cache.
- How will you reach as many pages as possible?
- Parse all URLs on each page and add them back to the queue.
- How can you keep track of pages that have already been visited?
- Distributed cache
- How will you deal with your client machines being blacklisted?
- Elastic Compute Cloud instance provisioning once
x
amount of requests get an error code response.
- Elastic Compute Cloud instance provisioning once
- How can you update your database when Wikipedia pages are added or updated?
- Handled by the mechanism of distributed cache and last update time checking.