Skip to content

A Web crawler which can crawl through web pages, download and parse data from them. Currently it is specialized for Wikipedia.

Notifications You must be signed in to change notification settings

prem-n/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-crawler

A Web crawler which can crawl through web pages, download and parse data from them. Currently it is specialized for Wikipedia.

Crawls through Wikipedia starting from a random article till the path reaches PHILOSOPHY page. Terminates if there is a loop or a dead-end. This process is repeated 'n' times, by default 10 and gives a stat of the mean and median distribution of the path lengths.

Each URL is represented as a Hyperlink object. Each Hyperlink object contains the path length from this page to PHILOSOPHY. This is used to optimize the path length computations for random pages by re-using path lengths from visited pages.

Candidate solutions are validated by checking if the hyperlinks are valid. Some of the hyperlinks which are italicized or present inside parenthesis are invalid. These candidate solutions are discarded and the next candidate solution is evaluated.

About

A Web crawler which can crawl through web pages, download and parse data from them. Currently it is specialized for Wikipedia.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages