web-crawler

A Web crawler which can crawl through web pages, download and parse data from them. Currently it is specialized for Wikipedia.

Crawls through Wikipedia starting from a random article till the path reaches PHILOSOPHY page. Terminates if there is a loop or a dead-end. This process is repeated 'n' times, by default 10 and gives a stat of the mean and median distribution of the path lengths.

Each URL is represented as a Hyperlink object. Each Hyperlink object contains the path length from this page to PHILOSOPHY. This is used to optimize the path length computations for random pages by re-using path lengths from visited pages.

Candidate solutions are validated by checking if the hyperlinks are valid. Some of the hyperlinks which are italicized or present inside parenthesis are invalid. These candidate solutions are discarded and the next candidate solution is evaluated.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
__init__.py		__init__.py
crawler.py		crawler.py
wiki.py		wiki.py
wiki.pyc		wiki.pyc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-crawler

About

Releases

Packages

Languages

prem-n/web-crawler

Folders and files

Latest commit

History

Repository files navigation

web-crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages