GitHub - krishabhishek/simple-java-web-crawler

Simple web crawler

Problem Statement :- Design a simple web crawler. The crawler should be limited to one domain. Given a starting URL – say http://www.prudential.co.uk/ - it should visit all pages within the domain, but not follow the links to external sites such as Google or Twitter.

The output should be a simple structured site map (this does not need to be a traditional XML sitemap - just some sort of output to reflect what your crawler has discovered), showing links to other pages under the same domain, links to external URLs and links to static content such as images for each respective page.

How to run :-

Clone this repository in local machine (git clone ..)
maven build , mvn clean install
java -jar target/crawler-1.0-SNAPSHOT.jar " http://www.prudential.co.uk/"

or Alternatively

If you dont want to run without commnad line, you can use any IDE and run as Crawler java class and provide the arguments " http://www.prudential.co.uk/" in run configurations.

Improvement Areas:-

We can add max depth for any link to process, otherwise it will keep crawling forever.
Some link generate dynamic link, in some cases it redirected to same page, causing Loop and will never finish
Retry logic if some resource is not available at that point of time
Deduplication if already processed.
Scalable via making this as dockerize component and deployed in cluster
We can do better in not crawling outside domain resources.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
README.md		README.md
crawler-1.0-SNAPSHOT.jar		crawler-1.0-SNAPSHOT.jar
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

krishabhishek/simple-java-web-crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages