Skip to content

This repo contains the JAVA code for a Multi-Threaded Web Crawler that stores the scraped URL's in an SQLite database. It runs multiple spiders to crawl multiple URL's concurrently and is my submission for the Final Project in CS-GY-9053- Introduction to JAVA

Notifications You must be signed in to change notification settings

MahatiMadhira/MultiThreadedWebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi Threaded Web Crawler as Final Project in CS-GY-9053: Introduction to JAVA.

This repo contains the JAVA code for a Multi-Threaded Web Crawler that stores the scraped URL's in an SQLite database. It runs multiple spiders to crawl multiple URL's concurrently and JSoup to parse the scraped URL's.

How to Execute the code

  1. Download the code as a .zip or .tar file and unzip in the IDE workspace (I used Eclipse).
  2. Add the Jsoup and SQLite .jar files to your project classpath so you can import the corresponding libraries.
  3. Navigate to webCrawler/src/webCrawler/MainTest.java file and run the file to start the web crawling process.
  4. If you don't have SQLite installed, please follow the installation guide based on your OS and click on the crawler.db file to access the database containing the scraped URL's in the crawler table which contains 3 columns: id, URL, PURL which describe the kind of data being scraped.

Future Work

  1. I am working on create an application to take user given seed URL and a max-depth they want to parse. Once the crawler bots complete parsing, the results will be in a downloadable CSV format so users can utilise the dataset for their personal use.

Results

Screen Shot 2022-12-18 at 3 12 35 AM

Screen Shot 2022-12-18 at 3 12 41 AM

About

This repo contains the JAVA code for a Multi-Threaded Web Crawler that stores the scraped URL's in an SQLite database. It runs multiple spiders to crawl multiple URL's concurrently and is my submission for the Final Project in CS-GY-9053- Introduction to JAVA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages