This project is a simple web scraping application using playwright-python and jolly'ol python. This tool allows you to extract data from multiple pages on the web by providing their website URLs and saving the extracted content in plain text files.
- Extract data from multiple webpages
- Loop over all links in a batched manner, scrapping a batch in parallel
- Save the extracted content in plain text files with custom file names
To get started, make sure you have python 3.12.x visrtual environment installed on your system.
Then, follow these steps:
- Clone this repository to your local machine using the following command:
git clone https://github.com/schartz/scraperdesu.git
- Navigate to the project directory:
cd scraperdesu
- Install required dependencies by running:
pip install -r requirements.txt
- Adjust your ENV info. Copy the
.env
file fromenv.sample
file in the root of the project directory. - Run the script by executing:
python main.py
command from the root of the project directory. - View the output to see the scraped data and saved files' paths.