This script is a web scraper and report generator that scrapes web pages and generates a report about the scraped content in HTML, PDF, or CSV format.
- Scrapes web pages using asynchronous requests for better performance
- Generates reports in HTML, PDF, or CSV format
- Extracts internal and external links, images, and HTML tag count
- Uses a list of user agent headers for scraping
- Supports concurrent requests
- Python 3.7+
- aiohttp
- aiofiles
- beautifulsoup4
- WeasyPrint
- argparse
To install the required libraries, run:
pip install aiohttp aiofiles beautifulsoup4 WeasyPrint argparse
python webscraper.py -u <urls> [-hf <headers_file>] [-c <concurrent_requests>] [-o <output_dir>] [-rf <report_format>]
- -u, --urls: List of URLs to scrape (required)
- -hf, --headers_file: File containing user agent headers (default: user_agents.txt)
- -c, --concurrent_requests: Number of concurrent requests (default: 10)
- -o, --output_dir: Directory to save scraped pages and reports (default: scraped_pages)
- -rf, --report_format: Format of the report (default: csv). Accepted formats: html, pdf, csv
python webscraper.py -u https://example.com https://example.org -hf user_agents.txt -c 10 -o scraped_pages -rf html
This command will scrape the web pages at https://example.com and https://example.org, save the scraped pages in the scraped_pages directory, and generate HTML reports for each page.