Skip to content

Latest commit

 

History

History
36 lines (32 loc) · 1.86 KB

README.md

File metadata and controls

36 lines (32 loc) · 1.86 KB

Web-Scraping

Overview

This project is a comprehensive web scraping solution designed to extract news articles from the Business Recorder e-paper website. The scraping operation covers a period from January 2018 to March 2024, collecting and processing article data into a structured dataset for further analysis.

Features

Optimized HTTP Requests:

Utilizes requests.Session to manage and optimize HTTP requests.

Concurrent Scraping:

Implements multithreading using concurrent.futures.ThreadPoolExecutor to scrape multiple articles concurrently, significantly speeding up the data collection process.

Data Extraction:

Uses BeautifulSoup to parse HTML and extract key information, including article titles, dates, URLs, and content.

Robust Error Handling:

Includes mechanisms to handle failed requests and ensure continuous scraping.

Data Storage:

Converts the collected data into a pandas DataFrame and saves it as a CSV file for easy access and analysis.

Installation

Clone the repository: git clone https://github.com/Iqra9999/Web-Scraping

Navigate to the project directory: cd Web-Scraping

Install the required libraries: pip install -r requirements.txt

Usage

Configure the base_url, start_date, end_date, and max_pages parameters in the script.

Run the script: python scrape_news.py

The script will scrape the news articles and save the data to a CSV file named News.csv.

Note:

Ensure that the article URLs are unique and change with each article. This code will only work if the URLs follow a predictable pattern. You might also need to change the extract_article_url function in code according to your specific url structure of website.

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.