Skip to content

Iqra9999/Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web-Scraping

Overview

This project is a comprehensive web scraping solution designed to extract news articles from the Business Recorder e-paper website. The scraping operation covers a period from January 2018 to March 2024, collecting and processing article data into a structured dataset for further analysis.

Features

Optimized HTTP Requests:

Utilizes requests.Session to manage and optimize HTTP requests.

Concurrent Scraping:

Implements multithreading using concurrent.futures.ThreadPoolExecutor to scrape multiple articles concurrently, significantly speeding up the data collection process.

Data Extraction:

Uses BeautifulSoup to parse HTML and extract key information, including article titles, dates, URLs, and content.

Robust Error Handling:

Includes mechanisms to handle failed requests and ensure continuous scraping.

Data Storage:

Converts the collected data into a pandas DataFrame and saves it as a CSV file for easy access and analysis.

Installation

Clone the repository: git clone https://github.com/Iqra9999/Web-Scraping

Navigate to the project directory: cd Web-Scraping

Install the required libraries: pip install -r requirements.txt

Usage

Configure the base_url, start_date, end_date, and max_pages parameters in the script.

Run the script: python scrape_news.py

The script will scrape the news articles and save the data to a CSV file named News.csv.

Note:

Ensure that the article URLs are unique and change with each article. This code will only work if the URLs follow a predictable pattern. You might also need to change the extract_article_url function in code according to your specific url structure of website.

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published