Web-Scraping

Overview

This project is a comprehensive web scraping solution designed to extract news articles from the Business Recorder e-paper website. The scraping operation covers a period from January 2018 to March 2024, collecting and processing article data into a structured dataset for further analysis.

Features

Optimized HTTP Requests:

Utilizes requests.Session to manage and optimize HTTP requests.

Concurrent Scraping:

Implements multithreading using concurrent.futures.ThreadPoolExecutor to scrape multiple articles concurrently, significantly speeding up the data collection process.

Data Extraction:

Uses BeautifulSoup to parse HTML and extract key information, including article titles, dates, URLs, and content.

Robust Error Handling:

Includes mechanisms to handle failed requests and ensure continuous scraping.

Data Storage:

Converts the collected data into a pandas DataFrame and saves it as a CSV file for easy access and analysis.

Installation

Clone the repository: git clone https://github.com/Iqra9999/Web-Scraping

Navigate to the project directory: cd Web-Scraping

Install the required libraries: pip install -r requirements.txt

Usage

Configure the base_url, start_date, end_date, and max_pages parameters in the script.

Run the script: python scrape_news.py

The script will scrape the news articles and save the data to a CSV file named News.csv.

Note:

Ensure that the article URLs are unique and change with each article. This code will only work if the URLs follow a predictable pattern. You might also need to change the extract_article_url function in code according to your specific url structure of website.

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Code.ipynb		Code.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web-Scraping

Overview

Features

Optimized HTTP Requests:

Concurrent Scraping:

Data Extraction:

Robust Error Handling:

Data Storage:

Installation

Usage

Note:

Contributions

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Iqra9999/Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping

Overview

Features

Optimized HTTP Requests:

Concurrent Scraping:

Data Extraction:

Robust Error Handling:

Data Storage:

Installation

Usage

Note:

Contributions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages