This project is a comprehensive web scraping solution designed to extract news articles from the Business Recorder e-paper website. The scraping operation covers a period from January 2018 to March 2024, collecting and processing article data into a structured dataset for further analysis.
Utilizes requests.Session to manage and optimize HTTP requests.
Implements multithreading using concurrent.futures.ThreadPoolExecutor to scrape multiple articles concurrently, significantly speeding up the data collection process.
Uses BeautifulSoup to parse HTML and extract key information, including article titles, dates, URLs, and content.
Includes mechanisms to handle failed requests and ensure continuous scraping.
Converts the collected data into a pandas DataFrame and saves it as a CSV file for easy access and analysis.
Clone the repository: git clone https://github.com/Iqra9999/Web-Scraping
Navigate to the project directory: cd Web-Scraping
Install the required libraries: pip install -r requirements.txt
Configure the base_url, start_date, end_date, and max_pages parameters in the script.
Run the script: python scrape_news.py
The script will scrape the news articles and save the data to a CSV file named News.csv.
Ensure that the article URLs are unique and change with each article. This code will only work if the URLs follow a predictable pattern. You might also need to change the extract_article_url function in code according to your specific url structure of website.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.