Newsletter Scraper Project

This project is a web scraper designed to extract newsletter data from the Bizztreat newsletter site using Scrapy and dlt. The data can be stored in both CSV and Parquet formats.

Features

Scrapes newsletters (title, date, author, URL, image URL) from the Bizztreat website.
Supports multiple output formats: CSV and Parquet.
Utilizes dlt for data pipeline management and local file storage.
Includes a Makefile for easier management of tasks such as running the scraper, installing dependencies, running tests, linting, and formatting.
Dockerized for easy deployment and isolated execution.

Installation

Clone the repository:

git clone https://github.com/srpwnd/newsletter-scraper.git
cd newsletter-scraper

Set up a Python virtual environment: The project uses a venv (virtual environment) to manage dependencies.

Run the following command to create and activate the virtual environment:
```
make venv
```
Install the dependencies: Once the virtual environment is active, install all the necessary dependencies by running:
```
make install
```

Usage

To run the scraper, use the following command:

make run

This will execute the main scraper pipeline defined in scraping_pipeline.py.

You can also clean up data files and virtual environments using:

make clean

Makefile Commands

The Makefile provides a series of predefined commands to help with project management. Here is a list of the most common commands:

Command	Description
`make run`	Runs the scraping script (`scraping_pipeline.py`) within the virtual environment.
`make venv`	Creates a Python virtual environment using `venv` and installs dependencies.
`make install`	Installs dependencies from the `requirements.txt` file into the virtual environment.
`make freeze`	Freezes the current set of installed dependencies into `requirements.txt`.
`make test`	Runs the unit and integration tests using `unittest`.
`make lint`	Lints the codebase using `ruff` to enforce coding standards and automatically fixes issues.
`make format`	Formats the codebase using `ruff` formatting rules.
`make clean`	Cleans up generated files such as `__pycache__`, virtual environments, CSV/Parquet data files.
`make clean-data`	Cleans only the data files (CSV, Parquet) stored in `_storage` directories.
`make clean-venv`	Removes the virtual environment folder.

Testing

Unit and integration tests are located in the tests/ directory. To run all tests, use:

make test

This will automatically discover and run all the test cases using unittest.

Docker Setup

The project is Dockerized to simplify running in isolated environments. You can use Docker and Docker Compose to run the scraper without needing to install the dependencies manually.

Docker

Build the Docker image:
```
docker build -t newsletter-scraper .
```
Run the scraper using Docker:
```
docker run newsletter-scraper
```

Docker Compose

Run the scraper with Docker Compose:
```
docker-compose up --build
```

This will build the image (if necessary) and run the scraper automatically inside a container.

Storage Location

The output data is stored in the _storage directory, which contains subdirectories for both CSV and Parquet formats, depending on the pipeline you use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.dlt		.dlt
scraping		scraping
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
scraping_pipeline.py		scraping_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newsletter Scraper Project

Features

Table of Contents

Installation

Usage

Makefile Commands

Testing

Docker Setup

Docker

Docker Compose

Storage Location

About

Releases

Packages

Languages

srpwnd/newsletter-scraper

Folders and files

Latest commit

History

Repository files navigation

Newsletter Scraper Project

Features

Table of Contents

Installation

Usage

Makefile Commands

Testing

Docker Setup

Docker

Docker Compose

Storage Location

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages