Data Pipeline Project - Scraping Reddit Posts

Overview:

The goal of this project was to create a nice framework for scraping text data from reddit.
It scrapes new posts from Reddit with no preference for topic or subreddit.
I wanted to get more familiar with using multiple docker containers and volumes to share data and so the three main pieces of the project are a container for scraping the data (and logging results), a data volume for storage (sqlite/csv), and a container running a Streamlit app to visualize/explore the data a bit.

Summary:

I learned there are a number of bots (which are not always obviously named) posting to various subreddits. Some sanctioned (automoderators) and some likely not (spamming crypto-subs, financial advice, other advertising). The most popular subreddits (by number of subscribers) do not exclusively get the most posts, due to the number of (likely) bots.

%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%%
graph TD
    subgraph Docker Compose
    subgraph Webscraper container
    B{Scraper} --> |get data as json|C[Pandas]
    Z[Pytest] --> |run tests|B
    end
    subgraph Shared Volume
    C --> |Save post content|E[CSV]
    C --> |Data quality statistics|D[(SQLite Logs)]
    end
    subgraph Streamlit Container
    P[Streamlit]
    E --> P
    end
    end

Project Structure:

Scrape new Reddit posts programatically, simulating "batch" data ingestion
Parse scraped nested json data
Clean excess metadata
Use pytest for testing
Track Data Quality Metrics
Store metrics in SQLite database
Write chunks of data to a csv
Clean the CSV data
Use Docker/Docker Compose for modularity
Use containerized streamlit app for a dashboard/displaying outputs

Config settings:

In the app/config.yml file, you can change the sleep time between http requests (don't make it too small or you'll get banned), the number of posts to get for each request, and the total number iterations/calls to make.

How to Run:

Clone the Repository
Run the docker commands in the root directory (which contains the compose.yml file):

docker compose build

docker compose up

To open the streamlit app, open the Network URL output in the terminal in a browser.
After it finishes running, to remove the container:

docker compose down --volumes

After running, to clean up containers/volumes (assuming you don't have others on your system you need to save):

docker system prune -a

Note: If you refresh the streamlit app with "Ctrl + R", it will update the data to the lastest scraped while the scraper is continuously running.

Alternate options to run it:

In the app directory, there are instructions to run the scraping portion of the app without using docker compose. It can be ran either locally (with a virtual env) or as a single docker container.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
app		app
images		images
streamlit		streamlit
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
azure-pipelines.yml		azure-pipelines.yml
compose.yml		compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline Project - Scraping Reddit Posts

Overview:

Summary:

Project Structure:

Config settings:

How to Run:

Note: If you refresh the streamlit app with "Ctrl + R", it will update the data to the lastest scraped while the scraper is continuously running.

Alternate options to run it:

About

Releases

Packages

Languages

josephplpriest/pipeline_project

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Project - Scraping Reddit Posts

Overview:

Summary:

Project Structure:

Config settings:

How to Run:

Note: If you refresh the streamlit app with "Ctrl + R", it will update the data to the lastest scraped while the scraper is continuously running.

Alternate options to run it:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages