This project demonstrates the data modelling process for Apache Cassandra, a noSQL database. The database is here is for a music streaming website. Nosql databases are optimized for fast writes. To design for faster reads tables have to be entirely denormalized with queries in mind. In other words, each table in the Cassandra data base is modelling based on it's unique query outcome, with data redundancy(overlaps between tables) permitted. In some sense this is an easier process than with relational databases, the trick however lies majorly in how data is partioned and sorted across nodes.
Project is a jupiter notebook. That follows these steps:
- We walk through file folder with data.
- Read in CSVs with all contain individual events
- We extract relevant columns(values) from each CSV and read them into new CSV that will house all events
- We then initiate the Cassandra cluster and define a key space.
- We create all our tables within this keyspace.
- We validate our table by running the queries they where designed for.
- Drop tables and close connections
- Python Cassandra driver aptly named "cassandra"
- Pandas for data manipulation
- OS and glob for crawling through file repository
- Modularize code