CSCI620 - Introduction to Big Data
Project Repository, Spring 2023

Authors/Collaborators -

Athina Stewart
Archit Joshi
Chengzi Cao
Parijat Kawale

Overview

This repository is a collection of JS, Python and SQL scripts to load and perform analysis on the MyAnimeList dataset from Kaggle for the CSCI620 Course project. Data loading, analysis and comparisons performed between PostgreSQL and MongoDB
Dataset - https://www.kaggle.com/datasets/azathoth42/myanimelist

Project split into 3 phases.
Please refer to each Phase report for information.
Each subsequent phase assumes previous phase has been executed and the data is present in the postgreSQL and MongoDB
Refer to CSCI620 - Project Presentation for an overview of the project lifecycle.

Phase 1

This phase deals with selecting a viable dataset to load and perform analysis on using a relational model.
Please refer to CSCI620 - Report Phase 1.pdf for more information and script execution instructions.
The Objectives of this phase include -
1. Select one or more datasets. The final dataset needs to be large (~50M tuples in a relational database), and interesting enough so you can perform meaningful queries and mine meaningful information from it.
2. Provide a description of the data and a meaningful relational model to faithfully represent the dataset.
3. Provide a program to load the dataset.

Phase 2

This phase deals with proposing and loading the data set using a document-oriented model
Please refer to CSCI620 - Report Phase 2.pdf for more information and script execution instructions.
The Objectives of this phase include -
1. Propose a document-oriented model for the dataset and compare it with the relational model
2. Provide code to load your data into this model.
3. Provide a program that issues at least five interesting SQL queries over the previous relational model and propose indexes to speed up query execution (report your timings).
4. Discover and explain functional dependencies and discuss normalization with respect to the relational model you provided in Phase I.

Phase 3

This phase deals with data cleaning, integration and item set mining.
Please refer to CSCI620 - Report Phase 3.pdf for more information and script execution instructions.
The Objectives of this phase include -
1. Provide a program that cleans and integrates your dataset.
2. Generate and discuss a few statistical observations from the dataset.
3. Provide a program tha applies itemset mining to the dataset to discover association rules. We have opted to use the apriori algorithm to achieve the same.
4. A comparison study on the better model fit for the dataset and the tasks performed on it.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
.gitattributes		.gitattributes
.gitignore		.gitignore
CSCI620 - Project Presentation.pptx		CSCI620 - Project Presentation.pptx
CSCI620 Project Report Phase 1.pdf		CSCI620 Project Report Phase 1.pdf
CSCI620 Project Report Phase 2.pdf		CSCI620 Project Report Phase 2.pdf
CSCI620 Project Report Phase 3.pdf		CSCI620 Project Report Phase 3.pdf
README.md		README.md
cleanData.sql		cleanData.sql
createIndex.sql		createIndex.sql
createMainTables.sql		createMainTables.sql
createTables.sql		createTables.sql
exportToMongo.sql		exportToMongo.sql
interestingQueries.sql		interestingQueries.sql
loadRawData.sql		loadRawData.sql
mongodb_datainsert.py		mongodb_datainsert.py
monogodb_itemsetmining.py		monogodb_itemsetmining.py
postgres_itemsetmining.py		postgres_itemsetmining.py
stats.py		stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSCI620 - Introduction to Big Data
Project Repository, Spring 2023

Authors/Collaborators -

Overview

Phase 1

Phase 2

Phase 3

About

Releases

Packages

Contributors 5

Languages

JoshiArchit/MyAnimeList-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

CSCI620 - Introduction to Big Data Project Repository, Spring 2023

Authors/Collaborators -

Overview

Phase 1

Phase 2

Phase 3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

CSCI620 - Introduction to Big Data
Project Repository, Spring 2023

Packages