Skip to content

JoshiArchit/MyAnimeList-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSCI620 - Introduction to Big Data
Project Repository, Spring 2023

Authors/Collaborators -

  • Athina Stewart
  • Archit Joshi
  • Chengzi Cao
  • Parijat Kawale

Overview

This repository is a collection of JS, Python and SQL scripts to load and perform analysis on the MyAnimeList dataset from Kaggle for the CSCI620 Course project. Data loading, analysis and comparisons performed between PostgreSQL and MongoDB
Dataset - https://www.kaggle.com/datasets/azathoth42/myanimelist
  • Project split into 3 phases.
  • Please refer to each Phase report for information.
  • Each subsequent phase assumes previous phase has been executed and the data is present in the postgreSQL and MongoDB
  • Refer to CSCI620 - Project Presentation for an overview of the project lifecycle.

Phase 1

  • This phase deals with selecting a viable dataset to load and perform analysis on using a relational model.
  • Please refer to CSCI620 - Report Phase 1.pdf for more information and script execution instructions.
  • The Objectives of this phase include -
    1. Select one or more datasets. The final dataset needs to be large (~50M tuples in a relational database), and interesting enough so you can perform meaningful queries and mine meaningful information from it.
    2. Provide a description of the data and a meaningful relational model to faithfully represent the dataset.
    3. Provide a program to load the dataset.

Phase 2

  • This phase deals with proposing and loading the data set using a document-oriented model
  • Please refer to CSCI620 - Report Phase 2.pdf for more information and script execution instructions.
  • The Objectives of this phase include -
    1. Propose a document-oriented model for the dataset and compare it with the relational model
    2. Provide code to load your data into this model.
    3. Provide a program that issues at least five interesting SQL queries over the previous relational model and propose indexes to speed up query execution (report your timings).
    4. Discover and explain functional dependencies and discuss normalization with respect to the relational model you provided in Phase I.

Phase 3

  • This phase deals with data cleaning, integration and item set mining.
  • Please refer to CSCI620 - Report Phase 3.pdf for more information and script execution instructions.
  • The Objectives of this phase include -
    1. Provide a program that cleans and integrates your dataset.
    2. Generate and discuss a few statistical observations from the dataset.
    3. Provide a program tha applies itemset mining to the dataset to discover association rules. We have opted to use the apriori algorithm to achieve the same.
    4. A comparison study on the better model fit for the dataset and the tasks performed on it.

About

CSCI620 Project Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published