- Athina Stewart
- Archit Joshi
- Chengzi Cao
- Parijat Kawale
Dataset - https://www.kaggle.com/datasets/azathoth42/myanimelist
- Project split into 3 phases.
- Please refer to each Phase report for information.
- Each subsequent phase assumes previous phase has been executed and the data is present in the postgreSQL and MongoDB
- Refer to CSCI620 - Project Presentation for an overview of the project lifecycle.
- This phase deals with selecting a viable dataset to load and perform analysis on using a relational model.
- Please refer to CSCI620 - Report Phase 1.pdf for more information and script execution instructions.
-
The Objectives of this phase include -
- Select one or more datasets. The final dataset needs to be large (~50M tuples in a relational database), and interesting enough so you can perform meaningful queries and mine meaningful information from it.
- Provide a description of the data and a meaningful relational model to faithfully represent the dataset.
- Provide a program to load the dataset.
- This phase deals with proposing and loading the data set using a document-oriented model
- Please refer to CSCI620 - Report Phase 2.pdf for more information and script execution instructions.
-
The Objectives of this phase include -
- Propose a document-oriented model for the dataset and compare it with the relational model
- Provide code to load your data into this model.
- Provide a program that issues at least five interesting SQL queries over the previous relational model and propose indexes to speed up query execution (report your timings).
- Discover and explain functional dependencies and discuss normalization with respect to the relational model you provided in Phase I.
- This phase deals with data cleaning, integration and item set mining.
- Please refer to CSCI620 - Report Phase 3.pdf for more information and script execution instructions.
-
The Objectives of this phase include -
- Provide a program that cleans and integrates your dataset.
- Generate and discuss a few statistical observations from the dataset.
- Provide a program tha applies itemset mining to the dataset to discover association rules. We have opted to use the apriori algorithm to achieve the same.
- A comparison study on the better model fit for the dataset and the tasks performed on it.