In this lesson, we'll review all the guidelines and specifications for the final project for Module 5.
- Understand all required aspects of the Final Project for Module 5
- Understand all required deliverables
- Understand what constitutes a successful project
Congratulations! You've made it through another intense module, and now you're ready to show off your newfound Machine Learning skills!
All that remains for Module 5 is to complete the final project!
For this project, you're going to select a dataset of your choosing and create a classification model. You'll start by identifying a problem you can solve with classification, and then identify a dataset. You'll then use everything you've learned about Data Science and Machine Learning thus far to source a dataset, preprocess and explore it, and then build and interpret a classification model that answers your chosen question.
We encourage you to be very thoughtful when identifying your problem and selecting your data set--an overscoped project goal or a poor data set can quickly bring an otherwise promising project to a grinding halt.
To help you select an appropriate data set for this project, we've set some guidelines:
-
Your dataset should work for classification. The classification task can be either binary or multiclass, as long as it's a classification model.
-
Your dataset needs to be of sufficient complexity. Try to avoid picking an overly simple dataset. Try to avoid extremely small datasets, as well as the most common datasets like titanic, iris, MNIST, etc. We want to see all the steps of the Data Science Process in this project--it's okay if the dataset is mostly clean, but we expect to see some preprocessing and exploration. See the following section, Data Set Constraints, for more information on this.
-
On the other end of the spectrum, don't pick a problem that's too complex, either. Stick to problems that you have a clear idea of how you can use machine learning to solve it. For now, we recommend you stay away from overly complex problems in the domains of Natural Language Processing or Computer Vision--although those domains make use of Supervised Learning, they come with a lot of other special requirements and techniques that you don't know yet (but you'll learn soon!). If your chosen problem feels like you've overscoped, then it probably is. If you aren't sure if your problem scope is appropriate, double check with your instructor!
-
Serious Bonus Points if some or all of the data is data you have to source yourself through web scraping or interacting with a 3rd party API! Having projects that show off your ability to source data effectively make you look that much more impressive when showing your work off to potential employers!
When selecting a data set, be sure to take into consideration the following constraints:
- Your data set can't be one we've already worked with in any labs.
- Your data set should contain a minimum of 1000 rows.
- Your data set should contain a minimum of 10 predictor columns, before any one-hot encoding is performed.
- Your instructor must provide final approval on your data set.
There are two ways that you can about getting started: Problem-First or Data-First.
Problem-First: Start with a problem that you want to solve with classification, and then try to find the data you need to solve it. If you can't find any data to solve your problem, then you should pick another problem.
Data-First: Take a look at some of the most popular internet repositories of cool data sets we've listed below. If you find a data set that's particularly interesting for you, then it's totally okay to build your problem around that data set.
There are plenty of amazing places that you can get your data from. We recommend you start looking at data sets in some of these resources first:
- UCI Machine Learning Datasets Repository
- Kaggle Datasets
- Awesome Datasets Repo on Github
- New York City Open Data Portal
- Inside AirBNB
For online students, your completed project should contain the following four deliverables:
-
A Jupyter Notebook containing any code you've written for this project. This work will need to be pushed to your GitHub repository in order to submit your project.
-
An organized README.md file in the GitHub repository that describes the contents of the repository. This file should be the source of information for navigating through the repository.
-
A Blog Post.
-
An "Executive Summary" PowerPoint Presentation that gives a brief overview of your problem/dataset, and each step of the OSEMN process.
Note: On-campus students may have different deliverables, please speak with your instructor.
For this project, your Jupyter Notebook should meet the following specifications:
Organization/Code Cleanliness
- The notebook should be well organized, easy to follow, and code is commented where appropriate.
- Level Up: The notebook contains well-formatted, professional looking markdown cells explaining any substantial code. All functions have docstrings that act as professional-quality documentation.
- The notebook is written to technical audiences with a way to both understand your approach and reproduce your results. The target audience for this deliverable is other data scientists looking to validate your findings.
Process, Methodology, and Findings
- Your notebook should contain a clear record of your process and methodology for exploring and preprocessing your data, building and tuning a model, and interpreting your results.
- We recommend you use the OSEMN process to help organize your thoughts and stay on track.
Refer back to the Blogging Guidelines for the technical requirements and blog ideas.
Online students can find a PDF of the grading rubric for the project here. Note: On-campus students may have different requirements, please speak with your instructor.