Note: Maybe updated in the future with additional information
- Course: CSC 405/605 - Data Science
- Schedule: Monday and Wednesday 5:30 pm - 6:45 pm
- Instructor: Dr. Kelechi M. Ikegwu
- Location: 219 Petty
Class Discussions: https://discord.gg/N2rhnw7h3J
- #welcome-and-rules channel contain rules for the channel
- Class Announcements will be in #announcements
- #resoucres will contain interesting or useful articals about data science
- #class-dicussions channel for class discussions
- #group-formation for forming groups
- Use #off-topic channel for any topics related to data science
-
Office Hrs: Thursday 5:00 pm - 6:00 pm via Zoom only (email for appointment and Zoom link)
-
Email: [email protected]
In a world with ever increasing data generated both by humans and machines alike, the field of computer science has seen a transition from computation-intensive solutions to data-intensive ones. Often in such a scenario, solutions to real-world problems can be derived/learned by analyzing disparate, complex and messy datasets using Data Science methods and approaches.
This course is highly interactive, and will explore the theories, techniques, and the tools necessary to gain insights from such datasets. Using a problem-based learning philosophy, students are expected to make use of such technologies to design data solutions that can process and analyze real-world datasets for a variety of scientific, social, and environmental challenges.
The core topics addressed by the course will be:
- Programming with Data
- Data Mining, Munging, Wrangling
- Statistics, Analytics, Representation, Visualization
- Introduction to Applied Machine-Learning
CSC 339 (Programming Languages) OR Programming experience (Instructor Permission Required)
There is no required text for the course. Relevant Articles and publicly avaiable books will be available for download. Class slides will also be available for download.
This course is highly interactive and based on the problem-based learning philosophy; students are expected to make use of said technologies to design highly scalable systems that can process and analyze real-world datasets for a variety of scientific, social, and environmental challenges.
-
Introduction to Data Science: (Week 1)
- Class Syllabus and Introduction
- Class Project discussion and assignment
-
Startup Tools and Programming (Weeks 2-3)
- Programming
- Re/Introduction to Python
- IPython, IPython-Notebook
- Data Science Reproducibility
- Setting up your Repository – Data, Code, and Documentation
- Using Version Control with Git
- Final Project Discussions - Goals and Requirements
- Programming
-
Data Munging, Wrangling, Cleaning (Week 4-5)
- Data Structures for Data Science
- Data Manipulation
- Selection - Indexing
- Handling Missing Data
- Aggregation
- Descriptive Statistics
- Merging / Join
- Working with Date-Time
- Project Review - Stage I
-
Data and Statistics (Week 6-9)
- Distributions
- Point Estimates
- Statistical Hypothesis Testing
- Correlation
- Distribution Estimators
- MoM, MLE, KDE
- Project Review - Stage II
-
Introduction to Applied Data Modeling: (Weeks 10-12)
- Applied Machine Learning
- Regression and Feature Selection
- Bias versus Variance
- Clustering and Dimensionality Reduction
- Validation and Model Performance
- Project Review - Stage III
-
Data Visualization (Week 13-14)
- Graph Generation
- Types of Graphs
- Customizing Plots
- Visualizing Errors
- Interactive / Dynamic Graphs
- Visualization Best Practices
- Project Review - Stage IV
- Graph Generation
- Project Presentations: (Week 15 – Final’s Week)
Class slides and ipython notebooks will be available here.
Grade | Max% | to | Min% |
---|---|---|---|
A | 100% | to | 94% |
A- | < 94% | to | 90% |
B+ | < 90% | to | 87% |
B | < 87% | to | 84% |
B- | < 84% | to | 80% |
C+ | < 80% | to | 77% |
C | < 77% | to | 74% |
C- | < 74% | to | 70% |
D+ | < 70% | to | 67% |
D | < 67% | to | 64% |
D- | < 64% | to | 60% |
F | < 60% | to | 59% |
-
Class / Homework Assignments (4): 30%
Four programming based in-class homework assignments will be given covering the utilization of the tools learned in class. Absolutely no collaboration on assignments. Students have to upload (Notebooks) individual assignments to GitHub. Listed below are the homework assignments for the class:
- TBD
- TBD
- TBD
- TBD
-
Final Project: 70%
The final project of the class will focus on the end to end development of an analytical model. The project will be split into four stages:
- Stage I Data/Project Understanding,
- Stage II Modeling,
- Stage III Basic Machine Learning, and
- Stage IV Visualization and Dashboard.
This will be a team-based effort, where in first week of the course the students split into teams of 4-5 students. After completing each stage, the teams will have to give a short presentation (3-5 mins) and a report (1 page) of their progress with the project. The projects will be open-source and the teams will have to use GitHub as their code repository. Upon completion of the project the teams will present their software along with the results in form of a presentation (20 minutes).
- Each Stage of the Final Project has 17.5 points. They will be equally weighted for the project final score.
- Each stage has deliverables of:
- Report
- Code Jupyter/IPython Notebooks
- Presentation
- To get the full points in each stage you need to finish all of the deliverables.
- Each stage has deliverables of:
- Graduate Students Only: For Stage IV, 80 percent of your points is from your project and 20 percent of your points is for the project report. Minimum 5 pages for single author, 8 for 2 authors, and 12 for 3 authors (figures and references included). Template:. Example: (Due: 04/20/2022)
Total: 100%
Note: Time of deadline is 11:59 PM
Category | Sub-Category | Deadline |
---|---|---|
Assignment | * Github Setup | 01/19/2022 |
* Assignment 1 | 01/30/2022 | |
* Assignment 2 | 02/13/2022 | |
* Assignment 3 | 03/06/2022 | |
* Assignment 4 | 04/17/2022 | |
Project | Groups Formed | 01/19/2022 |
* Stage I | 02/20/2022 | |
* Stage II | 03/16/2022 | |
* Stage III | 04/06/2022 | |
* Stage IV | 04/27/2022 |
-
Github Setup:
- Create a private Github repository (under your own account) and call it --- CSC-405-605_Spring-2022_Assignment.
- Send me and our TA access to the repository,
- My username: ikegwukc
- Our TA is: (TBD)
- Create a folder within the repository /Assignment_1
- Create two sub-folders /src and /data
- Work on your assignment (under /src)
- IPython notebook only(.ipynb). Python will not do (.py).
- Comment your code appropriately in Markdown.
- Enter the link to your assignment solution in the assignment text entry (on canvas) once you are done with your solution.
- Your notebook should contain the output of your cells. If there is no output rendered we will not grade it.
- No collaboration at all in assignments
-
Project:
- Your code and documentation will reside in a project repository.
- The structure of the repository should be maintained as such.
- /src - code and notebooks
- /team
- /stage_X
- /member
- /{member_name}
- /stage_X
- /{member_name}
- /team
- /data - data folder for the repository
- /stage_X
- /utility - utility or scripts
- /doc - documentation - project reports and presentations
- README.md - Description of project, deliverables, team members (see Stage I for details)
- all src files (notebooks) should use relative path.
- /src - code and notebooks
- Each project has separate deliverables - notebooks need to be updated into the repository for grading. We will grade the status of repository at the time of deadline.
- Each team makes a recorded presentation of their project stage and uploads it to canvas. Top presentations will be discussed in class.
- No collaboration on member tasks.
Discord channel for class discussions and team creation: https://discord.gg/N2rhnw7h3J. The channel should be used for discussing general questions related to assignments and projects. Use this channel to ask questions and find anwsers to already responded quesitons. If the question has been already answered in the channel, I will not be responding to emails. Emails are a one-to-one conversation which takes a lot of time hence the channel is there to broadcast information and have more community oriented discussion. Do not share code or screenshots of code in the channel. Email should be the last step and can be used to ask student specific questions.
- You are going to be reviewed on the following criterion:
- Criterion 1 (C1): Organize/Create information/slides in a manner appropriate for the intended audience
- Criterion 2 (C2): Deliver information in a manner appropriate for the intended audience
- Criterion 3 (C3): Relate to the intended audience
- For each criterion the evaluations/scoring are based on (higher the better):
- 4 - Exceeds Criteria: Excellent organization; information is well organized. Clear introduction; main points well stated and argued, with smooth transition to next point. Clear summary and conclusion.
- 3 - Meets Criteria: Satisfactory organization; clear introduction; main points are well stated; some transitions are somewhat sudden. Clear conclusion.
- 2 - Progressing to Criteria: Information is somewhat organized. Audience may have difficulty following presentation in areas.
- 1 - Below Expectations: Presentation is unorganized. Introduction unclear. Audience has difficulty following presentation. Presentation contains abrupt jumps; some of the main points and conclusion are unclear.
Note: Teams along with Team repostories will be listed here.
The instructor will deal strictly with any violations of academic honesty and integrity in this course. See Academic Integrity Policy (Link). for more details. Absolutely no discussion, collaboration, copying, and sharing on assignments. This includes coping from the internet. Any student who violates this policy will receive “F” in the course. The instructor will report the case to the university.
Attendance is required for all the class meetings. If you will be absent for any class it is your responsibility to catch up on class materials.
Students with disabilities should have documentation from the Office of Accessibility Resources & Services (Link). This documentation should be provided to the instructor for review. In the case of major provisions such as separate testing environment or test-readers, the student must make arrangements with Office of Accessibility Resources & Services so that suitable accommodations can be provided.
As we return for spring 2022, all students, faculty, and staff are required to uphold UNCG’s culture of care by actively engaging in behaviors that limit the spread of COVID-19. These actions include, but are not limited to:
- Following face-covering guidelines
- Engaging in proper hand-washing hygiene
- Self-monitoring for symptoms of COVID-19
- Staying home when ill
Complying with directions from health care providers or public health officials to quarantine or isolate if ill or exposed to someone who is ill Completing a self-report when experiencing COVID-19 symptoms, testing positive for COVID-19, or being identified as a close contact of someone who has tested positive Staying informed about the University's policies and announcements via the COVID-19 website
Instructors will have seating charts for their classes. These are important for facilitating contact tracing should there be a confirmed case of COVID-19. Students must sit in their assigned seats at every class meeting. Students may move their chairs in class to facilitate group work, as long as instructors keep seating chart records. Students should not eat or drink during class time.
A limited number of disposable masks will be available in classrooms for students who have forgotten theirs. Face coverings are also available for purchase in the UNCG Campus Bookstore. Students who do not follow masking requirements will be asked to put on a face covering or leave the classroom to retrieve one and only return when they follow the basic standards of safety and care for the UNCG community. Once students have a face covering, they are permitted to re-enter a class already in progress. Repeated issues may result in conduct action. The course policies regarding attendance and academics remain in effect for partial or full absence from class due to lack of adherence with face covering and other requirements.
For instances where the Office of Accessibility Resources and Services (OARS) has granted accommodations regarding wearing face coverings, students should contact their instructors to develop appropriate alternatives to class participation and/or activities as needed. Instructors or the student may also contact OARS (336.334.5440) who, in consultation with Student Health services, will review requests for accommodations.
Will update as needed with useful links.
- How To Enhance Jupyter Notebooks for Data Science?
- 28 Jupyter Notebook Tips, Tricks, and Shortcuts
- Optimizing Jupyter Notebook: Tips, Tricks, and nbextensions
- Jupyter Notebook Shortcuts, Tips, and Tricks —Top nbextensions—Bring Order to your Notebooks
- A Gentle Introduction to Exploratory Data Analysis
- 7 Steps to Mastering Data Preparation with Python
- Speed Up Your Exploratory Data Analysis With Pandas-Profiling
- Exploratory Data Analysis (EDA) and Data Visualization with Python
- Machine Learning with Kaggle: Feature Engineering
- Data cleaning and feature engineering in Python
- Feature Engineering Data Science Handbook
- A Hands-On Guide to Automated Feature Engineering using Featuretools in Python
- Feature Engineering Cookbook for Machine Learning
- Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages
- Working with Missing Data in Pandas
- Data Cleaning with Python and Pandas: Detecting Missing Values
- How to Handle Missing Data with Python
- Why and How to Use Pandas with Large Data
- Using Pandas with Large Data Sets in Python
- Optimizing the size of a pandas dataframe for low memory environment
- Making DataFrame Smaller and Faster
- Reducing DataFrame memory size by ~65%
- Dask: Scalable analytics in Python
Group Number | Name | |
---|---|---|
0 | 1 | Venkata sai phani raj kondapalli |
1 | 1 | Jaya Krishna mundru |
2 | 1 | Akhilesh Pathi |
3 | 1 | Harinath Sirigiri |
4 | 1 | Sardar Karan Singh |
5 | 2 | Kavya Manne |
6 | 2 | Rajitha Panchumarthi |
7 | 2 | Ramya panchumarthi |
8 | 2 | Soujanya Vemireddy |
9 | 3 | Suqoya Rhodes |
10 | 3 | Dillon Halbert |
11 | 3 | Japp Galang |
12 | 3 | Hayes, Priscilla M. |
13 | 3 | Zhu, Pengxu |
14 | 4 | Gunakar Reddy Panyala |
15 | 4 | Karthik Reddy Kanduri |
16 | 4 | Balram Krishna Kantipudi |
17 | 4 | Chakradhar Reddy Parne |
18 | 4 | Lakshmi Gayathri Kurri |
19 | 5 | Vishnu Vardhan Vankayalapati |
20 | 5 | Satya Sai Srimannarayana Sarma Bolloju |
21 | 5 | Rahul Sathya Gunti |
22 | 5 | Mahesh Krishna Reddy Vanga |
23 | 5 | Rahul Boga |
24 | 6 | Sri Lakshmi Jahnavi Mandalapu |
25 | 6 | Sai Manideep Chittiprolu |
26 | 6 | Sowmya Tella |
27 | 6 | Tejasai |
28 | 6 | Sai Venkatesh |
29 | 7 | PRANEETH ALURU |
30 | 7 | Akash Suresh |
31 | 7 | Nikhil Bolisetty |
32 | 7 | Apoorva Gnana Saraswathi Tangirala |
33 | 7 | Cheedu Venkat Narayan Reddy |
34 | 8 | Jagamoni, Sravya |
35 | 8 | Vijay Bodapati |
36 | 8 | Vineeth Reddy |
37 | 8 | Vadapally Ramyasree |
38 | 8 | Sai Nikhil kakkireni |