This repository holds the course materials for the Fall 2017 edition of Statistics 154: Modern Statistical Prediction and Machine Learning at UC Berkeley.
- Instructor: Gaston Sanchez, gaston.stat [at] gmail.com
- Class Time: MWF 1-2pm in 3108 Etcheverry
- Session Dates: 08/23/17 - 12/08/17
- Code #: 20978
- Units: 4 (more info here)
- Office Hours: MW 2:10-3:00pm in 309 Evans (or by appointment)
- Final: TBA
- GSI: Johnny Hong (OH 428 Evans: Tu 9-11am, Th 1-3pm).
Lab | Date | Room | GSI |
---|---|---|---|
101 | M 9am-11am | 330 Evans | Johnny Hong |
102 | M 11am-1pm | 330 Evans | Johnny Hong |
This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear regression, model assessment, model selection, regularization methods (pcr, plsr, ridge and lasso); logistic regression and discriminant analysis; cross-validation and the bootstrap; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).
In this course, we will explore the predictive modeling lifecycle, including question formulation, data preprocessing, exploratory data analysis and visualization, model building, model assessment/validation, model selection, and decision-making.
We will focus on quantitative critical thinking and key principles needed to carry out this cycle. 1) Foundational principles for building predictive models; 2) Intuitive explanations of many commonly used predictive modeling techniques for both classification and regression problems; 3) Principles and steps for validating a predictive model; and 4) write and use computer code to perform the necessary foundational work to build and validate predictive models.
The course focuses on predictive models, and it covers the following topics (not necessarily in the displayed order):
- Process of predictive model building
- Data Preprocessing
- Regression Models
- Linear models
- Non-linear models (time permitting)
- Tree-based methods
- Classification Models
- Linear models
- Non-linear models
- Tree-based methods
- Support Vector Machines (time permitting)
- Unsupervised methods like PCA and Clustering
- Data spending: splitting and resampling methods
- Model Evaluation
- Model Selection
- Multivariate calculus or the equivalent, esp. partial derivatives; e.g. Math 53
- Linear algebra or the equivalent (matrices, vector spaces); e.g. Math 54
- Statistical inference or the equivalent; e.g. Stat 135
- Scripting experience in R required; e.g. Stat 133
This course will build on a lot of material from matrix algebra. In particular, you should be comfortable with notions such as vector spaces, inner products, norms, matrix products/transpose/rank/determinants/inverses, as well as matrix decompositions.
You should also have some scripting experience---preferably in R---at the level of writing functions, conditionals (if-then-else structures), for loops, while loops, sampling, read in data sets, export results.
Last but not least, it is nice to know the basics of Rmd files, as well as some knowledge of LaTeX, especially some experience writing math symbols and equations.
The primary text is An Introduction to Statistical Learning (ISL) by James, Witten, Hastie, and Tibshirani. Springer, 2013. It is freely available online in pdf format (courtesy of the authors) at http://www-bcf.usc.edu/~gareth/ISL/.
As companion material, especially for the labs, R code and projects, we will also be using Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Springer, 2013.
Other good (optional) references for the course are:
-
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. Springer, 2009 (2nd Ed). This book is more mathematically-and-conceptually advanced than ISL. It is freely available online in pdf format (courtesy of the authors) at https://statweb.stanford.edu/~tibs/ElemStatLearn/. This text will not be used directly for this course and is simply a reference for more theoretical details.
-
Data Mining and Statistics for Decision Making by Stephane Tuffery. Wiley 2011. This book should be in electronic format via the UCB Library Catalog. If the course slides are not self-explanatory enough, you can supplement them with this little known, yet excellent resource.
-
Statistical Learning from a Regression Perspective by Richard Berk. Springer 2008. You can find this book in electronic format via the UCB Library Catalog. This text will not be used directly for this course and is simply a reference for more theoretical details.
We expect that at the end of the course you:
- Have a basic, yet solid, understanding of the prediction modeling process/lifecycle.
- Be able to read a well-described algorithm, and write code to implement it computationally (in R).
- Know the pros and cons of each predictive technique.
- Be able to describe (to non-professionals) what a predictive technique is doing.
- We will be using a combination of materials such as slides, tutorials, reading assignments, and chalk-and-talk.
- The main computational tool will be the computing and programming environment R.
- The main workbench will be the IDE RStudio. You will also use a terminal emulator to work with the command line.
- Please read the course logistics and policies for mode details about the structure of the course, DO's and DONT's, etc.
Unless otherwise noticed, this work, by Gaston Sanchez, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Author: Gaston Sanchez