-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
completion of chapter 4, Classification labs from ISLR.
- Loading branch information
1 parent
122ab68
commit e950099
Showing
1 changed file
with
240 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
--- | ||
title: "4. Classification labs" | ||
author: "Russ Conte" | ||
date: "8/7/2021" | ||
output: html_document | ||
--- | ||
|
||
## 4.6 Lab: Logistic Regression, LDA, QDA, and KNN | ||
|
||
## 4.6.1 The Stock Market Data set | ||
|
||
From the text: | ||
This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). | ||
|
||
```{r the Stock Market data set} | ||
library(ISLR) | ||
attach(Smarket) | ||
summary(Smarket) | ||
``` | ||
|
||
## the cor() function | ||
|
||
from the text: | ||
The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. | ||
|
||
```{r all pairwise correlations} | ||
cor(Smarket[, -9]) | ||
plot(Volume) | ||
``` | ||
|
||
## 4.6.2 Logistic Regression | ||
|
||
```{r glm using logistic regression} | ||
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial) | ||
summary(glm.fit) | ||
``` | ||
|
||
## find only the coefficients for the model | ||
|
||
```{r} | ||
coef(glm.fit) | ||
``` | ||
|
||
## print a summary of the model | ||
|
||
```{r summary of the glm model} | ||
summary(glm.fit) | ||
``` | ||
|
||
## How to use the predict function | ||
|
||
```{r using the predict function} | ||
glm.probs = predict(object = glm.fit,type = "response") | ||
glm.probs[1:10] | ||
glm.pred = rep("Down", 1250) | ||
glm.pred[glm.probs > 0.5] = "Up" | ||
table(glm.pred, Direction) | ||
mean(glm.pred == Direction) # 0.5216, only slightly better than flipping a fair coin | ||
``` | ||
|
||
from the text: | ||
In order to better assess the ac- curacy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. | ||
|
||
```{r setting up training and testing data subsets} | ||
train = Year<2005 | ||
Smarket.2005 = Smarket[!train,] | ||
dim(Smarket.2005) | ||
Direction.2005 = Direction[!train] | ||
``` | ||
|
||
## now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005 | ||
|
||
```{r} | ||
glm.fits = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train) | ||
glm.probs = predict(glm.fits, Smarket.2005, type = "response") | ||
glm.pred = rep("Down", 252) | ||
glm.pred[glm.probs > 0.5] = "Up" | ||
table(glm.pred, Direction.2005) | ||
mean(glm.pred == Direction.2005) # ouch!! | ||
``` | ||
|
||
from the text: | ||
Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model. | ||
|
||
```{r only using the most helpful variables} | ||
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Smarket,family = binomial, subset = train) | ||
glm.probs = predict(glm.fits, newdata = Smarket.2005, type = "response") | ||
glm.pred = rep("Down", 252) | ||
glm.pred[glm.probs > 0.5] = "Up" | ||
table(glm.pred, Direction.2005) | ||
mean(glm.pred == Direction.2005) # 0.4801587 - worse than random guessing! | ||
``` | ||
|
||
## 4.6.3 Linear Discriminant Analysis | ||
|
||
from the text: | ||
Now we will perform LDA on the Smarket data. In R, we fit an LDA model using the lda() function, which is part of the MASS library. | ||
|
||
```{r} | ||
library(MASS) | ||
lda.fit = lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train) | ||
lda.fit | ||
``` | ||
|
||
from the text: | ||
The LDA output indicates that πˆ1 = 0.492 and πˆ2 = 0.508; in other words, 49.2% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of μk. | ||
|
||
The predict() function returns a list with three elements. The first ele- ment, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, computed from (4.10). Finally, x contains the linear discriminants, described earlier. | ||
|
||
```{r LDA predict example} | ||
lda.pred = predict(lda.fit, Smarket.2005) | ||
lda.class = lda.pred$class | ||
table(lda.class, Direction.2005) | ||
mean(lda.class == Direction.2005) # 0.5595238 A nice improvement! | ||
``` | ||
|
||
## 4.6.4 Quadratic Discriminant Analysis | ||
|
||
From the text: | ||
We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library. The syntax is identical to that of lda(). | ||
|
||
```{r} | ||
qda.fit = qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train) | ||
qda.fit | ||
``` | ||
|
||
From the text: | ||
The output contains the group means. But it does not contain the coef- ficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA. | ||
|
||
```{r} | ||
qda.class = predict(qda.fit, Smarket.2005)$class | ||
table(qda.class, Direction.2005) | ||
mean(qda.class == Direction.2005) # 0.5992063 | ||
``` | ||
|
||
# 4.6.5 K_Nearest Neighbors | ||
|
||
from the text: | ||
|
||
The function requires four inputs. | ||
|
||
1. A matrix containing the predictors associated with the training data, labeled train.X below. | ||
2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below. | ||
3. A vector containing the class labels for the training observations, labeled train.Direction below. | ||
4. A value for K, the number of nearest neighbors to be used by the classifier. | ||
|
||
We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set. | ||
|
||
```{r KNN} | ||
library(class) | ||
train.X = cbind(Lag1, Lag2)[train,] | ||
test.X = cbind(Lag1, Lag2)[!train,] | ||
train.Direction = Direction[train] | ||
set.seed(1) | ||
knn.pred = knn(train.X, test.X, train.Direction, k = 1) | ||
table(knn.pred, Direction.2005) | ||
(43+83)/(43+83+68+58) # exactly 0.5 - crazy!! | ||
``` | ||
|
||
## Use KNN with Carvan Insurance Data | ||
|
||
```{r} | ||
dim(Caravan) | ||
attach(Caravan) | ||
summary(Purchase) | ||
``` | ||
|
||
From the text: | ||
A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column 86, because that is the qualitative Purchase variable. | ||
|
||
```{r Standardizing the data} | ||
standardized.X = scale(Caravan[, -86]) | ||
var(Caravan[,1]) | ||
var(Caravan[,2]) | ||
var(standardized.X[,1]) | ||
var(standardized.X[,2]) | ||
# Now every column of standardized.X has a standard deviation of one and a mean of zero. | ||
``` | ||
|
||
from the text: | ||
We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using K = 1, and evaluate its performance on the test data. | ||
|
||
```{r} | ||
test = 1:1000 | ||
train.X = standardized.X[-test,] | ||
test.X = standardized.X[test,] | ||
train.Y = Purchase[-test] | ||
test.Y = Purchase[test] | ||
set.seed(1) | ||
knn.pred = knn(train.X, test.X, train.Y, k = 1) | ||
mean(test.Y != "No") # 0.059 | ||
``` | ||
|
||
Using K = 3, the success rate increases to 19 %, and with K = 5 the rate is 26.7 %. This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set! | ||
|
||
```{r} | ||
table(knn.pred, test.Y) | ||
9 / (68 + 9) # 11.6%, much better than before! | ||
knn.pred = knn(train.X, test.X, train.Y, k = 3) | ||
table(knn.pred, test.Y) # 19.2% even better than before! | ||
knn.pred = knn(train.X, test.X, train.Y, k = 5) | ||
table(knn.pred, test.Y) # 0.267, very nice compared to before! | ||
``` | ||
|
||
## use logistic regression to predict who will buy insurance | ||
|
||
```{r} | ||
glm.fits = glm(Purchase ~., data = Caravan, family= binomial, subset = -test) | ||
glm.probs = predict(glm.fits, Caravan[test,],type = "response") | ||
glm.pred = rep("No", 1000) | ||
glm.pred[glm.probs > 0.25] = "Yes" | ||
table(glm.pred, test.Y) | ||
11/(11+22) # 33% success, a very good result! | ||
``` | ||
|