forked from rafalab/dsbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathknn.Rmd
76 lines (55 loc) · 2.46 KB
/
knn.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
## k-nearest neighbors
```{r, echo=FALSE, warning=FALSE, message=FALSE}
set.seed(2008)
library(tidyverse)
library(dslabs)
data("mnist_27")
# We use this function to plot the estimated conditional probabilities
plot_cond_prob <- function(p_hat=NULL){
tmp <- mnist_27$true_p
if(!is.null(p_hat)){
tmp <- mutate(tmp, p=p_hat)
}
tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
geom_raster(show.legend = FALSE) +
scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
stat_contour(breaks=c(0.5),color="black")
}
```
We introduced the kNN algorithm in Section \@ref(knn-cv-intro) and demonstrated how we use cross validation to pick $k$ in Section \@ref(caret-cv). Here we quickly review how we fit a kNN model using the __caret__ package. In Section \@ref(caret-cv) we introduced the following code to fit a kNN model:
```{r}
train_knn <- train(y ~ ., method = "knn",
data = mnist_27$train,
tuneGrid = data.frame(k = seq(9, 71, 2)))
```
We saw that the parameter that maximized the estimated accuracy was:
```{r}
train_knn$bestTune
```
This model improves the accuracy over regression and logistic regression:
```{r}
confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"),
mnist_27$test$y)$overall["Accuracy"]
```
A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability.
```{r best-knn-fit, echo=FALSE, out.width="100%"}
p1 <- plot_cond_prob() + ggtitle("True conditional probability")
p2 <- plot_cond_prob(predict(train_knn, newdata = mnist_27$true_p, type = "prob")[,2]) +
ggtitle("kNN")
grid.arrange(p2, p1, nrow=1)
```
## Exercises
1\. Earlier we used logistic regression to predict sex from height. Use kNN to do the same. Use the code described in this chapter to select the $F_1$ measure and plot it against $k$. Compare to the $F_1$ of about 0.6 we obtained with regression.
2\. Load the following dataset:
```{r, eval=FALSE}
data("tissue_gene_expression")
```
This dataset includes a matrix `x`:
```{r, eval=FALSE}
dim(tissue_gene_expression$x)
```
with the gene expression measured on 500 genes for 189 biological samples representing seven different tissues. The tissue type is stored in `y`:
```{r, eval=FALSE}
table(tissue_gene_expression$y)
```
Split the data in training and test sets, then use kNN to predict tissue type and see what accuracy you obtain. Try it for $k = 1, 3, \dots, 11$.