-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathISLR_chap7_conceptual.Rmd
181 lines (152 loc) · 8.49 KB
/
ISLR_chap7_conceptual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "ISLR Chapter 7 - conceptual"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(ggplot2)
```
## Exercise 1
**It was mentioned in the chapter that a cubic regression spline with one knot at $\xi$ can be obtained using a basis of the form $x$, $x^2$, $x^3$, $(x-\xi)^3_+$, where $(x-\xi)^3_+=(x-\xi)^3$ if $x>\xi$ and equals $0$ otherwise. We will now show that a function of the form
$$
f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)^3_+
$$
is indeed a cubic regression spline, regardless of the values of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**
**(a) Find a cubic polynomial
$$
f_1(x)=a_1+b_1x+c_1x^2+d_1x^3
$$
such that $f(x)=f_1(x)$ for all $x\leq\xi$. Express $a_1$, $b_1$, $c_1$, $d_1$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**
A cubic polynomial that satisfies these requirements is
$$
f_1(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3
$$
**(b) Find a cubic polynomial
$$
f_2(x)=a_2+b_2x+c_2x^2+d_2x^3
$$
such that $f(x)=f_2(x)$ for all $x>\xi$. Express $a_2$, $b_2$, $c_2$, $d_2$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**
We can derive a cubic polynomial that satisfies these requirements:
$$
f_2(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)^3\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)(x^2-2x\xi+\xi^2)\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x^3-3x^2\xi+3x\xi^2-\xi^3)\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4x^3-3\beta_4x^2\xi+3\beta_4x\xi^2-\beta_4\xi^3\\
=(\beta_0-\beta_4\xi^3)+(\beta_1+3\beta_4\xi^2)x+(\beta_2-3\beta_4\xi)x^2+(\beta_3+\beta_4)x^3\\
$$
Therefore we can express $a_2$, $b_2$, $c_2$, $d_2$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$ as follows:
$$
a_2=\beta_0-\beta_4\xi^3\\
b_2=\beta_1+3\beta_4\xi^2\\
c_2=\beta_2-3\beta_4\xi\\
d_2=\beta_3+\beta_4\\
$$
**(c) Show that $f_1(\xi)=f_2(\xi)$. That is, $f(x)$ is continuous at $\xi$.**
We can show that $f_2(\xi)$ is equivalent to $f_1(\xi)$:
$$
f_2(\xi)=(\beta_0-\beta_4\xi^3)+(\beta_1+3\beta_4\xi^2)\xi+(\beta_2-3\beta_4\xi)\xi^2+(\beta_3+\beta_4)\xi^3\\
=\beta_0-\beta_4\xi^3+\beta_1\xi+3\beta_4\xi^3+\beta_2\xi^2-3\beta_4\xi^3+\beta_3\xi^3+\beta_4\xi^3\\
=\beta_0+\beta_1\xi+\beta_2\xi^2+\beta_3\xi^3\\
=f_1(\xi)
$$
**(d) Show that $f'_1(\xi)=f'_2(\xi)$. That is, $f'(x)$ is continuous at $\xi$.**
First differentiating both $f_1(x)$ and $f_2(x)$:
$$
f'_1(x)=\beta_1+2\beta_2x+3\beta_3x^2\\
f'_2(x)=\beta_1+3\beta_4\xi^2+2x(\beta_2-3\beta_4\xi)+3x^2(\beta_3+\beta_4)
$$
We can show that $f'_2(\xi)$ is equivalent to $f'_1(\xi)$:
$$
f'_2(\xi)=\beta_1+\beta_43\xi^2+2\xi(\beta_2-\beta_43\xi)+3\xi^2(\beta_3+\beta_4)\\
=\beta_1+3\beta_4\xi^2+2\beta_2\xi-6\beta_4\xi^2+3\beta_3\xi^2+3\beta_4\xi^2\\
=\beta_1+2\beta_2\xi+3\beta_3\xi^2\\
=f'_1(\xi)
$$
**(e) Show that $f''_1(\xi)=f''_2(\xi)$. That is, $f''(x)$ is continuous at $\xi$.**
The second derivatives of $f_1(x)$ and $f_2(x)$ are
$$
f''_1(x)=2\beta_2+6\beta_3x\\
f''_2(x)=2(\beta_2-3\beta_4\xi)+6x(\beta_3+\beta_4)
$$
We can show that $f''_2(\xi)$ is equivalent to $f''_1(\xi)$:
$$
f''_2(\xi)=2(\beta_2-3\beta_4\xi)+6\xi(\beta_3+\beta_4)\\
=2\beta_2-6\beta_4\xi+6\beta_3\xi+6\beta_4\xi\\
=2\beta_2+6\beta_3\xi\\
=f''_1(\xi)
$$
**Therefore, $f(x)$ is indeed a cubic spline.**
## Exercise 2
**Suppose that a curve $\hat{g}$ is computed to smoothly fit a set of $n$ points using the following formula:
$$
\hat{g}=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(m)}(x)\right]^2 dx\right)}
$$
where $g^{(m)}$ represents the $m$th derivative of $g$ (and $g^{(0)}=g$). Provide example sketches of $\hat{g}$ in each of the following scenarios.**
**(a) $\lambda=\infty,m=0$
(b) $\lambda=\infty,m=1$
(c) $\lambda=\infty,m=2$
(d) $\lambda=\infty,m=3$
(e) $\lambda=0,m=3$**
The penalty term is the integral of the $m$th derivative of the function $g(x)$ raised to the power of two, multiplied by $\lambda$. Given that it is squared, the minimum value it can take is 0. When $\lambda=\infty$ the loss term is ignored so we can look only at the penalty term; the function is hence minimised where the penalty term is zero.
In scenario (a), the penalty term is zero where $g(x)=0$. In scenario (b), the penalty term is zero where $g(x)$ is equal to any constant, because the derivative of a constant is zero. For scenario (c), the penalty term is zero where $g(x)$ is a linear function, because the second derivative of a linear function is zero. In scenario (d), $g(x)$ would be a quadratic function, because the third derivative of a quadratic function is zero.
```{r}
x <- rep(c(0:10), 4)
g <- c(rep(0, 11), rep(5, 11), 0:10 + 3, (seq(0,3, by = 3/10))^2 + 3)
z <- c(rep("a", 11), rep("b", 11), rep("c", 11), rep("d", 11))
dat <- data.frame(x = x, y = g, z = z)
ggplot(dat) +
geom_smooth(aes(x = x, y = g), method = "loess", formula = y ~ x) +
facet_wrap(vars(z)) +
theme_minimal() +
theme(axis.text = element_blank())
```
$g(x)$ in scenario (e) would be a function that interpolates all values for x because it is completely unconstrained; in this case, the penalty term is zero and so only the sum of squares matters.
## Exercise 3
**Suppose we fit a curve with basis functions $b_1(X)=X$, $b_2(X)=(X-1)^2I(X\geq 1)$. (Note that $I(X\geq 1)$ equals 1 for $X\geq1$ and $0$ otherwise.) We fit the linear regression model
$$
Y=\beta_0+\beta_1b_1(X)+\beta_2b_2(X)+\epsilon
$$
and obtain coefficient estimates $\hat{\beta}_0=1$, $\hat{\beta}_1=1$, $\hat{\beta}_2=-2$. Sketch the estimated curve between $X=-2$ and $X=2$. Note the intercepts, slopes and other relevant information.**
The slope starts to decrease at $X=1$ - when $X<1$, the slope is one (and hence a straight line), and thereafter starts to decrease because of the quadratic basis function $b_2$ having a negative coefficient. The $X$ intercept is at $X=-1$ and the $Y$ intercept is at $Y=1$.
```{r}
x <- seq(-2, 2, 0.001)
y <- 1 + x + ifelse(x >= 1, -2 * (x - 1)^2, 0)
dat <- data.frame(x = x, y = y)
ggplot(dat) +
geom_path(aes(x = x, y = y)) +
theme_minimal()
```
## Exercise 4
**Suppose we fit a curve with basis functions $b_1(X)=I(0\leq X\leq2)-(X-1)I(1\leq X\leq 2)$, $b_2(X)=(X-3)I(3\leq X\leq4)+I(4<X\leq5)$. We fit the linear regression model
$$
Y=\beta_0+\beta_1b_1(X)+\beta_2b_2(X)+\epsilon
$$
and obtain coefficient estimates $\hat{\beta}_0=1$, $\hat{\beta}_1=1$, $\hat{\beta}_2=3$. Sketch the estimated curve between $X=-2$ and $X=2$. Note the intercepts, slopes and other relevant information.**
Where $X<0$, $Y$ is $1$. Where $0\leq X\leq1$, $Y$ is $2$. Thereafter, the curve is a downward sloping straight line with slope $-1$, due to the second term in basis function $b_1$. The basis function $b_2$ does not affect values for $X$ between $-2$ and $2$.
```{r}
x <- seq(-2, 2, 0.001)
y <- 1 +
ifelse(x >= 0 & x <=2, 1, 0) -
ifelse(x >= 1 & x <= 2, x - 1, 0) +
3 * (ifelse(x >= 3 & x <= 4, x - 3, 0) +
ifelse(x > 4 & x <= 5, 1, 0))
dat <- data.frame(x = x, y = y)
ggplot(dat) +
geom_path(aes(x = x, y = y)) +
scale_y_continuous(labels = function(x) round(x),
limits = c(0,3)) +
theme_minimal()
```
## Exercise 5
**Consider two curves, $\hat{g}_1$ and $\hat{g}_2$, defined by
$$
\hat{g}_1=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(3)}(x)\right]^2 dx\right)},\\
\hat{g}_2=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(4)}(x)\right]^2 dx\right)}\\
$$
where $g^{(m)}$ represents the $m$th derivative of $g$.**
**(a) As $\lambda\rightarrow\infty$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller training RSS?
(b) As $\lambda\rightarrow\infty$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller test RSS?
(c) For $\lambda=0$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller training and test RSS?**
As $\lambda\rightarrow\infty$, the less constrained curve will have a smaller training RSS as it will fit the data more closely with a higher order polynomial. This would be $\hat{g}_2$ because its penalty term uses a higher order derivative.
We do not have enough information to determine with certainty which curve would have a higher or lower test RSS. While it is likely that $\hat{g}_1$ would have a lower test RSS than $\hat{g}_2$ because the latter would overfit the data, if we have a lot of data where the relationship between the predictors and the response variable is very non-linear, $\hat{g}_2$ may provide a better fit and hence have a lower test RSS.
Where $\lambda=0$, the two curves are identical because the penalty terms are zero and hence they would have the same training and test RSS.