ISLR_chap7_conceptual.Rmd

---
title: "ISLR Chapter 7 - conceptual"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(ggplot2)
```

## Exercise 1
**It was mentioned in the chapter that a cubic regression spline with one knot at $\xi$ can be obtained using a basis of the form $x$, $x^2$, $x^3$, $(x-\xi)^3_+$, where $(x-\xi)^3_+=(x-\xi)^3$ if $x>\xi$ and equals $0$ otherwise. We will now show that a function of the form
$$
f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)^3_+
$$
is indeed a cubic regression spline, regardless of the values of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**

**(a) Find a cubic polynomial 
$$
f_1(x)=a_1+b_1x+c_1x^2+d_1x^3
$$
such that $f(x)=f_1(x)$ for all $x\leq\xi$. Express $a_1$, $b_1$, $c_1$, $d_1$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**

A cubic polynomial that satisfies these requirements is
$$
f_1(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3
$$

**(b) Find a cubic polynomial
$$
f_2(x)=a_2+b_2x+c_2x^2+d_2x^3
$$
such that $f(x)=f_2(x)$ for all $x>\xi$. Express $a_2$, $b_2$, $c_2$, $d_2$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$.**

We can derive a cubic polynomial that satisfies these requirements:
$$
f_2(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)^3\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x-\xi)(x^2-2x\xi+\xi^2)\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4(x^3-3x^2\xi+3x\xi^2-\xi^3)\\
=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\beta_4x^3-3\beta_4x^2\xi+3\beta_4x\xi^2-\beta_4\xi^3\\
=(\beta_0-\beta_4\xi^3)+(\beta_1+3\beta_4\xi^2)x+(\beta_2-3\beta_4\xi)x^2+(\beta_3+\beta_4)x^3\\
$$
Therefore we can express $a_2$, $b_2$, $c_2$, $d_2$ in terms of $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\beta_4$ as follows:
$$
a_2=\beta_0-\beta_4\xi^3\\
b_2=\beta_1+3\beta_4\xi^2\\
c_2=\beta_2-3\beta_4\xi\\
d_2=\beta_3+\beta_4\\
$$

**(c) Show that $f_1(\xi)=f_2(\xi)$. That is, $f(x)$ is continuous at $\xi$.**

We can show that $f_2(\xi)$ is equivalent to $f_1(\xi)$:
$$
f_2(\xi)=(\beta_0-\beta_4\xi^3)+(\beta_1+3\beta_4\xi^2)\xi+(\beta_2-3\beta_4\xi)\xi^2+(\beta_3+\beta_4)\xi^3\\
=\beta_0-\beta_4\xi^3+\beta_1\xi+3\beta_4\xi^3+\beta_2\xi^2-3\beta_4\xi^3+\beta_3\xi^3+\beta_4\xi^3\\
=\beta_0+\beta_1\xi+\beta_2\xi^2+\beta_3\xi^3\\
=f_1(\xi)
$$

**(d) Show that $f'_1(\xi)=f'_2(\xi)$. That is, $f'(x)$ is continuous at $\xi$.**

First differentiating both $f_1(x)$ and $f_2(x)$:
$$
f'_1(x)=\beta_1+2\beta_2x+3\beta_3x^2\\
f'_2(x)=\beta_1+3\beta_4\xi^2+2x(\beta_2-3\beta_4\xi)+3x^2(\beta_3+\beta_4)
$$
We can show that $f'_2(\xi)$ is equivalent to $f'_1(\xi)$:
$$
f'_2(\xi)=\beta_1+\beta_43\xi^2+2\xi(\beta_2-\beta_43\xi)+3\xi^2(\beta_3+\beta_4)\\
=\beta_1+3\beta_4\xi^2+2\beta_2\xi-6\beta_4\xi^2+3\beta_3\xi^2+3\beta_4\xi^2\\
=\beta_1+2\beta_2\xi+3\beta_3\xi^2\\
=f'_1(\xi)
$$

**(e) Show that $f''_1(\xi)=f''_2(\xi)$. That is, $f''(x)$ is continuous at $\xi$.**

The second derivatives of $f_1(x)$ and $f_2(x)$ are
$$
f''_1(x)=2\beta_2+6\beta_3x\\
f''_2(x)=2(\beta_2-3\beta_4\xi)+6x(\beta_3+\beta_4)
$$

We can show that $f''_2(\xi)$ is equivalent to $f''_1(\xi)$:
$$
f''_2(\xi)=2(\beta_2-3\beta_4\xi)+6\xi(\beta_3+\beta_4)\\
=2\beta_2-6\beta_4\xi+6\beta_3\xi+6\beta_4\xi\\
=2\beta_2+6\beta_3\xi\\
=f''_1(\xi)
$$

**Therefore, $f(x)$ is indeed a cubic spline.**

## Exercise 2
**Suppose that a curve $\hat{g}$ is computed to smoothly fit a set of $n$ points using the following formula:
$$
\hat{g}=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(m)}(x)\right]^2 dx\right)}
$$
where $g^{(m)}$ represents the $m$th derivative of $g$ (and $g^{(0)}=g$). Provide example sketches of $\hat{g}$ in each of the following scenarios.**

**(a) $\lambda=\infty,m=0$  
(b) $\lambda=\infty,m=1$  
(c) $\lambda=\infty,m=2$  
(d) $\lambda=\infty,m=3$  
(e) $\lambda=0,m=3$**

The penalty term is the integral of the $m$th derivative of the function $g(x)$ raised to the power of two, multiplied by $\lambda$. Given that it is squared, the minimum value it can take is 0. When $\lambda=\infty$ the loss term is ignored so we can look only at the penalty term; the function is hence minimised where the penalty term is zero. 

In scenario (a), the penalty term is zero where $g(x)=0$. In scenario (b), the penalty term is zero where $g(x)$ is equal to any constant, because the derivative of a constant is zero. For scenario (c), the penalty term is zero where $g(x)$ is a linear function, because the second derivative of a linear function is zero. In scenario (d), $g(x)$ would be a quadratic function, because the third derivative of a quadratic function is zero. 

```{r}
x <- rep(c(0:10), 4)
g <- c(rep(0, 11), rep(5, 11), 0:10 + 3, (seq(0,3, by = 3/10))^2 + 3)
z <- c(rep("a", 11), rep("b", 11), rep("c", 11), rep("d", 11))
dat <- data.frame(x = x, y = g, z = z)
ggplot(dat) +
  geom_smooth(aes(x = x, y = g), method = "loess", formula = y ~ x) +
  facet_wrap(vars(z)) +
  theme_minimal() +
  theme(axis.text = element_blank())
```

$g(x)$ in scenario (e) would be a function that interpolates all values for x because it is completely unconstrained; in this case, the penalty term is zero and so only the sum of squares matters.

## Exercise 3
**Suppose we fit a curve with basis functions $b_1(X)=X$, $b_2(X)=(X-1)^2I(X\geq 1)$. (Note that $I(X\geq 1)$ equals 1 for $X\geq1$ and $0$ otherwise.) We fit the linear regression model
$$
Y=\beta_0+\beta_1b_1(X)+\beta_2b_2(X)+\epsilon
$$
and obtain coefficient estimates $\hat{\beta}_0=1$, $\hat{\beta}_1=1$, $\hat{\beta}_2=-2$. Sketch the estimated curve between $X=-2$ and $X=2$. Note the intercepts, slopes and other relevant information.**

The slope starts to decrease at $X=1$ - when $X<1$, the slope is one (and hence a straight line), and thereafter starts to decrease because of the quadratic basis function $b_2$ having a negative coefficient. The $X$ intercept is at $X=-1$ and the $Y$ intercept is at $Y=1$.
```{r}
x <- seq(-2, 2, 0.001)
y <- 1 + x + ifelse(x >= 1, -2 * (x - 1)^2, 0)
dat <- data.frame(x = x, y = y)
ggplot(dat) +
  geom_path(aes(x = x, y = y)) +
  theme_minimal()
```

## Exercise 4
**Suppose we fit a curve with basis functions $b_1(X)=I(0\leq X\leq2)-(X-1)I(1\leq X\leq 2)$, $b_2(X)=(X-3)I(3\leq X\leq4)+I(4<X\leq5)$. We fit the linear regression model
$$
Y=\beta_0+\beta_1b_1(X)+\beta_2b_2(X)+\epsilon
$$
and obtain coefficient estimates $\hat{\beta}_0=1$, $\hat{\beta}_1=1$, $\hat{\beta}_2=3$. Sketch the estimated curve between $X=-2$ and $X=2$. Note the intercepts, slopes and other relevant information.**

Where $X<0$, $Y$ is $1$. Where $0\leq X\leq1$, $Y$ is $2$. Thereafter, the curve is a downward sloping straight line with slope $-1$, due to the second term in basis function $b_1$. The basis function $b_2$ does not affect values for $X$ between $-2$ and $2$.
```{r}
x <- seq(-2, 2, 0.001)
y <- 1 + 
  ifelse(x >= 0 & x <=2, 1, 0) - 
  ifelse(x >= 1 & x <= 2, x - 1, 0) +
  3 * (ifelse(x >= 3 & x <= 4, x - 3, 0) + 
         ifelse(x > 4 & x <= 5, 1, 0))
dat <- data.frame(x = x, y = y)
ggplot(dat) +
  geom_path(aes(x = x, y = y)) +
  scale_y_continuous(labels = function(x) round(x),
                     limits = c(0,3)) +
  theme_minimal()
```

## Exercise 5
**Consider two curves, $\hat{g}_1$ and $\hat{g}_2$, defined by
$$
\hat{g}_1=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(3)}(x)\right]^2 dx\right)},\\
\hat{g}_2=\arg\min_g{\left(\sum\limits_{i=1}^n (y_i-g(x_i))^2+\lambda\int\left[g^{(4)}(x)\right]^2 dx\right)}\\
$$
where $g^{(m)}$ represents the $m$th derivative of $g$.**

**(a) As $\lambda\rightarrow\infty$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller training RSS?  
(b) As $\lambda\rightarrow\infty$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller test RSS?  
(c) For $\lambda=0$, will $\hat{g}_1$ or $\hat{g}_2$ have the smaller training and test RSS?**

As $\lambda\rightarrow\infty$, the less constrained curve will have a smaller training RSS as it will fit the data more closely with a higher order polynomial. This would be $\hat{g}_2$ because its penalty term uses a higher order derivative.

We do not have enough information to determine with certainty which curve would have a higher or lower test RSS. While it is likely that $\hat{g}_1$ would have a lower test RSS than $\hat{g}_2$ because the latter would overfit the data, if we have a lot of data where the relationship between the predictors and the response variable is very non-linear, $\hat{g}_2$ may provide a better fit and hence have a lower test RSS. 

Where $\lambda=0$, the two curves are identical because the penalty terms are zero and hence they would have the same training and test RSS.