Ch1.Rmd

---
title: "R-4DS"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

## 3.2.4 Exercises

Run ggplot(data = mpg) what do you see?
```{r}
ggplot(data = mpg)
```

How many rows are in mtcars? How many columns?
```{r}
dim(mpg)
```
What does the drv variable describe? Read the help for ?mpg to find out.

Whether the car is front wheel drive or not.
f = front-wheel drive, r = rear wheel drive, 4 = 4wd

Make a scatterplot of hwy vs cyl.
```{r}
ggplot(mpg) + geom_point(aes(hwy, cyl))
```
What happens if you make a scatterplot of class vs drv. Why is the plot not useful?
```{r}
ggplot(mpg) + geom_point(aes(class, drv))
```

Because both variables are categorical.

## 3.3.1 Exercises

What’s gone wrong with this code? Why are the points not blue?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```

```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

### Categorical
- Model
- cyl
- Manufacturer
- trans
- drv
- fl
- class

### Continuous
- displ
- year
- cty
- hwy

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty))
# ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = cty)) This creates an error
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = cty))
```

What happens if you map the same variable to multiple aesthetics?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty, size = cty))
```

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

Stroke controls the width of the border of certain shapes. Those shapes which have borders are the only ones that stroke can alter.

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
```

ggplot turns displ < 5 into a boolean (or dummy) variable on the fly and maps that T or F to the colour argument.

## 3.5.1 Exercises

What happens if you facet on a continuous variable?

```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ cty)
```

It plots it anyway

What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

```{r}
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
```

It means that there are combinations where there are no data points.

What plots does the following code make? What does . do?

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)
```

The dot controls whether the facetting will be done row or column wise. For example `facet_grid(drv ~ .)` will use drv as rows while `facet_grid(. ~ drv)` will use it as columns. `facet_grid(~ drv)` will do the same as the column wise facetting but `facet_grid(drv ~)` won't because a formula object needs to have something after the `~`.

Take the first faceted plot in this section:

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
```

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

I think facetting is better when you want to pay particular attention to particular facets alone (naturally) while using the color aesthetic is better to discriminate which points are located where. Colour is better to get a global overview of the relationship while facetting is better for paying attention to within group patterns. For example, fitting many trendlines for different groups is better done with faceting rather than all together.

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?

`nrow` controls the number of rows for the total number of facets whereas `ncol` controls the number of columns. Other options can control interesting parameters. For example, scales can control whether each plot has its own y axis with `scales = "free"`, as in allow the axes to be free. The function also has the labeller option to change the names of each facet and other options like `strip.position` for the position of the facets labels. Read `?facet_wrap` for more options.

**BONUS**

How do you change the names of the facets? Very easily

```{r}
# the `0` and `1` are the old names
new_names <- as_labeller(c(`4` = "name0", `5` = "other_name", `6` = "name1", `8` = "name2"))

ggplot(mpg, aes(displ, cty)) + facet_wrap(~ cyl, labeller = new_names)
```

Yay!
**BONUS**

`facet_grid` doesn't have the option to specify rows or columns because it calculate automatically the grid. So the multiplication of the number of distinct values in the variables in the formula.


When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Because otherwise the graph is going to be too long and you won't understand anything. This graph is a good example:
```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(~ model)
```

## 3.6.1 Exercises

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
```{r}
# Line chart
mpg %>%
  group_by(year) %>%
  summarise(m = mean(cty)) %>%
  ggplot(aes(year, m)) + 
  geom_line()

# Boxplot
ggplot(mpg, aes(class, hwy)) +
  geom_boxplot()

# Histogram
ggplot(mpg, aes(displ)) +
  geom_histogram(bins = 60)

# Area chart
huron <- data.frame(year = 1875:1972, level = as.vector(LakeHuron))
ggplot(huron, aes(year, level)) +
  geom_area()

```

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
```

What does show.legend = FALSE do? What happens if you remove it?
Why do you think I used it earlier in the chapter?

It removes the legend. It gives a cleaner plot when its clear that the grouping is done on a specific variable.

What does the se argument to geom_smooth() do?

It removes the confidence intervals from the smoothed lines

Will these two graphs look different? Why/why not?
```{r, echo = F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```

They'll be exactly the same.

Recreate the R code necessary to generate the following graphs.

```{r}
# 1st.
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(se = F)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(aes(group = drv), se = F)

# 2nd.
ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_smooth(se = F) +
  geom_point()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(se = F)

# 3rd.
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(aes(linetype = drv), se = F)

# You can do this one by choosing a shape which has a border and simply colour
# the border with `colour` and the insides with `fill` (which is matched to drv).
# Then make the whole point a bit bigger with size
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(fill = drv), shape = 21, stroke = 2, colour = "white", size = 3)
```

## 3.7.1 Exercises

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

```{r, echo = F}
# Previous plot
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
```

`stat_summary` is associated with `geom_pointrange`.

```{r}
ggplot(diamonds) +
  geom_pointrange(aes(cut, depth, ymin = depth, ymax = depth))
```

What does geom_col() do? How is it different to geom_bar()?

`geom_col` leaves the data as it is. `geom_bar()` creates two variables (count and prop) and then graphs the count data on the y axis. With `geom_col` you can plot the values of any x variable against any y variable.

```{r}
# For example, plotting exactly x to y values.
aggregate.data.frame(diamonds$price, list(diamonds$cut), mean, na.rm = T) %>%
  print(.) %>%
  ggplot(aes(Group.1, x)) +
  geom_col()
```

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

What variables does stat_smooth() compute? What parameters control its behaviour?

`stat_smooth()` computes the y, the predicted value of y for each x value. Also, it computes
the se of that value predicted, together with the upper and lower bound of that point prediction.
It can compute different methods such as `lm`, `glm`, `lowess` among others. See method in `?stat_smooth`. The statistic can be controlled with the method argument.

You can see the values by wrapping any plot that has geom_smooth() with ggplot_build().

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

Not sure about this one.

```{r}
# Each cut is treated as a searapte group that sums to 1.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))

# If you calculate it manually, it doesn't matter
m <- ggplot(data = diamonds)
m + geom_bar(aes(cut, ..count../sum(..count..)))

diamonds %>%
  count(cut) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(cut, prop)) + geom_bar(stat = "identity") # or geom_col()

ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..count../sum(..count..)))

# By specifying group = 1, you treat all cut groups as 1 group.
ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..prop.., group = 1))
# and thus all the proportions are done calculate as a single group
```

## 3.8.1 Exercises

What is the problem with this plot? How could you improve it?

```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()
```
Althought the two variables are continuous, the chance of being in a single point is very discrete and a lot of points overlap. We could fix it by adding jitter.

```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_jitter()
```

What parameters to geom_jitter() control the amount of jittering?
`width` and `height`

```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_jitter(width = 5, height = 10)
```

Compare and contrast geom_jitter() with geom_count().
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_count()
```
`geom_count()` is another variant of `geom_point()` and controls the size of each dot based on the frequency of observations in a specifiy coordinate. It can help to contrast with `geom_jitter()` in understanding the data.

What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

```{r}
ggplot(data = mpg, mapping = aes(x = class, y = displ)) + 
  geom_boxplot(aes(colour = drv))
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

## 3.9.1 Exercises

Turn a stacked bar chart into a pie chart using coord_polar().

```{r}
ggplot(mpg, aes(factor(1), fill = factor(cyl))) +
  geom_bar(width = 1) +
  coord_polar(theta = 'y')
```

What does labs() do? Read the documentation.
`labs()` allows you to control all the labels in the plot. For example:
```{r}
ggplot(mpg, aes(cyl, fill = as.factor(cyl))) +
    geom_bar() +
    labs(title = "Hey, this is a title",
         subtitle = "This are the subs",
         x = "This is the X axis",
         y = "This is the Y axis",
         fill = "This is the fill",
         caption = "This is a caption")
```

What’s the difference between coord_quickmap() and coord_map()?
```{r}
nz <- map_data("nz")

nzmap <- ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

nzmap + coord_map()
nzmap + coord_quickmap()
```

`coord_quickmap()` is very similar to `coord_map()` but `coord_quickmap()` preserves straight lines in what should be a spherical plane. So, basically, the earth is shperical and `coord_map()` preserves that without plotting any straight lines. `coord_quickmap()` adds those lines adjusting to the spherical surface.

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()
```

There is a positive correlation between the two. `coord_fixed()` makes sure there is no visual discrepancies and 
> ensures that the ranges of axes are equal to the specified ratio by adjusting the plot aspect ratio - Documentation of `coord_fixed()`.

Finally, `geom_abline()` plots the estimated slope between the two variables.