-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy pathCh1.Rmd
422 lines (311 loc) · 13.4 KB
/
Ch1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
title: "R-4DS"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## 3.2.4 Exercises
Run ggplot(data = mpg) what do you see?
```{r}
ggplot(data = mpg)
```
How many rows are in mtcars? How many columns?
```{r}
dim(mpg)
```
What does the drv variable describe? Read the help for ?mpg to find out.
Whether the car is front wheel drive or not.
f = front-wheel drive, r = rear wheel drive, 4 = 4wd
Make a scatterplot of hwy vs cyl.
```{r}
ggplot(mpg) + geom_point(aes(hwy, cyl))
```
What happens if you make a scatterplot of class vs drv. Why is the plot not useful?
```{r}
ggplot(mpg) + geom_point(aes(class, drv))
```
Because both variables are categorical.
## 3.3.1 Exercises
What’s gone wrong with this code? Why are the points not blue?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
### Categorical
- Model
- cyl
- Manufacturer
- trans
- drv
- fl
- class
### Continuous
- displ
- year
- cty
- hwy
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty))
# ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = cty)) This creates an error
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = cty))
```
What happens if you map the same variable to multiple aesthetics?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty, size = cty))
```
What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
Stroke controls the width of the border of certain shapes. Those shapes which have borders are the only ones that stroke can alter.
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
```
ggplot turns displ < 5 into a boolean (or dummy) variable on the fly and maps that T or F to the colour argument.
## 3.5.1 Exercises
What happens if you facet on a continuous variable?
```{r}
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty)
```
It plots it anyway
What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
```
It means that there are combinations where there are no data points.
What plots does the following code make? What does . do?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
```
The dot controls whether the facetting will be done row or column wise. For example `facet_grid(drv ~ .)` will use drv as rows while `facet_grid(. ~ drv)` will use it as columns. `facet_grid(~ drv)` will do the same as the column wise facetting but `facet_grid(drv ~)` won't because a formula object needs to have something after the `~`.
Take the first faceted plot in this section:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
```
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
I think facetting is better when you want to pay particular attention to particular facets alone (naturally) while using the color aesthetic is better to discriminate which points are located where. Colour is better to get a global overview of the relationship while facetting is better for paying attention to within group patterns. For example, fitting many trendlines for different groups is better done with faceting rather than all together.
Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?
`nrow` controls the number of rows for the total number of facets whereas `ncol` controls the number of columns. Other options can control interesting parameters. For example, scales can control whether each plot has its own y axis with `scales = "free"`, as in allow the axes to be free. The function also has the labeller option to change the names of each facet and other options like `strip.position` for the position of the facets labels. Read `?facet_wrap` for more options.
**BONUS**
How do you change the names of the facets? Very easily
```{r}
# the `0` and `1` are the old names
new_names <- as_labeller(c(`4` = "name0", `5` = "other_name", `6` = "name1", `8` = "name2"))
ggplot(mpg, aes(displ, cty)) + facet_wrap(~ cyl, labeller = new_names)
```
Yay!
**BONUS**
`facet_grid` doesn't have the option to specify rows or columns because it calculate automatically the grid. So the multiplication of the number of distinct values in the variables in the formula.
When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
Because otherwise the graph is going to be too long and you won't understand anything. This graph is a good example:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(~ model)
```
## 3.6.1 Exercises
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
```{r}
# Line chart
mpg %>%
group_by(year) %>%
summarise(m = mean(cty)) %>%
ggplot(aes(year, m)) +
geom_line()
# Boxplot
ggplot(mpg, aes(class, hwy)) +
geom_boxplot()
# Histogram
ggplot(mpg, aes(displ)) +
geom_histogram(bins = 60)
# Area chart
huron <- data.frame(year = 1875:1972, level = as.vector(LakeHuron))
ggplot(huron, aes(year, level)) +
geom_area()
```
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
```
What does show.legend = FALSE do? What happens if you remove it?
Why do you think I used it earlier in the chapter?
It removes the legend. It gives a cleaner plot when its clear that the grouping is done on a specific variable.
What does the se argument to geom_smooth() do?
It removes the confidence intervals from the smoothed lines
Will these two graphs look different? Why/why not?
```{r, echo = F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
They'll be exactly the same.
Recreate the R code necessary to generate the following graphs.
```{r}
# 1st.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(se = F)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(group = drv), se = F)
# 2nd.
ggplot(mpg, aes(displ, hwy, colour = drv)) +
geom_smooth(se = F) +
geom_point()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv)) +
geom_smooth(se = F)
# 3rd.
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv)) +
geom_smooth(aes(linetype = drv), se = F)
# You can do this one by choosing a shape which has a border and simply colour
# the border with `colour` and the insides with `fill` (which is matched to drv).
# Then make the whole point a bit bigger with size
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(fill = drv), shape = 21, stroke = 2, colour = "white", size = 3)
```
## 3.7.1 Exercises
What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
```{r, echo = F}
# Previous plot
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
```
`stat_summary` is associated with `geom_pointrange`.
```{r}
ggplot(diamonds) +
geom_pointrange(aes(cut, depth, ymin = depth, ymax = depth))
```
What does geom_col() do? How is it different to geom_bar()?
`geom_col` leaves the data as it is. `geom_bar()` creates two variables (count and prop) and then graphs the count data on the y axis. With `geom_col` you can plot the values of any x variable against any y variable.
```{r}
# For example, plotting exactly x to y values.
aggregate.data.frame(diamonds$price, list(diamonds$cut), mean, na.rm = T) %>%
print(.) %>%
ggplot(aes(Group.1, x)) +
geom_col()
```
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth() compute? What parameters control its behaviour?
`stat_smooth()` computes the y, the predicted value of y for each x value. Also, it computes
the se of that value predicted, together with the upper and lower bound of that point prediction.
It can compute different methods such as `lm`, `glm`, `lowess` among others. See method in `?stat_smooth`. The statistic can be controlled with the method argument.
You can see the values by wrapping any plot that has geom_smooth() with ggplot_build().
In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
Not sure about this one.
```{r}
# Each cut is treated as a searapte group that sums to 1.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
# If you calculate it manually, it doesn't matter
m <- ggplot(data = diamonds)
m + geom_bar(aes(cut, ..count../sum(..count..)))
diamonds %>%
count(cut) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(cut, prop)) + geom_bar(stat = "identity") # or geom_col()
ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..count../sum(..count..)))
# By specifying group = 1, you treat all cut groups as 1 group.
ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..prop.., group = 1))
# and thus all the proportions are done calculate as a single group
```
## 3.8.1 Exercises
What is the problem with this plot? How could you improve it?
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
```
Althought the two variables are continuous, the chance of being in a single point is very discrete and a lot of points overlap. We could fix it by adding jitter.
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter()
```
What parameters to geom_jitter() control the amount of jittering?
`width` and `height`
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter(width = 5, height = 10)
```
Compare and contrast geom_jitter() with geom_count().
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_count()
```
`geom_count()` is another variant of `geom_point()` and controls the size of each dot based on the frequency of observations in a specifiy coordinate. It can help to contrast with `geom_jitter()` in understanding the data.
What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = displ)) +
geom_boxplot(aes(colour = drv))
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## 3.9.1 Exercises
Turn a stacked bar chart into a pie chart using coord_polar().
```{r}
ggplot(mpg, aes(factor(1), fill = factor(cyl))) +
geom_bar(width = 1) +
coord_polar(theta = 'y')
```
What does labs() do? Read the documentation.
`labs()` allows you to control all the labels in the plot. For example:
```{r}
ggplot(mpg, aes(cyl, fill = as.factor(cyl))) +
geom_bar() +
labs(title = "Hey, this is a title",
subtitle = "This are the subs",
x = "This is the X axis",
y = "This is the Y axis",
fill = "This is the fill",
caption = "This is a caption")
```
What’s the difference between coord_quickmap() and coord_map()?
```{r}
nz <- map_data("nz")
nzmap <- ggplot(nz, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
nzmap + coord_map()
nzmap + coord_quickmap()
```
`coord_quickmap()` is very similar to `coord_map()` but `coord_quickmap()` preserves straight lines in what should be a spherical plane. So, basically, the earth is shperical and `coord_map()` preserves that without plotting any straight lines. `coord_quickmap()` adds those lines adjusting to the spherical surface.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
```
There is a positive correlation between the two. `coord_fixed()` makes sure there is no visual discrepancies and
> ensures that the ranges of axes are equal to the specified ratio by adjusting the plot aspect ratio - Documentation of `coord_fixed()`.
Finally, `geom_abline()` plots the estimated slope between the two variables.