forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcommunication.qmd
1090 lines (879 loc) · 44.9 KB
/
communication.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Communication {#sec-communication}
```{r}
#| echo: false
source("_common.R")
```
## Introduction
In @sec-exploratory-data-analysis, you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know---even before looking---which variables the plot will display.
You made each plot for a purpose, could quickly look at it, and then move on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
Now that you understand your data, you need to *communicate* your understanding to others.
Your audience will likely not share your background knowledge and will not be deeply invested in the data.
To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible.
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
This chapter focuses on the tools you need to create good graphics.
We assume that you know what you want, and just need to know how to do it.
For that reason, we highly recommend pairing this chapter with a good general visualization book.
We particularly like [The Truthful Art](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
It doesn't teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.
### Prerequisites
In this chapter, we'll focus once again on ggplot2.
We'll also use a little dplyr for data manipulation, **scales** to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including **ggrepel** ([https://ggrepel.slowkow.com](https://ggrepel.slowkow.com/)) by Kamil Slowikowski and **patchwork** ([https://patchwork.data-imaginist.com](https://patchwork.data-imaginist.com/)) by Thomas Lin Pedersen.
Don't forget that you'll need to install those packages with `install.packages()` if you don't already have them.
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(scales)
library(ggrepel)
library(patchwork)
```
## Labels
The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels.
You add labels with the `labs()` function.
```{r}
#| message: false
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
#| points are colored according to the car class. A smooth curve following
#| the trajectory of the relationship between highway fuel efficiency versus
#| engine size of cars is overlaid. The x-axis is labelled "Engine
#| displacement (L)" and the y-axis is labelled "Highway fuel economy (mpg)".
#| The legend is labelled "Car type". The plot is titled "Fuel efficiency
#| generally decreases with engine size". The subtitle is "Two seaters
#| (sports cars) are an exception because of their light weight" and the
#| caption is "Data from fueleconomy.gov".
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
color = "Car type",
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)
```
The purpose of a plot title is to summarize the main finding.
Avoid titles that just describe what the plot is, e.g., "A scatterplot of engine displacement vs. fuel economy".
If you need to add more text, there are two other useful labels: `subtitle` adds additional detail in a smaller font beneath the title and `caption` adds text at the bottom right of the plot, often used to describe the source of the data.
You can also use `labs()` to replace the axis and legend titles.
It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units.
It's possible to use mathematical equations instead of text strings.
Just switch `""` out for `quote()` and read about the available options in `?plotmath`:
```{r}
#| fig-asp: 1
#| out-width: "50%"
#| fig-width: 3
#| fig-alt: |
#| Scatterplot with math text on the x and y axis labels. X-axis label
#| says x_i, y-axis label says sum of x_i squared, for i from 1 to n.
df <- tibble(
x = 1:10,
y = cumsum(x^2)
)
ggplot(df, aes(x, y)) +
geom_point() +
labs(
x = quote(x[i]),
y = quote(sum(x[i] ^ 2, i == 1, n))
)
```
### Exercises
1. Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `color` labels.
2. Recreate the following plot using the fuel economy data.
Note that both the colors and shapes of points vary by type of drive train.
```{r}
#| echo: false
#| fig-alt: |
#| Scatterplot of highway versus city fuel efficiency. Shapes and
#| colors of points are determined by type of drive train.
ggplot(mpg, aes(x = cty, y = hwy, color = drv, shape = drv)) +
geom_point() +
labs(
x = "City MPG",
y = "Highway MPG",
shape = "Type of\ndrive train",
color = "Type of\ndrive train"
)
```
3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.
## Annotations
In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.
The first tool you have at your disposal is `geom_text()`.
`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.
This makes it possible to add textual labels to your plots.
There are two possible sources of labels.
First, you might have a tibble that provides labels.
In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called `label_info`.
```{r}
label_info <- mpg |>
group_by(drv) |>
arrange(desc(displ)) |>
slice_head(n = 1) |>
mutate(
drive_type = case_when(
drv == "f" ~ "front-wheel drive",
drv == "r" ~ "rear-wheel drive",
drv == "4" ~ "4-wheel drive"
)
) |>
select(displ, hwy, drv, drive_type)
label_info
```
Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot.
Using the `fontface` and `size` arguments we can customize the look of the text labels.
They're larger than the rest of the text on the plot and bolded.
(`theme(legend.position = "none"`) turns all the legends off --- we'll talk about it more shortly.)
```{r}
#| fig-alt: |
#| Scatterplot of highway mileage versus engine size where points are colored
#| by drive type. Smooth curves for each drive type are overlaid.
#| Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel.
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
geom_text(
data = label_info,
aes(x = displ, y = hwy, label = drive_type),
fontface = "bold", size = 5, hjust = "right", vjust = "bottom"
) +
theme(legend.position = "none")
```
Note the use of `hjust` (horizontal justification) and `vjust` (vertical justification) to control the alignment of the label.
However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points.
We can use the `geom_label_repel()` function from the ggrepel package to address both of these issues.
This useful package will automatically adjust labels so that they don't overlap:
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
#| points are colored according to the car class. Some points are labelled
#| with the car's name. The labels are box with white, transparent background
#| and positioned to not overlap.
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
geom_label_repel(
data = label_info,
aes(x = displ, y = hwy, label = drive_type),
fontface = "bold", size = 5, nudge_y = 2
) +
theme(legend.position = "none")
```
You can also use the same idea to highlight certain points on a plot with `geom_text_repel()` from the ggrepel package.
Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars. Points
#| where highway mileage is above 40 as well as above 20 with engine size
#| above 5 are red, with a hollow red circle, and labelled with model name
#| of the car.
potential_outliers <- mpg |>
filter(hwy > 40 | (hwy > 20 & displ > 5))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_text_repel(data = potential_outliers, aes(label = model)) +
geom_point(data = potential_outliers, color = "red") +
geom_point(
data = potential_outliers,
color = "red", size = 3, shape = "circle open"
)
```
Remember, in addition to `geom_text()` and `geom_label()`, you have many other geoms in ggplot2 available to help annotate your plot.
A couple ideas:
- Use `geom_hline()` and `geom_vline()` to add reference lines.
We often make them thick (`linewidth = 2`) and white (`color = white`), and draw them underneath the primary data layer.
That makes them easy to see, without drawing attention away from the data.
- Use `geom_rect()` to draw a rectangle around points of interest.
The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.
Alternatively, look into the [ggforce package](https://ggforce.data-imaginist.com/index.html), specifically [`geom_mark_hull()`](https://ggforce.data-imaginist.com/reference/geom_mark_hull.html), which allows you to annotate subsets of points with hulls.
- Use `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow.
Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location.
Another handy function for adding annotations to plots is `annotate()`.
As a rule of thumb, geoms are generally useful for highlighting a subset of the data while `annotate()` is useful for adding one or few annotation elements to a plot.
To demonstrate using `annotate()`, let's create some text to add to our plot.
The text is a bit long, so we'll use `stringr::str_wrap()` to automatically add line breaks to it given the number of characters you want per line:
```{r}
trend_text <- "Larger engine sizes tend to have lower fuel economy." |>
str_wrap(width = 30)
trend_text
```
Then, we add two layers of annotation: one with a label geom and the other with a segment geom.
The `x` and `y` aesthetics in both define where the annotation should start, and the `xend` and `yend` aesthetics in the segment annotation define the end location of the segment.
Note also that the segment is styled as an arrow.
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars. A red
#| arrow pointing down follows the trend of the points and the annotation
#| placed next to the arrow reads "Larger engine sizes tend to have lower
#| fuel economy". The arrow and the annotation text is red.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
annotate(
geom = "label", x = 3.5, y = 38,
label = trend_text,
hjust = "left", color = "red"
) +
annotate(
geom = "segment",
x = 3, y = 35, xend = 5, yend = 25, color = "red",
arrow = arrow(type = "closed")
)
```
Annotation is a powerful tool for communicating main takeaways and interesting features of your visualizations.
The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!
### Exercises
1. Use `geom_text()` with infinite positions to place text at the four corners of the plot.
2. Use `annotate()` to add a point geom in the middle of your last plot without having to create a tibble.
Customize the shape, size, or color of the point.
3. How do labels with `geom_text()` interact with faceting?
How can you add a label to a single facet?
How can you put a different label in each facet?
(Hint: Think about the dataset that is being passed to `geom_text()`.)
4. What arguments to `geom_label()` control the appearance of the background box?
5. What are the four arguments to `arrow()`?
How do they work?
Create a series of plots that demonstrate the most important options.
## Scales
The third way you can make your plot better for communication is to adjust the scales.
Scales control how the aesthetic mappings manifest visually.
### Default scales
Normally, ggplot2 automatically adds scales for you.
For example, when you type:
```{r}
#| label: default-scales
#| fig-show: "hide"
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))
```
ggplot2 automatically adds default scales behind the scenes:
```{r}
#| fig-show: "hide"
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_color_discrete()
```
Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale.
The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date.
`scale_x_continuous()` puts the numeric values from `displ` on a continuous number line on the x-axis, `scale_color_discrete()` chooses colors for each of the `class` of car, etc.
There are lots of non-default scales which you'll learn about below.
The default scales have been carefully chosen to do a good job for a wide range of inputs.
Nevertheless, you might want to override the defaults for two reasons:
- You might want to tweak some of the parameters of the default scale.
This allows you to do things like change the breaks on the axes, or the key labels on the legend.
- You might want to replace the scale altogether, and use a completely different algorithm.
Often you can do better than the default because you know more about the data.
### Axis ticks and legend keys
Collectively axes and legends are called **guides**.
Axes are used for x and y aesthetics; legends are used for everything else.
There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`.
Breaks controls the position of the ticks, or the values associated with the keys.
Labels controls the text label associated with each tick/key.
The most common use of `breaks` is to override the default choice:
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars,
#| colored by drive. The y-axis has breaks starting at 15 and ending at 40,
#| increasing by 5.
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
```
You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether.
This can be useful for maps, or for publishing plots where you can't share the absolute numbers.
You can also use `breaks` and `labels` to control the appearance of legends.
For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them.
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars, colored
#| by drive. The x and y-axes do not have any labels at the axis ticks.
#| The legend has custom labels: 4-wheel, front, rear.
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL) +
scale_color_discrete(labels = c("4" = "4-wheel", "f" = "front", "r" = "rear"))
```
The `labels` argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc.
The plot on the left shows default labelling with `label_dollar()`, which adds a dollar sign as well as a thousand separator comma.
The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix "K" (for "thousands") as well as adding custom breaks.
Note that `breaks` is in the original scale of the data.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-alt: |
#| Two side-by-side box plots of price versus cut of diamonds. The outliers
#| are transparent. On both plots the x-axis labels are formatted as dollars.
#| The x-axis labels on the plot start at $0 and go to $15,000, increasing
#| by $5,000. The x-axis labels on the right plot start at $1K and go to
#| $19K, increasing by $6K.
# Left
ggplot(diamonds, aes(x = price, y = cut)) +
geom_boxplot(alpha = 0.05) +
scale_x_continuous(labels = label_dollar())
# Right
ggplot(diamonds, aes(x = price, y = cut)) +
geom_boxplot(alpha = 0.05) +
scale_x_continuous(
labels = label_dollar(scale = 1/1000, suffix = "K"),
breaks = seq(1000, 19000, by = 6000)
)
```
Another handy label function is `label_percent()`:
```{r}
#| fig-alt: |
#| Segmented bar plots of cut, filled with levels of clarity. The y-axis
#| labels start at 0% and go to 100%, increasing by 25%. The y-axis label
#| name is "Percentage".
ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "fill") +
scale_y_continuous(name = "Percentage", labels = label_percent())
```
Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur.
For example, take this plot that shows when each US president started and ended their term.
```{r}
#| fig-alt: |
#| Line plot of id number of presidents versus the year they started their
#| presidency. Start year is marked with a point and a segment that starts
#| there and ends at the end of the presidency. The x-axis labels are
#| formatted as two digit years starting with an apostrophe, e.g., '53.
presidential |>
mutate(id = 33 + row_number()) |>
ggplot(aes(x = start, y = id)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_x_date(name = NULL, breaks = presidential$start, date_labels = "'%y")
```
Note that for the `breaks` argument we pulled out the `start` variable as a vector with `presidential$start` because we can't do an aesthetic mapping for this argument.
Also note that the specification of breaks and labels for date and datetime scales is a little different:
- `date_labels` takes a format specification, in the same form as `parse_datetime()`.
- `date_breaks` (not shown here), takes a string like "2 days" or "1 month".
### Legend layout
You will most often use `breaks` and `labels` to tweak the axes.
While they both also work for legends, there are a few other techniques you are more likely to use.
To control the overall position of the legend, you need to use a `theme()` setting.
We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot.
The theme setting `legend.position` controls where the legend is drawn:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-alt: |
#| Four scatterplots of highway fuel efficiency versus engine size of cars
#| where points are colored based on class of car. Clockwise, the legend
#| is placed on the right, left, top, and bottom of the plot.
base <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))
base + theme(legend.position = "right") # the default
base + theme(legend.position = "left")
base +
theme(legend.position = "top") +
guides(color = guide_legend(nrow = 3))
base +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 3))
```
If your plot is short and wide, place the legend at the top or bottom, and if it's tall and narrow, place the legend at the left or right.
You can also use `legend.position = "none"` to suppress the display of the legend altogether.
To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`.
The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger.
This is particularly useful if you have used a low `alpha` to display many points on a plot.
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars
#| where points are colored based on class of car. Overlaid on the plot is a
#| smooth curve. The legend is in the bottom and classes are listed
#| horizontally in two rows. The points in the legend are larger than the points
#| in the plot.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 2, override.aes = list(size = 4)))
```
Note that the name of the argument in `guides()` matches the name of the aesthetic, just like in `labs()`.
### Replacing a scale
Instead of just tweaking the details a little, you can instead replace the scale altogether.
There are two types of scales you're mostly likely to want to switch out: continuous position scales and color scales.
Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and color, you'll be able to quickly pick up other scale replacements.
It's very useful to plot transformations of your variable.
For example, it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 3
#| fig-alt: |
#| Two plots of price versus carat of diamonds. Data binned and the color of
#| the rectangles representing each bin based on the number of points that
#| fall into that bin. In the plot on the right, price and carat values
#| are logged and the axis labels shows the logged values.
# Left
ggplot(diamonds, aes(x = carat, y = price)) +
geom_bin2d()
# Right
ggplot(diamonds, aes(x = log10(carat), y = log10(price))) +
geom_bin2d()
```
However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot.
Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale.
This is visually identical, except the axes are labelled on the original data scale.
```{r}
#| fig-alt: |
#| Plot of price versus carat of diamonds. Data binned and the color of
#| the rectangles representing each bin based on the number of points that
#| fall into that bin. The axis labels are on the original data scale.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
```
Another scale that is frequently customized is color.
The default categorical scale picks colors that are evenly spaced around the color wheel.
Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness.
The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.[^communication-1]
[^communication-1]: You can use a tool like [SimDaltonism](https://michelf.ca/projects/sim-daltonism/) to simulate color blindness to test these images.
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 3
#| fig-alt: |
#| Two scatterplots of highway mileage versus engine size where points are
#| colored by drive type. The plot on the left uses the default
#| ggplot2 color palette and the plot on the right uses a different color
#| palette.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
scale_color_brewer(palette = "Set1")
```
Don't forget simpler techniques for improving accessibility.
If there are just a few colors, you can add a redundant shape mapping.
This will also help ensure your plot is interpretable in black and white.
```{r}
#| fig-alt: |
#| Two scatterplots of highway mileage versus engine size where both color
#| and shape of points are based on drive type. The color palette is not
#| the default ggplot2 palette.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv, shape = drv)) +
scale_color_brewer(palette = "Set1")
```
The ColorBrewer scales are documented online at <https://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
@fig-brewer shows the complete list of all palettes.
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
```{r}
#| label: fig-brewer
#| echo: false
#| fig-cap: All colorBrewer scales.
#| fig-asp: 2.5
#| fig-alt: |
#| All colorBrewer scales. One group goes from light to dark colors.
#| Another group is a set of non ordinal colors. And the last group has
#| diverging scales (from dark to light to dark again). Within each set
#| there are a number of palettes.
par(mar = c(0, 3, 0, 0))
RColorBrewer::display.brewer.all()
```
When you have a predefined mapping between values and colors, use `scale_color_manual()`.
For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats.
One approach for assigning these colors is using hex color codes:
```{r}
#| fig-alt: |
#| Line plot of id number of presidents versus the year they started their
#| presidency. Start year is marked with a point and a segment that starts
#| there and ends at the end of the presidency. Democratic presidents are
#| represented in blue and Republicans in red.
presidential |>
mutate(id = 33 + row_number()) |>
ggplot(aes(x = start, y = id, color = party)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_color_manual(values = c(Republican = "#E81B23", Democratic = "#00AEF3"))
```
For continuous color, you can use the built-in `scale_color_gradient()` or `scale_fill_gradient()`.
If you have a diverging scale, you can use `scale_color_gradient2()`.
That allows you to give, for example, positive and negative values different colors.
That's sometimes also useful if you want to distinguish points above or below the mean.
Another option is to use the viridis color scales.
The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white.
These scales are available as continuous (`c`), discrete (`d`), and binned (`b`) palettes in ggplot2.
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 3
#| fig-asp: 0.75
#| fig-alt: |
#| Three hex plots where the color of the hexes show the number of observations
#| that fall into that hex bin. The first plot uses the default, continuous
#| ggplot2 scale. The second plot uses the viridis, continuous scale, and the
#| third plot uses the viridis, binned scale.
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
labs(title = "Default, continuous", x = NULL, y = NULL)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_c() +
labs(title = "Viridis, continuous", x = NULL, y = NULL)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_b() +
labs(title = "Viridis, binned", x = NULL, y = NULL)
```
Note that all color scales come in two varieties: `scale_color_*()` and `scale_fill_*()` for the `color` and `fill` aesthetics respectively (the color scales are available in both UK and US spellings).
### Zooming
There are three ways to control the plot limits:
1. Adjusting what data are plotted.
2. Setting the limits in each scale.
3. Setting `xlim` and `ylim` in `coord_cartesian()`.
We'll demonstrate these options in a series of plots.
The plot on the left shows the relationship between engine size and fuel efficiency, colored by type of drive train.
The plot on the right shows the same variables, but subsets the data that are plotted.
Subsetting the data has affected the x and y scales as well as the smooth curve.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| message: false
#| fig-alt: |
#| On the left, scatterplot of highway mileage vs. displacement, with
#| displacement. The smooth curve overlaid shows a decreasing, and then
#| increasing trend, like a hockey stick. On the right, same variables
#| are plotted with displacement ranging only from 5 to 6 and highway
#| mileage ranging only from 10 to 25. The smooth curve overlaid shows a
#| trend that's slightly increasing first and then decreasing.
# Left
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
# Right
mpg |>
filter(displ >= 5 & displ <= 6 & hwy >= 10 & hwy <= 25) |>
ggplot(aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
```
Let's compare these to the two plots below where the plot on the left sets the `limits` on individual scales and the plot on the right sets them in `coord_cartesian()`.
We can see that reducing the limits is equivalent to subsetting the data.
Therefore, to zoom in on a region of the plot, it's generally best to use `coord_cartesian()`.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| message: false
#| warning: false
#| fig-alt: |
#| On the left, scatterplot of highway mileage vs. displacement, with
#| displacement ranging from 5 to 6 and highway mileage ranging from
#| 10 to 25. The smooth curve overlaid shows a trend that's slightly
#| increasing first and then decreasing. On the right, same variables
#| are plotted with the same limits, however the smooth curve overlaid
#| shows a relatively flat trend with a slight increase at the end.
# Left
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 6)) +
scale_y_continuous(limits = c(10, 25))
# Right
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 6), ylim = c(10, 25))
```
On the other hand, setting the `limits` on individual scales is generally more useful if you want to *expand* the limits, e.g., to match scales across different plots.
For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-alt: |
#| On the left, a scatterplot of highway mileage vs. displacement of SUVs.
#| On the right, a scatterplot of the same variables for compact cars.
#| Points are colored by drive type for both plots. Among SUVs more of
#| the cars are 4-wheel drive and the others are rear-wheel drive, while
#| among compact cars more of the cars are front-wheel drive and the others
#| are 4-wheel drive. SUV plot shows a clear negative relationship
#| between higway mileage and displacement while in the compact cars plot
#| the relationship is much flatter.
suv <- mpg |> filter(class == "suv")
compact <- mpg |> filter(class == "compact")
# Left
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point()
# Right
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point()
```
One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-alt: |
#| On the left, a scatterplot of highway mileage vs. displacement of SUVs.
#| On the right, a scatterplot of the same variables for compact cars.
#| Points are colored by drive type for both plots. Both plots are plotted
#| on the same scale for highway mileage, displacement, and drive type,
#| resulting in the legend showing all three types (front, rear, and 4-wheel
#| drive) for both plots even though there are no front-wheel drive SUVs and
#| no rear-wheel drive compact cars. Since the x and y scales are the same,
#| and go well beyond minimum or maximum highway mileage and displacement,
#| the points do not take up the entire plotting area.
x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
# Left
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
# Right
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
```
In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.
### Exercises
1. Why doesn't the following code override the default scale?
```{r}
#| fig-show: "hide"
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
scale_color_gradient(low = "white", high = "red") +
coord_fixed()
```
2. What is the first argument to every scale?
How does it compare to `labs()`?
3. Change the display of the presidential terms by:
a. Combining the two variants that customize colors and x axis breaks.
b. Improving the display of the y axis.
c. Labelling each term with the name of the president.
d. Adding informative plot labels.
e. Placing breaks every 4 years (this is trickier than it seems!).
4. First, create the following plot.
Then, modify the code using `override.aes` to make the legend easier to see.
```{r}
#| fig-show: hide
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut), alpha = 1/20)
```
## Themes {#sec-themes}
Finally, you can customize the non-data elements of your plot with a theme:
```{r}
#| message: false
#| fig-alt: |
#| Scatterplot of highway mileage vs. displacement of cars, colored by class
#| of car. The plot background is white, with gray grid lines.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
```
ggplot2 includes the eight themes shown in @fig-themes, with `theme_gray()` as the default.[^communication-2]
Many more are included in add-on packages like **ggthemes** (<https://jrnold.github.io/ggthemes>), by Jeffrey Arnold.
You can also create your own themes, if you are trying to match a particular corporate or journal style.
[^communication-2]: Many people wonder why the default theme has a gray background.
This was a deliberate choice because it puts the data forward while still making the grid lines visible.
The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out.
The gray background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background.
Finally, the gray background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.
```{r}
#| label: fig-themes
#| echo: false
#| fig-cap: The eight themes built-in to ggplot2.
#| fig-alt: |
#| Eight barplots created with ggplot2, each
#| with one of the eight built-in themes:
#| theme_bw() - White background with grid lines,
#| theme_light() - Light axes and grid lines,
#| theme_classic() - Classic theme, axes but no grid
#| lines, theme_linedraw() - Only black lines,
#| theme_dark() - Dark background for contrast,
#| theme_minimal() - Minimal theme, no background,
#| theme_gray() - Gray background (default theme),
#| theme_void() - Empty theme, only geoms are visible.
knitr::include_graphics("images/visualization-themes.png")
```
It's also possible to control individual components of each theme, like the size and color of the font used for the y axis.
We've already seen that `legend.position` controls where the legend is drawn.
There are many other aspects of the legend that can be customized with `theme()`.
For example, in the plot below we change the direction of the legend as well as put a black border around it.
Note that customization of the legend box and plot title elements of the theme are done with `element_*()` functions.
These functions specify the styling of non-data components, e.g., the title text is bolded in the `face` argument of `element_text()` and the legend border color is defined in the `color` argument of `element_rect()`.
The theme elements that control the position of the title and the caption are `plot.title.position` and `plot.caption.position`, respectively.
In the following plot these are set to `"plot"` to indicate these elements are aligned to the entire plot area, instead of the plot panel (the default).
A few other helpful `theme()` components are used to change the placement for format of the title and caption text.
```{r}
#| fig-alt: |
#| Scatterplot of highway fuel efficiency versus engine size of cars, colored
#| by drive. The plot is titled 'Larger engine sizes tend to have lower fuel
#| economy' with the caption pointing to the source of the data, fueleconomy.gov.
#| The caption and title are left justified, the legend is inside of the plot
#| with a black border.
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
labs(
title = "Larger engine sizes tend to have lower fuel economy",
caption = "Source: https://fueleconomy.gov."
) +
theme(
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
legend.box.background = element_rect(color = "black"),
plot.title = element_text(face = "bold"),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.caption = element_text(hjust = 0)
)
```
For an overview of all `theme()` components, see help with `?theme`.
The [ggplot2 book](https://ggplot2-book.org/) is also a great place to go for the full details on theming.
### Exercises
1. Pick a theme offered by the ggthemes package and apply it to the last plot you made.
2. Make the axis labels of your plot blue and bolded.
## Layout
So far we talked about how to create and modify a single plot.
What if you have multiple plots you want to lay out in a certain way?
The patchwork package allows you to combine separate plots into the same graphic.
We loaded this package earlier in the chapter.
To place two plots next to each other, you can simply add them to each other.
Note that you first need to create the plots and save them as objects (in the following example they're called `p1` and `p2`).
Then, you place them next to each other with `+`.
```{r}
#| fig-width: 6
#| fig-asp: 0.5
#| fig-alt: |
#| Two plots (a scatterplot of highway mileage versus engine size and a
#| side-by-side boxplots of highway mileage versus drive train) placed next
#| to each other.
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1")
p2 <- ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
labs(title = "Plot 2")
p1 + p2
```
It's important to note that in the above code chunk we did not use a new function from the patchwork package.
Instead, the package added a new functionality to the `+` operator.
You can also create complex plot layouts with patchwork.
In the following, `|` places the `p1` and `p3` next to each other and `/` moves `p2` to the next line.
```{r}
#| fig-width: 6
#| fig-asp: 0.8
#| fig-alt: |
#| Three plots laid out such that first and third plot are next to each other
#| and the second plot stretched beneath them. The first plot is a
#| scatterplot of highway mileage versus engine size, third plot is a
#| scatterplot of highway mileage versus city mileage, and the third plot is
#| side-by-side boxplots of highway mileage versus drive train) placed next
#| to each other.
p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
(p1 | p3) / p2
```
Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots.
Below we create 5 plots.
We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with `& theme(legend.position = "top")`.
Note the use of the `&` operator here instead of the usual `+`.
This is because we're modifying the theme for the patchwork plot as opposed to the individual ggplots.
The legend is placed on top, inside the `guide_area()`.
Finally, we have also customized the heights of the various components of our patchwork -- the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatterplot 4.
Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.
```{r}
#| fig-width: 8
#| fig-asp: 1
#| fig-alt: |
#| Five plots laid out such that first two plots are next to each other. Plots
#| three and four are underneath them. And the fifth plot stretches under them.
#| The patchworked plot is titled "City and highway mileage for cars with
#| different drive trains" and captioned "Source: https://fueleconomy.gov".
#| The first two plots are side-by-side box plots. Plots 3 and 4 are density