forked from ngreifer/cobalt
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcobalt_A4_love.plot.Rmd
169 lines (138 loc) · 10.4 KB
/
cobalt_A4_love.plot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: 'Appendix 4: Using `love.plot` To Generate Love Plots'
author: "Noah Greifer"
date: "`r Sys.Date()`"
output:
html_vignette:
df_print: kable
toc: false
vignette: >
%\VignetteIndexEntry{Appendix 4: Using love.plot To Generate Love Plots}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(message = FALSE, fig.width=5)
```
This is a guide on how to use `love.plot` to its fullest potential, taking advantage of all of its many options and features. Other vignettes have described the basics of `love.plot`, but this guide goes more in depth. The help file for `love.plot` with `?love.plot` for the full documentation.
First, I'll describe the use of `love.plot` with binary and continuous treatments. Then, I'll go into faceted and aggregated plots in the context of multinomial treatments and clustered and multiply imputed data. I'll use the Lalonde data and `WeightIt` for all examples. `WeightIt` performs a variety of weighting methods, including propensity score weighting, which, for simplicity, will be the focus here.
First, let's load in the data set and estimate the weights.
```{r}
library(cobalt)
data("lalonde", package = "cobalt")
library(WeightIt)
w.out1 <- weightit(treat ~ age + educ + married + nodegree + race + re74 + re75,
data = lalonde, estimand = "ATE", method = "ps")
```
Next, because in this example we want to display standardized mean difference for all of the covariates, let's set the global `binary` option to `"std"` so we don't have to type it every time.
```{r}
set.cobalt.options(binary = "std")
```
The most basic way to use `love.plot` is simply to call it as you would `bal.tab` on the output of the preprocessing function (in this case, `weightit`).
```{r}
love.plot(w.out1)
```
We could also have supplied other arguments that would normally go in `bal.tab`[^1]:
```{r, eval = FALSE}
#This produces the same output as the prior block but with
#the additional covariates included in the formula.
love.plot(treat ~ age + educ + married + nodegree + race + re74 + re75 +
I(age^2) + I(educ^2), data = lalonde, weights = get.w(w.out1),
method = "weighting", estimand = "ATE")
```
Let's start with some basic customizations. First, we'll remove the propensity score from the balance display by setting `drop.distance = TRUE`. We'll change the order the covariates so they are displayed in descending order of their unadjusted mean differences by setting `var.order = "unadjusted"`. We'll also add some lines to make the change in balance clearer by setting `line = TRUE`. Finally, we'll add a threshold line at 0.1 by setting `threshold = .1`.
```{r}
love.plot(w.out1,
drop.distance = TRUE,
var.order = "unadjusted",
line = TRUE,
threshold = .1)
```
The plot is already looking much better and more informative, but let's change a few things to make it more professional. First, we'll change the names of the variables so they are easier to read. We can create a vector of new variable names and then supply that to the `var.names` argument in `love.plot()`. If it would be a burden to type out all the names, you can use the `var.names()` function to create a CSV file (i.e., spreadsheet) that can be customized and loaded back into R to be used with `love.plot()`. See `?var.names` for more information. Because we only have a few variable names, we'll just manually create a vector of names.
```{r}
new.names <- c(age = "Age (Years)",
educ = "Education (Years)",
married = "Married (Y/N)",
nodegree = "Degree Earned (Y/N)",
race_white = "Race: White",
race_black = "Race: Black",
race_hispan = "Race: Hispanic",
re74 = "Earnings in 1974 ($)",
re75 = "Earnings in 1975 ($)"
)
```
We'll change the colors of the points and lines with the `colors` argument so they aren't the `ggplot2` defaults. We'll also change the shape of the points to further clarify the different samples using the `shapes` argument.
```{r}
love.plot(w.out1,
drop.distance = TRUE,
var.order = "unadjusted",
line = TRUE,
threshold = .1,
var.names = new.names,
colors = c("red", "blue"),
shapes = c("triangle filled", "circle filled"))
```
Finally, let's makes some changed to the legend. First, we'll rename the samples to be "Unweighted" and "PS Weighted" using the `sample.names` argument. Second, we'll change the plot limits to give more padding on the right side using the `limits` argument. Third, we'll move the legend into the plot to save some space and give it a border. We can do this last step using `ggplot2` syntax because the `love.plot` output is a `ggplot` object. We need to load in `ggplot2` first to do this.
```{r}
library(ggplot2)
love.plot(w.out1,
drop.distance = TRUE,
var.order = "unadjusted",
line = TRUE,
threshold = .1,
var.names = new.names,
colors = c("red", "blue"),
shapes = c("triangle filled", "circle filled"),
sample.names = c("Unweighted", "PS Weighted"),
limits = c(0, .82)) +
theme(legend.position = c(.75, .2),
legend.key = element_rect(fill = "white"),
legend.box.background = element_rect(),
legend.box.margin = margin(6, 6, 6, 6))
```
This is starting to look like a publication-ready plot. There are still other options you can change, such as the title, subtitle, and axis names, some of which can be done using `love.plot` arguments and others which require `ggplot2` code. Sizing the plot and making sure eveything still looks good will be its own challenge, but that's true of all plots in R.
Perhaps we want to display balance for a second set of weights, maybe using a different method to estimate them, like the covariate balancign propensity score (CBPS) that is popular in political science. These can be easily added (but we'll have to use the formula interface to set multiple weights).
```{r}
w.out2 <- weightit(treat ~ age + educ + married + nodegree + race + re74 + re75,
data = lalonde, estimand = "ATE", method = "gbm",
stop.method = "ks.max")
love.plot(treat ~ age + educ + married + nodegree + race + re74 + re75,
data = lalonde,
weights = list(w1 = get.w(w.out1),
w2 = get.w(w.out2)),
var.order = "unadjusted",
line = TRUE,
threshold = .1,
var.names = new.names,
colors = c("red", "blue", "darkgreen"),
shapes = c("triangle filled", "circle filled", "square filled"),
sample.names = c("Unweighted", "PS Weighted", "CBPS Weighted"),
limits = c(0, .82)) +
theme(legend.position = c(.75, .2),
legend.key = element_rect(fill = "white"),
legend.box.background = element_rect(),
legend.box.margin = margin(6, 6, 6, 6))
```
We can see there is little benefit to using these weights over standard logistic regression weights, although significant imbalance remains for both sets of weights. (Try using entropy balancing by setting `method = "ebal"` in `weightit()` if you want to see the power of modern weighting methods.)
Perhaps balance on mean difference is not enough, and you want to display balance not only on mean differences but also on KS statistics. In general, this is good practice; mean differences don't tell the whole story. We can simply request `stat = "ks.statistics"` in `love.plot()`. We could also request `stat = "variance.ratios"` to get variance ratios, another potentially useful balance measure. Below we'll use similar formatting to request KS statistics:
```{r}
love.plot(treat ~ age + educ + married + nodegree + race + re74 + re75,
data = lalonde,
stat = c("mean.diffs", "ks.statistics"),
weights = list(w1 = get.w(w.out1),
w2 = get.w(w.out2)),
var.order = "unadjusted",
line = TRUE,
threshold = c(mean.diffs = .1),
var.names = new.names,
colors = c("red", "blue", "darkgreen"),
shapes = c("triangle filled", "circle filled", "square filled"),
sample.names = c("Unweighted", "PS Weighted", "CBPS Weighted"),
limits = c(0, .82)) +
theme(legend.position = c(.75, .2),
legend.key = element_rect(fill = "white"),
legend.box.background = element_rect(),
legend.box.margin = margin(6, 6, 6, 6))
```
Notice that KS statistics are displayed only for variables for which it is computed (i.e., continuous variables). The binary variables don't even appear on the plot. If we wanted to have them appear as well with empty space where the point would be, we can request `no.missing = FALSE`. Below, we'll put two plots side-by-side, one for mean differences and one for KS statistics. We'll put the variables in their original order rather than order by the unadjustment statistic to ensure teh same variable order is present in both plots. We'll remove the lines connecting the points. Finally, we'll
[^1]: Older versions of `cobalt` required the first argument to be a call to `bal.tab()`; this is no longer required and in fact is not recommended because `love.plot` is more flexible when handling its own inputs. That said, `love.plot` calls `bal.tab` internally, so problems that arise may be due to problems with `bal.tab` rather than with `love.plot`. Sometimes the `bal.tab` object can take some time to be produced, in which case it might be useful to just run `bal.tab` once and supply its output to different calls of `love.plot` as you fiddle with its options, as `b <- bal.tab(w.out); love.plot(b)`. Note that doing so removes some of `love.plot` flexibility. The most important aspect is that `love.plot` when called with the first syntax can modify the implicit call to `bal.tab` to get everything it needs. For example, normally variance ratios are not present in `bal.tab` output, but if they are requested with `stat = "variance.ratios"` in `love.plot`, they will be produced and displayed correctly without additional input. When using the latter syntax, an error will occur saying that variance ratios were not present in the original `bal.tab` call. This can be mitigated by setting `quick = FALSE` or `disp.v.ratio = TRUE` in the call to `bal.tab` before entering its argument into `love.plot`.