forked from microfan1/hdcate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
user_manual.Rmd
448 lines (344 loc) · 18.7 KB
/
user_manual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "User Manual: High-Dimensional Conditional Average Treatment Effects Estimation (R Package)"
output:
rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{User Manual: High-Dimensional Conditional Average Treatment Effects Estimation (R Package)}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
message = FALSE
)
# save built-in output hook
hook_output = knitr::knit_hooks$get("output")
# set a new output hook to cut long output texts
knitr::knit_hooks$set(output = function(x, options) {
if (!is.null(n <- options$out.lines)) {
x = xfun::split_lines(x)
if (length(x) > n) {
# cut
x = c(head(x, n), '....\n')
}
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
This package uses a two-step procedure to estimate the conditional average treatment effects (CATE) with potentially high-dimensional covariate(s). In the first stage, the nuisance functions necessary for identifying CATE can be estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. The second stage consists of a low-dimensional local linear regression, reducing CATE to a function of the covariate(s) of interest.
The CATE estimator implemented in this package not only allows for high-dimensional data, but also has the “double robustness” property: either the model for the propensity score or the models for the conditional means of the potential outcomes are allowed to be misspecified (but not both).
The algorithm used in this package is described in the paper by Fan et al., "Estimation of Conditional Average Treatment Effects With High-Dimensional Data" (2022), Journal of Business & Economic Statistics. ([doi:10.1080/07350015.2020.1811102](https://doi.org/10.1080/07350015.2020.1811102))
## Installation
Install the package from github:
```{r installation, eval=FALSE}
install.packages('devtools')
devtools::install_github('microfan1/hdcate')
```
## Basic Usage
Load package:
```{r load package}
library(hdcate)
```
Define the empirical data (which is a usual `data.frame`), here we use a built-in simulated data:
```{r message=TRUE}
set.seed(1) # set an arbitrary random seed to ensure the reproducibility
data <- HDCATE.get_sim_data(n_obs=500, n_var=100, n_rel_var=4, intercept=10) # generate data
```
The simulated data contains 500 observations, each observation has 100 covariates, only the first 4 of those covariates (`X1`, `X2`, `X3` and `X4`) are actually present in the expectation function of potential outcome, and only the last 4 covariates (`X97`, `X98`, `X99` and `X100`) are present in the propensity score function.
Let's see what the `data` looks like:
```{r data, out.lines=6}
data
```
In the data, we know that the dependent variable is `Y`, treatment variable is `D`, and we use a model with the formula of independent variable as:
```{r x_formula}
x_formula <- "X1+X2+X3+X4+X5+X6"
x_formula
```
Note that the propensity score function is misspecified in this example. We shall see the "double robust" shortly.
Then, we can use the `data` to create a model:
```{r model}
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
```
Pass this `model` to an operating function `HDCATE.set_condition_var()` to set the required conditional variable, here we want to know the conditional average treatment effects (CATE) for `X2`, from `X2=-1` to `X2=1`, step by `0.01`:
```{r set conditional variable}
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
```
Then we are ready to fit the model, pass the `model` to the operating function `HDCATE.fit()` to fit the model:
```{r fit}
HDCATE.fit(model)
```
After the model fitting, we obtain the resulting average treatment effects (via accesing the `ate` attribute of `model`) and conditional average treatment effects (via accesing the `hdcate` attribute of `model`):
```{r get fitted ATE}
# ATE
model$ate
```
```{r get fitted CATE, out.lines=6}
# CATE
model$hdcate
```
After fitting, we may want to know the uniform confidence bands, to achieve this, pass the fitted `model` to the operating function `HDCATE.inference()`:
```{r inferences}
HDCATE.inference(model)
```
Then we obtain the uniform confidence bands for 2-sided test, left-sided test and right-sided test by accessing the `i_two_side`, `i_left_side` and `i_right_side` attributes of the `model`, respectively:
```{r two-sided test, out.lines=6}
# two-sided bands
model$i_two_side
```
```{r left-sided test, out.lines=6}
# left-sided bands
model$i_left_side
```
```{r right-sided test, out.lines=6}
# right-sided bands
model$i_right_side
```
Finally, we may want a plot for the estimated results above:
```{r plot, fig.show='hold' , fig.width=6, fig.height=6, fig.align='center', fig.cap='HDCATE Estimation (propensity function is misspecified)', dpi=200, out.width="80%"}
HDCATE.plot(model)
# Alternatively, if the model is not inferenced, use the code below:
# HDCATE.plot(model, include_band=FALSE)
# Alternatively, we may want to save the full plot as a high-definition PDF file:
# HDCATE.plot(model, output_pdf=TRUE, pdf_name='hdcate_plot.pdf',
# include_band=TRUE, test_side='both', y_axis_min='auto', y_axis_max='auto',
# display.hdcate='HDCATEF', display.ate='ATE', display.siglevel='sig_level'
# )
```
From the graph above, although the model we use is misspecified for the propensity score function (we used the formula `Y~X1+X2+X3+X4+X5+X6`), we still obtain a robust result, which is very close to the actual CATE function (`CATE(X2)=10+X2`). If we use a model that is misspecified for the expectation function of potential outcome, the results are also robust:
```{r change formula, fig.show='hold' , fig.width=6, fig.height=6, fig.align='center', fig.cap='HDCATE Estimation (expectation function is misspecified)', dpi=200, out.width="80%"}
x_formula = paste(paste0('X', c(90:100)), collapse ='+') # x_formula="X90+...+X100"
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
```
The results above show the "double robust" property of the estimating approach in this package, see [Fan et al. (2022)](https://doi.org/10.1080/07350015.2020.1811102) for more theoretical results.
## Advanced Usage
### Full-sample Estimator vs Cross-fitting Estimator
By default, when creating an HDCATE model, this package uses full-sample approach, but you are also very welcome to use the cross-fitting approach described in [Fan et al. (2022)](https://doi.org/10.1080/07350015.2020.1811102), by simply pass `model` into the operator `HDCATE.use_cross_fitting()` right after the HDCATE `model` has been created, that is:
```{r cross-fitting}
# create the model first, or use an existing `model` and skip this line
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
# suppose we want to use the 5-fold cross-fitting approach:
HDCATE.use_cross_fitting(model, k_fold=5)
```
You may also want to set the folds manually, in this case, pass a list of index vector to the third argument of the operator:
```{r manual fold}
# In this case, the param k_fold is auto detected, you can pass any value to it.
HDCATE.use_cross_fitting(model, k_fold=1, folds=list(c(1:250), c(251:500)))
```
If we wants to use full-sample approach again, then we can use anothor operator `HDCATE.use_full_sample()`.
```{r change to full sample}
HDCATE.use_full_sample(model)
```
### User-defined Machine Learning Approach
By default, the package uses LASSO on the first stage (i.e. the estimation of the treated/untreated expectation functions and propensity score functions), but it is highly (and easily) extendable.
Researchers are welcome to define their own preferred methods (such as other ML methods) to run the first-stage estimation.
To define your own approach, you should define several functions that describe how you fit and predict the input data, and then pass those functions (and the `model` object) to the operator `HDCATE.set_first_stage()`. Then the package will use your method on the first stage.
The template of those functions are given as follows:
```{r temp function}
my_method_to_fit_expectation <- function(df) {
# fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
# ...
# return a fitted object
}
my_method_to_predict_expectation <- function(fitted_model, df) {
# use the fitted object (returned by your function above) to predict the conditional expectation
# fucntion on the input data.frame (df)
# ...
# return a vector of predicted values
}
my_method_to_fit_propensity <- function(df) {
# fit the input data.frame (df), where the first column is dummy treatment
# value (D) and the rest are covariates in x_formula
# ...
# return a fitted object
}
my_method_to_predict_propensity <- function(fitted_model, df) {
# use the fitted object (returned by your function above) to predict the propensity score
# fucntion on the input data.frame (df)
# ...
# return a vector of predicted values
}
```
**Example 1 (LASSO)**: the default ML method in `HDCATE.fit` is LASSO, users can manually define a ML method using the above template, and we again use LASSO as an example, similar practice applies for other ML methods:
> Note: In practice, LASSO is the built-in method in the package, if you just want to use LASSO, there's no need to do the following.
```{r lasso example}
x_formula <- "X1+X2+X3+X4+X5+X6" # propensity function is misspecified
# create a model
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
# using the template above, we can write:
my_method_to_fit_expectation <- function(df) {
# fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
hdm::rlasso(as.formula(paste0('Y', "~", x_formula)), df)
# return a fitted object
}
my_method_to_predict_expectation <- function(fitted_model, df) {
# use the fitted object (fitted_model) to predict the conditional expectation
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}
my_method_to_fit_propensity <- function(df) {
# fit the input data.frame (df), where the first column is dummy treatment
# variable (D) and the rest are covariates in x_formula
hdm::rlassologit(as.formula(paste0('D', "~", x_formula)), df)
# return a fitted object
}
my_method_to_predict_propensity <- function(fitted_model, df) {
# use the fitted object (fitted_model) to predict the propensity score
# fucntion on the input data.frame (df)
predict(fitted_model, df, type="response")
# return a vector of predicted values
}
```
After defining these functions, we can easily plug them into the package using operator `HDCATE.set_first_stage()`:
```{r reset1, include=FALSE}
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
```
```{r plug in}
HDCATE.set_first_stage(
model,
fit.treated=my_method_to_fit_expectation,
fit.untreated=my_method_to_fit_expectation,
fit.propensity=my_method_to_fit_propensity,
predict.treated=my_method_to_predict_expectation,
predict.untreated=my_method_to_predict_expectation,
predict.propensity=my_method_to_predict_propensity
)
```
And then we can use the user-defined first-stage ML methods to estimate CATE:
```{r re estimate 1, fig.show='hold' , fig.width=6, fig.height=6, fig.align='center', fig.cap='HDCATE Estimation (using user-defined LASSO on the first stage)', dpi=200, out.width="80%"}
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
```
We shall see the similar result, since the user-defined method in this example is the same as the built-in method (LASSO).
To clear the user-defined first-stage method and return to the default built-in method, simply call:
```{r recover}
HDCATE.unset_first_stage(model)
```
**Example 2 (Random Forest)**: Here we provide an additional example for another popular ML method: Random Forest, to further show how users can easily plug their own ML methods into this package:
```{r reset before rf, include=FALSE}
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
```
```{r ramdom forest example, fig.show='hold' , fig.width=6, fig.height=6, fig.align='center', fig.cap='HDCATE Estimation (using user-defined Random Forest model on the first stage)', dpi=200, out.width="80%"}
# propensity function is misspecified, for example
x_formula <- "X1+X2+X3+X4+X5+X6"
# create a model
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
# =========== User-defined ML approach start ===========
# using the template above, we can write:
my_method_to_fit_expectation <- function(df) {
# fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
randomForest::randomForest(as.formula(paste0('Y', "~", x_formula)), data = df)
# return a fitted object
}
my_method_to_predict_expectation <- function(fitted_model, df) {
# use the fitted object (fitted_model) to predict the conditional expectation
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}
my_method_to_fit_propensity <- function(df) {
# fit the input data.frame (df), where the first column is dummy treatment
# variable (D) and the rest are covariates in x_formula
randomForest::randomForest(as.formula(paste0('D', "~", x_formula)), data = df)
# return a fitted object
}
my_method_to_predict_propensity <- function(fitted_model, df) {
# use the fitted object (fitted_model) to predict the propensity score
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}
HDCATE.set_first_stage(
model,
fit.treated=my_method_to_fit_expectation,
fit.untreated=my_method_to_fit_expectation,
fit.propensity=my_method_to_fit_propensity,
predict.treated=my_method_to_predict_expectation,
predict.untreated=my_method_to_predict_expectation,
predict.propensity=my_method_to_predict_propensity
)
# =========== User-defined ML approach end ===========
# set conditional variable in CATE, same as above
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
# fit, inference and plot
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
```
As shown above, after applying user-defined Random Forest approach into this package, the CATE estimation result seems even better than that using LASSO in respect of confidence bands.
Futhermore, users may want to extract the fitted ML model for analysis, this can be achieved by accessing the `propensity_score_model_list`, `untreated_cond_exp_model_list` and `treated_cond_exp_model_list` attributes of the `model` object for the fitted propensity score model, fitted conditional expectation model (for untreated) and fitted conditional expectation model (for treated), respectively. The value of these attributes are "List of `K`", where `K` is the number of folds, `K=1` when using the default full-sample estimator.
For example, we may want to calculate the variable importance in fitting treated conditional expectations using the fitted random forest model:
```{r var importance in rf, fig.show='hold' , fig.width=6, fig.height=6, fig.align='center', fig.cap='Variable Importance Extracted from the User-defined Random Forest Model'}
# the list have only one element since we use full-sample estimator
rf_model <- model$treated_cond_exp_model_list[[1]]
# then, we can use the fitted object `rf_model` for futher analysis.
# the rest of codes are omitted since they are unconcerned for this package.
# ...
```
```{r plot rf, echo=FALSE, fig.align='center', fig.cap='Variable Importance Extracted from the User-defined Random Forest Model', fig.height=6, fig.width=6, dpi=200, fig.show='hold', out.width="80%"}
# the code below uses randomForest and ggplot2 to plot out variable importance
library('ggplot2') # visualization
library('ggthemes') # visualization
library('dplyr') # data manipulation
importance <- randomForest::importance(rf_model)
# varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'MeanDecreaseGini'],2)) # for classification problem
varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'IncNodePurity'],2)) # for regression problem
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
# geom_text(aes(x = Variables, y = 0.5, label = Rank),
# hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variables', y= 'Importance (measured in squared OOB residuals)') +
coord_flip() +
theme_few()
```
We can see that in this example, the powerful random forest model captures the actual relevant variable of conditional expectation functions of potential outcome (i.e. `X1`, `X2`, `X3` and `X4`) correctly.
Similar practice applies for other ML methods.
### User-defined Inference Options
Users can define their own inference procedure by passing the following arguments into `HDCATE.inference()`:
* `sig_level`: set significant level of the confidence bands, the default is 0.01, passing a vector of significant levels is also allowed.
* `n_rep_boot`: set the number of bootstrap replications, the default is 1000.
For example, we may want to construct uniform confidence bands with a larger number of bootstrap replications=3000, and use another significant level=5%:
```{r user-defined inf}
HDCATE.inference(model, sig_level=0.05, n_rep_boot=3000)
```
### User-defined Bandwidth
On the second stage, we use rule-of-thumb to decide the bandwidth in the local linear regression, the bandwidth can also be manually set before calling `HDCATE.fit()`:
```{r bw}
HDCATE.set_bw(model, 0.15) # for example, set bandwidth=0.15
```
To reset to default rule-of-thumb approach, call:
```{r recover bw}
HDCATE.set_bw(model)
```
## Other Details
Other details for using this package can be found on the detailed package documentation or the help files below:
```{r help, eval=FALSE}
help(HDCATE)
help(HDCATE.set_condition_var)
help(HDCATE.fit)
help(HDCATE.inference)
help(HDCATE.plot)
help(HDCATE.use_cross_fitting)
help(HDCATE.use_full_sample)
help(HDCATE.set_first_stage)
help(HDCATE.unset_first_stage)
help(HDCATE.set_bw)
```
## References
Qingliang Fan, Yu-Chin Hsu, Robert P. Lieli & Yichong Zhang (2022) Estimation of Conditional Average Treatment Effects With High-Dimensional Data, Journal of Business & Economic Statistics, 40:1, 313-327, DOI: [10.1080/07350015.2020.1811102](https://doi.org/10.1080/07350015.2020.1811102)