Skip to content

Commit

Permalink
update bacon page
Browse files Browse the repository at this point in the history
  • Loading branch information
grantmcdermott committed Aug 20, 2023
1 parent 81c1119 commit 62b5b46
Show file tree
Hide file tree
Showing 5 changed files with 239 additions and 247 deletions.
Binary file added assets/images/bacon_R/bacon1-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/bacon_R/bacon2-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/bacon_R/bacon3-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
215 changes: 215 additions & 0 deletions docs/code_r/06_bacon_r.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
title: bacon-decomp
layout: default
parent: R code
nav_order: 2
mathjax: true
image: "../../../assets/images/DiD.png"
---

Goodman-Bacon decomposition

{: .no_toc }

## Table of contents

{: .no_toc .text-delta }

1. TOC {:toc}

------------------------------------------------------------------------

This section will walk you through the basic logic of Andrew
Goodman-Bacon’s TWFE decomposition. It draws upon his 2021 *Journal of
Econometrics paper*,
[Difference-in-differences with variation in treatment
timing](https://www.sciencedirect.com/science/article/pii/S0304407621001445).

We’ll make use of the following R packages.

``` r
# install.packages(c("ggplot2", "fixest", "bacondecomp"))
library(ggplot2)
library(fixest)
library(bacondecomp)

# Optional: custom ggplot2 theme
theme_set(
theme_linedraw() +
theme(
panel.grid.minor = element_line(linetype = 3, linewidth = 0.1),
panel.grid.major = element_line(linetype = 3, linewidth = 0.1)
)
)
```

## What is the Goodman-Bacon decomposition?

As discussed at the end of the TWFE section, the introduction of
differential treatment timing makes it hard to draw a bright line
between *pre* and *post* treatment periods. Let’s continue with the same
dataset that we were using in the final example from that section.

``` r
dat4 = data.frame(
id = rep(1:3, times = 10),
tt = rep(1:10, each = 3)
) |>
within({
D = (id == 2 & tt >= 5) | (id == 3 & tt >= 8)
btrue = ifelse(D & id == 3, 4, ifelse(D & id == 2, 2, 0))
y = id + 1 * tt + btrue * D
})
```

In plot form:

``` r
ggplot(dat4, aes(x = tt, y = y, col = factor(id))) +
geom_point() + geom_line() +
geom_vline(xintercept = c(4.5, 7.5), lty = 2) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
labs(x = "Time variable", y = "Outcome variable", col = "ID")
```

![](../../assets/images/bacon_R/bacon1-1.png)

Here we see that our simulation includes two distinct treatment periods.
The first treatment occurs to period 5, where id=2’s trendline jumps by
2 units. The second treatment occurs in period 8, where id=3’s trendline
jumps by 4 units. In contrast, id=1 remains untreated for the duration
of the experiment.

Stepping back, it’s not immediately clear how to calculate the ATT. For
example, how should the late treated unit (id=3) regard the early
treated unit (id=2)? Can the latter be used as control group for the
former? After all, they didn’t receive treatment at the same time… but,
on the other hand, id=2’s path was already altered by the initial
treatment wave.

To unravel this conundrum, let’s start by estimating a simple TWFE
model.

``` r
feols(y ~ D | id + tt, dat4)
```

OLS estimation, Dep. Var.: y
Observations: 30
Fixed-effects: id: 3, tt: 10
Standard-errors: Clustered (id)
Estimate Std. Error t value Pr(>|t|)
DTRUE 2.90909 0.725719 4.00856 0.056967 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.35505 Adj. R2: 0.986455
Within R2: 0.831169

What does the resulting coefficient estimate of $\hat{\beta}=2.91$
represent? The short answer is that it comprises a *weighted average* of
four distinct 2x2 groups (or comparisons):

1. **treated** vs **untreated**
1) *early treated ($T^e$)* vs *untreated ($U$)*
2) *late treated ($T^l$)* vs *untreated ($U$)*
2. **differentially treated**
1) *early treated ($T^e$)* vs *late control ($C^l$)*
2) *late treated ($T^l$)* vs *early control ($C^e$)*

We can visualize these four comparison sets as follows:

``` r
rbind(
dat4 |> subset(id %in% c(1,2)) |> transform(role = ifelse(id==2, "Treatment", "Control"), comp = "1.1. Early vs Untreated"),
dat4 |> subset(id %in% c(1,3)) |> transform(role = ifelse(id==3, "Treatment", "Control"), comp = "1.2. Late vs Untreated"),
dat4 |> subset(id %in% c(2,3) & tt<8) |> transform(role = ifelse(id==2, "Treatment", "Control"), comp = "2.1. Early vs Untreated"),
dat4 |> subset(id %in% c(2:3) & tt>4) |> transform(role = ifelse(id==3, "Treatment", "Control"), comp = "2.2. Late vs Untreated")
) |>
ggplot(aes(tt, y, group = id, col = factor(id), lty = role)) +
geom_point() + geom_line() +
facet_wrap(~comp) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
scale_linetype_manual(values = c("Control" = 5, "Treatment" = 1)) +
labs(x = "Time variable", y = "Ouroleome variable", col = "ID", lty = "Role")
```

![](../../assets/images/bacon_R/bacon2-1.png)

In other words, the panel IDs are split into different timing cohorts
based on when the first treatment takes place and where it lies in
relation to the treatment of other panel IDs. The more panel IDs and
differential treatment timings there are, the more the combinations of
the above groups.

The Goodman-Bacon decomposition isolates each of these 2x2 comparisons
and assigns them a weight, based on their relative coverage in the data
(i.e., how long each comparison lasts relative to the overall timespan,
and how many units were involved).

To implement the Goodman-Bacon decomposition in R, we need simply call
the `bacon()` function from the **bacondecomp** package. An introductory
vignette to package is available
[here](https://cran.r-project.org/web/packages/bacondecomp/vignettes/bacon.html),
although the arguments are pretty self-explanatory. Let’s see what it
yields for our present problem:

``` r
(bgd = bacon(y ~ D, dat4, id_var = "id", time_var = "tt"))
```

treated untreated estimate weight type
2 5 Inf 2 0.3636364 Treated vs Untreated
3 8 Inf 4 0.3181818 Treated vs Untreated
6 8 5 4 0.1363636 Later vs Earlier Treated
8 5 8 2 0.1818182 Earlier vs Later Treated

Here we get our weights and the 2x2 $\beta$ for each group. The table
tells us that ($T$ vs $U$), which is the sum of the late and early
treated versus never treated, has the largest weight, followed by early
vs late treated, and lastly, late vs early treated.

Importantly, note that the weighted mean of these estimates is exactly
the same as our earlier (naive) TWFE coefficient estimate. Again, this
shouldn’t be surprising, since the whole point of the Bacon-Goodman
exercise is to decompose the makeup of that estimate and thus highlight
potential sources of bias.

``` r
(bgd_wm = weighted.mean(bgd$estimate, bgd$weight))
```

[1] 2.909091

We can easily plot this result to visualize how the different components
are affecting the overall estimate.

``` r
ggplot(bgd, aes(x = weight, y = estimate, shape = type, col = type)) +
geom_hline(yintercept = bgd_wm, lty = 2) +
geom_point(size = 3) +
labs(
x = "Weight", y = "Estimate", shape = "Type", col = "Type",
title = "Bacon-Goodman decomposition example",
caption = "Note: The horizontal dotted line depicts the full TWFE estimate."
)
```

![](../../assets/images/bacon_R/bacon3-1.png)

<!-- <img src="../../../assets/images/bacon1.png" height="300"> -->

The figure shows four points for the four groups in our example.

- *Earlier vs Later Treated* (red circle).
- *Later vs Earlier Treated* (green triangle).
- *Treated vs Untreated* (two blue squares; one for the earlier treated
group and another for the later treated group).

Finally, Note that the estimate values of 2 and 4 coincide with the
treatment effects that were encoded into our simulation. Specifically,
unit id=2 increases by 2 and unit id=3 increases by 4 over the untreated
unit id=1.

## So where do TWFE regressions go wrong?

*TO BE COMPLETED*
Loading

0 comments on commit 62b5b46

Please sign in to comment.