12 - Synthetic Control (2/2)

Slides

Practical example

Augmented synthetic control

We revisit the data example from last week (slightly changed and trimmed) and apply a more advanced method for synthetic control estimation, known as the augmented synthetic control, proposed by Ben-Michael, Feller, and Rothstein, 2021. This doubly robust estimator combines elements from both the synthetic control estimator and an outcome model estimator. While it does not strictly prohibit negative weights, regularization is used to minimize their occurrence. The general idea is to assess the pre-treatment imbalance using an outcome model and then reweight the synthetic control with the new weights.

\[ \begin{align*} \hat{Y}_{1T}^{\text{aug}}(0) &= \overbrace{\sum_{i=2}^{N+1} \hat{w}_i^{\text{sc}} Y_{iT}}^{\text{SC estimate}} + \overbrace{\left( \hat{m}_{1T} - \sum_{i=2}^{N+1} \hat{w}_i^{\text{sc}} \hat{m}_{iT} \right)}^{\text{imbalance correction}} \\ &= \underbrace{\hat{m}_{1T}}_{\text{outcome model}} + \underbrace{\sum_{i=2}^{N+1} \hat{w}_i^{\text{sc}} (Y_{iT} - \hat{m}_{iT})}_{\text{"IPW-like" re-weights to balance residuals}} \end{align*} \] Let’s use the augsynth() method from the augsynth package for estimation. You have to install it from GitHub, so please run the following command.

devtools::install_github("ebenmichael/augsynth")
library(tidyverse)

ride <- readRDS("ride_12_single.rds")

print(head(ride))
# A tibble: 6 × 11
  city    density employment   gdp population treated post  cohort revenue period     treat_post
  <chr>     <int>      <dbl> <dbl>      <dbl> <lgl>   <lgl>  <dbl>   <dbl> <date>          <int>
1 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    23.3 2009-01-01          0
2 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    24.1 2009-02-01          0
3 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    23.1 2009-03-01          0
4 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    27.6 2009-04-01          0
5 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    26.3 2009-05-01          0
6 Atlanta     333      0.601  5.78       4.91 FALSE   FALSE    Inf    28.1 2009-06-01          0

You can specify whether you want to run an outcome model and choose its form. If you opt not to use any model, the basic synthetic control method estimates will be provided. First, we will run the analysis without an outcome model, and then we will use ridge regression. However, other methods are also available. By typing ?augsynth into your console, you can see the additional options. Apart from this, the arguments are similar to those you have seen in previous weeks.

Task 1: Fill out the code below and report the results. Do it once without and once without outcome model. What is the difference? Please note, that the inference takes some time.

library(augsynth)

att_sc <- augsynth(
  form = ...,
  # outcome ~ treatment | auxiliary covariates
  unit = ...,
  time = ...,
  data = ...,
  #progfunc = "Ridge",#  function to use to impute control outcomes (or "None")
  scm = TRUE
) # whether to use the SCM

sc_summ <- summary(att_sc)
plot(sc_summ)
Code
library(augsynth)

att_sc <- augsynth(
  revenue ~ treat_post | gdp + population + log(gdp),
  unit = city,
  time = period,
  data = ride,
  progfunc = "None",
  scm = TRUE
)
One outcome and one treatment time found. Running single_augsynth.
Code
sc_summ <- summary(att_sc)
sc_summ

Call:
single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), 
    t_int = t_int, data = data, progfunc = "None", scm = TRUE)

Average ATT Estimate (p Value for Joint Null):  0.895   ( 0.83 )
L2 Imbalance: 3.704
Percent improvement from uniform weights: 28%

Covariate L2 Imbalance: 4.467
Percent improvement from uniform weights: 18%

Avg Estimated Bias: NA

Inference type: Conformal inference

       Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value
 2010-01-01    0.931              -1.99               3.85   0.630
 2010-02-01    0.161              -2.76               3.08   1.000
 2010-03-01    0.472              -2.45               3.39   0.638
 2010-04-01    1.403              -1.52               4.32   0.316
 2010-05-01   -1.961              -4.88               0.96   0.162
 2010-06-01    0.939              -1.98               3.86   0.630
 2010-07-01    0.546              -2.38               3.47   0.622
 2010-08-01   -0.955              -3.88               1.97   0.564
 2010-09-01    0.410              -2.51               3.33   0.618
 2010-10-01   -0.225              -3.15               2.70   0.867
 2010-11-01   -0.668              -3.59               2.25   0.631
 2010-12-01    1.625              -1.30               4.55   0.391
 2011-01-01    1.769              -1.15               4.69   0.237
 2011-02-01    0.563              -2.36               3.48   0.626
 2011-03-01   -0.058              -2.98               2.86   1.000
 2011-04-01   -0.217              -3.14               2.70   0.919
 2011-05-01    1.045              -1.88               3.97   0.629
 2011-06-01   -0.577              -3.50               2.34   0.619
 2011-07-01   -0.232              -3.15               2.69   0.856
 2011-08-01    1.456              -1.46               4.38   0.313
 2011-09-01   -0.423              -3.34               2.50   0.694
 2011-10-01    3.774               0.00               6.70   0.148
 2011-11-01    0.914              -2.01               3.83   0.607
 2011-12-01    2.169              -0.75               5.09   0.141
 2012-01-01   -0.738              -3.66               2.18   0.614
 2012-02-01    1.759              -1.16               4.68   0.227
 2012-03-01   -0.352              -3.27               2.57   0.756
 2012-04-01    2.085              -0.84               5.00   0.228
 2012-05-01    2.125              -0.80               5.05   0.216
 2012-06-01    0.973              -1.95               3.89   0.608
 2012-07-01    1.894              -1.03               4.82   0.235
 2012-08-01   -0.391              -3.31               2.53   0.586
 2012-09-01    0.846              -2.07               3.77   0.609
 2012-10-01   -0.209              -3.13               2.71   0.916
 2012-11-01    0.483              -2.44               3.40   0.621
 2012-12-01    1.707              -1.21               4.63   0.224
 2013-01-01    2.274              -0.65               5.20   0.221
 2013-02-01    2.587              -0.33               5.51   0.230
 2013-03-01    0.896              -2.02               3.82   0.623
 2013-04-01    1.943              -0.98               4.86   0.221
 2013-05-01    1.727              -1.19               4.65   0.266
 2013-06-01    0.260              -2.66               3.18   0.914
 2013-07-01    1.970              -0.95               4.89   0.224
 2013-08-01    0.312              -2.61               3.23   0.764
 2013-09-01    2.841              -0.08               5.76   0.290
 2013-10-01    2.698              -0.22               5.62   0.089
 2013-11-01    1.322              -1.60               4.24   0.618
 2013-12-01    1.105              -1.82               4.03   0.391
Code
plot(sc_summ)

Code
att_augsc <- augsynth(
  revenue ~ treat_post | gdp + population + log(gdp),
  unit = city,
  time = period,
  data = ride,
  progfunc = "Ridge",
  scm = TRUE
)
One outcome and one treatment time found. Running single_augsynth.
Code
augsc_summ <- summary(att_augsc)
augsc_summ

Call:
single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), 
    t_int = t_int, data = data, progfunc = "Ridge", scm = TRUE)

Average ATT Estimate (p Value for Joint Null):  1   ( 0.83 )
L2 Imbalance: 3.629
Percent improvement from uniform weights: 29%

Covariate L2 Imbalance: 4.254
Percent improvement from uniform weights: 22%

Avg Estimated Bias: -0.082

Inference type: Conformal inference

       Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value
 2010-01-01    0.983             -2.086                4.1   0.596
 2010-02-01    0.217             -2.852                3.3   0.923
 2010-03-01    0.521             -2.548                3.6   0.601
 2010-04-01    1.497             -1.572                4.6   0.323
 2010-05-01   -1.844             -4.912                1.2   0.315
 2010-06-01    1.125             -1.943                4.2   0.591
 2010-07-01    0.681             -2.388                3.8   0.619
 2010-08-01   -0.843             -3.912                2.2   0.633
 2010-09-01    0.382             -2.687                3.5   0.735
 2010-10-01   -0.107             -3.176                3.0   0.938
 2010-11-01   -0.518             -3.586                2.6   0.848
 2010-12-01    1.715             -1.354                4.8   0.390
 2011-01-01    1.796             -1.273                4.9   0.245
 2011-02-01    0.706             -2.362                3.8   0.625
 2011-03-01    0.007             -3.062                3.1   1.000
 2011-04-01   -0.141             -3.210                2.9   0.920
 2011-05-01    1.105             -1.963                4.2   0.535
 2011-06-01   -0.360             -3.429                2.7   0.837
 2011-07-01   -0.196             -3.265                2.9   0.914
 2011-08-01    1.645             -1.424                4.7   0.326
 2011-09-01   -0.445             -3.514                2.6   0.849
 2011-10-01    3.894              0.000                7.0   0.151
 2011-11-01    1.040             -2.029                4.1   0.466
 2011-12-01    2.349             -0.720                5.4   0.086
 2012-01-01   -0.548             -3.617                2.5   0.678
 2012-02-01    1.986             -1.083                5.1   0.239
 2012-03-01   -0.235             -3.303                2.8   0.929
 2012-04-01    2.267             -0.801                5.3   0.245
 2012-05-01    2.266             -0.803                5.3   0.253
 2012-06-01    1.135             -1.934                4.2   0.399
 2012-07-01    1.932             -1.137                5.0   0.230
 2012-08-01   -0.281             -3.349                2.8   0.942
 2012-09-01    0.912             -2.157                4.0   0.544
 2012-10-01   -0.012             -3.081                3.1   0.907
 2012-11-01    0.521             -2.548                3.6   0.688
 2012-12-01    1.786             -1.283                4.9   0.236
 2013-01-01    2.423             -0.646                5.5   0.136
 2013-02-01    2.718             -0.351                5.8   0.238
 2013-03-01    0.941             -2.128                4.0   0.508
 2013-04-01    2.020             -1.048                5.1   0.242
 2013-05-01    1.948             -1.121                5.0   0.219
 2013-06-01    0.321             -2.748                3.4   0.848
 2013-07-01    1.950             -1.118                5.0   0.218
 2013-08-01    0.404             -2.665                3.5   0.934
 2013-09-01    3.013             -0.056                6.1   0.324
 2013-10-01    2.818             -0.251                5.9   0.083
 2013-11-01    1.422             -1.647                4.5   0.395
 2013-12-01    1.174             -1.895                4.2   0.387
Code
plot(augsc_summ)

Staggered timing

Historically, synthetic control methods have been applied to single treated units, while difference-in-differences estimators were typically used for multiple treated units and staggered treatments. However, if we are unwilling to assume parallel trends and prefer to use the synthetic control approach, we can utilize the augsynth package again.

But first, let’s get the data where the new feature for the ride sharing platform were rolled out at multiple cities at different points of time and explore the data. A package that helps with visualization of staggered rollouts is panelView.

library(tidyverse)

ride_multi <- readRDS("ride_12_multi.rds")

print(head(ride_multi))
# A tibble: 6 × 15
  city     year density employment   gdp population month treated  post cohort summer pre_summer post_summer revenue period    
  <chr>   <int>   <int>      <dbl> <dbl>      <dbl> <int> <lgl>   <dbl>  <dbl> <lgl>  <lgl>      <lgl>         <dbl> <date>    
1 Chicago  2009     498      0.585  5.97       9.43     1 TRUE        0   2010 FALSE  FALSE      FALSE          23.9 2009-01-01
2 Chicago  2009     498      0.585  5.97       9.43     2 TRUE        0   2010 FALSE  FALSE      FALSE          24.2 2009-02-01
3 Chicago  2009     498      0.585  5.97       9.43     3 TRUE        0   2010 FALSE  FALSE      FALSE          23.4 2009-03-01
4 Chicago  2009     498      0.585  5.97       9.43     4 TRUE        0   2010 FALSE  TRUE       FALSE          24.3 2009-04-01
5 Chicago  2009     498      0.585  5.97       9.43     5 TRUE        0   2010 FALSE  TRUE       FALSE          26.3 2009-05-01
6 Chicago  2009     498      0.585  5.97       9.43     6 TRUE        0   2010 TRUE   FALSE      FALSE          26.2 2009-06-01

Task 2: Read through the tutorial and reproduce the following plots. Describe what you can see.

Code
library("panelView")

panelview(revenue ~ post, 
          data = ride_multi,
          index = c("city","period"), 
          xlab = "Period", 
          ylab = "City",
          cex.axis = 3,
          by.timing = TRUE,
          pre.post = TRUE
          )

Code
panelview(
  revenue ~ post,
  data = ride_multi,
  index = c("city","period"),
  cex.axis = 3,
  type = "outcome",
  main = "Revenue",
  xlab = "Period",
  ylab = "Revenue"
)

Code
panelview(
  revenue ~ post,
  data = ride_multi,
  index = c("city","period"), 
  cex.axis = 3,
  type = "outcome",
  main = "Revenue",
  xlab = "Period",
  ylab = "Revenue", 
  id = c("Chicago", "Phoenix", "Las Vegas", "Seattle", "Houston", "Washington"),
  by.group = TRUE
)

However, the problem that arises is that aggregate multiple synthetic controls into a single parameter estimate. We need to find the balance between simply estimating a synthetic control for each treated unit and pooled synthetic control for all treated units. The answer is the partially-pooled synthetic control that minimizes a weighted average of the two imbalances and includes an intercept to account for level differences.

Equation highlights two equivalent interpretations: average of unit-specific SCM estimates or SCM estimate of the average treated unit.

\[ \begin{aligned} \hat{\tau}_k = \frac{1}{J} \sum_{j=1}^J \hat{\tau}_{jk} &= \frac{1}{J} \sum_{j=1}^J \left[ Y_{jT_j + k} - \sum_{i=1}^N \hat{w}_{ij} Y_{iT_j + k} \right] \\ &= \frac{1}{J} \sum_{j=1}^J Y_{jT_j + k} - \sum_{i=1}^N \frac{1}{J} \sum_{j=1}^J \hat{w}_{ij} Y_{iT_j + k} \end{aligned} \]

Unit-specific imbalance:

\[ q_X^{\text{sep}}(\mathbf{W}) = \sqrt{\frac{1}{J} \sum_{j=1}^J \left\| \mathbf{X}_j - \sum_{i=1}^N w_{ij} \mathbf{X}_i \right\|_2^2} \]

Pooled imbalance:

\[ q_X^{\text{pool}}(\mathbf{W}) = \left\| \frac{1}{J} \sum_{j=1}^J \mathbf{X}_j - \sum_{i=1}^N w_{ij} \mathbf{X}_i \right\|_2 \] Weighted average of imbalances:

\[ \mathbf{W}^* = \arg\min_{\mathbf{W}} \left( \nu \left( \tilde{q}^{\text{pool}}(\mathbf{W}) \right)^2 + (1 - \nu) \left( \tilde{q}^{\text{sep}}(\mathbf{W}) \right)^2 + \lambda \| \Gamma \|_F^2 \right) \]

# Synthetic control with multiple treatment periods
att_msc <- multisynth(revenue ~ post | gdp + population + log(gdp), 
                             unit = city, 
                             time = period, 
                             data = ride_multi, 
                             scm = T)

msc_summ <- summary(att_msc) 
plot(msc_summ, levels = "Average")

Synthetic Difference-in-Differences

Synthetic Difference in Differences (SDID) is a method that incorporates both time and unit weights, as well as unit fixed effects, to achieve parallel trends by weighting control observations. SDID combines features of Difference in Differences (DID) and Synthetic Control (SC). It reweights and matches pre-exposure trends, similar to SC, and is invariant to additive unit-level shifts, making it valid for large-panel inference, like DID. SDID provides consistent and asymptotically normal estimates, performing as well as or better than DID in traditional DID settings. Unlike DID, which only handles completely random treatment assignment, SDID can manage cases where treatment assignment correlates with some time or unit latent factors. Similarly, SDID is as effective as or better than SC in traditional SC settings. While uniformly random treatment assignment results in unbiased outcomes for all methods, SDID is more precise.

SDID’s double robustness is comparable to the augmented inverse probability weighting estimator proposed by Scharfstein, Rotnitzky, and Robins (1999) and is similar to the augmented SC estimator by Ben-Michael, Feller, and Rothstein (2021) and Arkhangelsky et al. (2021).

\[ \hat{\tau}^{sdid} = \arg \min_{\tau, \mu, \alpha, \beta} \{ \sum_{i = 1}^N \sum_{t = 1}^T (Y_{it} - \mu - \alpha_i - \beta_ t - D_{it} \tau)^2 \hat{w}_i^{sdid} \hat{\lambda}_t^{sdid} \} \] For now, let’s go back to the first setting, where we had only one treated unit. Let’s build the 4 blocks as we did last week.

# Split and extract 4 matrices
y0_pre <- ride |>
  filter(post == 0, treated == 0) |> 
  pivot_wider(id_cols = "period", names_from = "city", values_from = "revenue") |> 
  select(-period) |> 
  as.matrix()

y1_pre <- ride |>
  filter(post == 0, treated == 1) |> 
  pivot_wider(id_cols = "period", names_from = "city", values_from = "revenue") |> 
  select(-period) |> 
  as.matrix()

y0_post <- ride |>
  filter(post == 1, treated == 0) |> 
  pivot_wider(id_cols = "period", names_from = "city", values_from = "revenue") |> 
  select(-period) |> 
  as.matrix()

y1_post <- ride |>
  filter(post == 1, treated == 1) |> 
  pivot_wider(id_cols = "period", names_from = "city", values_from = "revenue") |> 
  select(-period) |> 
  as.matrix()

Step-by-step, we build a synthetic DiD estimator, which slightly differs from the one proposed by Arkhangelsky et al., but gives you the intuition of the inner workings.

The first step is just to get the unit weights as we did for the synthetic control method last week.

library(CVXR)

# Optimization procedure
w <- Variable(ncol(y0_pre))
objective <- Minimize(sum_squares(y0_pre %*% w - y1_pre))
constraints <- list(sum(w) == 1, w >= 0)
problem <- Problem(objective, constraints)
w_star <- as.vector(solve(problem)$getValue(w))

Now, instead of retrieving the ATT via matrix multiplication, we store the unit weights in our data frame. For the treated unit, the weights we store are just the average of the treatment share.

unit_w <- tibble(city = colnames(y0_pre), unit_weight = w_star)

ride_w <- ride |> 
  left_join(unit_w, by = "city") |> 
  mutate(unit_weight = replace_na(unit_weight, mean(ride$treated)))

With this data, we could compute an equivalent ATT by running a regression with time fixed effects that the ATT we got from matrix multiplication.

mod_w <- lm(revenue ~ treat_post + factor(period),
          data = ride_w |> filter(unit_weight > 1e-10), 
          weights = unit_weight)
mod_w$coefficients[2]
treat_post 
       1.2 
mean(y1_post - (y0_post %*% w_star))
[1] 1.2

However, we don’t want to have the estimate from the synthetic control method, but we want to additionally exploit information of the time dependence and get time weights. Therefore, we want to regress the average of the post-treatment outcome on the pre-treatment outcomes of the control group. Similar as before, this yields time weights for the pre-treatment period, i.e. the weight we should put on each pre-treatment period when constructing our estimator.

While we again restrict, the weights to be non-negative and sum to 1, we also allow for the inclusion of an intercept. Therefore, we’ll just add a column of ones to the matrix.

\[ \{\hat{\lambda}_0, \hat{\lambda}\} = \sum_{i = 2}^{N_c}(\lambda_0 + \sum_{t = 1}^{T_{pre}} \lambda_t Y_{it} - \frac{1}{T_{post}} \sum_{t = T_{pre} + 1}^T Y_{it})^2 \]

Task 3: Run the regression to obtain the time weights. (Hint: you might need t() and colMeans()).

Code
# Initialize weights and add intercept
t_y0_pre <- t(y0_pre) # transpose to get time coefficients/weights
l <- Variable(ncol(t_y0_pre) + 1) # one more column due to intercept
icept <- matrix(rep(1, nrow(t_y0_pre)), ncol = 1) # create intercept
t_y0_pre <- cbind(icept, t_y0_pre) # add intercept to matrix

objective <- Minimize(sum_squares(t_y0_pre %*% l - colMeans(y0_post)))
constraints <- list(sum(l[2:ncol(t_y0_pre)]) == 1, l[2:ncol(t_y0_pre)] >= 0)
problem <- Problem(objective, constraints)
l_star <- as.vector(solve(problem)$getValue(l))
Code
# Remove intercept
l_star <- l_star[2:ncol(t_y0_pre)]

# Table with period weights
pre_periods <- ride |> filter(post == 0, treated == 0) |> pull(period) |> unique()
time_w <- tibble(period = pre_periods, time_weight = l_star)

# Show weights
round(l_star, 2)
 [1] 0.00 0.13 0.26 0.14 0.00 0.00 0.03 0.00 0.00 0.11 0.04 0.28

Now, we also add the time weights to our data.frame. For post-treatment periods, we take the share of the post periods.

Having obtained both unit and time weights, we multiply the weights to get the final weight each observation gets in the weighted regression.

# Add weights to data
ride_wl <- ride_w |> 
  left_join(time_w, by = "period") |> 
  mutate(time_weight = replace_na(time_weight, mean(ride$post))) |> 
  mutate(weight = unit_weight * time_weight)

Running the regression yields a synthetic DiD estimate.

Task 4: Run the regression and return the estimated treatment effect.

# Final regression
mod_wl <- lm(revenue ~ treat_post + factor(city) + factor(period), 
   data = ride_wl |> filter(weight > 1e-10), 
   weights = weight)

mod_wl$coefficients[2]
treat_post 
      0.98 

In practice, the synthdid package does all the work. But please note, that this package is still at an early stage. It is a bit more advanced that what we just did, but the general idea is the same. It allows an intercept shift and adds an L2 penalty on the unit weights to make sure no control units gets a too large weight.

library(synthdid)

setup = panel.matrices(
  panel = as.data.frame(ride),
  unit = "city",
  time = "period",
  outcome = "revenue",
  treatment = "treat_post",
  treated.last = TRUE
)
tau.hat <- synthdid_estimate(setup$Y, setup$N0, setup$T0)
se <- sqrt(vcov(tau.hat, method='placebo'))
sprintf('point estimate: %1.2f', tau.hat)
[1] "point estimate: 0.95"
sprintf('95%% CI (%1.2f, %1.2f)', tau.hat - 1.96 * se, tau.hat + 1.96 * se)
[1] "95% CI (0.10, 1.81)"
plot(tau.hat)

Assignment

Accept the Week 12 - Assignment and follow the same steps as last week and as described in the organization chapter.

Please also remember to fill out the course evaluation if you have not done so already.

For the following tasks, you should use the ride_12_multi.rds data.

  1. The synthdid packages does not yet provide an implementation for staggered adoption. Build your own implementation that applies the estimator once for every adoption date and report the treatment effects.

  2. Find an appropriate measure that summarizes the treatment effects. Discuss in 1-2 sentences, what you think should be considered when constructing such a measure.

  3. Report the short-term (1 year) and mid-term (3 years) treatment effects and compare them in an appropriate plot using the ggplot2 package.