9 - Difference-in-Differences (1/2)

Slides & Recap

So far, we have primarily discussed cross-sectional data, where each unit is observed only once. However, sometimes you can observe units over multiple time periods, resulting in panel data. This allows us to see how a unit changes before and after a treatment. When randomization isn’t feasible, panel data is the best alternative for identifying causal effects. The most well-known method for estimating this effect is called difference-in-differences (DiD).

Panel Data

In marketing, randomization is often impractical because you cannot always control who receives the treatment. For instance, let’s say you decide to place billboards in a city to advertise your app, aiming to measure the resulting number of downloads attributed to these billboards. You will advertise in some cities but not in others, creating a geo experiment. Additionally, you will track several time-invariant covariates:

  • \(age_i\): age average in city \(i\)
  • \(population\_size_i\): integer indicating population size of city \(i\)
  • \(user\_share_i\): initial share of app users prior to observation period in city \(i\)
  • \(competitor\_price_i\): average of competitor app price in that given area

Unlike your competitor, your app is free to download, and you plan to generate revenue within the app. Our observed units are cities \(i\) over multiple time periods \(t\). Let’s have a look at the data.

# Load tidyverse
library(tidyverse)

# Read data
df <- readRDS("mkt_panel.rds")

# Print first lines
print(df |> head(10))
   city month post downloads ever_treated age user_share population_size competitor_price
1     1     1    0       418            1  44       0.23              15              2.2
2     1     2    0       437            1  44       0.23              15              2.2
3     1     3    0       472            1  44       0.23              15              2.2
4     1     4    0       460            1  44       0.23              15              2.2
5     1     5    0       509            1  44       0.23              15              2.2
6     1     6    0       524            1  44       0.23              15              2.2
7     1     7    0       568            1  44       0.23              15              2.2
8     1     8    0       605            1  44       0.23              15              2.2
9     1     9    1       640            1  44       0.23              15              2.2
10    2     1    0       461            1  45       0.23               8              3.1

We see that for city \(i=1\), we have eight months with \(post = 0\) and 1 month with \(post = 1\). The outcome changes over time but the covariates are time invariant. Please also note that the treatment indicator indicates whether a unit was “ever” treated.

When counting the number of control and treatment units per period, we’ll see that we have a complete panel without any missing data.

# Panel data overview
df_piv <- df |>
  count(month, ever_treated, post) |> 
  pivot_wider(
    names_from = "ever_treated",
    values_from = "n",
    names_prefix = "ever_treated_"
    )

# Print
print(df_piv)
# A tibble: 9 × 4
  month  post ever_treated_0 ever_treated_1
  <int> <dbl>          <int>          <int>
1     1     0             91            109
2     2     0             91            109
3     3     0             91            109
4     4     0             91            109
5     5     0             91            109
6     6     0             91            109
7     7     0             91            109
8     8     0             91            109
9     9     1             91            109

Basic differences-in-differences

Not always you will have several pre-treatment periods and DiD already works when you only observe one period before and one period after the treatment. So let’s image that scenario and later re-introduce the other pre-treatment periods.

# Last pre-treatment and post-treatment period
df_2p <- df |> filter(month %in% 8:9)

# Print
print(df_2p |> count(month, ever_treated, post))
  month ever_treated post   n
1     8            0    0  91
2     8            1    0 109
3     9            0    1  91
4     9            1    1 109

Because we observe both control and treated units before and after the treatment, we can compute the treatment effect \(\tau_{\text{DiD}}\) using the difference-in-differences method. This involves first calculating the difference in outcomes between the treated and control groups after the treatment, then subtracting the difference in outcomes between these groups before the treatment. This double differencing helps isolate the treatment effect.

\[ \begin{aligned} \tau_{\text{DiD}} = \underbrace{\mathbb{E}[Y_{i,1} | D_i=1] - \mathbb{E}[Y_{i,1} | D_i=0]}_{\text{difference in post-treatment}} \\ - \underbrace{(\mathbb{E}[Y_{i,0} | D_i=1] - \mathbb{E}[Y_{i,0} | D_i=0])}_{\text{difference in pre-treatment}} \end{aligned} \]

In a slightly different and cleaned up notation, we see that the difference we obtain is only attributed to the treatment.

Group Time Outcome 1st Difference DiD
Treatment (D=1) 0 \(Y= Y_{T=0, D=1}\)
1 \(Y = Y_{T=0,D=1} + T + D\) \(T +D\)
\(D\)
Control (D=0) 0 \(Y = Y_{T=0, D=0}\)
1 \(Y = Y_{T=0, D=0} + T\) \(T\)

Task 1: With the help of the formula/table, compute \(\tau_{\text{DiD}}\). What is the interpretation with regard to our example?

Code
# Step 1: group by treatment group and period
did_p2 <- df_2p |> 
  group_by(ever_treated, post) |> 
  summarise(
    y = mean(downloads)
  ) |> 
  ungroup()

print(did_p2)
# A tibble: 4 × 3
  ever_treated  post     y
         <dbl> <dbl> <dbl>
1            0     0  517.
2            0     1  546.
3            1     0  566.
4            1     1  609.
# Step 2: Extract values and doubly differencing
with(did_p2, {
  y_00 <- y[ever_treated == 0 & post == 0]
  y_01 <- y[ever_treated == 0 & post == 1]
  y_10 <- y[ever_treated == 1 & post == 0]
  y_11 <- y[ever_treated == 1 & post == 1]
  
  (y_11 - y_10) - (y_01 - y_00)
})
[1] 15

Alternatively, we can use regression-based approaches to estimate the treatment effect. Two different approaches yielding equivalent results are:

  1. Two-way fixed effects (TWFE).

\[ Y_{i,t} = \alpha_i + \gamma_t + \tau_{\text{DiD}} (D_i \times t) + \epsilon_{i,t} \]

  1. Regression with treatment and time dummy, as well as interaction of treatment and time.

\[ Y_{i,t} = \alpha + \beta D_i + \gamma t + \tau_{\text{DiD}} (D_i \times t) + \epsilon_{i,t} \] Task 2: Perform both approaches and report the estimated coefficient.

Code
# (1)
mod_1 <- lm(downloads ~ ever_treated:post + as.factor(city) + as.factor(month), data = df_2p)

# Coefficient
coef_mod_1 <- mod_1$coefficients["ever_treated:post"]
sprintf("%.2f", coef_mod_1)
[1] "14.86"
Code
# (2)
mod_2 <- lm(downloads ~ ever_treated*post, data = df_2p)
summary(mod_2)

Call:
lm(formula = downloads ~ ever_treated * post, data = df_2p)

Residuals:
   Min     1Q Median     3Q    Max 
-489.1  -54.8   -3.8   55.5  564.1 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)          516.7       11.3   45.93   <2e-16 ***
ever_treated          48.8       15.2    3.20   0.0015 ** 
post                  29.0       15.9    1.82   0.0691 .  
ever_treated:post     14.9       21.6    0.69   0.4908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 107 on 396 degrees of freedom
Multiple R-squared:  0.0911,    Adjusted R-squared:  0.0842 
F-statistic: 13.2 on 3 and 396 DF,  p-value: 3.01e-08

For two-way fixed effects estimation, we can generally also use the lm() command which we have used for so many applications already. However, it is recommended to use the fixest package due to its increased speed and focus on fixed effects. As it is the workhorse of many other packages, let’s get it installed.

The formula notation is slightly different as we separate the fixed effects using the | operator. This approach allows the algorithm to converge faster, and the fixed effects will not appear in the regression summary. Since fixed effects are typically not of primary interest and we often have a large number of them, this simplification is quite convenient.

Code
# Faster way for (1)
library(fixest)
summary(feols(downloads ~ ever_treated:post | city + month, data = df_2p))
OLS estimation, Dep. Var.: downloads
Observations: 400
Fixed-effects: city: 200,  month: 2
Standard-errors: Clustered (city) 
                  Estimate Std. Error t value  Pr(>|t|)    
ever_treated:post       15          3       5 1.399e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 10.4     Adj. R2: 0.98258 
             Within R2: 0.112107

ATT conditional on covariates

TWFE

Task 6: Use the TWFE approach to estimate the ATT by under the assumption of parallel trend conditional on the covariates.

Code
feols(downloads ~ post:ever_treated +
                    age:month + user_share:month + population_size:month + competitor_price: month
                  | city + month, data = df)
OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                       Estimate Std. Error t value   Pr(>|t|)    
post:ever_treated         19.80      2.525     7.8 2.6261e-13 ***
age:month                  3.46      0.296    11.7  < 2.2e-16 ***
month:user_share          49.49      5.244     9.4  < 2.2e-16 ***
month:population_size      0.32      0.048     6.6 4.5016e-10 ***
month:competitor_price     2.21      0.428     5.1 6.2757e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.2     Adj. R2: 0.983281
             Within R2: 0.49367 

Doubly robust estimator

During the last weeks we have already learned the advantages of doubly robust estimation and in fact, there is a doubly robust DiD estimator proposed by Sant’Anna and Zhao, 2020 and implemented in the DRDID package.

library(DRDID)

drdid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)
 Call:
drdid(yname = "downloads", tname = "post", idname = "city", dname = "ever_treated", 
    xformla = ~age + user_share + population_size + competitor_price, 
    data = df_2p)
------------------------------------------------------------------
 Further improved locally efficient DR DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 18.6812     3.2087     5.822        0       12.3922    24.9703  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: weighted least squares.
 Propensity score est. method: inverse prob. tilting.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Task 7: Following the same syntax, use the outcome regression adjustment with ordid() and the inverse probability weighting approach ipwdid() to estimate the treatment effect.

Outcome regression

Code
ordid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)
 Call:
ordid(yname = "downloads", tname = "post", idname = "city", dname = "ever_treated", 
    xformla = ~age + user_share + population_size + competitor_price, 
    data = df_2p)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 19.2478     3.2017     6.0117       0       12.9724    25.5232  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Inverse probability weighting regression

Code
ipwdid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)
 Call:
ipwdid(yname = "downloads", tname = "post", idname = "city", 
    dname = "ever_treated", xformla = ~age + user_share + population_size + 
        competitor_price, data = df_2p)
------------------------------------------------------------------
 IPW DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 18.7472     3.3073     5.6684       0       12.2648    25.2296  
------------------------------------------------------------------
 Estimator based on panel data.
 Hajek-type IPW estimator (weights sum up to 1).
 Propensity score est. method: maximum likelihood.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Assignment

Accept the Week 9 - Assignment and follow the same steps as last week and as described in the organization chapter.

Please also remember to fill out the course evaluation if you have not done so already.

  1. Explain under what circumstances a time-invariant covariate can be a confounder.

  2. Explain under what circumstances a time-varying covariate can be a confounder.

  3. Let’s consider the variable competitor_price was time-variant and for each period, we would observe the price in the same period set by the competitor. Do you see any risk in using this variable?

  4. Apply the doubly robust estimator with ML from the lecture to the data from the tutorial. Use the models "regr.xgboost" and "classif.xgboost". Additionally, implement cross-fitting (K=5). Report and compare the result to the other estimates.

  5. Check what happens when you simulate a misspecification of the outcome model by replacing delta_mu with sample(delta_mu, replace = FALSE) or a misspecification of the exposure model by replacing ehat with sample(ehat, replace = FALSE). Report and discuss your result.