9 - Difference-in-Differences (1/2)

Slides & Recap

So far, we have primarily discussed cross-sectional data, where each unit is observed only once. However, sometimes you can observe units over multiple time periods, resulting in panel data. This allows us to see how a unit changes before and after a treatment. When randomization isn’t feasible, panel data is the best alternative for identifying causal effects. The most well-known method for estimating this effect is called difference-in-differences (DiD).

Panel Data

In marketing, randomization is often impractical because you cannot always control who receives the treatment. For instance, let’s say you decide to place billboards in a city to advertise your app, aiming to measure the resulting number of downloads attributed to these billboards. You will advertise in some cities but not in others, creating a geo experiment. Additionally, you will track several time-invariant covariates:

\(age_i\): age average in city \(i\)
\(population\_size_i\): integer indicating population size of city \(i\)
\(user\_share_i\): initial share of app users prior to observation period in city \(i\)
\(competitor\_price_i\): average of competitor app price in that given area

Unlike your competitor, your app is free to download, and you plan to generate revenue within the app. Our observed units are cities \(i\) over multiple time periods \(t\). Let’s have a look at the data.

# Load tidyverse
library(tidyverse)

# Read data
df <- readRDS("mkt_panel.rds")

# Print first lines
print(df |> head(10))

   city month post downloads ever_treated age user_share population_size competitor_price
1     1     1    0       418            1  44       0.23              15              2.2
2     1     2    0       437            1  44       0.23              15              2.2
3     1     3    0       472            1  44       0.23              15              2.2
4     1     4    0       460            1  44       0.23              15              2.2
5     1     5    0       509            1  44       0.23              15              2.2
6     1     6    0       524            1  44       0.23              15              2.2
7     1     7    0       568            1  44       0.23              15              2.2
8     1     8    0       605            1  44       0.23              15              2.2
9     1     9    1       640            1  44       0.23              15              2.2
10    2     1    0       461            1  45       0.23               8              3.1

We see that for city \(i=1\), we have eight months with \(post = 0\) and 1 month with \(post = 1\). The outcome changes over time but the covariates are time invariant. Please also note that the treatment indicator indicates whether a unit was “ever” treated.

When counting the number of control and treatment units per period, we’ll see that we have a complete panel without any missing data.

# Panel data overview
df_piv <- df |>
  count(month, ever_treated, post) |> 
  pivot_wider(
    names_from = "ever_treated",
    values_from = "n",
    names_prefix = "ever_treated_"
    )

# Print
print(df_piv)

# A tibble: 9 × 4
  month  post ever_treated_0 ever_treated_1
  <int> <dbl>          <int>          <int>
1     1     0             91            109
2     2     0             91            109
3     3     0             91            109
4     4     0             91            109
5     5     0             91            109
6     6     0             91            109
7     7     0             91            109
8     8     0             91            109
9     9     1             91            109

Basic differences-in-differences

Not always you will have several pre-treatment periods and DiD already works when you only observe one period before and one period after the treatment. So let’s image that scenario and later re-introduce the other pre-treatment periods.

# Last pre-treatment and post-treatment period
df_2p <- df |> filter(month %in% 8:9)

# Print
print(df_2p |> count(month, ever_treated, post))

  month ever_treated post   n
1     8            0    0  91
2     8            1    0 109
3     9            0    1  91
4     9            1    1 109

Because we observe both control and treated units before and after the treatment, we can compute the treatment effect \(\tau_{\text{DiD}}\) using the difference-in-differences method. This involves first calculating the difference in outcomes between the treated and control groups after the treatment, then subtracting the difference in outcomes between these groups before the treatment. This double differencing helps isolate the treatment effect.

\[ \begin{aligned} \tau_{\text{DiD}} = \underbrace{\mathbb{E}[Y_{i,1} | D_i=1] - \mathbb{E}[Y_{i,1} | D_i=0]}_{\text{difference in post-treatment}} \\ - \underbrace{(\mathbb{E}[Y_{i,0} | D_i=1] - \mathbb{E}[Y_{i,0} | D_i=0])}_{\text{difference in pre-treatment}} \end{aligned} \]

In a slightly different and cleaned up notation, we see that the difference we obtain is only attributed to the treatment.

Group	Time	Outcome	1st Difference	DiD
Treatment (D=1)	0	\(Y= Y_{T=0, D=1}\)
	1	\(Y = Y_{T=0,D=1} + T + D\)	\(T +D\)
				\(D\)
Control (D=0)	0	\(Y = Y_{T=0, D=0}\)
	1	\(Y = Y_{T=0, D=0} + T\)	\(T\)

Task 1: With the help of the formula/table, compute \(\tau_{\text{DiD}}\). What is the interpretation with regard to our example?

Code

# Step 1: group by treatment group and period
did_p2 <- df_2p |> 
  group_by(ever_treated, post) |> 
  summarise(
    y = mean(downloads)
  ) |> 
  ungroup()

print(did_p2)

# A tibble: 4 × 3
  ever_treated  post     y
         <dbl> <dbl> <dbl>
1            0     0  517.
2            0     1  546.
3            1     0  566.
4            1     1  609.

# Step 2: Extract values and doubly differencing
with(did_p2, {
  y_00 <- y[ever_treated == 0 & post == 0]
  y_01 <- y[ever_treated == 0 & post == 1]
  y_10 <- y[ever_treated == 1 & post == 0]
  y_11 <- y[ever_treated == 1 & post == 1]
  
  (y_11 - y_10) - (y_01 - y_00)
})

[1] 15

Alternatively, we can use regression-based approaches to estimate the treatment effect. Two different approaches yielding equivalent results are:

Two-way fixed effects (TWFE).

\[ Y_{i,t} = \alpha_i + \gamma_t + \tau_{\text{DiD}} (D_i \times t) + \epsilon_{i,t} \]

Regression with treatment and time dummy, as well as interaction of treatment and time.

\[ Y_{i,t} = \alpha + \beta D_i + \gamma t + \tau_{\text{DiD}} (D_i \times t) + \epsilon_{i,t} \] Task 2: Perform both approaches and report the estimated coefficient.

Code

# (1)
mod_1 <- lm(downloads ~ ever_treated:post + as.factor(city) + as.factor(month), data = df_2p)

# Coefficient
coef_mod_1 <- mod_1$coefficients["ever_treated:post"]
sprintf("%.2f", coef_mod_1)

[1] "14.86"

Code

# (2)
mod_2 <- lm(downloads ~ ever_treated*post, data = df_2p)
summary(mod_2)


Call:
lm(formula = downloads ~ ever_treated * post, data = df_2p)

Residuals:
   Min     1Q Median     3Q    Max 
-489.1  -54.8   -3.8   55.5  564.1 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)          516.7       11.3   45.93   <2e-16 ***
ever_treated          48.8       15.2    3.20   0.0015 ** 
post                  29.0       15.9    1.82   0.0691 .  
ever_treated:post     14.9       21.6    0.69   0.4908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 107 on 396 degrees of freedom
Multiple R-squared:  0.0911,    Adjusted R-squared:  0.0842 
F-statistic: 13.2 on 3 and 396 DF,  p-value: 3.01e-08

For two-way fixed effects estimation, we can generally also use the lm() command which we have used for so many applications already. However, it is recommended to use the fixest package due to its increased speed and focus on fixed effects. As it is the workhorse of many other packages, let’s get it installed.

install.packages("fixest")

The formula notation is slightly different as we separate the fixed effects using the | operator. This approach allows the algorithm to converge faster, and the fixed effects will not appear in the regression summary. Since fixed effects are typically not of primary interest and we often have a large number of them, this simplification is quite convenient.

Code

# Faster way for (1)
library(fixest)
summary(feols(downloads ~ ever_treated:post | city + month, data = df_2p))

OLS estimation, Dep. Var.: downloads
Observations: 400
Fixed-effects: city: 200,  month: 2
Standard-errors: Clustered (city) 
                  Estimate Std. Error t value  Pr(>|t|)    
ever_treated:post       15          3       5 1.399e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 10.4     Adj. R2: 0.98258 
             Within R2: 0.112107

Parallel trends

Here, where we only have two time period, we need to rely on one assumption to hold: the parallel trends assumption. Opposed to methods where we just know one outcome - the “after” outcome, regardless of whether a unit received or did not receive treatment - we do not have to assume that the potential outcomes are equal \((Y_1(0)-Y_0(0)) \perp\!\!\!\perp T\). That is a big difference, because do not have to assume that observation units are similar in all their characteristics. Instead DiD hinges on a different assumption, the parallel trends assumption. It says that, in absence of treatment for both groups, they would be expected to evolve similarly over time. In other words, we do not expect the potential outcome to be similar, but only the change of outcomes from before to after. It implies that there is no factor that has only an impact on just one of the groups.

\[ \mathbb{E}[Y_{i,t=1}(0) - Y_{i,t=0}(0) | D_i=1] = \mathbb{E}[Y_{i,t=1}(0) - Y_{i,t=0}(0) | D_i=0] \]

While the parallel trends assumption is generally untestable, in the case with only two periods, we are particularly in the dark.

Graphically, what we assume is that the treated line would have evolved similarly to the control line - as depicted by the dotted line. What can we do to increase the plausibility of this assumption?

Code

y10 <- did_p2[did_p2$ever_treated==1 & did_p2$post==0, ]$y
y01 <- did_p2[did_p2$ever_treated==0 & did_p2$post==1, ]$y
y00 <- did_p2[did_p2$ever_treated==0 & did_p2$post==0, ]$y

data_cf <- tibble(
  x = c(0, 1),
  y = c(y10, y10 + (y01-y00))
)

ggplot(did_p2, aes(x = post, y = y, group = ever_treated, color = as.factor(ever_treated))) +
  geom_point() +
  geom_line() +
  geom_line(data = data_cf, aes(x=x, y=y, group = NA), color="#5AFFC5", linetype = "dotted") +
  geom_vline(xintercept = .5, linetype = "dashed") +
  scale_x_continuous(breaks=c(0,1)) +
  theme(legend.position = "bottom")

Warning: The `scale_name` argument of `discrete_scale()` is deprecated as of ggplot2 3.5.0.

Testing for parallel trends

There is some remedy when we are able to observe multiple periods prior to the treatment. By comparing trends before treatment between the treatment and control groups, researchers aim to demonstrate that there was no significant difference prior to the treatment. The logic is that if there was no difference before treatment, any difference observed after the treatment is likely due to the treatment itself.

However, even that cannot provide full certainty about the parallel trends assumption. There may still be unobserved factors that could affect the treatment. Nevertheless, event studies are a useful tool for arguing that the treatment and control groups are comparable.

Let’s now switch back to the dataset with multiple pre-treatment periods and begin by plotting the observed outcomes over time. This will help us determine if there are converging or diverging trends between the treatment and control groups. By visualizing these trends, we can already assess whether the parallel trends assumption holds before the treatment.

# Outcomes across time by group
df |> 
  ggplot(aes(
    x = month,
    y = downloads,
    fill = as.factor(ever_treated),
    color = as.factor(ever_treated)
  )) + 
  stat_summary(fun.data = mean_se, geom = "point", alpha = .8) +
  stat_summary(fun.data = mean_se, geom = "line", alpha = .8) +
  stat_summary(fun.data = mean_se, geom = "ribbon", alpha = .2, color = NA) +
  geom_vline(xintercept = 8.5, linetype = "dashed") +
  scale_x_continuous(breaks=1:9) +
  theme(legend.position = "bottom")

At the first glance, prior to the treatment, the outcomes seem to evolve in parallel. However, a better way is to actually perform statistical tests and check the hypothesis that the evolution of both groups is parallel. For each difference between two pre-treatment periods, the should be no difference in the evolution of the outcomes across the treatment groups.

Task 3a: Subset the data so you can perform test for pre-trends between the pre-treatment months 7 and 8 using the TWFE approach from above.

Code

df_placebo <- df |> 
  filter(month %in% c(7, 8)) |> 
  mutate(post = if_else(month == 8, 1, 0))

summary(feols(downloads ~ ever_treated:post | city + month, data = df_placebo))

OLS estimation, Dep. Var.: downloads
Observations: 400
Fixed-effects: city: 200,  month: 2
Standard-errors: Clustered (city) 
                  Estimate Std. Error t value Pr(>|t|) 
ever_treated:post       -2        2.9   -0.68  0.49595 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 10.4     Adj. R2: 0.980199
             Within R2: 0.002316

Task 3b: Subset the data so you can perform test for pre-trends between the pre-treatment months 3 and 8 using the TWFE approach from above.

Code

df_placebo <- df |> 
  filter(month %in% c(8, 3)) |> 
  mutate(post = if_else(month == 8, 1, 0))

feols(downloads ~ ever_treated:post | city + month, data = df_placebo)

OLS estimation, Dep. Var.: downloads
Observations: 400
Fixed-effects: city: 200,  month: 2
Standard-errors: Clustered (city) 
                  Estimate Std. Error t value   Pr(>|t|)    
ever_treated:post      -17        4.8    -3.6 0.00033698 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 16.7     Adj. R2: 0.95883 
             Within R2: 0.063532

With this approach, you would have to iterate through quite a lot of combinations. Another approach would be to add “fake” post columns to the data frame and interact them with the treatment indicator.

Task 4: Subset the data so you can perform test for pre-trends between ALL two arbitrary pre-treatment periods using the TWFE approach from above. You’ll need to create new columns with e.g. post_t = if_else(month == t, 1, 0), where t is a treatment period and introduce new interactions into the formula.

Code

df_placebo_all <- df |> 
  mutate(
    post_1 = if_else(month == 1, 1, 0),
    post_2 = if_else(month == 2, 1, 0),
    post_3 = if_else(month == 3, 1, 0),
    post_4 = if_else(month == 4, 1, 0),
    post_5 = if_else(month == 5, 1, 0),
    post_6 = if_else(month == 6, 1, 0),
    post_7 = if_else(month == 7, 1, 0),
    post_8 = if_else(month == 8, 1, 0)
    )

feols(downloads ~ 
        ever_treated:post_1 +
        ever_treated:post_2 +
        ever_treated:post_3 +
        ever_treated:post_4 +
        ever_treated:post_5 +
        ever_treated:post_6 +
        ever_treated:post_7 +
        #ever_treated:post_8 + reference value
        ever_treated:post | city + month, data = df_placebo_all)

OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                    Estimate Std. Error t value   Pr(>|t|)    
ever_treated:post_1     19.1        6.3    3.03 2.7416e-03 ** 
ever_treated:post_2     16.8        5.5    3.03 2.7654e-03 ** 
ever_treated:post_3     17.4        4.8    3.64 3.4584e-04 ***
ever_treated:post_4     10.0        4.1    2.42 1.6456e-02 *  
ever_treated:post_5      5.7        3.8    1.51 1.3288e-01    
ever_treated:post_6      3.0        3.2    0.93 3.5104e-01    
ever_treated:post_7      2.0        2.9    0.68 4.9679e-01    
ever_treated:post       14.9        3.0    4.97 1.4628e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 19.6     Adj. R2: 0.967926
             Within R2: 0.030489

An even more programmatic way to test for pre-trends is to run an event study to perform the tests all in one command. We only have to slightly change the input. Instead of the single treatment indicator ever_treated:post, which is only active for the treated group after treatment, we include several interactions which cover all periods for the treated group.

We can do so with the help of i() from the fixest package and a newly created variable that measures the time distance to the treatment period.

Code

# Create time distance column
treat_month <- 9
df <- df |> mutate(time_to_treat = if_else(ever_treated == 1, month - treat_month, 0))

evt_stdy <- feols(downloads ~ i(time_to_treat, ever_treated, ref = -1) | city + month, data = df)
summary(evt_stdy)

OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                               Estimate Std. Error t value   Pr(>|t|)    
time_to_treat::-8:ever_treated     19.1        6.3    3.03 2.7416e-03 ** 
time_to_treat::-7:ever_treated     16.8        5.5    3.03 2.7654e-03 ** 
time_to_treat::-6:ever_treated     17.4        4.8    3.64 3.4584e-04 ***
time_to_treat::-5:ever_treated     10.0        4.1    2.42 1.6456e-02 *  
time_to_treat::-4:ever_treated      5.7        3.8    1.51 1.3288e-01    
time_to_treat::-3:ever_treated      3.0        3.2    0.93 3.5104e-01    
time_to_treat::-2:ever_treated      2.0        2.9    0.68 4.9679e-01    
time_to_treat::0:ever_treated      14.9        3.0    4.97 1.4628e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 19.6     Adj. R2: 0.967926
             Within R2: 0.030489

Using iplot() we can visualize the multiple treatment effects. Any treatment effect prior to the treatment is a big cause of concern with regards to the parallel trends assumption.

# Plot event study
iplot(evt_stdy)

Conditional parallel trends

Now, we have seen that we cannot assume parallel trends - there are significant differences in the evolution of outcomes. But what can we do to increase the plausibility of assuming parallel trends? One approach that is widely been used in practice is the inclusion of covariates. But as you have learned in the lecture, you have to be extremely careful.

What you obtain then is the conditional parallel trends assumption, which is the assumption of parallel trends conditional on covariates.

\[ \mathbb{E}[Y_{i,t=1}(0) - Y_{i,t=0}(0) | T_i=1, \mathbf{X_i}] = \mathbb{E}[Y_{i,t=1}(0) - Y_{i,t=0}(0) | T_i=0, \mathbf{X_i}] \quad \]

Task 5: Include the covariates into the event study and consider that they have time-varying effects.

Code

# Time-invariant, therefore multicollinearity.
evt_stdy_cond_1 <- feols(
  downloads ~ i(time_to_treat, ever_treated, ref = -1) +
    age + user_share + population_size + competitor_price
  | city + month, data = df)

The variables 'age', 'user_share', 'population_size' and 'competitor_price' have been removed because of collinearity (see $collin.var).

Code

evt_stdy_cond_1

OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                               Estimate Std. Error t value   Pr(>|t|)    
time_to_treat::-8:ever_treated     19.1        6.3    3.03 2.7416e-03 ** 
time_to_treat::-7:ever_treated     16.8        5.5    3.03 2.7654e-03 ** 
time_to_treat::-6:ever_treated     17.4        4.8    3.64 3.4584e-04 ***
time_to_treat::-5:ever_treated     10.0        4.1    2.42 1.6456e-02 *  
time_to_treat::-4:ever_treated      5.7        3.8    1.51 1.3288e-01    
time_to_treat::-3:ever_treated      3.0        3.2    0.93 3.5104e-01    
time_to_treat::-2:ever_treated      2.0        2.9    0.68 4.9679e-01    
time_to_treat::0:ever_treated      14.9        3.0    4.97 1.4628e-06 ***
... 4 variables were removed because of collinearity (age, user_share and 2 others [full set in $collin.var])
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 19.6     Adj. R2: 0.967926
             Within R2: 0.030489

Code

# Time-invariant, but time-varying effects.
evt_stdy_cond_2 <- feols(
  downloads ~ i(time_to_treat, ever_treated, ref = -1) +
    age:month + user_share:month + population_size:month + competitor_price: month
  | city + month, data = df)
evt_stdy_cond_2

OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                               Estimate Std. Error t value   Pr(>|t|)    
time_to_treat::-8:ever_treated    -3.11      3.339   -0.93 3.5244e-01    
time_to_treat::-7:ever_treated    -2.24      3.364   -0.67 5.0676e-01    
time_to_treat::-6:ever_treated     1.58      3.271    0.48 6.3003e-01    
time_to_treat::-5:ever_treated    -2.66      3.071   -0.87 3.8768e-01    
time_to_treat::-4:ever_treated    -3.79      3.115   -1.22 2.2503e-01    
time_to_treat::-3:ever_treated    -3.36      2.851   -1.18 2.3982e-01    
time_to_treat::-2:ever_treated    -1.17      2.827   -0.41 6.8065e-01    
time_to_treat::0:ever_treated     18.03      2.870    6.28 2.0508e-09 ***
age:month                          3.47      0.300   11.57  < 2.2e-16 ***
month:user_share                  49.36      5.315    9.29  < 2.2e-16 ***
month:population_size              0.32      0.048    6.60 3.6438e-10 ***
month:competitor_price             2.22      0.430    5.16 5.9256e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.1     Adj. R2: 0.983262
             Within R2: 0.495338

# Plot event study
iplot(evt_stdy_cond_2)

ATT conditional on covariates

TWFE

Task 6: Use the TWFE approach to estimate the ATT by under the assumption of parallel trend conditional on the covariates.

Code

feols(downloads ~ post:ever_treated +
                    age:month + user_share:month + population_size:month + competitor_price: month
                  | city + month, data = df)

OLS estimation, Dep. Var.: downloads
Observations: 1,800
Fixed-effects: city: 200,  month: 9
Standard-errors: Clustered (city) 
                       Estimate Std. Error t value   Pr(>|t|)    
post:ever_treated         19.80      2.525     7.8 2.6261e-13 ***
age:month                  3.46      0.296    11.7  < 2.2e-16 ***
month:user_share          49.49      5.244     9.4  < 2.2e-16 ***
month:population_size      0.32      0.048     6.6 4.5016e-10 ***
month:competitor_price     2.21      0.428     5.1 6.2757e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.2     Adj. R2: 0.983281
             Within R2: 0.49367

Doubly robust estimator

During the last weeks we have already learned the advantages of doubly robust estimation and in fact, there is a doubly robust DiD estimator proposed by Sant’Anna and Zhao, 2020 and implemented in the DRDID package.

install.packages("DRDID")

library(DRDID)

drdid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)

 Call:
drdid(yname = "downloads", tname = "post", idname = "city", dname = "ever_treated", 
    xformla = ~age + user_share + population_size + competitor_price, 
    data = df_2p)
------------------------------------------------------------------
 Further improved locally efficient DR DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 18.6812     3.2087     5.822        0       12.3922    24.9703  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: weighted least squares.
 Propensity score est. method: inverse prob. tilting.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Task 7: Following the same syntax, use the outcome regression adjustment with ordid() and the inverse probability weighting approach ipwdid() to estimate the treatment effect.

Outcome regression

Code

ordid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)

 Call:
ordid(yname = "downloads", tname = "post", idname = "city", dname = "ever_treated", 
    xformla = ~age + user_share + population_size + competitor_price, 
    data = df_2p)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 19.2478     3.2017     6.0117       0       12.9724    25.5232  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Inverse probability weighting regression

Code

ipwdid(
  yname = "downloads",
  tname = "post",
  idname = "city",
  dname = "ever_treated",
  xformla = ~age+user_share+population_size+competitor_price,
  data = df_2p
)

 Call:
ipwdid(yname = "downloads", tname = "post", idname = "city", 
    dname = "ever_treated", xformla = ~age + user_share + population_size + 
        competitor_price, data = df_2p)
------------------------------------------------------------------
 IPW DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
 18.7472     3.3073     5.6684       0       12.2648    25.2296  
------------------------------------------------------------------
 Estimator based on panel data.
 Hajek-type IPW estimator (weights sum up to 1).
 Propensity score est. method: maximum likelihood.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

Assignment

Accept the Week 9 - Assignment and follow the same steps as last week and as described in the organization chapter.

Please also remember to fill out the course evaluation if you have not done so already.

Explain under what circumstances a time-invariant covariate can be a confounder.
Explain under what circumstances a time-varying covariate can be a confounder.
Let’s consider the variable competitor_price was time-variant and for each period, we would observe the price in the same period set by the competitor. Do you see any risk in using this variable?
Apply the doubly robust estimator with ML from the lecture to the data from the tutorial. Use the models "regr.xgboost" and "classif.xgboost". Additionally, implement cross-fitting (K=5). Report and compare the result to the other estimates.
Check what happens when you simulate a misspecification of the outcome model by replacing delta_mu with sample(delta_mu, replace = FALSE) or a misspecification of the exposure model by replacing ehat with sample(ehat, replace = FALSE). Report and discuss your result.