4 - Matching

Slides & Recap

Last week, we discussed the significant advantage of employing a randomized treatment assignment: it allowed straightforward comparisons of group averages or the application of univariate linear regressions (i.e., with only one variable on the right-hand side), thereby easing statistical inference.

However, what if the treatment isn’t randomized? We’ve discovered that when developing our identification strategy, we must consider confounding variables and implement measures to control for them. How do we do this? In this chapter, you’ll delve into (1) a regression-based approach, (2) a matching-based approach, and (3) a weighting-based approach to address this challenge.

Tip

There are many terms used to account for a confounder which are used interchangeably. Among other, depending on the perspective, you could “block a backdoor path”, “condition on a confounder”, “adjust for the confounder”, or “control for the confounder”.

Regression adjustment

In this scenario, similar to the example from the first week, you’re the owner of an online marketplace enterprise aiming to assist businesses on your platform with strategic pricing guidance. However, please recall that these businesses operate anonymously and independently determine their pricing, rendering the treatment non-randomized. Once more, our focus remains on a particular category of similar products: light jackets. While in the initial week, we denoted the treatment as being on sale, this time, we adopt a more precise approach, utilizing price — a continuous variable - as the treatment.

Downloading, reading and printing the data, we see that each row represents an observation detailing the daily sales of a particular business with additional information such as the weekday and the product rating. In total, you collected 10’000 observations.

Reading and printing the data, it’s evident that each of the 10’000 rows represents an observation detailing the daily sales of a specific business, accompanied by supplementary information like the weekday and product rating.

library(tidyverse)

Warning: Paket 'ggplot2' wurde unter R Version 4.3.2 erstellt

Warning: Paket 'tidyr' wurde unter R Version 4.3.2 erstellt

prices <- readRDS("prices.rds")
print(prices)

# A tibble: 10,000 × 4
   weekday rating price sales
     <int>  <dbl> <dbl> <int>
 1       4    4    26.5    37
 2       3    4.2  31      39
 3       1    4    30      63
 4       7    3.6  28.6    56
 5       3    4    28.4    26
 6       4    4.8  30.2    38
 7       5    4.6  29      45
 8       5    4.8  29.7    41
 9       7    4.3  29.7    43
10       4    4    29.1    48
# ℹ 9,990 more rows

Conditional outcome regression

With the regression-based approach, we employ a multivariate linear regression that incorporates confounding variables. In essence, we estimates the treatment effect conditionally on covariates.

\[ \begin{align} Y_i &= \beta_0 + \beta_D D_i + \mathbf{\beta_{X}' X_i} + \epsilon_i \\ \mathbb{E}[Y_i | T_i, \mathbf{X_i}] &= \beta_0 + \beta_D D_i + \mathbf{\beta_{X}' X_i} \end{align} \]

Q1. Assuming that all potential confounders are observed and in the table, estimate the treatment effect.

Code

# Basic OLS regression
mod_ols <- lm(sales ~ price + rating + as.factor(weekday), data = prices)
summary(mod_ols)


Call:
lm(formula = sales ~ price + rating + as.factor(weekday), data = prices)

Residuals:
   Min     1Q Median     3Q    Max 
-53.40  -7.41   0.51   8.22  39.36 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          141.291      3.403   41.52   <2e-16 ***
price                 -2.690      0.121  -22.14   <2e-16 ***
rating                 3.416      0.304   11.24   <2e-16 ***
as.factor(weekday)2  -34.128      0.469  -72.73   <2e-16 ***
as.factor(weekday)3  -34.345      0.469  -73.18   <2e-16 ***
as.factor(weekday)4  -33.835      0.466  -72.54   <2e-16 ***
as.factor(weekday)5  -33.771      0.469  -71.96   <2e-16 ***
as.factor(weekday)6  -33.454      0.471  -71.01   <2e-16 ***
as.factor(weekday)7   -0.116      0.446   -0.26     0.79    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12 on 9991 degrees of freedom
Multiple R-squared:  0.588, Adjusted R-squared:  0.588 
F-statistic: 1.78e+03 on 8 and 9991 DF,  p-value: <2e-16

Frisch–Waugh–Lovell Theorem

Following the Frisch–Waugh–Lovell theorem, you can decompose your regression into a three-stage process and obtain identical estimates. The idea behind this is to initially eliminate all variation in $D$ and $Y$ that can be explained by the covariates $X$. Subsequently, you account for the remaining variation in $Y$ using the remaining variation in $D$.

Denoising: run a regression of the form $Y_i = \beta_{Y0} + \mathbf{\beta_{Y \sim X}' X_i} + \epsilon_{Y_i \sim X_i}$ and extract the estimated residuals $\hat\epsilon_{Y_i \sim X_i}$.
Debiasing: run a regression of the form $D_i = \beta_{D0} + \mathbf{\beta_{D \sim X}' X_i} + \epsilon_{D_i \sim X_i}$ and extract the estimated residuals $\hat\epsilon_{D_i \sim X_i}$.
Residual regression: run a residual-on-residual regression of the form $\hat\epsilon_{Y_i \sim X_i} = \beta_D \hat\epsilon_{D_i \sim X_i} + \epsilon_i$ (no constant).

Q2. Use the three-step procedure as described above to obtain the estimate. Check that they are identical to the estimates obtained with lm().

Tip

Hints:

When you run a model with lm(), residuals are automatically computed. You can access them by model_name$residuals. Residuals are the difference between the actual value and the value predicted by the model. By construction of the ordinary least square regression, residuals are zero on average.
To run a model without a constant/intercept, use y ~ 0 + ....

Code

# Frisch–Waugh–Lovell Theorem: 3-step procedure

# (1) Debiasing:
mod_D <- lm(price ~ as.factor(weekday) + rating, prices)
D_hat <- mod_D$residuals

# (2) Denoising:
mod_Y <- lm(sales ~ as.factor(weekday) + rating, prices)
Y_hat <- mod_Y$residuals

# (3) Residual regression
mod_fwl <- lm(Y_hat ~ 0 + D_hat)
summary(mod_fwl)


Call:
lm(formula = Y_hat ~ 0 + D_hat)

Residuals:
   Min     1Q Median     3Q    Max 
-53.40  -7.41   0.51   8.22  39.36 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
D_hat   -2.690      0.121   -22.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12 on 9999 degrees of freedom
Multiple R-squared:  0.0468,    Adjusted R-squared:  0.0467 
F-statistic:  491 on 1 and 9999 DF,  p-value: <2e-16

Let’s visualize what we have done in the steps before and what the FWL theorem makes so intuitive. With the regression-based approach, we account for the confounders by remove variance in both $D$ and $Y$ due to $X$ and then, we perform a “clean” residual regression.

When we would simply run a normal regression without any accounting for the confounder, we would obtain what is depicted here. There does not appear to be a relationship between price and sales.

Code

# Plot
ggplot(prices, aes(y = sales, x = price)) +
  geom_point(alpha = .2) +
  geom_smooth(method='lm') +
  labs(x = "Price (X)", y = "Sales (Y)")

However, when we do account for the confounders and regress the residual $\hat{Y}$ on the residual $\hat{D}$, we obtain what is expected: a negative correlation. Higher prices lead to fewer sales.

Code

# Add residuals to data frame
prices <- prices |> mutate(sales_hat = Y_hat, price_hat = D_hat)

# Plot
ggplot(prices, aes(y = sales_hat, x = price_hat)) +
  geom_point(alpha = .2) +
  geom_smooth(method='lm') +
  labs(x = "Price residuals (X_hat)", y = "Sales residuals (Y_hat)")

Q3. Compute the estimate that describes what is shown in the first plot.

Code

# Naive estimate
mod_naive <- lm(sales ~ price, data = prices)
summary(mod_naive)


Call:
lm(formula = sales ~ price, data = prices)

Residuals:
   Min     1Q Median     3Q    Max 
-57.81 -13.04  -1.43  12.61  56.21 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -37.275      4.627   -8.06  8.7e-16 ***
price          3.038      0.157   19.37  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18 on 9998 degrees of freedom
Multiple R-squared:  0.0362,    Adjusted R-squared:  0.0361 
F-statistic:  375 on 1 and 9998 DF,  p-value: <2e-16

Matching

Imagine another situation. You are operating an online shop, and a year ago, introduced a plus membership aimed at binding customers and driving revenue growth. The plus memberships comes at a small cost for the customers, which is why not all of the customers subscribed. Now you want to examine whether binding customers by this membership program in fact increases your sales with subscribed customers. However, it’s essential to consider potential confounding variables such as age, gender, or previous average purchase amounts.

Each row in your dataset corresponds to a different customer, with accompanying demographic details, their average purchase sum before the introduction of the plus membership, their current average purchase, and an indication of their participation in the plus membership program.

# Read data and show
membership <- readRDS("membership.rds")
print(membership)

# A tibble: 1,000 × 5
     age   sex pre_avg_purch  card avg_purch
   <dbl> <int>         <dbl> <int>     <dbl>
 1  31.7     0          22.4     1    47.2  
 2  41.5     1          32.5     0    72.7  
 3  23.2     0          31.5     1    57.4  
 4  67.1     1          47.5     1    87.8  
 5  27.4     1          30.3     1    72.5  
 6  53.7     1          30.8     1    54.0  
 7  40.2     0          37.7     0    10.2  
 8  26.7     0          25.5     1    42.0  
 9  28.7     0          26.0     0     0.561
10  35.2     1          46.5     1    70.0  
# ℹ 990 more rows

Now, we’ll delve into the matching-based approach, an alternative to regression for addressing backdoor bias. The concept is to find and match treated and non-treated units with similar or identical covariate values. This method aims to replicate the conditions of a randomized experiment, assuming we have observed all confounding variables. Matching encompasses various techniques aimed at equalizing or balancing the distribution of covariates between the treatment and control groups. Essentially, the objective is to compare “apples to apples” and ensure that treatment and control groups are as similar as possible, except for the treatment variable.

In R, there are several packages available to facilitate matching-based methods, one of which is the MatchIt package. You’ll need to install this package first and then load it.

install.packages("MatchIt")

Nearest neighbor matching

By enforcing treatment and control group to have little variation in the matching variables, we close the backdoors. When the backdoor variable does not vary or varies only very little, it cannot induce changes in treatment and outcome. So, when we suppress this variation in the backdoor variable, we can interpret the effect from treatment to outcome as causal.

One-vs-one matching

The procedure consists of two steps, matching and estimation. Let’s start with a one-vs-one matching. Because, as our estimand, we specify $ATT$, the average treatment effect on the treatment group, we find for each treated unit an untreated unit with the highest resemblance regarding the matching variables. To retrieve the $ATC$, the average effect on the control group (sometimes also called $ATU$), we would find a treated unit for each untreated unit. Using the summary() function, we can check how well the matching works in terms of the covariate balance across the groups.

library(MatchIt)

# (1) Matching
# 1 vs. 1 matching
match_1v1 <- matchit(
  card ~ pre_avg_purch + age + sex,
  data = membership,
  estimand = "ATT",
  method = "nearest",
  distance = "mahalanobis",
  ratio = 1,
  replace = TRUE
)
summary(match_1v1)


Call:
matchit(formula = card ~ pre_avg_purch + age + sex, data = membership, 
    method = "nearest", distance = "mahalanobis", estimand = "ATT", 
    replace = TRUE, ratio = 1)

Summary of Balance for All Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
pre_avg_purch         39.06          34.0           0.431        1.1     0.121    0.194
age                   43.39          39.0           0.307        1.1     0.079    0.137
sex                    0.52           0.5           0.035          .     0.017    0.017

Summary of Balance for Matched Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
pre_avg_purch         39.06         38.80           0.023        1.1     0.007    0.026           0.091
age                   43.39         43.09           0.021        1.1     0.008    0.029           0.098
sex                    0.52          0.52           0.000          .     0.000    0.000           0.000

Sample Sizes:
              Control Treated
All               547     453
Matched (ESS)     210     453
Matched           288     453
Unmatched         259       0
Discarded           0       0

The next step, the estimation, we will perform using the known lm() command. But we’ll need the matched data which includes the matched rows and the corresponding weights.

# Use matched data
df_1v1 <- match.data(match_1v1)
print(df_1v1)

# A tibble: 741 × 6
     age   sex pre_avg_purch  card avg_purch weights
   <dbl> <int>         <dbl> <int>     <dbl>   <dbl>
 1  31.7     0          22.4     1      47.2   1    
 2  41.5     1          32.5     0      72.7   1.27 
 3  23.2     0          31.5     1      57.4   1    
 4  67.1     1          47.5     1      87.8   1    
 5  27.4     1          30.3     1      72.5   1    
 6  53.7     1          30.8     1      54.0   1    
 7  40.2     0          37.7     0      10.2   0.636
 8  26.7     0          25.5     1      42.0   1    
 9  35.2     1          46.5     1      70.0   1    
10  21.8     0          35.7     1      45.5   1    
# ℹ 731 more rows

Now, we can simply run a regression. Please note, we only include the treatment variable as a predictor, because we closed the backdoor path in the matching step.

Q4. Run the estimation step of one-vs-one matching.

Code

# (2) Estimation
mod_1v1 <- lm(avg_purch ~ card, data = df_1v1, weights = weights)
summary(mod_1v1)


Call:
lm(formula = avg_purch ~ card, data = df_1v1, weights = weights)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-62.35 -13.26  -0.87  11.37  75.78 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     39.9        1.1    36.4   <2e-16 ***
card            15.2        1.4    10.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19 on 739 degrees of freedom
Multiple R-squared:  0.138, Adjusted R-squared:  0.137 
F-statistic:  118 on 1 and 739 DF,  p-value: <2e-16

Q5. Compute the $ATC$ (or: $ATU$) and the $ATE$.

Code

# ATU:
# (1) Matching
# 1 vs. 1 matching
match_1v1_atu <- matchit(
  card ~ pre_avg_purch + age + sex,
  data = membership,
  estimand = "ATC",
  method = "nearest",
  distance = "mahalanobis",
  ratio = 1,
  replace = TRUE
)
summary(match_1v1_atu)


Call:
matchit(formula = card ~ pre_avg_purch + age + sex, data = membership, 
    method = "nearest", distance = "mahalanobis", estimand = "ATC", 
    replace = TRUE, ratio = 1)

Summary of Balance for All Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
pre_avg_purch         39.06          34.0           0.450        1.1     0.121    0.194
age                   43.39          39.0           0.328        1.1     0.079    0.137
sex                    0.52           0.5           0.035          .     0.017    0.017

Summary of Balance for Matched Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
pre_avg_purch          34.2          34.0           0.019       0.95     0.008    0.031           0.097
age                    39.1          39.0           0.001       0.94     0.008    0.033           0.096
sex                     0.5           0.5           0.000          .     0.000    0.000           0.000

Sample Sizes:
              Control Treated
All               547     453
Matched (ESS)     547     197
Matched           547     288
Unmatched           0     165
Discarded           0       0

Code

# Use matched data
df_1v1_atu <- match.data(match_1v1_atu)
print(df_1v1_atu)

# A tibble: 835 × 6
     age   sex pre_avg_purch  card avg_purch weights
   <dbl> <int>         <dbl> <int>     <dbl>   <dbl>
 1  31.7     0          22.4     1    47.2     1.05 
 2  41.5     1          32.5     0    72.7     1    
 3  23.2     0          31.5     1    57.4     1.58 
 4  67.1     1          47.5     1    87.8     0.527
 5  27.4     1          30.3     1    72.5     1.05 
 6  53.7     1          30.8     1    54.0     1.05 
 7  40.2     0          37.7     0    10.2     1    
 8  26.7     0          25.5     1    42.0     1.05 
 9  28.7     0          26.0     0     0.561   1    
10  21.8     0          35.7     1    45.5     1.58 
# ℹ 825 more rows

Code

# (2) Estimation
mod_1v1_atu <- lm(avg_purch ~ card, data = df_1v1_atu, weights = weights)
summary(mod_1v1_atu)


Call:
lm(formula = avg_purch ~ card, data = df_1v1_atu, weights = weights)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-74.57 -11.76   0.87  12.16  66.08 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   34.844      0.791    44.0   <2e-16 ***
card          16.947      1.347    12.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18 on 833 degrees of freedom
Multiple R-squared:  0.16,  Adjusted R-squared:  0.159 
F-statistic:  158 on 1 and 833 DF,  p-value: <2e-16

Code

# ATE = p_T * ATT + p_C * ATC
ATE <- weighted.mean(c(mod_1v1$coefficients[2], mod_1v1_atu$coefficients[2]), c(453, 547))
print(ATE)

[1] 16

One vs. many matching

Sometimes, you want to match more than just one unit. For example, imagine that you have relatively few treated units but a large amount of untreated units. To leverage all the information contained in the untreated units, matching more units might increase the validity and precision of your estimate.

Note

The R package MatchIt offers a wide range of matching methods and we only scratch the surface of what is possible. Feel free to have a look at the documentation for further information.

Q6. Run one-vs-many matching: find 3 matches for each the treated units.

Code

M <- 3

# (1) Matching
# One-vs-many matching
match_1vM <- matchit(
  card ~ pre_avg_purch + age + sex,
  data = membership,
  estimand = "ATT",
  method = "nearest",
  distance = "mahalanobis",
  ratio = M,
  replace = TRUE
)
summary(match_1vM)


Call:
matchit(formula = card ~ pre_avg_purch + age + sex, data = membership, 
    method = "nearest", distance = "mahalanobis", estimand = "ATT", 
    replace = TRUE, ratio = M)

Summary of Balance for All Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
pre_avg_purch         39.06          34.0           0.431        1.1     0.121    0.194
age                   43.39          39.0           0.307        1.1     0.079    0.137
sex                    0.52           0.5           0.035          .     0.017    0.017

Summary of Balance for Matched Data:
              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
pre_avg_purch         39.06         38.71           0.030        1.1     0.008    0.037            0.13
age                   43.39         43.01           0.027        1.1     0.008    0.028            0.13
sex                    0.52          0.52           0.000          .     0.000    0.000            0.00

Sample Sizes:
              Control Treated
All               547     453
Matched (ESS)     326     453
Matched           462     453
Unmatched          85       0
Discarded           0       0

Code

# Use matched data
df_1vM <- match.data(match_1vM)
print(df_1vM)

# A tibble: 915 × 6
     age   sex pre_avg_purch  card avg_purch weights
   <dbl> <int>         <dbl> <int>     <dbl>   <dbl>
 1  31.7     0          22.4     1    47.2     1    
 2  41.5     1          32.5     0    72.7     0.680
 3  23.2     0          31.5     1    57.4     1    
 4  67.1     1          47.5     1    87.8     1    
 5  27.4     1          30.3     1    72.5     1    
 6  53.7     1          30.8     1    54.0     1    
 7  40.2     0          37.7     0    10.2     1.02 
 8  26.7     0          25.5     1    42.0     1    
 9  28.7     0          26.0     0     0.561   0.340
10  35.2     1          46.5     1    70.0     1    
# ℹ 905 more rows

Code

# (2) Estimation
matchit_mod_1vM <- lm(avg_purch ~ card, data = df_1vM, weights = weights)
summary(matchit_mod_1vM)


Call:
lm(formula = avg_purch ~ card, data = df_1vM, weights = weights)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-59.56 -13.31  -1.38  10.49  74.44 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.870      0.859    46.4   <2e-16 ***
card          15.300      1.222    12.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18 on 913 degrees of freedom
Multiple R-squared:  0.147, Adjusted R-squared:  0.146 
F-statistic:  157 on 1 and 913 DF,  p-value: <2e-16

Propensity score matching

When employing approaches based on covariate-matching, you quickly run into the curse of dimensionality as your number of covariates grows. As the dimensionality grows, it becomes increasingly difficult to find suitable matches, particularly in high-dimensional spaces where finding matches can be improbable.

Consider a scenario with two covariates, each with five distinct values. In this case, observations fall into one of 25 cells defined by the covariate value grid. Now, envision ten covariates with three different values each, creating already approximately 60,000 cells. This significantly boosts the likelihood of many cells being populated by only one or zero observations, making it impossible to find matches for numerous observations.

One approach to tackle this issue is the use of propensity scores. Propensity score represents the predicted probability of treatment assignment based on matching variables. In our case, we utilize age, gender, and previous average purchases to predict the likelihood of a customer participating in the membership program. Specifically, customers who spend more on average are expected to have a higher likelihood of participation. To model this relationship, we employ logistic regression, a technique that predicts outcomes between zero and one, generating the propensity score.

\[ \pi_i = PS(\mathbf{X_i}) = P(D_i = 1 | \mathbf{X_i}) \]

Q7. Run the matching using the propensity score as the matching variable. First, for each unit, compute the propensity of being treated.

Code

# Estimate propensity to be treated
mod_ps <- glm(
  card ~ pre_avg_purch + age + sex,
  family = binomial(link = 'logit'), 
  data = membership
  )

# Extract propensity score
membership <- membership |> mutate(propensity = mod_ps$fitted) # or: mod_ps$fitted.values
summary(membership$propensity)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.17    0.37    0.45    0.45    0.53    0.78

Q8. Having obtained the propensity score, use it as the matching variable.

Code

# (1) Matching
match_ps <- matchit(
  card ~ propensity,
  data = membership,
  estimand = "ATT",
  method = "nearest",
  distance = "mahalanobis",
  ratio = 1,
  replace = TRUE
)
summary(match_ps)


Call:
matchit(formula = card ~ propensity, data = membership, method = "nearest", 
    distance = "mahalanobis", estimand = "ATT", replace = TRUE, 
    ratio = 1)

Summary of Balance for All Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
propensity          0.48          0.43            0.44        1.2      0.12      0.2

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
propensity          0.48          0.48               0          1     0.001    0.015           0.008

Sample Sizes:
              Control Treated
All               547     453
Matched (ESS)     186     453
Matched           262     453
Unmatched         285       0
Discarded           0       0

Code

# Equivalent:
# match_ps2 <- matchit(
#   card ~ propensity,
#   data = membership,
#   estimand = "ATT",
#   ratio = 1,
#   replace = TRUE
# )


# Use matched data
df_ps <- match.data(match_ps)
print(df_ps)

# A tibble: 715 × 7
     age   sex pre_avg_purch  card avg_purch propensity weights
   <dbl> <int>         <dbl> <int>     <dbl>      <dbl>   <dbl>
 1  31.7     0          22.4     1      47.2      0.320   1    
 2  41.5     1          32.5     0      72.7      0.423   0.578
 3  23.2     0          31.5     1      57.4      0.368   1    
 4  67.1     1          47.5     1      87.8      0.608   1    
 5  27.4     1          30.3     1      72.5      0.373   1    
 6  53.7     1          30.8     1      54.0      0.440   1    
 7  40.2     0          37.7     0      10.2      0.458   0.578
 8  26.7     0          25.5     1      42.0      0.331   1    
 9  35.2     1          46.5     1      70.0      0.522   1    
10  21.8     0          35.7     1      45.5      0.397   1    
# ℹ 705 more rows

Code

# (2) Estimation
mod_ps <- lm(avg_purch ~ card, data = df_ps, weights = weights)
summary(mod_ps)


Call:
lm(formula = avg_purch ~ card, data = df_ps, weights = weights)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-59.56 -13.29  -0.06  11.43 123.33 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    39.63       1.18    33.6   <2e-16 ***
card           15.54       1.48    10.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19 on 713 degrees of freedom
Multiple R-squared:  0.134, Adjusted R-squared:  0.132 
F-statistic:  110 on 1 and 713 DF,  p-value: <2e-16

Note

Instead of running a separate logistic regression to compute the propensity scores, you could also provide the arguments method = "glm" and link = "logit" to the matchit() function.

It’s important to note that while propensity score matching effectively mitigates the curse of dimensionality, a fundamental issue arises: having the same propensity score does not guarantee that observations share identical covariate values. Conversely, identical covariate values do imply the same propensity score. This inherent asymmetry raises concerns among prominent statisticians, leading to criticism of propensity score matching as a robust identification strategy.¹

Inverse probability weighting

Step-by-step approach

Instead inverse probability weighting (IPW) has proven to be a more precise method than matching approaches, particularly when the sample is large enough. So what do we do with the probability/propensity scores in IPW? We use the propensity score of an observation unit to increase or decrease its weights and thereby make some observations more important than others. The weight obtains as

\[ w_i = \frac{D_i}{\pi_i} + \frac{(1-D_i)}{(1-\pi_i)} \]

where only one of the terms is always active as $D_i$ is either one or zero. $\pi_i$ denotes the propensity. Now we should better understand what “inverse probability weighting” actually means. It weights each observation by its inverse of its treatment probability. Let’s proceed to calculate this for our dataset.

Q9. Given the formula, calculate the weights for each observation.

Code

# Calculate inverse probability weights
membership <- membership |> mutate(ipw = (card / propensity) + (1-card) / (1-propensity))
summary(membership$ipw)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.2     1.6     1.9     2.0     2.3     4.4

We need to provide these calculated weights as an additional argument to lm() using the weights argument.

Code

# Regression with inverse probability weighting
model_ipw <- lm(avg_purch ~ card, data = membership, weights = ipw)
summary(model_ipw)


Call:
lm(formula = avg_purch ~ card, data = membership, weights = ipw)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-107.34  -18.56   -0.18   17.59  128.59 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.245      0.847    44.0   <2e-16 ***
card          15.413      1.197    12.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27 on 998 degrees of freedom
Multiple R-squared:  0.142, Adjusted R-squared:  0.142 
F-statistic:  166 on 1 and 998 DF,  p-value: <2e-16

Integrated approach

For demonstration and learning purpose, we split the procedure in two steps but there are external packages which combine both steps, e.g. the causalweight package and the function treatweight().

install.packages("causalweight")

library(causalweight)

# IPW estimation
model_ipw_int <- treatweight(
  y = membership$avg_purch,
  d = membership$card,
  x = membership[, c("pre_avg_purch", "age", "sex")]
)
model_ipw_int

$effect
[1] 15

$se
[1] 0.99

$pval
[1] 3.8e-55

$y1
[1] 53

$y0
[1] 37

$ntrimmed
[1] 0

Assignment

Accept the Week 4 - Assignment and follow the same steps as last week and as described in the organization chapter.

Regression Adjustment

When discussing your results from the regression adjustment for the sales of light jackets with your team, one of your analysts comes up with the idea to include temperature at the customers location into your analysis. Because you know the customer’s location, you can check the temperature and add it to the data. Load the data prices_new.rds.
1. Check whether adjusting for temperature improves your analysis using e.g. $R^2$ or hypothesis tests.
2. Do you think the relationship between sales and temperature is linear? Or do you suspect a non-linear relationship? Please argue, whether including a quadratic term into the functional form by y ~ x + I(x^2) + ... makes sense?

Matching

For the next tasks, you are going to use the data set health_program.rds, which you should already be familiar with from the assignment two weeks ago. Identify the matching variables and perform

one-vs-one nearest-neighbor matching for $ATT$ and $ATU$,
and propensity score matching for $ATE$.
Look at the matched data frames and explain the different number of rows.

Inverse probability weighting

Run a logistic regression to estimate the treatment propensity.
Add the propensity scores to your data frame, compute the inverse probability weights and sort your data frame by them by running df |> arrange(var) or df |> arrange(-var). Take a look at the units with the highest or lowest weights. What do you notice? Use the logistic regression summary to argue why these observations have such high/low weights.
Run the regression based on the calculated weights and compare it to the estimates from the matching estimators.

Footnotes

https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching↩︎