5 - Double Machine Learning

Slides & Recap

Last week, we delved into strategies for managing observed confounders, exploring techniques such as estimating conditional outcomes and re-weighting data based on propensity scores. Building on that foundation, we’ll now explore doubly robust methods. These approaches combine modeling of both the conditional outcome and the likelihood of treatment, offering the advantage of consistency even if only one model is accurately specified. Additionally, we’ll introduce machine learning (ML) models into our toolkit. ML enables us to perform data-driven model selection and tackle complex non-linear relationships while still allowing for statistical inference. Primarily, ML methods help us model nuisance parameters — factors that aren’t our main focus but assist in estimating our target parameter: the treatment’s effect on an outcome.

Practical example

As a member of the marketing team at an online retailer, your objective is to identify customers who are receptive to marketing emails. While recognizing the potential for increased spending associated with these emails, you’re also aware that some customers prefer not to receive them. To address this challenge, your aim is to estimate the average treatment effect (ATE) of sending such emails on customers’ future purchase volume. This estimation will enable your team to make informed decisions about email recipients.

Tip

Today, you will download a R script and the data to answer the questions throughout the tutorial.

# Load tidyverse package
library(tidyverse)
Warning: Paket 'ggplot2' wurde unter R Version 4.3.2 erstellt
Warning: Paket 'tidyr' wurde unter R Version 4.3.2 erstellt
# Read data
email <- read_csv("email_obs.csv")

# Data overview
glimpse(email)
Rows: 10,000
Columns: 27
$ mkt_email          <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0…
$ next_mnth_pv       <dbl> 1764.62, 162.21, 42.33, 12.41, 1398.49, 1088.87, 1485.70, 812.83, 1091.01, 2081.43, 984.25, 1755.68, 1.75, 1525.07, 739.7…
$ age                <dbl> 34, 65, 26, 31, 36, 68, 27, 23, 46, 27, 35, 37, 30, 31, 24, 29, 23, 49, 46, 28, 49, 47, 34, 41, 42, 24, 28, 34, 55, 27, 3…
$ tenure             <dbl> 2, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0…
$ ammount_spent      <dbl> 230.31, 142.41, 103.19, 10.25, 30.34, 78.16, 178.64, 37.14, 182.87, 136.78, 4.75, 28.81, 0.68, 92.45, 40.65, 192.17, 186.…
$ vehicle            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ food               <dbl> 2, 2, 0, 0, 1, 2, 0, 1, 0, 1, 1, 1, 0, 1, 2, 0, 0, 2, 0, 2, 1, 1, 3, 1, 4, 1, 0, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 0, 2, 1, 1…
$ beverage           <dbl> 2, 3, 0, 1, 0, 0, 1, 0, 0, 2, 1, 1, 0, 0, 1, 3, 0, 0, 1, 0, 1, 1, 1, 0, 1, 2, 1, 0, 2, 2, 2, 1, 0, 2, 1, 0, 3, 1, 0, 0, 0…
$ art                <dbl> 1, 0, 1, 0, 0, 0, 2, 0, 2, 3, 0, 0, 3, 0, 1, 3, 0, 0, 4, 2, 2, 1, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0, 2, 0, 2, 0, 1, 1, 1, 2, 0…
$ baby               <dbl> 0, 3, 2, 1, 1, 1, 1, 2, 1, 0, 1, 0, 3, 0, 0, 1, 1, 2, 1, 1, 1, 0, 2, 1, 1, 4, 1, 3, 1, 3, 2, 1, 3, 1, 0, 0, 2, 2, 3, 1, 2…
$ personal_care      <dbl> 2, 0, 0, 1, 0, 0, 2, 0, 2, 1, 3, 1, 2, 2, 1, 2, 3, 2, 1, 0, 0, 1, 1, 2, 0, 0, 1, 2, 0, 0, 0, 4, 1, 0, 1, 2, 2, 1, 0, 0, 2…
$ toys               <dbl> 0, 1, 2, 0, 0, 0, 1, 1, 1, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 2, 3, 0, 1, 2, 0, 3, 1, 1, 0, 1, 0, 0, 0, 1, 2, 1, 0, 0, 1, 1…
$ clothing           <dbl> 1, 3, 4, 1, 1, 4, 2, 1, 1, 4, 2, 2, 2, 3, 1, 0, 1, 0, 2, 2, 1, 0, 0, 1, 1, 3, 0, 3, 3, 1, 0, 6, 3, 0, 3, 8, 7, 3, 2, 1, 1…
$ decor              <dbl> 0, 0, 1, 2, 1, 0, 1, 1, 0, 1, 2, 5, 1, 2, 0, 1, 0, 2, 0, 1, 3, 0, 2, 0, 0, 0, 0, 1, 0, 3, 0, 1, 1, 0, 3, 0, 0, 0, 0, 0, 2…
$ cell_phones        <dbl> 4, 1, 2, 3, 3, 1, 5, 1, 3, 4, 0, 3, 1, 3, 2, 0, 1, 1, 4, 3, 1, 2, 1, 4, 2, 6, 3, 5, 3, 1, 2, 1, 5, 4, 6, 3, 5, 2, 1, 3, 4…
$ construction       <dbl> 3, 1, 0, 1, 2, 1, 0, 1, 1, 0, 2, 2, 1, 1, 2, 2, 2, 0, 2, 1, 2, 0, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 1, 0, 3, 0, 2, 0, 0…
$ home_appliances    <dbl> 0, 1, 1, 2, 1, 1, 1, 0, 1, 1, 2, 2, 2, 3, 1, 0, 1, 0, 0, 1, 1, 2, 0, 2, 0, 0, 1, 0, 2, 1, 1, 2, 2, 3, 2, 0, 1, 1, 2, 0, 1…
$ electronics        <dbl> 2, 4, 2, 1, 4, 0, 0, 3, 1, 5, 0, 5, 0, 4, 2, 2, 2, 3, 1, 3, 1, 1, 3, 1, 1, 2, 4, 2, 3, 3, 2, 2, 1, 1, 0, 1, 0, 1, 1, 3, 3…
$ sports             <dbl> 0, 3, 1, 0, 0, 1, 0, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 1, 2, 1, 0, 0, 1, 0, 1, 0, 0, 1, 2, 1, 0, 0, 3, 0, 2, 0, 2, 0, 1, 1, 2…
$ tools              <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 2, 0, 0, 3, 0, 1, 0, 3, 0, 4, 1, 0, 1, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0…
$ games              <dbl> 2, 3, 2, 0, 1, 2, 2, 0, 2, 1, 2, 1, 2, 3, 2, 1, 2, 1, 5, 2, 1, 2, 1, 0, 3, 4, 3, 0, 2, 1, 1, 0, 1, 3, 1, 0, 1, 2, 1, 0, 4…
$ industry           <dbl> 0, 1, 2, 0, 3, 1, 4, 1, 1, 2, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 1, 1, 3, 2, 1, 2, 2…
$ pc                 <dbl> 2, 3, 5, 2, 0, 3, 3, 2, 2, 3, 3, 4, 3, 2, 0, 1, 0, 2, 0, 4, 3, 2, 0, 3, 2, 2, 3, 3, 1, 0, 1, 1, 2, 0, 2, 2, 3, 3, 1, 0, 3…
$ jewel              <dbl> 0, 0, 1, 3, 2, 1, 0, 0, 0, 2, 0, 3, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 0, 2, 1, 3, 1, 3, 3, 1, 1, 1, 0, 0, 1, 0, 2, 1…
$ books              <dbl> 0, 0, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 0, 1, 1, 0, 0, 0, 3, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0, 1, 2, 1, 1, 1, 1, 3, 0, 0, 2, 2…
$ music_books_movies <dbl> 3, 2, 1, 2, 0, 3, 1, 0, 1, 0, 1, 0, 3, 2, 1, 1, 0, 1, 3, 0, 0, 1, 0, 1, 1, 1, 1, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 2…
$ health             <dbl> 2, 2, 1, 2, 2, 1, 1, 0, 2, 4, 0, 2, 5, 2, 0, 2, 1, 2, 2, 4, 3, 3, 1, 2, 2, 2, 2, 3, 1, 1, 1, 1, 2, 2, 1, 3, 3, 1, 0, 0, 3…

The treatment variable of interest is denoted as mkt_email, indicating whether a customer received a marketing email. The outcome of importance is the purchase volume one month after receiving the email, represented as next_mnth_pv. Alongside these key variables, the dataset encompasses various covariates such as customer age, tenure (time elapsed since the first purchase on the website), and purchase history across different product categories.

Frisch–Waugh–Lovell Theorem

Please recall from last week that we are able to split the estimation into three steps: estimating (1) a model for exposure, (2) a model for the outcome, and a (3) residual-on-residual regression. We obtain an estimate that is numerically equivalent to what we would have obtained in a linear regression modeling the conditional outcome.

Question: What is a residual?

Answer

Difference between the observed value and the value predicted by the fitted model.

Q1. Estimate the treatment effect using the FWL theorem.

Code
# Frisch–Waugh–Lovell Theorem: 3-step procedure

# (1) Debiasing:
mod_D <- lm(mkt_email ~ ., data = select(email, -next_mnth_pv))
D_hat <- mod_D$residuals

# (2) Denoising:
mod_Y <- lm(next_mnth_pv ~ ., select(email, -mkt_email))
Y_hat <- mod_Y$residuals

# (3) Residual regression
mod_fwl <- lm(Y_hat ~ 0 + D_hat)
summary(mod_fwl)

Call:
lm(formula = Y_hat ~ 0 + D_hat)

Residuals:
   Min     1Q Median     3Q    Max 
-21000   -128     -4    121 115926 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
D_hat   1313.6       37.8    34.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1880 on 9999 degrees of freedom
Multiple R-squared:  0.108, Adjusted R-squared:  0.108 
F-statistic: 1.21e+03 on 1 and 9999 DF,  p-value: <2e-16

Double machine learning with partially linear regression

To capture e.g. interactions and non-linearities better, the choice of flexible ML estimators instead of linear regression can be beneficial. In R, the package mlr3, alongside with the extension package mlr3learners provides a wide range of common ML estimators and a flexible framework.

install.packages("mlr3")
install.packages("mlr3learners")

Just as the FWL theorem, Chernozhukov et al. (2018) proposes a three step procedure to estimate the \(ATE\):

  1. Form prediction model for the treatment: \(\hat{e}(\mathbf{X_i})\)

  2. Form prediction model for the outcome: \(\hat{\mu}(\mathbf{X_i})\)

  3. Run feasible residual-on-residual regression: \(\hat{\tau}_{\text{ATE}} = \arg \min_{\tilde{\tau}} \frac{1}{N}\sum_{i=1}^n ( Y_i - \hat{\mu}(\mathbf{X_i}) - \tilde{\tau} ( T_i - \hat{e}(\mathbf{X_i})))^2 = \frac{\sum_{i=1}^n (Y_i - \hat{\mu}(\mathbf{X_i})) (T_i - \hat{e}(\mathbf{X_i}))}{\sum_{i=1}^n (T_i - \hat{e}(\mathbf{X_i}))^2}\)

Please note, no particular estimation method is specified in the procedure for step (1) and (2). Only for step (3), an OLS regression will be run on the residuals. Therefore, we load the packages and define what estimators, we want to use.

Note

With as_task_classif() or as_task_reg(), we define what we want to estimate which includes the data object and the estimation target, also known as response or outcome. When the outcome is binary, we use a classification task, when it is continuous, we use a regression task. With lrn(), we specify how we want to estimate it.

# Load packages
library(mlr3)
library(mlr3learners)

# Prediction model for the treatment/exposure
task_e <- as_task_classif(email |> select(-next_mnth_pv), target = "mkt_email")
lrnr_e <- lrn("classif.log_reg", predict_type = "prob")

# Prediction model for the outcome
task_m <- as_task_regr(email |> select(-mkt_email), target = "next_mnth_pv")
lrnr_m <- lrn("regr.lm")

Using these estimators, we will populate the vectors e_hat and m_hat, which are estimates of the treatment propensity and the conditional outcome for each unit, respectively (step 1 and step 2).

# Initialize nuisance vectors
n <- nrow(email)
m_hat <- rep(NA, n)
e_hat <- rep(NA, n)

head(m_hat)
[1] NA NA NA NA NA NA

However, because we generally do not rely on structural assumptions for the estimation of nuisance parameters and ML methods are prone to overfitting, we need to perform cross-fitting, i.e. we split the sample into \(K\) folds and for each fold \(k\), we train (or “fit”) a model on the remaining \(K-1\) folds. We predict the nuisance parameters for units in the fold \(k\) only with the model trained on the remaining \(K-1\) folds.

Question: What is overfitting?

Answer

An inadequate modeling of the data which results in a too close fit with the training but a bad fit with the test data.

Tip

Literature and methods from causal/statistical inference and machine learning often describe the same process. E.g. in causal/statistical inference the term fitting is mostly used, while in machine learning the term training is more prevalent. Other examples are covariates vs. features or outcome vs. target.

K-fold cross fitting with K = 5

K-fold cross fitting with K = 5

For the sake of simplicity and demonstration, we select \(K=2\). Common values for smaller data sets are \(K=5\) or \(K=10\), while for larger data sets, you will also see \(K=20\) quite often. Generally, the choice of \(K\) is associated with bias-variance trade-off and empirically, the values have shown to perform well.

# Split sample
ids_s1 <- sample(1:n, n/2) # indices of sample 1
ids_s2 <- which(!1:n %in% ids_s1) # indices of sample 2

head(ids_s1)
[1] 1786   34  696  921 4240 5669

Now, we just have to use the right sample for training and prediction as described above. Remember: a unit must never be predicted with a model that was trained on the same observation.

Note

In the mlr3 framework, we fit/train using the dollar operator and the method train() which takes the task and the row indexes as arguments. To predict, we use the method predict(), which takes the same arguments. You might feel that the usage of the dollar operator and methods is a bit different to what we have seen before - and you’re right: it’s from the R S4 system, which is a system for object oriented programming.

# Iteration 1:
# Train: S1 - Predict: S2
# Y ~ X
lrnr_m$train(task_m, row_ids = ids_s1)
m_hat[ids_s2] <- lrnr_m$predict(task_m, row_ids = ids_s2)$response
# D ~ X
lrnr_e$train(task_e, row_ids = ids_s1)
e_hat[ids_s2] <- lrnr_e$predict(task_e, row_ids = ids_s2)$prob[, 2] # col 2 for value D = 1

Q2. Run the second iteration.

Code
# Iteration 2:
# Train: S2 - Predict: S1
# Y ~ X
lrnr_m$train(task_m, row_ids = ids_s2)
m_hat[ids_s1] <- lrnr_m$predict(task_m, row_ids = ids_s1)$response
# D ~ X
lrnr_e$train(task_e, row_ids = ids_s2)
e_hat[ids_s1] <- lrnr_e$predict(task_e, row_ids = ids_s1)$prob[, 2]

Having obtained our nuisance parameters, we can move to step 3 and estimate the residuals-on-residuals regression.

Q3a. Obtain the residuals and store them as Y_hat and D_hat.

Code
# Obtain outcome and treatment residuals
Y <- email$next_mnth_pv
D <- email$mkt_email
Y_hat <- Y - m_hat
D_hat <- D - e_hat

Q3b. Run the residual-on-residual regression.

Code
# Residual-on-residual regression
mod_plr <- lm(Y_hat ~ 0 + D_hat)
summary(mod_plr)

Call:
lm(formula = Y_hat ~ 0 + D_hat)

Residuals:
   Min     1Q Median     3Q    Max 
-22982   -148     -9    141 120927 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
D_hat   1317.6       39.2    33.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1960 on 9999 degrees of freedom
Multiple R-squared:  0.101, Adjusted R-squared:  0.101 
F-statistic: 1.13e+03 on 1 and 9999 DF,  p-value: <2e-16

We obtain our target parameter - the estimated effect of receiving an email on the purchase volume.

Double machine learning with augmented inverse probability weighting

Now we want to make use of augmented inverse probability weighting, i.e. we want to combine outcome modeling and weighting with treatment probabilities. Again, Chernozhukov et al. (2018) propose a three step procedure.

  1. For prediction model and obtain the fitted values of the propensity scores:
    • \(\hat{e}_t(\mathbf{X_i})\)
  2. For outcome models and obtain the fitted values of the outcome regressions:
    • \(\hat{\mu}(t, \mathbf{X_i})\).
  3. Construct the doubly robust estimator:
    • \(\hat{\tau}_{\text{ATE}} = \hat{\mu}_{1} - \hat{\mu}_{0}\) with

    • \(\hat{\mu}_{1} = \frac{1}{n} \sum_{i=1}^n \left[\hat{\mu}(t= 1, \mathbf{X_i}) + \frac{\mathbb{1}(T_i = 1)(Y_i - \hat{\mu}(t = 1, \mathbf{ X_i})}{\hat{e}_1(\mathbf{X_i})}\right]\)

    • \(\hat{\mu}_{0} = \frac{1}{n} \sum_{i=1}^n \left[\hat{\mu}(t= 0, \mathbf{X_i}) + \frac{\mathbb{1}(T_i = 0)(Y_i - \hat{\mu}(t = 0, \mathbf{ X_i})}{\hat{e}_0(\mathbf{X_i})}\right]\) and

    • \(\hat{e}_0(\mathbf{X_i}) = 1 - \hat{e}_1(\mathbf{X_i})\)

Again, we initialize our nuisance vectors, but this time, we need to differentiate between a model for the untreated and one for the treated outcomes because we estimate the \(APO\) for all units and all outcomes, separately.

# Initialize nuisance vectors
e_hat <- rep(NA,n)
m0_hat <- rep(NA,n) # untreated
m1_hat <- rep(NA,n) # treated

Therefore, we also filter on whether some received the treatment or not.

# Treatment indices
ids_d0 <- which(email$mkt_email==0)
ids_d1 <- which(email$mkt_email==1)

Then, again using cross-fitting, because we do not rely on structural assumptions, we train and predict.

# Iteration 1:
# Train in S1, predict in S2
# D ~ X
lrnr_e$train(task_e, row_ids = ids_s1)
e_hat[ids_s2] <- lrnr_e$predict(task_e, row_ids = ids_s2)$prob[, 2]
# Y0 ~ X
lrnr_m$train(task_m, row_ids = intersect(ids_s1, ids_d0))
m0_hat[ids_s2] <- lrnr_m$predict(task_m, row_ids = ids_s2)$response
# Y1 ~ X
lrnr_m$train(task_m, row_ids = intersect(ids_s1, ids_d1))
m1_hat[ids_s2] <- lrnr_m$predict(task_m, row_ids = ids_s2)$response

Q4. Run the second iteration.

Code
# Iteration 2:
# Train in S2, predict in S1
# D ~ X
lrnr_e$train(task_e, row_ids = ids_s2)
e_hat[ids_s1] <- lrnr_e$predict(task_e, row_ids = ids_s1)$prob[, 2]
# Y0 ~ X
lrnr_m$train(task_m, row_ids = intersect(ids_s2, ids_d0))
m0_hat[ids_s1] <- lrnr_m$predict(task_m, row_ids = ids_s1)$response
# Y1 ~ X
lrnr_m$train(task_m, row_ids = intersect(ids_s2, ids_d1))
m1_hat[ids_s1] <- lrnr_m$predict(task_m, row_ids = ids_s1)$response

And finally, using the formula from above, we obtain the potential outcomes for each unit. The ATE obtains as the difference between those two.

Q5. Calculate the potential outcomes and the ATE.

Code
# Potential outcomes with AIPW
Y_0_PO <- m0_hat + (1-D) * (Y-m0_hat) / (1-e_hat)
Y_1_PO <- m1_hat + D * (Y-m1_hat) / e_hat

# ATE
Y_ate <- Y_1_PO - Y_0_PO

A trick to obtain the same statistical information as we are used to, is by simply regressing the variable Y_ate on the constant 1.

# Obtain statistical inference (same as t-test)
mod_aipw <- lm(Y_ate ~ 1)
summary(mod_aipw)

Call:
lm(formula = Y_ate ~ 1)

Residuals:
   Min     1Q Median     3Q    Max 
-19694   -309    -32    264 117421 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1336.1       19.2    69.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1920 on 9999 degrees of freedom

DoubleML

In the previous examples, we manually coded the estimation procedures, which involved splitting the sample and executing all steps ourselves. However, employing a function to automate these tasks offers numerous advantages. For instance, it enables easy adjustment of the number of folds \(K\), streamlines the process of changing the models for estimation, and significantly reduces the risk of errors by preventing confusion regarding which sample to use for training and prediction.

The R package DoubleML provides an implementation of the double / debiased machine learning framework and is built on top of the mlr3 framework we have already used. It also has a Python twin.

install.packages("DoubleML")

A little different than before, we need to specify the data object first. The object contains the data frame, the outcome column y_col and the treatment column d_cols.

# Load package
library(DoubleML)

# Specify data object
email_dml_data <- DoubleMLData$new(
  data = as.data.frame(email),
  y_col = "next_mnth_pv", 
  d_cols = "mkt_email"
  )

Partially linear regression

We choose the class DoubleMLPLR, which is short for “partially linear regression” and provide the data object and specify what estimators to use. We can just reuse the estimators from before. The methods fit() and summary() are self-explanatory.

# Specify task
dml_plr_obj <- DoubleMLPLR$new(
  data =  email_dml_data, # data object
  ml_l = lrnr_m, # outcome model
  ml_m = lrnr_e, # exposure model
  apply_cross_fitting = TRUE,
  n_folds = 10
  )

# Fit and return summary
dml_plr_obj$fit()
dml_plr_obj$summary()
Estimates and significance testing of the effect of target variables
          Estimate. Std. Error t value Pr(>|t|)    
mkt_email   1315.72       3.86     341   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Augmented inverse probablity weighting

The steps are the same for our AIPW estimator, only that we use the class DoubleMLIRM and additionally need to specify the score argument which we set to \(ATE\).

# Specify task
dml_aipw_obj = DoubleMLIRM$new(
  data = email_dml_data,
  ml_g = lrnr_m,
  ml_m = lrnr_e,
  score = "ATE",
  trimming_threshold = 0.01, # to prevent too extreme weights
  apply_cross_fitting = TRUE,
  n_folds = 10)

# Fit and return summary
dml_aipw_obj$fit()
dml_aipw_obj$summary()
Estimates and significance testing of the effect of target variables
          Estimate. Std. Error t value Pr(>|t|)    
mkt_email    1334.7       19.4    68.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Table of all estimates and standard errors
tbl_coef <- tibble(
  estimator = c(
    "PLR (by hand)",
    "AIPW (by hand)",
    "PLR (DoubleML)",
    "AIPW (DoubleML)"
  ),
  coef = c(
    broom::tidy(mod_plr)$estimate,
    broom::tidy(mod_aipw)$estimate,
    dml_plr_obj$coef,
    dml_aipw_obj$coef
  ),
  se   = c(
    broom::tidy(mod_plr)$std.error,
    broom::tidy(mod_aipw)$std.error,
    dml_plr_obj$se,
    dml_aipw_obj$se
  )
)

# Plot comparison
ggplot(tbl_coef, aes(x = estimator, y = coef, ymin = coef - 1.96*se, ymax = coef + 1.96*se)) +
  geom_errorbar(width = .5) +
  ylim(c(1100, 1400))

Assignment

Accept the Week 5 - Assignment and follow the same steps as last week and as described in the organization chapter.

For the purpose of introduction, we have just used logistic and linear regression. But as we mentioned, the usage of more flexible ML methods can have benefits.

  1. Use the commands from the DoubleML package but swap out the estimators with ML methods for

    • partially linear regression and

    • augmented inverse probability weighting.

You can find an overview of more ML-like estimators here. Please note, that for some of the learners, you have to install the package mlr3extralearners which is just available from GitHub via …

# You can ignore this for now.
install.packages("remotes")
remotes::install_github("mlr-org/mlr3extralearners")

load it via …

library(mlr3extralearners)

and, in some cases, install learners via …

  1. A few weeks later you are able to run an AB-test with perfectly randomized groups that either receive a marketing email or do not. Load email_rnd.csv, recall, how to retrieve the \(ATE\) in randomized experiments and report it.

  2. Compare the treatment effects from task 1 and 2 visually using ggplot and shortly discuss what estimate you have the most trust in.

  3. Explain why in double machine learning cross-fitting plays such a crucial role. What would happen if you left out the step of cross-fitting?