Causal Inference One-to-one Matching on Confounders Using R
Using R Matching package for causal inference with Mahalanobis Distance Matching (MDM) in Google Colab notebook
Causal inference is the process of determining the effect of a treatment. The causal impact can be evaluated by randomized experiments or observational studies.
- A randomized experiment randomly separates the samples into the treatment group and the control group. The causal impact can be calculated by getting the difference between the treatment group and the control group.
- When only the observational data is available, we can use causal inference algorithms to calculate the causal impact. Such algorithms include but are not limited to difference-in-difference, matching, Inverse Probability Treatment Weighting (IPTW), and counterfactual modeling.
Causal inference has well-established packages in R, but not in Python. This tutorial provides an example of using R packages for causal analysis in a Python notebook.
- If you are a Python user, please follow the steps in the tutorial to set up the environment to run the R package MatchIt for propensity score matching. To learn more about how to run R in a Python notebook, please check out my previous tutorial How to Use R with Google Colab Notebook.
- If you are an R user, please ignore the python code and use the R code directly. R code are the ones starting with
%R
or%%R
.
In this tutorial, we will talk about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching
. You will learn:
- How to use R packages in a notebook for Python users?
- How do the matching algorithms work?
- How to do Mahalanobis Distance Matching (MDM) using R?
- How to do causal inference using matched data?
- How much improvement in accuracy is made by causal inference?
Resources for this post:
- Video tutorial for this post on YouTube
- Python code is at the end of the post. Click here for the Colab notebook.
- More video tutorials on Causal Inference
- More blog posts on Causal Inference
Let’s get started!
Step 0: Matching Algorithms
Matching takes a sample in the treatment group and finds a similar sample in the non-treatment group. The goal of matching is to create the treatment group and the control group that are comparable.
The matching process involves the following steps:
- Compute a distance between the subject in the treatment group and the samples in the control group. The distance can be calculated using the confounders or using the propensity scores. In this tutorial, we will do the matching on confounders.
- Match subjects in the treatment and control group using the shortest distance. There are different ways of calculating distances, we will use the Mahalanobis Distance Matching (MDM) in this tutorial. It is similar to the Euclidean distance. The difference is that Mahalanobis Distance Matching (MDM) uses the standardized data while Euclidean distance uses the original data.
- Define a caliper as the maximum distance threshold we are willing to accept.
- A small caliper means a small distance threshold, a better balance between treatment and control group, a smaller number of matched pairs, and we get results with less bias and more variance.
- A large caliper means a large distance threshold, a worse balance between treatment and control group, a larger number of matched pairs and we get results with more bias and less variance.
Step 1: Install and Import Libraries
In step 1, we will install and import libraries.
Firstly, let’s install dowhy for dataset creation.
# Install dowhy
!pip install dowhy
Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages tableone and Matching. This tutorial uses both Python and R in the same Colab notebook. To get details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.
# activate R magic
%load_ext rpy2.ipython
# Install R package tableone
%R install.packages('tableone')
# Install R package Matching
%R install.packages('Matching')
After the installation is completed, we can import the libraries.
- The
datasets
is imported fromdowhy
for dataset creation. pandas
andnumpy
are imported for data processing.- Packages from
rpy2
are imported for Python to R conversion. - Then we imported the two R libraries,
tableone
andMatching
.
# Package to create synthetic data for causal inference
from dowhy import datasets
# Data processing
import pandas as pd
import numpy as np
# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')
# Read R libraries
%R library(tableone)
%R library(Matching)
Step 2:Create Dataset
In step 2, we will create a synthetic dataset for the causal inference.
- Firstly, we set a random seed using
np.random.seed
to make the dataset reproducible. - Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
- After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.
# Set random seed
np.random.seed(42)
# Create a synthetic dataset
data = datasets.linear_dataset(
beta=10,
num_common_causes=4,
num_samples=10000,
treatment_is_binary=True,
outcome_is_binary=False)
# Create Dataframe
df = data['df']
# Take a look at the data
df.head()
Next, let’s rename v0
to treatment
, rename y
to outcome
, and convert the boolean values to 0s and 1s.
# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)
# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)
# Take a look at the data
df.head()
Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.
# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
dfr = ro.conversion.py2rpy(df)
# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr
Step 3: Unadjusted Differences
In step 3, we will get the direct difference between the treatment group and the non-treatment group.
We can see that for the group that received the treatment, the average outcome is about 14, while for the group that did not receive the treatment, the average outcome is about -2.
# Get the average outcome for treatment group and non-treatment group
df.groupby('treatment')['outcome'].mean()
Output
treatment
0 -2.191159
1 13.940382
Name: outcome, dtype: float64
If we do not use causal inference and take the direct difference between the two groups, we will draw the conclusion that the treatment increased the outcome by 16, which is much higher than the true causal impact of 10.
# Print the direct difference
print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0])
The direct differenc is 16.131541137888554
Step 4: Standardized Mean Difference of Confounders
In step 4, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.
var=confounders
means that we will summarize the data for the four confounders.strata='treatment'
means that the variables will be grouped by the treatment type.test=TRUE
means the p-value for the significance testing will be reported.smd=TRUE
means that the standardized mean difference will be reported.
%%R
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')
# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)
# Print the results
print(table1, smd=TRUE)
Standardized Mean Differences(SMD) greater than 0.1 is usually used to check the imbalance between the treatment and the control group. We can see that most of the confounders have SMD greater than 0.1 with a significant p-value, meaning that the confounders are different between the treatment group and the non-treatment group.
Stratified by treatment
0 1 p test SMD
n 2269 7731
W0 (mean (SD)) -0.78 (0.94) -0.09 (0.97) <0.001 0.724
W1 (mean (SD)) -0.18 (0.68) 1.21 (0.83) <0.001 1.825
W2 (mean (SD)) 0.12 (1.00) 0.55 (0.99) <0.001 0.435
W3 (mean (SD)) 0.13 (0.97) 0.22 (1.01) <0.001 0.095
Step 5: Greedy Matching on Confounders Using Mahalanobis Distance
In step 5, we will perform Mahalanobis Distance Matching (MDM) on the dataset using Match
in the R Matching
package.
Tr
takes in the treatment indicator column.M=1
means one record in the treatment group is matched with one record in the control group.X
takes the list of confounders to match on.Match
returns objectsindex.treated
andindex.control
to reconstruct the matched dataset.
%%R
# Greedy match on Mahalanobis distance
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders])
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]
# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)
Output
Stratified by treatment
0 1 p test SMD
n 7731 7731
W0 (mean (SD)) -0.33 (0.89) -0.09 (0.97) <0.001 0.260
W1 (mean (SD)) 0.70 (0.57) 1.21 (0.83) <0.001 0.709
W2 (mean (SD)) 0.40 (0.93) 0.55 (0.99) <0.001 0.159
W3 (mean (SD)) 0.20 (0.95) 0.22 (1.01) 0.152 0.023
TableOne results show that after matching, the test and control are still significantly different for some confounders, so we need to add a caliper.
A caliper is the distance threshold for any match. The matched pairs with a distance greater than the caliper distance will be dropped.
caliper=.2
means that matches greater than 20% of the standard deviations of each confounder will be dropped.
%%R
# Greedy match
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2)
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]
# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)
We can see that after adding the caliper, the Standardized Mean Distance (SMD) decreased and the p-values are not significant, meaning that the matched treatment and control group are not significantly different from each other for all the confounders.
Stratified by treatment
0 1 p test SMD
n 659 659
W0 (mean (SD)) -0.43 (0.72) -0.41 (0.72) 0.685 0.022
W1 (mean (SD)) 0.33 (0.45) 0.37 (0.45) 0.171 0.075
W2 (mean (SD)) 0.26 (0.73) 0.27 (0.73) 0.783 0.015
W3 (mean (SD)) 0.07 (0.71) 0.08 (0.72) 0.849 0.010
Step 6: Causal Inference Using Mahalanobis Distance Matching
In step 6, we will calculate the causal impact on the matched dataset. Firstly, we get the outcomes for the matched treatment samples and the matched control samples. Then a paired t-test is performed to check if the mean difference is equal to 0.
%%R
# Get the matched treatement outcomes
y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1]
# Get the matched control outcomes
y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0]
# t-test
t.test(y_treatment_mdm,
y_control_mdm,
paired=TRUE,
conf.level=0.95)
From the output of the paired t-test, we can see that:
- The p-value is close to 0, meaning that the outcome mean of the treatment group is different from the outcome mean of the control group.
- The 95% confidence interval is between 10.12 and 10.21.
- The outcome mean difference is 10.1649, which is very close to the true causal impact of 10. Compared with the direct difference without matching of 16.13, the Mahalanobis Distance Matching (MDM) increased the accuracy by about 60%.
Paired t-test
data: y_treatment_mdm and y_control_mdm
t = 445.89, df = 658, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
10.12014 10.20966
sample estimates:
mean difference
10.1649
Summary
In this tutorial, we talked about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching
. You learned:
- How do the matching algorithms work?
- How to do Mahahanois Distance Matching (MDM) using R?
- How to do causal inference using matched data?
- How much improvement in accuracy is made by causal inference?
More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.