avatarAmy @GrabNGoInfo

Summary

This context provides a tutorial on using the R Matching package for causal inference with Mahalanobis Distance Matching (MDM) in a Google Colab notebook.

Abstract

The tutorial begins by explaining the concept of causal inference and the need for causal inference algorithms when only observational data is available. It then introduces the R Matching package and its use for Mahalanobis Distance Matching (MDM) in a Google Colab notebook. The tutorial covers the installation and import of required libraries, the creation of a synthetic dataset for causal inference, and the use of the Matching package for MDM. The tutorial also provides an example of how to use the package to perform MDM on a dataset and interpret the results. The tutorial concludes by recommending additional resources for further learning on the topic.

Bullet points

  • Causal inference is the process of determining the effect of a treatment.
  • Causal inference algorithms are used when only observational data is available.
  • The R Matching package is introduced for Mahalanobis Distance Matching (MDM) in a Google Colab notebook.
  • The tutorial covers the installation and import of required libraries.
  • A synthetic dataset is created for causal inference.
  • The Matching package is used for MDM on the dataset.
  • The tutorial provides an example of how to interpret the results of MDM.
  • Additional resources are recommended for further learning on the topic.

Causal Inference One-to-one Matching on Confounders Using R

Using R Matching package for causal inference with Mahalanobis Distance Matching (MDM) in Google Colab notebook

Photo by elen aivali on Unsplash

Causal inference is the process of determining the effect of a treatment. The causal impact can be evaluated by randomized experiments or observational studies.

  • A randomized experiment randomly separates the samples into the treatment group and the control group. The causal impact can be calculated by getting the difference between the treatment group and the control group.
  • When only the observational data is available, we can use causal inference algorithms to calculate the causal impact. Such algorithms include but are not limited to difference-in-difference, matching, Inverse Probability Treatment Weighting (IPTW), and counterfactual modeling.

Causal inference has well-established packages in R, but not in Python. This tutorial provides an example of using R packages for causal analysis in a Python notebook.

  • If you are a Python user, please follow the steps in the tutorial to set up the environment to run the R package MatchIt for propensity score matching. To learn more about how to run R in a Python notebook, please check out my previous tutorial How to Use R with Google Colab Notebook.
  • If you are an R user, please ignore the python code and use the R code directly. R code are the ones starting with %R or %%R.

In this tutorial, we will talk about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching. You will learn:

  • How to use R packages in a notebook for Python users?
  • How do the matching algorithms work?
  • How to do Mahalanobis Distance Matching (MDM) using R?
  • How to do causal inference using matched data?
  • How much improvement in accuracy is made by causal inference?

Resources for this post:

Let’s get started!

Step 0: Matching Algorithms

Matching takes a sample in the treatment group and finds a similar sample in the non-treatment group. The goal of matching is to create the treatment group and the control group that are comparable.

The matching process involves the following steps:

  1. Compute a distance between the subject in the treatment group and the samples in the control group. The distance can be calculated using the confounders or using the propensity scores. In this tutorial, we will do the matching on confounders.
  2. Match subjects in the treatment and control group using the shortest distance. There are different ways of calculating distances, we will use the Mahalanobis Distance Matching (MDM) in this tutorial. It is similar to the Euclidean distance. The difference is that Mahalanobis Distance Matching (MDM) uses the standardized data while Euclidean distance uses the original data.
  3. Define a caliper as the maximum distance threshold we are willing to accept.
  • A small caliper means a small distance threshold, a better balance between treatment and control group, a smaller number of matched pairs, and we get results with less bias and more variance.
  • A large caliper means a large distance threshold, a worse balance between treatment and control group, a larger number of matched pairs and we get results with more bias and less variance.

Step 1: Install and Import Libraries

In step 1, we will install and import libraries.

Firstly, let’s install dowhy for dataset creation.

# Install dowhy
!pip install dowhy

Then, we activate the R magic command in the Google Colab notebook using rpy2 and install the R packages tableone and Matching. This tutorial uses both Python and R in the same Colab notebook. To get details on using R with Colab, please check out my tutorial How to Use R with Google Colab Notebook.

# activate R magic
%load_ext rpy2.ipython
# Install R package tableone
%R install.packages('tableone')
# Install R package Matching
%R install.packages('Matching')

After the installation is completed, we can import the libraries.

  • The datasets is imported from dowhy for dataset creation.
  • pandas and numpy are imported for data processing.
  • Packages from rpy2 are imported for Python to R conversion.
  • Then we imported the two R libraries, tableone and Matching.
# Package to create synthetic data for causal inference
from dowhy import datasets
# Data processing
import pandas as pd
import numpy as np
# Python to R conversion
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
# Import R libraries
tableone = importr('tableone')
Matching = importr('Matching')
# Read R libraries
%R library(tableone)
%R library(Matching)

Step 2:Create Dataset

In step 2, we will create a synthetic dataset for the causal inference.

  • Firstly, we set a random seed using np.random.seed to make the dataset reproducible.
  • Then a dataset with the true causal impact of 10, four confounders, 10,000 samples, a binary treatment variable, and a continuous outcome variable is created.
  • After that, we created a dataframe for the data. In the dataframe, the columns W0, W1, W2, and W3 are the four confounders, v0 is the treatment indicator, and y is the outcome.
# Set random seed
np.random.seed(42)
# Create a synthetic dataset
data = datasets.linear_dataset(
    beta=10,
    num_common_causes=4,
    num_samples=10000,
    treatment_is_binary=True,
    outcome_is_binary=False)
# Create Dataframe
df = data['df']
# Take a look at the data
df.head()
Causal Inference Dataset — GrabNGoInfo.com

Next, let’s rename v0 to treatment, rename y to outcome, and convert the boolean values to 0s and 1s.

# Rename columns
df = df.rename({'v0': 'treatment', 'y': 'outcome'}, axis=1)
# Create the treatment variable, and change boolean values to 1 and 0
df['treatment'] = df['treatment'].apply(lambda x: 1 if x == True else 0)
# Take a look at the data
df.head()

Finally, the Python dataframe is converted to R dataframe in order to use R packages on the dataset.

# Convert Python dataframe to R dataframe
with localconverter(ro.default_converter + pandas2ri.converter):
  dfr = ro.conversion.py2rpy(df)
# Create a variable name in R's global environment
from rpy2.robjects import globalenv
globalenv['dfr'] = dfr

Step 3: Unadjusted Differences

In step 3, we will get the direct difference between the treatment group and the non-treatment group.

We can see that for the group that received the treatment, the average outcome is about 14, while for the group that did not receive the treatment, the average outcome is about -2.

# Get the average outcome for treatment group and non-treatment group
df.groupby('treatment')['outcome'].mean()

Output

treatment
0    -2.191159
1    13.940382
Name: outcome, dtype: float64

If we do not use causal inference and take the direct difference between the two groups, we will draw the conclusion that the treatment increased the outcome by 16, which is much higher than the true causal impact of 10.

# Print the direct difference
print('The direct differenc is', df.groupby('treatment')['outcome'].mean()[1] -df.groupby('treatment')['outcome'].mean()[0])
The direct differenc is 16.131541137888554

Step 4: Standardized Mean Difference of Confounders

In step 4, we will check the standardized mean difference (SMD) between the treatment group and the non-treatment group.

  • var=confounders means that we will summarize the data for the four confounders.
  • strata='treatment' means that the variables will be grouped by the treatment type.
  • test=TRUE means the p-value for the significance testing will be reported.
  • smd=TRUE means that the standardized mean difference will be reported.
%%R 
# Confounders
confounders <- c('W0', 'W1', 'W2', 'W3')
# Create table one, test=TRUE means with significance testing
table1 <- CreateTableOne(var=confounders, strata='treatment', data=dfr, test=TRUE)
# Print the results
print(table1, smd=TRUE)

Standardized Mean Differences(SMD) greater than 0.1 is usually used to check the imbalance between the treatment and the control group. We can see that most of the confounders have SMD greater than 0.1 with a significant p-value, meaning that the confounders are different between the treatment group and the non-treatment group.

Stratified by treatment
                 0            1            p      test SMD   
  n               2269         7731                          
  W0 (mean (SD)) -0.78 (0.94) -0.09 (0.97) <0.001       0.724
  W1 (mean (SD)) -0.18 (0.68)  1.21 (0.83) <0.001       1.825
  W2 (mean (SD))  0.12 (1.00)  0.55 (0.99) <0.001       0.435
  W3 (mean (SD))  0.13 (0.97)  0.22 (1.01) <0.001       0.095

Step 5: Greedy Matching on Confounders Using Mahalanobis Distance

In step 5, we will perform Mahalanobis Distance Matching (MDM) on the dataset using Match in the R Matching package.

  • Tr takes in the treatment indicator column.
  • M=1 means one record in the treatment group is matched with one record in the control group.
  • X takes the list of confounders to match on.
  • Match returns objects index.treated and index.control to reconstruct the matched dataset.
%%R 
# Greedy match on Mahalanobis distance
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders])
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]
# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)

Output

Stratified by treatment
                 0            1            p      test SMD   
  n               7731         7731                          
  W0 (mean (SD)) -0.33 (0.89) -0.09 (0.97) <0.001       0.260
  W1 (mean (SD))  0.70 (0.57)  1.21 (0.83) <0.001       0.709
  W2 (mean (SD))  0.40 (0.93)  0.55 (0.99) <0.001       0.159
  W3 (mean (SD))  0.20 (0.95)  0.22 (1.01)  0.152       0.023

TableOne results show that after matching, the test and control are still significantly different for some confounders, so we need to add a caliper.

A caliper is the distance threshold for any match. The matched pairs with a distance greater than the caliper distance will be dropped.

caliper=.2 means that matches greater than 20% of the standard deviations of each confounder will be dropped.

%%R 
# Greedy match
greedymatch_mdm <- Match(Tr=dfr[['treatment']], M=1, X=dfr[confounders], caliper=0.2)
matched_mdm <- dfr[unlist(greedymatch_mdm[c('index.treated', 'index.control')]),]
# Matched data
matchedtab1_mdm <- CreateTableOne(vars=confounders, strata='treatment', data=matched_mdm, test = TRUE)
print(matchedtab1_mdm, smd=TRUE)

We can see that after adding the caliper, the Standardized Mean Distance (SMD) decreased and the p-values are not significant, meaning that the matched treatment and control group are not significantly different from each other for all the confounders.

Stratified by treatment
                 0            1            p      test SMD   
  n                659          659                          
  W0 (mean (SD)) -0.43 (0.72) -0.41 (0.72)  0.685       0.022
  W1 (mean (SD))  0.33 (0.45)  0.37 (0.45)  0.171       0.075
  W2 (mean (SD))  0.26 (0.73)  0.27 (0.73)  0.783       0.015
  W3 (mean (SD))  0.07 (0.71)  0.08 (0.72)  0.849       0.010

Step 6: Causal Inference Using Mahalanobis Distance Matching

In step 6, we will calculate the causal impact on the matched dataset. Firstly, we get the outcomes for the matched treatment samples and the matched control samples. Then a paired t-test is performed to check if the mean difference is equal to 0.

%%R
# Get the matched treatement outcomes
y_treatment_mdm <- matched_mdm$outcome[matched_mdm$treatment==1]
# Get the matched control outcomes
y_control_mdm <- matched_mdm$outcome[matched_mdm$treatment==0]
# t-test
t.test(y_treatment_mdm,
       y_control_mdm,
       paired=TRUE,
       conf.level=0.95)

From the output of the paired t-test, we can see that:

  • The p-value is close to 0, meaning that the outcome mean of the treatment group is different from the outcome mean of the control group.
  • The 95% confidence interval is between 10.12 and 10.21.
  • The outcome mean difference is 10.1649, which is very close to the true causal impact of 10. Compared with the direct difference without matching of 16.13, the Mahalanobis Distance Matching (MDM) increased the accuracy by about 60%.
Paired t-test
data:  y_treatment_mdm and y_control_mdm
t = 445.89, df = 658, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 10.12014 10.20966
sample estimates:
mean difference 
        10.1649

Summary

In this tutorial, we talked about how to use Mahalanobis Distance Matching (MDM) for causal inference using the R package Matching. You learned:

  • How do the matching algorithms work?
  • How to do Mahahanois Distance Matching (MDM) using R?
  • How to do causal inference using matched data?
  • How much improvement in accuracy is made by causal inference?

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

Recommended Tutorials

References

[1] DoWhy Documentation

[2] rpy2 R and pandas Dataframe Documentation

[3] Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010 Feb 1;25(1):1–21. doi: 10.1214/09-STS313. PMID: 20871802; PMCID: PMC2943670.

[4] R MatchIt Documentation

[5] Coursera course: A Crash Course in Causality

[6] R Matching Documentation on Match

[7] R CreateTableOne Documentation

Causal Inference
Causality
Mahalanobis
Matching
R
Recommended from ReadMedium