CAUSAL DATA SCIENCE
DAGs and Control Variables
How to select control variables for causal inference using Directed Acyclic Graphs

When analyzing causal relationships, it is very hard to understand which variables to condition the analysis on, i.e. how to “split” the data so that we are comparing apples to apples. For example, if you want to understand the effect of having a tablet on a student’s performance, it makes sense to compare schools where students have similar socio-economic backgrounds. Otherwise, the risk is that only wealthier students can afford a tablet and, without controlling for it, we might attribute the effect to tablets instead of to the socio-economic background.
When the treatment of interest comes from a proper randomized experiment, we do not need to worry about conditioning on other variables. If tablets are distributed randomly across schools, and we have enough schools in the experiment, we do not have to worry about the socio-economic background of students. The only advantage of conditioning the analysis on some so-called “control variable” could be an increase in power. However, this is a different story.
In this post, we are going to have a brief introduction to Directed Acyclic Graphs and how they can be useful to select variables to condition a causal analysis on. Not only do DAGs provide visual intuition on which variables we need to include in the analysis, but also which variables we should not include, and why.
Directed Acyclic Graphs
Definitions
Directed acyclic graphs (DAGs) provide a visual representation of the data generating process. Random variables are represented with letters (e.g. X) and causal relationships are represented with arrows (e.g. →). For example, we interpret

as X (possibly) causes Y. We call a path any connection between two variables X and Y, independently of the direction of the arrows. If all arrows point forward, we call it a causal path, otherwise we call it a spurious path.

In the example above, we have a path between X and Y passing through the variables Z₁, Z₂, and Z₃. Since not all arrows point forward, the path is spurious and there is no causal relationship of X on Y. In fact, variable Z₂ is caused by both Z₁ and Z₃ and therefore blocks the path.
Z₂ is called a collider.
The purpose of our analysis is to assess the causal relationship between two variables X and Y. Directed acyclic graphs are useful because they provide us instructions on which other variables Z we need to condition our analysis on. Conditioning the analysis on a variable means that we keep it fixed and we draw our conclusions ceteris paribus. For example, in a linear regression framework, inserting another regressor Z means that we are computing the best linear approximation of the conditional expectation function of Y given X, conditional on the observed values of Z.
Causality
In order to assess causality, we want to close all spurious paths between X and Y. The questions now are:
- When is a path open? If it does not contain colliders. Otherwise, it is closed.
- How do you close an open path? You condition on at least one intermediate variable.
- How do you open a closed path? You condition on all colliders along the path.
Suppose we are again interested in the causal relationship of X on Y. Let’s consider the following graph:

In this case, apart from the direct path, there are three non-direct paths between X and Y through the variables Z₁, Z₂ and Z₃.
Let’s consider the case in which we analyze the relationship between X and Y, ignoring all other variables.
- The path through Z₁ is open but it is spurious
- The path through Z₂ is open and causal
- The path through Z₃ is closed since Z₃ is a collider and it is spurious
Let’s draw the same graph indicating in gray variables that we are conditioning on, with dotted lines closed paths, with red lines spurious open paths, and with green lines causal open paths.

In this case, to assess the causal relationship between X and Y we need to close the path that passes through Z₁. We can do that by conditioning the analysis on Z₁.

Now we are able to recover the causal relationship between X and Y by conditioning on Z₁.
What would happen if we were also conditioning on Z₂? In this case, we would close the path passing through Z₂ leaving only the direct path between X and Y open. We would then recover only the direct effect of X on Y and not the indirect one.

What would happen if we were also conditioning on Z₃? In this case, we would open the path passing through Z₃ which is a spurious path. We would then not be able to recover the causal effect of X on Y.

Example: Class Size and Math Scores
Suppose you are interested in the effect of class size on math scores. Are bigger classes better or worse for students’ performance?
Assume that the data generating process can be represented with the following DAG.

The variables of interest are highlighted. Moreover, the dotted line around ability indicates that this is a variable that we do not observe in the data.
We can now load the data and check what it looks like. The data is simulated and you can find the original data generating process here.
from src.dgp import dgp_schooldf = dgp_school().generate_data()
df.head()
What variables should we condition our regression on, in order to estimate the causal effect of class size on math scores?
First of all, let's look at what happens if we do not condition our analysis on any variable and we just regress math score on class size. We use the statsmodels package for the statistical analysis.
import statsmodels.formula.api as smfsmf.ols('math_score ~ class_size', df).fit().summary().tables[1]
In the second row of the table, we can see the estimated coefficient of interest (coef of class size) and its estimated standard error (std err). The effect of class_size is negative and statistically different from zero (has a very low p-value, P>|t|=0.003).
But should we believe this estimated effect? Without controlling for anything, this is the DAG representation of the effect we are capturing.

There is a spurious path passing through good school that biases our estimated coefficient. Intuitively, being enrolled in a better school improves the students’ math scores and better schools might have smaller class sizes. We need to control for the quality of the school.
smf.ols('math_score ~ class_size + good_school', df).fit().summary().tables[1]
Now the estimate of the effect of class size on math score is unbiased! Indeed, the true coefficient in the data generating process was 0.2.

What would happen if we were to instead control for all variables?
smf.ols('math_score ~ class_size + good_school + math_hours + class_year + hist_score', df).fit().summary().tables[1]
The coefficient is again biased. Why?
We have opened a new spurious path by controlling for hist score. In fact, hist score is a collider and controlling for it has opened a path through hist score and ability that was otherwise closed.

The example was inspired by the following tweet.





