Deeper Data-Science with DoWhy: Causal Modeling in Python

Learn how causal inference with DoWhy goes beyond prediction to answer 'what if we intervene?' questions. This tutorial uses a student attendance example to demonstrate the difference between correlation and causation in data science.

Deeper Data-Science with DoWhy: Causal Modeling in Python

Old ML was About Prediction; New AI is About Action

Old ML was about finding associations in data, and trying to make more accurate predictions.

But no matter how much data you collect - it is all history.

New things happen at a higher level - we must intervene - we must do something, with
some particular effect in mind.

Intervention requires having a causal model - an understanding of what factors lead to
what factors, which in turn lead to further factors and so on.

Judea Pearl's Ladder of Causation

The three levels of causal understanding are:

  1. Level 1: Seeing (association)
  2. Level 2: Doing (intervention)
  3. Level 3: Imagining (retrospection)

An Example

Consider an example, maybe more relevant for a CS teacher.

Observation (rung 1):

-> Students who attend more lectures score higher.

This is a correlation. Any standard ML model can learn this very well.

Now we ask a policy/action question (rung 2):

-> If we force students to attend more lectures, will scores improve?

From a purely correlational viewpoint, one might be tempted to say “yes.”

But that inference is unjustified.

Why? Because the original association may be driven by confounders:

  1. Motivation
  2. Prior ability
  3. Discipline
  4. Home environment

Students who attend more lectures may already be the ones inclined to do well. Forcing attendance could:

  1. Have little effect
  2. Backfire by reducing motivation
  3. Work for some groups and harm others

No amount of additional observational data from the same system resolves this ambiguity. You can add more features, more demographics, more regions -- and you are still estimating correlations under the same process.

I think the core point Judea Pearl makes is that more data alone cannot answer the specific contextually relevant question, “What if we intervene?”

A causal model changes the game because it:

  • Explicitly represents assumptions about how variables influence one another
  • Separates correlation from underlying deeper mechanisms
  • Lets you simulate interventions (“force attendance”) rather than observations (“see attendance”)

In practice, this means combining:

  • Data
  • Domain knowledge
  • Hypotheses about causal structure
  • And real interventions or natural experiments

Only then can we reason about which actions are likely to work, for whom, and under what conditions.

Deep learning is compelling at rung 1.

But moving to rung 2 requires a different kind of knowledge, not just more data or larger models.

From Correlation to Causation: a Gentle Introduction Using DoWhy

Imagine you are teaching a computer science class.

You look at your data and notice something that feels obvious:

Students who attend more lectures tend to score higher on exams.

This feels intuitive.
It feels actionable.
It feels like something we should do something about.

So the natural next thought is:

If we force students to attend more lectures, their scores should improve.

This tutorial explains why that conclusion does not follow, even though the data seems to support it — and how causal inference (using DoWhy) helps us reason correctly.

The difference between observing and acting

Let’s slow down and separate two very different questions.

Observation question:

Among students who attend more lectures, do scores tend to be higher?

This is a question about patterns in the data. Machine learning is extremely good at answering this.

Action (policy) question:

If we force students to attend more lectures, will their scores improve?

This is a question about changing the world. This is not something standard ML is designed to answer.

The entire tutorial exists because these two questions are not the same.

Why the obvious conclusion may be wrong

To understand the problem, we need to think about why some students attend more lectures in the first place.

Common-sense reasons:

  • They are more motivated
  • They have higher prior ability
  • They are more disciplined
  • They have a better home environment

Now notice something important:

These same factors also affect exam scores.

So when we see:

High attendance → high score

we might actually be seeing:

High motivation → high attendance
High motivation → high score

Attendance might just be a signal, not the cause.

Confounders explained in plain language

A confounder is a variable that:

  • influences the thing you want to change, and
  • influences the outcome you care about

In this example:

  • Motivation affects attendance
  • Motivation affects score

That makes motivation a confounder.

Here is what that looks like visually.

attendance score

This diagram is not something we learn from data.
It is something we assume based on domain knowledge.

This explicit statement of assumptions is the core of causal reasoning.

Why “just collecting more data” does not solve this

A very common reaction is:

“Okay, let’s collect more data and control for everything.”

But notice:

  • You are still observing the same system
  • Students are still selecting themselves into attendance
  • Confounders still exist

More data makes correlations more precise.
It does not turn them into causal answers.

This is a key idea emphasized by Judea Pearl:

Observational data alone cannot answer interventional questions.

What causal inference adds

Causal inference does not magically extract causation from data.

Instead, it forces you to:

  • State your assumptions clearly
  • Check whether your question is answerable under those assumptions
  • Separate what you assume from what the data says

The DoWhy library is built around enforcing this discipline.

From this point onward, we are no longer interested in prediction accuracy.
We are interested in answering a single causal question:

What happens to exam scores if we intervene and increase attendance?

Everything that follows exists only to answer this question honestly.

What is DoWhy?

DoWhy is a Python library that implements a simple rule:

Every causal analysis must go through the same four steps.

  • Model: State your causal assumptions
  • Identify: Check whether the causal effect can be computed
  • Estimate: Compute the effect
  • Refute: Stress-test the result

If you skip the first step, DoWhy will not proceed.

This is intentional.

Installing the required tools

You will need the following Python packages:

pip install dowhy statsmodels graphviz pydot pygraphviz

You also need the Graphviz system binary:

  • Linux: sudo apt install graphviz
  • macOS: brew install graphviz
  • Windows: install from https://graphviz.org

This allows us to render causal graphs.

Defining the problem in terms friendly to code

Before writing any code, let’s name everything clearly.

Variable name Meaning
attendance number of lectures attended
score exam score
motivation internal drive
ability prior academic ability
discipline study habits
home_env home support environment

These names are intentionally explicit and boring. We prefer clarity to ease understanding.

Writing down the causal assumptions explicitly

Now we translate our earlier reasoning into a formal causal graph.

Causal graph in Python (pygraphviz)

from pygraphviz import AGraph
from IPython.display import Image, display

def draw_graph(graph):
    graph.layout(prog='dot')
    display(Image(graph.draw(format='png')))

graph = AGraph(directed=True)

edges = [
    ("motivation", "attendance"),
    ("ability", "attendance"),
    ("discipline", "attendance"),
    ("home_env", "attendance"),
    ("motivation", "score"),
    ("ability", "score"),
    ("discipline", "score"),
    ("home_env", "score"),
    ("attendance", "score"),
]

graph.add_edges_from(edges)
draw_graph(graph)

dot1

This graph encodes what we believe about the world, not what the data tells us.

If these assumptions are wrong, the conclusions will be wrong -- and that is an honest failure mode.

Why we use synthetic data in this tutorial

For beginners, real-world data is confusing because:

  • The true causal effect is unknown
  • Many things happen at once
  • It is hard to tell whether a method worked

Synthetic data solves this:

  • We design the data-generating process
  • We know the true causal effect
  • We can check whether our method recovers it

Generating student characteristics

Each student has latent traits that influence their behavior.

import numpy as np

def generate_student_traits(num_students):
    return {
        "motivation": np.random.normal(0, 1, num_students),
        "ability": np.random.normal(0, 1, num_students),
        "discipline": np.random.normal(0, 1, num_students),
        "home_env": np.random.normal(0, 1, num_students),
    }

Think of these as underlying causes we do not control directly.

Generating attendance from traits

Attendance is not random.
It depends on student traits.

def generate_attendance(traits):
    return (
        0.8 * traits["motivation"]
        + 0.7 * traits["discipline"]
        + 0.6 * traits["ability"]
        + 0.4 * traits["home_env"]
        + np.random.normal(0, 1, len(traits["motivation"]))
    )

This reflects the real world idea that motivated, disciplined students attend more.

Generating exam scores

Scores depend on both traits and attendance.

def generate_score(traits, attendance):
    return (
        1.2 * traits["ability"]
        + 1.0 * traits["motivation"]
        + 0.8 * traits["discipline"]
        + 0.6 * traits["home_env"]
        + 0.1 * attendance  # true causal effect
        + np.random.normal(0, 1, len(attendance))
    )

df

Important detail:

  • Attendance has a small but real causal effect
  • Most of the signal comes from confounders

This is intentional.

The naive approach: regression on attendance

Most people would start with:

score ~ attendance

This answers:

Among students who attend more, do scores differ?

The answer is yes -- strongly.

But this is still an observational question.

It does not tell us what happens if we force attendance.

Building the DoWhy causal model

Now we tell DoWhy what question we want to answer and under which assumptions.


from dowhy import CausalModel
import networkx as nx

# Convert pygraphviz.AGraph to networkx.DiGraph manually
nx_graph = nx.DiGraph()
for node in graph.nodes():
    nx_graph.add_node(str(node))
for edge in graph.edges():
    source = str(edge[0])
    target = str(edge[1])
    nx_graph.add_edge(source, target)

model = CausalModel(
    data=df,
    treatment="attendance",
    outcome="score",
    graph=nx_graph
)

This is where causal inference differs from ML.

We are not asking the model to discover structure.
We are telling it what structure we assume.

Identification: checking whether the question is answerable

Before estimating anything, DoWhy asks a surprisingly strict question:

Given our assumptions and our data, is the causal question even answerable?

This step is called identification.

At a high level, identification asks:

  • Are there paths that leak spurious association from attendance to score?
  • Do we observe enough variables to block those paths?

In causal language, these unwanted paths are called backdoor paths.

What is a backdoor path (intuitively)?

A backdoor path is any way information can flow from attendance to score without actually going through attendance.

For example:

attendance ← motivation → score

This path creates correlation, even if attendance itself has no effect.

To block this path, we must condition on motivation.

What DoWhy does during identification

When we run:

estimand = model.identify_effect()
print(estimand)

We get:

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
      d                                                      
─────────────(E[score|motivation,ability,discipline,homeₑₙᵥ])
d[attendance]                                                
Estimand assumption 1, Unconfoundedness: If U→{attendance} and U→score then P(score|attendance,motivation,ability,discipline,home_env,U) = P(score|attendance,motivation,ability,discipline,home_env)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
      d                                                      
─────────────(E[score|motivation,ability,discipline,homeₑₙᵥ])
d[attendance]                                                
Estimand assumption 1, Unconfoundedness: If U→{attendance} and U→score then P(score|attendance,motivation,ability,discipline,home_env,U) = P(score|attendance,motivation,ability,discipline,home_env)

DoWhy inspects the causal graph and concludes:

  • There are confounding paths
  • All of them go through:
    • motivation
    • ability
    • discipline
    • home_env
  • We observe all of these variables

Therefore:

If we adjust for these variables, the causal effect of attendance is identifiable.

This is what the output is telling us, in mathematical language.

When you see:

E[score | motivation, ability, discipline, home_env]

you should read it as:

"Compare students who are similar in motivation, ability, discipline, and home environment -- and differ only in attendance."

This is the core causal idea.

This step asks:

  • Are all confounding paths blocked?
  • Do we observe the variables we need?

If the answer is no, DoWhy stops.

Estimation: approximating an intervention using data

Now that DoWhy has confirmed the causal effect is identifiable, we can estimate it.

We run:

estimate = model.estimate_effect(
    estimand,
    method_name="backdoor.linear_regression"
)
print(estimate.value)

We get:

0.0927561466288897

This answers the real question:

If we intervene and increase attendance, how much does score change?

This is mathematically different from correlation.

This number answers exactly one question:

If we intervene and increase attendance by one unit,
how much does the expected exam score change?

It does not mean:

  • “Students who attend more score higher”
  • "Attendance is a strong predictor"
  • “Attendance explains most variance in scores”

Those are observational statements.

This number is an interventional effect.

Why this number is so much smaller than the correlation

Earlier, the naive regression showed something like:

attendance coefficient ≈ 0.98

attendance coefficient ≈ 0.98

That number mixes together:

  • the true causal effect of attendance
  • the effect of motivation
  • the effect of ability
  • the effect of discipline
  • the effect of home environment

Once we explicitly adjust for those confounders, only the direct causal effect remains.

That direct effect is small — by construction.

This is the “aha” moment.

The “aha” result

The causal estimate tells us:

Forcing students to attend more lectures has only a small effect on scores.

This does not mean:

  • Lectures are useless
  • Attendance does not matter
  • Teachers do not matter

It means something subtler:

Much of the advantage seen among high-attendance students
comes from who they are, not from attendance itself.

This explains why policies like mandatory attendance often:

  • fail to produce large gains
  • work only for specific subgroups
  • sometimes backfire by reducing motivation

Causal inference does not tell us what to do -- but it tells us what will not work as expected.

Refutation: stress-testing the causal claim

After estimating a causal effect, DoWhy encourages skepticism.

The question now becomes:

Could this effect be an artifact of chance, noise, or modeling quirks?

This is the purpose of refutation tests.

Refutation does not prove the result is correct.
It checks whether the result is fragile.

Adding a random confounder

One simple refutation test is:

What happens if we add a completely random variable
and pretend it is a confounder?

If the causal estimate changes dramatically, the result is unstable.

We run:

refutation = model.refute_estimate(
    estimand,
    estimate,
    method_name="random_common_cause"
)
print(refutation)

Typical output looks like:

Estimated effect: 0.0927
New effect:       0.0928
p-value:          0.84

How to read this result

This tells us:

  • Adding a fake confounder does not change the estimate
  • The causal effect is not sensitive to random noise
  • The result is reasonably stable

This increases our confidence -- not certainty, but confidence.

Causal inference is about disciplined doubt, not blind belief.

What this example actually teaches

This example is deliberately simple, but the lesson is deep.

You learned that:

  • Prediction answers: What tends to happen?
  • Causation answers: What will happen if we act?

You learned that:

  • Correlation can be large
  • Causal effects can be small
  • Both can be true at the same time

You learned why:

  • More data is not enough
  • Better models are not enough
  • Explicit assumptions are unavoidable

Most importantly, you learned that:

Every action embeds a causal theory, whether we admit it or not.

DoWhy simply forces us to make that theory explicit -- and then checks whether the data can support it.

Try It Yourself

Find the full example in the Google Colab