Skip to content

ACE estimations from real observational data

This notebook serves as a demonstration of the CausalEffectEstimation module for estimating causal effects on both generated and real datasets.

Creative Commons LicenseaGrUMinteractive online version
import pyagrum as gum
import pyagrum.lib.discretizer as disc
import pyagrum.lib.notebook as gnb
import pyagrum.explain as gexpl
import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb
import pandas as pd
import matplotlib.pyplot as plt

In this example, we show how to estimate the ACE (Average Causal Effect) using real data that does not adhere to RCT conditions.

The dataset under consideration is the Census Adult Income dataset. The objective of this analysis is to determine whether possessing a graduate degree increases the likelihood of earning an income exceeding $50,000 per year.

df = pd.read_pickle("./res/df_causal_discovery.p")
df = df.rename(columns={"hours-per-week": "hoursPerWeek"})
df.describe()
age hoursPerWeek hasGraduateDegree inRelationship isWhite isFemale greaterThan50k
count 29170.000000 29170.000000 29170.000000 29170.000000 29170.000000 29170.000000 29170.000000
mean 38.655674 40.447755 0.052348 0.406616 0.878334 0.331916 0.245835
std 13.722408 12.417203 0.222732 0.491211 0.326905 0.470909 0.430588
min 17.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 28.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000
50% 37.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 48.000000 45.000000 0.000000 1.000000 1.000000 1.000000 0.000000
max 90.000000 99.000000 1.000000 1.000000 1.000000 1.000000 1.000000

We begin by focusing exclusively on the age covariate to inform our estimations. We hypothesize that age is a causal factor influencing both the hasGraduateDegree variable and the greaterThan50k outcome.

discretizer = disc.Discretizer(defaultDiscretizationMethod="NoDiscretization", defaultNumberOfBins=None)
template = discretizer.discretizedTemplate(df[["age", "hasGraduateDegree", "greaterThan50k"]])
template.addArcs([("age", "hasGraduateDegree"), ("age", "greaterThan50k"), ("hasGraduateDegree", "greaterThan50k")])
causal_model = csl.CausalModel(template)
cslnb.showCausalModel(causal_model, size="50")

svg

T = "hasGraduateDegree"
Y = "greaterThan50k"
X = "age"
cee = csl.CausalEffectEstimation(df, causal_model)
cee.identifyAdjustmentSet(intervention=T, outcome=Y)
Backdoor adjustment found.
Supported estimators include:
- CausalModelEstimator
- SLearner
- TLearner
- XLearner
- PStratification
- IPW
'Backdoor'
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.2332334047559898
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.29760513570330316
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.29038491617382495

Incorporating covariates and Unknown Adjustment

Section titled “Incorporating covariates and Unknown Adjustment”

Let’s examine whether incorporating all available covariates influences the estimation of the ACE. We will employ structure learning techniques to determine the DAG that the algorithm identifies from the data.

discretizer = disc.Discretizer(defaultNumberOfBins=5, defaultDiscretizationMethod="uniform")
template = discretizer.discretizedTemplate(df)
structure_learner = gum.BNLearner(df, template)
## we help the learning algorithm by giving it some causal constraints
structure_learner.setSliceOrder(
[["isWhite", "isFemale"], ["age"], ["hasGraduateDegree", "inRelationship", "hoursPerWeek"]]
)
structure_learner.addNoParentNode("isWhite")
structure_learner.addNoParentNode("isFemale")
structure_learner.useNMLCorrection()
structure_learner.useSmoothingPrior(1e-6)
learned_bn = structure_learner.learnBN()
causal_model = csl.CausalModel(learned_bn)
gnb.sideBySide(gexpl.getInformation(learned_bn, size="50"))
G hasGraduateDegree hasGraduateDegree greaterThan50k greaterThan50k hasGraduateDegree->greaterThan50k inRelationship inRelationship inRelationship->greaterThan50k hoursPerWeek hoursPerWeek inRelationship->hoursPerWeek isFemale isFemale isFemale->inRelationship isFemale->hoursPerWeek age age age->hasGraduateDegree age->inRelationship age->greaterThan50k age->hoursPerWeek isWhite isWhite isWhite->inRelationship isWhite->greaterThan50k isWhite->hoursPerWeek greaterThan50k->hoursPerWeek
PyAgrum inline image
cee = csl.CausalEffectEstimation(df, causal_model)
cee.identifyAdjustmentSet(intervention=T, outcome=Y)
Backdoor adjustment found.
Supported estimators include:
- CausalModelEstimator
- SLearner
- TLearner
- XLearner
- PStratification
- IPW
'Backdoor'

In this scenario, no adjustment set is available for causal inference:

  • This is not a randomized controlled trial (RCT) due to the presence of a backdoor path from hasGraduateDegree to greaterThan50k, which traverses through age and inRelationship.
  • The backdoor criterion is not met, as every node is a descendant of the intervention variable hasGraduateDegree.
  • The (Generalized) Frontdoor criterion is not applicable due to the absence of mediator variables.
  • There are no (Generalized) Instrumental Variables, as hasGraduateDegree lacks any ancestors in the causal graph.

Consequently, the causal effect can only be estimated using the CausalBNEstimator, provided that the causal effect of the intervention on the outcome is identifiable through do-calculus.

cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.24151595674049395

We observe a stronger Causal Effect within the previously defined causal structure.

Alternatively, it is possible to manually specify an adjustment set. However, it is important to note that this approach does not guarantee an asymptotically unbiased estimator.

cee.useBackdoorAdjustment(T, Y, {"age", "inRelationship"})
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.2860493646590411
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.2749304594008821
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 0.2713082028700332

We obtain more consistent results when specifying the backdoor adjustment using a selected subset of covariates.

To estimate the Conditional Average Causal Effect (CACE) using CausalEffectEstimation, pandas query strings can be utilized to specify the relevant conditions. Additionally, the Individual Causal Effect (ICE) can be estimated by providing a pandas.DataFrame containing the features of the individual units.

cee.fitTLearner()
tau_hat0 = cee.estimateCausalEffect(conditional="inRelationship == 0")
tau_hat1 = cee.estimateCausalEffect(conditional="inRelationship == 1")
print(f"CACE (inRelationship == 0) = {tau_hat0}")
print(f"CACE (inRelationship == 1) = {tau_hat1}")
CACE (inRelationship == 0) = 0.22769492525672883
CACE (inRelationship == 1) = 0.34386224091181283
cee.fitTLearner()
tau_hat0 = cee.estimateCausalEffect(conditional="isFemale == 0")
tau_hat1 = cee.estimateCausalEffect(conditional="isFemale == 1")
print(f"CACE (isFemale == 0) = {tau_hat0}")
print(f"CACE (isFemale == 1) = {tau_hat1}")
CACE (isFemale == 0) = 0.2983895058122265
CACE (isFemale == 1) = 0.22771192020812456
cee.fitTLearner()
changes = sorted(df["age"].unique())
booking = list()
for i in changes:
tau_hat = cee.estimateCausalEffect(conditional=f"age <= {i}")
booking.append(tau_hat)
plt.plot(changes, booking)
plt.title("CACE with \n(age <= x) as Conditional")
plt.xlabel("age")
plt.ylabel("CACE")
plt.show()

svg

cee.fitTLearner()
changes = sorted(df["hoursPerWeek"].unique())
booking = list()
for i in changes:
tau_hat = cee.estimateCausalEffect(conditional=f"hoursPerWeek <= {i}")
booking.append(tau_hat)
plt.plot(changes, booking)
plt.title("CACE with \n(hoursPerWeek <= x) as Conditional")
plt.xlabel("hoursPerWeek")
plt.ylabel("CACE")
plt.show()

svg