ACE estimations from real observational data
This notebook serves as a demonstration of the CausalEffectEstimation module for estimating causal effects on both generated and real datasets.
![]() | ![]() |
import pyagrum as gumimport pyagrum.lib.discretizer as discimport pyagrum.lib.notebook as gnbimport pyagrum.explain as gexpl
import pyagrum.causal as cslimport pyagrum.causal.notebook as cslnb
import pandas as pdimport matplotlib.pyplot as pltIn this example, we show how to estimate the ACE (Average Causal Effect) using real data that does not adhere to RCT conditions.
Dataset
Section titled “Dataset”The dataset under consideration is the Census Adult Income dataset. The objective of this analysis is to determine whether possessing a graduate degree increases the likelihood of earning an income exceeding $50,000 per year.
df = pd.read_pickle("./res/df_causal_discovery.p")df = df.rename(columns={"hours-per-week": "hoursPerWeek"})df.describe()| age | hoursPerWeek | hasGraduateDegree | inRelationship | isWhite | isFemale | greaterThan50k | |
|---|---|---|---|---|---|---|---|
| count | 29170.000000 | 29170.000000 | 29170.000000 | 29170.000000 | 29170.000000 | 29170.000000 | 29170.000000 |
| mean | 38.655674 | 40.447755 | 0.052348 | 0.406616 | 0.878334 | 0.331916 | 0.245835 |
| std | 13.722408 | 12.417203 | 0.222732 | 0.491211 | 0.326905 | 0.470909 | 0.430588 |
| min | 17.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 28.000000 | 40.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 50% | 37.000000 | 40.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 48.000000 | 45.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
| max | 90.000000 | 99.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
We begin by focusing exclusively on the age covariate to inform our estimations. We hypothesize that age is a causal factor influencing both the hasGraduateDegree variable and the greaterThan50k outcome.
discretizer = disc.Discretizer(defaultDiscretizationMethod="NoDiscretization", defaultNumberOfBins=None)template = discretizer.discretizedTemplate(df[["age", "hasGraduateDegree", "greaterThan50k"]])template.addArcs([("age", "hasGraduateDegree"), ("age", "greaterThan50k"), ("hasGraduateDegree", "greaterThan50k")])
causal_model = csl.CausalModel(template)
cslnb.showCausalModel(causal_model, size="50")T = "hasGraduateDegree"Y = "greaterThan50k"X = "age"cee = csl.CausalEffectEstimation(df, causal_model)Causal Identification
Section titled “Causal Identification”cee.identifyAdjustmentSet(intervention=T, outcome=Y)Backdoor adjustment found.
Supported estimators include:- CausalModelEstimator- SLearner- TLearner- XLearner- PStratification- IPW
'Backdoor'Causal Estimation
Section titled “Causal Estimation”cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.2332334047559898cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.29760513570330316cee.fitIPW()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.29038491617382495Incorporating covariates and Unknown Adjustment
Section titled “Incorporating covariates and Unknown Adjustment”Let’s examine whether incorporating all available covariates influences the estimation of the ACE. We will employ structure learning techniques to determine the DAG that the algorithm identifies from the data.
discretizer = disc.Discretizer(defaultNumberOfBins=5, defaultDiscretizationMethod="uniform")template = discretizer.discretizedTemplate(df)structure_learner = gum.BNLearner(df, template)## we help the learning algorithm by giving it some causal constraintsstructure_learner.setSliceOrder( [["isWhite", "isFemale"], ["age"], ["hasGraduateDegree", "inRelationship", "hoursPerWeek"]])structure_learner.addNoParentNode("isWhite")structure_learner.addNoParentNode("isFemale")structure_learner.useNMLCorrection()structure_learner.useSmoothingPrior(1e-6)
learned_bn = structure_learner.learnBN()
causal_model = csl.CausalModel(learned_bn)gnb.sideBySide(gexpl.getInformation(learned_bn, size="50"))cee = csl.CausalEffectEstimation(df, causal_model)cee.identifyAdjustmentSet(intervention=T, outcome=Y)Backdoor adjustment found.
Supported estimators include:- CausalModelEstimator- SLearner- TLearner- XLearner- PStratification- IPW
'Backdoor'In this scenario, no adjustment set is available for causal inference:
- This is not a randomized controlled trial (RCT) due to the presence of a backdoor path from
hasGraduateDegreetogreaterThan50k, which traverses throughageandinRelationship. - The backdoor criterion is not met, as every node is a descendant of the intervention variable
hasGraduateDegree. - The (Generalized) Frontdoor criterion is not applicable due to the absence of mediator variables.
- There are no (Generalized) Instrumental Variables, as
hasGraduateDegreelacks any ancestors in the causal graph.
Consequently, the causal effect can only be estimated using the CausalBNEstimator, provided that the causal effect of the intervention on the outcome is identifiable through do-calculus.
cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.24151595674049395We observe a stronger Causal Effect within the previously defined causal structure.
User specified adjustment
Section titled “User specified adjustment”Alternatively, it is possible to manually specify an adjustment set. However, it is important to note that this approach does not guarantee an asymptotically unbiased estimator.
cee.useBackdoorAdjustment(T, Y, {"age", "inRelationship"})cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.2860493646590411cee.fitTLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.2749304594008821cee.fitIPW()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 0.2713082028700332We obtain more consistent results when specifying the backdoor adjustment using a selected subset of covariates.
CACE estimations
Section titled “CACE estimations”To estimate the Conditional Average Causal Effect (CACE) using CausalEffectEstimation, pandas query strings can be utilized to specify the relevant conditions. Additionally, the Individual Causal Effect (ICE) can be estimated by providing a pandas.DataFrame containing the features of the individual units.
cee.fitTLearner()tau_hat0 = cee.estimateCausalEffect(conditional="inRelationship == 0")tau_hat1 = cee.estimateCausalEffect(conditional="inRelationship == 1")
print(f"CACE (inRelationship == 0) = {tau_hat0}")print(f"CACE (inRelationship == 1) = {tau_hat1}")CACE (inRelationship == 0) = 0.22769492525672883CACE (inRelationship == 1) = 0.34386224091181283cee.fitTLearner()tau_hat0 = cee.estimateCausalEffect(conditional="isFemale == 0")tau_hat1 = cee.estimateCausalEffect(conditional="isFemale == 1")
print(f"CACE (isFemale == 0) = {tau_hat0}")print(f"CACE (isFemale == 1) = {tau_hat1}")CACE (isFemale == 0) = 0.2983895058122265CACE (isFemale == 1) = 0.22771192020812456cee.fitTLearner()changes = sorted(df["age"].unique())booking = list()
for i in changes: tau_hat = cee.estimateCausalEffect(conditional=f"age <= {i}") booking.append(tau_hat)
plt.plot(changes, booking)plt.title("CACE with \n(age <= x) as Conditional")plt.xlabel("age")plt.ylabel("CACE")plt.show()cee.fitTLearner()changes = sorted(df["hoursPerWeek"].unique())booking = list()
for i in changes: tau_hat = cee.estimateCausalEffect(conditional=f"hoursPerWeek <= {i}") booking.append(tau_hat)
plt.plot(changes, booking)plt.title("CACE with \n(hoursPerWeek <= x) as Conditional")plt.xlabel("hoursPerWeek")plt.ylabel("CACE")plt.show()
