Causal Effect Estimation in datasets
This notebook serves as a demonstration of the CausalEffectEstimation module for estimating causal effects on both generated and real datasets.
![]() | ![]() |
import pyagrum as gumimport pyagrum.lib.discretizer as discimport pyagrum.lib.notebook as gnb
import pyagrum.explain as gexpl
import pyagrum.causal as cslimport pyagrum.causal.notebook as cslnb
import numpy as npimport pandas as pd1. ACE estimations from generated RCT data
Section titled “1. ACE estimations from generated RCT data”To estimate the Average Causal Effects (ACE) from data generated under conditions resembling a Randomized Controlled Trial (RCT), we define two generative models.
1.1 Dataset
Section titled “1.1 Dataset”Consider two generative models:
-
A linear generative model described by the equation:
-
And a non-linear generative model described by the equation:
Where , and are jointly independent in both of the models.
Data from the models can be obtatined by the functions given below.
def linear_simulation(n: int = 100000, sigma: float = 1) -> pd.DataFrame: """ Returns n observations from the linear model with normally distributed noise with expected value 0 and standard deviation sigma. """
X1 = np.random.normal(1, 1, n) X2 = np.random.normal(1, 1, n) X3 = np.random.normal(1, 1, n) X4 = np.random.normal(1, 1, n) epsilon = np.random.normal(0, sigma, n) T = np.random.binomial(1, 0.5, n) Y = 3 * X1 + 2 * X2 - 2 * X3 - 0.8 * X4 + T * (2 * X1 + 5 * X3 + 3 * X4) + epsilon d = np.array([T, X1, X2, X3, X4, Y]) df_data = pd.DataFrame(data=d.T, columns=["T", "X1", "X2", "X3", "X4", "Y"]) df_data["T"] = df_data["T"].astype(int)
return df_data
def non_linear_simulation(n: int = 100000, sigma: float = 1) -> pd.DataFrame: """ Returns n observations from the non-linear model with normally distributed noise with expected value 0 and standard deviation sigma. """
X1 = np.random.normal(1, 1, n) X2 = np.random.normal(1, 1, n) X3 = np.random.normal(1, 1, n) X4 = np.random.normal(1, 1, n) epsilon = np.random.normal(0, sigma, n) T = np.random.binomial(1, 0.5, n) Y = 3 * X1 + 2 * X2**2 - 2 * X3 - 0.8 * X4 + 10 * T + epsilon d = np.array([T, X1, X2, X3, X4, Y]) df_data = pd.DataFrame(data=d.T, columns=["T", "X1", "X2", "X3", "X4", "Y"]) df_data["T"] = df_data["T"].astype(int)
return df_dataFurthermore, the expected values of and can be explicitly calculated, providing us the theoretical ACE of which will serve as a point of reference for the estimations.
We will explore how the CausalEffectEstimation module can estimate the causal effect of on in both of the generated datasets.
1.2 Setup
Section titled “1.2 Setup”First, genereate the data using the previously defined functions, and specify the causal graph of the variables. A single graph will be applicable to both datasets, as they share the same variables and causal structure.
linear_data = linear_simulation()non_linear_data = non_linear_simulation()
bn = gum.fastBN("Y<-T; Y<-X1; Y<-X2; Y<-X3; Y<-X4")causal_model = csl.CausalModel(bn)
cslnb.showCausalModel(causal_model, size="10")If the CausalBNEstimator estimation method is not employed for estimation process, the model generated using gum.fastBN will suffice, as only the graph structure of the Causal Bayesian Network will be used for causal identification.
However, if the Causal Bayesian Network estimator is utilized, it will be necessary to provide a csl.CausalModel object with appropriate discretization, as the Conditional Probability Tables of the model will be used for estimation. Here we use the discretizer module to perform this task, the arcs are added manually.
Selecting an optimal discretization is crucial: a coarse discretization may lead to poor estimation due to its inability to capture fine variations in the distribution, while an overly fine discretization may result in too many parameters, making it difficult for the parameter learning algorithm to accurately estimate the distribution. Therefore, the discretization should strike a balance, being neither too coarse nor too fine.
discretizer = disc.Discretizer(defaultDiscretizationMethod="uniform", defaultNumberOfBins=20)disc_bn = discretizer.discretizedTemplate(linear_data)
disc_bn.beginTopologyTransformation()for node in {"T", "X1", "X2", "X3", "X4"}: disc_bn.addArc(node, "Y")disc_bn.endTopologyTransformation()
disc_causal_model = csl.CausalModel(disc_bn)
for id, _ in disc_bn: print(disc_bn.variable(id))T:Range([0,1])X1:Discretized(<(-3.238087879615067;-2.7978695727616936[,[-2.7978695727616936;-2.35765126590832[,[-2.35765126590832;-1.9174329590549464[,[-1.9174329590549464;-1.4772146522015728[,[-1.4772146522015728;-1.0369963453481992[,[-1.0369963453481992;-0.5967780384948256[,[-0.5967780384948256;-0.15655973164145198[,[-0.15655973164145198;0.2836585752119216[,[0.2836585752119216;0.7238768820652952[,[0.7238768820652952;1.1640951889186688[,[1.1640951889186688;1.6043134957720424[,[1.6043134957720424;2.044531802625416[,[2.044531802625416;2.4847501094787896[,[2.4847501094787896;2.924968416332163[,[2.924968416332163;3.365186723185537[,[3.365186723185537;3.8054050300389104[,[3.8054050300389104;4.245623336892284[,[4.245623336892284;4.685841643745658[,[4.685841643745658;5.12605995059903[,[5.12605995059903;5.566278257452405)>)X2:Discretized(<(-3.3841273691061735;-2.9593058280365927[,[-2.9593058280365927;-2.5344842869670123[,[-2.5344842869670123;-2.1096627458974315[,[-2.1096627458974315;-1.684841204827851[,[-1.684841204827851;-1.2600196637582703[,[-1.2600196637582703;-0.8351981226886895[,[-0.8351981226886895;-0.41037658161910917[,[-0.41037658161910917;0.01444495945047164[,[0.01444495945047164;0.43926650052005245[,[0.43926650052005245;0.8640880415896328[,[0.8640880415896328;1.2889095826592136[,[1.2889095826592136;1.7137311237287944[,[1.7137311237287944;2.1385526647983752[,[2.1385526647983752;2.563374205867955[,[2.563374205867955;2.988195746937536[,[2.988195746937536;3.413017288007117[,[3.413017288007117;3.8378388290766976[,[3.8378388290766976;4.262660370146278[,[4.262660370146278;4.687481911215859[,[4.687481911215859;5.112303452285439)>)X3:Discretized(<(-3.3705253359633858;-2.8982361076803134[,[-2.8982361076803134;-2.4259468793972414[,[-2.4259468793972414;-1.953657651114169[,[-1.953657651114169;-1.4813684228310968[,[-1.4813684228310968;-1.0090791945480246[,[-1.0090791945480246;-0.5367899662649522[,[-0.5367899662649522;-0.06450073798188027[,[-0.06450073798188027;0.40778849030119213[,[0.40778849030119213;0.8800777185842641[,[0.8800777185842641;1.3523669468673365[,[1.3523669468673365;1.824656175150409[,[1.824656175150409;2.2969454034334813[,[2.2969454034334813;2.7692346317165537[,[2.7692346317165537;3.2415238599996252[,[3.2415238599996252;3.7138130882826976[,[3.7138130882826976;4.18610231656577[,[4.18610231656577;4.6583915448488415[,[4.6583915448488415;5.130680773131914[,[5.130680773131914;5.602970001414986[,[5.602970001414986;6.075259229698059)>)X4:Discretized(<(-3.596251977481372;-3.1515115564779235[,[-3.1515115564779235;-2.7067711354744746[,[-2.7067711354744746;-2.262030714471026[,[-2.262030714471026;-1.8172902934675774[,[-1.8172902934675774;-1.3725498724641287[,[-1.3725498724641287;-0.9278094514606803[,[-0.9278094514606803;-0.4830690304572314[,[-0.4830690304572314;-0.038328609453782914[,[-0.038328609453782914;0.406411811549666[,[0.406411811549666;0.8511522325531145[,[0.8511522325531145;1.295892653556563[,[1.295892653556563;1.7406330745600114[,[1.7406330745600114;2.18537349556346[,[2.18537349556346;2.630113916566909[,[2.630113916566909;3.0748543375703576[,[3.0748543375703576;3.519594758573806[,[3.519594758573806;3.9643351795772546[,[3.9643351795772546;4.409075600580704[,[4.409075600580704;4.853816021584152[,[4.853816021584152;5.298556442587602)>)Y:Discretized(<(-15.86540786470282;-12.92263197338152[,[-12.92263197338152;-9.97985608206022[,[-9.97985608206022;-7.037080190738919[,[-7.037080190738919;-4.094304299417619[,[-4.094304299417619;-1.1515284080963184[,[-1.1515284080963184;1.791247483224982[,[1.791247483224982;4.7340233745462825[,[4.7340233745462825;7.676799265867583[,[7.676799265867583;10.619575157188883[,[10.619575157188883;13.562351048510184[,[13.562351048510184;16.50512693983148[,[16.50512693983148;19.447902831152785[,[19.447902831152785;22.39067872247409[,[22.39067872247409;25.333454613795386[,[25.333454613795386;28.276230505116683[,[28.276230505116683;31.219006396437987[,[31.219006396437987;34.161782287759294[,[34.161782287759294;37.104558179080584[,[37.104558179080584;40.04733407040189[,[40.04733407040189;42.990109961723185)>)We are now prepared to instantiate the CausalEffectEstimation object for both datasets.
linear_cee = csl.CausalEffectEstimation(linear_data, disc_causal_model)print(linear_cee)<pyagrum.causal.causalEffectEstimation._CausalEffectEstimation.CausalEffectEstimation object at 0x1174957f0>
Dataframe : <pandas.core.frame.DataFrame object at 0x10fce97f0> - shape : (100000, 6) - columns : Index(['T', 'X1', 'X2', 'X3', 'X4', 'Y'], dtype='object') - memory usage : 4.800132 MB Causal Model : <pyagrum.causal._CausalModel.CausalModel object at 0x1174b1950> - names : {0: 'T', 1: 'X1', 2: 'X2', 3: 'X3', 4: 'X4', 5: 'Y'} - causal BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o} - observ. BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o}non_linear_cee = csl.CausalEffectEstimation(non_linear_data, disc_causal_model)print(non_linear_cee)<pyagrum.causal.causalEffectEstimation._CausalEffectEstimation.CausalEffectEstimation object at 0x1174b1810>
Dataframe : <pandas.core.frame.DataFrame object at 0x1166565d0> - shape : (100000, 6) - columns : Index(['T', 'X1', 'X2', 'X3', 'X4', 'Y'], dtype='object') - memory usage : 4.800132 MB Causal Model : <pyagrum.causal._CausalModel.CausalModel object at 0x1174b1950> - names : {0: 'T', 1: 'X1', 2: 'X2', 3: 'X3', 4: 'X4', 5: 'Y'} - causal BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o} - observ. BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o}1.3 Causal Indentification
Section titled “1.3 Causal Indentification”The subsequent step involves identifying the causal criterion to be used for estimation. This is crucial, as most estimators rely on strong assumptions regarding the underlying causal structure of the data-generating process. Incorrect specification of the adjustment may compromise the guarantee of asymptotic normality.
linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y")non_linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)Randomized Controlled Trial adjustment found.
Supported estimators include:- CausalModelEstimator- DMIf the outcome variable is a cause of other covariates in the causal graph,Backdoor estimators may also be used.
'Randomized Controlled Trial'Consistent with the data generation, the adjustment identified is the Randomized Control Trial adjustment. This yields a list of the various estimators supported by this adjustment.
1.4 Causal Effect Estimation
Section titled “1.4 Causal Effect Estimation”Once the adjustment is identified, we can proceed to estimation using the supported estimators. First, the estimator must be fitted to the data.
linear_cee.fitDM()non_linear_cee.fitDM()The estimation is obtained by calling the estimateCausalEffect method.
linear_tau_hat = linear_cee.estimateCausalEffect()non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat - 10) * 10)} %")print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat - 10) * 10)} %")ACE linear = 10.044051643889613, MAPE = 0.44051643889613246 %ACE non linear = 10.100614352555777, MAPE = 1.0061435255577678 %The difference-in-means estimator, which is the simplest estimator for the ACE, yields mostly accurate results. This is expected in an RCT environment where the treatment is independent of confounders, making intervention equivalent to observation.
1.5 User specified adjustment
Section titled “1.5 User specified adjustment”If the user wish to apply an alternative adjustment, they may specify their own set of variables for each component of the adjustment. However, please note that such custom adjustments do not guarantee unbiased estimation and may not ensure an error-free estimation process.
linear_cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"X1", "X2", "X3", "X4"})non_linear_cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"X1", "X2", "X3", "X4"})linear_cee.fitSLearner()non_linear_cee.fitSLearner()linear_tau_hat = linear_cee.estimateCausalEffect()non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat - 10) * 10)} %")print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat - 10) * 10)} %")ACE linear = 10.02838981756783, MAPE = 0.2838981756782921 %ACE non linear = 10.016633052461158, MAPE = 0.1663305246115776 %In this case, RCT adjustment also supports Backdoor adjustment, we thus get mostly accurate estimations. Let’s see how the estimation would be if we specify an uncompatible adjustment.
1.6 Incorrectly specified adjustment
Section titled “1.6 Incorrectly specified adjustment”We will use the frontdoor adjustment to illustrate the behaviours of incorrect specification.
linear_cee.useFrontdoorAdjustment(intervention="T", outcome="Y", mediators={"X1", "X2", "X3", "X4"})non_linear_cee.useFrontdoorAdjustment(intervention="T", outcome="Y", mediators={"X1", "X2", "X3", "X4"})linear_cee.fitSimplePlugIn()non_linear_cee.fitSimplePlugIn()linear_tau_hat = linear_cee.estimateCausalEffect()non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat - 10) * 10)} %")print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat - 10) * 10)} %")ACE linear = 0.015629403824496978, MAPE = 99.84370596175503 %ACE non linear = 0.08405011654987375, MAPE = 99.15949883450126 %As anticipated, using an incorrect adjustment set results in a heavily biased estimation. In this case, the ACE is close to zero, indicating that the estimator incorrectly predicts no causal effect of on .
1.7 Causal Bayesian Network Estimation
Section titled “1.7 Causal Bayesian Network Estimation”To fully utilize the causal model within the estimation process, we will now use the Conditional Probability Tables of the Causal Bayesian Network via the CausalBNEstimator, rather than relying solely on the underlying causal graph. The procedure will follow the same methodology as previously applied.
linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)non_linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)'Randomized Controlled Trial'linear_cee.fitCausalBNEstimator()non_linear_cee.fitCausalBNEstimator()linear_tau_hat = linear_cee.estimateCausalEffect()non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat - 10) * 10)} %")print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat - 10) * 10)} %")ACE linear = 9.126934706670248, MAPE = 8.73065293329752 %ACE non linear = 9.0883428003703, MAPE = 9.116571996296994 %2. ACE estimations from real RCT data
Section titled “2. ACE estimations from real RCT data”In this section, we will estimate the ACE under real-world RCT conditions.
2.1 Dataset
Section titled “2.1 Dataset”The data used in this notbook come from the Tennessee Student/Teacher Achievement Ratio (STAR) trial. This randomized controlled trial was designed to assess the effects of smaller class sizes in primary schools (T) on students’ academic performance (Y).
The covariates in this study include:
genderageg1freelunchbeing the number of lunchs provided to the child per dayg1surbanthe localisation of the school (inner city or rural)ethnicity
## Preprocessing
## Load data - read everything as a string and then caststar_df = pd.read_csv("./res/STAR_data.csv", sep=",", dtype=str)star_df = star_df.rename(columns={"race": "ethnicity"})
## Fill nastar_df = star_df.fillna({"g1freelunch": 0, "g1surban": 0})drop_star_l = [ "g1tlistss", "g1treadss", "g1tmathss", "g1classtype", "birthyear", "birthmonth", "birthday", "gender", "ethnicity", "g1freelunch", "g1surban",]star_df = star_df.dropna(subset=drop_star_l, how="any")
## Cast value types before processingstar_df["gender"] = star_df["gender"].astype(int)star_df["ethnicity"] = star_df["ethnicity"].astype(int)
star_df["g1freelunch"] = star_df["g1freelunch"].astype(int)star_df["g1surban"] = star_df["g1surban"].astype(int)star_df["g1classtype"] = star_df["g1classtype"].astype(int)
## Keep only class type 1 and 2 (in the initial trial,## 3 class types where attributed and the third one was big classes## but with a teaching assistant)star_df = star_df[~(star_df["g1classtype"] == 3)].reset_index(drop=True)
## Compute the outcomestar_df["Y"] = ( star_df["g1tlistss"].astype(int) + star_df["g1treadss"].astype(int) + star_df["g1tmathss"].astype(int)) / 3
## Compute the treatmentstar_df["T"] = star_df["g1classtype"].apply(lambda x: 0 if x == 2 else 1)
## Transform date to obtain age (Notice: if na --> date is NaT)star_df["date"] = pd.to_datetime( star_df["birthyear"] + "/" + star_df["birthmonth"] + "/" + star_df["birthday"], yearfirst=True, errors="coerce")star_df["age"] = np.datetime64("1985-01-01") - star_df["date"]star_df["age"] = star_df["age"].dt.days / 365.25
## Keep only covariates we consider predictive of the outcomestar_covariates_l = ["gender", "ethnicity", "age", "g1freelunch", "g1surban"]star_df = star_df[["Y", "T"] + star_covariates_l]
## Map numerical to categoricalstar_df["gender"] = star_df["gender"].apply(lambda x: "Girl" if x == 2 else "Boy").astype("category")star_df["ethnicity"] = ( star_df["ethnicity"] .map({1: "White", 2: "Black", 3: "Asian", 4: "Hispanic", 5: "Nat_American", 6: "Other"}) .astype("category"))star_df["g1surban"] = ( star_df["g1surban"].map({1: "Inner_city", 2: "Suburban", 3: "Rural", 4: "Urban"}).astype("category"))
star_df.describe()| Y | T | age | g1freelunch | |
|---|---|---|---|---|
| count | 4215.000000 | 4215.000000 | 4215.000000 | 4215.000000 |
| mean | 540.095848 | 0.428233 | 4.879872 | 1.471886 |
| std | 39.267221 | 0.494881 | 0.465104 | 0.534171 |
| min | 439.333333 | 0.000000 | 3.129363 | 0.000000 |
| 25% | 511.333333 | 0.000000 | 4.525667 | 1.000000 |
| 50% | 537.333333 | 0.000000 | 4.818617 | 1.000000 |
| 75% | 566.000000 | 1.000000 | 5.111567 | 2.000000 |
| max | 670.666667 | 1.000000 | 7.225188 | 2.000000 |
It appears that there are more units in the control group. However, the control and treatment groups appear to be similar in distribution, indicating that the ignorability assumption is likely satisfied.
We will explore how the CausalEffectEstimation module can estimate the causal effect of on in both of the given datasets.
2.2 Structure Learning and Setup
Section titled “2.2 Structure Learning and Setup”In the absence of a predefined causal structure, structure learning is utilized to uncover the underlying relationships between the variables in the dataset. To facilitate this process, a slice order will be imposed on the variables. This approach will serve as the foundation for deriving the necessary causal structure for subsequent analysis.
To enable the application of structure learning algorithms, the variables will first be discretized using the discretizer module. Following this, the causal structure will be derived using gum.BNLearner.
discretizer = disc.Discretizer(defaultDiscretizationMethod="uniform")discretizer.setDiscretizationParameters("age", "uniform", 24)discretizer.setDiscretizationParameters("Y", "uniform", 30)
template = discretizer.discretizedTemplate(star_df)
learner = gum.BNLearner(star_df, template)learner.useNMLCorrection()learner.useSmoothingPrior(1e-6)learner.setSliceOrder( [ ["T", "ethnicity", "gender", "age"], [ "g1surban", "g1freelunch", ], ["Y"], ])bn = learner.learnBN()
print(learner)
gnb.sideBySide(gexpl.getInformation(bn, size="50"), gnb.getInference(bn, size="50"))Filename : /var/folders/r1/pj4vdx_n4_d_xpsb04kzf97r0000gp/T/tmpmh8v271m.csvSize : (4215,7)Variables : Y[30], T[2], gender[2], ethnicity[6], age[24], g1freelunch[3], g1surban[4]Induced types : FalseMissing values : FalseAlgorithm : MIICCorrection : NMLPrior : SmoothingPrior weight : 0.000001Constraint Slice Order : {ethnicity:0, T:0, g1surban:1, age:0, gender:0, g1freelunch:1, Y:2}This initial approach appears promising, as the inferred causal relationships are somewhat consistent with what might be expected from an non-expert perspective.
Now given the causal structure, we are set to instanciate the CausalEffectEstimation class to perform estimation.
causal_model = csl.CausalModel(bn)cee = csl.CausalEffectEstimation(star_df, causal_model)2.3 Causal Identification
Section titled “2.3 Causal Identification”The next step involves formal causal identification. As expected, we identify the RCT adjustment, consistent with the experimental design.
cee.identifyAdjustmentSet(intervention="T", outcome="Y")Randomized Controlled Trial adjustment found.
Supported estimators include:- CausalModelEstimator- DMIf the outcome variable is a cause of other covariates in the causal graph,Backdoor estimators may also be used.
'Randomized Controlled Trial'2.4 Causal Effect Estimation
Section titled “2.4 Causal Effect Estimation”Once the ajustment identified, we can use the appropiate estimators for estimation.
cee.fitDM()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 12.814738911047016cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 11.51523574877784Let’s evaluate how the backdoor adjustment estimators compare to the previously obtained estimates. In this analysis, we control for the g1freelunch variable.
cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"g1freelunch"})cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 11.616979201549668cee.fitTLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 11.61670551692447cee.fitIPW()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 11.77382443916544cee.fitPStratification()tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")ACE = 1.05591240785164463. ACE estimations from generated observational data
Section titled “3. ACE estimations from generated observational data”Next, we will estimate the ACE using generated data that does not adhere to RCT conditions. The generative model follows the specification from Lunceford & Davidian (2004)
3.1 Dataset
Section titled “3.1 Dataset”Consider the following generative model:
- The outcome variable is generated according the following equation:
Where , , and .
- The covariates are distributed as . Conditionally , the distribution of the other variables is defined as:
If , and
If , and with
- The treatment is generated as a Bernoulli of the propensity score: With and .
## Model parametersXI = np.array([-1, 1, 1])NU = np.array([0, -1, 1, -1, 2])BETA = np.array([0, 0.6, -0.6, 0.6])TAU_0 = np.array([-1, -1, 1, 1])TAU_1 = TAU_0 * -1SIGMA = np.array([[1, 0.5, -0.5, -0.5], [0.5, 1, -0.5, -0.5], [-0.5, -0.5, 1, 0.5], [-0.5, -0.5, 0.5, 1]], dtype=float)
def generate_lunceford(n=100000): # Generate data x3 = np.random.binomial(1, 0.2, n) v3 = np.random.binomial(1, (0.75 * x3 + (0.25 * (1 - x3))), n)
# If x3=0 you have a model, if x3=1 you have another one x1v1x2v2_x3_0_matrix = np.random.multivariate_normal(TAU_0, SIGMA, size=n, check_valid="warn", tol=1e-8) x1v1x2v2_x3_1_matrix = np.random.multivariate_normal(TAU_1, SIGMA, size=n, check_valid="warn", tol=1e-8) x1v1x2v2_x3 = np.where(np.repeat(x3[:, np.newaxis], 4, axis=1) == 0, x1v1x2v2_x3_0_matrix, x1v1x2v2_x3_1_matrix)
# Concatenate values xv = np.concatenate([x1v1x2v2_x3, np.expand_dims(x3, axis=1), np.expand_dims(v3, axis=1)], axis=1)
# Compute e, a, and y x = xv[:, [0, 2, 4]] v = xv[:, [1, 3, 5]] e = np.power(1 + np.exp(-BETA[0] - x.dot(BETA[1:])), -1) a = np.random.binomial(1, e, n) y = x.dot(NU[1:-1]) + v.dot(XI) + a * NU[-1] + np.random.binomial(1, e, n) + np.random.normal(0, 1, n)
# Create the final df synthetic_data_df = pd.DataFrame( np.concatenate([x, np.expand_dims(a, axis=1), v, np.expand_dims(y, axis=1)], axis=1), columns=["X1", "X2", "X3", "T", "V1", "V2", "V3", "Y"], ) synthetic_data_df["X3"] = synthetic_data_df["X3"].astype(int) synthetic_data_df["V3"] = synthetic_data_df["V3"].astype(int) synthetic_data_df["T"] = synthetic_data_df["T"].astype(int)
return synthetic_data_df
df = generate_lunceford()
df.head()| X1 | X2 | X3 | T | V1 | V2 | V3 | Y | |
|---|---|---|---|---|---|---|---|---|
| 0 | -1.590248 | 1.503957 | 0 | 1 | -2.037939 | 2.351779 | 0 | 11.303030 |
| 1 | -0.843005 | -0.614757 | 0 | 0 | -0.114829 | -1.676236 | 0 | -0.813697 |
| 2 | 0.128913 | -0.662779 | 0 | 1 | 0.812858 | 0.094382 | 0 | 3.847014 |
| 3 | -2.068048 | -1.775442 | 0 | 1 | -0.301771 | 1.368765 | 0 | 6.962143 |
| 4 | -0.300439 | 1.610470 | 0 | 0 | -0.411213 | 1.610434 | 0 | 4.060268 |
Here, the exact ATE can be explicitly calculated using the previously defined assumptions.
3.2 Setup
Section titled “3.2 Setup”Given that we know the data-generating mechanism, we can directly specify the causal structure by constructing a Causal Bayesian Network. To utilize the CausalBNEstimator, we will first employ the discretizer module to determine the domains of the variables.
discretizer = disc.Discretizer(defaultDiscretizationMethod="uniform", defaultNumberOfBins=10)discretizer.setDiscretizationParameters("Y", "uniform", 60)
bn = discretizer.discretizedTemplate(df)
bn.beginTopologyTransformation()for _, name in bn: if name != "Y": bn.addArc(name, "Y")for X in ["X1", "X2", "X3"]: bn.addArc(X, "T")for XV in ["X1", "V1", "X2", "V2"]: bn.addArc("X3", XV)bn.addArc("X3", "V3")bn.endTopologyTransformation()
causal_model = csl.CausalModel(bn)
cslnb.showCausalModel(causal_model, size="10")cee = csl.CausalEffectEstimation(df, causal_model)3.3 Causal Identification
Section titled “3.3 Causal Identification”cee.identifyAdjustmentSet(intervention="T", outcome="Y")Backdoor adjustment found.
Supported estimators include:- CausalModelEstimator- SLearner- TLearner- XLearner- PStratification- IPW
'Backdoor'3.4 Causal Estimation
Section titled “3.4 Causal Estimation”cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2 * 100)} %")ACE linear = 1.4314941416651175, MAPE = 28.425292916744127 %cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 2.0089990077766053, MAPE = 0.44995038883026695 %cee.fitTLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 2.004685694505701, MAPE = 0.23428472528506106 %cee.fitIPW()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 1.9587788486366562, MAPE = 2.0610575681671905 %3.5 Structure Learning with imposed slice order
Section titled “3.5 Structure Learning with imposed slice order”We will assess the performance of the BNLearner module to uncover the causal structure of the dataset. A slice order will be imposed on the variables to guide the learning process. Following the identification of the structure via the Structure Learning algorithm, we will proceed to estimate the causal effect based on the inferred structure.
template = discretizer.discretizedTemplate(df)
structure_learner = gum.BNLearner(df, template)structure_learner.useNMLCorrection()structure_learner.useSmoothingPrior(1e-9)structure_learner.setSliceOrder([["X3", "X1", "X2", "V1", "V2", "V3"], ["T"], ["Y"]])
learned_bn = structure_learner.learnBN()learned_causal_model = csl.CausalModel(learned_bn)
print(structure_learner)Filename : /var/folders/r1/pj4vdx_n4_d_xpsb04kzf97r0000gp/T/tmp0c981ewm.csvSize : (100000,8)Variables : X1[10], X2[10], X3[2], T[2], V1[10], V2[10], V3[2], Y[60]Induced types : FalseMissing values : FalseAlgorithm : MIICCorrection : NMLPrior : SmoothingPrior weight : 0.000000Constraint Slice Order : {T:1, X2:0, V3:0, V1:0, Y:2, X3:0, X1:0, V2:0}gnb.sideBySide(gexpl.getInformation(learned_bn, size="50"), gnb.getInference(learned_bn, size="50"))Subsequently, we will reuse the previously established pipeline to estimate the causal effect based on the identified structure.
cee = csl.CausalEffectEstimation(df, learned_causal_model)cee.identifyAdjustmentSet(intervention="T", outcome="Y")Backdoor adjustment found.
Supported estimators include:- CausalModelEstimator- SLearner- TLearner- XLearner- PStratification- IPW
'Backdoor'cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2 * 100)} %")ACE linear = 1.3734458578000632, MAPE = 31.327707109996837 %cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 2.0089990077766053, MAPE = 0.44995038883026695 %cee.fitTLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 2.004685694505701, MAPE = 0.23428472528506106 %cee.fitIPW()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 1.9587788486366562, MAPE = 2.0610575681671905 %We observe that the results remain largely consistent when imposing a slice order on the variables. However, the CausalBNEstimator appears to be the most sensitive to structural changes, as it relies on the entire graph structure for its estimations.
3.6 Structure Learning with imposed slice order
Section titled “3.6 Structure Learning with imposed slice order”Next, we will evaluate the estimations produced when allowing the algorithm to perform structure learning autonomously, without imposing any slice order.
template = discretizer.discretizedTemplate(df)
structure_learner = gum.BNLearner(df, template)structure_learner.useNMLCorrection()structure_learner.useSmoothingPrior(1e-9)
learned_bn = structure_learner.learnBN()learned_causal_model = csl.CausalModel(learned_bn)
print(structure_learner)Filename : /var/folders/r1/pj4vdx_n4_d_xpsb04kzf97r0000gp/T/tmptmbzmfhd.csvSize : (100000,8)Variables : X1[10], X2[10], X3[2], T[2], V1[10], V2[10], V3[2], Y[60]Induced types : FalseMissing values : FalseAlgorithm : MIICCorrection : NMLPrior : SmoothingPrior weight : 0.000000gnb.sideBySide(gexpl.getInformation(learned_bn, size="50"), gnb.getInference(learned_bn, size="50"))We are encountering challenges at the outset, as numerous causal relationships are misspecified, with multiple covariates incorrectly identified as causes of the outcome. Despite this, the intervention variable is correctly identified as a cause of the outcome. With limited expectations, let us now examine the estimation results.
cee = csl.CausalEffectEstimation(df, learned_causal_model)cee.identifyAdjustmentSet(intervention="T", outcome="Y")Randomized Controlled Trial adjustment found.
Supported estimators include:- CausalModelEstimator- DMIf the outcome variable is a cause of other covariates in the causal graph,Backdoor estimators may also be used.
'Randomized Controlled Trial'cee.fitCausalBNEstimator()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2 * 100)} %")ACE linear = 0.8832334169386906, MAPE = 55.838329153065466 %cee.fitDM()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = -2.805204329818507, MAPE = 240.26021649092533 %The results are exceptionally poor. In an attempt to address this issue, we will investigate whether controlling for covariates that are identified as causes of the outcome in the causal graph improves the estimation.
cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"V1", "V2"})cee.fitSLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 1.014594903253707, MAPE = 49.27025483731465 %cee.fitTLearner()tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 1.0488205198589302, MAPE = 47.55897400705349 %One might assume that the estimation methods are not sufficiently sophisticated, given that the default meta-learners utilize Linear Regression as the base model. To address this, we will assess whether employing XGBRegressor as the base learner within the XLearner framework improves the estimation results.
cee.fitXLearner(learner="XGBRegressor")tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat - 2) / 2) * 100} %")ACE linear = 1.0538501050550138, MAPE = 47.307494747249315 %The results show only marginal improvement, insufficient to draw reliable conclusions about the true causal effect.
This underscores the critical role of domain knowledge in accurately specifying the underlying causal structure. Even the most sophisticated estimation methods can falter when the causal structure is incorrectly specified.
