Skip to content

Learning and causality

Creative Commons LicenseaGrUMinteractive online version
import pyagrum as gum
import pyagrum.lib.notebook as gnb

Let’s assume a process X1Y1X_1\rightarrow Y_1 with a control on X1X_1 by CaC_a and a parameter PbP_b on Y1Y_1.

bn = gum.fastBN("Ca->X1->Y1<-Pb")
bn.cpt("Ca").fillWith([0.8, 0.2])
bn.cpt("Pb").fillWith([0.3, 0.7])
bn.cpt("X1")[:] = [[0.9, 0.1], [0.1, 0.9]]
bn.cpt("Y1")[{"X1": 0, "Pb": 0}] = [0.8, 0.2]
bn.cpt("Y1")[{"X1": 1, "Pb": 0}] = [0.2, 0.8]
bn.cpt("Y1")[{"X1": 0, "Pb": 1}] = [0.6, 0.4]
bn.cpt("Y1")[{"X1": 1, "Pb": 1}] = [0.4, 0.6]
gnb.flow.row(bn, *[bn.cpt(x) for x in bn.nodes()])
G X1 X1 Y1 Y1 X1->Y1 Pb Pb Pb->Y1 Ca Ca Ca->X1
Ca
0
1
0.80000.2000
X1
Ca
0
1
0
0.90000.1000
1
0.10000.9000
Y1
Pb
X1
0
1
0
0
0.80000.2000
1
0.20000.8000
1
0
0.60000.4000
1
0.40000.6000
Pb
0
1
0.30000.7000

Actually the process is duplicated in the system but the control CaC_a and the parameter PbP_b are shared.

bn.add("X2", 2)
bn.add("Y2", 2)
bn.addArc("X2", "Y2")
bn.addArc("Ca", "X2")
bn.addArc("Pb", "Y2")
bn.cpt("X2").fillWith(bn.cpt("X1"), ["X1", "Ca"]) # copy cpt(X1) with the translation X2<-X1,Ca<-Ca
bn.cpt("Y2").fillWith(bn.cpt("Y1"), ["Y1", "X1", "Pb"]) # copy cpt(Y1) with translation Y2<-Y1,X2<-X1,Pb<-Pb
gnb.flow.row(bn, bn.cpt("X2"), bn.cpt("Y2"))
G Y2 Y2 Pb Pb Pb->Y2 Y1 Y1 Pb->Y1 X1 X1 X1->Y1 X2 X2 X2->Y2 Ca Ca Ca->X1 Ca->X2
X2
Ca
0
1
0
0.90000.1000
1
0.10000.9000
Y2
Pb
X2
0
1
0
0
0.80000.2000
1
0.20000.8000
1
0
0.60000.4000
1
0.40000.6000

The process is partially observed : the control has been taken into account. However the parameter has not been identified and therefore is not collected.

## the base will be saved in completeData="out/complete_data.csv", observedData="out/observed_data.csv"
completeData = "out/complete_data.csv"
observedData = "out/observed_data.csv"
## generating complete date with pyAgrum
size = 35000
## gum.generateSample(bn,5000,"data.csv",random_order=True)
generator = gum.BNDatabaseGenerator(bn)
generator.setRandomVarOrder()
generator.drawSamples(size)
generator.toCSV(completeData)
## selecting some variables using pandas
import pandas as pd
f = pd.read_csv(completeData)
keep_col = ["X1", "Y1", "X2", "Y2", "Ca"] # Pb is removed
new_f = f[keep_col]
new_f.to_csv(observedData, index=False)

We will use now a database fixed_observed_data.csv. While both databases originate from the same process (the cell above), the use of fixed_observed_data.csv instead of observed_data.csv is made to guarantee a deterministic and stable behavior for the rest of the notebook.

fixedObsData = "res/fixed_observed_data.csv"

Using a classical statistical learning method, one can approximate a model from the observed data.

learner = gum.BNLearner(fixedObsData)
learner.useGreedyHillClimbing()
bn2 = learner.learnBN()
bn2
G Y2 Y2 Y1 Y1 Y2->Y1 X1 X1 X1->Y1 X2 X2 X2->Y2 Ca Ca Ca->X1 Ca->X2

Using the database, a question for the user is to evaluate the impact of the value of X2X2 on Y1Y1.

target = "Y1"
evs = "X2"
ie = gum.LazyPropagation(bn)
ie2 = gum.LazyPropagation(bn2)
p1 = ie.evidenceImpact(target, [evs])
p2 = gum.Tensor(p1).fillWith(ie2.evidenceImpact(target, [evs]), [target, evs])
errs = (p1 - p2) / p1
quaderr1 = (errs * errs).sum()
gnb.flow.row(
p1,
p2,
errs,
rf"$${100 * quaderr1:3.5f}\%$$",
captions=["in original model", "in learned model", "relative errors", "quadratic relative error"],
)
Y1
X2
0
1
0
0.62110.3789
1
0.45080.5492

in original model
Y1
X2
0
1
0
0.61830.3817
1
0.44150.5585

in learned model
Y1
X2
0
1
0
0.0044-0.0072
1
0.0205-0.0168

relative errors

0.07722%0.07722\%

quadratic relative error

Evaluating the causal impact of X2X2 on Y1Y1 with the learned model

Section titled “Evaluating the causal impact of X2X2X2 on Y1Y1Y1 with the learned model”

The statistician notes that the change wanted by the user to apply on X2X_2 is not an observation but rather an intervention.

import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb
model = csl.CausalModel(bn)
model2 = csl.CausalModel(bn2)
cslnb.showCausalModel(model)

svg

gum.config["notebook", "graph_format"] = "svg"
cslnb.showCausalImpact(model, on=target, doing={evs})
cslnb.showCausalImpact(model2, on=target, doing={evs})
Ca Ca X1 X1 Ca->X1 X2 X2 Ca->X2 Y1 Y1 X1->Y1 Pb Pb Pb->Y1 Y2 Y2 Pb->Y2 X2->Y2
Causal Model
P(Y1do(X2))=CaP(Y1Ca)P(Ca)\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{Ca}{P\left(Y1\mid Ca\right) \cdot P\left(Ca\right)}\end{equation*}


Explanation : backdoor [‘Ca’] found.

Y1
0
1
0.57680.4232

Impact
X1 X1 Y1 Y1 X1->Y1 X2 X2 Y2 Y2 X2->Y2 Y2->Y1 Ca Ca Ca->X1 Ca->X2
Causal Model
P(Y1do(X2))=X1P(Y1X1,X2)P(X1)\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1,X2\right) \cdot P\left(X1\right)}\end{equation*}


Explanation : backdoor [‘X1’] found.

Y1
X2
0
1
0
0.57430.4257
1
0.56280.4372

Impact

Unfortunately, due to the fact that PaP_a is not learned, the computation of the causal impact still is imprecise.

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact2orig, _ = csl.causalImpact(model2, on=target, doing={evs})
impact2 = gum.Tensor(p2).fillWith(impact2orig, ["Y1", "X2"])
errs = (impact1 - impact2) / impact1
quaderr2 = (errs * errs).sum()
gnb.flow.row(
impact1,
impact2,
errs,
rf"$${100 * quaderr2:3.5f}\%$$",
captions=[
r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in original model",
r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in learned model",
" <br/>relative errors",
" <br/>quadratic relative error",
],
)
Y1
0
1
0.57680.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in original model
Y1
X2
0
1
0
0.57430.4257
1
0.56280.4372

$P( Y_1 \mid \hookrightarrow X_2)$
in learned model
Y1
X2
0
1
0
0.0044-0.0060
1
0.0243-0.0331


relative errors

0.17362%0.17362\%


quadratic relative error

Just to be certain, we can verify that in the original model, P(Y1X2)=P(Y1)P( Y_1 \mid \hookrightarrow X_2)=P(Y_1)

gnb.flow.row(
impact1,
ie.evidenceImpact(target, []),
captions=[r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in the original model", "$P(Y_1)$ <br/>in the original model"],
)
Y1
0
1
0.57680.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in the original model
Y1
0
1
0.57680.4232

$P(Y_1)$
in the original model

Some learning algorthims such as MIIC (Verny et al., 2017) aim to find the trace of latent variables in the data !

learner = gum.BNLearner(fixedObsData)
learner.useMIIC()
bn3 = learner.learnBN()
gnb.flow.row(
bn,
bn3,
f"$${[(bn3.variable(i).name(), bn3.variable(j).name()) for (i, j) in learner.latentVariables()]}$$",
captions=["original model", "learned model", "Latent variables found"],
)
G Y2 Y2 Pb Pb Pb->Y2 Y1 Y1 Pb->Y1 X1 X1 X1->Y1 X2 X2 X2->Y2 Ca Ca Ca->X1 Ca->X2
original model
G Y2 Y2 Y1 Y1 Y2->Y1 X1 X1 X1->Y1 X2 X2 X2->Y2 Ca Ca X2->Ca Ca->X1
learned model

[(Y2,Y1)][('Y2', 'Y1')]

Latent variables found

A latent variable (common cause) has been found in the data betwenn Y1Y1 and Y2Y2 !

Therefore we can build a causal model taking into account this latent variable found by MIIC.

model3 = csl.CausalModel(bn2, [("L1", ("Y1", "Y2"))])
cslnb.showCausalImpact(model3, target, {evs})
L1 Y1 Y1 L1->Y1 Y2 Y2 L1->Y2 X1 X1 X1->Y1 X2 X2 X2->Y2 Ca Ca Ca->X1 Ca->X2
Causal Model
P(Y1do(X2))=X1P(Y1X1)P(X1)\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1\right) \cdot P\left(X1\right)}\end{equation*}


Explanation : backdoor [‘X1’] found.

Y1
0
1
0.57250.4275

Impact

Then at least, the statistician can say that X2X_2 has no impact on Y1Y_1 from the data. The error is just due to the approximation of the parameters in the database.

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact3orig, _ = csl.causalImpact(model3, on=target, doing={evs})
impact3 = gum.Tensor(impact1).fillWith(impact3orig, ["Y1"])
errs = (impact1 - impact3) / impact1
quaderr3 = (errs * errs).sum()
gnb.flow.row(
impact1,
impact3,
errs,
rf"$${100 * quaderr3:3.5f}\%$$",
captions=["in original model", "in learned model", "relative errors", "quadratic relative error"],
)
Y1
0
1
0.57680.4232

in original model
Y1
0
1
0.57250.4275

in learned model
Y1
0
1
0.0075-0.0102

relative errors

0.01588%0.01588\%

quadratic relative error

print("In conclusion :")
print(rf"- Error with spurious structure and classical inference : {100 * quaderr1:3.5f}%")
print(rf"- Error with spurious structure and do-calculus : {100 * quaderr2:3.5f}%")
print(rf"- Error with correct causal structure and do-calculus : {100 * quaderr3:3.5f}%")
In conclusion :
- Error with spurious structure and classical inference : 0.07722%
- Error with spurious structure and do-calculus : 0.17362%
- Error with correct causal structure and do-calculus : 0.01588%