import pyagrum.lib.notebook as gnb
Let’s assume a process X 1 → Y 1 X_1\rightarrow Y_1 X 1 → Y 1 with a control on X 1 X_1 X 1 by C a C_a C a and a parameter P b P_b P b on Y 1 Y_1 Y 1 .
bn = gum.fastBN( "Ca->X1->Y1<-Pb" )
bn.cpt( "Ca" ).fillWith([ 0.8 , 0.2 ])
bn.cpt( "Pb" ).fillWith([ 0.3 , 0.7 ])
bn.cpt( "X1" )[:] = [[ 0.9 , 0.1 ], [ 0.1 , 0.9 ]]
bn.cpt( "Y1" )[{ "X1" : 0 , "Pb" : 0 }] = [ 0.8 , 0.2 ]
bn.cpt( "Y1" )[{ "X1" : 1 , "Pb" : 0 }] = [ 0.2 , 0.8 ]
bn.cpt( "Y1" )[{ "X1" : 0 , "Pb" : 1 }] = [ 0.6 , 0.4 ]
bn.cpt( "Y1" )[{ "X1" : 1 , "Pb" : 1 }] = [ 0.4 , 0.6 ]
gnb.flow.row(bn, * [bn.cpt(x) for x in bn.nodes()])
G
X1
X1
Y1
Y1
X1->Y1
Pb
Pb
Pb->Y1
Ca
Ca
Ca->X1
X1
Ca
0
1
0 0.9000 0.1000
1 0.1000 0.9000
Y1
Pb X1
0
1
0 0 0.8000 0.2000
1 0.2000 0.8000
1 0 0.6000 0.4000
1 0.4000 0.6000
Actually the process is duplicated in the system but the control C a C_a C a and the parameter P b P_b P b are shared.
bn.cpt( "X2" ).fillWith(bn.cpt( "X1" ), [ "X1" , "Ca" ]) # copy cpt(X1) with the translation X2<-X1,Ca<-Ca
bn.cpt( "Y2" ).fillWith(bn.cpt( "Y1" ), [ "Y1" , "X1" , "Pb" ]) # copy cpt(Y1) with translation Y2<-Y1,X2<-X1,Pb<-Pb
gnb.flow.row(bn, bn.cpt( "X2" ), bn.cpt( "Y2" ))
G
Y2
Y2
Pb
Pb
Pb->Y2
Y1
Y1
Pb->Y1
X1
X1
X1->Y1
X2
X2
X2->Y2
Ca
Ca
Ca->X1
Ca->X2
X2
Ca
0
1
0 0.9000 0.1000
1 0.1000 0.9000
Y2
Pb X2
0
1
0 0 0.8000 0.2000
1 0.2000 0.8000
1 0 0.6000 0.4000
1 0.4000 0.6000
The process is partially observed : the control has been taken into account. However the parameter has not been identified and therefore is not collected.
## the base will be saved in completeData="out/complete_data.csv", observedData="out/observed_data.csv"
completeData = "out/complete_data.csv"
observedData = "out/observed_data.csv"
## generating complete date with pyAgrum
## gum.generateSample(bn,5000,"data.csv",random_order=True)
generator = gum.BNDatabaseGenerator(bn)
generator.setRandomVarOrder()
generator.drawSamples(size)
generator.toCSV(completeData)
## selecting some variables using pandas
f = pd.read_csv(completeData)
keep_col = [ "X1" , "Y1" , "X2" , "Y2" , "Ca" ] # Pb is removed
new_f.to_csv(observedData, index = False )
We will use now a database fixed_observed_data.csv . While both databases originate from the same process (the
cell above), the use of fixed_observed_data.csv instead of observed_data.csv is made to guarantee a deterministic and stable behavior for the rest of the notebook.
fixedObsData = "res/fixed_observed_data.csv"
Using a classical statistical learning method, one can approximate a model from the observed data.
learner = gum.BNLearner(fixedObsData)
learner.useGreedyHillClimbing()
G
Y2
Y2
Y1
Y1
Y2->Y1
X1
X1
X1->Y1
X2
X2
X2->Y2
Ca
Ca
Ca->X1
Ca->X2
Using the database, a question for the user is to evaluate the impact of the value of X 2 X2 X 2 on Y 1 Y1 Y 1 .
ie = gum.LazyPropagation(bn)
ie2 = gum.LazyPropagation(bn2)
p1 = ie.evidenceImpact(target, [evs])
p2 = gum.Tensor(p1).fillWith(ie2.evidenceImpact(target, [evs]), [target, evs])
quaderr1 = (errs * errs).sum()
rf "$$ {100 * quaderr1 :3.5f } \%$$" ,
captions = [ "in original model" , "in learned model" , "relative errors" , "quadratic relative error" ],
Y1
X2
0
1
0 0.6211 0.3789
1 0.4508 0.5492
in original model
Y1
X2
0
1
0 0.6183 0.3817
1 0.4415 0.5585
in learned model
Y1
X2
0
1
0 0.0044 -0.0072
1 0.0205 -0.0168
relative errors
0.07722 % 0.07722\% 0.07722%
quadratic relative error
The statistician notes that the change wanted by the user to apply on X 2 X_2 X 2 is not an observation but rather an intervention.
import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb
model = csl.CausalModel(bn)
model2 = csl.CausalModel(bn2)
cslnb.showCausalModel(model)
gum.config[ "notebook" , "graph_format" ] = "svg"
cslnb.showCausalImpact(model, on = target, doing = {evs})
cslnb.showCausalImpact(model2, on = target, doing = {evs})
Ca
Ca
X1
X1
Ca->X1
X2
X2
Ca->X2
Y1
Y1
X1->Y1
Pb
Pb
Pb->Y1
Y2
Y2
Pb->Y2
X2->Y2
Causal Model
P ( Y 1 ∣ do ( X 2 ) ) = ∑ C a P ( Y 1 ∣ C a ) ⋅ P ( C a ) \begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{Ca}{P\left(Y1\mid Ca\right) \cdot P\left(Ca\right)}\end{equation*} P ( Y 1 ∣ do ( X 2 )) = C a ∑ P ( Y 1 ∣ C a ) ⋅ P ( C a )
Explanation : backdoor [‘Ca’] found.
X1
X1
Y1
Y1
X1->Y1
X2
X2
Y2
Y2
X2->Y2
Y2->Y1
Ca
Ca
Ca->X1
Ca->X2
Causal Model
P ( Y 1 ∣ do ( X 2 ) ) = ∑ X 1 P ( Y 1 ∣ X 1 , X 2 ) ⋅ P ( X 1 ) \begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1,X2\right) \cdot P\left(X1\right)}\end{equation*} P ( Y 1 ∣ do ( X 2 )) = X 1 ∑ P ( Y 1 ∣ X 1 , X 2 ) ⋅ P ( X 1 )
Explanation : backdoor [‘X1’] found.
Y1
X2
0
1
0 0.5743 0.4257
1 0.5628 0.4372
Impact
Unfortunately, due to the fact that P a P_a P a is not learned, the computation of the causal impact still is imprecise.
_, impact1, _ = csl.causalImpact(model, on = target, doing = {evs})
_, impact2orig, _ = csl.causalImpact(model2, on = target, doing = {evs})
impact2 = gum.Tensor(p2).fillWith(impact2orig, [ "Y1" , "X2" ])
errs = (impact1 - impact2) / impact1
quaderr2 = (errs * errs).sum()
rf "$$ {100 * quaderr2 :3.5f } \%$$" ,
r " $ P ( Y_1 \m id \h ookrightarrow X_2 )$ <br/>in original model " ,
r " $ P ( Y_1 \m id \h ookrightarrow X_2 )$ <br/>in learned model " ,
" <br/>quadratic relative error" ,
$P( Y_1 \mid \hookrightarrow X_2)$ in original model
Y1
X2
0
1
0 0.5743 0.4257
1 0.5628 0.4372
$P( Y_1 \mid \hookrightarrow X_2)$ in learned model
Y1
X2
0
1
0 0.0044 -0.0060
1 0.0243 -0.0331
relative errors
0.17362 % 0.17362\% 0.17362%
quadratic relative error
Just to be certain, we can verify that in the original model, P ( Y 1 ∣ ↪ X 2 ) = P ( Y 1 ) P( Y_1 \mid \hookrightarrow X_2)=P(Y_1) P ( Y 1 ∣↪ X 2 ) = P ( Y 1 )
ie.evidenceImpact(target, []),
captions = [ r " $ P ( Y_1 \m id \h ookrightarrow X_2 )$ <br/>in the original model " , "$P(Y_1)$ <br/>in the original model" ],
$P( Y_1 \mid \hookrightarrow X_2)$ in the original model $P(Y_1)$ in the original model
Some learning algorthims such as MIIC (Verny et al., 2017) aim to find the trace of latent variables in the data !
learner = gum.BNLearner(fixedObsData)
f "$$ { [(bn3.variable(i).name(), bn3.variable(j).name()) for (i, j) in learner.latentVariables()] } $$" ,
captions = [ "original model" , "learned model" , "Latent variables found" ],
G
Y2
Y2
Pb
Pb
Pb->Y2
Y1
Y1
Pb->Y1
X1
X1
X1->Y1
X2
X2
X2->Y2
Ca
Ca
Ca->X1
Ca->X2
original model
G
Y2
Y2
Y1
Y1
Y2->Y1
X1
X1
X1->Y1
X2
X2
X2->Y2
Ca
Ca
X2->Ca
Ca->X1
learned model
[ ( ′ Y 2 ′ , ′ Y 1 ′ ) ] [('Y2', 'Y1')] [ ( ′ Y 2 ′ , ′ Y 1 ′ )]
Latent variables found
A latent variable (common cause) has been found in the data betwenn Y 1 Y1 Y 1 and Y 2 Y2 Y 2 !
Therefore we can build a causal model taking into account this latent variable found by MIIC.
model3 = csl.CausalModel(bn2, [( "L1" , ( "Y1" , "Y2" ))])
cslnb.showCausalImpact(model3, target, {evs})
L1
Y1
Y1
L1->Y1
Y2
Y2
L1->Y2
X1
X1
X1->Y1
X2
X2
X2->Y2
Ca
Ca
Ca->X1
Ca->X2
Causal Model
P ( Y 1 ∣ do ( X 2 ) ) = ∑ X 1 P ( Y 1 ∣ X 1 ) ⋅ P ( X 1 ) \begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1\right) \cdot P\left(X1\right)}\end{equation*} P ( Y 1 ∣ do ( X 2 )) = X 1 ∑ P ( Y 1 ∣ X 1 ) ⋅ P ( X 1 )
Explanation : backdoor [‘X1’] found.
Then at least, the statistician can say that X 2 X_2 X 2 has no impact on Y 1 Y_1 Y 1 from the data. The error is just due to the approximation of the parameters in the database.
_, impact1, _ = csl.causalImpact(model, on = target, doing = {evs})
_, impact3orig, _ = csl.causalImpact(model3, on = target, doing = {evs})
impact3 = gum.Tensor(impact1).fillWith(impact3orig, [ "Y1" ])
errs = (impact1 - impact3) / impact1
quaderr3 = (errs * errs).sum()
rf "$$ {100 * quaderr3 :3.5f } \%$$" ,
captions = [ "in original model" , "in learned model" , "relative errors" , "quadratic relative error" ],
0.01588 % 0.01588\% 0.01588%
quadratic relative error
print ( rf "- Error with spurious structure and classical inference : {100 * quaderr1 :3.5f } %" )
print ( rf "- Error with spurious structure and do-calculus : {100 * quaderr2 :3.5f } %" )
print ( rf "- Error with correct causal structure and do-calculus : {100 * quaderr3 :3.5f } %" )
- Error with spurious structure and classical inference : 0.07722%
- Error with spurious structure and do-calculus : 0.17362%
- Error with correct causal structure and do-calculus : 0.01588%