Learning and causality

import pyagrum as gum
import pyagrum.lib.notebook as gnb

Model

Let’s assume a process $X_1\rightarrow Y_1$ with a control on $X_1$ by $C_a$ and a parameter $P_b$ on $Y_1$ .

bn = gum.fastBN("Ca->X1->Y1<-Pb")
bn.cpt("Ca").fillWith([0.8, 0.2])
bn.cpt("Pb").fillWith([0.3, 0.7])

bn.cpt("X1")[:] = [[0.9, 0.1], [0.1, 0.9]]

bn.cpt("Y1")[{"X1": 0, "Pb": 0}] = [0.8, 0.2]
bn.cpt("Y1")[{"X1": 1, "Pb": 0}] = [0.2, 0.8]
bn.cpt("Y1")[{"X1": 0, "Pb": 1}] = [0.6, 0.4]
bn.cpt("Y1")[{"X1": 1, "Pb": 1}] = [0.4, 0.6]

gnb.flow.row(bn, *[bn.cpt(x) for x in bn.nodes()])

Ca
0	1
0.8000	0.2000

	X1
Ca	0	1
0	0.9000	0.1000
1	0.1000	0.9000

		Y1
Pb	X1	0	1
0	0	0.8000	0.2000
0	1	0.2000	0.8000
1	0	0.6000	0.4000
1	1	0.4000	0.6000

Pb
0	1
0.3000	0.7000

Actually the process is duplicated in the system but the control $C_a$ and the parameter $P_b$ are shared.

bn.add("X2", 2)
bn.add("Y2", 2)
bn.addArc("X2", "Y2")
bn.addArc("Ca", "X2")
bn.addArc("Pb", "Y2")

bn.cpt("X2").fillWith(bn.cpt("X1"), ["X1", "Ca"])  # copy cpt(X1) with the translation X2<-X1,Ca<-Ca
bn.cpt("Y2").fillWith(bn.cpt("Y1"), ["Y1", "X1", "Pb"])  # copy cpt(Y1) with translation Y2<-Y1,X2<-X1,Pb<-Pb

gnb.flow.row(bn, bn.cpt("X2"), bn.cpt("Y2"))

	X2
Ca	0	1
0	0.9000	0.1000
1	0.1000	0.9000

		Y2
Pb	X2	0	1
0	0	0.8000	0.2000
0	1	0.2000	0.8000
1	0	0.6000	0.4000
1	1	0.4000	0.6000

Simulation of the data

The process is partially observed : the control has been taken into account. However the parameter has not been identified and therefore is not collected.

## the base will be saved in completeData="out/complete_data.csv", observedData="out/observed_data.csv"

completeData = "out/complete_data.csv"
observedData = "out/observed_data.csv"

## generating complete date with pyAgrum
size = 35000
## gum.generateSample(bn,5000,"data.csv",random_order=True)
generator = gum.BNDatabaseGenerator(bn)
generator.setRandomVarOrder()
generator.drawSamples(size)
generator.toCSV(completeData)

## selecting some variables using pandas
import pandas as pd

f = pd.read_csv(completeData)
keep_col = ["X1", "Y1", "X2", "Y2", "Ca"]  # Pb is removed
new_f = f[keep_col]
new_f.to_csv(observedData, index=False)

We will use now a database fixed_observed_data.csv. While both databases originate from the same process (the cell above), the use of fixed_observed_data.csv instead of observed_data.csv is made to guarantee a deterministic and stable behavior for the rest of the notebook.

fixedObsData = "res/fixed_observed_data.csv"

statistical learning

Using a classical statistical learning method, one can approximate a model from the observed data.

learner = gum.BNLearner(fixedObsData)
learner.useGreedyHillClimbing()
bn2 = learner.learnBN()
bn2

Evaluating the impact of $X2$ on $Y1$

Using the database, a question for the user is to evaluate the impact of the value of $X2$ on $Y1$ .

target = "Y1"
evs = "X2"
ie = gum.LazyPropagation(bn)
ie2 = gum.LazyPropagation(bn2)
p1 = ie.evidenceImpact(target, [evs])
p2 = gum.Tensor(p1).fillWith(ie2.evidenceImpact(target, [evs]), [target, evs])
errs = (p1 - p2) / p1
quaderr1 = (errs * errs).sum()
gnb.flow.row(
  p1,
  p2,
  errs,
  rf"$${100 * quaderr1:3.5f}\%$$",
  captions=["in original model", "in learned model", "relative errors", "quadratic relative error"],
)

	Y1
X2	0	1
0	0.6211	0.3789
1	0.4508	0.5492

in original model

	Y1
X2	0	1
0	0.6183	0.3817
1	0.4415	0.5585

in learned model

	Y1
X2	0	1
0	0.0044	-0.0072
1	0.0205	-0.0168

relative errors

$0.07722\%$

quadratic relative error

Evaluating the causal impact of $X2$ on $Y1$ with the learned model

The statistician notes that the change wanted by the user to apply on $X_2$ is not an observation but rather an intervention.

import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb

model = csl.CausalModel(bn)
model2 = csl.CausalModel(bn2)

cslnb.showCausalModel(model)

svg

gum.config["notebook", "graph_format"] = "svg"
cslnb.showCausalImpact(model, on=target, doing={evs})
cslnb.showCausalImpact(model2, on=target, doing={evs})

Causal Model

\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{Ca}{P\left(Y1\mid Ca\right) \cdot P\left(Ca\right)}\end{equation*}

Explanation : backdoor [‘Ca’] found.

Y1
0	1
0.5768	0.4232

Impact

Causal Model

\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1,X2\right) \cdot P\left(X1\right)}\end{equation*}

Explanation : backdoor [‘X1’] found.

	Y1
X2	0	1
0	0.5743	0.4257
1	0.5628	0.4372

Impact

Unfortunately, due to the fact that $P_a$ is not learned, the computation of the causal impact still is imprecise.

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact2orig, _ = csl.causalImpact(model2, on=target, doing={evs})

impact2 = gum.Tensor(p2).fillWith(impact2orig, ["Y1", "X2"])
errs = (impact1 - impact2) / impact1
quaderr2 = (errs * errs).sum()
gnb.flow.row(
  impact1,
  impact2,
  errs,
  rf"$${100 * quaderr2:3.5f}\%$$",
  captions=[
    r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in original model",
    r"$P( Y_1 \mid \hookrightarrow X_2)$  <br/>in learned model",
    " <br/>relative errors",
    " <br/>quadratic relative error",
  ],
)

Y1
0	1
0.5768	0.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in original model

	Y1
X2	0	1
0	0.5743	0.4257
1	0.5628	0.4372

$P( Y_1 \mid \hookrightarrow X_2)$
in learned model

	Y1
X2	0	1
0	0.0044	-0.0060
1	0.0243	-0.0331

relative errors

$0.17362\%$

quadratic relative error

Just to be certain, we can verify that in the original model, $P( Y_1 \mid \hookrightarrow X_2)=P(Y_1)$

gnb.flow.row(
  impact1,
  ie.evidenceImpact(target, []),
  captions=[r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in the original model", "$P(Y_1)$ <br/>in the original model"],
)

Y1
0	1
0.5768	0.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in the original model

Y1
0	1
0.5768	0.4232

$P(Y_1)$
in the original model

Causal learning and causal impact

Some learning algorthims such as MIIC (Verny et al., 2017) aim to find the trace of latent variables in the data !

learner = gum.BNLearner(fixedObsData)
learner.useMIIC()
bn3 = learner.learnBN()

gnb.flow.row(
  bn,
  bn3,
  f"$${[(bn3.variable(i).name(), bn3.variable(j).name()) for (i, j) in learner.latentVariables()]}$$",
  captions=["original model", "learned model", "Latent variables found"],
)

original model

learned model

$[('Y2', 'Y1')]$

Latent variables found

A latent variable (common cause) has been found in the data betwenn $Y1$ and $Y2$ !

Therefore we can build a causal model taking into account this latent variable found by MIIC.

model3 = csl.CausalModel(bn2, [("L1", ("Y1", "Y2"))])
cslnb.showCausalImpact(model3, target, {evs})

Causal Model

\begin{equation*}P( Y1 \mid \text{do}(X2)) = \sum_{X1}{P\left(Y1\mid X1\right) \cdot P\left(X1\right)}\end{equation*}

Explanation : backdoor [‘X1’] found.

Y1
0	1
0.5725	0.4275

Impact

Then at least, the statistician can say that $X_2$ has no impact on $Y_1$ from the data. The error is just due to the approximation of the parameters in the database.

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact3orig, _ = csl.causalImpact(model3, on=target, doing={evs})

impact3 = gum.Tensor(impact1).fillWith(impact3orig, ["Y1"])
errs = (impact1 - impact3) / impact1
quaderr3 = (errs * errs).sum()
gnb.flow.row(
  impact1,
  impact3,
  errs,
  rf"$${100 * quaderr3:3.5f}\%$$",
  captions=["in original model", "in learned model", "relative errors", "quadratic relative error"],
)

Y1
0	1
0.5768	0.4232

in original model

Y1
0	1
0.5725	0.4275

in learned model

Y1
0	1
0.0075	-0.0102

relative errors

$0.01588\%$

quadratic relative error

print("In conclusion :")
print(rf"- Error with spurious structure and classical inference : {100 * quaderr1:3.5f}%")
print(rf"- Error with spurious structure and do-calculus : {100 * quaderr2:3.5f}%")
print(rf"- Error with correct causal structure and do-calculus : {100 * quaderr3:3.5f}%")

In conclusion :
- Error with spurious structure and classical inference : 0.07722%
- Error with spurious structure and do-calculus : 0.17362%
- Error with correct causal structure and do-calculus : 0.01588%

Learning and causality

Model

Simulation of the data

statistical learning

Evaluating the impact of X2X2X2 on Y1Y1Y1

Evaluating the causal impact of X2X2X2 on Y1Y1Y1 with the learned model

Causal learning and causal impact

Evaluating the impact of $X2$ on $Y1$

Evaluating the causal impact of $X2$ on $Y1$ with the learned model