Smoking, Cancer and causality

This notebook follows the famous example from Causality (Pearl, 2009).

A correlation has been observed between Smoking and Cancer, represented by this Bayesian network :

import pyagrum as gum
import pyagrum.lib.notebook as gnb
import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb


obs1 = gum.fastBN("Smoking->Cancer")

obs1.cpt("Smoking")[:] = [0.6, 0.4]
obs1.cpt("Cancer")[{"Smoking": 0}] = [0.9, 0.1]
obs1.cpt("Cancer")[{"Smoking": 1}] = [0.7, 0.3]

gnb.flow.row(
  obs1,
  obs1.cpt("Smoking") * obs1.cpt("Cancer"),
  obs1.cpt("Smoking"),
  obs1.cpt("Cancer"),
  captions=["the BN", "the joint distribution", "the marginal for $smoking$", "the CPT for $cancer$"],
)

the BN

	Smoking
Cancer	0	1
0	0.5400	0.2800
1	0.0600	0.1200

the joint distribution

Smoking
0	1
0.6000	0.4000

the marginal for $smoking$

	Cancer
Smoking	0	1
0	0.9000	0.1000
1	0.7000	0.3000

the CPT for $cancer$

Direct causality between Smoking and Cancer

The very strong observed correlation between smoking and lung cancer suggests a causal relationship as the Surgeon General asserts in 1964, then, the proposed model is as follows :

## the Bayesian network is causal
modele1 = csl.CausalModel(obs1)

cslnb.showCausalImpact(modele1, "Cancer", "Smoking", values={"Smoking": 1})

Causal Model

\begin{equation*}P( Cancer \mid \text{do}(Smoking)) = P\left(Cancer\mid Smoking\right)\end{equation*}

Explanation : Do-calculus computations

Cancer
0	1
0.7000	0.3000

Impact

Latent confounder between Smoking and Cancer

This model is highly contested by the tobacco industry which answers by proposing a different model in which Smoking and Cancer are simultaneously provoked by a common factor, the Genotype (or other latent variable) :

## a latent varible exists between Smoking and Cancer in the causal model
modele2 = csl.CausalModel(obs1, [("Genotype", ["Smoking", "Cancer"])])

cslnb.showCausalImpact(modele2, "Cancer", "Smoking", values={"Smoking": 1})

Causal Model

\begin{equation*}P( Cancer \mid \text{do}(Smoking)) = P\left(Cancer\right)\end{equation*}

Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).

Cancer
0	1
0.8200	0.1800

Impact

## just check P(Cancer) in the bn `obs1`
(obs1.cpt("Smoking") * obs1.cpt("Cancer")).sumIn(["Cancer"])

Cancer
0	1
0.8200	0.1800

Confounder and direct causality

In a diplomatic effort, both parts agree that there must be some truth in both models :

## a latent variable exists between Smoking and Cancer but the direct causal relation exists also
modele3 = csl.CausalModel(obs1, [("Genotype", ["Smoking", "Cancer"])], True)

cslnb.showCausalImpact(modele3, "Cancer", "Smoking", values={"Smoking": 1})

Causal Model

Hedge Error: G={'Smoking', 'Cancer'}, G[S]={'Cancer'}
Impossible

No result
Impact

Smoking’s causal effect on Cancer becomes uncomputable in such a model because we can’t distinguish both causes’ impact from the observations.

A intermediary observed variable

We introduce an auxilary factor between Smoking and Cancer, tobacco causes cancer because of the tar deposits in the lungs.

obs2 = gum.fastBN("Smoking->Tar->Cancer;Smoking->Cancer")

obs2.cpt("Smoking")[:] = [0.6, 0.4]
obs2.cpt("Tar")[{"Smoking": 0}] = [0.9, 0.1]
obs2.cpt("Tar")[{"Smoking": 1}] = [0.7, 0.3]
obs2.cpt("Cancer")[{"Tar": 0, "Smoking": 0}] = [0.9, 0.1]
obs2.cpt("Cancer")[{"Tar": 1, "Smoking": 0}] = [0.8, 0.2]
obs2.cpt("Cancer")[{"Tar": 0, "Smoking": 1}] = [0.7, 0.3]
obs2.cpt("Cancer")[{"Tar": 1, "Smoking": 1}] = [0.6, 0.4]

gnb.flow.row(
  obs2,
  obs2.cpt("Smoking"),
  obs2.cpt("Tar"),
  obs2.cpt("Cancer"),
  captions=["", "$P(Smoking)$", "$P(Tar|Smoking)$", "$P(Cancer|Tar,Smoking)$"],
)

Smoking
0	1
0.6000	0.4000

$P(Smoking)$

	Tar
Smoking	0	1
0	0.9000	0.1000
1	0.7000	0.3000

$P(Tar|Smoking)$

		Cancer
Smoking	Tar	0	1
0	0	0.9000	0.1000
0	1	0.8000	0.2000
1	0	0.7000	0.3000
1	1	0.6000	0.4000

$P(Cancer|Tar,Smoking)$

modele4 = csl.CausalModel(obs2, [("Genotype", ["Smoking", "Cancer"])])

cslnb.showCausalModel(modele4)

svg

cslnb.showCausalImpact(modele4, "Cancer", "Smoking", values={"Smoking": 1})

Causal Model

\begin{equation*}P( Cancer \mid \text{do}(Smoking)) = \sum_{Tar}{P\left(Tar\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Cancer\mid Smoking',Tar\right) \cdot P\left(Smoking'\right)}\right)}\end{equation*}

Explanation : frontdoor [‘Tar’] found.

Cancer
0	1
0.7900	0.2100

Impact

In this model, we are, again, able to calculate the causal impact of Smoking on Cancer thanks to the verification of the Frontdoor criterion by the Tar relatively to the couple (Smoking, Cancer)

## just check P(Cancer|do(smoking)) in the bn `obs2`
((obs2.cpt("Cancer") * obs2.cpt("Smoking")).sumOut(["Smoking"]) * obs2.cpt("Tar")).sumOut(["Tar"]).putFirst("Cancer")

	Cancer
Smoking	0	1
0	0.8100	0.1900
1	0.7900	0.2100

Other causal impacts for this last model

cslnb.showCausalImpact(modele4, "Smoking", doing="Cancer", knowing={"Tar"}, values={"Cancer": 1, "Tar": 1})

Causal Model

\begin{equation*}P( Smoking \mid \text{do}(Cancer), Tar) = P\left(Smoking\mid Tar\right)\end{equation*}

Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).

Smoking
0	1
0.3333	0.6667

Impact

cslnb.showCausalImpact(modele4, "Smoking", doing="Cancer", values={"Cancer": 1})

Causal Model

\begin{equation*}P( Smoking \mid \text{do}(Cancer)) = P\left(Smoking\right)\end{equation*}

Explanation : Do-calculus computations

Smoking
0	1
0.6000	0.4000

Impact

cslnb.showCausalImpact(modele4, "Smoking", doing={"Cancer", "Tar"}, values={"Cancer": 1, "Tar": 1})

Causal Model

\begin{equation*}P( Smoking \mid \text{do}(Tar),\text{do}(Cancer)) = P\left(Smoking\right)\end{equation*}

Explanation : Do-calculus computations

Smoking
0	1
0.6000	0.4000

Impact

cslnb.showCausalImpact(modele4, "Tar", doing={"Cancer", "Smoking"}, values={"Cancer": 1, "Smoking": 1})

Causal Model

\begin{equation*}P( Tar \mid \text{do}(Cancer),\text{do}(Smoking)) = P\left(Tar\mid Smoking\right)\end{equation*}

Explanation : Do-calculus computations

Tar
0	1
0.7000	0.3000

Impact

Four causal models for the same observational data

gnb.sideBySide(
  modele1,
  csl.causalImpact(modele1, on="Cancer", doing="Smoking")[0],
  modele2,
  csl.causalImpact(modele2, on="Cancer", doing="Smoking")[0],
  modele3,
  csl.causalImpact(modele3, on="Cancer", doing="Smoking")[0],
  modele4,
  csl.causalImpact(modele4, on="Cancer", doing="Smoking")[0],
  ncols=2,
)

	$P( Cancer \mid \text{do}(Smoking)) = P\left(Cancer\mid Smoking\right)$
	$P( Cancer \mid \text{do}(Smoking)) = P\left(Cancer\right)$
	None
	$P( Cancer \mid \text{do}(Smoking)) = \sum_{Tar}{P\left(Tar\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Cancer\mid Smoking',Tar\right) \cdot P\left(Smoking'\right)}\right)}$