Simpson's Paradox

This notebook follows the famous example from Causality (Pearl, 2009).

import pyagrum as gum
import pyagrum.lib.notebook as gnb
import pyagrum.causal as csl
import pyagrum.causal.notebook as cslnb

In a statistical study about a drug, we try to evaluate the latter’s efficiency among a population of men and women. Let’s note:

$Drug$ : drug taking
$Patient$ : cured patient
$Gender$ : patient’s gender

The model from the observed date is as follow :

m1 = gum.fastBN("Gender{F|M}->Drug{Without|With}->Patient{Sick|Healed}<-Gender")

m1.cpt("Gender")[:] = [0.5, 0.5]
m1.cpt("Drug")[:] = [
  [0.25, 0.75],  # Gender=F
  [0.75, 0.25],
]  # Gender=M

m1.cpt("Patient")[{"Drug": "Without", "Gender": "F"}] = [0.2, 0.8]  # No Drug, Male -> healed in 0.8 of cases
m1.cpt("Patient")[{"Drug": "Without", "Gender": "M"}] = [0.6, 0.4]  # No Drug, Female -> healed in 0.4 of cases
m1.cpt("Patient")[{"Drug": "With", "Gender": "F"}] = [0.3, 0.7]  # Drug, Male -> healed 0.7 of cases
m1.cpt("Patient")[{"Drug": "With", "Gender": "M"}] = [0.8, 0.2]  # Drug, Female -> healed in 0.2 of cases
gnb.flow.row(m1, m1.cpt("Gender"), m1.cpt("Drug"), m1.cpt("Patient"))

Gender
F	M
0.5000	0.5000

	Drug
Gender	Without	With
F	0.2500	0.7500
M	0.7500	0.2500

		Patient
Gender	Drug	Sick	Healed
F	Without	0.2000	0.8000
F	With	0.3000	0.7000
M	Without	0.6000	0.4000
M	With	0.8000	0.2000

def getCuredObservedProba(m1, evs):
  evs0 = dict(evs)
  evs1 = dict(evs)
  evs0["Drug"] = "Without"
  evs1["Drug"] = "With"

  return (
    gum.Tensor()
    .add(m1["Drug"])
    .fillWith(
      [gum.getPosterior(m1, target="Patient", evs=evs0)[1], gum.getPosterior(m1, target="Patient", evs=evs1)[1]]
    )
  )


gnb.sideBySide(
  getCuredObservedProba(m1, {}),
  getCuredObservedProba(m1, {"Gender": "F"}),
  getCuredObservedProba(m1, {"Gender": "M"}),
  captions=[
    r"$P(Patient = Healed \mid Drug )$<br/>Taking $Drug$ is observed as efficient to cure",
    r"$P(Patient = Healed \mid Gender=F,Drug)$<br/>except if the $gender$ of the patient is female",
    r"$P(Patient = Healed \mid Gender=M,Drug)$<br/>... or male.",
  ],
)

Drug
Without	With
0.5000	0.5750

$P(Patient = Healed \mid Drug )$
Taking $Drug$ is observed as efficient to cure

Drug
Without	With
0.8000	0.7000

$P(Patient = Healed \mid Gender=F,Drug)$
except if the $gender$ of the patient is female

Drug
Without	With
0.4000	0.2000

$P(Patient = Healed \mid Gender=M,Drug)$
... or male.

Those results form a paradox called Simpson paradox :

$P(C\mid \neg{D}) = 0.5 < P(C\mid D) = 0.575$ $P(C\mid \neg{D},G = Male) = 0.8 > P(C\mid D,G = Male) = 0.7$ $P(C\mid \neg{D},G = Female) = 0.4 > P(C\mid D,G = Female) = 0.2$

Actuallay, giving the drug is not an observation in our model but rather an intervention. What if we use intervention instead of observation ?

How to compute causal impacts on the patient’s health ?

We propose this causal model.

d1 = csl.CausalModel(m1)
cslnb.showCausalModel(d1)

svg

Computing $P (Patient = Healed \mid \text{do}(Drug = Without))$

cslnb.showCausalImpact(d1, "Patient", doing="Drug", values={"Drug": "Without"})

Causal Model

\begin{equation*}P( Patient \mid \text{do}(Drug)) = \sum_{Gender}{P\left(Patient\mid Drug,Gender\right) \cdot P\left(Gender\right)}\end{equation*}

Explanation : backdoor [‘Gender’] found.

Patient
Sick	Healed
0.4000	0.6000

Impact

We have, $P (Patient = Healed \mid \hookrightarrow Drug = without) = 0.6$

Computing $P (Patient = Healed \mid \text{do}(Drug = With))$

d1 = csl.CausalModel(m1)
cslnb.showCausalImpact(d1, "Patient", "Drug", values={"Drug": "With"})

Causal Model

\begin{equation*}P( Patient \mid \text{do}(Drug)) = \sum_{Gender}{P\left(Patient\mid Drug,Gender\right) \cdot P\left(Gender\right)}\end{equation*}

Explanation : backdoor [‘Gender’] found.

Patient
Sick	Healed
0.5500	0.4500

Impact

And then : $P(Patient = Healed \mid \text{do}(Drug = With)) = 0.45$

Therefore : $P(Patient = Healed\mid \text{do}(Drug = Without)) = 0.6 > P(Patient = Healed\mid \text{do}(Drug = With)) = 0.45$

Which means that taking this drug would not enhance the patient’s healing process, and it is better not to prescribe this drug for treatment.

Simpson paradox solved by interventions

So to summarize, the paradox appears when wrongly dealing with observations on $Drug$ :

gnb.sideBySide(
  getCuredObservedProba(m1, {}),
  getCuredObservedProba(m1, {"Gender": "F"}),
  getCuredObservedProba(m1, {"Gender": "M"}),
  captions=[
    r"$P(Patient = Healed \mid Drug )$<br/>Taking $Drug$ is observed as efficient to cure",
    r"$P(Patient = Healed \mid Gender=F,Drug)$<br/>except if the $gender$ of the patient is female",
    r"$P(Patient = Healed \mid Gender=M,Drug)$<br/>... or male.",
  ],
)

Drug
Without	With
0.5000	0.5750

$P(Patient = Healed \mid Drug )$
Taking $Drug$ is observed as efficient to cure

Drug
Without	With
0.8000	0.7000

$P(Patient = Healed \mid Gender=F,Drug)$
except if the $gender$ of the patient is female

Drug
Without	With
0.4000	0.2000

$P(Patient = Healed \mid Gender=M,Drug)$
... or male.

… and disappears when dealing with intervention on $Drug$ :

gnb.sideBySide(
  csl.causalImpact(d1, on="Patient", doing="Drug", values={"Patient": "Healed"})[1],
  csl.causalImpact(d1, on="Patient", doing="Drug", knowing={"Gender"}, values={"Patient": "Healed", "Gender": "F"})[1],
  csl.causalImpact(d1, on="Patient", doing="Drug", knowing={"Gender"}, values={"Patient": "Healed", "Gender": "M"})[1],
  captions=[
    r"$P(Patient = 1 \mid \text{do}(Drug) )$<br/>Effectively $Drug$ taking is not efficient to cure",
    r"$P(Patient = 1 \mid \text{do}(Drug), gender=F )$<br/>, the $gender$ of the patient being female",
    r"$P(Patient = 1 \mid \text{do}(Drug), gender=M )$<br/>, ... or male.",
  ],
)

Drug
Without	With
0.6000	0.4500

$P(Patient = 1 \mid \text{do}(Drug) )$
Effectively $Drug$ taking is not efficient to cure

Drug
Without	With
0.8000	0.7000

$P(Patient = 1 \mid \text{do}(Drug), gender=F )$
, the $gender$ of the patient being female

Drug
Without	With
0.4000	0.2000

$P(Patient = 1 \mid \text{do}(Drug), gender=M )$
, ... or male.

Simpson's Paradox

How to compute causal impacts on the patient’s health ?

Computing P(Patient=Healed∣do(Drug=Without))P (Patient = Healed \mid \text{do}(Drug = Without))P(Patient=Healed∣do(Drug=Without))

Computing P(Patient=Healed∣do(Drug=With))P (Patient = Healed \mid \text{do}(Drug = With))P(Patient=Healed∣do(Drug=With))

Simpson paradox solved by interventions

Computing $P (Patient = Healed \mid \text{do}(Drug = Without))$

Computing $P (Patient = Healed \mid \text{do}(Drug = With))$