Learning the structure of a Bayesian network
![]() | ![]() |
# %matplotlib inline## from pylab import *import mathimport matplotlib.pyplot as plt
import pyagrum as gumimport pyagrum.lib.notebook as gnbimport pyagrum.explain as explainimport pyagrum.lib.bn_vs_bn as bnvsbn
gum.about()gnb.configuration()pyAgrum 2.3.0.9(c) 2015-2024 Pierre-Henri Wuillemin, Christophe Gonzales
This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. For details, see 'pyagrum.warranty'.| Library | Version |
|---|---|
| OS | posix [darwin] |
| Python | 3.14.0 (main, Oct 7 2025, 09:34:52) [Clang 17.0.0 (clang-1700.3.19.1)] |
| IPython | 9.6.0 |
| Matplotlib | 3.10.7 |
| Numpy | 2.3.4 |
| pyDot | 4.0.1 |
| pyAgrum | 2.3.0.9 |
Generating the database from a BN
Section titled “Generating the database from a BN”bn = gum.loadBN("res/asia.bif")bngum.generateSample(bn, 50000, "out/sample_asia.csv", True);out/sample_asia.csv: 0%| |
out/sample_asia.csv: 100%|█████████████████████████████████|
Log2-Likelihood : -161108.69919616362with open("out/sample_asia.csv", "r") as src: for _ in range(10): print(src.readline(), end="")bronchitis,tuberculos_or_cancer,tuberculosis,dyspnoea,positive_XraY,visit_to_Asia,smoking,lung_cancer1,1,1,1,1,1,0,11,1,1,1,1,1,1,11,1,1,1,1,1,0,11,1,1,1,1,1,1,10,1,1,0,1,1,1,11,1,1,1,1,1,1,11,1,1,1,1,1,1,11,1,1,1,1,1,1,10,1,1,0,1,1,1,1learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variablesprint(learner)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : MIICCorrection : MDLPrior : -print(f"Row of visit_to_Asia : {learner.idFromName('visit_to_Asia')}") # first row is 0Row of visit_to_Asia : 0print(f"Variable in row 4 : {learner.nameFromId(4)}")Variable in row 4 : lung_cancerThe BNLearner is capable of recognizing missing values in databases. For this purpose, just indicate as a last argument the list of the strings that represent missing values.
## it is possible to add as a last argument a list of the symbols that represent missing values:## whenever a cell of the database is equal to one of these strings, it is considered as a## missing valuelearner = gum.BNLearner("res/asia_missing.csv", bn, ["?", "N/A"])print(f"Are there missing values in the database ? {learner.state()['Missing values'][0]}")Are there missing values in the database ? Truetype induction
Section titled “type induction”When reading a csv file, BNLearner can try to find the correct type for discrete variable. Especially for numeric values.
%%writefile out/testTypeInduction.csvA,B,C,D1,2,0,hot0,3,-2,cold0,1,2,hot1,2,2,warmOverwriting out/testTypeInduction.csvprint("* by default, type induction is on (True) :")learner = gum.BNLearner("out/testTypeInduction.csv")bn3 = learner.learnBN()for v in sorted(bn3.names()): print(f" - {bn3.variable(v)}")
print("")print("* but you can disable it :")learner = gum.BNLearner("out/testTypeInduction.csv", ["?"], False)bn3 = learner.learnBN()for v in sorted(bn3.names()): print(f" - {bn3.variable(v)}")
print("")print("Note that when a Labelized variable is found, the labesl are alphabetically sorted.")* by default, type induction is on (True) : - A:Range([0,1]) - B:Range([1,3]) - C:Integer({-2|0|2}) - D:Labelized({cold|hot|warm})
* but you can disable it : - A:Labelized({0|1}) - B:Labelized({1|2|3}) - C:Labelized({-2|0|2}) - D:Labelized({cold|hot|warm})
Note that when a Labelized variable is found, the labesl are alphabetically sorted.Parameters learning from the database
Section titled “Parameters learning from the database”We give the as a parameter for the learner in order to have the variables and the order of the labels for each variables. Please try to remove the argument in the first line below to see the difference …
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variables and labelsbn2 = learner.learnParameters(bn.dag())gnb.showBN(bn2)gnb.sideBySide( "<H3>Original BN</H3>", "<H3>Learned NB</H3>", bn.cpt("visit_to_Asia"), bn2.cpt("visit_to_Asia"), bn.cpt("tuberculosis"), bn2.cpt("tuberculosis"), ncols=2,)Structural learning a BN from the database
Section titled “Structural learning a BN from the database”Note that, currently, the BNLearner is not yet able to learn in the presence of missing values. This is the reason why, when it discovers that there exist such values, it raises a gum.MissingValueInDatabase exception.
with open("res/asia_missing.csv", "r") as asiafile: for _ in range(10): print(asiafile.readline(), end="")try: learner = gum.BNLearner("res/asia_missing.csv", bn, ["?", "N/A"]) bn2 = learner.learnBN()except gum.MissingValueInDatabase: print("exception raised: there are missing values in the database")smoking,lung_cancer,bronchitis,visit_to_Asia,tuberculosis,tuberculos_or_cancer,dyspnoea,positive_XraY0,0,0,1,1,0,0,01,1,0,1,1,1,0,11,1,1,1,1,1,1,11,1,0,1,1,1,0,N/A0,1,0,1,1,1,1,11,1,1,1,1,1,1,11,1,1,1,1,1,0,11,1,0,1,1,1,0,11,1,1,1,1,1,1,1exception raised: there are missing values in the databaseDifferent learning algorithms
Section titled “Different learning algorithms”For now, there are three scored-based algorithms that are wrapped in pyAgrum : LocalSearchWithTabuList, GreedyHillClimbing and K2
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useLocalSearchWithTabuList()print(learner)bn2 = learner.learnBN()print("Learned in {0}ms".format(1000 * learner.currentTime()))gnb.flow.row(bn, bn2, explain.getInformation(bn2), captions=["Original BN", "Learned BN", "information"])Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Local Search with Tabu ListTabu list size : 2Score : BDeuPrior : -
Learned in 15.564msTo apprehend the distance between the original and the learned BN, we have several tools :
- Compute the KL divergence (and other distance) between original and learned joint distribution
kl = gum.ExactBNdistance(bn, bn2)kl.compute(){'klPQ': 0.0002454972790806624, 'errorPQ': 0, 'klQP': 0.0002170115306462794, 'errorQP': 128, 'hellinger': 0.009492813820926288, 'bhattacharya': 4.5052130085963663e-05, 'jensen-shannon': 6.276601520482401e-05}- Compute some scores on the BNs (as binary classifiers) abd show the graphical diff between the two graphs
gnb.flow.row(bn, bn2, captions=["bn", "bn2"])gnb.flow.row( bnvsbn.graphDiff(bn, bn2), bnvsbn.graphDiff(bn2, bn), bnvsbn.graphDiffLegend(), captions=["bn versus bn2", "bn2 versus bn", ""],)
gcmp = bnvsbn.GraphicalBNComparator(bn, bn2)gnb.flow.add_html( "<br/>".join([f"{k} : {v:.2f}" for k, v in gcmp.skeletonScores().items() if k != "count"]), "Skeleton scores")gnb.flow.add_html("<br/>".join([f"{k} : {v:.2f}" for k, v in gcmp.scores().items() if k != "count"]), "Scores")
gnb.flow.display()precision : 0.73
fscore : 0.84
dist2opt : 0.27
precision : 0.27
fscore : 0.32
dist2opt : 0.96
A greedy Hill Climbing algorithm (with insert, remove and change arc as atomic operations).
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useGreedyHillClimbing()print(learner)bn2 = learner.learnBN()print("Learned in {0}ms".format(1000 * learner.currentTime()))gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), explain.getInformation(bn2), captions=["Original BN", "Learned BN", "Graphical diff", "information"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Greedy Hill ClimbingScore : BDeuPrior : -
Learned in 13.482msAnd a K2 for those who likes it :)
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useK2([0, 1, 2, 3, 4, 5, 6, 7])print(learner)bn2 = learner.learnBN()print("Learned in {0}ms".format(1000 * learner.currentTime()))gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), explain.getInformation(bn2), captions=["Original BN", "Learned BN", "Graphical diff", "information"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : K2K2 order : visit_to_Asia, tuberculosis, tuberculos_or_cancer, positive_XraY, lung_cancer, smoking, bronchitis, dyspnoeaScore : BDeuPrior : -
Learned in 6.264msK2 can be very good if the order is the good one (a topological order of nodes in the reference)
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useK2([7, 6, 5, 4, 3, 2, 1, 0])print(learner)bn2 = learner.learnBN()print("Learned in {0}s".format(learner.currentTime()))gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), explain.getInformation(bn2), captions=["Original BN", "Learned BN", "Graphical diff", "information"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : K2K2 order : dyspnoea, bronchitis, smoking, lung_cancer, positive_XraY, tuberculos_or_cancer, tuberculosis, visit_to_AsiaScore : BDeuPrior : -
Learned in 0.007757sFollowing the learning curve
Section titled “Following the learning curve”import numpy as np
%matplotlib inline
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useLocalSearchWithTabuList()
## we could prefere a log2likelihood score## learner.useScoreLog2Likelihood()learner.setMaxTime(10)
## representation of the error as a pseudo log (negative values really represents negative epsilon@np.vectorizedef pseudolog(x): res = np.log(x) # np.log(y)
return res if x > 0 else -res
## in order to control the complexity, we limit the number of parentslearner.setMaxIndegree(7) # no more than 3 parent by nodelearner.setEpsilon(1e-10)gnb.animApproximationScheme(learner, scale=pseudolog) # scale by default is np.log10
bn2 = learner.learnBN()Customizing the learning algorithms
Section titled “Customizing the learning algorithms”1. Learn a tree ?
Section titled “1. Learn a tree ?”learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useGreedyHillClimbing()
learner.setMaxIndegree(1) # no more than 1 parent by nodeprint(learner)bntree = learner.learnBN()gnb.sideBySide( bn, bntree, gnb.getBNDiff(bn, bntree), explain.getInformation(bntree), captions=["Original BN", "Learned BN", "Graphical diff", "information"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Greedy Hill ClimbingScore : BDeuPrior : -Constraint Max InDegree : 12. with prior structural knowledge
Section titled “2. with prior structural knowledge”learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useGreedyHillClimbing()
## I know that smoking causes cancerlearner.addMandatoryArc("smoking", "lung_cancer") # smoking->lung_cancer## I know that visit to Asia may change the risk of tuberculosislearner.addMandatoryArc("visit_to_Asia", "tuberculosis") # visit_to_Asia->tuberculosisprint(learner)bn2 = learner.learnBN()gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), explain.getInformation(bn2), captions=["Original BN", "Learned BN", "Graphical diff", "information"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Greedy Hill ClimbingScore : BDeuPrior : -Constraint Mandatory Arcs : {visit_to_Asia->tuberculosis, smoking->lung_cancer}3. changing the scores
Section titled “3. changing the scores”By default, a BDEU score is used. But it can be changed.
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useGreedyHillClimbing()
## I know that smoking causes cancerlearner.addMandatoryArc(0, 1)
## we prefere a log2likelihood scorelearner.useScoreLog2Likelihood()
## in order to control the complexity, we limit the number of parentslearner.setMaxIndegree(1) # no more than 1 parent by nodeprint(learner)bn2 = learner.learnBN()kl = gum.ExactBNdistance(bn, bn2)gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), "<br/>".join(["<b>" + k + "</b> :" + str(v) for k, v in kl.compute().items()]), captions=["original", "learned BN", "diff", "distances"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Greedy Hill ClimbingScore : Log2LikelihoodPrior : -Constraint Max InDegree : 1Constraint Mandatory Arcs : {visit_to_Asia->tuberculosis}4. comparing BNs
Section titled “4. comparing BNs”There are multiple ways to compare Bayes net…
help(gnb.getBNDiff)Help on function getBNDiff in module pyagrum.lib.notebook:
getBNDiff(bn1, bn2, size=None, noStyle=False) get a HTML string representation of a graphical diff between the arcs of _bn1 (reference) with those of _bn2.
if `noStyle` is False use 4 styles (fixed in pyagrum.config) : - the arc is common for both - the arc is common but inverted in `bn2` - the arc is added in `bn2` - the arc is removed in `bn2`
Parameters ---------- bn1: pyagrum.BayesNet the reference bn2: pyagrum.BayesNet the compared one size: float|str size (for graphviz) of the rendered graph noStyle: bool with style or not.
Returns ------- str the HTML representation of the comparisongnb.showBNDiff(bn, bn2)import pyagrum.lib.bn_vs_bn as gbnbn
help(gbnbn.graphDiff)Help on function graphDiff in module pyagrum.lib.bn_vs_bn:
graphDiff(bnref, bncmp, noStyle=False) Return a pydot graph that compares the arcs of bnref to bncmp. graphDiff allows bncmp to have less nodes than bnref. (this is not the case in GraphicalBNComparator.dotDiff())
if noStyle is False use 4 styles (fixed in pyagrum.config) : - the arc is common for both - the arc is common but inverted in _bn2 - the arc is added in _bn2 - the arc is removed in _bn2
See graphDiffLegend() to add a legend to the graph. Warning ------- if pydot is not installed, this function just returns None
Returns ------- pydot.Dot the result dot graph or None if pydot can not be importedgbnbn.GraphicalBNComparator?gcmp = gbnbn.GraphicalBNComparator(bn, bn2)gnb.sideBySide( bn, bn2, gcmp.dotDiff(), gbnbn.graphDiffLegend(), bn2, bn, gbnbn.graphDiff(bn2, bn), gbnbn.graphDiffLegend(), ncols=4)print("But also gives access to different scores :")print(gcmp.scores())print(gcmp.skeletonScores())print(gcmp.hamming())But also gives access to different scores :{'count': {'tp': 3, 'tn': 44, 'fp': 4, 'fn': 5}, 'recall': 0.375, 'precision': 0.42857142857142855, 'fscore': 0.39999999999999997, 'dist2opt': 0.8468504072413839}{'count': {'tp': 6, 'tn': 19, 'fp': 1, 'fn': 2}, 'recall': 0.75, 'precision': 0.8571428571428571, 'fscore': 0.7999999999999999, 'dist2opt': 0.2879377767249482}{'hamming': 3, 'structural hamming': 7}print("KL divergence can be computed")kl = gum.ExactBNdistance(bn, bn2)kl.compute()KL divergence can be computed
{'klPQ': 0.122659583144537, 'errorPQ': 0, 'klQP': 0.035362312624309146, 'errorQP': 64, 'hellinger': 0.2008652891479617, 'bhattacharya': 0.020379688845562502, 'jensen-shannon': 0.0233148500701501}5. Mixing algorithms
Section titled “5. Mixing algorithms”First we learn a structure with HillClimbing (faster ?)
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useGreedyHillClimbing()learner.addMandatoryArc(0, 1)bn2 = learner.learnBN()kl = gum.ExactBNdistance(bn, bn2)gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), "<br/>".join(["<b>" + k + "</b> :" + str(v) for k, v in kl.compute().items()]), captions=["original", "learned BN", "diff", "distances"],)And then we refine with tabuList
learner = gum.BNLearner("out/sample_asia.csv", bn) # using bn as template for variableslearner.useLocalSearchWithTabuList()
learner.setInitialDAG(bn2.dag())print(learner)bn3 = learner.learnBN()kl = gum.ExactBNdistance(bn, bn3)gnb.sideBySide( bn, bn2, gnb.getBNDiff(bn, bn2), "<br/>".join(["<b>" + k + "</b> :" + str(v) for k, v in kl.compute().items()]), captions=["original", "learned BN", "diff", "distances"],)Filename : out/sample_asia.csvSize : (50000,8)Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2]Induced types : FalseMissing values : FalseAlgorithm : Local Search with Tabu ListTabu list size : 2Score : BDeuPrior : -Initial DAG : True (digraph { 0; 1; 2; 3; 4; 5; 6; 7;
0 -> 1; 6 -> 7; 2 -> 3; 6 -> 5; 2 -> 7; 4 -> 5; 4 -> 2; 1 -> 2; 6 -> 4;}
)Impact of the size of the database for the learning
Section titled “Impact of the size of the database for the learning”rows = 3sizes = [400, 500, 700, 1000, 2000, 5000, 10000, 50000, 75000, 100000, 150000, 175000, 200000, 300000, 500000]
def extract_asia(n): """ extract n line from asia.csv to extract.csv """ with open("out/sample_asia.csv", "r") as src: with open("out/extract_asia.csv", "w") as dst: for _ in range(n + 1): print(src.readline(), end="", file=dst)gnb.flow.clear()nbr = 0l = []for i in sizes: extract_asia(i) learner = gum.BNLearner("out/extract_asia.csv", bn) # using bn as template for variables learner.useGreedyHillClimbing() print(learner.state()["Size"][0]) bn2 = learner.learnBN()
kl = gum.ExactBNdistance(bn, bn2) r = kl.compute() l.append(math.log(r["klPQ"]))
gnb.flow.add(gnb.getBNDiff(bn, bn2, size="3!"), f"size={i}")
gnb.flow.display()plt.plot(sizes, l)print(f"final value computed : {l[-1]}")(400,8)(500,8)
(700,8)(1000,8)
(2000,8)(5000,8)
(10000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)final value computed : -8.33048744688638gnb.flow.clear()nbr = 0l = []for i in sizes: extract_asia(i) learner = gum.BNLearner("out/extract_asia.csv", bn) # using bn as template for variables learner.useMIIC() print(learner.state()["Size"][0]) bn2 = learner.learnBN()
kl = gum.ExactBNdistance(bn, bn2) r = kl.compute() l.append(math.log(r["klPQ"]))
gnb.flow.add(gnb.getBNDiff(bn, bn2, size="3!"), f"size={i}")
gnb.flow.display()plt.plot(sizes, l)print(l[-1])(400,8)
(500,8)
(700,8)
(1000,8)
(2000,8)
(5000,8)
(10000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)
(50000,8)-8.42789925936379
