Discretization using pyAgrum's DiscreteTypeProcessor
![]() | ![]() |
Most of the functionality of pyAgrum works only on discrete data. However data in the real world can be often continous. This class can be used to create dizcretized variables from continous data. Since this class was made for the purposes of the class BNClassifier, this class accepts data in the form of ndarrays. To transform data from a csv file to an ndarray we can use the function in BNClassifier XYfromCSV.
Creation of an instance and setting parameters
Section titled “Creation of an instance and setting parameters”To create an instance of this class we need to specify the default parameters (the discretization method and the number of bins) for discretizing data. We create a type_processor which uses the EWD (Equal Width Discretization) method with 5 bins. The threshold is used for determining if a variable is already discretized. In this case, if a variable has more than 10 unique values we treat it as continous. we can use the setDiscretizationParameters method to set the discretization parameters for a specific variable
%load_ext autoreload%autoreload 2
import pyagrum.skbn as skbnfrom pyagrum.lib.discreteTypeProcessor import DiscreteTypeProcessor
type_processor = DiscreteTypeProcessor( defaultDiscretizationMethod="uniform", defaultNumberOfBins=5, discretizationThreshold=10)Auditing data
Section titled “Auditing data”To see how certain data will be treated by the type_processor we can use the audit method.
import pandas
X = pandas.DataFrame.from_dict( { "var1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3], "var2": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n"], "var3": [1, 2, 5, 1, 2, 5, 1, 2, 5, 1, 2, 5, 1, 2], "var4": [1.11, 2.213, 3.33, 4.23, 5.42, 6.6, 7.5, 8.9, 9.19, 10.11, 11.12, 12.21, 13.3, 14.5], "var5": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1], })
print(X)
auditDict = type_processor.audit(X)
print()print("** audit **")for var in auditDict: print(f"- {var} : ") for k, v in auditDict[var].items(): print(f" + {k} : {v}") var1 var2 var3 var4 var50 1 a 1 1.110 11 2 b 2 2.213 22 3 c 5 3.330 33 4 d 1 4.230 44 5 e 2 5.420 55 6 f 5 6.600 66 7 g 1 7.500 77 8 h 2 8.900 88 9 i 5 9.190 99 10 j 1 10.110 1010 11 k 2 11.120 1111 1 l 5 12.210 1212 2 m 1 13.300 1313 3 n 2 14.500 1
** audit **- var1 : + method : uniform + nbBins : 5 + type : Continuous + minInData : 1 + maxInData : 11- var2 : + method : NoDiscretization + values : ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n'] + type : Discrete- var3 : + method : NoDiscretization + values : [1 2 5] + type : Discrete- var4 : + method : uniform + nbBins : 5 + type : Continuous + minInData : 1.11 + maxInData : 14.5- var5 : + method : uniform + nbBins : 5 + type : Continuous + minInData : 1 + maxInData : 13We can see that even though var1 has more unique values than var1, it is treated as a discrete variable. This is because the values of var2 are strings and therefore cannot be discretized.
Now we would like to discretized var1 using k-means, var4 using decille and we would like for var3 to stay not discretized but with all the value from 1 to 5…
type_processor = DiscreteTypeProcessor( defaultDiscretizationMethod="uniform", defaultNumberOfBins=5, discretizationThreshold=10)
type_processor.setDiscretizationParameters("var1", "kmeans")type_processor.setDiscretizationParameters("var4", "quantile", 10)type_processor.setDiscretizationParameters( "var3", "NoDiscretization", "[1,5]") # same format for type as pyagrum.fastVar
auditDict = type_processor.audit(X)
print()print("** audit **")for var in auditDict: print(f"- {var} : ") for k, v in auditDict[var].items(): print(f" + {k} : {v}")** audit **- var1 : + method : kmeans + param : 5 + type : Continuous + minInData : 1 + maxInData : 11- var2 : + method : NoDiscretization + values : ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n'] + type : Discrete- var3 : + method : NoDiscretization + param : [1,5] + type : Discrete- var4 : + method : quantile + param : 10 + type : Continuous + minInData : 1.11 + maxInData : 14.5- var5 : + method : uniform + nbBins : 5 + type : Continuous + minInData : 1 + maxInData : 13Creating template BN from data
Section titled “Creating template BN from data”To create a template BN (and the variables) from data we can use the createVariable method for each column in our data matrix. This will use the parameters that we have already set to create discrete (or discretized) variables from our data.
template_bn = type_processor.discretizedTemplate(X)
print(template_bn)print(template_bn["var1"])print(template_bn["var2"])print(template_bn["var3"])print(template_bn["var4"])print(template_bn["var5"])BN{nodes: 5, arcs: 0, domainSize: 17500, dim: 34, mem: 312o}var1:Discretized(<(1;2.625[,[2.625;5.125[,[5.125;7.5[,[7.5;9.5[,[9.5;11)>)var2:Labelized({a|b|c|d|e|f|g|h|i|j|k|l|m|n})var3:Range([1,5])var4:Discretized(<(1.11;2.213[,[2.213;3.33[,[3.33;5.42[,[5.42;6.6[,[6.6;8.2[,[8.2;9.19[,[9.19;10.11[,[10.11;12.21[,[12.21;13.3[,[13.3;14.5)>)var5:Discretized(<(1;3.4[,[3.4;5.8[,[5.8;8.2[,[8.2;10.6[,[10.6;13)>)For supervised discretization algorithms (MDLP and CAIM) the list of class labels for each datapoint is also needed.
y = [True, False, False, True, False, False, True, True, False, False, True, True, False, True]
type_processor.setDiscretizationParameters("var4", "CAIM")template_bn = type_processor.discretizedTemplate(X, y)print(template_bn["var4"])type_processor.setDiscretizationParameters("var4", "MDLP")template_bn = type_processor.discretizedTemplate(X, y)print(template_bn["var4"])var4:Discretized(<(1.11;10.614999999999998[,[10.614999999999998;14.5)>)var4:Discretized(<(1.11;1.6615000000000002[,[1.6615000000000002;14.5)>)The type_processor keeps track of the number of discretized variables created by it and the number of bins used to discretize them. To reset these two numbers to 0 we can use the clear method. We can also use it to clear the specific parameteres we have set for each variable.
print(f"numberOfContinuous : {type_processor.numberOfContinuous}")print(f"totalNumberOfBins : {type_processor.totalNumberOfBins}")
type_processor.clear()print("\n")
print(f"numberOfContinuous : {type_processor.numberOfContinuous}")print(f"totalNumberOfBins : {type_processor.totalNumberOfBins}")
type_processor.audit(X)numberOfContinuous : 9totalNumberOfBins : 44
numberOfContinuous : 0totalNumberOfBins : 0
{'var1': {'method': 'kmeans', 'param': 5, 'type': 'Continuous', 'minInData': 1, 'maxInData': 11}, 'var2': {'method': 'NoDiscretization', 'type': 'Discrete'}, 'var3': {'method': 'NoDiscretization', 'param': '[1,5]', 'type': 'Discrete'}, 'var4': {'method': 'MDLP', 'param': 5, 'type': 'Continuous', 'minInData': 1.11, 'maxInData': 14.5}, 'var5': {'method': 'uniform', 'nbBins': 5, 'type': 'Continuous', 'minInData': 1, 'maxInData': 13}}type_processor.clear(True)type_processor.audit(X){'var1': {'method': 'uniform', 'nbBins': 5, 'type': 'Continuous', 'minInData': 1, 'maxInData': 11}, 'var2': {'method': 'NoDiscretization', 'values': array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n'], dtype=object), 'type': 'Discrete'}, 'var3': {'method': 'NoDiscretization', 'values': array([1, 2, 5], dtype=object), 'type': 'Discrete'}, 'var4': {'method': 'uniform', 'nbBins': 5, 'type': 'Continuous', 'minInData': 1.11, 'maxInData': 14.5}, 'var5': {'method': 'uniform', 'nbBins': 5, 'type': 'Continuous', 'minInData': 1, 'maxInData': 13}}Using DiscreteTypeProcessor with BNClassifier
Section titled “Using DiscreteTypeProcessor with BNClassifier”import pyagrum as gumimport pyagrum.lib.notebook as gnb
import pandas as pd
X = pandas.DataFrame.from_dict( { "var1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3], "var2": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n"], "var3": [1, 2, 5, 1, 2, 5, 1, 2, 5, 1, 2, 5, 1, 2], "var4": [1.11, 2.213, 3.33, 4.23, 5.42, 6.6, 7.5, 8.9, 9.19, 10.11, 11.12, 12.21, 13.3, 14.5], "var5": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1], })Y = [1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0]
classif = skbn.BNClassifier(learningMethod="TAN")## by default number of bins is 5classif.type_processor.setDiscretizationParameters("var1", "kmeans")# ... but 10 for var4classif.type_processor.setDiscretizationParameters("var4", "quantile", 10)## in the database, var3 only takes values 1,2 or 5 but 3 and 4 are also possibleclassif.type_processor.setDiscretizationParameters("var3", "NoDiscretization", "[1, 5]")
classif.fit(X, Y)
gnb.showInference(classif.bn)Using Discretizer with BNLearner
Section titled “Using Discretizer with BNLearner”import pyagrum.lib.notebook as gnb
import pyagrum.skbn as skbn
file_name = "res/discretizable.csv"data = pd.read_csv(file_name)
type_processor = DiscreteTypeProcessor( defaultDiscretizationMethod="quantile", defaultNumberOfBins=10, discretizationThreshold=25)## creating a template explaining the variables proposed by the type_processor. This variables will be used by the learnertemplate = type_processor.discretizedTemplate(data)learner = gum.BNLearner(file_name, template)learner.useMIIC()learner.useNMLCorrection()
bn = learner.learnBN()gnb.showInference(bn, size="10!")Comparing discretization methods
Section titled “Comparing discretization methods”Different discretization for the same mixture of 2 Gaussians.
import numpy as npimport pandas
N = 20000N1 = 2 * N // 3N2 = N - N1
classY = np.array([1] * N1 + [0] * N2)data = pandas.DataFrame( data={ "y": classY, # discretiation using quantile (15 bins) "q15": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), # discretiation using uniform (15 bins) "u15": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), # discretiation using kmeans (15 bins) "k15": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), # discretiation using quantile (5 bins) "q5": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), # discretiation using kmeans (5 bins) "k5": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), # other discretization methods "caim": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), "mdlp": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), "expert": np.concatenate((np.random.normal(0, 2, N1), np.random.normal(10, 2, N2))), })
type_processor = DiscreteTypeProcessor( defaultDiscretizationMethod="quantile", defaultNumberOfBins=15, discretizationThreshold=10)
type_processor.setDiscretizationParameters("u15", method="uniform")type_processor.setDiscretizationParameters("k15", method="kmeans")type_processor.setDiscretizationParameters("q5", method="quantile", parameters=5)type_processor.setDiscretizationParameters("k5", method="kmeans", parameters=5)type_processor.setDiscretizationParameters("caim", method="CAIM", parameters=5)type_processor.setDiscretizationParameters("mdlp", method="MDLP", parameters=5)type_processor.setDiscretizationParameters("expert", method="expert", parameters=[-30.0, -2, 0.2, 1, 30.0])By default, the distributions for discretized variables are represented as “histogram”(the surface of the bars are proportionnal to the probabilities).
template = type_processor.discretizedTemplate(data, y=classY, possibleValuesY=[0, 1])for i, n in template: print(f"{n:7} : {template.variable(i)}")y : y:Range([0,1])q15 : q15:Discretized(<(-7.459116957558915;-2.552211366462513[,[-2.552211366462513;-1.7064150884358111[,[-1.7064150884358111;-1.037663601850948[,[-1.037663601850948;-0.5104683994630105[,[-0.5104683994630105;-0.012045383153075254[,[-0.012045383153075254;0.4922450952076254[,[0.4922450952076254;1.049992973840601[,[1.049992973840601;1.6945037503429456[,[1.6945037503429456;2.590779862431847[,[2.590779862431847;5.222231420871295[,[5.222231420871295;8.303414037595223[,[8.303414037595223;9.48881447151842[,[9.48881447151842;10.463492433112185[,[10.463492433112185;11.66799886398239[,[11.66799886398239;16.93470550963358)>)u15 : u15:Discretized(<(-6.8353416333167;-5.249892758443412[,[-5.249892758443412;-3.664443883570125[,[-3.664443883570125;-2.078995008696838[,[-2.078995008696838;-0.49354613382354984[,[-0.49354613382354984;1.0919027410497382[,[1.0919027410497382;2.6773516159230244[,[2.6773516159230244;4.262800490796312[,[4.262800490796312;5.8482493656696[,[5.8482493656696;7.433698240542888[,[7.433698240542888;9.019147115416176[,[9.019147115416176;10.604595990289463[,[10.604595990289463;12.190044865162749[,[12.190044865162749;13.775493740036039[,[13.775493740036039;15.360942614909325[,[15.360942614909325;16.946391489782613)>)k15 : k15:Discretized(<(-7.580996310005012;-3.7989593733853315[,[-3.7989593733853315;-2.3824235431808445[,[-2.3824235431808445;-1.2582049048977924[,[-1.2582049048977924;-0.21612222539489712[,[-0.21612222539489712;0.8489950530758026[,[0.8489950530758026;2.0477997691335954[,[2.0477997691335954;3.5920374210045267[,[3.5920374210045267;5.781141383370514[,[5.781141383370514;7.818712547092726[,[7.818712547092726;9.177751026634914[,[9.177751026634914;10.346062954209607[,[10.346062954209607;11.439646648455515[,[11.439646648455515;12.562516525034841[,[12.562516525034841;13.928075813043414[,[13.928075813043414;18.139731637499594)>)q5 : q5:Discretized(<(-7.6306323477373175;-1.050586905349271[,[-1.050586905349271;0.4987181860397135[,[0.4987181860397135;2.5681100842241187[,[2.5681100842241187;9.495416472006855[,[9.495416472006855;17.35947878327129)>)k5 : k5:Discretized(<(-8.311509957869559;-1.2679115975797055[,[-1.2679115975797055;1.2583727711343045[,[1.2583727711343045;5.587599643755152[,[5.587599643755152;10.194879787939714[,[10.194879787939714;17.578324739638365)>)caim : caim:Discretized(<(-8.78935269011173;5.300621598808158[,[5.300621598808158;17.031799063220042)>)mdlp : mdlp:Discretized(<(-8.785319046690597;2.784837845072029[,[2.784837845072029;3.392129675193389[,[3.392129675193389;4.794432888838657[,[4.794432888838657;5.390304598383062[,[5.390304598383062;5.777373610413742[,[5.777373610413742;6.888765162688189[,[6.888765162688189;17.732549727312225)>)expert : expert:Discretized(<(-30;-2[,[-2;0.2[,[0.2;1[,[1;30)>)learner = gum.BNLearner(data, template)bn = gum.BayesNet(template)for i, n in bn: if n != "y": bn.addArc("y", n)bn## learner.useMIIC()## bn=learner.learnBN()learner.fitParameters(bn)
## dot | neato | fdp | sfdp | twopi | circo | osage | patchworkgum.config.push()gum.config["notebook", "graph_layout"] = "fdp"gnb.showInference(bn, size="8!")gum.config.pop()But you can always choose to show them as as “bar” (the height of the bars are proportionnal to the probabilities) instead of “histogram” (the area of the bars is proportionnal to the probabilites)
## changing how discretized variable are visualizedgum.config.push()gum.config["notebook", "histogram_discretized_visualisation"] = "bar"gnb.showInference(bn, size="13!")gum.config.pop() # default (above) is "histogram"gnb.showInference(bn, size="13!")
