Classifier using Bayesian networks
class pyagrum.skbn.BNClassifier(learningMethod=‘MIIC’, prior=None, scoringType=‘BIC’, constraints=None, priorWeight=1, possibleSkeleton=None, DirichletCsv=None, discretizationStrategy=‘quantile’, discretizationNbBins=5, discretizationThreshold=25, usePR=False, beta=1, significant_digit=10)
Section titled “class pyagrum.skbn.BNClassifier(learningMethod=‘MIIC’, prior=None, scoringType=‘BIC’, constraints=None, priorWeight=1, possibleSkeleton=None, DirichletCsv=None, discretizationStrategy=‘quantile’, discretizationNbBins=5, discretizationThreshold=25, usePR=False, beta=1, significant_digit=10)”Represents a (scikit-learn compliant) classifier which uses a BN to classify. A BNClassifier is build using
- a Bayesian network,
- a database and a learning algorithm and parameters
- the use of DiscreteTypeProcessor to discretize with different algorithms some variables.
The classifier can be used to predict the class of new data.
Warning
This class can be pickled. However, the state of this class is only the classifier itself, not the parameter used to train it.
- Parameters:
-
learningMethod (str) – A string designating which learning algorithm we want to use. Possible values are: Chow-Liu, NaiveBayes, TAN, MIIC + (MDL ou NML), GHC, Tabu. GHC designates Greedy Hill Climbing. MIIC designates Multivariate Information based Inductive Causation TAN designates Tree-augmented NaiveBayes Tabu designated Tabu list searching
-
prior (str) – A string designating the type of a priorsmoothing we want to use. Possible values are Smoothing, BDeu, Dirichlet and NoPrior . Note: if using Dirichlet smoothing DirichletCsv cannot be set to none By default (when prior is None) : a smoothing(0.01) is applied.
-
scoringType (str) – A string designating the scoring method we want to use. Since scoring is used while constructing the network and not when learning its parameters, the scoring will be ignored if using a learning algorithm with a fixed network structure such as Chow-Liu, TAN or NaiveBayes. possible values are: AIC, BIC, BD, BDeu, K2, Log2 AIC means Akaike information criterion BIC means Bayesian Information criterion BD means Bayesian-Dirichlet scoring BDeu means Bayesian-Dirichlet equivalent uniform Log2 means log2 likelihood ratio test
-
constraints (dict ( )) –
A dictionary designating the constraints that we want to put on the structure of the Bayesian network. Ignored if using a learning algorithm where the structure is fixed such as TAN or NaiveBayes. the keys of the dictionary should be the strings “PossibleEdges” , “MandatoryArcs” and “ForbiddenArcs”. The format of the values should be a tuple of strings (tail,head) which designates the string arc from tail to head. For example if we put the value (“x0”.”y”) in MandatoryArcs the network will surely have an arc going from x0 to y.
Note: PossibleEdge allows between nodes x and y allows for either (x,y) or (y,x) (or none of them) to be added to the Bayesian network, while the others are not symmetric.
-
priorWeight (double) – The weight used for a prior.
-
possibleSkeleton (pyagrum.undigraph) – An undirected graph that serves as a possible skeleton for the Bayesian network
-
DirichletCsv (str) – the file name of the csv file we want to use for the dirichlet prior. Will be ignored if prior is not set to Dirichlet.
-
discretizationStrategy (str) – sets the default method of discretization for this discretizer. This method will be used if the user has not specified another method for that specific variable using the setDiscretizationParameters method possible values are: ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’ and ‘MDLP’
-
discretizationNbBins (str or int) – sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. If the method used is NML, this parameter sets the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discretization method is used, this parameter is ignored.
-
discretizationThreshold (int or float) – When using default parameters a variable will be treated as continuous only if it has more unique values than this number (if the number is an int greater than 1). If the number is a float between 0 and 1, we will test if the proportion of unique values is bigger than this number. For instance, if you have entered 0.95, the variable will be treated as continuous only if more than 95% of its values are unique.
-
usePR (bool) – indicates if the threshold to choose is Prevision-Recall curve’s threshold or ROC’s threshold by default. ROC curves should be used when there are roughly equal numbers of observations for each class. Precision-Recall curves should be used when there is a moderate to large class imbalance especially for the target’s class.
-
XYfromCSV(filename, with_labels=True, target=None)
Section titled “XYfromCSV(filename, with_labels=True, target=None)”Reads the data from a csv file and separates it into an X matrix and a y column vector.
- Parameters:
- filename (str) – the name of the csv file
- with_labels (bool) – tells us whether the csv includes the labels themselves or their indexes.
- target (str or None) – The name of the column that will be put in the dataframe y. If target is None, we use the target that is already specified in the classifier
- Returns: Matrix X containing the data,Column-vector containing the class for each data vector in X
- Return type: Tuple(pandas.Dataframe,pandas.Dataframe)
fit(X=None, y=None, data=None, targetName=None)
Section titled “fit(X=None, y=None, data=None, targetName=None)”Fits the model to the training data provided. The two possible uses of this function are fit(X,y) and fit(data=…,targetName=…). Any other combination will raise a ValueError
- Parameters:
- X ( {array-like , sparse matrix} of shape (**n_samples , n_features )) – training data. Warning: Raises ValueError if either data or targetname is not None. Raises ValueError if y is None.
- y (array-like of shape (**n_samples )) – Target values. Warning: Raises ValueError if either data or targetname is not None. Raises ValueError if X is None
- data (Union [**str ,**pandas.DataFrame ]) – the source of training data : csv filename or pandas.DataFrame. targetName is mandatory to find the class in this source.
- targetName (str) – specifies the name of the targetVariable in the csv file. Warning: Raises ValueError if either X or y is not None. Raises ValueError if data is None.
fromTrainedModel(bn, targetAttribute, targetModality=”, copy=False, threshold=0.5, variableList=None)
Section titled “fromTrainedModel(bn, targetAttribute, targetModality=”, copy=False, threshold=0.5, variableList=None)”parameters:
: bn: pyagrum.BayesNet
: The Bayesian network we want to use for this classifier
targetAttribute: str
: the attribute that will be the target in this classifier
targetModality: str
: If this is a binary classifier we have to specify which modality we are looking at if the target
attribute has more than 2 possible values
if !=””, a binary classifier is created.
if ==””, a classifier is created that can be non-binary depending on the number of modalities
for targetAttribute. If binary, the second one is taken as targetModality.
copy: bool
: Indicates whether we want to put a copy of bn in the classifier, or bn itself.
threshold: double
: The classification threshold. If the probability that the target modality is true is larger than this
threshold we predict that modality
variableList: list(str)
: A list of strings. variableList[i] is the name of the variable that has the index i. We use this information
when calling predict to know which column corresponds to which variable.
If this list is set to none, then we use the order in which the variables were added to the network.
returns: : void
Creates a BN classifier from an already trained pyAgrum Bayesian network
get_metadata_routing()
Section titled “get_metadata_routing()”Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information. - Return type: MetadataRequest
get_params(deep=True)
Section titled “get_params(deep=True)”Get parameters for this estimator.
- Parameters: deep (bool , default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns: params – Parameter names mapped to their values.
- Return type: dict
predict(X, with_labels=True)
Section titled “predict(X, with_labels=True)”Predicts the most likely class for each row of input data, with bn’s Markov Blanket
- Parameters:
- X (str , {array-like , sparse matrix} of shape (**n_samples , n_features ) or str) – test data, can be either dataFrame, matrix or name of a csv file
- with_labels (bool) – tells us whether the csv includes the labels themselves or their indexes.
returns: : y: array-like of shape (n_samples,) : Predicted classes
predict_proba(X)
Section titled “predict_proba(X)”Predicts the probability of classes for each row of input data, with bn’s Markov Blanket
- Parameters: X (str or {array-like , sparse matrix} of shape (**n_samples , n_features ) or str) – test data, can be either dataFrame, matrix or name of a csv file
- Returns: Predicted probability for each classes
- Return type: array-like of shape (n_samples,)
preparedData(X=None, y=None, data=None)
Section titled “preparedData(X=None, y=None, data=None)”Given an X and a y (or a data source : filename or pandas.DataFrame), returns a pandas.Dataframe with the prepared (especially discretized) values of the base
- Parameters:
- X ( {array-like , sparse matrix} of shape (**n_samples , n_features )) – training data. Warning: Raises ValueError if either filename or targetname is not None. Raises ValueError if y is None.
- y (array-like of shape (**n_samples )) – Target values. Warning: Raises ValueError if either filename or targetname is not None. Raises ValueError if X is None
- data (Union [**str ,**pandas.DataFrame ]) – specifies the csv file or the DataFrame where the data values are located. Warning: Raises ValueError if either X or y is not None.
- Returns: pandas.Dataframe
score(X, y, sample_weight=None)
Section titled “score(X, y, sample_weight=None)”Return accuracy on provided data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters:
- Returns:
score – Mean accuracy of
self.predict(X)w.r.t. y. - Return type: float
set_fit_request(, data: bool | None | str = '', targetName: bool | None | str = '') → BNClassifier
Section titled “set_fit_request(, data: bool | None | str = 'UNCHANGEDUNCHANGEDUNCHANGED', targetName: bool | None | str = 'UNCHANGEDUNCHANGEDUNCHANGED') → BNClassifier”Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Versionadded
Section titled “Versionadded”Added in version 1.3.
- Parameters:
- data (str , True , False , or None , default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter infit. - targetName (str , True , False , or None , default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
targetNameparameter infit.
- data (str , True , False , or None , default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
- Returns: self – The updated object.
- Return type: object
set_params(**params)
Section titled “set_params(**params)”Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
- Parameters: **params (dict) – Estimator parameters.
- Returns: self – Estimator instance.
- Return type: estimator instance
set_predict_request(, with_labels: bool | None | str = '') → BNClassifier
Section titled “set_predict_request(, with_labels: bool | None | str = 'UNCHANGEDUNCHANGEDUNCHANGED') → BNClassifier”Configure whether metadata should be requested to be passed to the predict method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Versionadded
Section titled “Versionadded”Added in version 1.3.
- Parameters:
with_labels (str , True , False , or None , default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
with_labelsparameter inpredict. - Returns: self – The updated object.
- Return type: object
set_score_request(, sample_weight: bool | None | str = '') → BNClassifier
Section titled “set_score_request(, sample_weight: bool | None | str = 'UNCHANGEDUNCHANGEDUNCHANGED') → BNClassifier”Configure whether metadata should be requested to be passed to the score method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Versionadded
Section titled “Versionadded”Added in version 1.3.
- Parameters:
sample_weight (str , True , False , or None , default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore. - Returns: self – The updated object.
- Return type: object
showROC_PR(data, , beta=1, save_fig=False, show_progress=False, bgcolor=None)
Section titled “showROC_PR(data, , beta=1, save_fig=False, show_progress=False, bgcolor=None)”Use the pyagrum.lib.bn2roc tools to create ROC and Precision-Recall curve
- Parameters:
- data (str | dataframe) – a csv filename or a DataFrame
- beta (float) – the value of beta for the F-beta score
- save_fig (bool) – whether the graph should be saved
- show_progress (bool) – indicates if the resulting curve must be printed
- bgcolor (str) – HTML background color for the figure (default: None if transparent)