Learning a CLG

One of the main features of this library is the possibility to learn a CLG.

More precisely what can be learned is : : - The dependency graph of a CLG

The parameters of a CLG: the mu and sigma of each variable, the coefficients of the arcs

Learning the graph

To learn the graph of a CLG (ie the dependence between variables) we use a modified PC algorithm based on the workof Diego Colombo, Marloes H. Maathuis: Order-Independent Constraint-Based Causal Structure Learning(2014).

The independence test used is based on the work of Dario Simionato, Fabio Vandin: Bounding the Family-Wise Error Rate in Local Causal Discovery using Rademacher Averages(2022).

class pyagrum.clg.learning.CLGLearner(filename, , n_sample=15, fwer_delta=0.05)

Using Rademacher Average to guarantee FWER(Family Wise Error Rate) in independency test. (see “Bounding the Family-Wise Error Rate in Local Causal Discover using Rademacher Averages”, Dario Simionato, Fabio Vandin, 2022)

Parameters:
- filename (str)
- n_sample (int)
- fwer_delta (float)

Adjacency_search(order, verbose=False)

This function is the first step of PC-algo: Adjacency Search. Apply indep_test() to the first step of PC-algo for Adjacency Search.

Parameters:
- order (List [**NodeId ]) – A particular order of the Nodes.
- verbose (bool) – Whether to print the process of Adjacency Search.
Returns:
- C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.
- sepset (Dict[Tuple[NodeId, NodeId], Set[NodeId]]) – Sepset(which will be used in Step2&3 of PC-Algo).

PC_algorithm(order, verbose=False)

This function is an advanced version of PC-algo. We use Indep_test_Rademacher() to replace indep_test() in PC-algo. And we orient the undirected edges in the skeleton C by comparing the variances of the two nodes.

Parameters:
- order (List [**NodeId ]) – A particular order of the Nodes.
- verbose (bool) – Whether to print the process of the PC algorithm.
Returns: C – A directed graph DAG representing the causal structure.
Return type: Dict[NodeId, Set[NodeId]]

Pearson_coeff(X, Y, Z)

Estimate Pearson’s linear correlation(using linear regression when Z is not empty).

Parmeters

X : id of the first variable tested.

Y : id of the second variable tested.

Z : The conditioned variable’s id set.

RAveL_MB(T)

Find the Markov Boundary of variable T with FWER lower than Delta.

Parameters: T (NodeId) – The id of the target variable T.
Returns: MB – The Markov Boundary of variable T with FWER lower than Delta.
Return type: Set[NodeId]

RAveL_PC(T)

Find the Parent-Children of variable T with FWER lower than Delta.

Parameters: T (NodeId) – The id of the target variable T.
Returns: The Parent-Children of variable T with FWER lower than Delta.
Return type: Set[NodeId]

Repeat_II(order, C, l, verbose=False)

This function is the second part of the Step1 of PC algorithm.

Parameters:
- order (List [**NodeId ]) – The order of the variables.
- C (Dict [**NodeId , Set [**NodeId ] ]) – The temporary skeleton.
- l (int) – The size of the sepset
- verbose (bool) – Whether to print.
Returns: found_edge – True if a new edge is found, False if not.
Return type: bool

Step4(C, verbose=False)

This function is the fourth step of PC-algo. Orient the remaining undirected edge by comparing variances of two nodes.

Parameters:
- C (Dict [**NodeId , Set [**NodeId ] ]) – The temporary skeleton.
- verbose (bool) – Whether to print the process of Step4.
Returns:
- C (Dict[NodeId, Set[NodeId]]) – The final skeleton (of Step4).
- new_oriented (bool) – Whether there is a new edge oriented in the fourth step.

estimate_parameters(C)

This function is used to estimate the parameters of the CLG model.

Parameters: C (Dict [**NodeId , Set [**NodeId ] ]) – A directed graph DAG representing the causal structure.
Returns:
- id2mu (Dict[NodeId, float]) – The estimated mean of each node.
- id2sigma (Dict[NodeId, float]) – The estimated variance of each node.
- arc2coef (Dict[Tuple[NodeId, NodeId], float]) – The estimated coefficients of each arc.

fitParameters(clg)

In this function, we fit the parameters of the CLG model.

Parameters: clg (CLG) – The CLG model to be changed its parameters.

static generate_XYZ(l)

Find all the possible combinations of X, Y and Z.

Returns: All the possible combinations of X, Y and Z.
Return type: List[Tuple[Set[NodeId], Set[NodeId]]]

static generate_subsets(S)

Generator that iterates on all all the subsets of S (from the smallest to the biggest).

Parameters: S (Set [**NodeId ]) – The set of variables.

id2samples : Dict[NodeId, List]

learnCLG()

First use PC algorithm to learn the skeleton of the CLG model. Then estimate the parameters of the CLG model. Finally create a CLG model and return it.

Returns: learned_clg – The learned CLG model.
Return type: CLG

r_XYZ : Dict[Tuple[FrozenSet[NodeId], FrozenSet[NodeId]], List[float]]

sepset : Dict[Tuple[NodeId, NodeId], Set[NodeId]]

supremum_deviation(n_sample, fwer_delta)

Use n-MCERA to get supremum deviation.

Parameters:
- n_sample (int) – The MC number n in n-MCERA.
- fwer_delta (float ∈ (**0 ,**1 ]) – Threshold.
Returns: SD – The supremum deviation.
Return type: float

test_indep(X, Y, Z)

Perform a standard statistical test and use Bonferroni correction to correct for multiple hypothesis testing.

Parameters:
- X (NodeId) – The id of the first variable tested.
- Y (NodeId) – The id of the second variable tested.
- Z (Set [**NodeId ]) – The conditioned variable’s id set.
Returns: True if X and Y are indep given Z, False if not indep.
Return type: bool

three_rules(C, verbose=False)

This function is the third step of PC-algo. Orient as many of the remaining undirected edges as possible by repeatedly application of the three rules.

Parameters:
- C (Dict [**NodeId , Set [**NodeId ] ]) – The temporary skeleton.
- verbose (bool) – Whether to print the process of this function.
Returns: C – The final skeleton (of Step3).
Return type: Dict[NodeId, Set[NodeId]]