Paper deep dive
On the Number of Conditional Independence Tests in Constraint-based Causal Discovery
Marc Franquesa Monés, Jiaqi Zhang, Caroline Uhler
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/26/2026, 2:34:43 AM
Summary
The paper introduces the Greedy Ancestral Search (GAS) algorithm for constraint-based causal discovery. It establishes that the number of conditional independence (CI) tests required to recover the essential graph is exponential in the maximum undirected clique size (s) of the essential graph, specifically p^O(s). The authors prove this is exponent-optimal by establishing a lower bound of 2^Ω(s) CI tests for any constraint-based algorithm, and demonstrate that GAS performs more efficiently than existing methods like the PC algorithm, which scales with the maximum degree (d).
Entities (5)
Relation Signals (3)
Greedy Ancestral Search → achievescomplexity → p^O(s)
confidence 100% · GAS (i.e., Algorithm 1) outputs ℰ(𝒢) using p^O(s) CI tests.
Constraint-based algorithms → requiresminimum → 2^Ω(s)
confidence 100% · any algorithm requires at least 2^Ω(s) CI tests to verify 𝒢∈[𝒢].
Greedy Ancestral Search → outperforms → PC Algorithm
confidence 90% · demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
Cypher Suggestions (2)
Find algorithms and their associated complexity bounds · confidence 85% · unvalidated
MATCH (a:Algorithm)-[:HAS_COMPLEXITY]->(c:ComplexityBound) RETURN a.name, c.value
Map relationships between algorithms and the graph types they operate on · confidence 80% · unvalidated
MATCH (a:Algorithm)-[:OPERATES_ON]->(g:GraphType) RETURN a.name, g.name
Abstract
Abstract:Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{\Omega(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
Tags
Links
- Source: https://arxiv.org/abs/2603.21844v1
- Canonical: https://arxiv.org/abs/2603.21844v1
Full Text
111,040 characters extracted from source content.
Expand or collapse full text
On the Number of Conditional Independence Tests in Constraint-based Causal Discovery Marc Franquesa Monés Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard Centre de Formació Interdisciplinària Superior, Universitat Politècnica de Catalunya Equal Contribution Jiaqi Zhang Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard Laboratory for Information and Decision Systems, Massachusetts Institute of Technology Equal Contribution Caroline Uhler Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard Laboratory for Information and Decision Systems, Massachusetts Institute of Technology Abstract Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of p(s)p^O(s) tests, where p is the number of nodes in the graph and s denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least 2Ω(s)2 (s) conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed. 1 Introduction Inferring causal relationships between variables is a fundamental challenge across many scientific domains (Wright,, 1921; Haavelmo,, 1943; Duncan,, 1975). Directed acyclic graphs (DAGs) are a popular framework for representing causal structures, where nodes correspond to random variables and directed edges denote direct causal effects. The objective of causal structure discovery is to identify these edges and their orientations from data on the nodes. With observational data and no additional assumptions, it is known that the underlying causal DAG is only identifiable up to its Markov equivalence class (Verma and Pearl,, 1990), which can be represented by a partially directed graph known as the essential graph. Constraint-based causal structure discovery methods use conditional independence (CI) tests to infer the underlying essential graph, the most prominent example being the PC algorithm (Spirtes et al.,, 2001): (1) it begins with a fully connected undirected graph and iteratively removes edges between variables if they are found to be conditionally independent given some conditioning set; (2) edge orientations are inferred based on cases where variables that are conditionally independent become dependent when conditioned on a common child—corresponding to v-structures in the underlying DAG; (3) possibly additional orientations are inferred using a set of logical rules known as the Meek rules (Meek,, 1995). Building on this approach, many constraint-based algorithms have been introduced to address more general and diverse settings, e.g., including in the presence of latent variables (Verma and Pearl,, 1990; Spirtes et al.,, 1989, 2001; Colombo et al.,, 2012; Colombo and Maathuis,, 2014; Sondhi and Shojaie,, 2019). Despite significant progress in constraint-based algorithms, it has been observed that their efficiency, both in terms of computational speed and correctness, tend to degrade when scaling to larger or denser graphs (Spirtes et al.,, 2001; Kalisch and Bühlman,, 2007). The main bottleneck of the PC algorithm (and other algorithms of similar nature) arises from the adjacency search (step (1) described above), which requires a large number of CI tests, exponential in the maximum degree of the underlying DAG in the worst case (Spirtes et al.,, 2001; Claassen et al.,, 2013). Performing a large number of CI tests not only exacerbates the computational burden but may also impact accuracy, since in the finite sample regime guaranteeing the correctness of each CI test requires strong assumptions (Uhler et al.,, 2013; Teh et al.,, 2024; Mazaheri et al.,, 2025). This motivates the development of causal structure discovery algorithms that minimize the number of CI tests that need to be performed. … Figure 1: A graph with p nodes, where d=p−2d=p-2, and s=2s=2. In this work, we present an algorithm, called Greedy Ancestral Search (GAS), that is exponent-optimal up to a logarithmic factor in terms of the number of CI tests performed. GAS requires at most p(s)p^O(s) CI tests,111Here, we use the asymptotic notation O (c.f., (Arora and Barak,, 2009)) in the following way: a function f:↦ℝf:G is p(s)p^O(s) if and only if there exists a fixed constant c∈ℝc , such that for any DAG G, it holds that f()≤pc⋅sf(G)≤ p^c· s, where p and s are the number of nodes and the maximum undirected clique size of G, respectively. We define 2Ω(s)2 (s) and p(d)p^O(d) in a similar way. where p is the number of nodes in the DAG and s denotes the size of the maximum undirected clique in the underlying essential graph. The best upper bound for previous constraint-based algorithms (Richardson,, 1996; Spirtes et al.,, 2001; Claassen et al.,, 2013; Colombo and Maathuis,, 2014) was p(d)p^O(d), where d is the graph’s maximum degree. Considering the degree of a node in the maximum undirected clique of the essential graph, note that s≤d+1s≤ d+1. In fact, s is often much smaller than d; see, for example, the DAG in Figure 1; thus, p(s)p^O(s) is often much smaller than p(d)p^O(d). Complementing this upper bound, we prove that any constraint-based algorithm must perform 2Ω(s)2 (s) CI tests utilizing a technique introduced in Zhang et al., (2024), thereby establishing that GAS achieves exponent-optimality up to a logarithmic factor in terms of the number of CI tests needed. Since d can be of the size Ω(p⋅s) (p· s) (e.g., as in the graph in Figure 1), the previous constraint-based algorithms do not achieve exponent-optimality up to a logarithmic factor. Finally, we empirically benchmark our algorithm against established methods on both synthetic and real-world data, assessing both computational efficiency and accuracy. To provide some insight on how the reduction in number of CI tests is achieved, consider the PC algorithm. The adjacency search in step (1) of the PC algorithm is where the CI tests are performed. To reduce the number of CI tests compared to the PC algorithm, GAS integrates steps (1) and (2) of the PC algorithm; namely, GAS focuses on using CI tests to learn ancestral relationships, which can then be used to perform CI tests to uncover adjacencies in a more targeted way. In particular, note that if all ancestral relationships were known, then a single CI test would be sufficient to determine the presence/absence of an edge.222All ancestral relationships being known is equivalent to being given a permutation =(π1,…,πp) π=( _1,…, _p) of the nodes that is consistent with the DAG (i.e., if there is a directed edge πi→πj∈E _i→ _j∈ E, then it has to hold that i<ji<j). In this case a single CI test is sufficient to determine the presence/absence of an edge, namely πi→πj _i→ _j for i<ji<j if and only if XπiX_ _i is not independent of XπjX_ _j given XSX_S where S=π1,π2,…,πj−1∖πiS=\ _1, _2,…, _j-1\ \ _i\. As a consequence, the main difficulty of causal structure discovery is to learn the ancestral relations, which motivates our approach of concentrating on CI tests that provide ancestral information. Recent work (Shiragur et al.,, 2024) has shown that specific ancestral relationships can be learned using a polynomial number of CI tests. We show that, with appropriate modifications, this approach can be extended to learn all the adjacencies with substantially fewer CI tests than other methods. 1.1 Related works Learning causal structures from observational data has been extensively studied, with current methods primarily falling into three categories (Kitson et al.,, 2023): constraint-based approaches (Verma and Pearl,, 1990; Spirtes et al.,, 2001; Kalisch and Bühlman,, 2007), which utilize CI tests as constraints to learn the DAG; score-based approaches (Chickering,, 2002; Geiger and Heckerman,, 2002; Brenner and Sontag,, 2013), which search for DAGs that maximize a score function evaluating how well the DAG represents the data; and hybrid approaches (Schulte et al.,, 2010; Alonso-Barba et al.,, 2013; Nandy et al.,, 2018; Solus et al.,, 2021), which combine both. For constraint-based algorithms, the focus of this work, a major challenge is their CI test complexity. Algorithms such as PC have a CI test complexity that is exponential in the graph’s maximum degree (Spirtes et al.,, 2001; Claassen et al.,, 2013). While improvements exist that depend on other parameters like the maximum in-degree (Mokhtarian et al.,, 2025) or the maximum separating set size (Sondhi and Shojaie,, 2019), they face similar scalability issues or require additional assumptions. To address this complexity, several research directions have emerged. Some methods leverage minimal conditional (in)dependencies to obtain ancestral structures (Claassen et al.,, 2013; Magliacane et al.,, 2016). Other recent work shifts the focus from learning the entire graph to characterizing which parts of it can be learned under a constrained number of CI tests (Shiragur et al.,, 2024) or with small conditioning sets (Kocaoglu,, 2023). Alternative formulations, such as testing a hypothesized DAG, have also been explored (Zhang et al.,, 2024). While valuable, these approaches either change the problem’s scope or do not aim to recover the complete essential graph. In this work, by contrast, we recover the full essential graph with fewer number of CI tests. While discrete-variable Bayesian network learning is known to be NP-hard (Chickering et al.,, 2004), causal discovery with CI oracles constitutes a distinct problem class that is not NP-hard under bounded degree assumptions (Claassen et al.,, 2013). Under the causal sufficiency assumption, our derived bounds guarantee that the essential graph can be recovered with a polynomial number of tests provided the size of the undirected cliques is bounded. This condition covers the case of bounded degree. Furthermore, for the general case of unbounded clique sizes, our lower bound suggests that constraint-based discovery with CI oracles is not in NP, since verifying a specific Markov equivalence class requires an exponential number of queries, but remains within EXP as the number of all possible DAGs is bounded by 2poly(p)2^poly(p). Beyond computational efficiency, the correctness of any causal discovery algorithm depends on statistical assumptions that connect the observed data to the underlying DAG. For classic algorithms like PC, correctness in the infinite-sample limit is guaranteed by the Markov and faithfulness assumptions (Lauritzen,, 1996; Spirtes et al.,, 2001). Recent work provides a framework for characterizing the precise correctness conditions for any given constraint-based algorithm, enabling a formal comparison of the required assumptions (Sadeghi,, 2017; Teh et al.,, 2024). A related line of work explores how additional or redundant CI tests can be used to detect and correct errors, thereby increasing robustness (Faller and Janzing,, 2025). These approaches highlight a fundamental goal: to precisely characterize the necessary and sufficient CI information required to recover the correct causal structure. In this work, we contribute to both fronts: we determine the number of CI tests required to recover the essential graph, and we show that our method requires strictly weaker correctness assumptions than the standard combination of Markov and faithfulness. Our empirical results further show that our algorithm achieves higher accuracy than existing methods in denser graphs as well as on synthetic gene expression data. 1.2 Organization We begin by reviewing essential background and notation in Section 2. Our main theoretical results are summarized in Section 3. The proposed algorithm, along with its correctness guarantees and complexity, is presented in Section 4. In Section 5, we establish a lower bound in terms of number of CI tests, proving the optimality of our algorithm. Empirical validation is provided in Section 6. We conclude with a brief discussion in Section 7. 2 Preliminaries 2.1 Graph definitions Let G be a graph consisting of a finite set of p nodes V and a set of edges E. For any two nodes u,v∈Vu,v∈ V, we write u∼vu v if they are adjacent and u≁vu v otherwise. Edges can be undirected u−vu-v or directed u→vu→ v, u←vu← v. For any subset of nodes W⊆VW V, we denote its complement by W¯=V∖W W=V W and the subgraph it induces by G[W]G[W]. A path in G is a sequence of nodes such that consecutive nodes are adjacent. A path that starts and ends at the same node is a cycle. A cycle is directed if all its edges are directed and follow a consistent orientation. A clique is a subset of nodes where every two nodes are adjacent. In an undirected clique the nodes are connected by undirected edges. A graph containing directed and/or undirected edges is partially directed. A partially directed graph is a chain graph if it has no directed cycle. The connected components of a chain graph formed by removing all directed edges are known as chain components. An undirected graph is chordal if every cycle of length four or more has a chord, that is, an edge connecting two non-consecutive nodes in the cycle. A fully directed chain graph is called a directed acyclic graph (DAG). The skeleton of a graph, denoted by sk()sk(G), is the undirected graph that results when all directed edges are replaced by undirected edges. In directed graphs, for any node v∈Vv∈ V we denote its sets of parents, ancestors, children, and descendants in G by pa(v) pa(v), anc(v) anc(v), ch(v) ch(v) and des(v) des(v), respectively. We use square brackets for inclusive sets: e.g., anc[v]=anc(v)∪v anc[v]= anc(v)∪\v\. Following Shiragur et al., (2024), a subset S⊆VS V is called a prefix node set if for all v∈Sv∈ S, anc[v]⊆S anc[v] S; in other words, a prefix node set is a subset of nodes that is closed under ancestral relations. For any subset W⊆VW V, we define the set of source nodes within W as src(W)=v∈W∣anc(v)∩W=∅ src(W)=\v∈ W anc(v)∩ W= \. When necessary to distinguish the underlying graph, we employ subscript notation, for example, pa pa_G, src src_G. A node v is called a collider on a path if its adjacent nodes u and w on the path satisfies u→v←wu→ v← w. If these parent nodes u and w are not adjacent, this configuration is called a v-structure. 2.2 Markov equivalence classes DAGs are commonly used in causality (Lauritzen,, 1982; Pearl,, 1985, 2009) where nodes represent random variables and edges represent direct causal relationships. A distribution ℙP is Markov relative to a DAG G, if it factorizes as ℙ(v1,…,vp)=∏i=1pℙ(vi∣pa(vi))P(v_1,…,v_p)= _i=1^pP(v_i pa(v_i)). This factorization implies a set of conditional independence relations, which can be determined from G using d-separation (Pearl,, 1988). For disjoint node sets A,B,C⊂VA,B,C⊂ V, A and B are d-separated by C if all paths connecting A and B are inactive given C. A path is inactive (or blocked) given C if it contains at least one node v satisfying one of the following conditions: (i) v is a non-collider on the path and v∈Cv∈ C, or (i) v is a collider on the path and des[v]∩C=∅ des[v]∩ C= . Otherwise, the path is active given C. We denote conditional independence in the distribution ℙP by A⟂B∣CA B C, and d-separation in the DAG G by A⟂B∣CA _GB C. For singleton sets, for example A=aA=\a\, we omit the braces for simplicity, a⟂B∣Ca B C. Additionally, we write A⟂B∣CA B C for potentially overlapping sets to denote A⟂B∣C∖(A∪B)A B C (A∪ B). Multiple DAGs can represent the same conditional independence information (Verma and Pearl,, 1990). The set of all DAGs encoding the same independencies forms a Markov equivalence class (MEC), denoted [][G]. Markov equivalent DAGs are characterized by having the same skeleton and the same v-structures, which enables a unique representation of an MEC by its essential graph, ℰ()E(G) (Andersson et al.,, 1997). The essential graph ℰ()E(G) is a chain graph that shares the skeleton of the DAGs in [][G] and contains a directed edge if and only if the orientation of the edge is the same in every DAG belonging to the MEC. These invariant edge orientations stem from v-structures and orientations given by the Meek rules (Meek,, 1995; Andersson et al.,, 1997). If a distribution ℙP is Markov with respect to a DAG G, then d-separation implies conditional independence; i.e., A⟂B∣C⟹A⟂B∣CA _GB C A B C. Under the faithfulness assumption, the reverse implication holds. When ℙP is both Markov and faithful relative to G, we say that ℙP respects G, signifying: A⟂B∣C⇔A⟂B∣CA B C A _GB C . 3 Main results Under the Markov and faithfulness assumptions,333We remark here that we do not need to assume causal sufficiency for Theorem 1 to hold. Causal sufficiency needs to be assumed when one requires that G captures the complete causal explanation (Meek,, 1997). we provide an algorithm, called Greedy Ancestral Search (GAS), that achieves optimality in terms of the number of CI tests. Our algorithm is given in Section 4.2, with the following guarantees. Theorem 1. Given infinite samples from a distribution respecting a DAG G, GAS (i.e., Algorithm 1) outputs ℰ()E(G) using p(s)p^O(s) CI tests. Here, s is the size of the maximum undirected clique in ℰ()E(G). This result shows that the number of CI tests required by GAS is upper bounded exponentially in the maximum undirected clique size of the essential graph. This is in contrast to current algorithms, which, without additional assumptions, require in the worst case the number of CI tests to be at least an exponential function of the maximum degree (Spirtes et al.,, 2001; Claassen et al.,, 2013). Complementing this result, we show that any algorithm requires at least 2Ω(s)2 (s) CI tests to distinguish the ground-truth [][G] versus other MECs. Theorem 2. Given infinite samples from a distribution respecting a DAG G, any algorithm requires at least 2Ω(s)2 (s) CI tests to verify ∈[]G∈[G]. Specifically, for any collection of fewer than 2s−s−12^s-s-1 CI tests, there exists an MEC [ℋ]≠[][H]≠[G] such that both classes are indistinguishable based on these CI relations. Since any constraint-based causal discovery algorithm must distinguish the true MEC from all others, it follows that, without additional assumptions, it must perform at least 2Ω(s)2 (s) CI tests. Together, Theorems 1 and 2 show that the complexity of constraint-based causal discovery, in terms of the number of CI tests, is exponential in s, and that our algorithm achieves this bound. Additionally, we show that our algorithm can output the correct MEC when faithfulness is violated. As a consequence, its correctness condition is weaker than Markov and faithfulness, and hence it can output the correct essential graph in more settings. There exist other algorithms that also provably require weaker assumptions than Markov and faithfulness (Ramsey et al.,, 2006; Raskutti and Uhler,, 2018; Solus et al.,, 2021), their complexity in terms of CI testing could be prohibitive. The proof of the following result can be found in Appendix D. Proposition 3. There exists a joint distribution that is Markov but not faithful to a DAG G, for which Algorithm 1, when given observational data sampled from it, correctly outputs the ground-truth ℰ()E(G). In the remainder of this section, we provide a concise overview of the methods used to establish our main theorems. 3.1 Outline of methods Upper bound for GAS. Compared to other constraint-based algorithm (e.g., PC), the reason our algorithm performs fewer CI tests is its use of ancestral information during the adjacency search, which is learned by keeping track of both conditional independence and dependence statements. This ancestral information is inferred through two types of CI tests (given in Definitions 4 and 5 below), which primarily serve to identify edge orientations arising from v-structures and Meek rule 1. As we show in Section 4.2, these edge orientations are sufficient to recover all ancestral information in the underlying essential graph. GAS tracks this ancestral information through an iterative procedure that expands a prefix node set S⊂VS⊂ V. This process, which builds on Shiragur et al., (2024), serves two key roles: it enables learning the remaining ancestral relations in the essential graph, and it simplifies CI testing for adjacencies by restricting the search space of the conditioning sets. Next, to show that the number of CI tests is bounded by s, we prove that the iterative expansion procedure terminates with a partition of the nodes V into disjoint chordal components S1,S2,…,SmS_1,S_2,…,S_m, where each partial union S1∪S2∪⋯∪SiS_1∪ S_2∪…∪ S_i for i=1,…,mi=1,…,m forms a prefix node set. It follows from Dirac, (1961) that any minimal separator within a chordal component SiS_i forms a clique. This property implies that the conditioning sets used for CI tests within SiS_i need not exceed the size of the largest clique in the skeleton, which can serve as a stopping criterion when searching for adjacencies within SiS_i. Further details are provided in Section 4. Lower bound on any constraint-based algorithm. The argument is based on the largest undirected clique of size s in the essential graph. This clique contains 2s−s−12^s-s-1 distinct subsets of size at most s−2s-2, each representing a potential conditioning set. If an algorithm performs fewer CI tests, there must exist at least one subset that has not been used as the conditioning set. For any such unused set, we show how to construct a distinct MEC that is consistent with all performed CI tests. A formal proof detailing this construction is provided in Section 5. 4 Upper bound In this section, we present our main algorithm and show that it outputs the correct essential graph using at most p(s)p^O(s) CI tests. Proofs omitted from this section are provided in Appendix B. 4.1 Learning a prefix node set From the orientation rules of PC (steps (2) and (3) described in Section 1), we see that the directional information in the essential graph are derived from v-structures and Meek rules. While four Meek rules exist for orienting edges, only rule 1 introduces ancestral information not already captured by transitivity (see Figure 2). Therefore, the ancestral relationships in an essential graph can be completely determined by its v-structures and the application of Meek rule 1. Leveraging this observation, we developed two sets of CI tests to specifically identify the ancestral relations established by v-structures (Definition 4) and Meek rule 1 (Definition 5). iijjkk ⇒ (a) Meek rule 1. iijjkk ⇒ (b) Meek rule 2. iijjkkll ⇒ (c) Meek rule 3. iijjkkll ⇒ (d) Meek rule 4. Figure 2: Meek rules (Meek,, 1995). The dashed lines partition nodes into sets based on ancestral relationships. Only Meek rule 1 establishes an ancestor for a node that previously had none, forcing a component to split. Definition 4 (Set DSmD_S^m). For all w∈S¯w∈ S, let w∈DSmw∈ D_S^m if and only if u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪wu v S∪ W∪\w\ for some u∈Vu∈ V, v∈S¯v∈ S and W⊂S¯W⊂ S with |W|=m|W|=m. These CI tests identify a downstream node w by its properties as a collider, or a descendant of a collider. Specifically, conditioning on w creates a dependency between two previously independent nodes, u and v. Definition 4 generalizes the concept of minimal conditional dependence (Claassen and Heskes,, 2011; Magliacane et al.,, 2016), by incorporating a prefix node set. Next, we define tests to identify nodes whose incident edges are oriented based on Meek rules. Definition 5 (Set FSmF_S^m). For all w∈S¯w∈ S, let w∈FSmw∈ F_S^m if and only if u⟂w∣Su w S and u⟂w∣S∪Wu w S∪ W for some u∈Su∈ S and W⊂S¯W⊂ S with |W|=m|W|=m. This definition checks for a node w that is initially dependent on a node u in the prefix node set S, but becomes independent once we condition on W⊆S¯W S. Graphically, this means that all active paths between u and w pass through W. This implies w has an ancestor in S¯ S, indicating that it can be excluded when expanding the prefix node set S. With these definitions, we can expand a prefix node set S⊂VS⊂ V (which may be empty) into a new, larger set S′⊆VS V. Theorem 6. Let S⊂VS⊂ V be a prefix node set. For any integer ℓ , the set S′=V∖(DS0∪DS1∪FS1∪⋯∪DSℓ∪FSℓ)S =V (D_S^0∪ D_S^1∪ F_S^1∪…∪ D_S ∪ F_S ) can be obtained with (pℓ+3)O(p +3) number of CI tests. In addition, S′S is a prefix node set satisfying (S∪src(S¯))⊆S′(S∪ src( S)) S , that is, it contains all the remaining source nodes in S¯ S. The proof of Theorem 6 relies on the following lemmas. These lemmas serve two purposes: first, they establish the properties of the sets previously defined, and second, they guarantee that S′S strictly increases the size of the prefix node set S. Lemma 7. Let S be a prefix node set. If w∈DSmw∈ D_S^m, then des[w]⊆DSm des[w] D_S^m. Furthermore, we have DSm∩src(S¯)=∅D_S^m∩ src( S)= . 01234∈ 90.0$∈$src(S¯) src( S)∈FS1∈ F_S^1∈DS1∖FS1∈ D_S^1 F_S^1S Figure 3: Example graph illustrating the classification of nodes into sets DS1D_S^1 and FS1F_S^1 given the prefix node set S=0,1S=\0,1\. The preceding lemma shows that DSmD_S^m contains all descendants of its members. While descendants of nodes in FSmF_S^m are not necessarily in FSmF_S^m, they must be in DSm∪FSmD_S^m∪ F_S^m, as shown in the following lemma; see also Figure 3. Lemma 8. Let S be a prefix node set. If w∈FSm∖(FS1∪⋯∪FSm−1)w∈ F_S^m (F_S^1∪…∪ F_S^m-1 ), then des[w]⊆FSm∪DSm des[w] F_S^m∪ D_S^m. Furthermore, FSm∩src(S¯)=∅F_S^m∩ src( S)= . Proof of Theorem 6. We first show that S′S satisfies src(S¯)⊂S′ src( S)⊂ S . By Lemmas 7 and 8 we have (DS0∪DS1∪FS1∪⋯∪DSℓ∪FSℓ)∩src(S)=∅(D_S^0∪ D_S^1∪ F_S^1∪…∪ D_S ∪ F_S )∩ src(S)= . Since S′¯=DS0∪DS1∪FS1∪⋯∪DSℓ∪FSℓ S =D_S^0∪ D_S^1∪ F_S^1∪…∪ D_S ∪ F_S , it must hold that src(S¯)⊂S′ src( S)⊂ S . Next we show that S′S is a prefix node set. For this we only need to show that for all x∈S′x∈ S and y∈anc(x)y∈ anc(x), it holds that y∈S′y∈ S . For contradiction, assume x∈S′x∈ S but y∉S′y∉ S . We must have y∈DSiy∈ D_S^i or y∈FSjy∈ F_S^j for some i,j≤mi,j≤ m. If y∈DSiy∈ D_S^i, then by Lemma 7 we have x∈DSix∈ D_S^i, a contradiction. Otherwise, let j be the lowest number such that y∈FSjy∈ F_S^j; then by Lemma 8 we have x∈FSj∪DSjx∈ F_S^j∪ D_S^j, a contradiction. Therefore S′S must be a prefix node set. We now bound the number of CI tests needed to obtain S′S . Towards this, we derive the complexity of DSmD_S^m and FSmF_S^m. Constructing DSmD_S^m requires checking, for all w∈S¯w∈ S, whether there exist u∈Vu∈ V, v∈S¯v∈ S, and W⊂S¯W⊂ S with |W|=m|W|=m such that u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪wu v S∪ W∪\w\. The required number of conditional independence tests is upper bounded by 2×(|S¯|1)(|V|−11)(|S¯|−11)(|S¯|−2m)2× | S|1 |V|-11 | S|-11 | S|-2m, which is in (pm+3)O(p^m+3). Therefore, computing DSmD_S^m takes (pm+3)O(p^m+3) CI tests; similarly, FSmF_S^m takes (pm+2)O(p^m+2) CI tests. Since we compute up to m=ℓm= , the total number of CI tests required is (pℓ+3)O(p +3). ∎ 4.2 Learning the essential graph via GAS Algorithm 1 Greedy Ancestral Search (GAS) 1: CI queries from a distribution ℙP that respects a DAG G 2: The essential graph of G 3: ←empty ordered listS ordered list; 4: S←∅S← ; 5: ℰ←complete undirected graph on VE undirected graph on V; 6:⊳ Prefix node set expansion ⊲ 7: while S≠VS≠ V do 8: V′←V∖SV ← V S; 9: ℓ←0 ← 0; 10: while there exists a clique in ℰ[V′]E[V ] with size ≥ℓ≥ do 11: ⊳ Edge removal ⊲ 12: for v∈V,w∈V′v∈ V,w∈ V s.t. v−wv-w in ℰE do 13: Remove v−wv-w in ℰE if ∃W⊂V′∃ W⊂ V with |W|=ℓ|W|= s.t. w⟂v∣S∪Ww v S∪ W; 14: ⊳ Ordering learned from v-structures ⊲ 15: Compute DSℓD_S ; 16: V′←V′∖DSℓV ← V D_S ; 17: if ℓ>0 >0 then 18: ⊳ Ordering learned from Meek rules ⊲ 19: Compute FSℓF_S ; 20: V′←V′∖FSℓV ← V F_S ; 21: ℓ←ℓ+1 ← +1; 22: S←S∪V′S← S∪ V ; 23: Add V′V to the end of S; 24: for v∈Si,w∈Sjv∈ S_i,w∈ S_j with i<ji<j do 25: If v−wv-w in ℰE, replace with v→wv→ w; 26: return ℰE Our algorithm for learning essential graphs, Greedy Ancestral Search (GAS), is presented in Algorithm 1. It extends the prefix node set method of CCPG (Shiragur et al.,, 2024). Unlike CCPG, which only identifies a coarse, partition-level graph, GAS recovers the complete essential graph. In particular, we generalize their Definitions 4.1–4.3 by introducing the parameter m; the original definitions are recovered as special cases from our Definitions 4 and 5 by choosing m=0m=0 or m=1m=1. In the following, we prove that this generalization is sufficient to identify all edges in the essential graph, and we establish a stopping criterion for Algorithm 1, allowing us to bound the number of CI tests needed. GAS initializes with a complete undirected graph ℰE on the node set V. It proceeds iteratively, constructing a sequence of expanding prefix node sets ∅⊂⋯⊂S⊂⋯⊂V ⊂…⊂ S⊂…⊂ V by greedily adding elements according to Theorem 6, while simultaneously removing edges from the working graph. The number of CI tests used in this construction is bounded by an exponential function of ℓ , which we show using our stopping criterion—to be no greater than the size of the maximum undirected clique in the essential graph. A detailed, step-by-step example in Appendix C illustrates the execution of Algorithm 1 on a 5-node graph, showing how the prefix node set is expanded, how edges are removed, and how the final edge orientations are determined. The sets DSmD_S^m and FSmF_S^m are calculated using the CI test results from the edge removal step (Line 13). The computation of DSmD_S^m, for example, only requires some targeted additional tests to identify descendants of v-structures (see Algorithm 3). The specific implementation is detailed in Appendix E. We now sketch the proof of Theorem 1, which establishes the correctness and guarantees for Algorithm 1. The full formal proof can be found in Appendix B.3. Correctness. To show that the algorithm correctly outputs the essential graph ℰ()E(G), we formally characterize the nodes contained within the computed sets DSmD_S^m and FSmF_S^m (Lemmas 17 and 18). This characterization demonstrates that these sets accurately capture the necessary information for the algorithm to learn the partial ordering defined by directed edges ℰ()E(G). This ordering is reflected in the partition S1,…,SmS_1,…,S_m identified by Algorithm 1, where directed edges in ℰ()E(G) exist only between nodes belonging to different components SiS_i and SjS_j. Number of CI tests. The key is to bound the maximum order ℓ up to which the sets DSmD_S^m and FSmF_S^m must be computed. In Appendix B.3, we prove the following result: if w∈DSmw∈ D_S^m and w∉DS0∪⋯∪DSm−1w∉ D_S^0∪…∪ D_S^m-1, then there must exist a clique of size m in anc(w)∖S anc(w) S. This implies that the maximum required order is bounded by the maximum clique in any of the components SiS_i. This leads to the overall complexity bound of p(s)p^O(s) where s is the size of the maximum undirected clique in ℰ()E(G). 5 Lower bound The following lemma from Zhang et al., (2024) specifies the exact CI tests on which two DAGs will disagree if they differ only by a single edge, provided that edge is undirected in the essential graph of the larger DAG. Lemma 9 (Lemma 7 in Zhang et al., (2024)). Let G and ℋH be two DAGs such that ℋH differs from G by missing one edge u→vu→ v, which is undirected in ℰ()E(G). Denote the maximum undirected clique in G containing u,vu,v by S. For disjoint sets A,B,C⊂[p]A,B,C⊂[p], the statement of whether A and B are d-separated by C differs between G and ℋH only if (pa(v)∩ch(u))∩S⊆C∩S⊆(pa(v)∖u)∩S.( pa_G(v)∩ ch_G(u))∩ S C∩ S ( pa_G(v) \u\)∩ S. (1) Proof of Theorem 2. First, note that by the binomial theorem we have 2s=∑i=0s(si)2^s= _i=0^s si. This number is exactly how many ways we can choose an unordered subset of any size from a fixed set of s elements. Therefore 2s−(s−1)−(s)=2s−s−12^s- ss-1- ss=2^s-s-1 is exactly all the ways we can choose a subset of size less or equal to s−2s-2 from s nodes. Let S be the largest undirected clique in ℰ()E(G). If less than 2s−s−12^s-s-1 tests have been done, there must be a subset W⊆SW S of s−2s-2 nodes or less, such that no test A⟂B∣CA B C has been performed where C∩S=WC∩ S=W. Since S is an undirected clique in ℰ()E(G), its nodes can be ordered arbitrarily to obtain a valid DAG in [][G]. Let ℱ∈[]F∈[G] be a DAG such that in the topological ordering of S all nodes in W are placed consecutively after the first node. Let u denote this first node in the topological ordering, and let v denote the node after all nodes in W. Let ℋH be the DAG obtained by removing the edge u→vu→ v from ℱF. For disjoint sets A,B,CA,B,C testing if A and B are independent given C will disagree between ℱF and ℋH only if C∩S=WC∩ S=W. This is because using Lemma 9 we have W=(paℱ(v)∩chℱ(u))∩S⊆C∩S⊆(paℱ(v)∖u)∩S=W. splitW&=( pa_F(v)∩ ch_F(u))∩ S C∩ S ( pa_F(v) \u\)∩ S=W. split (2) Since no CI tests have been performed that satisfy C∩S=WC∩ S=W, all the performed tests agree with [][G] and [ℋ][H]. Therefore, more tests must be performed in order to distinguish the MEC classes [][G] and [ℋ][H]. ∎ We remark that our lower bound in Theorem 2 is strictly stronger than the one provided in Zhang et al., (2024). Our result provides an any-case guarantee: for any collection of fewer than 2Ω(s)2 (s) CI tests, we show an indistinguishable MEC is guaranteed to exist. In contrast, Zhang et al., (2024) establish a worst-case result by constructing a specific MEC and proving that no algorithm can distinguish it with fewer than 2Ω(s)2 (s) tests. 6 Experiments In this section, we empirically evaluate our proposed algorithm, Greedy Ancestral Search (GAS), and its variant, GAS+, which incorporates (p2)O(p^2) additional CI tests. We observe that these additional CI tests improve performance in practice (details can be found in Appendix F). We also benchmark performance against established causal discovery algorithms on synthetic data from linear Gaussian structural equation models (Wright,, 1921) on random Erdős-Rényi graphs (Erdős and Réni,, 1959), semi-synthetic gene expression data obtained using the SERGIO simulator (Dibaeinia and Sinha,, 2020), and a real-world dataset. Implementation details and data descriptions are provided in Appendices E and G, respectively. Appendix H presents additional experiments, including results on scale-free graphs, networks with thousands of nodes, and comparisons of the number of CI tests performed by different algorithms. All code for these experiments is available at https://github.com/uhlerlab/greedy-ancestral-search. 6.1 Linear Gaussian synthetic data (a) (b) Figure 4: Comparison across Erdős-Rényi graphs on 50 nodes and increasing expected neighborhood size. Results are averaged over 3 runs. (a) Execution time (seconds) presented on a logarithmic scale. (b) Accuracy comparison measured by Structural Hamming Distance (SHD) between predicted and ground-truth graphs normalized by the number of possible edges. Shaded regions show the standard deviation. We compare both implementations of Algorithm 1, GAS and GAS+, against other constraint-based and hybrid algorithms that rely on CI tests. We include the popular constraint-based method PC (Spirtes et al.,, 2001), using the order-independent variant from Colombo and Maathuis, (2014). We additionally include GSP with both depth 44 and depth ∞ (Solus et al.,, 2021). A comparison with algorithms FCI and GRaSP2 (Spirtes et al.,, 2001; Lam et al.,, 2022), along with additional metrics and a similar experiment scaled to 2500 nodes, can be found in Appendix H. To ensure fair comparisons, evaluations with algorithms originating from different softwares are presented separately. Within each evaluation set, GAS and GAS+ used the same CI tester as the respective baseline algorithms. The experimental results are presented in Figure 4. Regarding computational speed, a consistent finding is that both variants of our algorithm, GAS and GAS+, are significantly faster than the other methods. Analyzing the prediction accuracy shown in Figure 4(b), GAS+ achieves accuracy comparable to the GSP algorithms. The PC algorithm demonstrates high accuracy in sparser graphs, but its performance degrades by a large margin in denser graphs. In contrast, GAS performs fewer tests, resulting in the highest accuracy in denser graphs but a lower accuracy in sparse settings. 6.2 SERGIO For our semi-synthetic evaluation, we simulated gene expression data using the SERGIO model (Dibaeinia and Sinha,, 2020). This enables precise comparison of inferred structures against the ground truth DAG. As in Dibaeinia and Sinha, (2020), we utilized their DS1 graph, which samples 27002700 cells each with the expression of 100100 genes. Further details on this experimental setup are provided in Appendix G. (a) (b) Figure 5: Comparison on semi-synthetic data generated using SERGIO. Results are averaged over 5 runs. (a) Execution time (seconds) presented on a logarithmic scale. (b) Accuracy comparison measured by Structural Hamming Distance (SHD) between predicted and ground-truth graphs normalized by the number of possible edges. Figure 5 presents the comparative results on this semi-synthetic data. The high-degree structure of this dataset makes standard algorithms like PC and FCI computationally infeasible due to prohibitive runtimes (further discussed in Appendix G.2). We therefore benchmark our proposed methods, GAS and GAS+, against GRaSP2 (Lam et al.,, 2022).444GSP’s performance was found to be less competitive compared to the methods presented here. The findings indicate that while GAS is the fastest algorithm overall, GAS+ achieves the highest accuracy in recovering the underlying causal structure while being much faster than GRaSP2. 6.3 Real-world example We consider the 6-variable Airfoil dataset (Asuncion and Newman,, 2007); see Lam et al., (2022). Lacking a complete ground-truth DAG, we assess consistency on the known causal relations for this system: (1) velocity, chord and attack are experimental variables and thus exogenous, while (2) pressure is the measured outcome, expected to be an effect of the other variables. Figure 6 shows the partially directed graphs learned by different algorithms on the Airfoil dataset. While all algorithms correctly identify pressure as a downstream variable, GAS is the most efficient, requiring the fewest CI tests and running twice as fast as the score-based GRaSP2 algorithm. However, we note that only PC correctly identifies attack as an exogenous variable. For GAS and GAS+, this discrepancy stems from their reliance on conditional dependencies to infer ancestry. Specifically, the data indicates that frequency and chord are marginally independent but become dependent when conditioned on attack. If we assume the distribution respects an underlying DAG, this pattern implies that attack is a collider (or a descendant of one) and thus non-exogenous. While the CI test results may be correct, this implication can be broken by latent confounders. VelocityAttackChordFrequencyDisplacementPressure (a) GAS VelocityAttackChordFrequencyDisplacementPressure (b) GAS+ VelocityAttackChordFrequencyDisplacementPressure (c) PC VelocityAttackChordFrequencyDisplacementPressure (d) GRaSP2 Figure 6: Partially directed graphs learned in the Airfoil dataset. GAS, GAS+, and PC performed 4646, 5050, and 104104 distinct CI tests, respectively. 7 Discussion In this work, we established tight bounds on the number of conditional independence tests required for causal structure discovery using constraint-based algorithms. We propose an exponent-optimal algorithm achieving this bound up to a logarithmic factor, requiring less CI tests than existing algorithms. Our empirical evaluations demonstrate that the proposed algorithm is significantly faster than established baseline methods and achieves highly competitive accuracy, particularly with denser graphs. Limitations and future work. This work is motivated by the goal of improving the efficiency of constraint-based algorithms, both in terms of computational speed and correctness. We characterize efficiency by the number of CI tests performed, which directly impacts computational speed. However, as discussed in Section 1.1, the number of CI tests is related with, but not equivalent to, the exact correctness conditions required by an algorithm. While we have shown that our algorithm can succeed under assumptions weaker than the standard Markov and faithfulness conditions, its precise correctness guarantees—and how they compare to those of existing algorithms—remain unknown. Moreover, understanding how finite-sample effects influence these correctness conditions is an important direction for future work, critical to understanding the potential and limitations of constraint-based causal discovery. Finally, our algorithm’s reliance on both conditional independencies and dependencies raises practical considerations. Future work should explore whether the algorithm’s performance on finite, real-world data can be optimized by using different p-value thresholds or distinct statistical tests for each constraint. Acknowledgements We thank Kiran Shiragur for initial discussions that inspired this work. We also thank the anonymous reviewers for their helpful feedback. M.F.M. was partially supported by the Eric and Wendy Schmidt Center at the Broad Institute, Fundació Privada Mir-Puig, and a MOBINT-MIF grant. J.Z. was partially supported by an Apple AI/ML PhD Fellowship. C.U. was partially supported by NCCIH/NIH (1DP2AT012345), ONR (N00014-24-1-2687), the United States Department of Energy (DE-SC0023187), and the Eric and Wendy Schmidt Center at the Broad Institute. References Alonso-Barba et al., (2013) Alonso-Barba, J. I., Gámez, J. A., Puerta, J. M., et al. (2013). Scaling up the greedy equivalence search algorithm by constraining the search space of equivalence classes. International Journal of Approximate Reasoning, 54(4):429–451. Andersson et al., (1997) Andersson, S. A., Madigan, D., and Perlman, M. D. (1997). A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541. Arora and Barak, (2009) Arora, S. and Barak, B. (2009). Computational complexity: a modern approach. Cambridge University Press. Asuncion and Newman, (2007) Asuncion, A. and Newman, D. (2007). UCI Machine Learning Repository. Brenner and Sontag, (2013) Brenner, E. and Sontag, D. (2013). Sparsityboost: A new scoring function for learning Bayesian network structure. arXiv preprint arXiv:1309.6820. Bron and Kerbosch, (1973) Bron, C. and Kerbosch, J. (1973). Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16(9):575–577. Brooks et al., (1989) Brooks, T. F., Pope, D. S., and Marcolini, M. A. (1989). Airfoil Self-Noise. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5VW2C. Chickering, (2002) Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554. Chickering et al., (2004) Chickering, D. M., Heckerman, D., and Meek, C. (2004). Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5(Oct):1287–1330. Claassen and Heskes, (2011) Claassen, T. and Heskes, T. (2011). A logical characterization of constraint-based causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, page 135–144, Arlington, Virginia, USA. AUAI Press. Claassen et al., (2013) Claassen, T., Mooij, J. M., and Heskes, T. (2013). Learning sparse causal models is not NP-hard. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, page 172–181, Arlington, Virginia, USA. AUAI Press. Colombo and Maathuis, (2014) Colombo, D. and Maathuis, M. H. (2014). Order-independent constraint-based causal structure learning. Journal of Machine Learning Research, 15(1):3741–3782. Colombo et al., (2012) Colombo, D., Maathuis, M. H., Kalisch, M., and Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321. Dibaeinia and Sinha, (2020) Dibaeinia, P. and Sinha, S. (2020). SERGIO: a single-cell expression simulator guided by gene regulatory networks. Cell Systems, 11(3):252–271. Dirac, (1961) Dirac, G. A. (1961). On rigid circuit graphs. Abhandlungen aus dem Mathematischen Seminar der Universität Hamburg, 25(1-2):71–76. Duncan, (1975) Duncan, O. D. (1975). Introduction to structural equation models. Academic Press. Erdős and Réni, (1959) Erdős, P. and Réni, A. (1959). On random graphs. I. Publicationes Mathematicae Debrecen, 6(290-297):18. Faller and Janzing, (2025) Faller, P. M. and Janzing, D. (2025). On different notions of redundancy in conditional-independence-based discovery of graphical models. arXiv preprint arXiv:2502.08531. Geiger and Heckerman, (2002) Geiger, D. and Heckerman, D. (2002). Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. The Annals of Statistics, 30(5):1412–1440. Haavelmo, (1943) Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica, Journal of the Econometric Society, pages 1–12. Hagberg et al., (2008) Hagberg, A. A., Schult, D. A., and Swart, P. J. (2008). Exploring network structure, dynamics, and function using networkx. In Varoquaux, G., Vaught, T., and Millman, J., editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA. Kalisch and Bühlman, (2007) Kalisch, M. and Bühlman, P. (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8(3). Kitson et al., (2023) Kitson, N. K., Constantinou, A. C., Guo, Z., Liu, Y., and Chobtham, K. (2023). A survey of Bayesian network structure learning. Artificial Intelligence Review, 56(8):8721–8814. Kocaoglu, (2023) Kocaoglu, M. (2023). Characterization and learning of causal graphs with small conditioning sets. Advances in Neural Information Processing Systems, 36:74140–74179. Lam et al., (2022) Lam, W.-Y., Andrews, B., and Ramsey, J. (2022). Greedy relaxations of the sparsest permutation algorithm. In Uncertainty in Artificial Intelligence, pages 1052–1062. PMLR. Lauritzen, (1982) Lauritzen, S. L. (1982). Lectures on Contingency Tables. University of Aalborg Press, Aalborg, Denmark, 2nd edition. Lauritzen, (1996) Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. Magliacane et al., (2016) Magliacane, S., Claassen, T., and Mooij, J. M. (2016). Ancestral causal inference. Advances in Neural Information Processing Systems, 29. Mazaheri et al., (2025) Mazaheri, B., Zhang, J., and Uhler, C. (2025). Faithfulness and intervention-only causal discovery. In ICML 2025 Workshop on Scaling Up Intervention Models. Meek, (1995) Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, page 403–410, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Meek, (1997) Meek, C. (1997). Graphical Models: Selecting causal and statistical models. PhD thesis, Carnegie Mellon University. Mokhtarian et al., (2025) Mokhtarian, E., Elahi, S., Akbari, S., and Kiyavash, N. (2025). Recursive causal discovery. Journal of Machine Learning Research, 26(61):1–65. Nandy et al., (2018) Nandy, P., Hauser, A., and Maathuis, M. H. (2018). High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151–3183. Pearl, (1985) Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pages 329–334. Pearl, (1988) Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc. Pearl, (2009) Pearl, J. (2009). Causality. Cambridge University Press. Ramsey et al., (2006) Ramsey, J., Spirtes, P., and Zhang, J. (2006). Adjacency-faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, page 401–408, Arlington, Virginia, USA. AUAI Press. Raskutti and Uhler, (2018) Raskutti, G. and Uhler, C. (2018). Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183. Richardson, (1996) Richardson, T. (1996). A discovery algorithm for directed cyclic graphs. In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, UAI’96, page 454–461, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Sadeghi, (2017) Sadeghi, K. (2017). Faithfulness of probability distributions and graphs. Journal of Machine Learning Research, 18(148):1–29. Schulte et al., (2010) Schulte, O., Frigo, G., Greiner, R., and Khosravi, H. (2010). The IMAP hybrid method for learning Gaussian Bayes nets. In Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31–June 2, 2010. Proceedings 23, pages 123–134. Springer. Shiragur et al., (2024) Shiragur, K., Zhang, J., and Uhler, C. (2024). Causal discovery with fewer conditional independence tests. In Forty-first International Conference on Machine Learning. Solus et al., (2021) Solus, L., Wang, Y., and Uhler, C. (2021). Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795–814. Sondhi and Shojaie, (2019) Sondhi, A. and Shojaie, A. (2019). The reduced PC-algorithm: Improved causal structure learning in large random networks. Journal of Machine Learning Research, 20(164):1–31. Spirtes et al., (1989) Spirtes, P., Glymour, C., and Scheines, R. (1989). Causality from probability. Spirtes et al., (2001) Spirtes, P., Glymour, C., and Scheines, R. (2001). Causation, Prediction, and Search. The MIT Press. Squires, (2018) Squires, C. (2018). causaldag: creation, manipulation, and learning of causal models. Teh et al., (2024) Teh, K. Z., Sadeghi, K., and Soo, T. (2024). A general framework for constraint-based causal learning. arXiv preprint arXiv:2408.07575. Uhler et al., (2013) Uhler, C., Raskutti, G., Bühlmann, P., and Yu, B. (2013). Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pages 436–463. Verma and Pearl, (1990) Verma, T. and Pearl, J. (1990). Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, page 255–270, USA. Elsevier Science Inc. Wright, (1921) Wright, S. (1921). Correlation and causation. Journal of agricultural research, 20(7):557. Zhang et al., (2024) Zhang, J., Shiragur, K., and Uhler, C. (2024). Membership testing in Markov equivalence classes via independence queries. In International Conference on Artificial Intelligence and Statistics, pages 3925–3933. PMLR. Zheng et al., (2024) Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., Shimizu, S., Spirtes, P., and Zhang, K. (2024). Causal-learn: Causal discovery in Python. Journal of Machine Learning Research, 25(60):1–8. Appendix A Meek rules Proposition 10 (Meek rules, Meek, (1995)). We can infer all directed edges in ℰ()E(G) using the following four rules: R1. If i→j∼ki→ j k and i≁ki k, then j→kj→ k. R2. If i→j→ki→ j→ k and i∼ki k, then i→ki→ k. R3. If i∼j,i∼k,i∼l,j→k,l→ki j,i k,i l,j→ k,l→ k and j≁lj l, then i→ki→ k. R4. If i∼j,i∼k,i∼l,j→k,i→ji j,i k,i l,j→ k,i→ j and j≁lj l, then l→kl→ k. Appendix B Omitted proofs of upper bound We will start by showing the following lemma. Lemma 11. Let S⊂VS⊂ V be a prefix node set. Let A,BA,B and C be disjoint subsets such that A⊂VA⊂ V, and B,C⊂S¯B,C⊂ S. Let ℋ=[S¯∪A]H=G[ S∪ A] be the induced subgraph of G on the node set S¯∪A S∪ A. Then A⟂B∣(S∖A)∪CA _GB (S A)∪ C if and only if A⟂ℋB∣CA _HB C. Proof. We will first prove the if direction. Assume A⟂ℋB∣CA _HB C. This means there exists an active path P in ℋH between a node in A and a node in B, given C. By definition of ℋH as [S¯∪A]G[ S∪ A], all nodes on the path P necessarily belong to S¯∪A S∪ A, which means no node on P is part of S∖AS A. For P to be active in ℋH given C, every non-collider on P must not be in C, and every collider on P must either be in C or have a descendant in C. Now, considering this same path P within the graph G and conditioned on (S∖A)∪C(S A)∪ C: any non-collider on P, not being in C (due to being active in ℋH) and not being in S∖AS A (due to being fully in ℋH) is therefore not in (S∖A)∪C(S A)∪ C. Similarly, any collider on P that is in ancℋ[C] anc_H[C] is consequently in anc[(S∖A)∪C] anc_G[(S A)∪ C]. Since path P satisfies the criteria for an active path in G when conditioned on (S∖A)∪C(S A)∪ C, it follows that A⟂B∣(S∖A)∪CA _GB (S A)∪ C. For the only if direction, we assume A⟂B∣(S∖A)∪CA _GB (S A)∪ C. This assumption implies there exists a path P between a node in A and a node in B that is active in G when conditioned on (S∖A)∪C(S A)∪ C; we can select P such that only its endpoints are in A∪BA∪ B. For P to be active, any non-collider on this path must not be in (S∖A)∪C(S A)∪ C, which directly means non-colliders on P must reside in A∪S¯∖CA∪ S C. Colliders on P, on the other hand, must be in anc[(S∖A)∪C] anc_G[(S A)∪ C]. Consider a collider c on path P. If c were in anc[S∖A] anc_G[S A], then because S is a prefix set (which implies anc[S∖A]⊆S anc_G[S A] S), c itself must be in S. As c is an intermediate node on P (not in A∪BA∪ B), c being in S means c∈S∖Ac∈ S A. The argument then follows that if such a collider c∈S∖Ac∈ S A is on P, the prefix nature of S necessitates that at least one of c’s neighbors on path P must also be in S∖AS A while being an intermediate node on P as B⊂S¯B⊂ S. This neighbor, however, must function as a non-collider on P while also being in S∖AS A, thereby blocking P in G, a contradiction. Therefore, any collider on path P must be in anc[C]∖S⊆ancℋ[C] anc_G[C] S anc_H[C]. Given that non-colliders are in A∪S¯∖CA∪ S C, and colliders are in ancℋ[C] anc_H[C], it follows that P is active in ℋH given C. This implies A⟂ℋB∣CA _HB C. ∎ We will make use of the following lemma. Lemma 12. Let S⊂VS⊂ V be a prefix set. Let a,b∈Va,b∈ V and W⊂S¯W⊂ S be such that a⟂b∣S∪Wa b S∪ W and a⟂b∣S∪W′a b S∪ W for all W′⊂W ⊂ W. Then, W⊆anc(a)∪anc(b)W anc(a)∪ anc(b). . We aim to show that every element w∈Ww∈ W must be a member of anc(a)∪anc(b) anc(a)∪ anc(b). The argument can be understood by iteratively considering how elements of W contribute to blocking paths. If W=∅W= , the inclusion in anc(a)∪anc(b) anc(a)∪ anc(b) is trivial. 1. Initial element. Since W′=∅W = is a proper subset of W, we have a⟂b∣Sa b S. Therefore, there must exist an active path P between a and b given S. Since S is prefix, by Lemma 11, there are no intermediate nodes in S and therefore, the path does not contain colliders. This implies that all nodes on P are necessarily within anc(a)∪anc(b) anc(a)∪ anc(b). Since a⟂b∣S∪Wa b S∪ W, this path P must be blocked by S∪WS∪ W. Thus, at least one element from W, termed x1x_1, must lie on P as a non-collider. Being a node on P, x1x_1 is therefore in anc(a)∪anc(b) anc(a)∪ anc(b). 2. Subsequent elements. Let Xk=x1,…,xkX_k=\x_1,…,x_k\ be a subset of W for which it has been established that Xk⊆anc(a)∪anc(b)X_k anc(a)∪ anc(b). If Xk≠WX_k≠ W, we must have a⟂b∣S∪Xka b S∪ X_k. So there exists an active path, PkP_k, between a and b given S∪XkS∪ X_k. Since S∪XkS∪ X_k does not block PkP_k, a node from W∖XkW X_k, termed xk+1x_k+1, must block PkP_k. Since the PkP_k is active, any collider on PkP_k must also be in anc[Xk] anc[X_k]. Since Xk⊆anc(a)∪anc(b)X_k anc(a)∪ anc(b) we have anc[Xk]⊆anc(a)∪anc(b) anc[X_k] anc(a)∪ anc(b). This implies that all nodes on the entire path PkP_k are in anc(a)∪anc(b) anc(a)∪ anc(b). Since xk+1x_k+1 is on PkP_k, xk+1x_k+1 must also be in anc(a)∪anc(b) anc(a)∪ anc(b). Therefore every element in W must be in anc(a)∪anc(b) anc(a)∪ anc(b). ∎ B.1 Proof of Lemma 7 To prove Lemma 7, we will make use of the following definition and lemma. Definition 13 (Set ASA_S). Let S⊂VS⊂ V be a set of nodes not necessarily prefix. For all w∈S¯w∈ S, let w∈ASw∈ A_S if and only if u⟂v∣Su v S and u⟂v∣S∪wu v S∪\w\ for some u∈Vu∈ V and v∈S¯v∈ S. Lemma 14. If w∈ASw∈ A_S, then des[w]⊂AS des[w]⊂ A_S. Furthermore, AS∩src(S¯)=∅A_S∩ src( S)= . . We first show that if w∈ASw∈ A_S, then it must hold that des[w]⊂AS des[w]⊂ A_S: since w∈ASw∈ A_S, there exists a node u∈Vu∈ V and v∈S¯v∈ S such that u⟂v∣Su v S and u⟂v∣S∪wu v S∪\w\. We now show that for any x∈des[w]x∈ des[w], we have u⟂v∣S∪xu v S∪\x\. Since u⟂v∣S∪wu v S∪\w\, there is a path P from u to v that is active given S∪wS∪\w\. Therefore, all non-colliders on P are not in S∪wS∪\w\ and all colliders on P are in anc[S∪w] anc[S∪\w\]. Since x∈des[w]x∈ des[w], all colliders on P are in anc[S∪x] anc[S∪\x\]. If all non-colliders are not in S∪xS∪\x\, then P is an active path from u to v given S∪xS∪\x\, and thus u⟂v∣S∪xu v S∪\x\. Otherwise there is a non-collider on P that is x. Since u⟂v∣Su v S, the path P is inactive given S. From above we know that all non-colliders on P are not in S. Therefore there exists a collider on P that is not in anc[S] anc[S]. Suppose the leftmost and rightmost such colliders are k,k′k,k (it is possible that k=k′k=k ), then k,k′k,k must be in anc[S∪w]∖anc[S]⊆anc[w]⊂anc[x] anc[S∪\w\] anc[S] anc[w]⊂ anc[x]. Therefore, the path P takes the form: u−…→k←…→k′←⋯−vu-…→ k←…→ k ←…-v. If x is between u and k (or between k′k and v), consider the path Q in the graph by cutting out the parts between x and k′k (or between k and x) on P and replacing them with directed edges from k′k to x (or from k to x). Compared to P, the additional non-colliders on Q are all on the directed path from k′k to x (or from k to x). They are not in S since k,k′∉anc[S]k,k ∉ anc[S], and thus Q has no non-colliders in S. Compared to P, there is no collider on P that is not in anc[S] anc[S] and is still on Q by the fact that k,k′k,k are leftmost and rightmost colliders on P that are not in anc[S] anc[S]. Therefore, x must be a collider on Q, or else Q is active given S and u⟂v∣Su v S. Therefore all non-colliders on Q are not in S∪xS∪\x\. Every collider on Q is either x or a collider of P, which is in anc[S∪x] anc[S∪\x\]. Thus Q is active given S∪xS∪\x\. Therefore u⟂v∣S∪xu v S∪\x\. If x is between k and k′k we can replace the part between k and x by the direct path from k to x and the part between x and k′k by the direct path from k′k to x. There is no additional collider on the new path and all non-colliders are not in S since k,k′∉anc[S]k,k ∉ anc[S]. Therefore the path is active given S∪xS∪\x\ and u⟂v∣S∪xu v S∪\x\. Next we show that AS∩src(S¯)=∅A_S∩ src( S)= : for contradiction assume that there exists a node a∈src(S¯)a∈ src( S) such that a∉S¯ ∈ S A_S, that is a∈src(S¯)a∈ src( S) and for some node u∈V,v∈S¯u∈ V,v∈ S, u⟂v∣Su v S and u⟂v∣S∪au v S∪\a\. Since u⟂v∣Su v S and u⟂v∣S∪au v S∪\a\, there exists a path P between u and v which is inactive when conditioned on S but is active upon conditioning on S∪aS∪\a\. Moreover, this path contains a node b that is a collider on P and satisfies: a∈des[b]a∈ des[b] and des[b]∩S=∅ des[b]∩ S= . Since des[b]∩S=∅ des[b]∩ S= , we have that b∈S¯b∈ S. Furthermore, since b∈S¯,a∈src(S¯)b∈ S,a∈ src( S) and a∈des[b]a∈ des[b], this implies that b=ab=a. Therefore, the path P takes the form: P=v−…→a←⋯−uP=v-…→ a←…-u. All the colliders on the path P either belong to or have descendant in the set S∪aS∪\a\. Now consider the path v−…→av-…→ a and note that it is active given S. Let k be the number of nodes between v and a on this path v−v1−⋯−vk→av-v_1-…-v_k→ a. It is immediate that vk∈Sv_k∈ S since a∈src(S¯)a∈ src( S). However, since vk∈Sv_k∈ S, and since we condition on the set S, this should be a collider for the path Q to be active, which is not possible. Thus, we get a contradiction, which completes the proof. ∎ See 7 . Since w∈DSmw∈ D_S^m, there exists two nodes u∈Vu∈ V, v∈S¯v∈ S and a set W⊂S¯W⊂ S with |W|=m|W|=m such that u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪wu v S∪ W∪\w\. Let S′=S∪WS =S∪ W. We have that w∈AS′w∈ A_S and by Lemma 14, des[w]⊂AS′ des[w]⊂ A_S , AS′∩src(S′¯)=∅A_S ∩ src( S )= . Since AS′⊂DSmA_S ⊂ D_S^m we have that des[w]⊂DSm des[w]⊂ D_S^m. Additionally, src(S¯)⊂W∪src(S′¯) src( S)⊂ W∪ src( S ) and w∉Ww∉ W, therefore w∉src(S¯)w∉ src( S). ∎ B.2 Proof of Lemma 8 To prove Lemma 8 we will make use of the following result. Lemma 15. Let S be a prefix set. Let w∈S¯w∈ S be such that w∉FS1∪⋯∪FSm−1w∉ F_S^1∪…∪ F_S^m-1 and w∈FSmw∈ F_S^m. Let u∈Su∈ S and W⊂S¯W⊂ S be such that u⟂w∣Su w S and u⟂w∣S∪Wu w S∪ W. Then W⊂anc(w)W⊂ anc(w). . Since u⟂w∣S∪W′u w S∪ W for all W′⊂W ⊂ W, by Lemma 12 we have W⊂anc(u)∪anc(w)W⊂ anc(u)∪ anc(w). However, given that u∈Su∈ S, S is a prefix set and W⊂S¯W⊂ S, it necessarily follows that W⊂anc(w)W⊂ anc(w). ∎ See 8 . We first show that if w∈FSm∖FS1∪⋯∪FSm−1w∈ F_S^m F_S^1∪…∪ F_S^m-1, then for any y∈des(w)y∈ des(w), we have y∈FSm∪DSmy∈ F_S^m∪ D_S^m. Since w∈FSmw∈ F_S^m, we have u⟂w∣Su w S and u⟂w∣S∪Wu w S∪ W for some u∈Su∈ S. Take the active path between u and w given S and extend it by the directed path from w to y. Note that none of nodes on the directed path from w to y are in S, since S is prefix and w∉Sw ∈ S. Therefore, this extended path is also active given S, which means u⟂y∣Su y S. Thus, if y∉FSmy ∈ F_S^m, then it must hold that u⟂y∣S∪Wu y S∪ W. This means there is an active path, denoted by P, between u and y given S∪WS∪ W. Consider extending this path by the directed path from w to y, denoted as Q. Compared to P, the additional non-colliders on Q are not in S∪WS∪ W: for S, this is because S is prefix, w∉Sw ∈ S, and all additional non-colliders are descendants of w; for W, this is because by Lemma 15 we have W∩des(w)=∅W∩ des(w)= . Thus, Q is active given S∪WS∪ W, unless y is a collider on Q. Since u⟂w∣S∪Wu w S∪ W, the path Q must be inactive given S∪WS∪ W, which means y is a collider on Q. This means Q is active given S∪W∪yS∪ W∪\y\. Therefore, u⟂w∣S∪W∪yu w S∪ W∪\y\. Together with u⟂w∣S∪Wu w S∪ W, we have y∈DSmy∈ D_S^m. Next we show that FSm∩src(S¯)=∅F_S^m∩ src( S)= . Since w∈FSmw∈ F_S^m, we have u⟂w∣Su w S and u⟂w∣S∪Wu w S∪ W for some u∈Su∈ S and W⊂S¯W⊂ S, |W|=m|W|=m. Let W′⊂S¯W ⊂ S be the smallest set such that u′⟂w∣S∪W′u w S∪ W for some u′∈Su ∈ S. We must have 1≤|W′|≤|W|1≤|W |≤|W|. Additionally, Lemma 15 shows that W′⊂anc(w)W ⊂ anc(w). Since W′W is a non-empty subset of S¯ S containing ancestors of w we have w∉src(S¯)w∉ src( S). ∎ B.3 Proof of Theorem 1 We develop the proof of Theorem 1 through a series of intermediate results. Our first key lemma establishes that any node serving as the collider in a v-structure is necessarily contained within a set DSmD_S^m. Lemma 16. Let S⊂VS⊂ V be a prefix node set. If there are three nodes u∈Vu∈ V, k,v∈S¯k,v∈ S that form a v-structure u→k←vu→ k← v, then there exists a set W⊂S¯W⊂ S such that u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪ku v S∪ W∪\k\. . The path u→k←vu→ k← v is active given k and therefore u⟂v∣S∪W∪ku v S∪ W∪\k\ for any W⊂S¯W⊂ S. We will now see that for W=(pa(u)∪pa(v))∖SW= ( pa(u)∪ pa(v) ) S we have u⟂v∣S∪Wu v S∪ W. If u⟂v∣S∪Wu v S∪ W then there exists an active path between u and v given S∪WS∪ W. Let u−u′−⋯−v′−vu-u -…-v -v be this path. If we have u←u′u← u or v′→v → v then the path would be inactive. We therefore have u→u′−⋯−v′←vu→ u -…-v ← v and the path must contain at least one collider. Let c be a collider in this path, thus c∈anc[S∪W]c∈ anc[S∪ W]. If c∈anc[S]c∈ anc[S], since S is prefix, c∈Sc∈ S. In this case, the nodes on the path adjacent to c are also in S and the path must be inactive. By acyclicity, if there is a collider c∈anc[W]c∈ anc[W] in the path then there must be at least two colliders. Let c,c′c,c be the leftmost and rightmost colliders on the path. We must have c∈anc[pa(v)]∖Sc∈ anc[ pa(v)] S and c′∈anc[pa(u)]∖Sc ∈ anc[ pa(u)] S and the graph contains a cycle c′→…→u→…→c→…→v→c′c →…→ u→…→ c→…→ v→ c , a contradiction to G being a DAG. ∎ We now provide a result characterizing specific nodes in DSmD_S^m. Lemma 17. Let S be a prefix set. If w∈DSmw∈ D_S^m and anc(w)∩DSm=∅ anc(w)∩ D_S^m= , then w must serve as the collider in a v-structure a→w←ba→ w← b. . Since w∈DSmw∈ D_S^m, there exists u∈Vu∈ V, v∈S¯v∈ S and W⊂S¯W⊂ S with |W|=m|W|=m such that u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪wu v S∪ W∪\w\. Additionally, the active path between u and v given S∪W∪wS∪ W∪\w\ must contain a v-structure a→c←ba→ c← b with c∈anc[w]∩DSmc∈ anc[w]∩ D_S^m. But anc(w)∩DSm=∅ anc(w)∩ D_S^m= and therefore w=cw=c. Thus, a→w←ba→ w← b is a v-structure. ∎ We then establish a relationship between the set FSmF_S^m and undirected cliques, which follows from the observation that FSmF_S^m includes downstream nodes of edges oriented by Meek rules. Lemma 18. Let S be a prefix set. Let m>0m>0 and V′⊂S¯V ⊂ S such that, 1. There does not exist a v-structure a→c←ba→ c← b with c,b∈V′c,b∈ V . 2. There is an edge between nodes in V′V oriented by a Meek rule. 3. S∪V′S∪ V is prefix. 4. V′∩(FS1∪⋯∪FSm−1)=∅V ∩(F_S^1∪…∪ F_S^m-1)= . Then there exists an undirected clique of size larger or equal to m in the subgraph induced by V′V in ℰ()E(G). . Let V′⊂S¯V ⊂ S satisfy conditions (1)–(4). By condition (2), there exists an edge u→vu→ v in ℰ()E(G) with u,v∈V′u,v∈ V that is oriented by a Meek rule. The application of a Meek rule to orient an edge u→vu→ v requires at least one existing oriented edge between nodes in anc(v) anc(v), as observed in Proposition 10. Thus, the orientation of any edge using a Meek rule depends on prior orientations within anc(v) anc(v). We can trace this chain of dependencies backward through ancestors. Since G is acyclic and finite, this dependency chain must terminate. This termination guarantees the existence of an edge x→yx→ y between two nodes x,y∈S¯x,y∈ S such that there are no prior oriented edges in ℰ()E(G) between nodes in anc(y)∩S¯ anc(y)∩ S. Condition (3) implies x,y∈V′x,y∈ V and condition (1) requires this edge to be oriented by a Meek rule. Thus, x→yx→ y is oriented by directed edges between nodes in S and nodes in V′V or directed edges between nodes in S. We now determine which Meek rule must have oriented this edge. We proceed by eliminating rules 2, 3 and 4. R2: Requires z such that x→zx→ z and z→yz→ y. Since x,z∈anc(y)x,z∈ anc(y) and x→zx→ z, we must have z∈Sz∈ S, but then, we must also have x∈Sx∈ S, a contradiction. R3: Requires x,yx,y to form a v-structure with another z∈Vz∈ V, a contradiction. R4: Requires z,z′z,z such that z→z′z→ z , z′→yz → y, z∼xz x, z′∼xz x and z≁yz y. Since z,z′∈anc(y)z,z ∈ anc(y) we must have z∈Sz∈ S. This together with z∼xz x and z≁yz y means we have z→xz→ x, and therefore, x→yx→ y is also oriented by Meek rule 1. Therefore, x→yx→ y is oriented by Meek R1. This implies there exists z∈Sz∈ S such that z→xz→ x and z≁yz y in ℰ()E(G). By condition (4) we have y∉FS1∪⋯∪FSm−1y∉ F_S^1∪…∪ F_S^m-1. Therefore, z⟂y∣S∪Wz y S∪ W for any W⊂S¯W⊂ S such that |W|<m|W|<m. Let K=pa(y)∖SK= pa(y) S. Since z∈Sz∈ S we have pa(z)⊆S pa(z) S, and therefore, z⟂y∣S∪Kz y S∪ K. Thus, |K|≥m|K|≥ m. By condition (1) all nodes in K must be adjacent. Finally, since we selected x→yx→ y such that there is no directed edge in ℰ()E(G) between nodes in anc(y)∩S¯ anc(y)∩ S, then K⊂anc(y)∩S¯⊂V′K⊂ anc(y)∩ S⊂ V must form an undirected clique in the essential graph of at least size m. ∎ Corollary 19. Let S be a prefix set. Let m>0m>0 and V′⊂S¯V ⊂ S such that, 1. There does not exist a v-structure a→c←ba→ c← b with c,b∈V′c,b∈ V . 2. There exists a clique W⊂V′W⊂ V of size larger or equal to m. 3. S∪V′S∪ V is prefix. 4. V′∩(FS1∪⋯∪FSm−1)=∅V ∩(F_S^1∪…∪ F_S^m-1)= . Then there exists an undirected clique of size larger or equal to m in the subgraph induced by V′V in ℰ()E(G). . If W is undirected in the essential graph, we are done. Otherwise, if W is not undirected in the essential graph, we can apply a Meek rule to one of the edges in it. By Lemma 18 there exists an undirected clique of size larger or equal to m in the subgraph induced by V′V in ℰ()E(G). ∎ We now extend the characterization of minimal separators in chordal undirected graphs to DAGs. The proof of this extension utilizes the moral graph construction and the concept of separation in undirected graphs. For a DAG G, its moral graph, denoted mG^m, is the undirected graph created by adding edges between parents sharing common children in G and then removing all arrowheads. In an undirected graph, a node set C separates two disjoint node sets A and B if all paths between A and B intersect C. The following result is a key component of our argument. Proposition 20 (Proposition 3.25 in Lauritzen, (1996)). Let A,BA,B and C be disjoints subsets of V. Then C d-separates A from B in G if and only if C separates A from B in [anc[A∪B∪C]]mG[ anc[A∪ B∪ C]]^m. Lemma 21. Let S⊂VS⊂ V be a prefix node set. If for some u∈Vu∈ V, v∈S¯v∈ S and W⊂S¯W⊂ S we have u⟂v∣S∪Wu v S∪ W, u⟂v∣S∪W′u v S∪ W for all W′⊂S¯W ⊂ S such that |W′|<|W||W |<|W| and (anc[u]∪anc[v])∖S( anc[u]∪ anc[v]) S contains no v-structures then W is a clique in G. . Let ℋ=[S¯∪u]H=G[ S∪\u\]. By Lemma 11 we have that u⟂v∣S∪W′u _Gv S∪ W for some W′⊂S¯W ⊂ S if and only if u⟂ℋv∣W′u _Hv W . By Lemma 12 we have that W⊆anc(u)∪anc(v)W anc(u)∪ anc(v) and therefore anc[u,v∪W]=anc[u,v] anc[\u,v\∪ W]= anc[\u,v\]. Let ℱ=ℋ[anc[u,v]]F=H[ anc[\u,v\]]. Applying Proposition 20 we have that for any W′⊂anc(u,v)W ⊂ anc(\u,v\), W′W d-separates u and v in ℋH if and only if W′W separates u and v in ℱmF^m. As (anc[u]∪anc[v])∖S( anc[u]∪ anc[v]) S does not contain any v-structures and ℱmF^m does not have any added adjacencies, any cycle of of four or more nodes will have a chord. Therefore, ℱmF^m is a chordal graph. Since W is a minimal separating set in a chordal graph, by Theorem 1 in Dirac, (1961) it is a clique in ℱmF^m. Since ℱmF^m has the same adjacencies as ℱF, W forms a clique in ℱF and also in G, because ℱF is a subgraph of G. ∎ The following result establishes that if a node belongs to DSmD_S^m, then, a clique must exist among its ancestors. Lemma 22. Let S be a prefix set. If w∉DS0∪⋯∪DSm−1w∉ D_S^0∪…∪ D_S^m-1 and w∈DSmw∈ D_S^m then, there exists a set of nodes W⊂anc(w)∩S¯∖DSmW⊂ anc(w)∩ S D_S^m of size larger or equal to m that forms a clique in G. . Let c be a node in anc[w]∩S¯ anc[w]∩ S that acts as a collider in a v-structure a→c←ba→ c← b, with a∈Va∈ V, b∈S¯b∈ S, chosen such that no node in anc(c) anc(c) also serves as such a collider. This node must exist due to G being acyclic and having a finite amount of nodes. From Lemma 16, we know there exists a set W⊂S¯W⊂ S that separates a and b given S, that is, a⟂b∣S∪Wa b S∪ W. By the selection of c, the set anc(c)∩S¯ anc(c)∩ S contains no v-structures. As a,b∈anc(c)a,b∈ anc(c) it follows that (anc[a]∪anc[b])∖S( anc[a]∪ anc[b]) S contains no v-structures. Then, by Lemma 21 the smallest set W that satisfies a⟂b∣S∪Wa b S∪ W must form a clique. Now, because a→c←ba→ c← b forms a v-structure, conditioning on the collider c induces a dependence. Specifically a⟂b∣S∪W∪ca b S∪ W∪\c\ holds, where W is the smallest separating set identified previously. Furthermore, we are given w∉DS0∪⋯∪DSm−1w∉ D_S^0∪…∪ D_S^m-1, and by Lemma 7, it follows that c∉DS0∪⋯∪DSm−1c∉ D_S^0∪…∪ D_S^m-1. This condition on c, implies that |W|≥m|W|≥ m. ∎ Finally, we establish the guarantees of Algorithm 1, as stated in Theorem 1, through the sequence of lemmas presented below. Lemma 23. In Algorithm 1, if at any iteration the working graph ℰE contains a clique of size larger or equal to ℓ in V′V then ℰ()E(G) restricted to the nodes in V′V also contains an undirected clique of size larger or equal to ℓ . Proof. We will prove that ℰ()E(G) restricted to V′V contains an undirected clique of size larger or equal to ℓ . By Theorem 6, S∪V′S∪ V is a prefix set. If V′V contains a v-structure, let a→c←ba→ c← b a v-structure such that c,b∈V′c,b∈ V , and such that there is no v-structure a′→c′←b′a → c ← b with c′,b′∈anc(c)∩S¯c ,b ∈ anc(c)∩ S. Note that |V′||V | is bounded and therefore this v-structure must exist. By Lemma 16 we have c∈DSkc∈ D_S^k for some k≥ℓk≥ . By Lemma 22 there exists a clique W of size larger or equal to m in anc(c)∩S¯ anc(c)∩ S. We also have that S∪(anc(c)∩S¯)S∪( anc(c)∩ S) is prefix and since c∉DS0∪⋯∪DSℓ−1∪FS1∪⋯∪FSℓ−1c∉ D_S^0∪…∪ D_S -1∪ F_S^1∪…∪ F_S -1, by Lemma 8 we have (anc(c)∩S¯)∩(FS1∪⋯∪FSℓ−1)=∅( anc(c)∩ S)∩(F_S^1∪…∪ F_S -1)= . By Corollary 19 there exists an undirected clique in anc(c)∩S¯ anc(c)∩ S. If V′V does not contain a v-structure we will prove by contradiction. Assume that there does not exist a clique of size larger or equal to ℓ in [V′]G[V ]. Then, we must have u∼vu v in ℰE but not in G for some u,v∈V′u,v∈ V . This means that there exists a subset W⊂V′W⊂ V , |W|≥ℓ|W|≥ such that u⟂v∣S∪Wu v S∪ W and is minimal. We have that (anc[u]∪anc[v])∖S( anc[u]∪ anc[v]) S does not contain v-structures and by Lemma 21, it must be a clique in G. We have |W|≥ℓ|W|≥ and by Corollary 19 we will have an undirected clique in the essential graph restricted to V′V . ∎ Lemma 24. Let S1,…,SmS_1,…,S_m be the partition on V created by Algorithm 1 and let u→c←vu→ c← v be a v-structure in G. Then, u∈Siu∈ S_i, v∈Sjv∈ S_j and c∈Skc∈ S_k with k>i,jk>i,j. . Without loss of generality, assume j≥ij≥ i. Since S1∪⋯∪SjS_1∪…∪ S_j is prefix by Theorem 6, we must have k≥jk≥ j. If not k>jk>j then k=jk=j. Let S=S1∪⋯∪Sj−1S=S_1∪…∪ S_j-1. By Lemma 16 we have u⟂v∣S∪Wu v S∪ W and u⟂v∣S∪W∪cu v S∪ W∪\c\ for some W⊂S¯W⊂ S. Let W′⊂S¯W ⊂ S be the set with the least amount of nodes that satisfies this and let k=|W′|k=|W |. We have that c∈DSkc∈ D_S^k and c∉DS1∪⋯∪DSk−1c∉ D_S^1∪…∪ D_S^k-1. By Lemma 22 there exists a clique of size larger or equal to k in anc(c)∩S¯⊆Sj anc(c)∩ S S_j, meaning, DSkD_S^k must have been computed, a contradiction to c∈Sjc∈ S_j. ∎ Lemma 25. The graph created by Algorithm 1 has the same skeleton as G. . We will prove by contradiction. Let ℰE denote the graph created by Algorithm 1 and assume that ℰE and G do not have the same skeleton. Let u,v∈Vu,v∈ V. If u∼vu v in G we have u⟂v∣Wu v W for all W⊂VW⊂ V. Therefore, if sk()≠sk(ℰ)sk(G) (E), there must exist u∈Siu∈ S_i and v∈Sjv∈ S_j such that u∼vu v in ℰE but not in G. Denote S=S1∪⋯∪Sj−1.S=S_1∪…∪ S_j-1. There must exist a set W⊂SjW⊂ S_j such that u⟂v∣S∪Wu v S∪ W and is minimal. By Lemma 24, (anc[u]∪anc[v])∖S( anc[u]∪ anc[v]) S does not contain any v-structures and by Lemma 21 we have that W forms a clique. This contradicts Algorithm 1 not removing this edge. ∎ Lemma 26. Let S1,…,SmS_1,…,S_m be the partition on V created by Algorithm 1. For all u,v∈Vu,v∈ V if u→vu→ v in ℰ()E(G) then u∈Siu∈ S_i and v∈Sjv∈ S_j such that i<ji<j. . We prove this by contradiction. Assume there exists a directed edge u→vu→ v in ℰ()E(G) such that both u and v belong to the same component SkS_k. By Lemma 24, for any v-structure a→c←ba→ c← b in G, we will have a∈Si,b∈Sj,c∈Ska∈ S_i,b∈ S_j,c∈ S_k with k>i,jk>i,j. This implies that if we have u→vu→ v in ℰ()E(G) and u,vu,v belong to the same component, the edge must have been oriented by a Meek rule. Let S=S1∪⋯∪Sk−1S=S_1∪…∪ S_k-1 and let s′s denote the largest clique in ℰ()[Sk]E(G)[S_k]. By Theorem 6, S is a prefix set. By Lemma 24, there does not exist a v-structure a→c←ba→ c← b with c,b∈Skc,b∈ S_k. By definition, we have Sk∩(DS0∪⋯∪DSs′∪FS1∪⋯∪FSs′)=∅S_k∩(D_S^0∪…∪ D_S^s ∪ F_S^1∪…∪ F_S^s )= and by Lemma 18, there must exist a clique in ℰ()[Sk]E(G)[S_k] of size strictly larger than s′s , a contradiction. ∎ See 1 . We will show that Algorithm 1 outputs the essential graph of G and that it performs, at most, p(s)p^O(s) CI tests. Let ℰE denote the output of Algorithm 1. By Lemma 25, v and w are adjacent in ℰE if and only if v and w are adjacent in G. We will now prove that if v→wv→ w in ℰE then also v→wv→ w in G. If v→wv→ w in ℰE we must have v∈Siv∈ S_i and w∈Sjw∈ S_j for some i<ji<j. By Theorem 6, S=S1∪⋯∪SiS=S_1∪…∪ S_i is prefix, therefore, we have v→wv→ w in G. Finally, using Lemma 26, we have v→wv→ w in ℰE if and only if we have v→wv→ w in ℰ()E(G). Therefore, ℰ=ℰ()E=E(G). By Lemma 23 the algorithm will calculate sets DSmD_S^m and FSmF_S^m of up to m≤sm≤ s where s is the largest undirected clique in the essential graph of G. By Theorem 6 an iteration requires at most (ps+3)O(p^s+3) CI tests. Thus, the algorithm uses (ps+4)O(p^s+4) CI tests, proving Theorem 1. ∎ Appendix C Step-by-step example of Algorithm 1 We now provide a step-by-step example of Algorithm 1. Example 27. Suppose we have a probability distribution ℙP that is Markov and faithful to the graph below. In this case, the graph and the essential graph are equal. 01234 Step 0 (Initialization, Lines 3–5). We initialize an empty prefix node set S=∅S= , an empty list S, and a complete undirected graph ℰE. 01234 Step 1 (Prefix node set expansion, Lines 7–23). We now learn the partial ordering of the essential graph through DSmD_S^m and FSmF_S^m, while establishing adjacencies. The expansion step is repeated as long as the prefix node set does not contain all nodes (S≠VS≠ V). For each expansion, we work on a set of working nodes, denoted V′V . This set excludes the prefix set and any identified downstream nodes. In each expansion, we indicate the prefix node set S. The set of working nodes, V′V , is specified only when it differs from S¯ S (the set of nodes not in S). For each ℓ , we remove edges between nodes if there exists a set W⊆V′W V with |W|=ℓ|W|= that renders them independent. Nodes identified as being in DSℓD_S are colored blue, while nodes in FSℓF_S are colored green. • First prefix node set expansion (S=∅,=[]S= ,S= [ ]). Nodes 0 and 11 are found to be conditionally independent given the empty set, so the edge between them is removed. Furthermore, we find that conditioning on node 22 makes 0 and 11 dependent (unblocking the v-structure 0→2←10→ 2← 1). Similarly, conditioning on 33 or 44 (descendants of the collider) also makes 0 and 11 dependent. Therefore, we identify DS0=2,3,4D_S^0=\2,3,4\. The set of working nodes is then updated for the next iteration: V′=V′∖DS0=0,1,2,3,4∖2,3,4=0,1V =V D_S^0=\0,1,2,3,4\ \2,3,4\=\0,1\. For ℓ=1 =1 no additional adjacencies are removed. The expansion stops here since no clique of size 22 exists in the current working set V′=0,1V =\0,1\. 01234 (a) ℓ=0 =0. 01234V’ (b) ℓ=1 =1. • Second prefix node set expansion (S=0,1,=[0,1]S=\0,1\,S= [\0,1\ ]). For ℓ=0 =0, no new independencies are found. For ℓ=1 =1, we remove several edges based on conditional independencies found. For instance, the edge (0,3)(0,3) is removed because 0⟂3∣(S∖0)∪20 3 (S \0\ )∪\2\. All other non-adjacencies in the true graph are also discovered and their corresponding edges removed during this phase. Additionally, we find 0⟂3∣S∖00 3 S \0\ and 0⟂3∣(S∖0)∪20 3 (S \0\ )∪\2\, which identifies 3∈FS13∈ F_S^1. Similarly, 1⟂3∣(S∖1)∪21 3 (S \1\ )∪\2\ and 1⟂3∣(S∖1)∪2,41 3 (S \1\ )∪\2,4\ identifies 4∈DS14∈ D_S^1. The working set is updated to V′=V′∖(FS1∪DS1)=2,3,4∖(3∪4)=2V =V (F_S^1∪ D_S^1)=\2,3,4\ (\3\∪\4\)=\2\. Finally, since V′V contains no clique of size 22, this expansion stops here. 01234S (a) ℓ=0 =0. 01234S (b) ℓ=1 =1. • Third prefix node set expansion (S=0,1,2,=[0,1,2]S=\0,1,2\,S= [\0,1\,\2\ ]). We find 2⟂4∣S∖22 4 S \2\ and 2⟂4∣(S∖2)∪32 4 (S \2\ )∪\3\, so 4∈FS14∈ F_S^1. 01234S (a) ℓ=0 =0. 01234S (b) ℓ=1 =1. • Fourth prefix node set expansion (S=0,1,2,3,=[0,1,2,3]S=\0,1,2,3\,S= [\0,1\,\2\,\3\ ]). Only one node remains outside of S. No adjacencies are removed. 01234S (a) ℓ=0 =0. 01234S (b) ℓ=1 =1. • End of prefix node set expansion (S=0,1,2,3,4,=[0,1,2,3,4]S=\0,1,2,3,4\,S= [\0,1\,\2\,\3\,\4\ ]). 01234S Step 2 (Edge replacement, Lines 24–25). Since all remaining edges connect nodes in different components of the partition S, we orient them from components with smaller indices to those with larger indices. 01234S1S_1 S2S_2 S3S_3 S4S_4 Step 3 (End, Line 26). The essential graph is now fully recovered. 01234 Appendix D Example of correctness without faithfulness The following example serves to prove Proposition 3. Example 28. Suppose we have a probability distribution ℙP modeled by the graph G in Figure 12 and such that the random variables XiX_i are related to each other by the following structural equations Xj=∑i<jaijXi+εjX_j= _i<ja_ijX_i+ _j (3) where εj∼(0,1) _j (0,1) and aij∈[−1,1]a_ij∈[-1,1] are parameters with aij≠0a_ij≠ 0 if and only if XiX_i causes XjX_j. X0X_0X1X_1X2X_2X3X_3a01a_01a03a_03a13a_13a23a_23 Figure 12: An example demonstrating how Algorithm 1 can recover the correct causal graph when the generating probability distribution is Markov but not faithful to that graph. According to Proposition 4.1. in Uhler et al., (2013), the condition: a03a13−a01=0a_03a_13-a_01=0 (4) holds if and only if the conditional independence X0⟂X1∣X2,X3X_0 X_1 \X_2,X_3\ is induced by ℙP. Therefore, if Equation (4) is satisfied, this constitutes a faithfulness violation. By selecting aija_ij such that this is the only faithfulness violation present, Algorithm 1 successfully recovers the true essential graph ℰ()E(G). This is due to the algorithm identifying X3X_3 as a downstream node very early. Specifically, for S=∅S= , the algorithm finds that X1X_1 and X2X_2 are independent, but become dependent when conditioning on X3X_3. This reveals the v-structure X1→X3←X2X_1→ X_3← X_2 and places X3X_3 in DS0D_S^0. Because X3X_3 is identified as downstream, it is excluded from certain future conditioning sets, meaning the test X0⟂X1∣X2,X3X_0 X_1 \X_2,X_3\—which represents the only faithfulness violation—is never performed by Algorithm 1. Appendix E Implementation details All graph manipulations required by Algorithm 1 are implemented using the NetworkX Python library (Hagberg et al.,, 2008), available under the 3-clause BSD license. We now provide the specific implementations of the steps in Algorithm 1. Algorithm 2 Algorithm used to remove edges from ℰE in Algorithm 1 1: CI queries from ℙP, graph ℰE, prefix set S, nodes V′V , condition set size m. 2: ℰE, sepset. 3: for u∈V,v∈V′u∈ V,v∈ V such that u−vu-v in ℰE do 4: repeat 5: Choose a new set W⊆V′∖u,vW V \u,v\ with |W|=m|W|=m; 6: if u⟂v∣S∪Wu v S∪ W then 7: Remove u−vu-v from ℰE; 8: sepset(u,v)←Wsepset(u,v)← W; 9: until u and v are no longer adjacent or all sets W have been considered 10: return ℰE, sepset Algorithm 3 Algorithm used to calculate DSmD_S^m 1: CI queries from ℙP, graph ℰE, prefix set S, nodes V′V , sepset from Algorithm 2. 2: DSmD_S^m. 3: DSm←∅D_S^m← ; 4: P←∅P← ; 5: for u∈Vu∈ V, v∈V′v∈ V such that |sepset(u,v)∖S|=m|sepset(u,v) S|=m do ⊳ Identify v-structures 6: for w in adj(ℰ,u)∩adj(ℰ,v)∩V′adj(E,u) (E,v)∩ V do 7: if w not in sepset(u, v) then 8: DSm←DSm∪wD_S^m← D_S^m∪\w\; 9: P←P∪u,vP← P∪\\u,v\\; 10: for u,v\u,v\ in P do ⊳ Identify descendants of the v-structures 11: for w in V′∖DSmV D_S^m do 12: if u⟂v∣S∪sepset(u,v)∪wu v S (u,v)∪\w\ then 13: DSm←DSm∪wD_S^m← D_S^m∪\w\; 14: return DSmD_S^m Algorithm 4 Algorithm used to calculate FSmF_S^m 1: CI queries from ℙP, prefix set S, nodes V′V , sepset from Algorithm 2. 2: FSmF_S^m. 3: FSm←∅F_S^m← ; 4: for u∈Su∈ S, v∈V′v∈ V such that sepset(u,v)>0sepset(u,v)>0 do 5: repeat 6: Choose a new set W⊆V′∖u,vW V \u,v\ with |W|=m|W|=m; 7: if u⟂v∣S∪Wu v S∪ W then 8: FSm←FSm∪vF_S^m← F_S^m∪\v\; 9: until v∈FSmv∈ F_S^mor all sets W have been considered 10: return FSmF_S^m To solve the k-clique problem, we adapt the Bron-Kerbosch algorithm (Bron and Kerbosch,, 1973), originally used for enumerating maximal cliques, by terminating the search as soon as a clique of size k is found. To efficiently compute the sets DSmD_S^m and FSmF_S^m, we optimize the process by removing adjacencies once a separating set is found (see Algorithm 2) and storing these sets. The calculation of DSmD_S^m (detailed in Algorithm 3) is a two-stage process. First, it identifies initial v-structures by iterating through non-adjacent node pairs (u,v)(u,v) and checking if a common neighbor w is absent from their separating set, reusing the results from Algorithm 2. Second, it finds all descendants of these v-structures by performing targeted additional CI tests to see which nodes w, when added to a conditioning set, break a known independence. Lemma 17 guarantees that that the set DSmD_S^m consists exclusively of colliders within v-structures and their descendants. The calculation of FSmF_S^m (Algorithm 4) identifies the desired nodes by leveraging the previously identified separating sets. The definition for w∈FSmw∈ F_S^m requires two conditions: (1) u⟂w∣Su w S for some u∈Su∈ S, and (2) u⟂w∣S∪Wu w S∪ W for some W⊆S¯W S of size m. The first condition is implicitly satisfied by Line 4. Therefore, simply finding a separating set W of size m for a pair (u,w)(u,w) with u∈Su∈ S is sufficient to place w in FSmF_S^m. Appendix F Algorithm variant with additional tests We introduce a variant of Algorithm 1, which we call GAS+, that replaces the final orientation loop (Lines 24–25) with additional CI tests (see Algorithm 5). This variant returns a different graph only if faithfulness is violated. In our experiments (Figure 4), this variant acts as a normalized baseline for comparison. Algorithm 5 Modification of the final loop in Algorithm 1 (Lines 24–25) 1: Create empty graph D on the node set V. 2: for SiS_i in =[S1,…,Sm]S=[S_1,…,S_m] do 3: Add v−wv-w to D iff v⟂w∣S1∪⋯∪Si∖v,wv w S_1∪…∪ S_i \v,w\. 4: for v∈Si,w∈Sjv∈ S_i,w∈ S_j with i<ji<j do 5: Add v→wv→ w to D iff v⟂w∣S1∪⋯∪Sj∖v,wv w S_1∪…∪ S_j \v,w\. 6: return D Next, we will show that under faithfulness, D is precisely ℰ()E(G). We will start by proving that if we have v−wv-w in D then v and w will be adjacent in G. If we have v−wv-w in D, we must have v,w∈Siv,w∈ S_i for some Si∈S_i . Denote S=(S1∪⋯∪Si)∖v,wS=(S_1∪…∪ S_i) \v,w\. Since v⟂w∣Sv w S there exists an active path between v and w given S. Assume this path contains more than two nodes. We have that S∪v,wS∪\v,w\ is prefix, therefore, there must exist a collider on the path. Any collider on the active path must also be in S, however, in this case, the adjacent nodes will also be in S and the path would be inactive. The only remaining case is if v∼c∼wv c w is a v-structure for some c∈Sc∈ S but since v,w∈Siv,w∈ S_i then c∈Sic∈ S_i and by Lemma 24 we have a contradiction. Therefore, we have v∼wv w. If we have v→wv→ w in D we must have v∈Siv∈ S_i, w∈Sjw∈ S_j for some Si,Sj∈S_i,S_j and i<ji<j. Denote S=(S1∪⋯∪Sj)∖v,wS=(S_1∪…∪ S_j) \v,w\. Since v⟂w∣Sv w S there exists an active path between v and w given S. We have that S∪v,wS∪\v,w\ is prefix, therefore, there must exist a collider on the path. Any collider in the active path must be in S, however, in this case, the adjacent nodes will also be in S and the path would be inactive. The only remaining case is if v→c←wv→ c← w, but that means c∈Sjc∈ S_j. If not v∼wv w then v∼c∼wv c w is a v-structure, by Lemma 24 we have c∈Skc∈ S_k with k>jk>j, a contradiction. Since S1∪⋯∪SiS_1∪…∪ S_i is prefix we must have v→wv→ w. By Lemma 26 there are no directed edges in the essential graph of G other than those in D. Finally, we will prove that if v∼wv w in G then v∼wv w in D and therefore we have all the adjacencies. If v∼wv w in G we will have v⟂w∣Wv w W for any W⊂VW⊂ V. With the intra and inter component tests we will also have v∼wv w in D. Appendix G Experimental setup Hardware. All experiments were conducted on a single device with 1TB of RAM. The conditional independence test employed across all compared algorithms was the partial correlation test (using Fisher’s z-transform) at a significance level of α=0.05α=0.05, unless otherwise specified. However, to ensure fair comparisons, we utilized different implementations of this test: results presented in Figures 4 are using the implementation from the causaldag package (Squires,, 2018), while results in Figures 5, 6 and 13 are using the implementation provided by the causal-learn package (Zheng et al.,, 2024). In both cases, the default implementation of each tester and algorithm is used. G.1 Linear Gaussian synthetic data In these experiments, we simulate an observational setting. For each run, 10,000 data samples were generated from a linear Structural Causal Model (Pearl,, 2009) with additive Gaussian noise, defined by a randomly generated underlying causal DAG G. Edge weights for the SCM are drawn from [−1,−0.25]∪[0.25,1][-1,-0.25]∪[0.25,1], ensuring they are bounded away from 0. G.2 SERGIO The SERGIO simulation framework, developed by Dibaeinia and Sinha, (2020), enables the generation of single-cell transcriptomics data that realistically reflects gene regulatory networks defined by the user. For this, we use the DS1 gene regulatory network also introduced in Dibaeinia and Sinha, (2020), which was designed based on known regulatory pathways in E. coli. This network comprises 100 genes and 258 edges. Key structural characteristics include 10 designated regulator genes, which are the only nodes with outgoing edges, and a maximum node out-degree of 76. Such a high maximum degree significantly impacts the runtime of constraint-based algorithms like PC and FCI, which become computationally prohibitive due to their search procedures. Using this DS1 GRN as the ground truth, we simulated gene expression datasets consisting of 2,700 cells. Finally, for the conditional independence tests used by GAS and GAS+, we set the significance level to α=10−4α=10^-4. G.3 Real-world example For our real-world application, we utilize the Airfoil Self-Noise NASA dataset, which was introduced by Brooks et al., (1989) and is available under the C BY 4.0 license. For this experiment, we set the significance level to α=10−4α=10^-4. Appendix H Additional experiments This section extends our empirical evaluation of GAS and GAS+. We analyze their performance across diverse graph structures, network sizes, and sample sizes to assess their robustness and scalability. H.1 Increasing neighborhood size in Erdős-Rényi Graphs Figure 13 shows the execution time and Structural Hamming Distance. As shown in Figure 14, GAS and GAS+ significantly reduce the number of conditional independence tests compared to PC, often by more than an order of magnitude. Regarding accuracy, Figure 15 breaks down errors into extra and missing edges, while Table 1 provides complementary skeleton-level metrics, including precision, recall, and F1-score. (a) (b) Figure 13: Comparison across Erdős-Rényi graphs on 50 nodes and increasing expected neighborhood size. Results are averaged over 3 runs. (a) Execution time (seconds) presented on a logarithmic scale. (b) Accuracy comparison measured by Structural Hamming Distance (SHD) between predicted and ground-truth graphs normalized by the number of possible edges. Shaded regions show the standard deviation. Figure 14: Number of distinct CI tests performed across Erdős-Rényi graphs of 50 nodes and increasing expected neighborhood size. Results are averaged over 10 runs. (a) (b) Figure 15: Comparison across Erdős-Rényi graphs on 50 nodes and increasing expected neighborhood size. Results are averaged over 10 runs. (a) Number of missing edges between predicted and ground-truth skeletons. (b) Number of extra edges between predicted and ground-truth skeletons. Shaded regions show the standard deviation. Table 1: Additional skeleton metrics for GAS+ and PC across Erdős-Rényi graphs of 50 nodes and increasing expected neighborhood size. Exp. Neigh. Size Precision Recall F1-Score GAS+ PC GAS+ PC GAS+ PC 15 0.42 0.83 0.92 0.29 0.58 0.43 25 0.57 0.81 0.81 0.16 0.67 0.27 35 0.73 0.87 0.77 0.12 0.75 0.21 45 0.93 0.96 0.75 0.09 0.83 0.17 H.2 Increasing number of nodes in Erdős-Rényi graphs To evaluate scalability with respect to graph size, we conducted experiments on Erdős-Rényi graphs with an increasing number of nodes. Figure 16 reports the execution time and Structural Hamming Distance (SHD), while Figure 17 details the number of distinct conditional independence (CI) tests. It is important to note that for the experiments with a fixed expected neighborhood size (Figure 17(a)), the graphs become effectively sparser as the number of nodes increases. GAS and GAS+ demonstrate superior scaling compared to the standard PC algorithm and MARVEL (Mokhtarian et al.,, 2025) when the edge probability is held constant (Figure 17(b)). While the hybrid algorithm GSP performs the fewest distinct CI tests, this lower test count does not strictly correspond to lower computational cost, as evidenced by the superior runtime scaling of GAS and GAS+ in Figure 16. (a) (b) (c) (d) Figure 16: Comparison across Erdős-Rényi graphs of increasing number of nodes and edge probability p=0.5p=0.5. Results are averaged over 3 runs. (a), (c) Execution time (seconds) presented on a logarithmic scale. (b), (d) Accuracy comparison measured by the Structural Hamming Distance between predicted and ground-truth graphs normalized by the number of possible edges. Shaded regions show the standard deviation. (a) Expected neighborhood size 22 (b) Edge probability p=0.5p=0.5 Figure 17: Comparison of the number of distinct conditional independence tests across Erdős-Rényi graphs of increasing size. Results are averaged over 10 runs; shaded regions indicate the standard deviation. H.3 Scale-free graphs Given that many real-world networks exhibit scale-free properties, we evaluate our algorithms on graphs generated using the Barabási–Albert (BA) model. Table 2 reports the Structural Hamming Distance and execution time for graphs with increasing connectivity, controlled by the BA model’s parameter m. Table 2: Comparison across Barabási–Albert graphs of 40 nodes and increasing parameter m. Results are averaged over 3 runs. m Exp. Neigh. Size Structural Hamming Distance Execution Time (s) PC GAS GAS+ PC GAS GAS+ 5 8.57 105.00 380.67 283.33 1.7576 0.0377 0.0432 10 14.29 225.00 426.00 289.67 1.4815 0.0085 0.0138 15 17.14 297.00 477.33 296.33 1.7844 0.0074 0.0192 20 17.14 305.00 450.00 294.67 3.3219 0.0071 0.0150 H.4 Varying sample size Finally, we investigate the impact of sample size on the accuracy of the learned graph structures. We generated datasets of varying sizes from a 50-node Erdős-Rényi graph with an edge probability of p=0.5p=0.5. Table 3 shows the skeleton Structural Hamming Distance for each method. Table 3: Skeleton Structural Hamming Distance between the predicted and ground-truth graphs for increasing sample sizes. Data was generated from 5050-node Erdős-Rényi graphs with an edge probability of p=0.5p=0.5. Results are averaged over 3 runs. Number of Samples PC GAS GAS+ 50 568.33 585.00 598.00 100 567.00 579.67 573.33 500 540.00 597.67 514.33 1000 554.33 596.33 500.33 5000 523.00 607.00 492.33 10000 538.00 604.00 476.33 100000 534.67 605.67 508.33