Peter Spirtes, Jiji Zhang 1 Faithfulness comes in several flavors and is a kind of principle that selects simpler (in a certain sense) over more complicated models. We show how to weaken the assumption of standard faithfulness so that it needs to be applied in fewer circumstances. We show how to weaken the assumption of strong (ε)faithfulness) so that it does not prohibit the existence of weak edges. We show how to modify the causal search algorithms so that they make fewer mind changes as the sample size grows. 2 X Y X Y X Y X Y X Y Z Z Z Z Z W W W W W True Graph W = aZ + εW Z = bX + cY + εZ IP(W,X|Z) = 0 IP(W,Y|Z) = 0 IP(X,Y| ) = 0 X = εX Y = εY 3 S1. Form the complete undirected graph H on the given set of variables V. S2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H iff such a set is found. S3. Let K be the graph resulting from S2. For each unshielded triple <X, Y, Z> (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are not adjacent), if X and Z are independent conditional on some subset of V\{X, Y} that does not contain Y, then orient the triple as a collider: X Y Z. S4. Execute the entailed orientation rules. 4 Causal Markov Assumption: For a set of variables for which there are no unmeasured common causes, each variable is independent of its non-effects conditional on its direct causes. Non-obvious equivalent formulation: If IG(X,Y|Z) in causal DAG G with no unmeasured common causes then IP(X,Y|Z) = 0. If IP(X,Y|Z) = 0 then IG(X,Y|Z) in causal DAG G. Converse of Causal Markov Assumption. If IP(X,Y|Z) is a rational function of parameters, then violations are Lebesgue measure 0. 5 Reduction of Underdetmination If I(A,B| then prefer A → C ← B to A → C → B Computational Efficiency If A – C – B and I(A,B| then don’t need to check I(A,B|C. Statistical Efficiency The Markov equivalence class can be found without testing independence conditional on a set with more than maximum degree of any variable in the true causal graph. 6 If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then there exist pointwise consistent estimators of Markov equivalence class SGS PC GES (Gaussian, multinomial) If just assume Causal Markov Assumption and causal sufficiency there are no pointwise consistent estimators of Markov Equivalence Class Gaussian Multinomial Unrestricted 7 If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then no uniform consistent estimator of Markov Equivalence Class Gaussian Multinomial Unrestricted 8 (A4: ε-faithfulness) The partial correlations between X(i) and X( j) given {X(r); r k} for some set k {1,…,pn}\{i,j} are denoted by rn;i,j|k. Their absolute values are bounded from below and above: { } inf ri, j|k ;i, j,k with ri, j|k ¹ 0 ³ cn ,cn-1 = O(n d ) f or some 0 < d < b / 2 sup ri, j|k £ M < 1 n;i, j,k where 0 < b £ 1 is as in (A3). 9 Assume (A1)-(A4). Denote by Ĝskel,n (a n ) the estimate from the (first part of the) SGS algorithm and by Gskel,n the true skeleton from the DAG Gn . Then, there exists a n ® 0 (n ® ¥) such that P[Ĝskel,n (a n ) = Gskel,n ] = 1-O(exp(-Cn1-2d )) ® 1 (n ® ¥) for some 0 < C < ¥, where d > 0 as in (A4). 10 Uhler et al.: (A4) tends to be violated fairly often, if the parameter values are assigned randomly, and ε is not very small. There are two ways to get very small partial correlations – almost cancellations and very weak edges. (A4) forbids both – it entails that there are no very weak edges. 11 X Y X Y X Y X Y Z Z Z Z W W W W 12 X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Small Sample X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Medium- Sample X Y Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) Z Z IP(X,Y| ) IP(W,Z| ) W W Output Medium+ Sample Output X IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) IP(X,Y| ) IP(W,Z| ) Large Sample 13 X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Small Sample X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Medium- Sample X Y Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) Z Z IP(X,Y| ) IP(W,Z| ) W W Output Medium+ Sample Output X IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) IP(X,Y| ) IP(W,Z| ) Large Sample 14 X→Y→Z→W X–Y–Z–W IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,Z|Y) IP(X,Z|Y) IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(X,W| ) IP(X,W| ) Small Sample Large Sample True Graph X–Y–Z→W 15 X→Y→Z→W X–Y–Z–W X–Y–Z→W IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Small Sample IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Large Sample True Graph 16 X Y Z W 17 X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Small Sample X Y IP(W,{X,Y}|Z) IP(W,{X,Y}| ) Z IP(X,Y| ) IP(W,Z| ) W Output Medium- Sample X Y Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) Z Z IP(X,Y| ) IP(W,Z| ) W W Output Medium+ Sample Output X IP(W,{X,Y}|{Z}) IP(W,{X,Y}| ) IP(X,Y| ) IP(W,Z| ) Large Sample 18 S3*. Let K be the undirected graph resulting from S2. For each unshielded triple <X, Y, Z>, If X and Z are not independent conditional on any subset of V\{X, Y} that contains Y, then orient the triple as a collider: X Y Z. If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous (or unfaithful). 19 Adjacency – If X – Y in the causal DAG then IP(X,Y|Z) ≠ 0 for any Z. 20 Triangle – For any three variables that form a triangle in causal DAG G If Z is a non-collider on the path <X, Z, Y>, then X and Y are not independent conditional on any subset of V\{X, Y} that does not contain Z; If Z is a collider on the path <X, Z, Y>, then X and Y are not independent conditional on any subset of V\{X, Y} that contains Z. Suppose X → Y ← Z and IP(X,Z|Y) = 0. This is faithful to X → Y → Z. This cannot be detected, so it must be assumed. 21 X Z Y X Z Y W ¬I(X,Z| ) ¬I(X,Y|Z) ¬I(Y,Z| ) ¬I(X,Z| ) ¬I(Y,Z| ) ¬I(X,Y|Z) ¬I(X,W| ) ¬I(Y,W| ) ¬I(Z,W| ) ¬I(X,Z|W) ¬I(X,Z|Y,W) ¬I(Y,Z|W) ¬I(Y,Z|X,W) ¬I(X,Y|W) ¬I(X,Y|Z,W) ¬I(X,W|Z) ¬I(X,W|Y) ¬I(Y,W|X) ¬I(Y,W|Z) ¬I(Z,W|X) ¬I(Z,W|Y) 22 The population distribution is not Markov to any proper subDAG of the true causal DAG. Causal Minimality is entailed by manipulation definition of causation if a distribution is positive. There is a weaker kind of causal minimality – Pminimality: the population distribution is not Markov to any DAG that entails a proper superset of the conditional independence relations. Is this sufficient for the correctness of VCSGS? 23 X→Y→Z→W X–Y–Z–W X–Y–Z –W IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Small Sample IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Large Sample True Graph 24 X→Y→Z→W X–Y–Z–W X – Y – Z →W IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Small Sample IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Large Sample True Graph 25 X→Y→Z→W X–Y–Z–W X – Y – Z →W IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Small Sample IP(X,Z|Y) IP(Y,W|{X,Z)} IP(X,W| ) Large Sample True Graph 26 V1. Form the complete undirected graph H on the given set of variables V. V2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H and mark the pair <X, Y> as ‘apparently non-adjacent’, if and only if such a set is found. V3. Let K be the graph resulting from V2. For each apparently unshielded triple <X, Y, Z> (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are apparently non-adjacent), If X and Z are not independent conditional on any subset of V\{X, Y} that contains Y, then orient the triple as a collider: X Y Z. If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous (or unfaithful), and mark the pair <X, Z> as ‘definitely non-adjacent’. 27 V4. Execute the same orientation rules as in S4, until none of them applies. V5. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in every pattern, then mark the ‘apparently non-adjacent’ <V,W> pair as ‘definitely non-adjacent’. 28 Faithfulness Adjacency-Faithfulness Triangle-Faithfulness P-Minimality 29 If Triangle Faithfulness Assumption, Causal Minimality Assumption, and Causal Markov Assumption, then VCSGS is a consistent estimator of the extended Markov equivalence class. Is it complete? 30 V5*. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in some pattern, then mark the ‘apparently non-adjacent’ <V,W> pair as ‘definitely non-adjacent’. 31 Assumption NVV(J): inf varM ( X i | V \{X i }) ³ J , for some (small) J > 0 X i ÎV Assumption UBC(C): sup X i ,X j ÎV,WÍV\{ X i ,X j } r M ( X i , X j | W) £ C for some C < 1 32 Given a set of variables V, suppose the true causal model over V is M = <P,G>, where P is a Gaussian distribution over V, and G is a DAG with vertices V For any three variables X, Y, Z that form a triangle in G (i.e., each pair of vertices is adjacent), If Y is a non-collider on the path <X, Y, Z>, then |r(X, Z|W)| ≥ k |eM(X – Z)| for all WV that do not contain Y; and If Y is a collider on the path <X, Y, Z>, then |r(X, Z|W)| ≥ k |eM(X – Z)| for all WV that do contain Y. 33 S3* (sample version). Let K be the undirected graph resulting from the adjacency phase. For each unshielded triple <X, Y, Z>, If there is a set W not containing Y such that the test of r(X, Z|W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that contains Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)| L returns 0 (i.e., accepts the hypothesis), then orient the triple as a collider: X Y Z. If there is a set W containing Y such that the test of r(X, Z|W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that does not contain Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)| L returns 0 (i.e., accepts the hypothesis), then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous. 34 Say that CSGS(L, n, M) errs if it contains (i) an adjacency not in GM; or (ii) a marked non-collider not in GM, or (iii) an orientation not in GM. Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-TriangleFaithfulness, NVV(J), and UBC(C) Assumptions, the CSGS algorithm is uniformly consistent in the sense that lim sup P (CSGS(L,n, M ) errs) = 0 n®¥ M Îy k ,J ,C n M 35 For each vertex Z If every vertex not adjacent to Z is not confirmed to be non-adjacent to Z return ‘Unknown’ for every edge containing Z else For every non-adjacent pair <Y, Z> in EP(G), let the estimate be 0 For each vertex Z such that all of the edges containing Z are oriented in EP(G), if Y is a parent of Z in EP(G), let the estimate be the sample regression coefficient of Y in the regression of Z on its parents in EP(G). 36 Let M1 be an output of the Estimation Algorithm, and M2 be a causal model. We define the structural coefficient distance, d[M1,M2], between M1 and M2 to be d[M 1, M 2 ] = max êM (Xi ® X j ) - eM (Xi ® X j ) i, j where by convention êM (Xi ® X j ) - eM (Xi ® X j ) = 0 if êM (Xi ® X j ) = “Unknown”. 1 2 1 2 1 37 E1. Run the CSGS algorithm on an i.i.d. sample of size n from PM. E2. Let the output from E1 be CSGS(L, n, M). Apply step V5 in the VCSGS algorithm (from section 3), using tests of zero partial correlations and record which nonadjacencies are confirmed. E3. Apply the Estimation Algorithm to CSGS(L, n, M), the confirmed non-adjacencies, and the sample of size n. 38 Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle-Faithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation I algorithm is uniformly consistent in the sense that for every > 0 lim sup P (d[Ô( M ), M ] > d ) = 0 n®¥ MÎy k ,J ,C n M For a large enough and dense enough graph, this still allows for the possibility of large manipulation errors (due to many small edge errors. 39 X1 1.0 0.01 0.7877781 X2 X3 1.0 0.612157 1.0 40 if k > 0.014, then the k-Triangle-Faithfulness Assumption is violated for models M2 and M3, but not for M1. If 0.008 < k < 0.014 then the k-Triangle-Faithfulness Assumption is violated for models M3, but not for M1 or M2. 41 E1. Run Edge Estimation Algorithm I. E2. Set ForbiddenOrientations = {}. E3. For each maximal clique in CSGS(L, n, M) such that if a vertex in the clique is not adjacent to some vertex not in the clique, it is definitely non-adjacent (i) for each possible orientation O of all of the unoriented edges in the maximal clique Apply the orientation O to each of the unoriented edges. Apply Meeks’ orientation rules. If application of the rules produces a cycle or a new unshielded collider add O to ForbiddenOrientations Add O to ForbiddenOrientations if for any Y and W such that Y is a non-collider the path <X, Y, Z>, and W V and does contain Y ( ) j n k êO ( X - Z) - r̂ ( X ,Z | W) > L = 0 42 E4. For each unoriented edge X – Y in CSGS(L, n, M), if there is only one orientation X Y that does not occur in ForbiddenOrientations, and every vertex that Y is not adjacent to, Y is definitely not adjacent to, orient as X Y E5. For each vertex V such that some edge containing V in CSGS(L, n, M) is not oriented, if there is only one orientation of all of the edges containing V that is not in ForbiddenOrientations, and every vertex that V is not adjacent to, V is definitely not adjacent to, let the estimate of each edge equal be the sample regression coefficient of V on its parents in the non-forbidden orientation. 43 Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-TriangleFaithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation II algorithm is uniformly consistent in the sense that for every > 0 lim sup PMn (O(L,n, M ) errs) = 0 n®¥ M Îy k ,J ,C lim sup PMn (d[Ô( M ), M ] > d ) = 0 n®¥ MÎy k ,J ,C where O(L,n,M) is the graphical output of the Edge Estimation II algorithm, and Ô( M ) is the output of the Edge Estimation II algorithm. 44 We weaken the assumption of faithfulness so that fewer inferences from conditional independence to dseparation need to be made. We strengthened the assumption so that it allows one to make inferences from “almost independence” in a probability distribution to d-separation in a causal graph, allowing for the existence of uniformly consistent estimation algorithms. 45 We changed the concept of correctness to allow for missing weak edges, and saying “don’t know” about some features of Markov equivalence classes. The new simplicity assumption broke up the Markov equivalence class in the sense that it considers some models in a Markov equivalence class simpler than other models in the same Markov equivalence class. This allowed for uniformly consistent estimates of linear coefficients in a causal model, as well as causal structure. 46 Can we get similar results for: PC FCI non-linear models increasing numbers of variables and vertex degree and decreasing k (analogous to Kalisch and Buhlmann)? If parameter values are randomly assigned, how often is k-triangle faithfulness violated as a function of sample size clique size parameter distribution k 47 Kalisch, M., and P. Bühlmann (2007). Estimating high- dimensional directed acyclic graphs with the PCalgorithm. Journal of Machine Learning Research 8, 613–636. Spirtes, P., and Zhang, J. (forthcoming) A Uniformly Consistent Estimator of Causal Effects Under The kTriangle-Faithfulness Assumption, Statistical Science. Spirtes, P., and Zhang, J. (submitted) Three Faces of Faithfulness, Synthese. 48