Thursday slides - Andrew.cmu.edu

CAUSAL SEARCH IN THE REAL WORLD A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox  Some real-world successes:  Learning based on more than just independence  Learning about latents & their structure Short-run causal search  Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies  Of course, this is a problem for all inference: we might just be really unlucky  Note: This is not (really) the problem of unrepresentative samples (e.g., black swans) Convergence in search  In search, we would like to bound our possible error as we acquire data  I.e., we want search procedures that have uniform convergence  Without uniform convergence,  Cannot set confidence intervals for inference  Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose Pointwise convergence   Assume hypothesis H is true Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted Uniform convergence   Assume hypothesis H is true Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size. Two theorems about convergence  There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993)  There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003) Two theorems about convergence   What if we didn’t care about “small” causes? ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε  Every  association predicted by d-connection is ≥ε For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness  So we have error bounds, confidence intervals, etc. Sample selection bias  Sometimes, a variable of interest is a cause of whether people get in the sample  E.g., measure various skills or knowledge in college students  Or measuring joblessness by a phone survey during the middle of the day  Simple problem: You might get a skewed picture of the population Sample selection bias  If two variables matter, then we have: Factor A Factor B Sample  Sample = 1 for everyone we measure  That is equivalent to conditioning on Sample  ⇒ Induces an association between A and B! Simpson’s Paradox  Consider the following data: Men Treated Untreated Alive 3 20 Dead 3 24 P(A | T) = 0.5 P(A | U) = 0.45… Women Treated Untreated Alive 16 3 Dead 25 6 P(A | T) = 0.39 P(A | U) = 0.333 Treatment is superior in both groups! Simpson’s Paradox  Consider the following data: Pooled In the “full” population, you’re better off not being Treated! Treated Untreated Alive 19 23 Dead 28 30 P(A | T) = 0.404 P(A | U) = 0.434 Simpson’s Paradox  Berkeley Graduate Admissions case More than independence  Independence & association can reveal only the Markov equivalence class  But  our data contain more statistical information! Algorithms that exploit this additional info can sometimes learn more (including unique graphs)  Example: LiNGaM algorithm for non-Gaussian data Non-Gaussian data   Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D=BD+ε  where B is permutable to lower triangular (because graph is acyclic) Non-Gaussian data   Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D = Aε  where A = (I – B)-1 Non-Gaussian data   Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D = Aε  where   A = (I – B)-1 ICA is an efficient estimator for A ⇒ Efficient causal search that reveals direction! C ⟶ E iff non-zero entry in A Non-Gaussian data  Why can we learn the directions in this case? Gaussian noise A B A B Uniform noise Non-Gaussian data  Case study: European electricity cost Learning about latents  Sometimes, our real interest…  is in variables that are only indirectly observed  or observed by their effects  or unknown altogether but influencing things behind the scenes Sociability General IQ Test score Other factors Math skills Reading level Size of social network Factor analysis   Assume linear equations Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error Factor analysis If we have one factor, then we find coefficients to minimize error in: Fi = ai + biU where U is the unobserved variable (with fixed mean and variance)   Two factors ⇒ Minimize error in: Fi = ai + bi,1U1 + bi,2U2 Factor analysis   Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff Also, the interpretation of the unobserved factors must be provided by the scientist  The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”) Factor analysis as graph search  One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: U F1 F2 … Fn Factor analysis as graph search  Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: U1 F1 F2 … Fn U2 Better methods for latents  Two different types of algorithms: Determine which observed variables are caused by shared latents 1.  Determine the causal structure among the latents 2.   BPC, FOFC, FTFC, … MIMBuild Note: need additional parametric assumptions  Usually linearity, but can do it with weaker info Discovering latents  Key idea: For many parameterizations, association between X & Y can be decomposed  Linearity  ⇒ ⇒ can use patterns in the precise associations to discover the number of latents  Using the ranks of different sub-matrices Discovering latents U A B C D Discovering latents U A L B C D Discovering latents   Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc. And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents Other Algorithms      CCD: Learn DCG (with non-obvious semantics) ION: Learn global features from overlapping local sets (including between not co-measured variables) SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints LoSST: Learn causal structure while that structure potentially changes over time And lots of other ongoing research! Tetrad project  http://www.phil.cmu.edu/projects/tetrad/current.html

Thursday slides - Andrew.cmu.edu

Related documents

Products

Support

Thursday slides - Andrew.cmu.edu

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib