Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1 What is high-dimensional data? • Modern data sets are both massive and fine-grained. # Features (variables) > # Observations (Samples) • A trend in modern data analysis. 2 High-Dimensional Data: an example Medical images Transcript records age gender BMI heart rate billing information ….. Allergies type reaction severity start year stop year ….. Diagnosis info ICD9 codes Description start year stop year ….. Medications name strength schedule ….. Lab results Hl7 text value abnormality obs. year ….. 3 What can we do with such data? • Extract useful, actionable information. Health Care Reform Heritage Health Prize • Predictive models for: clinical outcomes patient evolution readmission rate HITECH … Designthan (or71 advise) • More million persons are admitted to hospitals each year. treatment • Over $30 billion was spent on unnecessary hospital readmissions clinical interventions and trials (2006). 4 Diabetes Example • n = 500 (patients) • P = 805 (variables) medical information: medications, lab results, diagnosis, … [Data from Practice Fusion posted on kaggle] • Find significant variables in predicting Type 2-Diabetes • “People with higher Bilirubin are more susceptible to diabetes ” How certain we are about this claim? 5 parameters Problem of Uncertainty Assessment indices How stable are these estimates? What can we say about the true parameters? 6 Confidence intervals parameters Blood pressure index 7 Why is it hard? • Low dimensional regime ( fixed, ) Large Sample Theory • Situation in high-dimensional regime is very different! 8 • Much progress has been achieved for high-dimensional parameter estimation high-dimensional variable/feature selection high-dimensional prediction Supp ( ) [Tibshirani, Donoho, Cai, Zhou, Candés, Tao, Bickel, van de Geer, Ritov, Bühlmann, Meinshausen, Zhao, Yu, Wainwright, …] 9 How to assign measures of uncertainty to each single parameter? 10 Other examples Targeted online advertising Personalized medicine Genomics Social Networks Collaborative filtering 11 Overview of Regularized Estimators 12 Regularized Estimators • Investigate low dimensional structures in data minimize Loss +λ Model Complexity • Mitigates spurious correlations noise accumulation instability (to noise and sampling) • This comes at a price. biased (towards small complexity) nonlinear and non explicit 13 Diabetes Example • • Patient i gets type-2 diabetes Variables of patient i q 0, j : contribution of feature j argmin logistic loss • Convex optimization • Variable selection (some of qˆj = 0) 14 Selects 62 interesting features. We want to construct confidence intervals for each parameter. 15 What is a confidence interval? • We observe data • Confidence intervals: generated by Confidence level Confidence intervals are random objects. 16 Why uncertainty assessment? 99% confidence Scientific discoveries 70% confidence • Curry increases the cognitive capacity of the brain. [Tze-Pin Ng, 2006] • Beautiful parents have more daughters than ugly parents. [Kanazawa 2006] • Left-handedness in males has a significant level on wage level. [Ruebeck 2006] “Why most published research findings are wrong?” [John P. A. Ioannidis] 17 Why uncertainty assessment? Decision making System state: We take measurements abnormal zone normal zone State space 18 Why uncertainty assessment? Optimization/ Stopping rules First order methods for large scale data (coordinate descent, mirror descent, Nesterov’s method, …) Optimization is a tool not the goal! iteration 19 Why uncertainty assessment? Optimization/ Stopping rules First order methods for large scale data (coordinate descent, mirror descent, Nesterov’s method, …) Stopping point iteration 20 Reasoning about Uncertainty 21 Setup y = Xq0 +W p n y = X q0 Gaussian noise with mean zero and covariance p + W . 22 Lasso argmin [Tibshirani 1996, Chen, Donoho 1996] Deterministic Random distribution of ? 23 Approach 1: Sample splitting X (1) X X (2) S S XS(2) Lasso Subset S of variables Least square explicitly Distribution of [Wasserman, Röeder 2009, Bühlmann, Meier, Meinshausen 2009] 24 Problems with sample splitting • Have to cut half of data • Assumes Lasso on X (1) selects all relevant features (plus some). • It depends on the splitting. 25 Approach 2: Bootstrap Data Sampled Data fails because of the bias! 26 Our approach: de-biasing Lasso Classical setting n>p: Gaussian error Unbiased estimator Precise distributional characterization Problem in high dimension (n <p): is not invertible! 27 Our approach: de-biasing Lasso Use your favorite M Bias Gaussian error Lets (try to) subtract the bias 28 Geometric interpretation q2 1 y - Xq 2n 2 1 Ball q1 subgradient of 29 How should we choose M? Bias Error We want small bias and small error. 30 Choosing M? minimize Varminimize (Errori ) subject to |Bias subject i| ≤ iξ to ξ Variance • mi : i-th row of M • ei : (0,0,..,1,0,…0) ξ=ξ * infeasible feasible set Bias 31 What does it look like? is not sparse! 32 Distribution of our estimator? Neglecting the bias 33 Distribution of our estimator? Q-Q plot 0.0 2 0 -4 -2 Sample Quantiles of Z 0.2 0.1 Density 0.3 4 0.4 Histogram -10 -5 0 5 10 -3 -2 -1 0 1 2 3 Standard normal quantiles ‘Ground-truth’ from n tot = 10000 records. 34 Confidence intervals coefficients Blood pressure indices Coverage: 93.6% 35 Main Theorem Theorem (Javanmard, Montanari 2013) Assume has i.i.d. subgaussian rows with covariance . Also eigenvalues of are bounded as sample size grows. Then, asymptotically as , with , What is s? number of truly significant variables (number of nonzero parameters). 36 Consequences • Confidence interval for each individual parameter • Length of confidence intervals do not depend on p. • This is optimal. 37 Summary (so far) • High dimensionality and regularized estimators • Uncertainty assessment for parameter estimations • Optimality R-package will be available soon! 38 Further insights and related work 39 Two questions • How general? • What about smaller sample size? 40 Question1: How to generalize it? Regularized estimators: argmin los s regularizer Suppose that loss decomposes over samples: 41 Question1: How to generalize it? • Debiasing the regularized estimator • Find M by solving the same optimization problem minimize Subject to ξ Fisher information 42 Question 2: How about smaller sample size? • Estimation, prediction: [Candés, Tao 2007, Bickel et al. 2009] • Uncertainty assessment, confidence intervals: [This talk] Can we match the optimal sample size, ? 43 Can we match the optimal sample size, ? • Javanmard, Montanari, 2013 Sample size, . Gaussian designs. Exact asymptotic characterization. • Javanmard, Montanari, 2013 Sample size, . Confidence intervals have (nearly) optimal average length. 44 Related work • Lockhart, Taylor, Tibshirani, Tibshirani, 2012 Test significance along Lasso path. • Zhang, Zhang, 2012, Van de Geer, Bühlmann, Ritov, 2013 Assume structure on X. For random designs is assumed to be sparse. Optimality in terms of semiparametric efficiency. • Bühlmann, 2012 Tests overly conservative. 45 Future directions 46 Two directions • Uncertainty assessment for predictions • Other applications 47 Thank you! 48