Slides

advertisement
Reasoning about Uncertainty in
High-dimensional Data Analysis
Adel Javanmard
Stanford University
1
What is high-dimensional data?
• Modern data sets are both massive and fine-grained.
# Features (variables) > # Observations (Samples)
• A trend in modern data analysis.
2
High-Dimensional Data: an example
Medical images
Transcript records
age
gender
BMI
heart rate
billing information
…..
Allergies
type
reaction
severity
start year
stop year
…..
Diagnosis info
ICD9 codes
Description
start year
stop year
…..
Medications
name
strength
schedule
…..
Lab results
Hl7 text
value
abnormality
obs. year
…..
3
What can we do with such data?
• Extract useful, actionable information.
Health Care Reform
Heritage Health Prize
• Predictive models for:
 clinical outcomes
 patient evolution
 readmission rate
HITECH
…
Designthan
(or71
advise)
• More
million persons are admitted to hospitals each year.
 treatment
• Over
$30 billion
was spent
on unnecessary hospital readmissions
 clinical
interventions
and trials
(2006).
4
Diabetes Example
• n = 500 (patients)
• P = 805 (variables)
medical information: medications, lab results,
diagnosis, …
[Data from Practice Fusion posted on kaggle]
• Find significant variables in predicting Type 2-Diabetes
• “People with higher Bilirubin are more susceptible to diabetes ”
 How certain we are about this claim?
5
parameters
Problem of Uncertainty Assessment
indices
How stable are these estimates?
What can we say about the true parameters?
6
Confidence intervals
parameters
Blood
pressure
index
7
Why is it hard?
• Low dimensional regime ( fixed,
)
Large Sample Theory
• Situation in high-dimensional regime is very different!
8
• Much progress has been achieved for
 high-dimensional parameter estimation
 high-dimensional variable/feature selection
 high-dimensional prediction
Supp ( )
[Tibshirani, Donoho, Cai, Zhou, Candés, Tao, Bickel, van de Geer,
Ritov, Bühlmann, Meinshausen, Zhao, Yu, Wainwright, …]
9
How to assign measures of uncertainty to each single
parameter?
10
Other examples
Targeted online advertising
Personalized medicine
Genomics
Social Networks
Collaborative filtering
11
Overview of Regularized Estimators
12
Regularized Estimators
• Investigate low dimensional structures in data
minimize Loss +λ Model Complexity
• Mitigates
 spurious correlations
 noise accumulation
 instability (to noise and sampling)
• This comes at a price.
 biased (towards small complexity)
 nonlinear and non explicit
13
Diabetes Example
•
•
Patient i gets type-2 diabetes
Variables of patient i
q 0, j : contribution of feature j
argmin
logistic loss
• Convex optimization
• Variable selection (some of qˆj = 0)
14
Selects 62 interesting features.
We want to construct confidence intervals for each parameter.
15
What is a confidence interval?
• We observe data
• Confidence intervals:
generated by
Confidence level
Confidence intervals are random objects.
16
Why uncertainty assessment?
99% confidence
 Scientific discoveries
70% confidence
• Curry increases the cognitive capacity of the brain.
[Tze-Pin Ng, 2006]
• Beautiful parents have more daughters than ugly parents.
[Kanazawa 2006]
• Left-handedness in males has a significant level on wage level.
[Ruebeck 2006]
“Why most published research findings are wrong?”
[John P. A. Ioannidis]
17
Why uncertainty assessment?
 Decision making
System state:
We take measurements
abnormal zone
normal zone
State space
18
Why uncertainty assessment?
 Optimization/ Stopping rules
First order methods for large scale data
(coordinate descent, mirror descent, Nesterov’s method, …)
Optimization is a tool not the goal!
iteration
19
Why uncertainty assessment?
 Optimization/ Stopping rules
First order methods for large scale data
(coordinate descent, mirror descent, Nesterov’s method, …)
Stopping point
iteration
20
Reasoning about Uncertainty
21
Setup
y = Xq0 +W
p
n
y =
X
q0
Gaussian noise with mean zero and covariance
p
+ W
.
22
Lasso
argmin
[Tibshirani 1996, Chen, Donoho 1996]
Deterministic
Random
distribution of
?
23
Approach 1: Sample splitting
X (1)
X
X (2)
S
S
XS(2)
Lasso
Subset S of variables
Least square
explicitly
Distribution of
[Wasserman, Röeder 2009, Bühlmann, Meier, Meinshausen 2009]
24
Problems with sample splitting
• Have to cut half of data
• Assumes Lasso on X (1) selects all relevant features (plus some).
• It depends on the splitting.
25
Approach 2: Bootstrap
Data
Sampled Data
fails because of the bias!
26
Our approach: de-biasing Lasso
Classical setting n>p:
Gaussian error
 Unbiased estimator
 Precise distributional characterization
Problem in high dimension (n <p):
is not invertible!
27
Our approach: de-biasing Lasso
Use your favorite M
Bias
Gaussian error
Lets (try to) subtract the bias
28
Geometric interpretation
q2
1
y - Xq
2n
2
1
Ball
q1
subgradient of
29
How should we choose M?
Bias
Error
We want small bias and small error.
30
Choosing M?
minimize
Varminimize
(Errori )
subject to
|Bias
subject
i| ≤ iξ to
ξ
Variance
• mi : i-th row of M
• ei : (0,0,..,1,0,…0)
ξ=ξ
*
infeasible
feasible
set
Bias
31
What does it look like?
is not sparse!
32
Distribution of our estimator?
Neglecting the bias
33
Distribution of our estimator?
Q-Q plot
0.0
2
0
-4
-2
Sample Quantiles of Z
0.2
0.1
Density
0.3
4
0.4
Histogram
-10
-5
0
5
10
-3
-2
-1
0
1
2
3
Standard normal quantiles
‘Ground-truth’ from n tot = 10000 records.
34
Confidence intervals
coefficients
Blood
pressure
indices
Coverage: 93.6%
35
Main Theorem
Theorem (Javanmard, Montanari 2013)
Assume has i.i.d. subgaussian rows with covariance .
Also eigenvalues of are bounded as sample size grows.
Then, asymptotically as
, with
,
 What is s?
number of truly significant variables (number of nonzero parameters).
36
Consequences
• Confidence interval for each individual parameter
• Length of confidence intervals do not depend on p.
• This is optimal.
37
Summary (so far)
• High dimensionality and regularized estimators
• Uncertainty assessment for parameter estimations
• Optimality
R-package will be available soon!
38
Further insights and related work
39
Two questions
• How general?
• What about smaller sample size?
40
Question1: How to generalize it?
Regularized estimators:
argmin
los
s
regularizer
Suppose that loss decomposes over samples:
41
Question1: How to generalize it?
• Debiasing the regularized estimator
• Find M by solving the same optimization problem
minimize
Subject to
ξ
Fisher information
42
Question 2: How about smaller sample size?
• Estimation, prediction:
[Candés, Tao 2007, Bickel et al. 2009]
• Uncertainty assessment, confidence intervals:
[This talk]
Can we match the optimal sample size,
?
43
Can we match the optimal sample size,
?
• Javanmard, Montanari, 2013
 Sample size,
.
 Gaussian designs.
 Exact asymptotic characterization.
• Javanmard, Montanari, 2013
 Sample size,
.
 Confidence intervals have (nearly) optimal average length.
44
Related work
• Lockhart, Taylor, Tibshirani, Tibshirani, 2012
 Test significance along Lasso path.
• Zhang, Zhang, 2012, Van de Geer, Bühlmann, Ritov, 2013
 Assume structure on X.
 For random designs
is assumed to be sparse.
 Optimality in terms of semiparametric efficiency.
• Bühlmann, 2012
 Tests overly conservative.
45
Future directions
46
Two directions
• Uncertainty assessment for predictions
• Other applications
47
Thank you!
48
Download