Grouped and Hierarchical Model Selection through Composite Absolute Penalties (CAP)

advertisement
Grouped and Hierarchical
Model Selection through
Composite Absolute Penalties
(CAP)
Bin Yu
Department of Statistics, UC Berkeley
Joint work with
Peng Zhao and Guilherme Rocha
Outline

Motivation

Background
–
Penalization methods
– L1-penalty

Composite Absolute Penalties (CAP)
–
–
–
–
–
Building Blocks – L-norm Regularization
Definition
Interpretation
Algorithms
Examples and Results
2016年6月30日星期四
page 2
IT is Providing a Golden time for Statistics
Computer technology advances in the IT age have increased
tremendously our data collection abilities.
Diverse origins of IT data:

IT core areas: IT phenomenon
Information retrieval, information extraction, natural language
processing, web search

IT systems areas: CS hard-core
Chip design, program debugging, network tomography

IT influence areas: Impacted by IT
Remote sensing, astronomy, neuroscience, finance, …
2016年6月30日星期四
page 3
Characteristics of Modern Data Set Problems

Goal is efficient use of data for:
–
Prediction
– Interpretation

Larger number of variables:
–
Number of variables (p) in data sets is large
– Sample sizes (n) have not increased at same pace

Scientific opportunities:
–
2016年6月30日星期四
New findings in different scientific fields
page 4
Cyberinfrastructure
challenges/opportunities for statistics
Cyberinfrastructure (NSF CS Div. Report in 2003):
integrating computer technology into the very fabric of science
to extract useful information from the data deluge.
For statistics to be part of the making of this cyberinfrastructure
to solve problems outside statistics, it is necessary to integrate into
our statistical framework the considerations or constraints of

storage (databases)

communication (streaming data, sensor networks)

computation (including memory usage)
Feature selection (data reduction) is useful
for the first two areas and interpretation,
and it needs fast computation.
2016年6月30日星期四
page 5
Automatic Feature Selection
Model selection is too expensive:
Jornsten and Yu (2003) use Minimum Description Length (MDL)
principle for simultaneous gene selection and sample classification with
competitive results. They pre-selected about 100 or so genes from 6000.
There were still 2^100=1.3 10^30 possible subsets.
Combinatorial search over all subsets is too expensive.
Recent alternatives:
continuous embedding into a convex optimization problem through
Lasso -- third generation computational methods in
statistics.
2016年6月30日星期四
page 6
Computation for Statistical Inference
First generation computation in statistics before computers:
use parametric models with closed form solutions for
maximum likelihood estimators or Bayes estimators.
Second generation computation with computers:
design statistically optimal procedures and worry about
computation later. Call optimization routines.
Third generation computation:
form statistical goals with computation in mind and take
advantage of special features of statistical computation.
2016年6月30日星期四
page 7
Lasso: L1-norm as a penalty

The L1 penalty is defined for coefficients 

Used initially with L2 loss:

–
Signal processing: Basis Pursuit (Chen & Donoho,1994)
–
Statistics: LASSO (Tibshirani, 1996)
Properties:
–
Sparsity (variable selection)
–
Convexity (convex relaxation of L0-penalty)
–
“Stability” (non-negative garrote, Breiman, 1995)
2016年6月30日星期四
page 8
Lasso: L1-norm as a penalty
Computation:
the “right” tuning parameter unknown so “path” is needed
(discretized or continuous)
–
Initially: quadratic program for each a grid on . QP is called
for each .
–
Later: path following algorithms
homotopy by Osborne et al (2000)
LARS by Efron et al (2004)
Theoretical studies: much work recently…
2016年6月30日星期四
page 9
General Penalization Methods


Given data
:
–
Xi : a p-dimensional predictor
–
Yi : response variable
The parameters  are defined by the penalized problem:
where
–
is the empirical loss function
–
–
is a penalty function
 is a tuning parameter
2016年6月30日星期四
page 10
Beyond Sparsity of Individual Predictors:
Natural Structures among predictors
Rationale: side information might be available and/or
additional regularization is needed beyond Lasso for p>>n

Groups:
–
–
–
–

Genes belonging to the same pathway;
Categorical variables represented by “dummies”;
Polynomial terms from the same variable;
Noisy measurements of the same variable.
Hierarchy:
–
–
–
Multi-resolution/wavelet models;
Interactions terms in factorial analysis (ANOVA);
Order selection in Markov Chain models;
2016年6月30日星期四
page 11
Composite Absolute Penalties (CAP)
Overview

The CAP family of penalties:
•
Highly customizable:
– ability to perform grouped selection
– ability to perform hierarchical selection
•
Computational considerations:
– Feasibility: Convexity
– Efficiency: Piecewise linearity in some cases
•
Define groups according to structure
•
Combine properties of L-norm penalties
•
Encompass and go beyond existing works:
– Elastic Net (Zou & Hastie, 2005)
– GLASSO (Yuan & Lin, 2006)
– Blockwise Sparse Regression (Kim, Kim & Kim, 2006)
2016年6月30日星期四
page 12
Composite Absolute Penalties
Review of L Regularization
Given data

and loss function
:
L Regularization:
–
Penalty:
–
Estimate:
where >0 is a tuning parameter

For the squared error loss function:
–
–
–
2016年6月30日星期四
Hoerl & Kennard (1970): Ridge (=2)
Frank & Friedman (1993): Bridge (general )
Tibshirani (1996): LASSO (=1)
page 13
Composite Absolute Penalties
Definition

The CAP parameter estimate is given by:
–
Gk's, k=1,…,K - indices of k-th pre-defined group
–
Gk – corresponding vector of coefficients.
–
|| . ||k – group Lk norm:
–
|| . ||0 – overall norm:
–
groups may overlap (hierarchical selection)
2016年6月30日星期四
Nk = ||k||k;
T() =||N||0
page 14
Composite Absolute Penalties
A Bayesian interpretation

For non-overlapping groups:
–
Prior on group norms:
–
Prior on individual coefficients:
2016年6月30日星期四
page 15
Composite Absolute Penalties
Group selection

Tailoring T() for group selection:
–
–
Define non-overlapping groups
Set k>1, 8k  0:
• Group norm k tunes similarity within its group
 k>1 causes all variables in group i to be included/excluded
together
–
Set 0=1:
• This yields grouped sparsity
–
i=2 has been studied by Yuan and Lin
(Grouped Lasso, 2005).
Contour plot for
0=1, 1=2, 2=2
2016年6月30日星期四
page 16
Composite Absolute Penalties
Hierarchical Structures

Tailoring T() for Hierarchical Structure:
–
Set 0=1
–
Set i>1, i
–
Groups overlap:
• If2 appears in all groups where 1 is included
Then X2 enters the model after X1

As an example:
2016年6月30日星期四
page 17
Composite Absolute Penalties
Hierarchical Structures

Represent Hierarchy by a directed graph:
X1
X2

Then construct penalty by:

For graph above, 0=1, r=:
2016年6月30日星期四
page 18
Composite Absolute Penalties
Computation

CAP with general L norms
–
–
Approximate algorithms available for tracing regularization path
Two examples:
• Rosset (2004)
• Boosted Lasso (Zhao and Yu, 2004): BLASSO

CAP with L1–L norms
–
–
Exact algorithms for tracing regularization path
Some applications:
• Grouped Selection: iCAP
• Hierarchical Selection: hiCAP for ANOVA and wavelets
2016年6月30日星期四
page 19
iCAP:
Degrees of Freedom (DFs) for tuning par. selection
Two ways for selecting the tuning parameter in iCAP:
1. Cross-validation
2. Model selection criterion AIC_c
where DF used is a generalization of Zou et al (2004)’s df
for Lasso to iCAP.
2016年6月30日星期四
page 20
Simulation Studies
Summary of Results

Good prediction accuracy:
–

Extra structure results in lower model errors
Sparsity/Parsimony:
–
Less sparse models in l0 sense
– Sparser in terms of degrees of freedom

Estimated degrees of freedom (Group, iCAP only)
–
Good choices for regularization parameter
– AICc: model errors close to CV
2016年6月30日星期四
page 21
Grouping examples
Case 1 Settings

Goals:
 Coefficient Profile
–
Comparison of different group norms
– Comparison of CV against AICC
Y = X + 
 Settings:

2016年6月30日星期四
page 22
Grouping example
Case 1: LASSO vs. iCAP sample paths
Normalized coefficients
LASSO path
Number of steps
iCAP path
Number of steps
2016年6月30日星期四
page 23
Grouping example
Case 1: Comparison of norms and clusterings
1.0 K clusters
1.5 K clusters
2016年6月30日星期四
k=
k=4
k=2
k=
k=4
k=2
k=
k=4
k=2
LASSO
10 fold CV Model error
0.5 K clusters
page 24
Grouping example
Case 1: Comparison of  selection
10 fold CV Model error
LASSO
CV
2016年6月30日星期四
AICC
iCAP
0.5 K clusters
CV
AICC
iCAP
1.0 K clusters
CV
AICC
iCAP
1.5 K clusters
CV
AICC
page 25
Grouping examples
Case 2 Settings

Goals:
–
Comparison of performance when number of predictors (p) grows
– Comparison of performance when number of groups (K) grows
Y = X + 
 Settings:

–
Coefficients are randomly selected:
• Grouped Laplacian:
– K coefficients independently from double exponential with parameter G
– Coefficients are repeated within groups
• Individual Laplacian:
– p coefficients independently from double exponential with parameter I
2016年6月30日星期四
page 26
Grouping examples
Case 2: Comparison of model errors
Model Error
AICC selection
2016年6月30日星期四
page 27
Grouping examples
Case 2: Comparison of “group sparsity”
Number of selected groups
AICC selection
2016年6月30日星期四
page 28
Grouping examples
Case 2: Comparison of degrees of freedom
Degrees of freedom
AICC selection
2016年6月30日星期四
page 29
Grouping 2: 250 predictors, 25 groups, Ind. Exp.
Paired T-test Statistics
0.5 K
1.0 K
1.5 K
0.5 K
2016年6月30日星期四
1.0 K
1.5 K
0.5 K
1.0 K
1.5 K
0.5K
1.0K
1.5K
0.5K
1.0K
1.5K
0.5K
1.0K
1.5K
0.5 K
1.0 K
1.5 K
0.5K
1.0K
1.5K
page 30
ANOVA Hierarchical Selection
Simulation Setup

55 variables (10 main effects, 45 interactions)

121 observations

200 replications in results that follow
2016年6月30日星期四
page 31
ANOVA Hierarchical Selection
Model Error
2016年6月30日星期四
page 32
ANOVA Hierarchical Selection
Number of Selected Terms
2016年6月30日星期四
page 33
ANOVA Hierarchical Selection
Number of Terms in Complete Graph
2016年6月30日星期四
page 34
Wavelet Tree Hierarchical Selection
Simulation Setup

15 basis functions
 80 obsevations, 5 in each of the 16 time slots
 5 fold cross validation (“balanced”)
2016年6月30日星期四
page 35
Hierarchical Tree Selection
Model Error Comparisons
2016年6月30日星期四
page 36
Hierarchical Tree Selection
Number of Selected Variables
2016年6月30日星期四
page 37
Hierarchical Tree Selection
Number of Terms in Complete Tree
2016年6月30日星期四
page 38
Simulation Studies
Summary of Results

Good prediction accuracy:
–

Extra structure results in lower model errors
Sparsity/Parsimony:
–
Less sparse models in l0 sense
– Sparser in terms of degrees of freedom

Estimated degrees of freedom (Group, iCAP only)
–
Good choices for regularization parameter
– AICc: model errors close to CV
2016年6月30日星期四
page 39
CAP: Group and Hierarchical Sparsity

CAP penalties:
–
–
Are built from L “blocks”
Allow incorporation of different structures to fitted model:
• Group of variables
• Hierarchy among predictors

Algorithms:
–
–

Approximation using BLASSO for general CAP penalties
Exact and efficient for particular cases (L2 loss, L1 and L norms)
Choice of regularization parameter :
–
–
Cross-validation
AICc for particular cases (L2 loss, L1 and L norms)
2016年6月30日星期四
page 40
The Road Ahead

Extension of algorithms:
GLMs: Park and Hastie (2006)’s algorithm for iCAP
– Boosted version of algorithms:
–
• Steps in “groups” or “hierarchical” directions

Model Selection Consistency for CAP (0=1):
–
Can CAP select groups consistently?
– Is irrepresentable condition on “group” level sufficient/necessary?

More general CAP penalties:
–
The “Rigid Net”:
• Two groups containing all variables, 0=1, 1=1, 2=
–
One example: 0=, k=1:
• Sparsity within groups?
• Similarity across groups?
2016年6月30日星期四
page 41
Codes: www.stat.berkeley.edu/~yugroup
Paper: www.stat.berkeley.edu/~binyu
Funding acknowledgements:
NSF
ARO
Guggenheim Foundation
2016年6月30日星期四
page 42
Download