Grouped and Hierarchical Model Selection through Composite Absolute Penalties (CAP) Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao and Guilherme Rocha Outline Motivation Background – Penalization methods – L1-penalty Composite Absolute Penalties (CAP) – – – – – Building Blocks – L-norm Regularization Definition Interpretation Algorithms Examples and Results 2016年6月30日星期四 page 2 IT is Providing a Golden time for Statistics Computer technology advances in the IT age have increased tremendously our data collection abilities. Diverse origins of IT data: IT core areas: IT phenomenon Information retrieval, information extraction, natural language processing, web search IT systems areas: CS hard-core Chip design, program debugging, network tomography IT influence areas: Impacted by IT Remote sensing, astronomy, neuroscience, finance, … 2016年6月30日星期四 page 3 Characteristics of Modern Data Set Problems Goal is efficient use of data for: – Prediction – Interpretation Larger number of variables: – Number of variables (p) in data sets is large – Sample sizes (n) have not increased at same pace Scientific opportunities: – 2016年6月30日星期四 New findings in different scientific fields page 4 Cyberinfrastructure challenges/opportunities for statistics Cyberinfrastructure (NSF CS Div. Report in 2003): integrating computer technology into the very fabric of science to extract useful information from the data deluge. For statistics to be part of the making of this cyberinfrastructure to solve problems outside statistics, it is necessary to integrate into our statistical framework the considerations or constraints of storage (databases) communication (streaming data, sensor networks) computation (including memory usage) Feature selection (data reduction) is useful for the first two areas and interpretation, and it needs fast computation. 2016年6月30日星期四 page 5 Automatic Feature Selection Model selection is too expensive: Jornsten and Yu (2003) use Minimum Description Length (MDL) principle for simultaneous gene selection and sample classification with competitive results. They pre-selected about 100 or so genes from 6000. There were still 2^100=1.3 10^30 possible subsets. Combinatorial search over all subsets is too expensive. Recent alternatives: continuous embedding into a convex optimization problem through Lasso -- third generation computational methods in statistics. 2016年6月30日星期四 page 6 Computation for Statistical Inference First generation computation in statistics before computers: use parametric models with closed form solutions for maximum likelihood estimators or Bayes estimators. Second generation computation with computers: design statistically optimal procedures and worry about computation later. Call optimization routines. Third generation computation: form statistical goals with computation in mind and take advantage of special features of statistical computation. 2016年6月30日星期四 page 7 Lasso: L1-norm as a penalty The L1 penalty is defined for coefficients Used initially with L2 loss: – Signal processing: Basis Pursuit (Chen & Donoho,1994) – Statistics: LASSO (Tibshirani, 1996) Properties: – Sparsity (variable selection) – Convexity (convex relaxation of L0-penalty) – “Stability” (non-negative garrote, Breiman, 1995) 2016年6月30日星期四 page 8 Lasso: L1-norm as a penalty Computation: the “right” tuning parameter unknown so “path” is needed (discretized or continuous) – Initially: quadratic program for each a grid on . QP is called for each . – Later: path following algorithms homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently… 2016年6月30日星期四 page 9 General Penalization Methods Given data : – Xi : a p-dimensional predictor – Yi : response variable The parameters are defined by the penalized problem: where – is the empirical loss function – – is a penalty function is a tuning parameter 2016年6月30日星期四 page 10 Beyond Sparsity of Individual Predictors: Natural Structures among predictors Rationale: side information might be available and/or additional regularization is needed beyond Lasso for p>>n Groups: – – – – Genes belonging to the same pathway; Categorical variables represented by “dummies”; Polynomial terms from the same variable; Noisy measurements of the same variable. Hierarchy: – – – Multi-resolution/wavelet models; Interactions terms in factorial analysis (ANOVA); Order selection in Markov Chain models; 2016年6月30日星期四 page 11 Composite Absolute Penalties (CAP) Overview The CAP family of penalties: • Highly customizable: – ability to perform grouped selection – ability to perform hierarchical selection • Computational considerations: – Feasibility: Convexity – Efficiency: Piecewise linearity in some cases • Define groups according to structure • Combine properties of L-norm penalties • Encompass and go beyond existing works: – Elastic Net (Zou & Hastie, 2005) – GLASSO (Yuan & Lin, 2006) – Blockwise Sparse Regression (Kim, Kim & Kim, 2006) 2016年6月30日星期四 page 12 Composite Absolute Penalties Review of L Regularization Given data and loss function : L Regularization: – Penalty: – Estimate: where >0 is a tuning parameter For the squared error loss function: – – – 2016年6月30日星期四 Hoerl & Kennard (1970): Ridge (=2) Frank & Friedman (1993): Bridge (general ) Tibshirani (1996): LASSO (=1) page 13 Composite Absolute Penalties Definition The CAP parameter estimate is given by: – Gk's, k=1,…,K - indices of k-th pre-defined group – Gk – corresponding vector of coefficients. – || . ||k – group Lk norm: – || . ||0 – overall norm: – groups may overlap (hierarchical selection) 2016年6月30日星期四 Nk = ||k||k; T() =||N||0 page 14 Composite Absolute Penalties A Bayesian interpretation For non-overlapping groups: – Prior on group norms: – Prior on individual coefficients: 2016年6月30日星期四 page 15 Composite Absolute Penalties Group selection Tailoring T() for group selection: – – Define non-overlapping groups Set k>1, 8k 0: • Group norm k tunes similarity within its group k>1 causes all variables in group i to be included/excluded together – Set 0=1: • This yields grouped sparsity – i=2 has been studied by Yuan and Lin (Grouped Lasso, 2005). Contour plot for 0=1, 1=2, 2=2 2016年6月30日星期四 page 16 Composite Absolute Penalties Hierarchical Structures Tailoring T() for Hierarchical Structure: – Set 0=1 – Set i>1, i – Groups overlap: • If2 appears in all groups where 1 is included Then X2 enters the model after X1 As an example: 2016年6月30日星期四 page 17 Composite Absolute Penalties Hierarchical Structures Represent Hierarchy by a directed graph: X1 X2 Then construct penalty by: For graph above, 0=1, r=: 2016年6月30日星期四 page 18 Composite Absolute Penalties Computation CAP with general L norms – – Approximate algorithms available for tracing regularization path Two examples: • Rosset (2004) • Boosted Lasso (Zhao and Yu, 2004): BLASSO CAP with L1–L norms – – Exact algorithms for tracing regularization path Some applications: • Grouped Selection: iCAP • Hierarchical Selection: hiCAP for ANOVA and wavelets 2016年6月30日星期四 page 19 iCAP: Degrees of Freedom (DFs) for tuning par. selection Two ways for selecting the tuning parameter in iCAP: 1. Cross-validation 2. Model selection criterion AIC_c where DF used is a generalization of Zou et al (2004)’s df for Lasso to iCAP. 2016年6月30日星期四 page 20 Simulation Studies Summary of Results Good prediction accuracy: – Extra structure results in lower model errors Sparsity/Parsimony: – Less sparse models in l0 sense – Sparser in terms of degrees of freedom Estimated degrees of freedom (Group, iCAP only) – Good choices for regularization parameter – AICc: model errors close to CV 2016年6月30日星期四 page 21 Grouping examples Case 1 Settings Goals: Coefficient Profile – Comparison of different group norms – Comparison of CV against AICC Y = X + Settings: 2016年6月30日星期四 page 22 Grouping example Case 1: LASSO vs. iCAP sample paths Normalized coefficients LASSO path Number of steps iCAP path Number of steps 2016年6月30日星期四 page 23 Grouping example Case 1: Comparison of norms and clusterings 1.0 K clusters 1.5 K clusters 2016年6月30日星期四 k= k=4 k=2 k= k=4 k=2 k= k=4 k=2 LASSO 10 fold CV Model error 0.5 K clusters page 24 Grouping example Case 1: Comparison of selection 10 fold CV Model error LASSO CV 2016年6月30日星期四 AICC iCAP 0.5 K clusters CV AICC iCAP 1.0 K clusters CV AICC iCAP 1.5 K clusters CV AICC page 25 Grouping examples Case 2 Settings Goals: – Comparison of performance when number of predictors (p) grows – Comparison of performance when number of groups (K) grows Y = X + Settings: – Coefficients are randomly selected: • Grouped Laplacian: – K coefficients independently from double exponential with parameter G – Coefficients are repeated within groups • Individual Laplacian: – p coefficients independently from double exponential with parameter I 2016年6月30日星期四 page 26 Grouping examples Case 2: Comparison of model errors Model Error AICC selection 2016年6月30日星期四 page 27 Grouping examples Case 2: Comparison of “group sparsity” Number of selected groups AICC selection 2016年6月30日星期四 page 28 Grouping examples Case 2: Comparison of degrees of freedom Degrees of freedom AICC selection 2016年6月30日星期四 page 29 Grouping 2: 250 predictors, 25 groups, Ind. Exp. Paired T-test Statistics 0.5 K 1.0 K 1.5 K 0.5 K 2016年6月30日星期四 1.0 K 1.5 K 0.5 K 1.0 K 1.5 K 0.5K 1.0K 1.5K 0.5K 1.0K 1.5K 0.5K 1.0K 1.5K 0.5 K 1.0 K 1.5 K 0.5K 1.0K 1.5K page 30 ANOVA Hierarchical Selection Simulation Setup 55 variables (10 main effects, 45 interactions) 121 observations 200 replications in results that follow 2016年6月30日星期四 page 31 ANOVA Hierarchical Selection Model Error 2016年6月30日星期四 page 32 ANOVA Hierarchical Selection Number of Selected Terms 2016年6月30日星期四 page 33 ANOVA Hierarchical Selection Number of Terms in Complete Graph 2016年6月30日星期四 page 34 Wavelet Tree Hierarchical Selection Simulation Setup 15 basis functions 80 obsevations, 5 in each of the 16 time slots 5 fold cross validation (“balanced”) 2016年6月30日星期四 page 35 Hierarchical Tree Selection Model Error Comparisons 2016年6月30日星期四 page 36 Hierarchical Tree Selection Number of Selected Variables 2016年6月30日星期四 page 37 Hierarchical Tree Selection Number of Terms in Complete Tree 2016年6月30日星期四 page 38 Simulation Studies Summary of Results Good prediction accuracy: – Extra structure results in lower model errors Sparsity/Parsimony: – Less sparse models in l0 sense – Sparser in terms of degrees of freedom Estimated degrees of freedom (Group, iCAP only) – Good choices for regularization parameter – AICc: model errors close to CV 2016年6月30日星期四 page 39 CAP: Group and Hierarchical Sparsity CAP penalties: – – Are built from L “blocks” Allow incorporation of different structures to fitted model: • Group of variables • Hierarchy among predictors Algorithms: – – Approximation using BLASSO for general CAP penalties Exact and efficient for particular cases (L2 loss, L1 and L norms) Choice of regularization parameter : – – Cross-validation AICc for particular cases (L2 loss, L1 and L norms) 2016年6月30日星期四 page 40 The Road Ahead Extension of algorithms: GLMs: Park and Hastie (2006)’s algorithm for iCAP – Boosted version of algorithms: – • Steps in “groups” or “hierarchical” directions Model Selection Consistency for CAP (0=1): – Can CAP select groups consistently? – Is irrepresentable condition on “group” level sufficient/necessary? More general CAP penalties: – The “Rigid Net”: • Two groups containing all variables, 0=1, 1=1, 2= – One example: 0=, k=1: • Sparsity within groups? • Similarity across groups? 2016年6月30日星期四 page 41 Codes: www.stat.berkeley.edu/~yugroup Paper: www.stat.berkeley.edu/~binyu Funding acknowledgements: NSF ARO Guggenheim Foundation 2016年6月30日星期四 page 42