SLAC REPORT-355 STAN-LCS-106 UC-405 (Ml INTERPRETABLE PROJECTION SALLY CLAIRE PURSUIT* MORTON Stanford Linear Accelerator Center Stanford University Stanford, California 94309 OCTOBER 1989 Prepared for the Department of Energy under contract number DE-AC03-76SF00515 Printed in the United States of America. Available from the National Technical Information Service, U.S. Department of Commerce, 5285 Port Royal Road, Springfield, Virginia 22161. Price: Printed Copy A06; Microfiche AOl. *Ph.D Dissertation Abstract The goal of this thesis is to modify for interpretability. The modification standable model without sacrificing projection pursuit by trading produces a more parsimonious the structure The method retains the nonlinear versatility which projection of projection accuracy and under- pursuit seeks. pursuit while clarifying the results. Following an introduction which outlines the dissertation, chapters contain the technique projection pursuit regression respectively. measured as the simplicity Several interpretability of rotation different as applied to exploratory in factor analysis description, algorithms and entropy. weighting with and of a description is projection a rotationally invariant replacing smoothness. exploratory projection regression are described. projection in the original require slightly approach is used to search for a more par- interpretable pursuit The two methods goals. interpretability for both interpretable alterations The interpretability pursuit indices for a set of vectors are defined based on the ideas A roughness penalty tational projection of the coefficients which define its linear projections. indices due to their contrary simonious the first and second are required. pursuit In the former index is needed and defined. algorithm The compu- Examples and case, In the latter, of real data are considered in each situation. The third cation chapter deals with the connections between the proposed modifi- and other ideas which seek to produce more interpretable models. The Abstract modification Page iv as applied to linear regression is shown to be analogous to a nonlin- ear continuous method of variable selection. selection techniques It is compared with other variable and is analyzed in a Bayesian context. Possible extensions to other data analysis methods are cited and avenues for future research are identified. The conclusion addresses the issue of sacrificing in general. An example of calculating pretability due to a common simplifying for a histogram, illustrates accuracy for parsimony the tradeoff between accuracy the applicability action, namely rounding of the approach. and inter- the binwidth Acknowledgments I am grateful to my principal I also thank enthusiasm. advisor Jerry Friedman my secondary Persi Diaconis, Art Owen, Ani Adhikari, Kirk Cameron, David Draper, Knowles, Michael Sheehy, David Martin, Siegmund, advisors and examiners: Pregibon, Hal Stern; and Brad Efron, Joe Oliger; my teachers and colleagues: Tom DiCiccio, Daryl for his guidance Jim Hodges, Iain Johnstone, Mark John Rolph, Anne and my friends: Joe Romano, Mark Barnett, Ginger Brower, Renata Byl, Ray Cowan, Marty Dart, Judi Davis, Glen Diener, Heather Gordon, Arla Holly Marincovich, This work Haggerty, Curt Lasher, LeCount, Alice Lundin, Michele Mike Strange and Joan Winters. was supported in part by the Department DE-AC03-76F00515. I dedicate this thesis to my parents, sister, and brothers, who inspire me by example. of Energy, Grant Table of Contents ..* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . aaz Abstract Introduction 1. Interpretable V ................................ Acknowledgments 1 .................................... Exploratory 1.1 The Original Projection Exploratory ........... Pursuit Projection Pursuit 4 ...... Technique 4 4 ............................ 1.1.1 Introduction 1.1.3 The Legendre Projection 1.1.4 The Automobile 1.2 The Interpretable 1.2.1 A Combinatorial Projection Strategy Index 1.3.1 Factor Analysis 11 .................... Example Optimization 1.3 The Interpretability 9 ................. Index Exploratory 1.2.2 A Numerical 7 .......................... 1.1.2 The Algorithm Pursuit Approach 15 .............. 17 ........................ Background 14 14 .................... Strategy .... 17 ................... 1.3.2 The Varimax Index For a Single Vector ............ 19 1.3.3 The Entropy Index For a Single Vector ............ $2 1.3.4 The Distance Index For a Single Vector ............ ,24 1.3.5 The Varimax Index For Two Vectors 1.4 The Algorithm 1.4.1 Rotational 26 .............. L7 .............................. Invariance 1.4.2 The Fourier Projection of the Projection Index Vi Index ........ 28 . . . . . . . . , . . . . . . . . . 29 Page vii Tableof Contents 1.4.3 Projection Axes Restriction 1.4.4 Comparison With . . . . . . . . . . . . . . . . . . . $5 Factor Analysis 1.4.5 The Optimization 1.5.2 The Automobile 2. Interpretable .42 Pursuit Projection 57 58 60 .................... Pursuit 62 Example ..................... 82 as a Prior 3.4 FutureWork ............................... 88 90 ....................... ............................. 3.4.3 Algorithmic Improvements 91 ................... 92 .......................... 3.5 A General Framework Example 92 93 ..................... ............................. 97 ............................ Exploratory 86 90 3.4.2 Extensions Gradients : ........ ......... ........................ 3.3 Interpretability 3.5.2 Conclusion 73 82 Ridge Regression 3.5.1 The Histogram 68 ...................... Linear Regression Connections 65 ....................... and Conclusions With .......... .................. Procedure 62 63 to Include the Number of Terms 2.3 The Air Pollution ..... Regression Approach .................... Index 2.2.3 The Optimization A.1 Interpretable ....... .......................... 2.2.1 The Interpretability 3.4.1 Further 57’ ............................ 2.2 The Interpretable 3.2 Comparison ............ Regression Technique 2.1.3 Model Selection Strategy 3.1 Interpretable 46 Regression Pursuit 2.1.2 The Algorithm 2.2.2 Attempts 42 .................... Example Projection 2.1.1 Introduction A. $9 ......................... Projection 2.1 The Original Appendix .................. Procedure 1.5.1 An Easy Example Connections $7 ................................ 1.5 Examples 3. ............... Projection 99 Pursuit Gradients ....... 99 Page viii Tableof Contents A.2 Interpretable References Projection Pursuit Regression Gradients . . . . . . 102 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure [l.l] Most structured projection scatterplot of the automobile according to the Legendre index. ..................... Captions data 12 [1.2] Varimax interpretability index for q = 1, p = 2. .............. 20 [1.3] Varimax interpretability index for q = 1, p = 3. .............. 21 [1.4] Varimax interpretability index contours for q = 1, p = 3. ........ 21 [1.5] Simulated data with 43 n = 200 and p = 2. .................. [1.6] Projection and interpretability indices versus X for the simulated data. .................................. 44 Cl.71 Projected simulated data histograms A. ....................................... 45 for various values of [1.8] Most structured projection scatterplot of the automobile ...................... according to the Fourier index. data [1.9] Most structured projection scatterplot of the automobile according to the Legendre index. ..................... data [l.lO] [l.ll] 47 48 Projection and interpretability indices versus X for the automobile data. ................................. 49 P ro j ec t e d automobile data scatterplots A. ....................................... 50 [1.12] Parameter [1.13] Country data.. trace plots for the automobile of origin projection .................................... scatterplot for various values of data. .............. 53 of the automobile 55 [2.1] Fraction of unexplained variance U versus number of terms m for the air pollution data. ........................ 74 [2.2] Model paths for the air pollution data for models with numberoftermsm=l, ... . 6. ......................... 76 iX Pagex Figure Captions [2.3] Model paths for the air pollution data for models with numberoftermsm=7,8,9. . . . . . . . . . . , . . . . . . . . . . . . . . . [2.4] Draftsman’s display for the air pollution mearregression. [3.1] I ner t p reat bl e 1’ [3.2] Interpretability [3.3] Percent change in binwidth example. data . . . . . . . . . . . . . . . 79 . . . . . . . . ... . . . . . . . . . . . . . prior density for p = 2. 77 84 . . . . . . . . . . . . . . . . . . 89 IMSE versus multiplying fraction e in the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 List [l.l] of Tables Most structured Legendre plane index values for the automobile data. . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . [1.2] Linear combinations [1.3] Abbreviated for the automobile linear combinations data. ............... for the automobile xi data. ........ 29 51 52 Introduction The goal of this th esis is to modify projection more interpretability projection pursuit and Stuetzle accuracy for in the results. The two techniques examined are exploratory (Friedman 1981). 1987) and projection Th e f ormer is an exploratory produces a description procedure pursuit by trading of a group of variables. which determines the relationship pursuit regression (Friedman data analysis method The latter is a formal of a dependent variable which modeling to a set of predictors. The common outcome of all projection vectors which define the directions ponent is nonlinear jection linear combinations projection pursuit projection pictorially pursuit representation. that is faced with a collection q projections row corresponding are made. The resulting to a projection. data points. the linear combinations of p variables direction The statistician In and predictors. of vectors and a nonlinear Given a dataset of n observations com- contains the pro- of the projected the model contains of rather than mathematically. the smooths of the dependent variable versus the projected The statistician is a collection The remaining the description and the histograms regression, methods of the linear projections. and is summarized For example, in exploratory pursuit matrix A graphic each, suppose is q X p, must try to understand each and explain these vectors, both singly and as a group, in the context of the original p variables and in relation is to illustrate a method to the nonlinear for trading components. The purpose of this thesis some of the accuracy in the description or Page 2 Introduction model in return for more interpretability object is to retain the versatility increasing the clarity parsimony and flexibility A. in the matrix of this promising The technique while of the results. In this dissertation, than parsimony, or simplicity interpretability is used in a similar which may be thought is that as few parameters yet broader sense of as a special case. The principle of as possible should be used in a description or model. Tukey stated the concept in 1961 as ‘It may pay not to try to describe in the analysis the complexities that are really present in the situation.’ methods exist which choose more parsimonious Several models, such as Mallows’ (1973) Cp in linear regression which balances the number of parameters and prediction error. Another example is the work of Dawes (1979), who restricts the parameters in linear models based on standardized descriptions data to be 0 or fl, calling the resulting ‘improper’ linear models. His conclusion is that these models predict almost as well as ordinary based on experience. of the prediction Throughout but rather refers to the goodness-of-fit to the particular while considering shares the general philosophical intuition this thesis, accuracy is not measured in terms of future observations the model or description Parsimony, linear regressions and do better than clinical of data. solely the number of parameters goals of interpretability. in a model, These goals are to produce results which are (;) easier to understand. (ii) easier to compare. (G) easier to remember. (iv) easier to explain. The adjective plex’ with terpretability ‘simple’ is used interchangeably ‘uninterpretable’ throughout this thesis. is included in part to distinguish which receives extensive treatment with ‘interpretable’ as is ‘com- However, the new term in- this notion from that of simplicity in the philosophical and Bayesian literature. Page 3 Introduction The quantification of interpretability not easy to define or measure. our intuitions . . . presents a veritable The concept is pursuit ‘the diversity chaos of opinion.’ are easier than others. In particular, as produced by projection of an interpretability linear combinations index. by a computer more interpretable Exploratory rather than a human, in an automatic projection pursuit is considered first method motivates approach chosen is supported terpretability indices are developed. requires changes to the original. ploratory which can search for results. review of the original rithmic lend This mathe- index can serve as a ‘cognostic’ (Tukey 1983), or a diagnostic be interpreted of Fortu- and many other data analysis methods themselves readily to the development matical problem. As Sober (1975) comments, about simplicity nately, some situations is a difficult projection pursuit in Chapter 1. the simplifying modification. versus alternative strategies. The modification algorithm An example of the resulting method is examined. Projection A short The algo- Various in- to be employed interpretable pursuit ex- regression is considered in a similar manner in Chapter 2. The differing goals of this second procedure compel changes in the interpretability index. The chapter concludes with an example. Chapter 3 connects the new approach with established techniques of trading accuracy might for interpretability. benefit from general application Extensions this modification of the tradeoff to other data analysis are proposed. between accuracy methods that The thesis closes with and interpretability. a This work is an example of an approach which simplifies the complex results of a novel statistical described within method. by others in similar The hope is that the framework circumstances. will be used Chapter 1 Interpretable Exploratory In this chapter, interpretable Section 1.1 presents projection pursuit provides the motivation projection and goals of the original and outlines for the strategy requires that a simplicity Pursuit pursuit is demonstrated. the algorithm. exploratory An example for the new approach is included. described and support new method exploratory the basic concepts technique Projection which The modification chosen is given in the next section. is The index be defined, which is discussed in Section 1.3. Section 1.4 details the algorithm, and its application to the example is described in the final section. 1.1 The Original Exploratory projection ment of the algorithm intensive method Exploratory Projection pursuit Technique (Friedman 1987) is an extension and improve- presented by Friedman and Tukey in 1974. It is a computer data analysis tool for understanding helps the statistician any initial Pursuit assumptions, look for structure while providing high dimensional datasets. in the data without The requiring the basis for future formal modeling. 1.1.1 Introduction Classical multivariate criminant methods such as principal analysis can be used successfully 4 components analysis or dis- when the data is elliptical or normal 1.1 The Original Exploratory Projection Pursuit Technique and well-described .in nature pursuit Page 5 by its first few moments. Exploratory is designed to deal with the type of nonlinear structure niques are ill-equipped to handle. onto a line (one dimensional dimensional exploratory the problem The method linearly exploratory projection while maintaining projection pursuit). these older tech- projects pursuit) the data cloud or onto a plane (two By reducing the dimensionality the same number of datapoints, overcomes the ‘curse of dimensionality’ (Bellman 1961). This malady definition combinations is chosen for two reasons (Friedman is easier to understand of the original variables. gerate structure Initially, interactively projection an exhaustive mathematical Second, a linear projection projection 1982, Asimov numerical human pattern recognition consequently Structure projection consider principal pursuit components classical for specific projection analysis. is no longer scale but rather on a means that numerous Thus the scheme, while making many A of a view is defined feasible for large datasets, requires the careful choice of a projection (1985) p oin t s out, views the method was automated. optimization. one. This simplification which exhibit choose interesting index which measures the structure possible indices may be defined. ploratory the 1985). because the time required search was prohibitive, measured on a multidimensional As Huber First, does not exag- pursuit is to find projections and the space is searched via a computer univariate 1987). as it consists of one or two linear the idea was to let the statistician by eye (McDonald to perform is revealed in in the data as it is a smoothed shadow of the actual datapoints. The goal of exploratory structure. is due to space. A linear projection projection of the technique the fact that a huge number of points is required before structure high dimensional projection techniques the analysis index. are forms index choices. of ex- For example, Let Y be a random vector in RP. The Page 6 1. InterpretableExploratory Projection Pursuit definition of the jth principal component is the solution of the maximization problem “p”x i Var[PTY] 3 @T&=1 and p:Y is uncorrelated all previous principal Thus, principal projection contrast, with variance (Var) as the projection is a global the novelty its ability components . components analysis is an example of one dimensional pursuit Variance with to recognize nonlinear many definitions index. or general measure of how structured and applicability of structure of exploratory or local structure. and corresponding exploratory projection a view is. pursuit As remarked projection In lies in previously, index choices exist. All present indices, however, depend on the same basic premise. The idea is that though ‘interesting’ is difficult to define or agree on, ‘uninteresting’ is clearly normality. Friedman (1987) and Huber (1985) p rovide theoretical Effectively, methods the normal distribution which explain attempting The statistician can be explored adequately using traditional covariance structure. to address situations structure metric Desirable computing is so far in this discussion properties also affect the decision. is detailed in the next subsection. The preference for the type of and invariance in the concrete context of a particular algorithm pursuit to views she holds interesting. choice is thus based on individual have been neglected original projection must choose a method by which to measure distance from to be found. considerations Exploratory for which these methods are not applicable. this normal origin that leads the algorithm distance support for this choice. which These index are discussed as the Page 7 1.1 The Original Exploratory Projection Pursuit Technique 1.1.2 The Algorithm Two dimensional exploratory ful than one dimensional, situation pursuit is more interesting problems when interpretable exploratory plane. data may have several interesting index optima views or local projection should find as many as possible. The original stract projec- is considered which need not be addressed in the one dimensional case. For the present, the goal is to find one structured algorithm and use- so the former is discussed solely. The two dimensional also raises intriguing tion pursuit projection version algorithm (1985), thereby though the data consists of n observations variable Y E W. simplifying the notation. Thus, of length p, consider first a random = Va?@Y] and Cov[p~Y,&‘Y] (/?I, ,&) which = 1 = 0 where G is the projection index which measures the structure density. on the linear combinations The constraints and the (1987) is reviewed in the ab- The goal is to find linear combinations 3 Var[PTY] the Section 1.5.3 addresses this point. presented by Friedman due to Huber Actually, of the projection ensure that the structure seen in the plane is not due to covariance (Cov) effects which can be dealt with by classical methods. Initially, the original data Y is sphered (Tukey and Tukey 1981). The sphered variable 2 E RP is defined as 2 f D- W(Y with matrix U and D resulting - E[Y]) from an eigensystem decomposition EW of the covariance of Y. That is, c = E[(Y - E[Y])(Y - E[Y])*] = UDU* P.31 Page 8 1. InterpretableExploratory Projection Pursuit with U the orthonormal matrix of associated eigenvalues. ponent directions be translated of eigenvectors of C, and D the diagonal matrix The axes in the sphered space are the principal of Y. The previous optimization to one involving problem [l.l] linear combinations max involving al, a2 and projections comY can of 2: G(cuTZ, o$Z) QI>q 3 a& and LVTQ~= 0 The fact that structure, ditions the standardization which the technique calculations actually imposed work required are performed In subsequent notation, in the original covariance are now geometric (Friedman the parameters of a projection the value of the index is calculated 1987). con- Thus, all index G are (,0TY, /3lY) for the sphered data [1.2]. convenience and is invisible to She associates the value of the index with the visual projection data space. After the maximizing translated to exclude on the sphered data 2. The sphering, however, is merely a computational the statistician. . is not concerned with, reduces the computational numerical though constraints WI = 1 4cv2 = sphered data plane is found, the vectors o1 and CQ are to the unsphered original variable space via /31= UD- ia, p2 Since only the direction = UD-+a2 . of the vectors matter, these combinations are usually normed as a final step. The variance and correlation pf-cp, constraints = 0 . on ,L?iand @2 can be written as D.61 1.1 The Original Exploratory Projection Pursuit Thus, ,& and ,& are orthogonal orthogonal Page 9 Technique in the covariance metric while (Ye and ~2 are geometrically. The optimization method used to solve the maximization coarse search followed by an application optimization procedure. The initial problem [1.4] is a of steepest descent, a derivative-based survey of the index space via a coarse step- ping approach helps the algorithm avoid deception by a small local maximum. The second stage employs the derivatives of the index to make an accurate search in the vicinity of a good starting The numerical optimization sess certain computational compute, point. procedure requires that the projection The index should be fast and stable to properties. and must be differentiable. to the interpretability 1.1.3 The Legendre Friedman’s transforming index pos- These criteria surface again with respect index defined in Section 1.3. Projection Index (1987) Legendre index exhibits the sphered projections these properties. He begins by to a square with the definition Rl E 2G+I!;Z) - 1 R:! E 2@(aTZ) - 1 where @ is the cumulative random variable. uninteresting, probability density function of a standard Under the null hypothesis that the projections the density p(R1, As a measure of nonnormality, R2) is uniform normal are normal and on the square [-1, I] X [-l,l]. he takes the integral of the squared distance from the uniform G&?;Y, /3;Y) He expands the density orthogonal tion, - J’ -1 [p(&, R2) - +I2 d&dR2 respect to a uniform subsequent integration taking P.71 . p(R1, R2) using Legendre polynomials, on the square with along with J’ -1 weight function. which are This ac- advantage of the orthogonality Page 10 1. InterpretableExploratory Projeciion Pursuit .relationships, yields an infinite sum involving polynomials in R1 and R2, which are functions application, the expansion is truncated moments. Thus, instead of E[f(Y)] n observations expected values of the Legendre of the random variable f or a function f, the sample mean over the yl, ~2, . . . , yn is calculated. rather than in the tails. Thus, it identifies rather than heavy-tailed densities. variables 2. covariance structure, which exhibit GL also has the property The index has this characteristic Since exploratory in the center of distribution, projections (Huber 1985), w h ic h means that it is invariant of Y. projection this property the area of projection of ‘affine invariance’ as it is based on the sphered pursuit should not be drawn in by is desirable for any projection indices but so far alternatives normality have proved less computato take advantage or use alternate methods of measuring distance from as discussed in Section 1.4.2. In the following section, an example of the original algorithm cussed in order to provide the motivation considering .index. Research continues in feasible. Other indices use different sets of polynomials of different weight functions clustering under scale changes and location The Legendre index is also stable and fast to compute. tionally In and sample moments replace theoretical This index has been shown to find structure shifts Y. how to modify pretable, the projection using GL is dis- for the proposed modification. After the method in order to make the results more inter- index clearly needs a new theoretical does not have. Consequently, property a new index is defined in Section 1.4.2. which GL 1.1 The Original Exploratory Projection Pursuit Technique 1.1.4 The Automobile The automobile Example dataset (Friedman 1987) consists of ten variables on 392 car models and reported in ‘Consumer Yl Y2 Y3 Y* Ys : : : : : : : : : : Y6 ., Y7 Y8 fi Ylo The second variable dicating Page 11 collected Reports’ from 1972 to 1982: gallons per mile (fuel inefficiency) number of cylinders in engine size of engine (cubic inches) engine power (horse power) automobile weight acceleration (time from 0 to 60 mph) model year American (0,l) European (0,l) Japanese (0,l) has 5 possible values while the last three are binary, the car’s country of origin. As Friedman suggests, these variables inare gaussianized, which means the discrete values are replaced by normal scores after any repeated observations their discrete marginals are randomly The object is to ensure that do not overly bias the search for structure. All the variables are standardized the analysis. ordered. In addition, to have zero mean and unit variance before the linear combinations which define the maximizing plane (PI, P2) are normed to length one. As a result, the coefficient in a combination represents its relative importance. The definition of the solution p1 = (-0.21,-0.03,-0.91, p2 = (-0.75, -0.13, plane is 0.16, 0.43, 0.30,-0.05,-0.01, 0.45,-0.07, 0.03, 0.00,-0.02)T 0.04,-0.15,-0.03, That is, the horizontal coordinate ficiency -0.03 times the number of cylinders (standardized) standardized) of each observation and so on. The scatterplot In the projection horizontal for a variable and vertical scatterplot, 0.02,-0.01)T is -0.21 times fuel inef(gaussianized and of the points is shown in Figure 1.1. the combinations (PI,&) are the orthogonal axes and each observation is plotted as (/?TY, PZY). These 1. Interpretable Exploratory Projection Page 12 Pursuit 2 . . .. .. . * *.*. . .I .I .* :* . : . ’c . 7. ‘*.y - .,* #.:.* .,. ,’ -: * *‘>..i\..* . . .*.. r.*,$s?:.. :..: . **. : :$ ,I’ . . ., . - “.:“. . ., : . . -1 8 I I I I I I I Most structured projection scatterplot according to the Legendre index. vectors are orthogonal 1 I I I 1 0 -1 Fig. 1.1 . 1 of the automobile in the covariance metric due to the constraint ever, in the usual reference system, the orthogonal data [l.S]. How- axes correspond to the vari- ables. For example, the x axis is Yl, the y axis is Y2, the z axis is Y3, and so on. The combinations are not orthogonal in this reference frame. This departure from common graphic convention is discussed further The first step is to look at the structure of the projected case, the points are clustered along the vertical out to the upper right corner with a few outliers cern is whether fluctuation. this structure actually in Section 1.4.3. In this axis at low values and straggle to the left. The obvious con- exists in the data or is due to sampling The value of the index GL for this particular question is whether points. this value is significant. view is 0.35 and the Friedman (1987) approximates the answer to this question by generating values of the index for the same number Page 13 1.1 The Original Exploratory Projection Pursuit Technique of observations Comparison and dimensions (n and p) under the null hypothesis of the observed value with of normality. these generated ones gives an idea of how unusual the former is. Sun (1989) d e1ves deeper into the problem, an analytical significance approximation level. The structure Given the clustering attempt to interpret Though by projecting tion pursuit for a critical providing value given the data size and chosen found in this example is significant. exhibited along the vertical the linear combinations axis, the second stey)is to which define the projection the data from ten to two dimensions, plane. exploratory projec- has reduced the visual dimension of the problem, the linear projec- tions must still be interpreted in the original number of variables. The structure is represented by a set of points in two dimensions but understanding points represent in terms of the original data requires considering what these all ten vari- ables . An infinite straints number of pairs of vectors (pi, ,&) exist which satisfy [1.6] and define the most structured vectors can be spun around the origin rotation of the scatterplot In other words, in the plane rigidly and they still satisfy the constraints The orientation plane. and maintain is inconsequential the conthe two via an orthogonal the structure found. to its visual representation of the structure. These facts lead to the principle or most interpretable linear combinations contribution way possible. have length Given that the data is standardized one, the coefficients of each variable to the combination. find the simplest variance that a plane should be defined in the simplest representation This action forces the coefficients as possible. represent the individual Friedman of the plane by spinning of the squared coefficients of the vertical (1987) attempts the vectors until coordinate of the second combination and parsimonious vector is easier to understand, to the are maximized. p2 to differ as much As a consequence, variables are forced out of the combination. more interpretable and the vector has fewer variables compare, remember and explain. in it. This Such a Page 14 1. InterpretableExploratoy Projection Pursuit The goal of the present work is to expand this approach. of interpretability binations is discussed in Section 1.3. Criteria are considered. More importantly, in the plane allowed but also the solution slightly away from the most structured of the vectors plane may be rocked in p dimensions plane. The resulting Exploratory Projection sessed but attaching difficult. is Pursuit Approach projection pursuit produces (/3TY, PCY), the value of the index GL(PTY, ,%$Y), and the linear (,&, pz). Th e scatterplot’s combinations loss in structure is deemed sufficient. As shown in the preceding example, exploratory the scatterplot which involve both com- not only is rotation acceptable only if the gain in interpretability 1.2 The Interpretable A precise definition nonlinear structure much meaning to the actual numerical As remarked earlier, the linear combinations in terms of the original number of variables. stand these vectors may involve ‘rounding may be visually as- value of the index is must still be interpreted In fact, a mental attempt to under- them by eye’ and dropping variables which have ‘too small’ coefficients. The question to be considered is whether the linear combinations more interpretable without losing too much observed structure The idea of the present modification the scatterplot If initially of variable might is linked with parsimony, be considered. regression is all subsets regression. applied to principal restricts or parsimony found in in (pi, ,f?2). Strategy interpretability selection in the projection. is to trade some of the structure in return for more comprehensibility 1.2.1 A Combinatorial can be made components the linear transformation A similar (Krzanowski matrix a combinatorial method The analogous approach in linear idea, principal variables, has been 1987, McCabe 1984). This method to a specific form, for example each row has two ones and all other entries zero. The result is a variable selection method with each component formed from two of the original variables. Both variable 1.2 The Interpretable Exploratory Projection selection methods, all subsets and principal only consider whether Applying variables, are discrete in nature and a variable is in or out of the model. a combinatorial in the following Page 15 Approach Pursuit strategy number of solutions, to exploratory projection pursuit results each of which consists of a pair of variable subsets. Given p variables and the fact that each variable is either in or out of a combination, symmetric, 2P possible single subsets of variables exist. The combinations are meaning that the (,f3i, ,&) plane is that the same as the (&, ,&) plane. Thus, the number of pairs of subsets with unequal members is (‘,‘). However, this count does not include the 211pairs with both subsets identical, which are permissible solutions. It does include 2P -p degenerate pairs in which one subset is empty, or both subsets consist of the same single variable. These degenerate pairs do not define planes. The total number of solutions is 0 2p + 2P _ 2P - p = 9-1 - 2p-1 - p . P Some planes counted may not be permissible due to the correlation as discussed in Section 1.4.3, so the actual count may be slightly constraint less. However, the principal term remains of the order of 2 2P-1 . The number of subsets grows exponentially with p and for each subset the optimization redone. Unlike all subsets regression, no smart search methods exist for elimi- nating poor contenders which do not produce interesting time required to produce a single exploratory combinatorial must be completely projections. projection pursuit Due to the solution, this approach is not feasible. 1.2.2 A Numerical Optimization Strategy Given that a numerical optimization is already being done, an approach is to consider whether a variable selection or interpretability in the optimization. The objective function used is criterion can be included Page 16 1. InterpretableExploratory Projection Pursuit for X E [0, 11. Th is f unction .contribution pretability and an interpretability parameter its maximum index S contribution, X. The interpretability manner. exploratory projection pursuit index index S is defined to index is divided by is also E [0, 11. algorithm is applied in an First, find the best plane using the original exploratory algorithm. G indexed by the inter- or simplicity possible value in order that its contribution The interpretable tion pursuit sum of the projection A na 1o g ously, the value of the projection have values E [O,l]. iterative is the weighted projec- Set max G equal to the index value for this most struc- tured plane. For a succession of values (Xi, X2, . . . , Xl), such as (0.1,0.2, . . . , 1 .O), solve [1:9] with X = Xi. In each case, use the previous Xi-1 solution as a starting point. One way to envision the procedure is to imagine beginning at the most structured solution and being equipped with an interpretability turned, the value of X increases as the weight on simplicity, plexity, is increased. The plane rocks smoothly If the projection index is relatively loss in structure is gradual. As the dial is or the cost of com- away from the initial flat near the maximum When to stop turning dial. (Friedman solution. 1987), the the dial is discussed in Section 1.5.3. The additive functional drawing a parallel with 1984). In that goodness-of-fit fitted context, criterion form choice for the objective function roughness penalty the problem [1.9] is made by methods for curve-fitting is a minimization. The first (Silverman term is a such as the squared distance between the observed and values, and the second is a measure of the roughness of the curve such as the integrated square of the curve’s second derivative. the curve becomes more rough. comparable to interpretability. As the fit improves, The negative of roughness, or smoothness, is Page 17 1.3 The Interpretability Index An alternate idea is to think of solving a series of constrained maximization subproblems 2% , GM% P2TY) [l.lO] 3 2 G S(Pl,P2) for values of ci such as (0.0,O.l) . . . , 1.0). Th’is p ro bl em may be rewritten as an unconstrained Rein- maximization using the method of Lagrangian sch (1967) d escribes the relationship the relationship between [1.9] and [l.lO] The computational pursuit solution [1.8] to I. approach [1.9] is linear in savings for this numerical method versus a combinatorial can be substantial. The inner loop, namely finding an exploratory I. one projection is the same for either. The outer loop, however, is reduced from 1.3 The Interpretability The interpretability Index index $ measures the simplicity (pi, pz). It has a minimum of the pair of vectors value of zero at the least simple pair and a maximum value of one at the most simple. 1.3.1 Factor and consequently between ci and Xi. The amount of work required by the functional differentiable multipliers. Like the projection index G, it needs to be and fast to compute. Analysis For two dimensional Background exploratory projection pursuit, the object is simplify a 2 X p matrix . Consider a general q X p matrix sional exploratory matrix projection 0 with entries wij, corresponding pursuit. have? When is one matrix What characteristics to q dimen- does an interpretable more simple than another? Researchers have Page 18 1. InterpretableExploratory Projection Pursuit considered such questions with respect to factor loading matrices. factor analysis is to explain the correlation tor model. The solution structure in the variables via the fac- is not unique and the factor matrix make it more interpretable. Comments ence between this rotation The goal in often is rotated are made regarding the geometric and the interpretable to differ- rocking of the most structured plane in Section 1.4.4. Though the two situations are different, goal is the same and so factor research is used as a starting point in the development Intuitively, analysis rotation of a simplicity the interpretability ‘Local’ interpretability the philosophical index S. of a matrix may be thought measures how simple a combination of in two ways. or row is individually. In a general and discrete sense, the more zeros a vector has, the more interpretable it is as less variables are involved. a collection ‘Global’ interpretability measures how simple of vectors is. Given that the vectors are defining a plane and should not collapse on each other, a simple set of vectors is one in which each row clearly contains its own small set of variables and has zeros elsewhere. Thurstone (1935) advocated ‘simple structure’ in the factor loading matrix, defined by a list of desirable properties ple, each combination which were discrete in nature. For exam- (row) should have at least one zero and for each pair of rows, only a few columns should have nonzero entries in both rows. In summary, his requirements correspond to a matrix for, only a subset of variables (columns). which involves, or has nonzero entries Combinations should not overlap too much or have nonzero coefficients for the same variables and those that do should clearly divide into subgroups. These. discrete notions of interpretability measure which is tractable for computer must be translated optimization. into a continuous Local simplicity for a single vector is discussed first and the results are extended to a set of two vectors. Page 19 1.9 The Interpretability Index 1.3.2 The Varimax Consider pretability Index For a Single Vector a single vector w = (wi, ~2,. . . , w~)~. translates entries as possible. In a discrete into as few variables in the combination The goal is to smooth sense, inter- or as many zero the discrete count interpretability index D(W) Z f: I(Wi = 0} [Ml] i=l where I{+} is the indicator Since the exploratory function. projection pursuit linear combinations represent direc- tions and are usually normed as a final step, the index should involve the normed coefficients. In addition, in order to smooth the notion that a variable is in or out of the vector, the general unevenness of the coefficients The sign of the coefficients is inconsequential. measure the relative mass of the coefficients should be measured. In conclusion, irrespective the index should of sign and thus should involve the normed squared quantities u” i=l,...,p WTW . In the 1950’s, several factor analysis researchers arrived separately criterion (Gorsuch 1983, Harman at the same 1976) which is known as ‘varimax’ and is the variance of the normed squared coefficients. (1987) used as discussed in Section 1.1.3. This is the criterion which Friedman The corresponding interpretability index is denoted by S, and is defined as s~(w)z-&f-(*-;)2 i=1 ww . [1.12] The leading constant is added to make the index value be E [0, 11. Fig. 1.2 shows the value of the varimax index for a linear combination w in two dimensions Page 20 1. InterpretableExploratory Projection Pursuit 0.8 0.8 0.2 0.0 angle Fig. I.2 of the vector in radians Varimax interpretability index for q = 1, p = 2. The value of the index for a linear combination w in two dimensions is plotted versus the angle of the direction in radians over the range [0, TIT]. (p = 2) versus the angle of the direction arctan(w2/wl) of the linear combination W. Fig. 1.3 shows the varimax mensions (p = 3). to symmetry. Only vectors with all positive components one in three diare plotted Fig. 1.3 shows the value of the index as the vertical versus the values of (WI, ~2). be graphed. index for vectors w of length The value of w3 is known due coordinate and does not need to Fig. 1.4 shows contours for the surface in Fig. 1.3. These contours are just the curves of points w which satisfy the equation formed when the left side of [1.12] is set equal to a constant ity index is increased, three points the contours el = (l,O,O), value. As the value of the interpretabil- move away from (-&, -$, -$=) toward e2 = (O,l,O) and es = (O,O, 1). The centerpoint the is 1.3 The Interpretability Index Page 21 Fig. I.3 Varimax interpretability index for q = 1, p = 3. The surface of the index is plotted as the vertical coordinate versus the first two coordinates (~1, wz) of vectors of length one in the first quadrant. Fig. J.4 Varimax interpretability index contours for q = 1, p = 3. The axes are the components (WI, ~2, ws). The contours, from the center point outward, are those points which have varimax values S,(w) =(O.O, 0.01, 0.05, 0.2, 0.3, 0.6, 0.8). 1. Interpretable Exploratory Projection the contour for S,(w) S,(w) Pursuit = 0.0 and the next three joined curves are contours for = 0.01,0.05,0.2. The next three sets of lines going outward ei’s are contours corresponding toward the to S,(w) = 0.3,0.6,0.8. Since w is normed, the varimax criterion, Page 22 criterion S, is equivalent to the ‘quartimax’ which derives its name from the fact it involves the sum of the fourth powers of the coefficients. S, is also the coefficient of variation squared of the squared vector components. 1.3.3 The Entropy Index For a Single Vector The vector of normed squared coefficients has length one and all entries are positive, similar to a multinomial set of probabilities measures how nonuniform If a vector is more simple, from one another. probability vector. The negative entropy of a the distribution the more uneven or distinguishable Thus a second possible interpretability is (Renyi 1961). its entries are index is the negative entropy of the normed squared coefficients or S,(w)-l+lf: W ’2lnW ’2 lnp wTw wTw * i=l The usual entropy simplicity Property measure is slightly altered to have values E [0, 11. The two measures S, and Se share four common properties. 1. Both are masimized when & where the ei,j = l,...,p = fej j = l,...,p are the unit axis vectors. occurs when only one variable is in the combination. Thus, the maximum value 1.3 The Interpretability Property 2. Both are m inimized or when the projection made that Page 23 Index when is an equally weighted average. The argument an equally weighted average is in fact simple. deciding which variable most clearly affects the projection, could be However, in terms of it is the most difficult to interpret. Property 3. Both are symmetric in the coefficients wi. No variable counts more than any other. Property explanation Definition. 4. Both are strictly Schur-convex as defined below. follows Marshall The following and Olkin (1979). Let 5 = (C, &, . . . , &) and y = (ri, 72,. . . , yP) be any two vectors E RP. Let Cpl 2 $21 2 . - - 2 Cb] ad qll 1 721 2 . . . 2 ~1 denote their components in decreasing order. The vector < mujorizes y (C F -y) if The above definition holds w 7 = CP where P is a doubly stochastic matrix, that is P has nonnegative entries, and column and row sums of one. In other words, if y is a smoothed or averaged version of C, it is majorized example of a set of majorizing (l,O,... ,O) + (go )...) by C. An vectors is 0) h *.. I+ (- p- 1 l’““P- -J- 1’ 0) s ($,) . 1. Interpretable Definition. ProjectionPursuit Exploratory A function with strict inequality Page24 f : RP H R is strictly Schur-convex if if y is not a permutation of <. This type of convexity is an extension of the usual idea of Jensen’s Inequality. Basically, if a vector C is more spread out or uneven than y, then S(c) > S(y). This intuitive idea of interpretability now has an explicit The two indices S, and Se rank all majorizable ing the theory of Schur-convexity, mathematical meaning. vectors in the same order. a general class of simplicity Us- indices could be defined. 1.3.4 The Distance Index For a Single Vector Besides the variance interpretation, S, measures the squared distance from the normed squared vector to the point (k, k, . . . , t), which m ight thus be called the least simple or most complex point. Let the notation for the Euclidean norm of any vector 6 be If V, is defined to be the squared and normed version of w, lJ, z WTW7W T W ”“’ W T W ’ and t/c is the most complex point, then &J(w)= --/$l”y - hII - * [1.13] 1.3 The Interpretability Page 25 Index .This index can be generalized. In contrast to having one most complex point, an V alternate index can be defined by considering a set points. = {or, . . . , VJ} of m simple This set must have the properties Uji>O i = l,...,p j=l,...,J P Uji = 1 c j=l,...,J [1.14] . i=l The u;‘s are the squares of vectors on the unit sphere E example is V = {cj : j = 1,. . . ,p}, used with exploratory projection with pursuit, J RP as are u, and vC. An = p. In the event that this index is the statistician set of simple points rather than be restricted could define her own to the choice of V, used in S,. This distance would be large when w is not simple, so the interpretability index should involve the negative of it. An example is [1.15] The constant c is calculated so that the values are E [O,l]. Any distance norm can be used and an average or total distance could replace the m inimum. If V is chosen to be the ej’s and the m inimum Euclidean norm is used, the distance index becomes s;(w) where k corresponds G 1 - $@&)2-&+1] to the maximum I w; I, i = tion does not need to be done at each step, though efficient similar in the vector choices of ros ((i, i,O,. V must be found. Analogous such as all permutations 1,. . . , p. The m inimiza- the largest absolute results are obtainable of two $ entries and cofor p - 2 ze- . * , O), ($7 o,;, 0, - ’* , 0), . . .) corresponding to simple solutions of two variables each. 1. InterpretableExploratory Projection Pursvii Since the ej’s maximize relationship Page 26 S, and are the simple points associated with Sl, the between the two indices proves interesting. m4 The minimum = --St&) + --& (P-& wi Algebra reveals - 1) . of the second term occurs when W:=f WTW p or all coefficients are equal and S:(w) = S,(w) = 0. The maximum occurs when fL=l WTW and S;(w) = &(w) The difficulty = 1. Th e relationship between the interim values varies., with the distance index S’s is that its derivatives and present problems when a derivative-based Given that the entropy index S, and the varimax optimization are not smooth procedure is used. index S, share common prop- erties and the latter is easier to deal with computationally, the varimax approach is generalized to two dimensions. 1.3.5 The Varimax The varimax vectors Wj = (Wjl, Index For Two Vectors index SV can be extended to measure the simplicity Wjz, for one combination . . . ) Wjp), j = 1,. . . , Q. In the following, of a set of Q the varimax [1.12] is called Si. In order to force orthogonality index between the squared normed vectors, the variance is taken across the vectors and summed over the variables to produce [1.16] 1.4 The Algorithm Page 27 -If the sums were reversed and the variance was taken across the variables and summed over the vectors, the unnormed index would equal WWl) + with each element the one dimensional vector. simplicity [1.12] of the corresponding The previous approach [ 1.161 results in a cross-product For two dimensional with appropriate appropriate exploratory norming, = 12p [iP - ~&l,~2) with [1.17] + * * - + &(wp) wJ2) q pursuit, = 2 and the index, reduces to + (P - vu~l) norming. projection term. + W(w2) 21 - $ , [1.181 -&-& 1 1 2 The first term measures the local simplicities vectors while the second is a cross-product term measuring the orthogonality the two normed squared vectors. This cross-product orthogonality’, groups of variables appear in each vector. maximized so that different of the of forces the vectors to ‘squared S, is when the normed squared versions of w1 and w2 are ek and el, L # 1 and minimized when both are equal to (i, f, . . . , 5). 1.4 The Algorithm In this section, the algorithm jection pursuit problem used to solve the interpretable [1.9] with general approach of the original interpretability exploratory exploratory pro- index S,, is discussed. The projection pursuit algorithm is fol- lowed but two changes are required, as described in the first three subsections. 1. InterpretableExploratory Projection Pursuit 1.4.1 Rotational Invariance Page 28 of the Projection As discussed in Section 1.1.4, the orientation and a particular observed structure index G no matter stating from which of a scatterplot is immaterial should have the same value of the projection direction this is that the projection Index it is viewed. An alternative index should be a function of the plane, not of the way the plane is represented (Jones 1983). The interpretable pursuit algorithm should always describe a plane in the simplest as measured by S,,. algorithm should have the property Definition. where (PI, Given these two facts, A projection P2) of rotation matrix projection way possible index used by the rotationalinvariance. if e or is (771,~2) rotated through with Q the orthogonal any projection rotationallyinvariant index G is way of associated with the angle 8 or [1.19] Rotational invariance should not be confused with afhne invariance. in Section 1.1.3, the latter Friedman’s property is a welcome byproduct Legendre index GL [l. 71 is not rotationally erty is not required for his algorithm. of sphering. invariant. This prop- As described in Section 1.1.4, he simplifies the axes after finding the most structured plane by maximizing varimax index S1 for the second combination. rock away from the most structured As remarked the single vector He does not allow the solution to plane in order to increase interpretability. 1.4 The Algorithm Page29 Recall that the first step in calculating the projected the marginals sphered data to a square. Under the null hypothesis tion is N(0, I), which means the projection orientation GL is to transform of the axes is immaterial. is not true, the placement scatterplot Intuitively, of the projec- looks like a disk and the however, if the null hypothesis of the axes affects the marginals and thus the index value. Empirically, this lack of rotational invariance is evident. . bile example from Section 1.1.4, the most structured GL(,BFY, ,BgY) = 0.35. The projection In the automo- plane was found to have index values for the scatterplot as the axes are spun through a series of angles are shown in Table 1.1. A new index, which seeks to maintain the computational properties and to find the same structure as GL, is developed in the next subsection. e G(/3;Y&Y) Table1.1 0.0 fi zi 3* ZiT t L 4 % 7* x F 3 t 0.35 0.36 0.34 0.32 0.30 0.29 0.29 0.29 0.30 0.32 0.35 Most structured Legendre plane index values for the automobile data. Values of the Legendre index for different orientations of the axes in the most structured plane are given. 1.4.2 The Fourier Projection Index The Legendre index GL is based on the Cartesian coordinates of the projected sphered data ($2, coordinates natural o$‘Z). Th e index is based on knowing the distribution under the null hypothesis alternative of these coordinates given that rotational Polar coordinates polynomials are the invariance is desired. The distribution is also known under the null hypothesis. the density via orthogonal GL. of normality. of these The expansion of is done in a manner similar to that of 1. Interpretable Exploratory The polar coordinates Actually, Pursuit Projection Page 30 of a projected the usual polar coordinate sphered point (aT2, ac.Z) are definition rather than its square but the notation Under the null hypothesis involves the radius of the point is easier given the above definition that the projection is N(0, I), R of R. and 0 are inde- pendent and R- Exp 0 ; 0 - Unif[-7r,7r] . The proposed index is the integral of the squared distance between the density of (R, 0) and the null hypothesis density, which is the product of the exponential and uniform densities, The density pR,e is expanded as the tensor product onal polynomials properties. chosen specifically The weight functions der to utilize the orthogonality 0 portion R for their weight functions both Cartesian coordinates and rotational must match the densities f~ and fe relationships and the polynomials of the expansion must result in rotational is not affected by rotation. of two sets of orthog- Friedman chosen for the invariance. By definition, (1987) uses Legendre polynomials as his two density functions are identical is Lebesgue measure, and his algorithm require rotational Hall (1989) considered Hermite developed a one dimensional index. The following the two authors’ approaches. Throughout for (Unif[-l,l]), the Legendre weight function invariance. in or- does not polynomials and discussion combines aspects of the discussion, i, j, and k are integers. 1.4 The Algordhm Page31 The set of polynomials for the R portion are defined on the interval is the Laguerre polynomials [O,oo) with weight function W(U) = e-“. which The polyno- mials are Lo(u) = 1 L,(u) = u - 1 L;(u) = (u - 2i + lpi-l(u) [1.21] The associated Laguerre functions - (i - 1)2L44 . are defined as Z;(U)f .L;(u)e-i” . The orthogonality relationships O" J between the p.olynomials are Zi(U)Zj(U)dU = where 6ij is the i(ronecker delta function. Any piecewise smooth function these polynomials ;,j>Oandi#j Sij 0 f : R H R may be expanded in terms of as f(U) = e%&(u) i=O where the ai are the Laguerre coefficients a; Z O” Li(U)t?-3uf(U)dU - J0 The smoothness property derivatives of f means the function has piecewise continuous first and the Fourier series converges pointwise. has density f, the coefficients can be written Ui = Ef [Zi(W)] as . If a random variable W 1. Interpretable Exploratory The 63 portion Pursuit Projection Page 32 of the index is expanded in terms of sines and cosines. Any piecewise smooth function f : R H R may be written as f(w)= y + 5 [ahcos(kw)+ bksin( kv)] k=l with pointwise convergence. The orthogonahty relationships between these trigonometric 7r % J xJ J r J cos2(kv)dv = --r -% sin2(kv)dv bk are k>l = 0 k>Oandj>l cos(kw)sin(jv)dv dv=2r . the Fourier coefficients 1 ak E -~ 1 bk E = = J _ cos(kv)f(v)dv = ;Ef [cos(kW )] J = -% sin( kv)f(v)dv = i.Ef [sin(kW)] The density pR,e can be expanded via a tensor product . of the two sets of poly- nomials as i- 2 [aikcos(kv)+ b;ksin(ku)]) k=l The Uik and are --x --I The ak and = ?r functions bik are the coefficients defined as C&k 3 :Ep [ii(R) cos(kO)] i,k>O bik G :Ep i>Oandk>l [Ii(R) sin(kO)] . . Page 33 1.4 The Algorithm The null distribution, the uniform which is the product density over [-‘ir,~] of the exponential density and is h(u>fo(~) = ($+) (+-) = . -&O(U) The index [1.20] becomes The further condition tion, integration, that pR,e is square-integrable and use of the orthogonality and subsequent relationships multiplica- show that the index G(PTY, /3,‘Y) equals +(27r)(&) ~(271)~+~~1T(u~k+b~k)-(2r)(~)(~) i=O k=l i=l [1.22] is equivalent Maximizing to maximizing G&-Y&Y) The definitions [I.221 the Fourier index defined as = ~G(Pf-y,&‘y) of the coefficients . - f - in GF yield GF(PTY,p,‘Y) =f 2 Ep”[Ii(R)] + 2 2 (Ei [ii(R) cos(k@)I+ Ep”k(R)sin(k@)l) i=O k=l i=O - ;Ep[lo(R)]. [1.23] This index closely resembles the form of GL in Friedman i = 0 term in the first sum and the last subtracted function is the exponential polynomials. In application, instead (1987). The extra term appear since the weight of Lebesgue measure as it is for Legendre each sum is truncated at the same fixed value and 1. Interpretable Explorato y Projection Pursuit Page 34 the expected values are approximated data points. For example, Ei by the sample moments taken over the [Ii(R) cos(kO)]is approximated by where rj and Oj are the radius squared and angle for the projected jth observation. The Fourier index is rotationally spun by an angle of r. invariant. Suppose the projected The radius squared R is unaffected points are by the shift so the first sum and final term in [1.23] d o not change. The sine and cosine of the new angle are cos( 0 + T) = sin 0 cos 7 + cos 0 sin 7 sin(O + r) = cosOcos7 Each component . - sin@sinr in the second term of [1.23] is Ei[Zi(R) cos(k(0 + T))] + E~[Zi(R)sin(k(O cos2(kT)Ei[Zi(R) sin(kO)] + r))] = + sin2(kr)E~[Zi(R) cos(k@ )] + 2 sin( kT) cos(kT)Ep[Zi(R) sin(kO) cos(k@ )] + cos2(kT)Ei[Zi(R) cos(k@ )]+ -2sin(kr) [1.24] sin2(kr)Ei[Zi(R) cos(k~)E,[Z;(R)sin(kC3)cos(kO)] = Ei[Zi(R)cos(kO)]+ E~[Zi(R)sin(kO)] and the index value is not afFected by the rotation. the index also has this property replacing The truncated as it is true for each component. version of Moreover, the expected value by the sample mean does not affect [1.24], so the sample truncated version of the index is rotationally Hall (1989) p ro p oses a one dimensional m ite weight function densities. sin(k@)] In addition, in the truncated Hermite invariant. function index. of e- 3 u2 helps bound his index for heavytailed The Herprojection he addresses the question of how many terms to include version of the index. Similarly, the asymptotic behavior of GL 1.4 The Page 35 Algorithm is being investigated at present. The Fourier index consists of the bounded sine and cosine Fourier contribution, and of the Laguerre function portion which is weighted by e- 3 U. The class of densities for which this index is finite must be determined as well as the number of terms needed in the sample version. The Fourier GF and Legendre GL indices are based on comparing the density of the projection a rotationally with the null hypothesis density. Jones and Sibson (1987) define invariant to equate structure find clusters. with rotationally searches through the possible planes with the weighted objec- [1.9] as a criterion. Every plane has a single projection it since GF is rotationally have a single possible representation; analytically. index could be based on Axes Restriction The algorithm S,. Unfortunately, invariant (Donoho and Johnstone 1989). 1.4.3 Projection associated with Their index tends outliers while the density indices GF and GL tend to A possible future Radon transforms tive function index based on comparing cumulants. the optimal invariant. Ideally, each plane would the most interpretable representation index value one as measured by of a given plane cannot be solved However, as the weight on simplicity (X) is increased, the algorithm tends to represent each plane most simply. In order to help the algorithm plane, the constraints find the most interpretable on the linear combinations of a (pi, ,&) must be changed. Re- call that in the original algorithm, the linear combinations (&, &) which define th e solution plane. This constraint translates into an orthogonality which define the solution the correlation representation constraint [1.6] is imposed on for the linear combinations (oi, ~22) plane in the sphered data space. However, simplicity is measured for the two unsphered combinations the two vectors are (fek, constraint &eZ), k (/?I, /32) and is maximized # 1. These maximizing combinations thogonal in the original data space and correspond to the variable k and when are or- variable 1. Inierpwfable Ezploraioy Projeciion Pursuit Page 36 I axes. Unless two variables are uncorrelated, achieved by a pair of uncorrelated To ensure that the algorithm the maximum orthogonality. can find a maximum, in the following manner. the optimal constraint, Without translation loss of generality, the plane until the two vectors are orthogonal. @2onto PI and taking the remainder. a component given any pime after the linear combinations Unfortunately, cannot be combinations. by the pair (PI, p2) which satisfies the correlation of the plane is calculated simplicity defined the interpretability have been translated to is not known so it is done PI is fixed and /?2 is spun in The spinning is done by projecting That is, ,& is decomposed into the sum of which is parallel to PI and a component which is orthogonal The new ,&, which is called ,8i, is the latter component. to PI. Mathematically, [1.25] Whether Sv(P, , Pi> is always greater than or equal to SV(p,,p2) However, as noted above, the maximum is not clear. value of the index can be achieved given this translation. As an added bonus, original the two combinations (p,,&) are orthogonal variable space, which is the usual graphic reference frame. tion is in contrast to the original exploratory projection pursuit in the This situa- algorithm which graphs with respect to the covariance metric as noted in Section 1.1.4. In effect, a further subjective representation simplification in the solution has been made as the visual is more interpretable. The final solution for any particular X value is reported as the normed, trans- lated set of vectors [1.26] 1.4 The Algorithm Throughout Page 37 the rest of this thesis, the orthogonal (Pr, &,) is written 1.4.4 and Il.181 becomes = $KP- W l(PJ+ (p- l)S@{) Comparison Given that comparison W ith factor with X = (Xi, X2)r is assumed and for (&, Pi). As a result, whenever S,(p,, /?2) is referred to. the actual value is S,,(pl,&) wdv translat.ion Factor + 21 112 . [1.27] 2 Analysis analysis is used to motivate this method - $ -L!&g is warranted. the interpretability The two dimensional index, projection is defined as XEBY where B is the 2 X p linear combination matrix and the observed variables are Y = (K, Y2, . . . , YP)=. This definition to the one for principal principal projection components except that in the latter case, usually in Section 1.1.1, principal to simplify Section 1.1.4, involved rigidly B, due to Friedman is p X p. In addition, maximize a different matrix constraint. by multiplying (1987) and discussed in spinning the projected points (X1,X2) tion plane. The new set of points is QX = lation components all p index. The first attempt orthogonal B components are found so that the dimension of as was remarked is similar QBY as in [1.19]. S’ mce the rotation Thus simplification the linear combination in the solu- where Q is a two dimensional is rigid, it maintains the corre- through spinning in the plane is achieved matrix B by Q. 1. Interpretable Exploratory p variable The analogous In this model, (fi, f2>’ (Y,,yz,... Page 38 Pursuit factor analysis model is there are assumed to be two unknown and E is a and is found Projection p X 1 error vector. by seeking to explain the covariance , YP) given certain distributional changing the explanatory order to simplify assumptions. matrix which is comparable in dimensionality pursuit linear combination to a more interpretable solution = 51 is p X 2 of the variables Due to the non-uniqueness A rotation is made in to CJQ’. matrix, QB. f rotated producing Taking the transpose of the new factor-loading matrix structure power of the model. the factor-loading factors The factor loading matrix of the model, the factors can be orthogonally without underlying matrix produces QflT, a 2 X p to the new exploratory projection Thus, spinning the linear combinations is analogous to simplifying the factor-loading matrix in a two factor model. A transpose is taken since the factor analysis linear combinations are of the underlying combinations are of the observed variables. between factor analysis and principal Interpretable plane. exploratory factors, while the exploratory This comparison is analagous to that components analysis. projection pursuit In this case, the linear combinations only to the correlation via [1.25] to further constraint. involves rocking (pi, /32) are moved in They are then transposed increase interpretability. may not be linear combinations data analysis the solution RP subject to orthogonality The more interpretable coefficients of the original ones. Such a move is not allowed in the factor-analysis setting. The interpretable method may be thought looser, less restrictive version of factor analysis rotation. of as a 1.4 The Page 39 Algorithm 1.4.5 The Optimization Procedure The interpretable exploratory projection pursuit objective function is (I- A) G~(~~~~,Ty) + v (p 1, p 2 ) AS max GF where &@I, rithm P2) is calculated after the translation [1.28] [1.25]. The computer algo- employed is similar to that of the original exploratory algorithm described in Section 1.1.2. The algorithm projection is outlined pursuit below and then comments are made on the specific steps. 0. Sphere the data [1.2]. 1. Conduct lem [l.l] a coarse search to find a starting point to solve the original prob- (or [1.28] with X = 0). 2. Use an accurate derivative based optimization structured procedure to find the most plane. 3. Spin the solution representation. vectors in the optimal plane to the most interpretable Call this solution Pa. 4. Decide on a sequence (Xi, X2, . . . , Xl) of interpretability 5. Use a derivative based optimization and starting 6. If i = I, plane Pi-l. EXIT. values. procedure to solve [1.28] with X = Xi Call the new solution plane Otherwise, parameter set i = i + 1 and GOT0 Pi.. 5. The search for the best plane is performed in the sphered data space, as discussed in Section 1.1.2. However, an important note is that the interpretability of the plane must always be calculated in terms of the original variables. combinations, not the ai combinations, are the ones the statistician she is unaware of the sphering, which is just a computational The pi sees. In fact, shortcut behind the scenes. The modification Friedman does require one important difference in the sphering. In (1987), the suggestion is to consider only the first Q sphered variables 1. Interpretable Exploratory Projection Pursuit 2 where q < p and a considerable dropping of the unimportant unimportant putational amount of the variance is explained. The sphered variables is the same as the dropping components in principal work involved. Page 40 of components analysis and reduces the com- In Step 5, the interpretability gradients are calculated for (PI, /32) and th en t ranslated via the inverse of [1.5], a1 = DfU’p, a2 = DbJ’p, p.291 , to the sphered space. If the gradient components are nonzero only in the p - q dropped dimensions, optimization they become zero after translation. procedure the maximum The derivative-based assumes it is at the maximum has not been reached. and stops, even though Thus, no reduction in dimension during sphering should be made. The coarse search in Step 1 is done to ensure that the algorithm the vicinity of a large local maximum. starts in The procedure which Friedman (1987) employs is based on the axes in the sphered data space. He finds the most structured the sphered space. pair of axes and then takes large steps through Since the interpretability a feasible alternative measure S, is calculated for the original variable space, m ight be to coarse step through the original rather than sphered data space. For example, the starting point could be the most structured pair of original axes. This pair of combinations is in fact the simplest possible. On the other hand, stepping evenly through the unsphered data space m ight not cover the data adequately as the points could be concentrated due to covariance structure. tried so far, the starting Sphering solves this problem. procedure used is the program (Gill et al. 1986). This package solves nonlinear tion problems. In all data examples point did not have an effect on the final solution. In Steps 2 and 5, the accurate optimization NPSOL in some subspace The search direction constrained at any step is the solution optimiza- to a quadratic 1.4 The Page Algorithm programming binations problem. The package is employed to solve for the sphered com- ((Y~,cx~) subject to the length and orthogonality gradients for the projection index SV derivatives given the complicated is extremely translations this problem is a difficult from the un- [1.25] and from the sphered to unsphered space [ 1.51. These gradients are given in Appendix design a steepest descent algorithm [1.4]. The The interpretability as they involve translations combinations The package NPSOL constraints index GF are straightforward. are more difficult correlated to orthogonal powerful. A. At present, work continues to which maintains the constraints. However, between the sphered and unsphered spaces, one. Step 3 can be performed discretely 41 in two ways. The initial spun in the plane to the most interpretable pair of vectors representation can be or Steps 5 and 6 can be run with A equal to a very small value, say 0.01. This slight weight on simplicity permitted does not overpower to rock but spinning is allowed. representation is similar of the most structured to Friedman’s interpretability The initial inator the desire for structure. The result is the most interpretable plane. As noted previously, (1987) simplification value of the projection except that a two vector varimax index GF is used as max G in the denom- However, the algorithm and as the weight on simplicity move to a larger maximum. may be caught in a is increased, the procedure may Thus, the contribution of the projection may at some time be greater than one. This is an unexpected terpretable this spinning index is used instead of a single vector one. of the first term in [1.28]. local maximum The plane is not projection pursuit approach, both structure index term benefit of the in- and interpretability have been increased. In the,examples tried, the algorithm is not very sensitive to the X sequence choice as long as the values are not too far apart. (O.O,O.l,. . . , 1.0) produces the same solutions sequence (0.0,0.25,0.5,1.0) does not. For example, as (0.0,0.05,0.1, the sequence . . . , 1.0) but the 1. Interpretable Exploratory Throughout as the starting Projection Pursuit Page the loop in Steps 5 and 6, the previous solution point for the application of the algorithm at Xi-1 is used with X;. This approach is in the spirit of rocking the solution away from the original solution. have shown that the objective made initially is fairly 42 Examples smooth and a large gain in simplicity is in turn for a small loss in structure. As remarked in Section 1.1.4 when the example was considered, the data Y is usually standardized before analysis. The reported combinations the coefficients represent the relative importance nation. are [1.26]. Thus of the variables in each combi- The next section consists of the analysis of an easy example followed by a return to the automobile Example. 1.5 Examples In this section, two examples of interpretable are examined. exploratory projection The first is an example of the one dimensional algorithm, second is the two dimensional algorithm applied to the automobile in Section 1.1.3. Several implementation pursuit while the data analyzed issues are discussed at the end of the section. 1.5.1 An Easy Example The simulated data in this example consists of 200 (n) points in two (p) di- mensions. The horizontal distributed and vertical coordinates are independent with means of zero and variances of one and nine respectively. data is spun about the origin through tracted from each coordinate coordinate and normally of the remaining an angle of thirty degrees. Three is sub- of the first fifty points and three is added to each points. The data appear in Fig. 1.5. Since the data is only in two dimensions and can be viewed completely scatterplot, using interpretable quite contrived. The exploratory projection pursuit in this instance is However, this exercise is useful in helping understand the procedure works and its outcome. in a the way 1.5 Ezamples 101 Page 43 I I I I I -10"""""""' -10 -5 ’ ” ’ I”’ ’ I""1 5 0 10 Yl Fig. 1.5 Simulated The algorithm p = 2. is run on the data using the varimax for a single vector axes restriction data with n = 200 ad S1 and the Legendre index GL. modifications are irrelevant interpretability Rotational in this situation sional solution is sought. The values of the simplicity index invariance and since a one dimen- parameter X for which the solutions are found are (0,O. 1,0.2, . . . , 1 .O). The most structured or the first principal the observations line through component the data should be about thirty of the data. are split into two groups. . line toward From Fig. 1.5, the horizontal projected and vertical merge into one. In the horizontal projection, of these two axes. the most structure onto it. If the data is projected onto the vertical lines in the axes. The algorithm the more structured axis exhibits onto this line, The most interpretable entire space, R2 in this case, are the horizontal should move the solution When projected degrees when the data is axis, the two clusters the two groups only overlap slightly. 1. Interpretable Ezploratoq Projection Pursuit Page 44 -y-+- ~2c”“““““““““” 0.8 & 0.6 -1 2 I I I I I I I I 1 8 I I I I 8 , I I I I 0.0 0 1 0.8 0.6 0.4 0.2 A Fig. Projection and interpretability indices versus X for the simulated data. The projection index values are normalized by the X = 0 value and are joined by the solid line. The simplicity index values are joined by the dashed line. 1.6 Fig. 1.6 shows the values of the projection the values of A. The projection downward and interpretability indices versus index begins at 1.0 as it is normed and moves to about 0.3 as the weight on simplicity increases. The simplicity index begins at about 0.2 and increases to 1.0. In addition to this graph, the statistician Fig. 1.7 shows the projection . combinations histograms, should view the various projections. values of the indices, and the linear for four chosen values of A. The histogram binwidths as twice the interquartile have area one. range divided by ni and the histograms are calculated are normed to Page45 1.5 Examples 0.25 ,,I, , , I I I \ I ,_l 0.20 0.15 0.10 1 h=O.O A=O.5 Fig. 1.7 Projected simulated data histograms for various The values of the indices and combinations are values of X. X = 0.0, GL = 1.00, S1 = 0.21, ,L3= (0.85,0.52)T A = 0.3, X= 0.5, GL = 0.93, Sl = GL = 0.78, S1 = 0.47, 0.71, p = (0.92,0.47)T B = (0.97,0.24)T X = 1.0, GL = 0.27, Sl = 1.00, ,!3= (l.OO,O.OO)T As predicted, _ the horizontal is evident the algorithm moves from an angle of about thirty axis. As the weight on simplicity in the merging groups are overlapping. of the two groups. Also important degrees to is increased, the loss of structure In the fmal histogram, to note is the comparison the two between the Page46 1. InterpretableExploratory Projection Pursuit first and second histograms value of the projection the two histograms 1.5.2 are virtually The automobile indistinguishable Example interpretable exploratory consists of 392 (n) observations GF projection rotational in a different notation index instead projection pursuit. projection pursuit analysis uses the Fourier of the Legendre GL index due to the desire for point for the algorithm, of Section 1.4.5. Comparison invariance using the Fourier index. orthogonal in the original the plane. The corresponding this choice results the X = 0 or PO plane in the of these two solutions in the Legendre projection. projection demonstrates The dashed axes are (PI, ,&), which variable space and are the simplest PO solution representation constraint With of as Fig. 1.8 purposes. metric. Rigidly [1.6]. imagination, index values. are for the Legendre index is shown in In Fig. 1.9, the dashed axes are (/&,pz), variance the Fig. 1.8 shows the PO Fig. 1.9. This plot is the same as Fig. 1.1 but with the same limits for comparison using two Recall that this data as discussed in Section 1.4.1. Naturally, starting lack of rotational 1.1.3 is re-examined of ten (p) variables. exploratory invariance of 0.26. However, to the eye. data discussed in Section The interpretable A loss of 7% in the index is traded for a gain in simplicity The Automobile dimensional at X = 0 and X = 0.3 respectively. rotating However, spinning For example, which the combinations are orthogonal maintains in the co- the correlation the Legendre index changes as shown in Table 1.1. these axes through the new marginals different angles produces lower after a rotation of f are less clustered and therefore GL is reduced. The value of the Fourier index GF for the projection the value ‘for the projection in Fig. 1.8 is 0.21 while in Fig. 1.9 is 0.19. The Legendre index GL value for the second is 0.35 as previously reported. For the projection in Fig. 1.8, the Legendre index varies from 0.19 to 0.21, depending on the orientation of the axes. 1.5 Ezamples Page 47 .:.. . I :. . .,* *. . . . I : . y;. -. : ‘*. ..:+‘J: :.*;.. I . . a... .. ; .. 1: . -. .: ..*..‘.. .J. -- -.-. J : ;;.Y++ *.*.z .: ..I. Fag. I.8 ---- Most structured projection scatterplot of the automobile data according to the Fourier index. The dashed axes are the solution combinations which define the plane. Both find projections which exhibit endre index combinations are translated clustering into two groups. to an orthogonal If the Leg- pair via jl.51, they become ,& = (-0.21, -0.03,-0.91, ,&. = (-0.80,-0.14, 0.16, 0.27, and have interpretability the simplest representation 0.30, -0.05, -0.01, 0.49,-0.02, measure S@, 0.03, 0.00, -0.02)T 0.03,-O-16,--0.02, 0.02,-0.01)T /&) = 0.50. Of course, this may not be of the plane though the orthogonalizing transforma- tion may help. The Fourier index combinations PI = (-0.06, p2 = ( 0.00, -0.22, -0.79, -0.44, O.ll,-0.53, originally are 0.34,-0.05, 0.82,-0.03, 0.02, -0.14, -0.05, 0.05,-0.01, 0.16, 0.04)T 0.04,-0.06)T 1. InterpretableEzploratoy Projection Pursuit Fig. 1.9 Most structured projection scatterplot of the automobile data according to the Legendre index. The dashed axes are the solution combinations which define the plane maximize GL and are orthogonal in the covariance metric. and have interpretability representation Page48 measure 0.17. These axes are spun to the simplest as discussed in Section 1.4.5 and orthogonalized. axes are shown in Fig. 1.8 and the combinations p1 = (-0.03, -0.12, -0.95, -0.01, @2 = (-0.02, 0.16,-0.09, with interpretability 0.29,-0.02, 0.94,-0.19, parameter are 0.00, -0.03,-0.02, 0.09,-0.01, 0.18, O.OO)T 0.07,-0.06)T measure 0.80. Fig. 1.10 shows the values of the interpretability plicity The resulting and Fourier indices for sim- values X = (0.0, 0.1, . . . , l.O), analogous to Fig. 1.6 for the sim- ulated data. This example demonstrates as the interpretability that the projection index does, possibly index may increase because the algorithm gets bumped Page49 1.5 Examples CT 2 0 0.2 0.4 0.6 0.6 1 h Fig. 1.10 Projection and interpretability indices versus X for the automobile data. The projection index values are normalized by the X = 0 value and are joined by the solid line. The simplicity index values are joined by the dashed line. out of a projection local maximum as it moves toward a simpler solution. Given this plot, the statistician lution planes for specific may then choose to view several of the so- X values. actual values of the combinations Six solutions are given in Table 1.2. If coefficients less than 0.05 in absolute value are replaced by -, sense, this action of discarding nature of the interpretability at which coefficients the coefficients move quickly suggests investigating are shown in Fig. 1.11. ‘small’ coefficients which are Table 1.3 results. is contrary The In some to the continuous measure. However, the second table shows the rate decrease. Due to the squaring in the index SV, the to one but slowly to zero. This convergence problem the use of a power less than two in the index in the future as discussed in Section 3.4.3. The table also demonstrates the global simplicity 1. Interpretable Exploratory I I I I 3 Projection I I I I . .. Pursuit Page 50 I I I / -_ I -4 -2 0 2 I , I -4 I -4 I 1 I I I , 0 h=0.4 *... I -2 X=0.6 .. -2 0 2 h=0.7 .. * i: :i . .I ;. . . ;; ; . *. . . -. ;. . ” .’ ‘: :- *: I -4 I I , I I I I -2 x=1.0 Fig. I.11 Projected automobile data scatterplots for various values of A. The values of the indices can be seen in Fig. 1.10 and the combinations are given in Table 1.2 c 2 Page 51 1.5 Examples 0.4 0.00 -0.09 -0.97 0.06 0.22 -0.01 0.02 -0.02 -0.02 0.00 -0.06 0.15 0.00 0.95 -0.15 0.08 -0.03 0.19 0.07 -0.06 0.00 -0.08 -0.98 0.04 0.16 -0.02 0.03 -0.02 -0.02 -0.01 -0.03 0.12 0.01 0.97 -0.11 0.12 -0.03 0.14 0.06 -0.04 0.01 -0.05 -0.99 0.01 0.12 -0.01 0.04 0.00 -0.01 -0.01 -0.03 0.06 0.00 0.98 -0.07 0.11 -0.03 0.11 0.05 -0.03 0.01 -0.05 -0.99 0.02 0.09 -0.01 0.05 -0.01 -0.02 -0.01 -0.02 0.06 0.01 0.98 -0.04 0.10 -0.04 0.11 0.05 -0.03 0.02 -0.04 -0.99 0.01 0.09 0.04 0.02 0.02 -0.01 0.00 -0.02 -0.02 0.01 0.99 -0.02 -0.02 0.03 0.01 0.04 -0.09 0.01 -0.03 -1.00 0.00 0.03 0.03 0.02 0.01 0.00 -0.01 0.00 -0.01 0.00 1.00 0.00 0.01 0.03 0.02 0.03 -0.07 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 . 0.5 0.6 0.7 0.8 0.9 1.0 Table 1.2 Linear combinations for the automobile data. The linear combinations are given for the range of X values. The first row for each X value is PT and the second is ,f?T. 1. Interpretable 3.2 Exploratory Page 52 Pursuit - -0.10 -0.96 0.05 0.25 - 0.17 - 0.94 -0.19 0.06 0.24 0.94 -0.18 -0.10 3.3 Projection -0.05 - 0.17 -0.96 - -0.09 -0.97 0.06 0.22 0.15 - 0.95 -0.15 0.08 0.08 - - 0.18 0.18 - 0.07 0.07 - -0.05 -0.05 - 3.4 -0.06 3.5 - -0.08 - 0.12 -0.05 3.6 3.7 3.8 3.9 -0.98 -0.99 - 0.06 - - -0.05 -0.99 - 0.06 - - - - - - - - -1.00 - - - - -0.99 -1.00 1.0 Table I.3 - 0.97 0.98 0.98 0.99 1.00 1.00 0.16 -0.11 0.12 -0.07 0.09 0.09 0.08 0.12 0.11 0.10 0.05 - 0.19 0.14 0.11 0.11 0.07 0.06 0.05 0.05 -0.06 - - - - - - - - - - - -0.09 - - - - - _ - - - - - -0.07 - - - - - _ - - - - - - Abbreviated linear combinations for the automobile data. The linear combinations are given for the range of X values as in Table 1.2. A - replaces any coefficient less than 0.05 in absolute value. Page 59 1.5 Examples 1.0 0.5 0.0 -_-----_ - -0.5 -1.0 IIll 0 IIll IIII IIII 0.4 0.2 0.0 III1 0.8 III1 1 0 III1 0.2 IIII 0.0 0.8 1 Y3 IIII 0.5 IIII 0.6 y1 1.0 III1 0.4 Iill r” IIII III1 - III1 --- --- -0.5 -1.0 III1 0 III1 0.2 IIII IllI 0.4 0.8 Ill1 IIll- 0.8 1 0 IIII 0.2 IIII IIll 0.4 0.6 Yr IllI- 0.8 1 YS 1.0 0.5 - --- -\ 0.0 III1 0 IIII 0.2 IIII IIII 0.4 0.6 IIll-, 0.8 y, Fig. I. It Parameter trace plots for the automobile data. The values of the parameters are given for those variables whose coefficients are large enough. The parameter values (/3li solid, &i dashed) are plotted versus A. 1 1. Interpretable Exploratory Projection Pursuit of the two combinations. If one variable Page 54 is in a combination, it usually is absent in the other. Another useful diagnostic tool fashioned after ridge regression trace graphs is shown in Fig. 1.12. For variables whose coefficients X, the values of the coefficients are 2 0.10 for any value of in each combination One of the most interesting solution are plotted versus X. planes is the one for X = 0.8. The combi- nations involve four variables which are split into two pairs, one in each combination. The first combination and the fifth variable involves the negative of the third variable engine size automobile weight. The second combination engine power and the negative of the gaussianized the solution plane with by the plotting must decide when the tradeoff In this example, since the projection plane. However, the decision may be difficult. no external a plane’s usefulness. No doubt measurement Rather, interpretable cedure to.find ticular error can be used to judge of planes which exhibit Friedman projection (1987) presents applies a transformation it yet leaves other orthogonal pursuit planes. algorithm In fact, After disturbing a method to the structured planes undisturbed. The can employ the same pro- Once the A value and hence the par- plane have been chosen, the structure as desired. structure. should be removed without planes. He recursively several interesting found and simplified Given that this is an exploratory 1.9. The object is to find all such views. of other interesting exploratory increased from the previous such as prediction plane is found, its structure plane which normalizes and accu- the decision rests with the statistician. in Figs. 1.8 and removal. between simplicity index actually this data has a wealth two are evident for structure delineated the X = 0.8 model is a good stopping place, especially the structure Japanese flag. Fig. 1.13 shows the type of car (Japanese or Non-Japanese) racy should be halted. a structured the symbol. The statistician technique, involves is removed and the next plane is Page 55 1.5 Examples .-. * .,’ - -. . . ...:: y.,. -.. . . . .. ,.. .. . . ** : : . Fig. The interpretability onal. ;* . .. of a collection j(: of solution planes could be considered. two planes might be simple in relation However, in orthogonal ‘-I ..* Country of origin projection scatterplot of the automobile data for X = 0.8. American and European cars are plotted as points, Japanese cars as asterisks. I.13 example, .* since the structure planes, in practice to each other if they are orthog- removal procedure solution For does not affect structure sets of planes tend to be orthogonal anyway. Originally, Section 1.1.1. to rotate color. exploratory After the point The statistician was time-consuming The solution projection choosing pursuit was interactive, three variables, cloud in real time. A fourth the statistician dimension could then pick out structure and only allowed was to automate combinations the structure as mentioned in used a joystick could be shown in by eye. However, this task of four variables at best. identification by defining an index 1. Interpretable Exploratory which measures structure statistician Projection Pursuit mathematically loses some of her individual Page 56 and to use a computer choice in structure optimizer. identification The but she can define her own index if desired. Interpretable variable selection. coupled with exploratory projection The interpretability a numerical pursuit is an analogous index is a mathematical optimization routine, choosing which three or four variables to view. automation of measure which, takes the place of interactively Chapter Interpretable Pursuit Regression Projection This chapter describes interpretable ilar in organization common to the previous projection chapter modification reviewing the notation is considered ing goals of exploratory interpretability pursuit though to both have been addressed previously. inal algorithm, In the second section, is detailed. and projection index must be changed slightly It is sim- condensed as several issues and strategy. pursuit regression. Section 2.1 deals with the orig- and the new algorithm projection 2 the Due to the differ- pursuit regression, from that of Chapter the 1. The final section consists of an example. 2.1 The Original The original and Stuetzle Projection projection (1981). tures and extends pursuit Friedman the approach regression in addition Pursuit Regression Technique regression technique (1984a, 1985) improves to include classification to single response regression. response regression is considered. 57 is presented in Friedman several algorithmic and multiple In this chapter, fea- response only single 2. Interpretable 2.1.1 Projection Pursuit Page 58 Regression Introduction The easiest way to understand as a generalization tion pursuit (1982). pursuit linear regression. regression is to consider it Many authors motivate projec- regression in this manner, among them Efron (1988) and McDonald For ease of notation, predictors X = (X1,X2,. linear function familiar of ordinary projection suppose the means are removed from each of the of the centered X. notation, The goal is to model the response Y as a . . ,Xp)T. With the usual assumptions and slightly un- the single response linear regression model may be written WI Y - E[Y] = WTX + c where c is the random error term with zero mean. The vector w = (WI, ~2, . . . , w~)~ consists of the regression coefficients. In general, this linear function Y given particular is estimated values of the predictors by the conditional expectation 2 = (q, 52,. . . , zp). The fitted of value of Y is 3(x) The expected = E[Y] + WTX value of Y is estimated distance between the true and fitted by the sample mean. w of the model are estimated m in In practice, mean. The expected L2 random variables is L2(w, x, Y) E E[Y - Q2 The parameters WI . Lg(w,X,Y) . by . P-31 the sample mean over the n data points replaces the population 2.1 The Original Projection Page 59 Pursuit Regression Technique The model [2.1] may be written Y - E[Y] = /qct'X) WI + E where p = wTw and a = (01, CY~,. . . , c+,)’ is w normed. is analogous to [2.2]. Th e p arameters equation [2.3] subject to the constraint The rewritten the projection model between the fitted generalization projections the fitted pursuit stricted combinations oj which zero mean and unit variance. of functions o’. The relationship is a straight line. A natural pursuit unrestricted regression functions of parametrically. regression model with m terms is @jfj(aTX) the functions The parameters In the usual way, the conditional WI s + E define the directions to have length one. In addition, the terms. oTX Y depends only on value to be a sum of univariate y - E[Y]= 5 The linear as in to vary. Projection which are smooth but otherwise The projection p and o may be estimated variables X onto the direction values Y and the projection allowing value response variable is to allow this relationship does just that, fitted that aTa, = 1. [2.4] shows that of the predictor The resulting of the smooths fj are smooth, pj capture are re- and have the variation between mean is used to estimate the sum as m p(X) = E[Y] + C @j.fj(aTx> j=l with the parameters Analogous estimated to exploratory as in [2.3]. projection pursuit, may be more successful than other nonlinear mensional space (Huber 1985). The model projection methods pursuit by working regression in a lower di- [2.5] is useful when the underlying 2. Interpretable relationship Pursuit Regression Projection Page 60 between the response and predictors is nonlinear, gression [ 2.41, and when the relationship is smooth, methods Diaconis that such as recursive any function number can be approximated of terms m. Substantial theoretical properties algorithm are difficult 2.1.2 partitioning. work of the method. as opposed to other and Shahshahani by the model remains versus ordinary nonlinear (1984) show [2.5] for a large enough to be done In addition, re- with the numerical respect to the aspects of the as discussed in Section 2.2.3. The Algorithm The parameters pj, oj and the functions min fj are estimated by minimizing La@, a, f, x, Y) @j,CYj,fj:j=l,...VZ 3 QTaj = 1 WI JVjl = 0 and Var[fj] The criterion However, for. algorithm k= if certain Friedman results l,.. [2.6] cannot be minimized simultaneously ones are fixed, the optimal (1985) employs are discussed in Section . , m. The problem for all the parameters. optimization as they are pertinent 2.2.3. . values of others are easily solved such an ‘alternating’ in this section is considered j = 1,. . . ,m = 1 First, strategy. His when the modified he considers a specific term k, [2.6] may be written min -Wk - Bkfi;(~TX)12 pk#k,fk WI where Rk s j#k For the kth term, in turn while the three sets of parameters all others are held constant. have been found, the next term is considered. After /?k, ok, and fk are estimated all elements The algorithm of the kth term cycles through the 2.1 The Original Projection in the model until terms alternating strategy The minimizing Pursuit Regression Technique the objective Page 61 in [2.6] d oes not decrease sufficiently. The is discussed in more detail in Section 2.2.3. ,Bk is P-81 The minimizing This function estimate is found fk for any particular using the nonparametric man (198413). Th e resulting constraints. direction procedure. squares problem as possible. Minimizing but rather as an the criterion and requires an [2.6] as a function of ok is a least- In applying than use a method of applying these methods, an iterative search procedure if the function to be minimized simplifies and is easily approximated (Gill determining optimization is of least-squares fact and uses the Gauss-Newton first deriva- Hessian is preferable. specifically et al. 1981). to about the function which only employs the Hessian, outweighs their additional However, on’this directly which also uses the actual or approximate the difficulty rately estimating ak. function the goal is to use as much information Thus, rather tives, a method italizes to satisfy mean and variance ok cannot be determined as seen in [2.7]. a function, Usually curve is standardized discussed in Fried- value for each observation. The minimizing minimize smoother It is not expressed as a mathematical estimated iterative point azx is or accu- properties. form, the Hessian Friedman (1985) cap- procedure to find the optimal 2. Interpretable 2.1.3 Projection Model Selection Strategy Model selection in the original ing the number Page 62 Pursuit Regression projection pursuit of terms m in the model. gression considers not only the number regression consists of choos- Interpretable projection pursuit re- of terms but also the interpretability of those terms. Friedman M. (1985) suggests starting The procedure for finding the algorithm this model is discussed in Section 2.2.3. A model with M - 1 terms is then determined the previous with a large number of terms model as a starting using the M - 1 most important point. The importance terms in of a term k in a model of m terms is defined as Ik E fjj k [2.10] 1 where ]/3l] is the maximum absolute parameter. Since the functions to have variance one, [@J-lmeasures the contribution strained fj are con- of the term to the model. The statistician may then plot the value of the objective in [2.6], which is the model’s residual sum of squares, versus the number of terms for each model. In most cases, the plot has an elbow shape. The usual advice is to choose that model closest to the tip of the elbow, where the increase in accuracy additional 2.2 term is not worth The Interpretable A projection number of terms functions fi. m = the increased complexity. Projection pursuit regression 1, . . . , M. Since these functions they are visually Pursuit analysis combination represents. The parameters Regression produces Approach a series of models The models’ nonlinear components are smooths and not functionally assessed by the statistician. associated due to the Each is considered aj in order to understand ,f3j measure the relative what with are the expressed, along with its aspect of the data it importance of the terms. 2.2 The Interpretable Each direction Projection oj must be considered in the context of the original variables p. The collection such as the correlation of combinations constraint regression technique, of variables similar to those in Section selection before, a weighted penalty criterion in projection balancing The simplicity 1.2 show that pursuit As with any a combinatorial criterion The minimization _ problem index measures the interpretability since the objective 1nterpretabilit.y that the number of terms m is fixed. This assump- tion is discussed in Section 2.2.2. The goal is define an interpretability am). At first glance, the situation foragroupofmvectors(ai,cy2,..., exploratory two vectors (PI, ,&) is measured. a plane. which, In relation when normed in the latter the simplest index S is a gen- where the simplicity of case, the vectors define pairs of combinations are ones This type of orthogonality is in Section 2.3.5. Measures of this squared orthog- and each vector’s individual Within each combination, variables index pursuit and squared, are orthogonal. onality interpretability projection However, to each other, named squared orthogonality of Index Consider for the moment of interpretable to be in the next two subsections. The Interpretability eralization [2X] of the collection is minimized. an becomes , Q2 T.--T a m ) xqcrl As [2.6] with in the first term causes both contributions and is subtracted ap- regression is not feasible. min La index choices are.discussed 2.2.1 pursuit. is desirable for parsimony. L2(P)crJJm for X E [0, l]. Th e d enominator combinations projection the goodness-of-fit is employed. (1 -A) min /3j,Crj,fj:j=l,..., TTZ E [O,l]. in exploratory a variable selection method which causes the same subgroup proach to variable interpretability number of is not subject to any global restriction to appear in all the combinations Arguments Page 63 Pursuif Regression Approach interpretability enter the index are selected via the single vector Sr [1.12] by encouraging the vector coefficients S, [1.18]. varimax to vary. 2. Interpretable Different variables Page 64 Pursuit Regression Projection are selected for either vector by forcing the combinations to exclude the same variables. Projection pursuit be simple within is penalized. interest regression has a different itself. That is, homogeneity The varimax each vector of the coefficients within should a vector index Sr for one vector achieves this. However, in the of decreasing the total number of variables in the model, the same small subset of variables of each direction variables should be nonzero in each vector. Summing as in [ 1.171 d oes not force the vectors necessarily. On the other hand, if the exploratory the simplicities to include to squared orthogonality and would contain may be used for projection if variable pursuit selection is the object, Consider the m X p matrix the directions different the same projection measure [l ,161 for q = m vectors is used, the vectors would varimax Though goal. First, variables. pursuit be forced This old index regression to achieve this outcome. However, a new index must be developed. of combinations aj are constrained defined for all sets of combinations to have length one in [2.6], the index is in general. The goal is that all the combina- tions are nonzero or load into the same columns so that the same variables in all combinations. are An index’ which forces this outcome is based on a summary vector y of the matrix 52 whose components are 2 3;’ - 2 j=l Each component tribution yi is positive to variable 5; i=l,...,p * [2.12] “T~j and contains the sum of each term’s i, or the total column weight. The components relative con- of the new 2.2 The Interpretable Projection Pursuit Regression Approach Page 65 vector sum to m. The object is to force these components sible. For a single combination, index S, [1.12]. to be as varied as pos- this is achieved via the varimax For y appropriately normed, this measure is = $-&py WY) interpretability . 12.131 84 The index function [2.13] f orces the weight of all the combinations, f2 Sd O!y1,(Y2,...,~m)=- The function [F of the columns 1)5 5 g-&l waJ+2(pP j#k j=l is in contrast to force the squared orthogonality interpretability of each combination. exploratory projection over the terms is subtracted of each column to be dissimilar. of the each total column weight over the rows of the matrix, of the model, in order of the combinations. The index forces the overall weights dispersion +m;;:l) * i=l to S, for interpretable [1.18], in which the cross-product pursuit As a this index is called Ss and may also be written Sr measures the individual This re-expression of R to vary widely. depends on the goodness-of-fit criterion, An example The or terms using this index is discussed in Section 2.3. 2.2.2 Attempts to Include In the discussion argument that so certain. with However, assessed. of Terms of terms m in the model is fixed. fewer terms is more interpretable on further Each term involves The interpretability be visually so far, the number a model -more persuasive. the Number reflection, its combination of the combination this conclusion The than one with does not appear oj and the smooth function can be measured but the function fj. must 2. Interpretable Projection Pursuit Consider the following out knowing Regression example with two variables Xi and X2 (p = 2). With- the data context, ranking (1) n-l = in order of interpretability function Situations of widely is impossible. varying would be difficult f2(X2) The first model involves two functions combination. of the most complex easily can be imagined more interpretable. The second involves only one combination in which clearly For example, if the two variables traits (apples vs. to understand models f(X1ix2) each, the simplest consisting the two fitted 2, fl(&), (2)m=l, of one variable Page 66 oranges), of variables, the average. one or the other are distinct model measurements then a combination of the two and Model (1) would be preferable. However, if the two were aggregate measures of similar characteristics (reading and spelling scores) which could be combined easily into a single variable (verbal ability), second model m ight be easier to interpret. the number As with tion pursuit statistician weight number interpretable is equipped exploratory with projection pursuit, the algorithm interpretable as an interactive dial. process in which the and the model becomes more simple. is included should drop terms as it simplifies three methods considered. Factors required fortunately, none of the resulting projec- As the dial is turned, of terms to begin with is m and this parameter discussion, ways to include measure of a model is enlightening an interpretability (X) in [2.11] is increased the following the are unsuccessful. regression can be envisioned sure of simplicity, below. However, considering of terms in the interpretability even if present attempts is for including the If the in the meathe model. In m in the index [2.13] are to put the index values E [0, l] are ignored. indices works in practice Un- for the reasons given 2.2 The Interpretable Projection Pursuit Since the interpretability %rst attempt Regression Page 67 Approach of a model decreases with the number of terms, the is to multiply the interpretability index [2.13] by A, obtaining Sa(CY1,Cr2,. . .) CXm)E -;[$-&-;)2] . . t=l The resulting index Sa decreases as the number of terms increases. However, this measure does not work in practice as each term’s contribution is reweighted when the number of terms changes. Instead, the index should be such that submodels of the current model measured contribute the same to the index, regardless of the size of the complete model. Each model can be thought number of points Section 2.3.4. increases. The total increases as the number contain of as a point Consider E BP, As terms are added, the a distance interpretability of distance from a set of points of points the sarne variables, Since the object does. a plausible interpretability Sb(al,Q2,-..,~m)-- ri$ el[Vc~~ index as in to a particular is that point all the terms index is -Vll12 . j=l As in Chapter 1, the set V is composed of simple vectors squared and normed one dimensional [1.14] and vaj is the version of the oj term [1.13]. This index is similar index [1.15]. The m inimum j but of the sum of distances the same interpretable to the is not taken of each individual in order to ensure that term all the terms simplify to vector ~1. If the set V consists of the cl’s (1 = 1,. . . ,p), Sb reduces to Pm 2 "ji - T cc i=l . where yk is the largest with the removal discontinuous. j=l ~j component of the m inimization, - 2-/k + m tyj as defined in [2.12]. the derivatives Unfortunately, of the given index even are Page 68 2. Interpretable Projection Pursuit Regression Averaging which all the possible is continuous. toward distances vectors the closest ~1 overpowers Unfortunately, though continuous all the terms to collapse toward In conclusion, the number sion model cannot selection the interpretability linear regression procedure is poor. to explore 2.2.3 The strategy term model it. The outlook Instead, should procedure projection than a strict themselves situation with model regresin which between model pursuit of measur- pursuit the tradeoff for an analogy interpretable the number projection controls inter- selection in regression selection compare these models is discussed is described in the next subsection. is a pro- in Section Procedure a large number of models of terms m = 1,. . . , M, the following models for different X. Steps 1 and 2 find the original which toward and smoothly to a one parameter X completely to find a sequence of interpretable ity parameter be less than of the terms such a method, is to begin with model with number at incorporating and the interpretability be reduced The Optimization T must simplifies and pulls the term for simultaneously the model space, rather cess. How the statistician 2.3. The optimization As a result, attempts A method parameter and accuracy. each term as opposed to Sa, the index S, does not force Without yet. pretability index the same simple vector. of terms has not been found ul. the others none of the three terms into the index works. ing both a third The power r must be chosen so that one of the interpretable one so that produces minimizes [2.6]. M. For each sub- algorithm is employed values of the interpretabil- projection pursuit regression m Steps 3 and 4 find the sequence of models which minimize [2.11] for various X values. Throughout the description, updating a parameter means noting the new value and basing subsequent dependent Page 69 2.2 The Interpretable Projection Pursuit Regression Approach calculations Moving a model means permanently on it. changing its parameter values. 1. The objective function lined in Friedman 2. Use a backwards is [2.6]. Use the stagewise modeling and Stuetzle procedure out- (1981) to build the M term model. stepwise approach to fit down to the m term model. Fori=l,...,M-m begin Rank the terms from most (term 1) to least important as measured by [2.10]. D iscard the least important Use an alternating procedure to minimize (term M - i + 1) term. the remaining M - i terms. a. Fork=l,...,M-i begin Update the term’s parameters other terms are fixed. respectively. step to find the new direction by updating pk and fi; using Only one Gauss-Newton ation due to the expense of a step. objective the Choosing from among several steplengths, do a single Gauss-Newton plete the iteration pk, Ok and curve fi; assuming ok. Com- [2.8] and [2.9] step is taken for each iterContinue iterating until the stops decreasing sufficiently. end b. If the objective decreased on the last complete loop through (a), move the model. If the objective (GOT0 a.). Otherwise, another pass term model is complete. end decreased sufficiently, the optimization the terms perform of the M - i 2. Interpretable Projection Pursuit Regression Page 70 3. Let X0 = 0 and call the m term model resulting model. Choose a sequence of interpretability from Steps 1 and 2 the X0 parameters (Xl, X2, . . . , XI). Let i = 1. 4. The objective is [2.11]. function model using a forecasting starting Set X = X; and solve for the m term alternating procedure and the Xi-1 model as the point. Reorder the terms in reverse order of importance 1) to most important [2.10] from least (term (term m). Make a move in the best direction possible. a. For k = 1,. . . , m begin Choosing from among several steplengths, update the crk resulting from the best step in the steepest descent direction. iteration by updating the ,& and fk using [2.8] and [2.9] respectively. Only one steepest descent direction of a step. Always perform iterating Complete step is taken due to the expense at least one iteration until the objective and then continue stops decreasing sufficiently. end 6. If only one loop through the terms the model regardless of whether perform another completed pass (GOT0 and the objective one loop has been completed perform another pass (GOT0 Otherwise, the objective move decreases or not and a). If more than one loop has been decreased, move the model. and the objective a). Otherwise, m term model with interpretability c. If i = I, EXIT. (a) has been completed, parameter If more than decreased sufficiently, the optimization X; is complete. let i = i + 1 and GOT0 4. of the 2.2 The Interpretable Projection The forecasting Pursuit alternating Regression procedure in Step 4 differs from the alternating procedure in Step 2. In the latter case, determining specific term reduced to a least-squares problem this problem could not be solved analytically Hessian. direction (Gauss-Newton) However, with the addition of the objective. as Gauss-Newton. The gradients utilized of the interpretability term the effect a simplification algorithm as described not updated between for ease in calculation. However, sacrificing accuracy a decrease in the objective. shift in term two produces index. S, and the equality The increase in interpretability in the solution among term for an individual a decrease. Such a algorithm. does not work because of the strong interaction index This approach However, possibly this change of moves is not considered in the original in the simplicity A term is this approximation / For example, suppose a change in term Using this ‘look one step ahead’ approach in interpretable regression In the original scenarios exist in which the global m inimum. in term one and the resulting combination decreases as a result. the terms, A. is made to forecast in Step 2, each term is considered separately. ignores the interaction one does not produce is the negative are given in Appendix in that an attempt function method is not as accurate of one term will have on the others. unless the objective does not produce this method of the objective The Step 4 approach is also different the least- Thus, a cruder optimization Unfortunately, While which must be used. Steepest descent is employed and the step direction of the gradient al; for a and had to be iterated, [2.11] no longer has this form. the objective the optimal as noted in Section 21.2. squares form lent itself to a special procedure the estimated Page 71 Approach projection between pursuit the terms contributions to the term must be very large before it changes on its own. However, once it changes, the other terms quickly follow suit, like a row of dominoes. the algorithm between. produces The result is that if Step 2 approach is used, either very complex or very simple models but none in 2. Inierpretable Projection Pursuit Thus, a compromise is due to the fact that sure, irrespective simplification Regression Page 72 is reached based on empirical all terms contribute of their relative evidence. The first change equally to the interpretability importance mea- 12.101. Thus, when considering a of the model, that term which least affects the model fit should be considered first as it does as well as any other at increasing interpretability. As a result, the terms are looked at in reverse order of import ante (Step 4). The second change is that the algorithm once as opposed to one term at a time. moves the entire set of m terms at A move is not evaluated until it is formed from a sequence of submoves, each of which is a shift of an individual 4~). These submoves are made in reverse order of goodness-of-fit are made in the best direction The third move appears to be a poor one. 4a is positive (Lundy importance and possible, the steepest descent one. change is that the algorithm a form of annealing term (Step always moves the model, even if the In some respects this jiggling 1985). Th e m inimum yet quite small, so large unwelcome steplength of the model is considered in Step increases in the objective are impossible. Both the second and third of the simplification of one term works well in practice requirement that changes are attempts on subsequent as is demonstrated the algorithm results. However, the algorithm behavior is seen in the following at forecasting terms. quickly example. The given algorithm in the next section. always move means that the effect Occasionally, the a clearly worse model adjusts for the next value of X. This 2.3 The Air Pollution 2.3 The Air Pollution The example using additive Page 73 Example Example in this section concerns air pollution. models in Hastie and Tibshirani and using alternating conditional expectations (1985). It consists of 330 (n) o b servations (p) independent variables each. The data (1984) and Buja et al. (1989), (ACE) in Breiman and Friedman of one response variable The daily is analyzed observations (Y) and nine were recorded in Los Angeles in 1976. The variables are Y : ozone concentration x1 : Vandenburg 500 m illibar height x2 : windspeed x3 : humidity x4 : Sandburg Air Force Base temperature xg : inversion base height x6 : Daggott pressure gradient x7 : inversion base temperature X8 : visibility xg : day of the year As in exploratory projection pursuit, all of the variables are standardized mean zero and variance one before projection As suggested by Friedman chosen to be nine (p). The algorithm M by backstepping model is measured the introduction, as the fraction inaccuracy respect to the data rather fraction regression is applied. (1984a), the original Section 2.2.3) is run for a large value of M terms m = l,..., pursuit to have algorithm initially. (Steps 1 and 2 in For this example, M is produces all the submodels with number of from the largest. of variance it cannot The inaccuracy explain. of each As noted in in this thesis denotes lack of fit as measured with than to the population in general. From [2.6], this is defined as u - J52MdJJ) k(Y) * The plot of the number of terms and fraction of unexplained variance of each model is shown in Fig. 2.1. 2. Interpretable Projection Page 74 Pursuit Regression m Fig. 2.1 Fraction of unexplained variance U versus number of terms m for the air pollution data. Slight ‘elbows’ are seen at m = 2,5,7. Using the original by weighing approach, the statistician chooses the model from this plot the increase in accuracy against the additional chooses a model complexity of adding at an ‘elbow’ in the plot, where the a term. She generally marginal increase in accuracy due to the next term levels off. In some situations, such an elbow may not exist or it may not be a good model choice in actuality. Only one model for each number model space is one dimensional Interpretable particular. . measure. projection number of terms m is found. pursuit regression expands point for each simplicity the model the algorithm space for a by adding an interpretability search for a given m is the model shown in Fig. 2.1. Then for a sequence of X values which signify weight on simplicity, m, the in U. of terms m to two dimensions The starting For a particular an increasing cuts a path through the model plane [V X S,]. 2.3 The Air Pollution Lubinsky Page 75 Example (1988) d iscuss searching and Pregibon model space in general. vides the structure through Their premise is that a formalization a description of this action pro- by which such a search can be automated. which space characterization, prehensive than agree that two important In their work, the latter Their is based on Mallows’ (1983) work, the two dimensional U and Sg summary description dimensions concept is an extension eters measure and is in the spirit includes both ‘the conciseness of the description descriptive is-more given above. are accuracy comThey and parsimony. of the usual number of interpretability or of param- as defined in this thesis. It and its usefulness in conveying information.’ Initially for this and other quence is (O.O,O.l,. . . , 1.0). examples, However, the interpretability the usual result parameter is a path X se- through the model space which consists of a few clumps of models separated by large breaks in the path. Even the forecasting 2.2.3 cannot completely der to produce with curve. eliminate a smoother additional nature described in Section these large hops between model groups. path, the statistician values of X specifically For example, of the algorithm In or- is advised to run the algorithm chosen to produce a more continuous if on the first pass the path has a large hole between the X = 0.3 and X = 0.4 models, the algorithm should be run with ues of (0.33,0.36,0.39). twenty models are produced for each valueof m = l,..., Using this strategy, 9. The actual Various diagnostic through X val- X values used are not shown in the following figures as their values are not important. a guide for the algorithm additional The interpretability parameter is solely the model space. plots can be made of the collection of models which are distinguished by their m, U and Sg values. Chambers et al. (1983) provide several possibilities. Given that the number of terms variable m is discrete, a partitioning approach is used. Partitioning in a plot represents S, and inaccuracy plots are shown in Figs. 2.2 and a model with U. Ideally, the given number of terms, for the best comparison, 2.3. Each point interpretability these plots should be 2. Interpretable Projection L,,,, 1.0 F Pursuit I,,, ,(I, Page 76 Regression I,,, (,,It + ql,,,,,,,,,,,, f ,,(,,,, 1 0.1 0.15 0.2 0.25 U, ‘III 1.0 ‘III 0.3 0.35 0.1 0.15 0.2 m=l III1 U, III, - 0.25 0.3 0.35 m=2 ‘III9 45 4M4 0.6 - 0.6 r 0.4 - 0.2 F 4 4 cc 4 4 % 44 4bF 0.0 0.1 ““I 0.15 ““I 0.2 ““I U, L,,,, 1.0 - 0.3 ” ” 0.35 0.1 0.15 ,4,:1,, 0.2 U, , , I,, , , I,, 0.25 0.3 0.35 0.3 0.35 , ,I m=4 m I,,, 0.6 I I I I 1, m=3 4 vi” 0.25 ““I 4 44 0.4 4 4 $ 0.0 0.1 14 + ” ” “I’ 0.15 0.2 U, Fig. 2.2 -I ” ” I ““I 0.25 m=5 0.3 ” ” 0.35 0.1 0.15 0.2 U, 0.25 m=6 Model paths for the air pollution data for models with number oftermsm=l,... ,6. Each point indicates the interpretability S, and fraction of unexplained variance U for a model with the given number of terms. 2.3 The Air Pollution Example L,,,, 1.0 ,,I[ Page 77 ,,,I ,,I, y-- ,,,,-I I,,, I,,, ,,I, II,, 4’ 44 II,,4 ’ 4 4- 4 88 0.8 - 0.6 - $ 4 ml 0.4 0.2 4 a0 44 : :4jt 0.0 “I’ 0.1 - 4 0.15 44 ” ” L,,,, $4 4 ;to 14 ” 0.2 ” ” 0.25 U, 1.0 4 $ I,,, ” ’ “I” 0.3 0.35 p8 0.1 0.15 m=7 ,I,, ’ 0.2 U, ‘I,, II III’ 0.25 ‘III- 0.3 0.35 m=8 ,,,I- ‘4 - 7 +# 0.8 - r/-Y 0.6 7 0.4 0.2 % 4 & 4 - 4 4 14 0.0 4 .“l’b”“““l”‘l”‘I” 0.1 0.15 0.2 0.25 U. Fig. 2.8 0.35 Model paths for the air pollution data for models with number Each point indicates the interpretability of terms m = 7,8,9. Ss and fraction of unexplained variance U for a model with the given number of terms. lined up side by side. superficially 0.3 m=9 However, note that though Ss scales appear to be the same for all the graphs, they are not as implicit each plot is the number of terms m. A symbolic are graphed in one plot of unexplained a particular graphing the simplicity the interpretability variance symbol for the number scales are dependent scatterplot in in which all models U versus interpretability S, with of terms, also obscures the fact that on the number of terms in the model. 2. Interpretable Projection Pursuit As interpretability increases, so does the inaccuracy values of m , the path through tially Page 78 Regression of a model. the model is ‘elbow-shaped’, a large gain in interpretability of all possible models. have the same inaccuracy. that ini- is made for a small increase in-inaccuracy. The curves shift to the left as m increases, as the additional overall inaccuracy indicating For most terms decrease the For all values of m , the X = 1 models These models have all directions parallel and equal to an ej, so in effect they only have one term of one variable. Due to the forecasting is not always monotonic. non-monotonic is found. m inimum. in smaller interpretability for m = 3, interpretability value of X, rather the number The draftsman’s the inaccuracy out of a local for a particular are possible for a given linear regression variable selection methods find inaccuracy for a fixed interpretability display in Fig. 2.4 shows how the number U and simplicity simpler versus the same interpretability Ss vary. Again, than one with note that more, plotting scale is m isleading. value, which is of terms m and if a model all the models the models from which to choose if certain requirements can define equivalence with This set of plots is useful met, such. as U 5 0.20, Sg 2 0.50, or some combination the statistician The of parameters. fewer terms is considered in determining This type values Ss E [0.2,0.4]. it helps describe the models which the model which m inimizes partitioning readjusts. does not find the global m inimum number of terms m . In contrast, usually and larger inaccuracy, poor move may be needed to force the algorithm The algorithm the model space a clearly worse model as evidenced by a usually on the next step, the algorithm is evident intermediate employed, the path through Occasionally, move resulting However, of behavior algorithm thereof. must be More formally, classes of models from the draftsman and plots consisting of sets of models with different number of terms which satisfy certain inaccuracy and interpretability criteria. 2.3 The Air Pollution Page 79 Example m 0.6 0.4 0.2 0.0 0 2 4 6 8 10 m Fig. 2.4 Draftsman’s display for the air pollution data All possible pairwise scatterplots of number of terms m, fraction of unexplained variance U and interpretability S, are shown. The statistician of the variance, with must now choose a model. approximately even moderate inaccuracy crossing model which explains model with The first term explains 75% (U = 0.25). For models with interpretability (S, the 0.20 threshold 2 0.40), cannot (U 5 6, models be achieved without 2 0.20) as seen in Fig. 2.2. more than 80% of the variance is required, S, = 0.60 is possible. m the bulk a seven term However, its advantages over the original term model (Fig. 2.1) which explains slightly less is debatable. If a two If close to 0.25 2. Interpretable inaccuracy Projection Page 80 Pursuit Regression U is acceptable, simple two or three terms .models are possible. .example, a three term model exists with a two term model exists with Ss = 0.85, U = 0.23 and combinations- Q!l = ( 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, p2 = 1.10, Q2 = ( 0.0, 0.0, 0.3, 0.9, 0.0,-0.2)r The fourth variable, 0.0, 0.0, 0.1, Sandburg Air Force Base temperature, as is cited in other analyses of this data (Hastie Breiman and Friedman alternating is is the most in- and Tibshirani 1984, 1985). The last variable day of the year also has an effect. pursuit conditional . index with a power less than two may be warranted. fluential The projection o.oy 1.5.2 and 3.4.3, the convergence of the coefficients slow and an interpretability regression model is different in form to the additive and expectation models as it includes smooths of several vari- ables rather than of one variable. Thus, the three models cannot be compared functiondly. However, the three inaccuracy As remarked of the data. interpretability In addition, among the models. Similar tative assessment is a further automated to the weighing does one model may be more understandable and a function of the model’s may have a clear quadratic number of terms m, this quali- subjective notion of interpretability which to exploratory projection pursuit, projection regression is procedure. As such, an objective applied to choose between models. is not Methods pursuit measure of predictive error may be such as cross-validation may be used to produce an unbiased estimate of the prediction strategy fj when choosing by the index Ss. In contrast a modeling views the functions and therefore For example, form. of terms in the and depends on the context each of the curves is a smooth description, than another. of the number is subjective the statistician Though not have a functional explainable measures are comparable. in Section 2.2.2, the inclusion gauging of a model’s -. S, = 0.40 and U = 0.21. Alternatively, p1 = 0.21, As discussed in Sections For accuracy in the models. A good is to choose a small number of models based on the above procedure and 2.3 I The Air then distinguish 1 Page 81 Example Pollution between them using a resampling to a lack of identifiability method. which results from the fact that the number / / cannot be included / is not a one parameter (X) minimization problem. I ’ the model’s increases with an additional interpretable I modeling ! I , / .,! I 1 I I I ! in the interpretability complexity projection technique. Unfortunately, pursuit index, the cross-validation A subjective of terms procedure measure of how term must be made. regression is an exploratory due Thus, rather than a strict Chapter Connections A comparison scribed of the accuracy in the previous warranted method and established demonstrates the generality pursuit, Linear This setting whose properties First, than use random observed vector of the trading de- selection techniques is between the proposed This discussion is preliminary and an example accuracy for interpretability which approach. Regression procedures have not yet been proposed modification provides for pro- is considered for linear regression various other model space search methods for the linear variable is stated and fitted Y consists approach are known for comparison. the notation The problem tradeoff The last section includes the interpretable in this section. connections Conclusions other data analysis methods to which this method Since other model selection jection other model ideas are considered. are identified. 3.1 Interpretable with In this chapter, topics of future work, including could be extended, and interpretability two chapters and interesting. and 3 notation regression as in Chapter as a minimization values rather problem than is described. 2, matrix X has entries xij, between the The (yl, ~2,. . . , yn), the value of the jth predictor 82 is used. value minimization. of the response values for the n observations and the n X p matrix notation of the squared distance an expected Rather for the jth 3.1 Interpretable observation. Linear Page 83 Regression If an intercept term is required, a column of ones is included. error vector is E and the model may be written as Y=XO+c The parameters distance . p = (/31, ,&, . . . , ,f$,) are estimated between the n fitted The by m inimizing and actual observations. the squared The problem in matrix form is (Y - XP)‘(Y m jn The least squares estimates - XP) solve the normal equations BL* E (xTx)-lxTy The modification “F (I - x) (Y in [1.12] for example. tion with - XPts>‘(Y The denominator the problem rather is used instead index Sr defined of the first term is the m inimum squared linear regression solution Xp that linear regression of squared distance becomes a maximization P.21 - XBLS) of the predictors the response is the ordinary X E [O, 11, is _ x S(P) index S is the single vector varimax the combination correlation parameter (Y - -v>*v - -w distance possible, which would be at the ordinary Recall that WI . for values of the interpretability where the interpretability . has maximum solution. [3.1]. correla- Thus, if the as a measure of the model’s fit, and the simplicity term should be added than subtracted. As the interpretability parameter moves away from the ordinary X increases, the fitted vector f = X,8 least squares fit E’LS in the space spanned by the 3. Connections Fig. 3.1 p predictors and Conclusions Page 84 Interpretable linear regression. As the interpretability parameter X increases, the fit moves away from the least squares fit ~LS to the interpretable fit ~ILR in the space spanned by the p predictors. as shown in Fig. 3.1. The interpretable fits ?ILR are not necessarily the same length as the least squares one. As a variable’s coefficient decreases toward zero , the fit moves into the space spanned by a subset of p - 1 predictors. The most interpretable p - 1 variables may not be the best p - 1 in a least squares sense. Even if the variable is the same, the interpretable . fitting least squares model. coefficients search may not guide the statistician The interpretability index S attempts subset to the best to pull the /?i apart since a diverse group is considered more simple, whereas the least squares method considers only squared distance when choosing a model. 3.1 Interpretable Page 85 Linear Regression The definition of the interpretability index that it can be used as a ‘cognostic’ (Tukey for a more interpretable model. S as a smooth function means 1983) to guide the automatic search However, these smooth and differentiable erties are the root of the reason the equation The problem coefficients similar is that the interpretability which make the objective [3.2] cannot be solved explicitly. index contains the squared and normed a nonquadratic function. A linear solution to [3.1] is not possible. In comparison, tion criteria Mallows’ (1973) CP and Akaike’s involve a count function As [3.2] d oes not admit [l.ll]. prop- criterion as an interpretability a linear solution, estimate. variable index as defined in For both the CP and AIC cannot be determined criteria, the complexity model is equal to the number of variables in the model, say k. With norming, the associated interpretability selec- the more general Mallows’ CL which may be used for any linear estimator the interpretable (1974) AIC of a appropriate index is sM(p)sl- 'c WI . P The Mallows model prediction variable selection technique uses an unbiased error to choose the best model. The resulting estimate of the search through the model space includes only one model for a given number of variables. natively, which the interpretability guides the statistician Another search procedure in a nonlinear result of this search is variable interpretability. Using this terminology, means that the CP and AIC for techniques is a model estimation manner selection through as variables the discreteness Alter- procedure the model are discarded of the criterion are solely variable, space. rather for [3.3] than model, selection procedures. As is discussed in Section 1.3.4, an alternate interpretability defined as the negative distance from a set of simple points. is whether the modification [3.2] can be thought index sd can be The natural question of as a type of shrinkage. 3. Connections Page 86 and Conclusions 3.2 Comparison W ith Ridge Regression In Section 1.3.4, a distance interpretability index Sd is defined in [1.15] which involves the distances to a set of simple points V = {VI,. . . , YJ}. Before restricting its values to be E [O,l], th is simplicity the negative of the m inimum The coefficient not the absolute lead to a nonlinear properties vector This standardization others in the coefficient the coefficient vji P’P . so that the relative Though mass, these actions are possible. Suppose for the moment that the X are standardized to have mean zero and variance ensures that vector though one variable does not overwhelm the length of the coefficient one. This action partially vector in the interpretability In order to remove the need for squaring, be considered. 2 > vectors of any length may be compared equally and such as Schur-convexity is not necessarily # size or sign, of the elements matters. response Y and predictors one. p i=l CC ,L?is squared and normed solution, vector ,f3 is distance to V or m in j=l,...,J - measure of the coefficient the vector itself removes the reason for normalizing index. all possible sign combinations must The simple set vectors vj are no longer the squares of vectors on the unit RP sphere but just vectors on it. For example, they could be fej, 1 ,‘“7 p. The distance to each simple point is written (P-vj)T(P-vj) If the optimization parameter problem in matrix j=l,...,p is reparameterized j = form as , with a new interpretability K, [3.2] becomes rnjn (Y - Xp)‘(Y - XP) + K j-ynp (P - vj>‘(P - ,..., - Vj) K>O . 3.2 Comparison The second m inimum may be placed outside of the first since the first term does not involve Uj resulting m in j=l The solution ,...,P [ rnp in - Xp) p s (XTX + KI)-‘(XTY vector is portion This estimated Draper matrix m inimizes the of [3.4]. vector is similar to that of ridge regression and Smith 1981) with ridge parameter Ridge regression may be examined The method P-51 + Kzq) and I is the index which pR f (XTX point. P4 + K (P - Uj)'(fi (Y - X/?)‘(Y where I is the p X p identity bracketed Page 87 With Ridge Regression K. The ridge estimate + KI)-lXTY 1976, is . WI from either a frequentist is advised in situations (Thisted or Bayesian where the matrix which can occur when the variables are collinear. XTX In addition, view- is unstable, it does better than linear regression in terms of mean square error as it allows biased estimates. If a Bayesian analysis is used, the distributional error term E has independent Given this assumption, . components assumption is made that the each with mean zero and variance 02. the prior belief produces the Bayes rule [3.6]. The ridge estimate [3.1] toward the origin @ R [3.6] sh rm’ k s away from the least squares solution as the ridge parameter K increases. ~,QS The interpretability 3. Connections Page 88 p^ [3.5] sh rm ’ k s away from the least squares solution estimate point and Conclusions ul as the interpretability during the procedure If the varimax parameter K increases. toward the simple The index I may change however. index [1.12] is used instead of the distance index [1.15], the result is a solution similar to [3.5] except that the shrinkage is away from the set of 2p points 4~3,. (&*, regression terminology, . . f &). Shrinkage toward points is the usual ridge so the distance index is used in the explanation above. As Draper and Smith (1981) point out, ridge regression places a restriction the size of the coefficients ,0, whether or not that restriction interpretability [3.2] d oes not require any restriction approach squares the coefficients before examining mathematical problems. on the coefficients Clearly, the approach and identically -logL(P;Y) . terized a prior is discussed in the next section. is also the maximum likelihood the posterior distribution Thus, as Good and Gaskins m inimizes solution givenr N(0, a2). , Then the squared or = (Y - XP)‘(Y estimate likelihood are that the elements of the error distributed distance is the negative of the log likelihood prior. can be viewed as placing The necessary assumptions term E are independent tribution, as it norms and as a Prior The least squares solution The maximum The As noted, however, this produces and this Bayesian viewpoint 3.3 Interpretability certain conditions. them. is called a prior. on - XP) * this expression. is equal to the likelihood Given a prior dismultiplied (1971) point out, m inimizing [3.2] (Y - XP)'(Y - XP) - fqP) by the the reparame- 3.3 Interpretability as a Prior Page 89 1.0 angle of /3 in radians Fig. 3.2 is equivalent Interpretability prior density for p = 2. The prior fn( ,0) is plotted versus the angle of the coefficient vector in radians for various values of K. to minimizing the negative do so is to put a prior density interpretability for a given interpretability K>O index, the prior parameter To to . distribution of the coefficients value K > 0 is where CK is the normalizing constant. density minus the log prior. on p where the prior is proportional exp(KS(P)) For the varimax log likelihood belongs to the general exponential The coefficients are dependent. family defined by Watson (1983). The Page 90 3. Connections and Conclusions The prior calculated function, for the p = 2 case is plotted using numerical The constant Since the exponential integration. the basic shape of the curve resembles the index As K increases, the prior increases. 3.4 Future are pushed toward Similar C, is is a monotonic S1 as in Fig. 1.2. on interpretability an angle of zero (p = (1,O)) or an results would be seen for p = 3. Work The previous interpretable sections may serve as a foundation method connecting with this approach both extending algorithm becomes more steep as the weight The coefficients angle of $ (p = (0,l)). 3.4.1 in Fig. 3.2. other with the method variable selection for the comparison Further techniques. others is proposed below. In addition, to other data analysis methods of the work ideas for and improving the are given. Further Connections Schwarz (1978) and C. J. Stone (1981, 1982) suggest other model selection techniques based on Bayesian assumptions. pares the Schwarz and Akaike M. Stone (1979) asymptotically (1974) criteria, noting that the comparison fected by the type of analysis used. In the Bayesian framework, technique Another could be compared technique with these others utilizing which may provide sanen (1987), who appl’les the minimum theory to measure the complexity asymptotic comparison description length principle In addition, is af- the interpretable an interesting of a model. com- analysis. is that of Risfrom coding the work by Copas (1983) on shrinkage in stepwise regression may provide ways to extend the ridge regression discussion of Section 3.2. Interestingly rules noted involve the number of parameters, enough, all the model selection k in [3.3], rather than a smooth measure of complexity. In Chapter computational 2, the varimax and entropy indices have similar appeal and both have historical motivation. intuitive and Interpretability as 9.4 Future Work measured by these indices could be compared philosophical Though Page 91 terminology the framing the coefficients with simplicity as defined using in Good (1968), Sober (1975) and Rosenkrantz of the interpretable attempts approach as the placing to do this, a less mathematical (1977). of a prior on and more philosophical discussion may prove beneficial. 3.4.2 Extensions The interpretable search into a numerical whose resulting extended approach one. description It can be applied to measure the simplicity of interpretability complicated descriptions. At present, of other description and accuracy for methods feasible va.riable.selection that produce any subset of predictors bound search methods is known analytically linear can be employed regression variable linear regression may help in collinear that interpretable to include from a group of collinear A general . class of models regression is an example. optimization a model sebut for which the least squares solution Second, branch for and search the model space, procedures situations. exist, interpretable The least squares solution suggested. is At present, examples linear regression clearly chooses the variables ones and produces a stable solution. for which Whenever must all sub- need not be considered. selection the interpretable useful is generalized linear models (McCullagh separate it provides combinations which smartly and in fact, ridge regression is usually seem to indicate types such as functions, to be [3.1]. areas so that all possible contenders Though unstable First, method prove useful for even more is that linear space If the index is approaches do not exist. For linear regression, sets regression is possible for two reasons. eliminating might however, its main improvement procedure model to any data analysis or model involves linear combinations. the tradeoff lection changes the usual combinatorial be done. method may prove and Nelder 1983), of which logistic a generalized At present, linear model is considered, stepwise methods a are used 9. to choose models. approach 3.4.3 Page 92 and Conclusions Connections These methods on the basis of prediction Algorithmic As described algorithm could be compared with the interpretable error. Improvements in Chapter would benefit variant Fourier others. Other rotationally 1, the interpretable from further projection exploratory improvements. First, index GF needs further invariant projection projection testing pursuit the rotationally in- and comparison with indices are possible as suggested in Section 1.4.2. Second, present work involves designing a procedure to solve the constrained optimization instead of using the general routine as outlined improvement should decrease the computational projection pursuit needs further regression forecasting time involved. procedure described in Chapters as the weight on interpretability of the squaring in Section 1 and 2, the convergence of the coefficients increases is slow. This property used in the varimax zero as the combinations the 2.2.3 moves to a maximizing 3.5 A General to have the same derivatives which is the varimax index The sections of the latter at their joined points. Framework The main result of this thesis is the computer tion of projection results because ei (Figs. 1.2 and 1.3). Solutions except for a range close to the e;‘s where it is linear. index could be matched to zero index and the result that the slope goes to might be to use a lower power or a piecewise function pursuit implementation of a modifica- which makes the results more understandable. the approach provides Beyond this specific application, accuracy Analogously, investigation. As mentioned dition, in Section 1.4.5. The a general framework for tackling the search for interpretability is made often in statistics, sometimes implicitly. similar at In adproblems. the expense of The identification and 3.5 A General formalization of this action is useful since choices which previously tive become objective. histogram is examined. simplifying In the next subsection, the rounding were subjec- of the binwidth for a This common example shows that the consequences of a action in terms of accuracy loss can be approximated. of interpretability outcomes Page 99 Framework must be broadened than linear combinations. to deal with Measuring The definition a much more elusive set of the interpretability increase is difficult. 3.5.1 The Histogram Example Consider the problem of drawing a histogram Two quantities must be determined. x0, which usually bin. is chosen so that The second is the binwidth of n observations ~1, x2, . . . , zn. The first is the left endpoint of the first bin no observations fall on the boundaries h, which generally a rule of thumb and then rounded usually multiples of powers of ten. Based on the mathematical determine the rules widely the binwidth employed, the resulting intervals to are simple, approach used to the loss in accuracy incurred is to estimate Scott (1979) d e t ermines a binwidth the integrated by rounding the shape of the true underlying rule which asymptotically mean square error of the histogram rule and further theoretical results. employs is useful in approximating A certain the further m inimizes density from the true density. Diaconis and Freedman (1981) use the same criterion rounded. according can be calculated. The purpose of a histogram density. so that is calculated of a to lead to a slightly approximation accuracy different step which Scott lost if the binwidth is The discussion below follows his approach. The integrated the true density mean square error of the estimated histogram density f” from f is IMSE G J [ O” E f^( x) - f(x)] 2dx --oo 1 =O” f’(~)~dx + O(; + $h2 nh -CO J WI + h”) Page 94 3 Connections and Conclusions where h is the histogram binwidth. Minimizing the first two terms of [3.7] pro- duces the estimate P-81 Scott also shows that if the binwidth in IMSE is multiplied by a factor c > 0, the increase is = ~IMSE(IZ) IMSE(cf) WI . Via Monte Carlo studies, Scott shows [3.8] and [3.9] to be good approximations for normal data. In reality, derivative density [3.8] is useless since the underlying Scott, f’ are unknown. as a reference distribution. where s is the estimated suggest the similar standard and Diaconis density f and therefore and Freedman its use the normal Scott suggests the approximation deviation of the data. Diaconis and Freedman approximation where IQ is the interquartile range of the data. Monte Carlo studies have shown these approximations Given either approximate to be robust. ,. approximation hs or AD, the statistician the binwidth discussed in a moment. estimate by rounding. At present, consider rounding of such simplification the estimate k* is E (fis - u, hs + u) where u is some rounding u would be i if the binwidth - The benefits may elect to further also be written is rounded to an integer. as A* E (1 + e)iis unit. is. are The new For example, The new binwidth may Page 95 3.5 A General Framework where e is a positive or negative factor depending rounded up or down. This multiplying on whether the old estimate is factor must be 2 1 as a negative binwidth is impossible. The estimation procedure may be drawn schematically: binwidth h binwidth u i binwidth u fis rounded binwidth u h* minimize estimated first two terms use normal approximate density as reference round If [3.9] is used as an approximation rounding, for the resulting loss in IMSE * between the IMSE of hs and h* is written the relationship IMSE(fi*) The percent factor change in IMSE e3 = (13Tl ; e; 21MSE( can be plotted due to . is) as a function of the multiplying e as shown in Fig. 3.3. This exercise demonstrates different repercussions difficult to measure explicitly. -are all that it to another. are simpler. can be retained that Finally, might a binwidth the class delineations (1979) sh ows that result the accuracy is is easier to draw and describe Certainly in short-term up or down results in The increase in interpretability The histogram In fact, Ehrenberg removes confusion rounding in terms of accuracy. since the class boundaries to remember. that memory. in explaining are easier two digits other than zero In addition, the rounding the histogram or comparing in the actual observations xi may prompt 3. Connections Page 96 and Conclusions e Fig. 3.3 rounding. Percent change in IMSE versus multiplying fraction e in the binwidth example. The rounded binwidth A* - (1 + e$s, where ks is the estimated binwidth due to Scott (1979). Extra digits beyond the number of significant ones in the data leads to a false sense of accuracy. How to measure the interpretability Diaconis determine. interpretability gain. of a histogram directly is difficult (1987) suggests conducting experiments A typical be to divide a statistics experiment m ight to quantify into two matched groups and to present to either group respectively histogram and its simplified version. Measurements to the class a unrounded of interpretability could be made on the basis of the correctness of answers to questions such as ‘How would - adding the following observations change the shape of the histogram?’ is the lower quartile ?‘. As interpretability is increased, some information or ‘Where may be lost. For example, the question ‘Where is the mode?’ may become unanswerable. Page 97 5.5 A General Framework This loss of information, a more general measure of inaccuracy could be measured along with interpretability. be included 3.5.2 by asking data analysts their reaction have changed statistics hand, the resulting and applications. flexibility expert opinion could - to rounding. principle technique, to monitor pursuit. modes of information these computer these abilities of exploratory undreamed of possibilities losing the ability amount of flexi- computer-intensive the statistician and communicating retains her new to achieve her basic those results to others. tools has come freedom. provide of abilities The aim of this thesis is to use the results in a particular without On the one In the initial the power with which to follow stages of an the basic tenets data analysis (Tukey 1977), to let the data drive the analysis rather than subjecting it to preconceived stages, the wealth expanded previously In this manner, translation goal of clearly understanding analysis, and irreversibly. the sacrifice of an acceptable for more interpretable projection With has cultivated and unmanageable. of parsimony in return substantially On the other, the sheer number and complexity can be both bewildering assumptions of models or descriptions enormously. using the bootstrap Confirmation (Efron noted, the statistician ‘What IMSE, Conclusion Computers bility In addition, than which may not be true. In later which can be fit and evaluated of these models can be readily answered procedures. As Tukey can be confirmed?’ but rather 1979) or other resampling no longer answers ‘What has can be done?‘. Computing putational statistician’s power is extending and confirmational imagination and eliminating boundaries. are alleviated Even unconscious (McDonald results of such an analysis can be complex, important, hard to explain. dora’s box has been opened. Though previous mathematical, restrictions and Pedersen 1985). hard to understand, this progress is exciting, Just as grappling with comon the The and even more in a sense a Pan- the theoretical demons of 9. Connections and Conclusions new methods Page 98 such as projection task, so too is considering In order to understand effectively, a controlled computing power which pursuit (Huber the parsimonious and communicate 1985) is a necessary and difficult aspects. the results of a statistical use of these new methods has produced these novel to balance the search for an accurate and truthful equally important such a balance. desire for simplicity. is helpful. techniques description Interpretable Fortunately, provides analysis the very a means of the data with an projection pursuit strikes Appendix A Gradients A.1 Interpretable In this pursuit Exploratory section, objective are calculated. desired gradients the gradients Projection Gradients for the interpretable exploratory projection function Since the search procedure is conducted in the sphered space, the are dF aajplT dF aF aF -daj = (aajl’G’***’ From [A.l], Pursuit the gradients may be written aF _ (I- A> aGF dcvj max GF hi in vector notation +A% 99 j=1,2 as, j=1,2 . as . Page 100 Appendix A. Gradients The Fourier projection index gradients are calculated from [1.23], yielding From the definition of the Laguerre functions z:(R) = CSe-iR with recursive equations [1.20], - ie-!jR i=O,l,... derived from the definition ) of the Laguerre polynomials p.211 =o -= o a”L”, -1 -= ddLRz -==ii: -= f3R 2 2i - 1 --fR)%& ( i The gradients definition i - 1 fLi-l- of the radius squared R (7)x = (2Xj)Z 2= 2 3 )... and angle 0 are calculated [1.20]. If X1 z QTZ and X2 G ~$2, g . dLi-2 the gradients . using the are j=1,2 P-31 A.1 Interpretable Exploratory Projection Pursuit Page 101 Gradients In [A.3], note that 2 is a vector E RP, while X1 and X2 are scalars. As with the calculation of the value of the index, the expected values are approximated by sample means over the data. The gradients of the simplicity tion [1.27], th e orthogonally mapping written translated from the unsphered in matrix notation index are calculated component & using the index definidefinition 11.251, and the to the sphered space [1.5]. The gradients may be as [A*41 ml as,-a~;ap, d012=bpiap2acv2 * In [A.4], note that the partial a p X p matrix. derivative of one vector with respect to another is For example, ah = aplr (-1 aQ, ts aals r,s = l,,.., . p Using [1.5] yields aPjr -=aaj, Urs a where U and D are the eigenvector r,s = l,...,p and eigenvalue matrices defined in [1.3]. Using [1.25] yields a& - = - Plr ( P2aTPl> ah, MA w;, _ Plr ---pTB,(BI,) ap2, The diagonal - elements are WlsW2) I2 r,s=l,..., pandr#s . Page 102 Appendix A. Gradients The two vector varimax bination of the individual %hPd I Taking simplicities = $P partial simplicity - Wl(Pl> derivatives index S, [I.121 may be written with + 21 - . qp,,p,j yields 6 gr -plr [ --I-(p-l)sl(pl) p P -ah = 2PTB, P;& The partial denoted as C yielding and a cross-term + (P - l)Sl(PZ) as a. com- 1 to ,8& is identical respect r= -~+CtP1,82)] l,...,p . 1 with & components replacing ,& components . A.2 Interpretable Projection Pursuit In this section, the gradients sion objective Regression for the interpretable projection pursuit The gradients Y) - WI wJ(W,~2,...,4 used in the steepest descent search for directions Cyj are dF -= daj aF aF dF -y (---dajl ' aaj2" ' * ' aajp From [A.5], the gradients aF Friedman may be written (i-x) (1985) calculates aL2 L = -2E[Rj &j The partials regres- function F(P, a, f, x, Y) z (1 - X) L2(P;;;fi;Y are calculated. Gradients j = l,...,m in vector notation aL2 the L2 distance j= gradients - @jfj((YTX)]Pjfj(cYTX)X of the curves fj are estimated l,...,m . as . from [2.7], yielding j = I.,. . . ,m using interpolation. . A.2 Interpretable The gradients of the simplicity [2.13]. Taking partial derivatives index are calculated The partials using the index definition yields a&J -=mC~l,~(~-;)$ doji Page 103 Projection Pursuit Regression Gradients j=l,..., mandi=l,.:..,p k=l of the overall vector y [2.12] are -2cViic4 ark= (“jTclj)2 dcrj; 2crjl,(Crroj - ark= aajk forj=l,..., mandi=l,..., (“TCyj)2 p. Ic “3k) + 2 . References Akaike, H. (1974). Transactions Asimov, “A new look at the statistical on Automatic D. (1985). data,” SIAM Bellman, “The Journal Control AC-19, Grand R. E. (1961). Adaptive and Statistical Control IEEE 716-723. A tool for viewing Tour: of Scientific model identification,” Computing multidimensional 6, 128-143. Processes, Princeton University Press, Princeton. Breiman, L. and Friedman, for multiple J. H. (1985). regression and correlation ican Statistical Association Chambers, discussion),” J. M., Cleveland, Annals Journal - of Statistics Wadsworth, “Regression, of the Royal Statistical Dawes, R. M. (1979). discussion),” R. (1989). W. S., Kleiner, ical Methods GOT Data Analysis, Copas, J. B. (1983). (with optimal transformations Journal of the Amer- smoothers and additive 80, 580-619. Buja, A., Hastie, T. and Tibshirani, models (with “Estimating prediction “Linear 17, 453-555. B. and Tukey, P. A. (1983). Graph- Boston. and shrinkage (with discussion),” Society, Series B 45, 311-354. “Th e robust beauty of improper making,” American Psychologist 34, 571-582. Diaconis, P. (1987). Personal communication. linear models in decision I I Page 105 References Diaconis, I P. and Freedman, D. (1981). Lz theory,” Zeitschrift fuer “On the histogram as a density estimator: WahrscheidichkeitstheoTie und Verwandte Gebiete 57, 453-476. Diaconis, P. and Shahshahani, combinations,” / SIAM Journal D. L. and Johnstone, Donoho, and a duality “On M . (1984). of Scientific and Statistical Annals functions Computing “Projection-based I. M . (1989). with kernel methods,” nonlinear of Statistics of linear 5, 175-191. approximation 17, 58-106. ! Draper, N. and Smith, H. (1981). Applied Regression Analysis, Wiley, New York ;: 1 / , Efron, B. (1982). The Jackknife, CMBS 38, SIAM-NSF, Efron, B. (1988). the Bootstrap, and Other Resampling Plans, Philadelphia. “Computer-intensive methods in statistical regression,” SIAM Review 30,421-449. Ehrenberg, A. S. C. (1981). Statistical Association Friedman, J. fi. Department Friedman, Department Friedman, jection Journal of the American 35, 67-71. (1984a). “SMART of Statistics, Stanford J. H. (1984b). user’s guide,” Technical Stanford J. H. (1985). LCSOOl, Technical Report LCSO05, University. “Classification Technical Report University. “A variable span smoother,” of Statistics, pursuit,” “The problem of numeracy,” Report and multiple LCS012, regression Department through of Statistics, pro- Stanford University. Friedman, J. H. (1987). ican Statistical Friedman, exploratory projection pursuit,” Journal of the Amer- Association 82, 249-266. J. H. and Stuetzle, Aal of the.American Friedman, “Exploratory Statistical W. (1981). “Projection regression,” JOUF Association 76, 817-823. J. H. and Tukey, J. W. (1974). data analysis,” pursuit IEEE Transactions “A projection on Computers pursuit algorithm C-23, 881-889. for References Page 106 Gill, P., Murray, W. and Wright, M. H. (1981). Practical Optimization, Academic Press, London. Gill, P. E., Murray, for NPSOL,” Stanford W., Saunders, M. A. and Wright, Technical Report M. A. (1986). SOL 86-2, Department “:User’s guide of Operations Research, University. Good, I. 3. (1968). and a sharpened “C orroboration, British razor,” explanation, evolving probability, for the Philosophy Journal simplicity of Science 19, 123- 143. Good, I. J. and Gaskins, R. A. (1971). probability densities,” Gorsuch, R. L. (1983). “N on p arametric roughness penalties for 58, 255-277. Biometrika Factor Analysis, Lawrence Erlbaum Associates, New Jersey. Hall, P. (1987). tion pursuit,” “On polynomial-based Annals of Statistics projection projec- 17, 589-605. H. H. (1976). M o d em Factor Analysis, Harman, indices for exploratory The University of Chicago Press, Chicago. Hastie, T. and Tibshirani, Department Huber, of Statistics, P. (1985). R. (1984). Stanford “Projection “Generalized additive models,” LSCO02, University. pursuit (with discussion),” Annals of Statistics 13, 435-525. Jones, M. C. (1983). analysis,” Ph.D. “The Dissertation, projection University Jones, M. C. and Sibson, R. (1987). sion) ,” Journal Krzanowski, structure, pursuit W. J. (1987). using principal is projection data pursuit? (with discus- Society, Series A 150, l-36. “S e 1ec t’ion of variables components,” for exploratory of Bath. “What of the Royal Statistical algorithm Applied to preserve multivariate Statistics 36, 22-33. data References Page 107 Lubinsky, D. and Pregibon, “Data M. (1985). “Applications problems in statistics,” Marshall, A. W. and Olkin, Its Applications, of the annealing Biometrika Academic I. (1979). Journal of to combinatorial Theory of Majorization and Press, New York. “Some comments Mallows, C. L. (1983). “Data and Robustness, algorithm Inequalities: C. L. (1973). McCabe, as search,” 72, 191-198. Mallows, New York, analysis 38, 247-268. Econometrics Lundy, D. (1988). on C,,” description,” in Scientific eds. G. E. P. Box, T. Leonard 15, 661-676. Technometrics Inference Data Analysis, and C.-F. Wu, Academic Press, 135-151. G. P. (1984). McCullagh, “Principal P. and Nelder, variables,” J. A. (1983). Technometrica G eneralized 26, 137-144. Linear Models, Chapman and Hall, New York. McDonald, tation, J. A. (1982). Department McDonald, analysis puting “I n t eractive of Statistics, Stanford J. A. and Pedersen, part I: introduction,” graphics Ph.D. Disser- University. J. (1985). SIAM for data analysis,” “Computing Journal environments of Scientific for data and Statistical Com- 6, 1004-1012. Reinsch, C. H. (1967). “S moothing by spline functions,” Numeriache Mathematik 10, 177-183. RCnyi, A. (1961). of the Fourth “On Berkeley Symposium Vol. 1, ed. J. Neyman, Rissanen, J. (1987). Royal Statistical Rosenkrantz, measures of entropy on Mathematical 547-561, University “Stochastic Society, and information,” Statistics of California complexity (with in Proceedings and Probability, Press, Berkeley. discussion)” Journal of the Series B 49, 223-239, 252-265. R. D. (1977). I n f erence, Method and Decision, Reidel, Boston. Page 108 References Schwarz, G. (1978). , I “Estimating the dimension of a model,” Annuls of Statistics 6, 461-464. I Scott, D. (1979). I I “On optimal and data-based histograms,” B&net&z 66, 605- 610. Silverman, B. W. (1984). cyclopedia of Statistical “Penalized maximum Sciences, eds. S. Kotz likelihood estimation,” and N. L. Johnson, in En- Wiley, New selection of an accurate and parsimonious nor- York, 664-66’7. Sober, E. (1975). Simplicity, Stone, C. J. (1981). “Ad missible mal linear regression model,” Stone, Annals “Local C. J. (1982). Akaike’s Clarendon Press, Oxford. of Statistics asymptotic model selection rule,” Annals 9, 475-485. admissibility of the Institute of a generalization of Statistical of Mathematics 34, 123-133. Stone, M. (1979). Journal of the Royal Statistical Sun, J. (1989). of Statistics, Thisted, “C omments on model selection criteria of Akaike and Schwarz,” “P-values in projection Stanford pursuit,” Ph.D. Dissertation, Department University. R. A. (1976). Bayes methods,” Society, Series B 41, 276-278. “Ridge regression, Ph.D. Dissertation, minimax Department estimation of Statistics, and empirical Stanford Univer- sity. Thurstone L. L. (1935). The V ec t OTS of the Mind, University of Chicago Press, Chicago. Tukey, J. W. (1961). _ “D iscussion, emphasizing of variance and spectrum analysis,” Tukey, J. W. (1977). Exploratory the connection Technometrics Data Analysis, between analysis 3, 201-202. Addison-Welsey, Reading, MA. References Tukey, statistics: Page 109 J. W. (1983). PTOCeedingS “Another of R. Sacher and J. Wilkinson, Tukey, views,” look at the future,” the ldth symposiu m on the Interface, Springer-Verlag, P. W. and Tukey, J. W. (1981). in Interpreting in Computer Mu&ivariate Data, G. S. (1983). Statistics and eds. K. Heiner, New York, 2-8. “Preparation; ed. prechosen sequences of V. Barnett, 189-213. Watson, Science on Spheres, Wiley, New York. Wiley, New York,