CALIFORNIA STATE UNIVERSITY, NORTHRIDGE RIDGE REGRESSION: 1\ AN EXAMINNriON OF THE BIASING PARl.<;.METER A ·thesis S1J.bmitted in partial satisfaction of the req1J.irements for the degree of Master of Science in Health Sciencey Biostatistics &Epidemiology by John R. C. Odencran"l::.z January, .1979 The Thesis of John Odencrantz is approved: Madison Bernard Hanes, Committee Chairman California State University, Northridge ii ACKNOWLEDGMENTS I would like to thank the members of my committee for their comments and suggestions. In particular, I wish to thank Dr. Bernard Hanes for his support and encouragement, without which this thesis would not have been possible. iii TABLE OF CONTENTS Page APPROVAL • • • • . ACKNOWLEDGI-IENTS . ii . iii vi ABSTRACT Chapter 1 INTRODUCTION . • 1 BACKGROUND • 1 PURPOSE 4 2 REVIEW OF THE LITERATURE . 5 3 OPTIMIZATION AND GENERALIZED RIDGE REGRESSION . • • • • • • • • • 9 THE CRITERIA FOR OPTIMIZA'riON 9 DERIVING AN OPTIMUM FOR GENERALIZED RIDGE REGRESSION • . • • • . . • • • • • 12 4 ESTIMATING THE RIDGE OPTIMUM . • • • . 16 AN ALTERNATIVE OP'l'IMUM FOR GENER~.LIZED RIDGE REGRESSION . • . . • . • • . . 19 OPTIMIZING THE ORDINARY RIDGE ESTH1.i\'rOR • 27 2 RIDGE SOLUTIONS FOR THE E(L } CRITERION • • • • • • • 1• • 27 THE E(L 2 ) SOLUTION OF HOCKING, SPEED, 2 AND LYNN . • • • • • • . • • • • . • 30 MALLOWS' E(L 2 ) RIDGE SOLUTION • • 2 • 32 THE • 35 LAWLESS~WANG ESTIMATOR . • . THE McDONALD-GALARNEAU ESTIMATOR • . iv . • • 36 Chapter 5 Page OBENCHAIN'S ESTI.tv1ATOR FOR ORDINARY RIDGE REGRESSION . • • • • • 40 A REVIEW AND EVALUATION OF THE ORDINARY RIDGE SOLUTIONS • • • • • • • • 41 CONCLUSIONS • • 44 BIBLIOGRAPHY 46 APPENDIX I 50 APPENDIX II APPENDIX III e . . . . . e G e e e e e e 59 68 ~ APPENDIX IV 72 APPENDIX v 79 APPENDIX VI 90 APPENDIX VII •••• ·:;: v •• • • • • • oP-. !!t •• 96 ABSTRACT RIDGE REGRESSION: AN EXAMINATION OF THE BIASING PARAMETER by John R. C. Odencrantz Master of Science in Health Science Biostatistics and Epidemiology Ridge regression is an alternative to least squares for highly collinear systems of predictor variables. It differs from least squares in having a biasing parameter, k, added to the main diagonal of the X'X matrix. A number of rules for choosing k have been pro- posed, all of which give different solutions. Fundamen- tal in applying ridge regression is deciding which rule to use. This thesis examines some of the proposed methods of choosing the biasing parameter. The two types of ridge regression, generalized ridge and ordinary ridge, . are considered separately. Derivations are given for the vi different solutions, with stress on the intent and underlying assumptions of each. To permit an evaluation of relative performance, the results of Wichern and Churchill's (1978) simulation study are included. The generalized ridge solutions are derived: those of Hoerl and Kennard (1970) and Hemmerle and Brantle (1978) . In its original form, the solution of Hoerl and Kennard was iterative. Hemmerle's (1975) reduction of the Hoerl-Kennard iteration to a single step is included, as is a simpler and more intuitive way of achieving the same result. Several proposed solutions to the ordinary form of ridge regression are given. One of these (Mallows, (1973) is shown to be incorrect in its final algebraic form, and a numerical approach is suggested instead. appendix of numerical examples is added. Finally, some background results relating to ridge regression are included. Among these are a derivation of ordinary ridge regression from a theorem in quadratic response surfaces and a presentation of Marquardt's (1970) "fractional rank" estimator, a technique closely related to ridge regression. _vii An Chapter 1 INTRODUC'riON Background The standard model for multiple linear regression is ( 1.1) where X is an (nxp) matrix of predictor variables, y is an (nxl) vector of responses, e is an (nxl) error vector such that E(e) = = 0 and E (ee') - unknown constant, and f cr 2 r , -·n where cr 2 is an is a (pxl) vector of unknown regression coefficients. The usual solution for {1.1) is the Gaussian least squares estimator B= where S (X'X)-lX'y, is (pxl) and E(S) (1.2) =~ Multiple regression is among the most popular tools for the analysis of health data. Typically such data are extensive and involve many survey variables, some of them highly correlated. Thus regression models which attempt to make full use of the available information will often be multicollinear. 1 2 This leads to difficulties: S is so unstable for multicollinear data that even minor perturbations of the data can change the solution drastically (Hoerl, 1962). The mean square error is likely to be unreasonably large and the the B vector B vector tends to have a much greater norm than it is estimating (Hoerl and Kennard, 1970A). Under conditions of multicollinearity the investigator often chooses to drop variables from the model. Popular statistical methods of doing this include stepwise techniques (Efroymson, 1960), calculation of all possible subsets (Garside, 1971), and regression on principal components (Massy, 1965). Of these, stepwise methods are the easiest in terms of computation and interpretation, and are included in most statistical packages. However, stepwise methods are of little use for multicollinear data. Their intended function is to eliminate variables with no predicting power from orthogonal systems of predictor variables. Principal components regression and selection of a best predictor subset out of all possible subsets may both yield satisfactory results subject to the selection criterion. Principal components regression (Appendix I) requires interpretational effort, but its structural simplicity has much to recommend it. This is especially true in very large systems where the computation of all possible subsets becomes impractical. 3 If the purpose of the regression model is to predict one variable from a set of' other variables, dropping predictors makes sense for multicollinear data. A subset of predictor variables will specify the response variable almost as precisely as will the total set. The problem is that regression, especially in areas such as epidemiology, is likely to have as its true intent the explaining of some effect in terms of other observables. Since the relationships between the various predictors are seldom completely understood (otherwise multicollinearity could be avoided), some loss of explanatory information is bound to accompany reductions in the model. The ridge estimator of Hoerl and Kennard (1970A&B) is another method of handling multicollinearity. Vari- ables may still be dropped (Hoerl and Kennard, 1970B; McDonald and Schwing, 1973), but the emphasis is on transforming the estimators to achieve greater stability and smaller mean square error. The ridge estimator is given by {1.2) where I is a (pxp) identity matrix, k is a constant, S* is (pxl) , and X and 1 are the same as in the least squares estimator. The relationship between A squares solution and~*, ~' the ridge solution, is the least 4 ( 1. 3) (Hoerl and Kennard, 1970A). Since B is unbiased, it follows that the ridge solution is biased for kiO. Usually k, the biasing param- eter, is chosen to minimize the mean square error of B*. In practice, much of ridge regression centers around estimating the best biasing parameter. Hoerl and Kennard considered this problem in their 1970 papers, and several authors have proposed solutions since then. Purpose The purpose of this thesis is to review the rules currently proposed for determining the biasing parameter. 'I'he theoretical basis for each rule will be given, and the results of a simulation study comparing some of the estimators (Wichern and Churchill, 1970) will be presented. Solutions considered are those of Hoerl and Kennard (1970A), Mallows (1973), Hemmerle (1975), Hoerl, e!:_ al. (l975), McDonald and Galarneau (1975), Hoerl and Kennard (1976), Lawless and Wang (1976), Hocking, et al. (1977), Hemmerle and Brantle (1978), and Obenchain (1978). Chapter 2 REVIEW OF THE LITERATURE The effects of multicollinearity are well known, and have been detailed by Farrar and Glauber (1967), Hoerl and Kennard (1970A), Snee (1973), and Mason, et al. (1975). Several authors have also pointed out the preva- lence of multicollinearity in real data, and examples may be found in HcDonald and Schwing (1973) and in Gorman and Toman (1966) . Computationally, the problem posed by redundant variables is a fundamental one: vert a singular matrix. it is impossible to in- It is true that in most cases collinearity does not imply true singularity, but for practical purposes nearly singular matrices may produce useless answers. Hoerl (1962) suggested the application of response surface methodology (Box and Wilson, 1951; Hoerl, 1959) to ill-conditioned regression problems. His ridge esti- mator was a solution to the Langrangian problem of minimizing the residual sum of squares for a given estimator norm(Appendix II}. It differed from Gaussian regression in having a biasing parameter, k, added to the main diagonal of the correlation matrix. The only restriction on k was that it be positive, a result of theoretical 5 6 considerations proposed earlier by Hoerl (1959) and later proven by Draper (1963). This was the basis of ridge regression, but, as Hoerl remarked in the same (1962) paper, the theory was incomplete. In particular, no proof had been offered that the ridge estimator was a good one in terms of mean square error. Hoerl's (1964) review of ridge analysis did not include ridge regression except in a comment that more work was needed. A systematic development of the method appeared later in two papers (Hoerl and Kennard, 1970A&B), and included the following: {1) rederivation of the ridge estimator, showing it to be of minimum length for a given residual sum of squares (Appendix II), (2) proof that there exists some ridge estimator having a smaller mean square error than the corresponding least-squares estimatorr (3) a description of a "canonical" form of ridge regression involving transformed variables, (4) an algorithm for finding a best (in the mean square error sense) estimator for the canonical {generalized) form of ridge regression, and {5) the graphical ridge trace. Thus the 1970 papers of Hoerl and Kennard presented both the theoretical basis for ridge regression 7 and considerable extensions of the technique. The methodology used by McDonald and Schwing (1973) in studying air pollution was precisely that given in the second of the Hoerl-Kennard papers. Since the appearance of ridge regression, there has been interest in its relationship to other biased estimators. Marquardt (1970) showed that a number of properties are shared by ridge regression and his "fractional rank" (Appendix I) generalization of the principal components estimator. Goldstein and Smith (1974), ex- tending the work of Lindley and Smith (1972) found that the ridge solution actually approximates the fractional rank solution. Assessments of the relative power of ridge and other estimators have been made by Mayer and Wilke (1973) and by Hocking, et al. (1976). Mayer and Wilke derived ordinary ridge estimators and shrunken estimators (Stein, 1960; Sclove, 1968) as minimum norm (for a given residual sum of. squares) estimators in the class of linear transforms of least squares estimators. They found that shrunken estimators had minimum variance among those studied. Hocking, et al. carried the generalization still further, finding a class of estimators that included principal components estimators and generalized ridge regression as well as shrunken estimators and ordinary 8 ridge regression. They concluded the generalized ridge was most effective at minimizing the mean square error. The determination of an optimal biasing parameter was considered by Hoerl and Kennard in their fundamental work. Their solution was iterative and involved only the generalized ridge estimator. Hemmerle (1975) found that iteration was not necessary, and Hemmerle and Brantle (1978) developed an alternative solution, again restricted to generalized ridge regression. Biasing parameters for the ordinary ridge estimator have been considered by Mallows (1973), Farebrother (1975), Hoerl, et al. {1975), Hoerl and Kennard (1976), Lawless and Wang (1976), ~1cDonald and Galarneau (1975), Hocking, et al (1976), Obenchain (1978), and Wichern and Churchill (1978). In the following sections, some of the solutions introduced above will be examined in detail. Chapter 3 OPTIMIZATION AND GENERALIZED RIDGE REGRESSION The Criteria for Optimization The simplest form of the ridge estimator is (3.1) where X is an (nxp) matrix of n observations on p predictor variables, y is an (nxl) vector of observations on the response variable, k is a constant, and~* l is a is a (pxp) identity matrix, (pxl) vector of estimators. The k in (3.1) can in theory take on any positive valuei therefore (3.1) has infinitely many possible solutions. Since these will not all be equally useful to a researcher using ridge regression, some means of choosing a value for k is needed. This in turn means that the criteria by which a solution is considered to be a good one must be established. Hoerl (1962) and, later, Hoerl and Kennard (1970A&B) favored stability as a criterion. Stability, in the sense of Hoerl and Kennard, meant the extent to which changes in k affect B*; an estimator is stable or unstable in this sense depending on whether the absolute values of the individual terms of dB*/dk are large or 9 10 small. The values which S* takes on as k changes are referred to as the ridge trace. Vinod (1976) objected to this concept of stability because a strict application of it to any problem would lead to the conclusion that the optimal k has an infinitely large value. He proposed a modified ridge trace with the k axis replaced by an m-axis defined as 1' m = p-EA./(A.+k), 1 l. (3.2) l. where p is as before the number of independent variables and A· is the ith eigenvalue of X'X. 1 The advantage of this modification is that the point of maximal stability for each term d(S*) ./dm is at some m which corresponds to - l. a finite k. Since d~*/dm is a vector, it is of no immediate use as a test statistic. Vinod "scalarized" it through a statistic he termed the Index of Stability of Relative Magnitudes (ISRM): = ISRM ' 2 )/SA.)-1) 2 E((p(A./(A.+k) l. 1 l. ( 3. 3) l. where - s = v 2 l:A./(A.+k) • 1 l. l. The ISRM is zero for othogonal predictor systems, nonzero for nonorthogonal systems, and large in absolute value for seriously nonorthogonal systems. To some 11 extent, it indicates how much a given model resembles an orthogonal system. Various considerations of stability or sensitivity in the estimator have been closely associated with ridge regression from the inception of the technique. However, stability, whether as defined by Hoerl (1962), Hoerl and Kennard (1970A), or Vinod (1976) cannot be satisfactorily equated with any statistical concept outside ridge regression. For this reason, there is interest in finding other criteria by which a ridge solution can be considered optimal. A widely accepted basis for evaluating estimators is their mean square error. In the case of ridge regres- sion, investigators have considered both the ordinary mean square error defined as (3.4) where S* is the ridge estimator and S is equal to the expected value of the least-squares estimator, and the criterion of Stein (1960), defined as (3.5) The best ridge estimators in the mean square error sense are those which minimize one of the expected values 12 2 E (Ll > = E ( ( B*-8) <§_*-..@_)) I (3.6) or (3.7) With some algebra (Appendix III) ,(3.6) can be expressed either as ( 3. 8) or as E (L where a 2 2 1 ) ' = 2 y 2 2 L: ( cr A . +k a . ) / ( A. +k) 1 1 . 1 1 2 (3.9) , is the residual mean square error and a. is the 1 ith term of a = Q'f, Q being the matrix of eigenvectors = of X'X such that Q'X'XQ Similarly, ~' the matrix of eigenvalues. (3.7) can be written either as (3.10) or as = ? 2 L:(o: A. 1 1 2 +\.k 2 a. 2 )/(A. ~.k) 2 • 1 1 1 (3.11) Deriving an Optimum for Generalized Ridge Regression A comparison of (3.8) and (3.10) with (3.9) and (3.11) shows that (3.9) and (3.11) are algebraically simpler than the other two. For this reason, it is the practice among researchers investigating optima for ridge 13 regression to use these simpler, transformed forms. The X matrix of predictor variables is transformed by postmultiplying it by Q (see Appendix III), which is then substituted into (3.1) in place of X. The resulting estimator is (3.12) or, equivalently, (3.12) where A is the ·diagonal matrix of X'X. Let and let a a and 8, a be = ~i' the eigenval~es = o, Then the relationships between a and ~, the value of a* which corresponds to k E(a). of and~* and ~* are a = o'~, a = o's, and ~* = Q'~*· This is a more general case of principal components reression (Appendix I). 2 2 ) and E(L; ) are found by 1 differentiating (3.9) and (3.11) with respect to k and The optima for E(L setting the results equal to zero. = Thus 'L(~.ka. 2 -~.cr 2 )/(~.+k) 3 1 1 1 1 1 = 0 (3.13) and (3.14) determine the optima. 14 Although (3.13) and (3.14) optimize k on the basis of t he L 1 2 an dL 2 .. . J crlterla, t h ey are not ana1 ytlc so.u- 2 tions for k, nor can they be solved analytically for the general case. The only way to find a general analytic solution would be to solve each term of (3.13) and (3.14) separately for k, which would mean that in general the k. l for --the i th term would not be the same as the k . for the J jth term. Let K be a diagonal matrix whose ith non zero entry, k 1 , is positive but not necessarily the same as k:. its jth non zero entry, J The generalized ridge esti- mator is ( 3.15) 'l'he generalized ridge estimator differs from the ordinary ridge estimator in two respects. requirement that k. l = First, the k. satisfies the conditions for a J Lagrangian system which minimizes the norm of a ridge estimator for a given residual sum of squares (Appendix II) • The generalized ridge estimator is thus not of minimum norm. Secondly, it is not generally the case the Q'KQ Therefore, 'it is not true that = Qa.* K. = Q(A+K)-lQ'~'y (3.16) is the same as Qa.* = Q(A+Q'KQ)-lQ'X'y. ( 3 .1 7) 15 This means that the generalized ridge estimator which optimizes E((~*-a) 1 E( (Qa*-Qa) 1 (~*-a)) (Qa*-~)). will not in general optimize (See Appendix III) Consequently, 2 2 the E(L 1 > and E(L 2 > criteria are defined, for the generalized ridge estimator, to be (3.18) and 2 = E(L 2 ) E ( (a*-a) - - 'A (a*-a)). (3.19) --- In practice, the same definitions are used in ordinary ridge regression, as well. The reason for this is that the data are rescaled so that X 1 X is a correlation matrix before undergoing a principal components transformation. Transforming the rescaled variates is not a linear transpormation of the original rlata. To solve (3.18) and (3.19) we have, from (3.13} and (3~14) 2 2 (A.k.a. -A.a ) l l l l = 0 (3.20) and ' 2 2 2 2 (A.l ka.l -A.l a ) = 0 (3.21) For both, the solution is k. l = a 2 /a.l 2 . (3.22) 16 Estimating the Ridge Optimum Equation (3.22) is expressed in terms of unknown parameters and therefore has to be estimated. thing would be to replace cr 2 by 8 2 and a. l 2 The obvious by &, 2 1. , the least squares estimates, but ill-conditioning will tend 2 to rna k e a. 2 1 arger t h an a ... A 1 1 For a more satisfactory solution, Hoerl and Kennard (1970A) suggested the following iterative procedure: (1) Estimate k. using k. 1 (2) Compute a.* 1 (3) Compute k. 1 1 = = = "'2;"' a. a 1.2. (A.+k.)-l{Q·'X'y) .. 1 1 - - ~ 1 2 2 d /(a.*) 1 (4} Repeat (2) and (3) until ai* and ki stablize, i.e., until the iteration no longer changes them. Note that the process does not attempt to reestimate &2 . The reason for this is that, uncondition2 2 ally, the maximum likelihood estimate of cr is 8 , and because the obvious re-estimation o~ &2 around a* will always exceed the least squares estimate. (A brief dis- cussion of estimating o 2 around the ridge estimator may be found in Obenchain (1978) .) As it happens, the convergence points of ki and a.* can be determined without actually iterating. 1 17 Hemmerle (1975) first proved this (Appendix IV). Rather than his algebraic proof, a more intuitive approach is presented here. If the iteration converges somewhere, then a.*= "(A.+k.)-l(Q'X'y). 1 1 1 - - - (3.23) 1 and (3.24) must have the same values for k. and a.* at the 1 convergence point. 1 The simplest way to find this point is to determine where they do have the same values. Squaring (3.23) and eliminating (a.*) 2 from both equations 1 gives (k.)+A. 1 1 2 2 (J =0 (3.25) From the quadratic formula, the solution to this is k.1 = (2'~'~)i2-2AifJ2~ /<Q'K'y)i4-4AifJ2(Q'~'y)i2 28 ( 3. 26) 2 which has two possible values. To see which is correct, consieer the curves defined by (a . *) 1 and 2 = 2 (J /k . 1 , k . >0 1 (3.27) 18 (a.*) 2 = 1 2 (A.+k.)- (Q'X'y) . 2 , k.>O. l l - - - l (3.28) l The following are true: (1) As ki approaches zero, c >c , and 1 2 (2) as ki goes to infinity, c >c . 1 2 Figure I illustrates the case where There are two points of interse.ction, designated k' and k", between c1 and c2 • For this case the following is also true: The second derivative with respect to k of the c1 difference between the inverses of stant. and c2 is a con- Along with relationships (1) and (2), this implies (3). The Hoerl-Kennard iterative procedure can be expressed as two recursive formulas: k. 1 = 2 a2 /(a.*) 1 (3.29) and = where the " == identity. from c2 to 11 2 (),.+k.)(Q'X'v). l l _ _ ..L1 2 (3.30) indicates a computation rather than an Then (3.29) represents a horizontal movement c1 , while (3.30) represents a vertical movement 19 from c1 to c2 • Note that this is true for all positive k .• 1. This iterative process specified by (3.29) and (3.30) can be initiated at any positive k .• If it is 1. initiated precisely at k' or k" there will of course be no change with iteration. If the starting point is at some k.<k 11 there will be convergence to k', either from 1. the left or from the right. Initial values greater than k" will cause k.1. to increase indefinitely. Although (3.26) has two possible solutions, only one of the two is associated with convergence. This implies that the minus sign should always be chosen in (3.26). Figure B~ shows the case where Convergence is from the left only, and <:>2,,-1 v A c1 1. A a. -2 1. and c2 = ~- inter- sect in only one point. For Figure solution. c, <:>2' -1 v "A A. 1. a.1. -2 <~ .. and there is no If an iteration is attempted, k. increases 1. indefinitely and ai* goes to zero. Appendix IV shows the equivalence of these results with those of Hemmerle. An Alternative Optimum for Generalized Ridge Regression The iteration of Hoerl and Kennard is based on the idea of finding a theoretical optimum (k. 1. and then estimating that optimum. = 0 2 /a. 1. 2 ) An alternative method, 20 '.. \ . '. \ \ \ I ---+l.cC:.ON'I~~c.e r C.eNVSf'G~\.1C.€' \<' Figure A: Hoerl-Kennard iteration on the ith term, ai*~ of a generalized ridge estimator. ~2, u II.. 1 -L.a.. -2 1 = ~ 21 (a....;r)"'. \ \. '.\ \ .\ \. \ . I I ---"~ I ---'l"' 1 'D1\IE'~e.'Nc:;.e' c.owv~e;,...u:.e k' Figure B: Hoerl-Kennard iteration on the ith term, a.*, 1 of a generalized ridge estimator. 22 \ '\ \. \. \ 2.. 1\ \ ex.. . \ Figure C: Hoerl-Kennard iteration on the ith term, a.*, l of a generalized ridge estimator. (5 2, -L.. -2 1\o a. > l l ~ 23 given by Hemmerle and Brantle (1978) is to find an estimator for the optimization criterion and then optimize the estimator. 'rhe following derivation is based on the paper of Hemn1erle and Brantle (1978): From (3.9), . ., 2 2 2 2 E( (a*-a) '(a*-a)) = L:(A..a +k. a. )I(A..+k.) - - - - 1 l . l l l Since a.*= (A..I(A..+k.))a., then E(a.*-a.) l l . l "' 2 E({l-A..I(A.+k.))a. l l l l l l ) = = L:k. l (k.I(A..+k.)) l l l 2 (a 2 l 2 (3.31) l = IA..+a. l 2 l ) therefore E ( ( a*-a) ' ( a*-a)) 'P 1 2 l (a 2 2 2 I A..l +a.l } I (A.l +k.l ) . (3.32) Combining (3.32) and (3.31), E ( (a*-.£) ' ( a*-a)) = E( (a*-8) - . - 1 (a*-a)) - - +a 2'L:(A..-k.)I(A..(A.+k.)). 1 l l l l l (3.33) Recalling that the A.. are diagonal elements of l E(L 2 1 ~' } can be estimated by (3.34) Similarly, E( (a*-a) so that 'A(a*-~)) = E( (~*-a) 1 1\(a*-a)) 2 . -1 +a Trace((!I.-K) (1\+K) ) (3.35) 24 is an unbiased estimator of E(L 2 2 )• Define = v. l. :\./{:\.+k.) l. l. (3.36) l. so that * = a.v .. l. l. a.· l. The ith component, M., of (3.34) may be written as l. {3.37) "Ylhere A Ll 2 = 1 l:M. 1 l. ' l. and L2 2 - l::\.M .• 1 l. A To minimize Ll 2 A and L2 2 differentiate M. with l. respect to v.: l. {3.38) or, equivalently, 2 v.l. = 1-$ 2 j:\.6. . • l. l. {3.39) 25 Since v. as defined in (3.36) must lie between 1 zero and one, it follows that (3.39) cannot be used for 2 optimization if 8 /(~.&. 1 2 1 )>1. However, note that M. is 1 quadratic for v. and so increases monotonically as v. 1 1 moves away from the minimum point. Since the object is to ·minimize M., it follows that v. should be as close as 1 1 possible to the optimum point. v. 1 = Hence the solution is 2 2 8 1 < ~.1 &1. Ha {l-62/(A.&2) . 1 1 (3.40) 82/(~.&.2)>1 0 1 1 which corresponds to {&.1 (1-62/(1..&.2)) 1 1 * ai -- 82/(~.&.2)'1 1 1 (3.41) 82/(~.&.2)>1 0 1 The case where 1 2 8 2 /(~.&. )>1 1 1 is similar to the case where a.* is constrained for other reasons, such as taking 1 into account prior information about a .. 1 Assuming the constraint excludes the optimum from the permissible solution region, ai* will lie on the boundary. For example, if ai* is constrained so that ai*#A and if 2 2 &. (l-8 /(X.&. )<A, then the solution will be a.*= A. l. 1~1 1 In practice, it is unlikely that constraints will be applied directly to the a.*, but it is quite possible 1 26 that the B·* will be constrained, since there could easily 1 be prior information about the s.1 (recall that the s.1 are related to the nontransformed predictor variables). Optimization when the Si* are constrained is difficult, however, because inequalities become complicated under linear transformations. A constraint on one Bi* will transform into constraints on several ai*' with the possible solutions for any one variable partially dependent on what solutions are chosen for the other variables. Hemmerle and Brantle (1978) considered this problem and propsed a quadratic programming algorithm as a solution. The details are given in their paper. Chapter 4 OPTIMIZING THE ORDINARY RIDGE ESTIMATOR The preceding chapter motivated the generalized ridge estimator through the impossibility of obtaining (3~13) algebraic solutions to or (3.14). Nonetheless, the ordinary ridge estimator is considered useful for certain types of problems, so optimizing it is of some interest. A number of solutions have been proposed, and some of them will be considered in this chapter. Ridge Solutions for the E(L 1 2) Criterion The condition for minimizing E(L 2 ) is given by 1 equation (3.13). An algebraic solution for (3.13) does not exist, but it is possible to solve it numerically. choice would be Newton-Raphson iteration. The obvious Recall that (3.13) is = 1 ~(A.ka. 1 l 2 l 2 -A.cr )/(A.+k) l l 3 = 0 Then f ~(3cr 1 2 A.+A.a. l l -~ 2 (A.-2k))/(A.+k) l l 4 , (4.1) and the solution is ( 4. 2) 27 28 where k. is the value of kat the jth iteration and J k1 = 0. Iteration continues until convergence is achieved. The difficulty with this solution is that it is time-consuming. since . a.1 Since the ai must be estimated, and has too large an absolute value for ill- conditioned data, the Newton-Raphson iteration must be nested within the framework of a Hoerl-Kennard iteration, which naturally involves a great deal of computation. For this reason the solfttion has not been widely used. Dempst.er, et al. (1977) have raised other objections to its use, but not in sufficient detail to permit an evaluation of their merit. In contrast to this doubly iterative procedure, a solution proposed by Wichern and Churchill (1978) is extremely simple but admittedly not optimal. pointed out in Hoerl and Kennard (1970A), As was (3.13) is nega- tive for k<0 2Aa max ) 2 , where Iamax I is the largest of the !ail· Since this is so, and since (3.13) is negative from k = 0 up to the point of minimization, a solution based on k = 0 2 /(amax ) 2 will have a smaller mean square error than the least squares solution. as a possible biasing parameter. This suggests Since the mean square 2 2 error will be minimized by 0 2 /(a max ) 2 <k<0 /(a m1n . ) , (4.3) 29 is somewhat conservative, meaning that it does not produce as much bias as would be theoretically optimal. Wichern and Churchill attribute (4.3) to Hoerl and Kennard, although Hoerl and Kennard did not actually suggest it as an estimator. To avoid confusion, it will be referred to here as the Hoerl-Kennard conservative estimator. Hoerl, Kennard, and Baldwin (1975) found an algebraic solution to (3.13) by_assuming, somewhat unrealistically, that X'X is an identity matrix. case li = In that 1 for all i, so (3.13) gives ., kl:a. 1 2 l. -pcr 2 = (4.4) 0, and then (4.5) Another way to obtain this result, also given in Hoerl, et al. (1975) is to use the harmonic mean of the optimal k. given by equation (3.22), i.e. , k. . l. . l. = 2 (j 2 /eti . If kh is the harmonic mean, then = l' 1/pl:l/k. l. 1 = 'P 1/pl:a./cr 1 l. 2 = 2"f 1/pcr l:a. 1 l. 2 = ( 4. 6) Therefore, kh = 2 pcr /.§.'.§_. 30 Hoerl, et al. proposed that the least-squares for~·£ estimator should be used in (4.4). and Kennard (1976) noted that, since estimate of B'B §•§ Later, Hoerl is not a good when the data are ill-conditioned, an iterative process similar to their earlier one (1970A) should be used. The E(L 2 1 ) Solution of Hocking, Speed, and Lynn Hocking, et al. (1977) considered a class of estimators expressible as a*= Ba where a (4.7) is, as before, the transformed least-squares linear estimator, and where B is a (pxp) diagonal matrix. For ridge regression, the ith element of B is b.= (l+k./A.) l. l. l -1 1 ( 4. 8) A. ( 1-b.) /b. . ( 4. 9) which is equivalent to k. l. = l. l l. For ordinary ridge regression, k. l = k for every i. In other words, A·l. (1-b.)/b. l. l. = Ap (1-b p )/b p i=l, ... ,p-1. ( 4 .10) There is thus a constant ratio between Ai and bi/(1-bi). From this, Hocking, et al. (1977) suggested that the k. l. 31 could be combined through a least-squares formula, i.e., k = ., 1' 2 2 (L\.b./(1-b.) )/(L:b . ./(1-b.) ) • 1 1 1 1 1 1 (4.1.1) 1 This is derived from setting A. as the dependent 1 variable and b.1 I ( 1-b.1 ) as th~ independ_.ent variable. It would be equally feasible to do the reverse, obtaining k = ., .., 2 (L:A.b./(1-b.))/L:A . • 1 1 1 1 1 (4.12) 1 It is necessary to determine values for the b. 1 before (4.11) can be evaluated. the procedure of Hoerl, et al. vidual optima of (3.22): = given as k. 1 1 1 (1975) and used the indi- = a 2 /a.1 2 . k. 1 This can also be 2 2 A. a I A. a. , so, in accord with ( 4 . 9) • 1 = ( 1-b . ) /b . Hocking, et al. followed 1 a 2 1 I ( A1. a 1. 2 ) • (4.13) Hence (4.11) and (4.12) become, respectively, k = , 2 2 2 1' 2 4 '4 ·_ (L:A. a.1 ja )/(L:A. a./a t 1 1 1 1 1 . q 2 1' (l:A. 1 1 2 a. 1 2 = 1' 2 4 )/(L:A. a. ) 1 1 1 (4.14) and 2 2 'P 2 = 1/a 2"'L:A. a. /L:A . • 1 1 1 1 1 (4.15) Least-squares estimates could be used for the a.; 1 alternatively an iterative process similar to those of 32 Hoerl and Kennard could be used .. Mallows' E(L 2 2 ) Ridge Solution Some of the results for the E(L 2 ) criterion can 1 . 2 be extended to E (I, ) : the Newton-Raphson technique is 2 applicable to (3.23), for example, and the results of Hocking, et al. can apply to either criterion. A. more interesting approach is that of Mallows (1973). Mallows' solution begins with the "scaled summed mean squared error," defined as (4.16) !t should be noticed that Jk differs from L2 the constant 1/cr 2 2 only by . From Appendix !II, (4.17) where (4.18) and (4.19) The residual sum of squares is 33 (4.20) from which E(RSSk) = 2 cr V* k +B k (4.21) where (4.22) The estimator of E(Jk) is thus (4.23) which is to be minimized. Since the residual sum of squares about the least-squares model is constant it can be disregarded in the minimization. Let (4.24) be the variable to be minimized. and ( 4. 24) From (4.20) 1 (4.23) 1 1 (4.26) 34 Recall that Q was defined earlier as the matrix of eigenvectors such that Q'X'XQ of eigenvalues \. of X'X. ~ = A, the diagonal matrix Transforming (4.26) by Q and 1 letting = Q'X'y, 1 +2l:\.j(\.+k) l = 1/~ 1 1 21 ' l:\. (Z./(\.+k)-z./\.) 2 +2~\,j(\.+k) l 1 1 1 1 1 1 1 1 (4.27) To minimize this with respect to k, (4.28) for which Mallows' solution is (l+k)/k (4.29) and then ( 4. 30) This solution does not seem to be correct. can be solved numerically (Appendix V). However (4~28) 35 Mallows' solution is somewhat similar to the generalized ridge solution of Hemmerle and Brantle (1978) in that it optimizes an estimator rather than estimating a theoretical optimum. The Lawless-Wang Estimator Lawless and Wang (1976) derive a solution for ordinary ridge regression as follows: Defining a as before to be the transformation off, i.e., a= QB, suppose a is a multivariate normal random variable with a distribution defined by a-N(O,a 2 ( 4. 31) I), -p--a-p where p is the dimensionality of B. The Bayesian estimator for a is a*, where a*. 1 = (A./(A.+a 2/a 2 ))a. 1 1 a 1 i = 1, ... , p (4.32) (Goldstein and Smith, 1975) . To estimate a 2; a o, 2 , f'1rst notice that if a-N (O,a - p - 2 I ) , then a -p and = 2 l p+ ( L: A . ) a 1 1 a 2 Ia • ' Since X'X is in correlation form, L:X. = p. 1 1 36 Therefore, ? 2 2 E(EA.a. jpa -1 1 1 1 = a 2 a ja 2 • 2 Since aa 2 is expected to be much greater than a , 2 2 a a 2 ;a 2 can be estimated by EA.a. /pa . 1 1 1 The Lawless-Wang optimum is thus (4.33) The McDonald-Galarneau Estimator In contrast to the estimators discussed above, the estimator of McDonald and Galarneau (1975) is not based on considerations of mean square error. Instead, i t is intended to find the solution whose norm is as close to the norm of B as possible. To determine this solution, note that or, equivalently, Thus, an estimator of f'f is (4.34) Equation (4.34) cannot be solved algebraically for the general case, and McDonald and Galarneau 37 suggested a trial-and-error process involving 201 values for the interval (0,1). Iteration·would be a more effi- cient means, however, and the Newton-Raphson method will work for this problem. Specifically, the problem is to find k such that (4.35) or, equivalently, { 4. 36) Iterate as follows: = k.-f(k,)/fl (k,) J .J 1 J where and f I (.k . ) J = l E { 2 (A . +k . ) 1 1 J -3 . (X I y) . - - 2 ) , 1 k. being the value of k at the jth iteration. J Obenchain's Estimator for Ordinary Ridge Regression Suppose it is assumed that a nearly exact estimate Of' lies somewhere along the ridge trace, but that it is unclear just where. The method of Obenchain (1978) 38 permits a ~* to be determined fo~ any desired probability level, say, f, that ~* is a better estimate than least squares estimator. S, the Consider a Scheffe confidence ellipsoid (Scheffe, 1961) about the least squares estimator. Searle (1971) gives as an f-level confidence A region about _@. ( 4. 37) where p is the nQmber of independent variables, n is the number of observations, and F(p,n-p-l;f) is the value of the F distribution with p and (n-p-1) degrees of freedom, A having a probability of f. If 8 is the least-squares es- timator, this ellipsoid covers the· true unknown value of ~with proability 1-f. The ridge trace is a subset of points in the space. ~- It may be visualized as a path running from the least-squares estimator (where k (where k is infinite) . = ~ 0) to the point = Q_ Assuming it intersects the boun- dary of the confidence ellipsoid, it will do so at only one point. This point Obenchain chooses as his estimator. Expressed as a formula, his solution is: choose 8* such that (8*-S} 'X'X(8*-S) -·----- = 2 p& F(p,n-p-l;f). (4.38) Obenchain's solution must be interpreted with a certain amount of caution. ability that ~* The F value is not the prob- is a better estimate of 8 than is A ~' 39 Ll KE L\ HCC:D SPAC.E" .4-- \OO(\-~)<!Jo c.o~ F=\ DENC.E RE<SlON Figure D: Obenchain's method: ridge trace and accompanying 100(1-f}% confidence region for a two-variable example. (Adapted from Obenchain, 1978} 40 because points outside the confidence ellipsoid may very A well be closer to S than to ~*(see FigureD). Further, if it is assumed that ~ lies somewhere along the ridge trace, the F value is still not the probability that ~* is a better estimate than is S. The reason is that a distribution (in this case the F) which holds for a space will not generally hold for a onedimensional path through that space. In practice, it is easiest to evaluate (4.38) by choosing an arbitrary ated F value. ~* and then determining the associ- This fact, as well as the fact that a satisfactory probability would be difficult to set without some prior knowledge of the associated k, make it difficult to view Obenchain's estimator as a point solution. It appears instead to be an additional means of examining the ridge trace. Vinod's Estimator for Ordinary Ridge Regression Vinod's estimator for k was presented in the preceding chapter. To repeat it here, it is as follows: Find a k which minimizes ? E(p(~./(~.+k) 1 1 1 2 - ? )s~.-1)~~ where - s = 1' r~./(~.+k) 1 1 1 2 . I (4.39) 41 This cannot be solved algebraically, and so must Wichern and Churchill (1978) examined a be estimated. range of values and chose the most satisfactory solution. Assuming there was only one local minimum, Newton-Raphson iteration would work, since (4.39) can be differentiated with respect to k. The result is a bit complicated, but a convergence scheme is more efficient than examining a set of values and choosing the smallest one. A Review and Evaluation of the Ordinary Ridge Solutions Because of the number of solutions for ordinary ridge regression and the length of some of the derivations, a quick overview would be helpful. r1ost of the estimators presented above are intended to minimize the mean square error, either in a classical or Bayesian sense. Vinod's (1975) 1 The exceptions to this are which is based on stability, McDonald and Galarneau's (1975) 1 which is based on estimator norm, and Obenchain's (1978), which is based on confidence intervals. Of the estimators based on the classical mean 2 square error (E(L 1 )) 1 the Hoerl-Kennard theoretical optimum and the Hoerl-Kennard conservative estimate were derivable without requiring further assumptions. The Hoerl-Kennard conservative solution is intended to be better than the least-squares solution but does not minimize the mean square error. 42 The solutions of Hoerl, et al. (1975), Hoerl and Kennard (1976), and Hocking, et al. (1977) are intended to minimize the mean square error. However, they are all arbitrary to a certain degree, or else require additional assumptions such as the assumption by Hoerl, et al. that the predictor variables are uncorrelated. The solutions of Lawless and Wang (1976) and Mallows (1973) are optimal in a Bayesian sense. The Bayesian assumption for Lawless and Wang's solution was given in its derivation, while that of Mallows' solution is implicit in the E(L 2 ) criterion itself. Stein (1960) 2 and Efron and Horris (1973) detail this Bayesian approach. The question of how well these estimators perform in practice has been considered by McDonald and Galarneau (1975), Hemmerle and Brantle (1978) and Wichern and Churchill (1978). All three studies used the simulation method of McDonald and Galarneau (1975). Wichern and Churchill's study was the most thorough and comprehensive, comparing the least squares estimator with five ridge solutions. Of these five, the McDonald-Galarneau estimator and the Hoerl-Kennard conservative estimator were the most consistent in reducing the mean square error associated with the least squares solution. The other ridge estimators considered were those of Hoerl, et al. (1975), Lawless and Wang (1976), and 43 Vinod (1975). All three were quite variable in terms of performance, especially Vinod's. Appendix VII presents the results of the Wichern-Chruchill study in more detail. Chapter 5 CONCLUSIONS As was indicated in Chapter 1, the linear least squares estimator suffers from a number of drawbacks when the predictor variables X. are collinear. These drawbacks include in- 1 A stablility for minor changes in data, S.1 whose absolute values are too large, and a large mean square error. Yamamura (1977) reviews and discusses these problems. Among the techniques which can be used to deal with collinearity is ridge regression, a term which actually refers to two closely related estimators. These are the ordinary ridge estimator and the generalized ridge estimator where A is the matrix of eigenvalues of X'~, 2 is the matrix of associated eigenvectors, and K is a diagonal matrix whose nonzero terms k. are positive but not 1 necessarily equal. 44 45 An important problem in ridge regression is choosing the biasing parameter (k for ordinary ridge, k.1. for generalized ridge). In the case of generalized ridge this comes down to a choice between two possible solutions: one based on estimating an optimum and the other based on optimizing an estimator. There is no reason to consider either of these superior in general, though the second is more conservative and may be better for extremely large variances. Ordinary ridge regression has a known theoretical optimum, but it is difficult to solve. As a resul·t, there is interest in finding other solutions, often based on combinations of the optimal k. ·from the generalized 1. ridge estimates. Due to the arbitrary nature of these combinations, it is hard to justify them or evaluate them theoretically; therefore comparisons based on simulation are of considerable importance. Some work of this sort has already been done, but no estimator evaluated so far has been consistently superior to the others, and it is likely that a consistently superior ordinary ridge solution will not be found. The current feeling seems to be that more than one solution should be examined when ridge regression is used. ' 46 BIBLIOGRAPHY 1. Anderson, T. W. Introduction to Multivariate Statistical Analysis. New York: John Wiley and Sons, 1958. 2. Box, G. E. P. and Wilson, K. B. "On the Experimental Attainment of Optimum Conditions." Journal of the Royal Statistical Society, Series B. 13: 1-45, 1951. 3. Brown, P. J. "Centering and Scaling in Ridge Regression." Technometrics. 19: 35-36, 1977. 4. Dempster, A. P., Schatzoff, M., and Wermuth, N. "A Simulation Study of Alternatives to Ordinary Least Squares." Journal of the American Statistical Association. 72: 77-90, 1977. 5. Draper, N. R. "Ridge Analysis of Response Surfaces." Technometrics. 5: 469-479, 1963. 6. Efron, B. and Morris, C. "Stein's Rule and Its Competitors--An Empirical Bayes Approach." Journal of the American Statistical Association. 68: 117-130, 1973. 7. Efroymson, M. A. "Multiple Regression Analysis." Mathematical Methods for Digital Computers, Vol. 1. A. Ralston ( ed) , Ne\v York: J"ohn Wiley and Sons, 1960, pp. 191-203. 8. Farrar, D. E. and Glauber, R. R. "Multicollinearity in Regression Analysis: the Problem Revisited." The Review of Economics and Statistics. 49: 92107, 1967. 9. Garside, M. J. "The Best Subset in Multiple Regression Analysis·." Applied Statistics. 196-200, 1965. 10. 14: Goldstein, M. and Smith, A. F. M. "Ridge-Type Es-timators for Regression Analysis." Journal of the Royal Statistical Society, Series B. 36: 284-291, 1974. 47 11. Hemmerle, W. J. "An Explicit Solution for Generalized Ridge Regression." Technometrics. 17:309-314, 1975. 12. Hemmerle, W. J. and Brantle, T. F. "Explicit and . Constrained Generalized Ridge Estimators." Technometrics. 20: 109-120, 1978. 13. Hocking, R. R., Speed, F. M., and Lynn, M. J. "A Class of Biased Estimators in Linear Regression." Technometrics. 18: 425-438, 1976. 14. "OptimUm Solution of Many Variables Hoerl, A. E. Equations." Chemical Engineering Progress. 55: 69-78, 1976. 15. "Applications of Ridge Analysis to Regression Problems." Chemical Engineering Progress. 58: 54-59, 1962. 16. Hoerl, A. E. and Kennard, R. W. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics. 12: 55-67, 1970A. 17. "Ridge Regression: Application to Nonorthogonal Problems." Technome~rics. 12: 69-82, 1970B. 18. "Ridge Regression: Iterative Estimation of the Biasing Parameter." Communications in .Statistics. 4: 105-123, 1975. 19. Hoerl, A. E., Kennard, R. Tt7., and Baldwin, K. F. "Ridge Regression: Some Simulations." Communications in Statistics. 4: 105-123, 1975. 20. Hotelling, H. "Analysis of a Complex of Statistical Variables Into Principal Components." Journal of Educational Psychology. 24: 417-441; ~91-520, 1933. 21. "Simplified Calculation of Principal Components." Psychometrika. 1: 27-35, 1936. 22. Kempthorne, 0. Discussion on a paper by D. V. Lindley and A. F. M. Smith. Journal of the Royal Statistical Society, Series B: 34: 33-36, 1972. 23. 11 Lawless, J. F. and Wang, P. A Simulation Study of Ridge and Other Estimators." Communications in Statistics. 5: 307-323, 1976: 48 24. Lindley, D. V. and Smith, A. F. M. "Bayes Estimators for the Linear Model" (with discussion). Journal of the Royal Statistic~l Society, Series =---~~--~~~~~~----------------------~ B. 34: 1-41, 1972. 25. Mallows, c. P. "Some Comments on cp." metrics. 15: 661-675, 1973. 26. Marquardt, D. W. "Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation." Technometrics. 12: 591-611, 1970. 27. Mason, R. L., Gunst, R. F., and Weber, J. T. "Regression Analysis and Problems with Multicollinearity." Communications in Statist.ics. 4: 277-292, 1975. 28. Massy, W. F. "Principal Components Regression in Exploratory Statistical Research." Journal of the AmericanStatistical Association. 60: 234-256, 1965. 29. Mayer, L. S. and Wilke, T. A. "On Biased Estimation in Linear Models." Technometrics. 15: 497-508, 1973. 30. McDonald, G. C. and Galarneau, D. J. "A Monte Carlo Evaluation of Some Ridge-Type Estimators." ,Journal of the American Statistical Association. 70: 407416, 1975. 31. McDonald, G. C. and Schwing, R. c. "Instabilities of Regression Estimates Relating Air Pollution to Mortality." Technometrics. 15: 463-481, 1973. 32. Morrison, D. F. Multivariate Statistical Methods. New York: McGraw-Hill, 1967. 33. Newhouse, J. P. and Oman, S. D. "An Evaluation of Ridge Estimators." Rand Report No. 4-716-PR: 128, 1971. 34. Obenchain, R. L. "Classical F-Tests and Confidence Regions for Ridge Regression." Technometrics. 19: 429-439, 1972. 35. Pearson, K. "On Lines and Planes of Closest Fit to Systems of Points in Space." Philosophical Magazine, Series 6. 2: 559-572, 1901. Techno- 49 36. Rao, C. R. Linear Statisti6al Inference and Its Applications, 2nd ed. New York: John Wiley and Sons, 1970, pp. 294-305. 37. Scheffe, H. The Analysis of Variance. John Wiley and Sons, 1959. 38. Sclove, s. L. "Improved Estimators for Coefficients in Linear Regression." Journal of the American Statistical Associati6n. 63: 597-606, f968. 39. Searle, s. R. Linear Models. New York: Wiley and Sons, 1971, pp. 100-116. 40. Silvey, S. D. "Multicollinearity and Imprecise Estimation.n Journal of the Royal Statistical Society, Series B. 31: 539-552, 1969. 41. Snee, R. "Some Aspects of Nonorthogonal Dat.a Analysis .. " Journal of Quality Technology. 5: 67-79, 1973. . 42. Stein, C. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution." Proceedings of the ·Third Berkeley Symposium on Mathemat.ical Statistics and P.robabili ty, Vol. 1. Berkeley: University of Califorillla Press, 1956, pp • 19 7- 2 0 6 • New York: John 43. "Multiple Regression." Contributions to Probab~lity and Statistics. I. Olkin (ed). Stanford: Stanford University Press, 1960, pp . 4 2 4- 4 4 3 • 44. Wichern, D. W. and Churchill, G. A. "A Comparison of Ridge Estimators." Technometrics. 20: 301311, 1978. 45. Yamamura, A. M. "Ridge Regression: An Answer to Multicollinearity." Unpublished Master's Thesis. California State University, Northridge, 1977. 50 APPENDIX 1 PRINCIPAL COMPONENTS REGRESSION Let X be an (nxp) matrix of n observations on p variables standardized so that X'X is a correlation matrix. Suppose one wishes to find an (nxl) vector which accounts for as much sample variance as possible. An algebraic statement of the problem is the following: find a linear compound (1) such that the sample variance = Q 'X'XQ L:L:q.lq.ls .. = 1--1 . . 1. J l.J (2) l.J is maximized subject to is (nxl), . x. -1. o1 'o 1 = 1. is the ith column of X. For this problem Y1 0 1 is a vector and s . . is the ( i , j ) th l.J element of X'X, i.e., s .. is the sample covariance of X. l.J and X .. -J -1. The solution of the problem is to use the Lagrange multiplier A : 1 ( 3) 51 where I is the identity matrix. ·To maximize, set (3) to zero, which gives p simultaneous equations (4) The system of equations given by (4) is solved by choosing A. 1 such that = IX'X-A II - - 1- It follows that A. 1 0 (5) is an eigenvalue of X'X. Premultiply- ing ( 4) by Q • , 1 (6) and, recalling t.hat g_ • Q 1 1 A1 = =1 0 'X'XQ - -1 = Therefore, Q 1 = 1. 2 (7) SYl . is the eigenvec~or associated with A. , and the magnitude of the sample variance sy 1 by A. 1 • 2 1 is given 2 Since the intention is to maximize sYl' A. 1 is the largest eigenvalue of X'X. The first principal component accounts for the maximum possible variance in the observations. Proceeding inductively, the second principal component accounts for as much of the remaining variance as possible. ally, the problem is: Algebraic- find the linear compound (8} 52 such that the sample variance = L:L:q. q. s .. 2 J 2 lJ ij l = Q 'X'XQ -2 - --2 (9} is maximized subject to the constraints o1 •o 2 = 0. o2 •o 2 = 1 and The first of these is as before a standardiza- tion of length, while the second constrains Q to be 2 orthogonal to 0 1 . where \ o1 • 2 and \ Again a Lagrangian system is used: are the multipliers. 3 Premultiplying by and setting to zero. 2Q 'X'XQ +\ -1 -- - 2 3 =0 Similar premultiplication of {4} by ( 11} o2 • implies that Q 'X'XQ --2 -1 - and hence A 3 = = 0 0. (12) The second component thus satisfied (X'X-\2_!_)Q2 = 0 and is solved by IX'X-\2.!.1 = 0. ( 13) 53 Further, premultiplying (10) by Q ' and recalling that 2 1.. 3 == 0, Q 'X'XQ -2--2 Since sy that sYl that 1.. 2 (14) 2 2 is to be maximized subject to the fact has already been accounted for by 1.. 1 , it follows 2 is the second largest eigenvalue of X'X, and that 0 2 is the associated eigenvector. The third principal com- ponent will similarly be determined by the eigenvector associated with the third largest eigenvalue and so forth. Principal components are orthog-onal to each other and, beginning with the first component, each accounts for as much as possible of the variance that has not been accounted for by the preceding components. Geometrically, they correspond with the principal axes of the data ellipsoid determined by X. Principal components regression is linear regression which uses the principal components as independent variables. Let y be an (nxl) vector of observa- tions on a response variable. Then principal components regression can be written as (15) where Q is the (pxp) matrix whose ith column is equal to gi' the ith eigenvector of X'X. Another way of writing 54 ( 15) is (16) where A is the (pxp) diagonal matrix whose ith diagonal A~, entry is the ith eigenvalue of X'X. Note that (16) may be transformed into the ordinary least squares solution by premultiplying it by Q, i.e., A §_ = QA since (X'X)-l -1 Q' X 1 Y1 = QA-lQ'. ( 17) It may happen that some of the eigenvalues are very nearly zero and should therefore be eliminated from the model, since they account for almost none of the predictor variance. eigenvalues Remembering that the are in order of decreasing size, A can be partitioned as follows: (18) where the last (p-r) eigenvalues are assumed to be zero. We also partition Q: (19) where the last (p-r) eigenvectors correspond with the zero-valued eigenvalues. 55 Since A is by assumption a zero matrix, a -p-r generalized inverse of (X 1 X) --r -1 = Q A -lQ (XIX) - - r -r-r (Marquardt, 1970). (X 1 X) -l --r -r is I (20) This ma.y also be written = r ~1/A.Q.Q, 1 1-1'--1 I ( 21) where -Q. is as before the ith eigenvector of Q. 1 To see that (21) is correct, recall that q .. is the jth term of 1] the ith eigenvector. (X' X) --r -1 = Then QA-lQI (22) tvhere Q qll q21 qrl q12 q22 qr2 qlp q2p qrp 0 0 l/A. 2 0 = and 1/A.l 0 -1 A = 0 0 1/A. r 56 2 l:q.l(l/1...) I"' f" 1 {K•X) 1 1 -1 r L:q.2q.l(l/A.) r 1 1 1 1 tq.lq.2(1/A.) 1 1 1 1 L:q. q. (1/A.) 1 1 1 1p 1 tq~ (1/A.) 1 1 2 1 L:q. q. (1/A.) 1 1 2 1P 1 r ( 2 3) r 1 = r L:q. q. (1/A..) 1 1p 1 2 1 q.1pq.l(l/A.) 1 1 tq~ (1/A.) 1 1p 1 2 qil qilqi2 qilqip qi2qil 2 qi2 qi2qip f1 {1/A.) 1 (24) q.1pq.l 1 = qipqi2 2 qip r L:l/A .Q.Q. I . 1 1-1-'-1 The estimator based on the first r eigenvalues would thus be A B = . Q A -r-r -1 ( 25) Q 'X'y. -r - - In practice, it would not be likely that the last (p-r) eigenvalues would be precisely zero. This means that some method of deciding whether or not a particular eigenvalue is zero is needed. Specifically, some constant c should be set so that X'X may be said to be of rank r 57 if the first r components account for all but some small fraction of the variance, i.e., if' The advantage to (25) as an estimator is chiefly its simplicity. Since the eigenvectors are all orthogonal to e'ach other, they can be eliminated from the model through a simple stepwise procedure. tates the decis~on Also (26) facili- of how many variables should be dropped. Marquardt (1970) considers the possibility of using "fractional ranks." His suggest.ion is to eliminate those eigenvalues which are definitely considered to be zero and to reduce the inflation caused by inverting the smaller nonzero eigenvalues by adding a small constant to the denominator of each term in (21). To detail how this works, suppose that the rank of X'X is considered to be greater than r but less than r+l. Specifically, suppose it is set to r+f, where f is some number between zero and one. Marquardt's·suggested estimator is ( 27) where, from (21) and (26), = r L(A.+k), 1 l ( 28) 58 or, equivalently, {29) k == f.Ar+l/r Marquardt's fractional rank estimator may be seen as a compromise between principal components regression and ordinary ridge regression, with part of the effects of multicollinearity being eliminated through one nique and part through the other. tech~ All the same, it is not clear, despite {28), that the concept of "fractional ranks" is in itself particularly meaningful. To illu- strate this, consider the case where f is approximately equal to one. The result could very well be two dissimi- lar estimators for models whose rank was in theory almost identical, using {27) for one estimate and {25) for the other. 59 APPENDIX II RIDGE ANALYSIS AND RIDGE REGRESSION A frequently occurring problem in multivariate analysis is to find the stationary values of a function f(x 1 , ... ,xp) of p variables x , ... ,xp subject to restric1 tions on the x. such as g.(x.l_ , .•• ,x) 1 J p = 0 (j = l, •.• ,n). The method by which the problem is solved is that of Lagrange multipliers. = F Letting • f-L~... g., 1 J J (1) w·here A. , ••• , A.n are unknowns, differentiate (1) with 1 respect to each x.1 and set the results equal to zero. This yields p equations ·ap;ax. 1 = 1 "1 = 1, .•• , n) • 3f/3x.-~A..ag./3x. J ) 1 =o (i = l, ••• ,p). (2) Additionally, g. == 0 J (j ( 3) is a solution of (2) and (3) after eliminating the A.j. Let 60 2 a = M(x) M(x , • • • , x ) p 1 F = (4) be the matrix of second order partial derivatives, and let M(~) = M(a 1 , .•• ,ap) be the resulting after the solu- tion {a , .•• ,a) has been substituted into (4). p 1 Then if M(a) is (a) positive definite, i.e., y'My>O, (b) where negative definite, i.e., y'My<O, y' = (yl, .•. ,yp) is any (lxp) real vector, the function f(x , ... ,xp) achieves 1 (a) a local minimum at x = a, (b) a local maximum at = a, respectively. X To see tnis, suppose F is expanded as a Taylor series of partial derivatives about a and that h represents a vector (h , ... ,hp)' of increments hi. 1 Then, recalling that the first partial derivatives are zero at x .., a, (5} 61 where o(h 3 ) represents the higher order terms of the series. It is a feature of Taylor series that the effects of the higher order terms can be made arbitrarily small if the increments h. are small enough. 1 Since the first order term is equal to zero, that means that the only term to be con.sidered is the second order term greater than zero if M(a) ~h 'M(a) h, which is is positive definite. plies that, for all small h, F(a+h)>F(a). This im- Therefore, if h is such that the restrictions gj .= 0 are fulfilledr (6) and f(a) is a loc~l for the case where minimum. M(~) Similar reasoning holds is negative definite. To apply this to quadratic surfaces, consider the second order response surface in p variables b , •.• ,bp 1 given by (7) where the point (0, .•. ,0) is the origin of measurement for b , ..• ,bp. 1 The problem is to find the stationary points of a sphere centered on the origin and having radius R, in other words, to find the stationary points of y subj ec·t to 62 (8) Using the method of Lagrange multipliers, set F = y-Ag, where A is the multiplier. Then, taking the first derivatives with respect to the b., rearranging, 1 and dividing by 2, this gives (see equation (2)) (9) • • ~s • • • • 8 .• • • 1 P b 1 +~s 2 p b 2 + ... +(s pp -A)b p = "o --· ks 2 p or, in matrix notation, (S-AI)B = (10) -~~ where s sll ~sl2 ~s ~sl2 s22 ~s lp 8 2p s2 11 ' s = = ... ~slp ~s 2p ,,' spp bl .. = sp r . B b2 (11) = bp One method of finding B would be to solve (8) and (9} for b , ••. ,b p, and A. 1 Sometimes, however, the value of R is not of great importance, in which case it is 63 possible to regard R as variable and A as fixed. Thus A can be inserted directly into (9), which can then be solved for the b.'s, after which R can be computed from l ( 8) • Suppose, that, for some R, there is a multiplier A* which gives as a solution to (9) the vector B*, and that. A*>d., all i, where the d.l are the eigenvalues of S. l Eigenvalues are the roots of the equation ls-drl = o, (12) and, if A*<d., all i, then, for an arbitrary (pxl) vector l = z'M(B*)z --~ -- where d.-A*>O. l y = f(B*) = z' (S-A*I)z -- -- z'Sz-A*Z'z -- -- = z'z(d.-A*), -- l (13) Therefore, M(B*) is positive definite and is a local minimum. To derive ridge regression from· this, let f X X 11 x12 xlp ryl bl x21 x22 x2p y2 b2 = I X nl xn2 xnp Y. = I Yn B = (14) bp where the x .. and y. are known, and the b. are to be lJ determined. l The problem is to minimize l (Y-~B) '(!-X~) 64 - = ~(y.-blx.l... -~ x. )2 1 1 1 p 1p 1'\ 2 2 2 E(y. -b 1 x. y.- ... -b x. y.+b x. + ... +b x. 1 1 11 1 p 1p 1 1 1 1 p 1p +b 1 b 2 x.1 1 x.1 2 + ... +b p- 1 b p x.1 ( p- l)x.1p ) = " y.· 2 -b n Ex. y.- •.. -b Ex. y.+b 2" Ex. 2 + ... 1 1 11· 1 _1 Pl 1 1 1 1 l 1 1 "' +b 21'\ Ex. 2 +b b Ex. x. + ... 1 21 l 1 1 2 p 1 lp " 1 ( p- l)x.1p +b p- 1 b P1Ex. ( 15) subject to 2 2 bl + ••• +bp -R n If LX .. y. 1 1] 1 = (16) 0 , = s.J and EX .. X. k = s]. k; then ( 15) is 1 1] ]_ the same as (7), except that there are twice as many cross-product terms sjkbjbk and s .. The solution is, J therefore, as in (9). \'\ (Ex. 1 11 2 -~)b ~ ~ I\ x. b + ... +~x. x. b = Ex. y. 1 +Ex. 1 11 l 2 2 1 1 1 1p p 1 11 1 ~ 2 "' L:x. x. 2b 1 +(Ex. -~)b + ... +Ex. x. b 2 1 l 1 l 1 l 2 1 1 2 1p p 1:\ l' Ex. x. b 1 11 1 2 "' " 2 -~)b +L:x. x. b + ... +(Ex. 1 1 1 2 1p 2 1 1p p or, in matrix notation, = ~x.2y. 1 1 1 l\ Ex. 1 y. 1p 1 ( 17) 65 (X'X- I)B l> rx., 1 1..L = X'y ., 2 n rx. x. fx. x. 2 1 11 1 rx. x. 1 ;L 1 1 2 2 ~x.2 1 1 ~x. x. 1 1 1 1p ~x. x. 1 1 2 1p ~X. 1 1 1 l.p X'X B 1 1p b' 1 "1 1 1 y.1 rx. b2 "L:X. 2y. 1 , = X'y 1 2 1 = (18) ~ bp tx.1py.1 1 Since the smallest eigenvalue of X'X is at least zero,. (X'X-A..!) will be positive definite whenever A.<O, so (Y-XB) '(Y-XB) will be a minimum for a given 2:b. 1 1 2 . Thus the estimator which solves (15) and (16) is (19) which is the ridge estimator. This same estimator also mi.nimizes Eb. 2 for a given (Y-XB) '(Y-XB). 1 1 To show this A briefly; let B be the least squares estimator. Then (20) 66 where B is the estimator whose individual terms are the bi for which fbi 2 is to be minimized. Using Lagrangian multipliers as· before, (21) where 1/k is the multiplier and S is the residual sum of squares (Y-XB) '(Y-XB) of B. Differentiating with respect to B, ( 22) and then ( 2 3) or (24) Equivalently, (25) or, finally, ( 26) which is the same result as for the firt Lagrangian problem. Since the residual sum of squares and the esti- mator norm are continuous-valued, it follows that the 67 minimum of either one is a monotonically decreasing function of the other. 68 APPENDIX III A DERIVATION OF THE MEAN SQUARE ERROR ESTIMATOR FOR RIDGE REGRESSION Consider an (nxp) matrix X of n observations on p variables and a vector y of n observations on one variable. Rao (1970) gives the following result: using the model y where E(ee'} ~ = (1) XB+e is a (pxl) vector of unknown parameters and where = 2 cr In' suppose L'Y is an estimator of r is (pxl) • r'~, where Then (2) Let r be the ith column of I , the (pxp) identity ·-P matrix, and let L be the ith column of the matrix (X (_?f' X+k_!) -l} • Then E(L'y-B.) 2 -- ]_ = cr 2 (X(X'X+ki)-l). '(X(X'X+ki)-l). ---- ]_ ---- ]_ +(X' (X(X'X+ki)-l) .-r) 'B'B(X' (X(X'X+ki)-l) .-r) - ---- J.- --- Since (X' (X{X'X+ki)-l) _,_ ---equivalent to .-r) ---- J.- (3) 'B is a scalar, this is J.-- 69 E(f3.*-f3.) 2 1 1 = cr 2 (X(X'X+ki)-l).' (X(X'X+ki)-l). - - 1 - - -· 1 +B' ({X'X(X'X+ki)-:-l) .-r) ((X'X(X'X+ki)-l) .-r)' ------- where Si* = L'y. 1 ----- (4) 1 Equation (4) is the ith term of the mean square error of f3.*. 1 For the total mean square error, E(f3*-f3) '{f3*-f3) - -· - = cr 2 trace(X'X+ki)-lX'X(X'X+ki)-l -- - - - - - - +fB' ((X'X(X'X+ki)-l) .-r) ((X'X(X'X+ki)-l) .-r) 'f3 1- ---- - 1 ---- where A.iis the ith eigenvalueofX'X. E(f3*-S) 'X'X(S*-S) = ------ E(Xf3*-Xf3) - 1 - Similarly, - - -(Xf3*-Xf3) -I = cr 2 trace((~'X(X'X+k!)-l)X'X(X'X+ki)-l)) +f' (X'X(X'X+k!)- 1 -!)X'~(X'X(X'X+k!)-l-I)~ = cr 2 fA. 2 j{A..+k) 2+k 2 f3' (X'X) (X'X+ki)- 2 f3. 1 1 1 - (6) - Suppose A is a (pxp) matrix whose ith diagonal entry is A.. and whose off-diagonal entries are zero, and 1 70 Q be the matrix of eigenvectors such that Q'X'XQ Define a = A. as = o•s. (7) Then the expectation of a is E ( a) = Q' E ( ~) and the variance-covariance matrix of a is which means that where ai' ai, and Ai are the ith elements of a, a(=E(@)) and A, respectively. Transforming (5) b! Q gives = ' ~(cr 1 2 A..+k 2 a. 2 )/(A.+k) 2 • l l l (8} I 71 Similarly, ( 9) 72 APPENDIX IV HEMMERLE'S DERIVATION OF THE EXPLICIT SOLUTION Source: William J. Hammerle; "An Explicit Solution for Generalized Ridge Regression"; Technometrics 17 (_l); (1975); p. 309- 314. Consider the linear estimator defined by (1) where X is an (nxp) matrix of independent variables, y is an (nxl) vector of response observations, A is the diagonal matrix of eigenvalues of X'X, Q is the matrix of eigenvectors of X'X such that Q'X'XQ estimated variance. (Q I X 'y) -0 1 = A, and &2 is the For convenience, let 0 (Q'X'y) 0 0 ---- 2 (2) B = 0 0 (Q'X'y)p 0 (Q'X'y)p and a.* - l(j) 0 1L -J = a.*2(j) .. o: 0 (3) 0 a.* - p (j) 73 where a * i{j) is the jth iteration on a*i of the HoerlKennard process described in Chapter 3. If the least- squares estimate can be considered to be the Oth iteration, then {4) Further, the {j+l)st step in the iterative procedure may be described as {5) or A. 1 -J+ {6) Now {6) can be written as (7) and then (8) Letting (9) gives A.+,. . . -J {10) 74 and an expression for A. -2 is given by -J+l Aj+l -2 (11) (12) diagonal and commute. Aj+l Thus -2 (13) Multiplying both sides of (13) by D-l gives -1 D -2 Aj+l. (14) so that if E. = D-lA . -2 , -J - (15) -J the iterative procedure is reduced to the simple forumula (16) Assume that a.~O 1 for all i and that the iterative procedure is convergent such that lim E. -J = E*. ( 17) From (16) and (17) E* 2 = -0 E (I+E*) - -- (18) 75 or (19) Now (19) consists of p equations of the form (e*) 2 + (2-1/e ) e*+l 0 where e 0 = and e* are scalars. e* = ( 1-2e ) ~ 11-4e 0 2e 1 0 (20) Solving (20) for e* 0) ( 21) 0 To show.whether the plus or minus sign should be selected, first note that (16) consists, like (19), of p separate expressions of the form (22) where e 0 , ej, and ej+l are scalars anq the subscript j is . used to denote the jth iterate. To show that the procedure defined by (22) converges for e 0 = ~, observe that for e >o 0 ( 23) (24) and (25) 76 For e 0 = ~ = l-3;1 2j+ 2 . let In general, r:::-/ ¥e . .,g. J J (26) Then = v'e (l+e. > ~re-: (l+/e~·l~/e (l+g.) 0 J 0 "] 0 ( 2 7) J and = l-3/2j+ 3 = ( 2 8) all j. (29) Consequently, v'e.<g.<l for J J eo=~' Combined with (25), this gives (30) a monotonically increasing sequence of real numbers bounded from above, so the procedure converges. From (21) 77 e* = Lim ej = 1 for e 0 - \. (31) To extend this, suppose that the procedure defined by (22) converges for e for the sequence (e' 0 , e' 1 , 0 ~··r and that O<e' 0 ~e 0 • Then e' ., .•• ) we have e' J so that this sequence converges to (e*) '~e*. .~e. J J From this fact it may be seen that the iterative procedure converges for O<e <\. 0 At this point it is necessary to choose betweeri the plus and minus sings in (21). (l-2e )+(1-4e ) 0 0 2e Note that for O<e <\ 0 2 0 > ( 1-2e ) 2 +2 ( l-4e ) ( l-4e ) + ( l-4e ) 0 0 0 0 2 4e 0 which contradicts the fact that e*~l. Consequently the iterative procedure converges, whenever O<e <\, to 0 e = (l-2e )-/(l-4e ) 0 0 (32) 78 To show the equivalence of Hemmerle's results to those of Chapter 3, note from (9) and (15) that Aiei(j) = ki(j), where the subscript j indicates the jth iteration. Then, from (23) I A.e*. = ~ ~ = 2 2 A.-2A.e. (O)- ¥/A..~ -4A.~ e.~ (O) ~~ ~ 2ei(O) a ~ ~ 1¥1 1 2 2a A./(Q'X'y) . 2 1 = a A. _ 2 A . 2 21 <0 , x , Y > . 2 _ ,{ . 2 _ 4 A . 3 21 <Q , x , Y > . 2 ~ ~ 2 4 2 2 2 (O'X'y) . -2A.8 -lQ'X'y) • -4A.d (Q'X'y) . 1 ~ ~ ¥ ~ ~ 2 2a which is the same as (3.26). ( 33) 79 APPENDIX V THE SOLUTION OF MALLOWS Let Q be the matrix of eigenvectors of that Q'X'XQ = A, let z - Q'X'y. X'~ such the full rank matrix of eigenvalues, and The ith term of z is denoted by z .. - l The ordinary ridge solution of Mallows (1973) is that k which minimizes (1) where ~ 2 rank of is the residual mean square error and p is the A. To minimize Ck*' differentiate it and set i t to zero: (2) from which Mallows gets as a solution (3) (l+k)/k and then (4) Due to an inability to derive (3) from (2) it was decided to test (4) on some data to see whether it yields 80 an optimal solution. The data used are from the example of Marquardt (1970): 312 10 X= 4/2 10 41"2 10 312 10 1 = y 2 5/2 5/2 10 10 1 49 2612 30 10 3 from which X'X = ~·y 49 50 1 12 2 Q - Z = Q'X'y = 12 2 51 10 1 = 2512 10 s:l 99 A= 0 1 50 a= 10 e 2 = ~·y-a'Q'x'y = 12/33 The following nine pages give the listing of a small Fortran program written to test Mallows' solution, and the results of that program. The first result, labeled 'MALLOWS SOLUTION', is Mallows' optimum from equation (4). The second set of 81 L t! JaG 1 XYt=5·1 dl 8~2 ~ u; s z=tLtB. 01=1,9~ c 2=. c z OOG6 lll oc;:;r u~ I.N !lJn U' OO!.'q Lt, '] IJ'l Q '-.t·! JG11 Lt.J J.: 12 u; J )1 ,3 U' IJC14 Ul ilC15 d: uct& u· 'lOU u: JJZG Ul J~17 'l~1g L t~ uoz1 lN ~· l!C22 lf·i uJZ3 .L 111=85./33. A 2=5 JJC4 J :' ( 5 !..tt . XY2=,1 'l G 2 4 \1025 ~~. ~! F' =2. II K•1 AL 1 , I ( ( A1 • "'2 • A2 ' ' 2 ) I ( P 'S 2 ) -1 , ) = ZK=AK~·A-. CALL CO~P!XYl,X1Z,Al,AZ,SZ,Cl,021P9A1ALC;GIFF~o'lOIFC;ZKt w,;ITEI61,gQIJl W1=.lTECS1,gQ3) WP:::TE!&t,~Ool -------------- --·---··----- - - - 14i<IiEC61, 'lOU 111<'11\i:, -Wk!TEI61,YOZI AHALC WF:TEt61,90JI OIFFC 0 LDl F= 0. Z K::: 0, IH<.l TE I 61, 'lO 6 I WP!TEI&1,gJ51 . WI-UTE!F)1,giJ 0 1 1 C 0 CAli. C0'1~ t X Y1; nz <Al; 112 ;SZtDl ;oz-; P9 111 A:..C;tJIFFG ;~01F'C97l(y---­ ------------~-~- IF!A~~~uLOlF-oiFFCI.L~ O;_f'!iF=;:JIFFC Lll JC26 i.l~ !JC'Z! QC2~ ltl ouzq J~-.~ 0 Q3i !JC32 ~ Q 33 ~~!H"i~I~~5fr 0 !~c~ 1-H"l T F {(,1, 9J W"'lTEI61 JC 36 ')~37 CJ41 OG42 QC43 01;3 ~;ji+4 gQ4 :J JLd 'l:4g JC5 ') !)051 USASI 300 9 61 gQ4) FC"r:lli 11X9' t1A~LJWS 50L~flO~')--~---­ FQP~'AT!1X,'~".IOG<:: T~LICE'I YC5 FQR~DT!1X·'~EWTJN-~AP~J0~ ITERATION') 9CC For!-'1\1 11X, 'OE~IVATIIIE=' 9 1X,F14e10l 9C1 fCi;:!'l\y 11X;'K=' i10X;<'11+;t1JI - - - - - - - - - - - - - - - - - - - 9 G2 - F O" ~·A i ( 1 X •• GK= I' g X. ::' 14. 1 GI g GS F QF ~~A~ I 1 'i , ' go7 '"C~"AT ClX,' nm FJ~TFAN ---·· --- "• "- "'" "'"' • • "' ' 1 ) I ---------- -- G!~G~O~TIC - ------- RESU~fS FJ~ FT~oHAlN ---------------------------~---- Lt. 1)001 Ltl ~ c 0 2 LN JC:i3 i-t' 0 r '14 LN ..t~ Jb~ 3 0•)06 -L.fJ 0C;J7 LtJ il'JCl3 u. a ceq LN 'JHJ ~~· 1011 u~ :1 LN JJ12 ~u '-'• J '114 -d JO 15 U• OC16 LN u; )J17 JCB USlSl -------·---------~------------------- CALL CJHPIXY1,XV2,41,AZ,SZ,01,02,F,A1ALC,OIFFC,JDIFC,ZKI WF! TE t&l o qt) 7'1- ··------------------------------------W".1 1 E ! & 1 , qQ 1 l ZK WF:ITfi51,9J2) A'IAL: W:<ITE151 9 00GI OlFFG tjCJ8 0:; -~ •J ·J14C IJC45 - -- ------ WUTE.!~1;90&1Qu 3t;~ J=1,t!GG l K: l K~, !:i C ::> ~:!5 6 'lC117 OJOOu000011 ;o TJ 200 wr:rrrc&t,go2, AHALC w:::ITEI&l,%01 OIFFC W".IT E I Ell, gu 71 Gu TO teO------ ----------------------~----2 u iJ Z K= C, . ')::34 !) ·~ 4 •• f)~T~AN OltG~OSTIC F.ESULfS FJ~ CO~P -- -·- --------- --···------------ ------- 82 -- ........... . ............... 11ALLCWS 30LUTIOlll K.: CK= 0.0235308099 3,3o21580868. -z. 9-336'3H&73 DEi\IVATr\IE= •••••••••• --'K£. ~ CK= O,QG59788927 --· ~.OOGOCJ00"0 -50.Su5~505055 D~KIVATIVE= O,C134357191 K: J,&Q66~65831 CK= - DE~IIIATIVS= -20.7037144419 K: 1).~22~657473 CK: 3,4J65121720 -3.3822205513 OERIVATIIIE= :J, G32611 0162----~--- . -----------lo308d726829 -3,3167236876 CK= DE~IVATIV£: ij,Q432712338 K~ CK= ----oEr.iVH!V£:- 3o2651E85561 --1.2!t5918G'.IIt3'--~-----· O,il51071~985 K= ---c-K=~------ ---~; OE~Iv~TIIIE= - - 1 ( = - .. - - --- -~ 2'+CJ 0 & 3&&u s------ -u.~tC79311t5~8 -- CK= DE~Iv~rrvF= J 5 3990 8422 3•2'+53816285 ~0~ -a.:9123~t953C K: 0,0542754803 3.21t51DO lj199 CK= --------u£1\IVA TIVE=- ~'"J ;c 071t CJC 366-z---K= -'-~-cK=-----------n£~IvAiiVE= --K=------- K= CK= 3; 2450'l8276~tc:----- -~.OOG0&02719 ---- J.:542778081 _ _ _ _ _ _ CK= OE~·IVAT O.OS4277aQ80 3.21t5J'J327&2 I VE= --oH.HATIIIE=-- -a. cot oo o0040 ·------- - - - - - - - - - - a.Q542776Ddi 3,2450'382762 a~oocooooooo ____ 83 •••••••••• ---RIDGE I~ACC: ............... - O,!rG50CHOOC!- - - - - - - - - - - - - - - - 3.&501914231 DERivATivE= ·23,65&7835528 -I(: . CK= --- ------ - -------------- lo0l~UCJ0000 K: CK: 3.4769730343 OERiuATI~E=· •12.4462E037f2~ J.Ot5CCJOOLO 7990 - K= 3~382412 CK= LlEI\IIIAfi\1£= --K=---- ~j.,2CCOOCOOO. CK= OEFhAT I \If= K~ CK= --DE~ -7.~&71191!7023 -· -- ----···- 1.3273625Ja ·'-· 2 2 2'>3125 (j1 J,Q25GG~OOOO 3. 293950 3&32 IllATivE=-- -2.oG2&1tu4824 - - - - - - - - - - - · · - - K= ~.OJOC~uGOCC -.-cK=·----------J;2731967a3i- - - - OEFivArrvE= ·t.&258422741t - - -x=- -- ------- ---- o.o350C :JDOu-c---------------- CK= u£~1\IATIVE= - J.260250C355 •l,OC81C76510 K= Q,Q40000G000 CK=. l,252339C953 -oE;;. Ill ATI \I C.-=-·--- l~ 6" 1 0 9 0 07 3 5 - - - - · - - - - - - . - - - K= o.or.5oc a oooo - 3,21,78033457-- ----·--·-0.3226124537 \IE= ---eK= ---· -OE~IvAfi ·a.osoooooouo -· --·1(;:: CK= ilEUIIATIIIE= K= ·cK= ---OERIVATI oJE = ----------------- 2 4 58 E>3 9 7 T I T - - - - - - - - - - - - u.1273924092 ---·~; OE~J~ATivE= - - K=-- ---- o.cssocooaco 3.2451115981 a. a183227520 J.060C00~000 K= --cK:----- 3. 2" 5 0 1 382 2 3 -J.1258t,53393 -- ----- -- Q;0651!1Tli!:01ltl----------- CK= OE~IvAfiVE= 3,21t75805221 J.212&561102 0,07CODOOOOC J.25GOt2f4&5 ------oEt~!VATIVE= ---J ~-28149808 BT _ _ _ _ _ K= CK:: ~= Q,G7500000QC ---r;l(.:-----. --3.2531722q7q 0.3368308339 --x=-------CK= O£~IIIATIVE-= ---------- -··- K= ~= 3.25681211+07 u,l879929754 -·-··----------------- ·---------- CK: -··- on:IiiAiii/T=---c~<= ··n-.us~acouourr -- ---·· -· ·-- OE~IvArrv£= O.OS50COOOCO 3. 2 0 0 91 2 6577 iJ .1+312 8 2 52 00 _ _ _ _ _ - - - - - - - - - - J.09LOCOCOQO 3. 26Slt23t1Zr------- - - - - - - - - - - - 0.~7G3C2~759 84 K= CK= u,0'350COOOCC 3. anc 8 29LoO OEo<lVATIVE= .;.5~&18?8010 K= CK= 0,1CC CLilCCOO 3, 2755;)95813 ---- DEf\IVAftllf= K= --cK= -oE"IvATIVE= --1(::- - ~.i3Cl737051\'3 J.1050CHOOC 3~ 28t0'3721~3 ;,5715355768 u.11~orouooo CK= 3,Zd6q&511575 OEriJATII/1:.= ~.&~1'3'3973~2 K: CK= Go115000olOCO J, 2'3 313 3 7792 OEKIVATI VE= ~.63143Jftlt20 K= 0,120Gt00000 '3, 2'3'35916366 - - - - - - - - - - - - - - --0.660G58309Q -CK=·-OE~-IVATIIIE= -K= CK= ~ .tz5or o oaoo ------------------ OEUIIATliiE= ~. 3063327'337' o.~660J&2~+37 K= li .ucuc 0 0000 CK= 3.313-~508032 DE<i.IVATIIIE= 0.715Lt856155-- - - - - - - - - - - - - K= J.13500000<10 3. 320&li10348" - - - - - - - - - - - - - - 0. 71+24'32 !15'>4 --cK:o· OEKIVATIVE= K= - J.tftc~cooooo- CK= OEr<IIIATIVE= 3.3281'3'33'362 o. 76'3120 8631 K= G.145uOOOO~O CK= 3.3360223411 ----uEi'I'JATIVE= ---u, 7'354151577 __ _ K= ---cr:·-0E~IVA'l11E= --- 1(: -------- -- - -- - CK= 0.1500000000 --T;344H56'3118--------0.6214063810 ·a~ 1550GOOuilU _ _ _ _ _ 3,35244'35~29 DE '<I VA TI 1/f= c. 8471236370 K= J,16CQ~OOOOO CK= 3.3610~82990 -- -11HIVATIVE="-- -J. 872576 %85 _______ K= ---cK= OE "IV ATI VE= --1<=--- CK= OE~IVATtllf= K= CK= ---- oE;; rvnrr vE = o.16sarooo:;o l. 3& '39C a 2s 57 - - - - - - - - - - - . - - o.s<J777'l2036 ---0.1700[000~0-------- 3o379C0l06'37 J.'l22737337& fl.1750C00i100 3,3883542311 J,qlo71t5'557C6 85 ~. ta J cr a GG.:. o -3, 3'j]'j!:1388'l Jd71'1360854 K= --cK=--- --0£1-:Ii/AiiVE= --,.:=---- ---- - ),1S5HOGOn 3.4077'121&50 t),9'l61i"'l&338 CK= OEt<IVAT I liE= -----------·----· --- K=. CK= --DEi<I\IATIVE= ~.l'ltODCCDolQ 1o417~741'l10 -- 1.G2C185'l76'1-- ------------ - - - - - - K= 0.1'l50COCCOO 0EFIIIATIVE= 1ou43'l542151 ---cK=-------- --3,4281'35~9~9· ---K= -- ---- -- i1,20CUtDGOOO 3.'+187524772 t,Q&7463u353 CK= DEKI.ATIVE= K= CK= -----oERI\/t.T I liE= o.zosoeo~uoo 3.1t<.'l51t3'l!oll4 - 1. ~'10770 8'15"3 _______ --- - - - - - - - - K= j,210~C.OuOOC ot:;;_rlfATIIIc= 1~ ""~ 5£; 70865- - -1.11381&15'11 ---·cK= --- ilo215CCOOO(i0- --------- -----1(=-- CK= 3,1t718l'l'+575 1.13b6171'l63 DEtdvATIVE= K= Oo2ZLOCOUOGD CK= j,'ttl32986111 1. 15 9172 Lt5 3 8 - - - - - - - - - - - - - ---DHIVAfiVE= K= --"CK=' --- ---------- OEklVATIIIf= - K=---- ---CK= DE~IIIATrvE= K= CK:: --oEt:IVAiiiTE= J.2z5ooauooo 3~ lt'l5GC 201125 _ _ _ lol811t805057 -n,23GOGutlO~O 3o5V&'l27 3'130 lo2o35,.00876 J,2350C OGOuO 3.51'lC7Zu524 -1.22535S118g---------------- K= 0.2t,QQ(~00te ---cK=----------- 1,531'+3356!:7----------JE~IVATIVE= 1•24&'1(97151 - - , , - - - ----- ---a.2r.50COOOilO CK= 3,5r.4JC94097 O~~IVATIVE= 1o266218l942 K= Co250G00~000 CK= l,SS£>7970&56 1. 211 '3275 u77 3 - - - - - - - - - - - - - - - - - - - ot::.:IVATIVE= K= -·cK=- ------OEP.IIIATI'JE= ~.2550CJOOGO 1~ 5E.'l794G717 - - - - - - 1· 5H il6JJ,58 86 1(: ~.~ouul CK= OEFII/IITIVE= .J, 5 szqq7 A<.TII t. 330&3.H.J58 1(: o.Z&5oto oooo CK= J,sq&<+c51l%7 1,350931+3311- K= --cK: 0.270CCOGOJ0 .J,6Ui0l5Eil571. HQq8J'l537 -- OEPii/ATI'JE: oE;..IvATIVE= ---·11:= u Ju"l! ---- ~.275aoocorD-­ CK: J.f-231124 7~&8 1.3'l07B24554 OEk!VATIVE= K= CK= ~.2800COOOOO 3.&3783050'!7 1. r. 1u 330 4~+&6 OEFIVATIVF:: K= . --.:;K:: ~.2850COO,OC ~.552C305132 ---------------------- OEPIVATlVE= t.<tZ%2861175 "'" 1),2'lilOCODOOU ------------- -------- K= CK= 0.295•)000000 3.&8100 32486 1. rt&n 7q6<.3r · ---------------- - CK: OERI~ATIIII'= L\E?IVATIVE= K= --c ~~:=-- - ·-··- .. OE:IHI/ATIVE= -------,;: =OE~IVAT 3,<;&&422253'! 1.446&760771 o. ~00000 00110 - 3;&q57711JZ1i6 _________ _ 1·46&0345352 G. lJ 50 r; 0 OOC 0 ------ ------ CK= I liE;: K= cK= OE:F"f\1 nr IVE: 1(: 0,3100COOOG0 3,7258~70911 -1,5 22~>C g ~z;;c;-·---- ~.~15ec~onoo 3.7~ll71lSOIJq---- t.S4J2322H& ---·1{:-- ~.32DOG0Gu~~- CK= 3. 756&&0 '!323 IJEi':II/A:I\1!:= 1.5571113'3833 K= CK= 3. 772325'!825 OE.:<IvATIVt;, K= ---cK: - --- -OH.IvATII/E= ------------- -------- ~. ~30000GC00 j , 7 881632652------------·- t.snc6u80u3 ~.335JCOOO~O 3,11G!t17J4118 1,C:,fj'l12'!3064 O'Oh<IVA~IIIf: ---------- 1,5751562754 CK= K= CK= -- -- -----------·---- J,H500ol.lGGO w:= DEl< IVA • I Iff: . !..7107231211 1.5043<.400'!6 --cK: DERIVIITIVE = -- ·~-.,_!::2------------ a.~ .. corooryco 3, '12C3<+50707 1.E>zsn359qs -·-----· 87 K: O. 3<t50( ~C.OGO 3; B36E. b4 go n 1oolt21655370 --cK=-oEodvA riVE= --x=---- ~.35aiHOGOO!l - 3,1!531876133 t.o58337n21 CK= OEHVATIIIE= , K= CK= J,l550tOODCO J,6o'385C 6879 1o671i2799'366____ _ K= ~.36COCOOOOO OE~IVATIVE= 1oo8'39964471 --utRIIiATIVE=- - - - -CK.=-------3; 8d 66 7 2 ii5 au----------~=--------- ------·a; 3E5GOUOQliO ____ 3.903650&665 1.7C54883790 CK= OEkiVATlVf= 0,3700DOOOQO K= CK= ~.920761lt81,1 ---!lH.IIIATI vE= --!;720757 8l3lt___________ 0,3750000000 ---------3; 938 ~ 61i lo'lC 6- ---- - -OEPtVATIVE= 1.7358(66737 K= ---cK:--- -- --K= - - - - - -------:r;37999'l99':l'3___________________ 3o9551o96693'l 1,7506375627 CK= DERIVATIVE= K= CK= Q,38~999999'3 3,9730765213 1; 76525206urt--------------- --DEI<Ii/ATIV:::::·- K= 0.36999999'39 ------cK:-- ----- --"3, 9908GT2211t _ _ _ _ - - - - ' OE~IVATivE= 1o779E.521t211 ---K=--------------- ~.391+999999'3·---.-------­ CK= ~E~I~AIIIIE= --------------------- K: CK= -· 4,uJ86f.S6629 1o79l8407895 ----------------------- 0.399999'3999 lt,J2o677 337lt --1;-8078192993---------- ---uEF.IVATIIIf= K= ~.ltOEt9999999 --cK:::-------- ---li;:r.C.824556S- -DERI~A'I1f= 1.8215900'301 ---x-=-- -- ----- -J;t,Qqqgqgg9q----r.,"631CB453& 1,6351553050 CK= OERI~Af!VE= ~.lt1lt9'39999'3 K= tK= ---uERIVATI~r:: K= C.o061526q61,5 -- 1,S4851701197 _____ - - - - - - - - --r;K:- ------- ---- OERIVAriV£= J.~19999qqqq r..1 00 078124T ________ _ 1.~6167756'30 88 l] ,it24999'.l999 -- -·--· ... u8n96725 1.87463119'+62 K= CK= OEI'!VATIVE= J,42999999'J9 4,13757C 21tb'l 1;8S7LtC33016 K= Cl<= -JE?h'AriVf= Q,<t349S'l999'l 4.1565()7288C!-1. 8 999 7 2 79 0 1 K= CK£ .OEFI~ATIVE --1(=-- -- = o;4l':l9999999 _____ - -- - -- - CK= --- - - - ---· '+.1755&90602 lo 9123495404 Q£1i.IVAT I \IE= K= CK= Q,41t49999999 4.19475 3 &442 1.924535&742 --------- --·oc:t<IVATI\o't= K= ------------- Oolt49'l'l99999 c.; 21405•H453·-------------lo9365333u't1 -·-cK=·---OE~I>'ATIVE= l),4549'l99999 ____ _ --1<=----- -------. CK= t,,233~836891 DERIVATiVE= 1.9483445337 K= 0,4599999999 '+.2530254217 CK~ ---OEiHVATIYE= -- 1~959'171451tll----------- K= ---cK=----·---------..,.;27268251.07--------0ENI~A~l~~= 1.~714161492 -·-o;r.o99999999 --K=·-- ---- CK= 1t,29245314~2 OE~IVATII/~= 1,9826806857 K= 0,4749999999 CK= ~.31233553G7 ---oERIIIATIVE=- ---1,993767U'l8-------------- K= 0,4799999999 --cK=--- ------- 4; 3323278996_____ OE~IVArivE= -J<=- · -- 2.00467749'+4 ·-- --- n;4849S999'l'l _ _ __ CK= ~ERI\IAT!Vi= 4,3524265005 2.U154138371 K= 0.469999'!999 CK= 4. 3726356029 - - OEt\IvATIVE=' 2;025CJ7816i5 ------- - - - - - - - K= --cK= ·--- --- ---- rt~3'l291<7497jj---~------.-------, OE~IVATIVE= ---K:··-- -~- 0.49~9999999 2,036372t.660 0.4999999999 CK= ~. 4133f2~CJ2J OEi<IVAT!vE= 2. 0 46598 732& . . 89 results, labelled 'NEWTON-RAPHSON ITERATION', is from a Newton-Raphson iteration based on ( 5) where the subscript denotes the number of iterations. and where The third set of numbers, labelled 'RIDGE TRACE', give the results of a ridge trace beginning with k and advancingthrough increments of .005 to k = = .005 .5. Listed are the k ('K = (value)'), the Ck* statistic ( 'CK = (v'alue) '), and the derivative of the Ck * sta- tistic ('DERIVATIVE= (value)'). Mallows' theoretical optimum gives a k of .02353, for which Ck* =· 3.30216 and dCk*/dk = -2.99370. The Newton-Raphson iteration gives a k of .05428, for which Ck* = 3.24510 and dCk*/dk = .00000. The ridge trace ·indicates a minimum somewhere between .05000 and .05500, thus agreeing with the NewtonRaphson iteration. No other local minimum is in evidence. Under these circumstances, it is hard to see how equation (4) can be correct. 90 APPENDIX VI EXAMPLES OF ORDINARY RIDGE SOLUTIONS The artificial data of Marquardt (1970), given in Appendix V, can be used to illustrate some of the ridge estimators discussed in Charpter 4. Included here are the Hoerl-Kennard conservative estimate (1970A), the solution of Hoerl, Kennard, and Baldwin (1975), the Hoerl-Kennard iterated estimate (1976), the Hocking, Speed, and Lynn solution (1977), an iterated form of the Hocking, Speed, and Lynn solution, and the solutions of Lawless and Wang (1976) and McDonald and Galarneau (1975) • Four of the estimates can be given in direct algebraic form. The Boerl-Kennard conservative estimate is (1) Substi tu·ting the data of Marquardt (see Appendix V) , k from which = (! 2 / 33 ) * .014545 cs) 2 · 91 ci* = A1 A +k 1 0 1.98 1.994545 0 TI 0 .02 .034545 5 = A a A A +k 1 0 85 [ 2 55697] = 0 2.89474 and S* = = Qa* 12 2 -2 12 12 2 -12 2 2.55697 3.85494 = -0.23883 2.89474 The estimator of Hoer1, Kennard, and Baldwin is (2) = 2(12/33) 2 2 (85/33) +(5) = .002290 from which a* = 2. 54619.1 S* = 1 r3. 44525 lo .15561J ( 2.32613J The Hocking, Speed, and Lynn estimator is 2 k = a2 :EA. a. 1 1 2 1 ( 3) 2 :EA. a. 1 1 1 4 92 ;;:: .054751. . This yields 2.50645] ( 1.33777 , 6* = [:::~:::] The Lawless-Wang estimator is k = 2(12/33) = .05333, giving S* = [:: :::::] The other three estimators are iterative and cannot be given in direct algebraic form. The following three pages list a Fortran program which calculates the seven ridge solutions mentioned in this appendix and give the results of the run. 93 LN JOG1 ~~ XY1=5.1 ll!~~~/JJ. 882i l~ J~hs A 2= 5 S2=1f,/J3. LN OJC6 P=2. ~I! .) .1 ·• 4 g~~~oi 6 t~ ~gg~ · AKiiAL=1. I ( ( AtH·ZtAZ• •211 (P•SZI-1.1 ZK=AKI'A._ CALL··co'iP(XYl,XYZ,Al ..iAZ;·sz,cr,02,P,A~ALC,OIFF;,)O!FC,ZKl l ~I OOOq Lt. ~010 ··;:.N llill LN ~012 lN OC14 ··u1 JC1'5 ~N JG16 l~ Q018 LN - ~~17 ZK=O~ "h LN ------ ~C21 JC22 23 -- - -lliQ OC24 0025 JC26 ZK=ZK-::JIFFC/ODIFC 0027 -- -·wr.:!TE!61i9011ZK _ _ _ _ _ _ _ JU2>1 W<?!Tf !61, 90 21 AI1ALC 0029 w~ITEI61,9GOI OIFfC ~N l!J J ~ - - - - - LN L~ ---~-- ~D oe.H Wli.! TE 161 , 9ll 71 ZK=O. WFITfi61,9J61 ~3~~ ---~----- ~~iH~H:~~~~----·---~---------- L~ ~CJ6 ~N 9G36 (N ------------ ·t;u-·ro--uo---------------~---- -~------------------- 2~0 0C32 LN ~033 B: --- WRITEI61o906J WRITE &1,9JS WR1TEI&1,qil61 CAlL. CD~F IX Y1 ,""XY2oAloli2>S2-tOTi oz-,-p, AiALc·,urFFC ;-JOTFctzx:r-----·--lfiA851uLOlF-QIFFCI.LE •• 0000u000011 ;o TJ 200 OLOif=OiffC -------- [ N QJ Lll Ui l tl ---------------------- OLDIF=O. LN JJ2e l~ WRIT£161,~061 Wii!IE!61;901J"AK~AL-------·-------------------- WF!T£161,9021 AHALC WF!T£161,9001 OIFFC LN OC19 LN --- w.<ITEf&l, 9061 WF.rTEI&1,90J LN 0)13 Ou 3G~ J=1•tt0 ZK=ZKf,GC~ . CALL fOMPIXY1tXY2,A1,A2,S2t01,D2,FtA1ALC,OlFFCtaOIFC,ZKI Oa~7 -o C .J '} - - - - - - - WPI ff:' C6t·, <JO rt - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - LN GG .. O W~Ilfl61,g011 ZK C<l41 Wi'ITE;I61,9121 AHAL:; ~N ilC42 300 WfiiTEI6l,'lGOI OIFFC ==~=-ttl (N -·----- ~N · 60!<3 ~~ ~C~4 ---------ge 3- fC'>MAr (~ LN 9C5 x-,• t1!\ltJWS--3('lt:Ut1""0"1 ......... --------~---------·--------ITEkATlON'I 9 c G For-:"' A• 1 1 " , • oE ·U v Arr 11 E = ' , 1 x , F 1 '+, 1 o 1 ----qet- fCRI':A' tl.X >'K=' o"10Xi~14otO)---_------~------------ 00~3 ~C49 lM ]050 - - - - - - - ~ti 605t-- 11 FGPrAT(1X,'FIDG~ T~ICE'I FORMATI1,,'hEWTJN-~APH50~ 9D .. LN 0~~5 c11 o~" 6 ------· :rt :104 7 902 FQR~AT(1Xo'CK•',9X,~1~.10) 9G6 F8RHAT<tx,•••·•~••••••l 907 F PHATI!X,' 'l --~-------~-- Erm------ ---- USASI FJRT~AN CIAG~O~fiC RESU~fS FJR FTh.HAIN -----------· -----------------------------------------------------------~J EiH.OkS Sug~OUTi~E COHPIXYltXY2,Al,A2,S2,Dl,J2, 0 ,AHALC,)!FFC,JOIFC,ZKI IIK=1./S2•Xy1H2"ZK/10l+ZKIH3 3K=-1.tsz•xr:!••z•zKtf::J2+ZIO . . l. ---- ----- --- · -· c K=-_ D 11 1H K 2 OK=-02/ D2+ZK ••2 EK;t,tS2•XY1'•2•ZK .. 'ZIIJ1°10l+ZKI••21 f K= loiS z• -.:r 2 . . z• ZK ... 21 I 02" I 02 t ZKl uz I GK"2• •011 Ol•lKl H K= 2, • (J 2/ () 2 ~ l Kl PK=-1./SZ• xr t•• 2• 101-2. •zKJIDt+ZKJ••:. QK= l; /S z• YY 2 . . 2• (;) 2-2; • Z K f ( 0 2+ ZK l •• + . ;) K= 2. • J 1/ ( J 1 i- Z Kl • • 3 ![ I •• - -------- - --- ----- -- ---------~----- --- SK:-2.~)2/1)2+ZKI•+J AHALC=EKtFK+GK+HK ---I') IFFC= AKt-BK-CK-oK·------DOIFC=-PKt-OKtRKt~K -- ~-~ ---~~----.,- - ------------ -· ~--- ------~--------- ·-- --- i<ETUFN _ _5NO ___________________ , __________ . _ - - - - - - - - - - - VS~SI F~~T~AN OIAGNOST!C -~------ RESU~rs FJR COMP ----------.----.. ---- ----.------------------ 94 d· ) 0 d ---------- Lfl J Ltl ~ L'· 1 i. ~~ u U;. J .,l l "65 7 USAS! Q~Ct su~~rUTI~E T~Ghii~1,A~,AP1,A~2,31,82,ZK,OloOZI A=t=at/(~l+~Kl•At A;2:021HZ+ZKI•A2 'lf'= (A< i+FZ f"'SOi:T (;:!,1/'!.• 'l2=1A'1-A~'~I·sa·;rl2.lt?.. ":fTU"N -- __ ft!~ --------·· --------FD~T~A~ OIAG~QjJIC ~ESULTS SU~~OUTI~E FO~ T~ANS I~lTL(~LDA1oOLOA2,0LCK,A~l,A~2,ZK,A1,A21 ngi ________ nLPK=C. 3tR~g~L- _____________ --·-------·---·--·- ---·--·----10"" J~~S J:~o Ai';l=Al A~?.=A2 JC Q l _ ----- ___ l.K=D a - - - - - - - - - - :)()( !S f.'ETUFt •HG'l E!iO - _________ __'j:l__f_hQ.:!S... ______ ----------------- O~J-,TGO _________'iQ __f_Rr QE5 ------------- ----------- 95 ~·~if.~l.l!.f ~~----------- ····--· ---··- fl __ t HOERL-~~NNA~J CG~~E~V~T!VE - - - · -·- K ..J(:--t .. l1~S~<.'>-'+545--'·-------- ---------- -- Al= A2= 2.5?bq73564~ 2.iq4736~~2l ____ j l ; 3.11549413174 82= •C.23~R347u42 **•* ......... . K: O.J22~89B403 - - U • __ 2.5!>61936!.38 _____ ----------·- --·--· A2= ?.3261~sggz& 91: 3.4452537905 ~~~=---.C.a.l5.5&D.ZZ32.9__ _ _ _ _ -~--~ !.! !! _! !_!._!_~---------------------~ - - - - - - - - - - - -------- - - - - - - - K= At: 0·111461915& 2.~38485712g ___A2L- ..Ji-.1 6J.5.I bS.lJ..n.. _ _ _ _= 81~ ?.2621493493 B2: 1•18~3902174 =---o-.!!'!:!!!! !~:! _______ -----------------------:=---- ----.- - - - ........ •+ .... • .J.;>:. !51 !.!t!J.I____-~- ~--~;,;----C 41= ?.50~~486493 t.337766d593 ,-==!H.~ • .Z.1!\. 22' C.B5~ 5. ______ _ A2: ~Z5 r.azt1~28tag ................... K= n.Js~c7~G&d7 A:l;; _ 2. 5i!2.363&n.at ______________________________ ---· __________ _ A2= 1.2BOR514C8~ .31= z.&73137J.i5o ~--B2~--£ • H::3Z.3'36:li.9. _ _ _ .............. ~--:---~-~-!!::!!! !_~! __ ·--- -------------------,---------------.----...,.-------------.----------- K: G.~f3~333333 A1; z.sr~1g6721J -~-AZ.;;;·--1• Ji::39363£3L_______________ - - .--------------~----·----~ ~1~ 2.73 7 7gq430G 9Z; ~.~:9!2~Jg04 ~-~--K•-·--- t.l1~471ZJ>l2--------~-----------·-·~-------~---- 61; ~~; =-Ill; ·- 3. 6;.751 ?124-4-----------~-~--------~- ---- -f!;!; _ -~-~- ?o53t95Q&gg] z,?gq14&4t9z •Q,Q5~?t.Hn50 - --· ----- 96 APPENDIX VII THE SIMUl,ATION STUDY OF WICHERN AND CHURCHILL 'fhe Method The Wichern-Churchill (1978) simulation study of ordinary ridge estimators uses the method of McDonald and Galarneau (1975). lt assumes a linear model (1) XB+e where X is an (nxp) matrix of observations on the predietor variables, ¥ is an (nxl) vector of observations on the response variable, 8 is a (pxl) vector of unknown regression coefficients, and e is an (nxl) vector of errors such that ei - N ( 0, a 2 ) , i ~ (2) 1, ... , n. For the purposes of simulation, x .. , the (i,j)th ~J element of ~;"~_· ~J ~' :;o:; is computed as (1___ -a. 2}~-z .. +a.z. ( +l) J l.J J l, p - ' ( 3} where zij and zi(p+l} are independent pseudo N(O,l) variates, and where a is a constant. Note that, if O<a<l, then for some variable of X, say Xk' 97 variance Further, for Xk and some other variable, say X , . m It follows that akam is the theoretical correlation between xk and xm. Once a set of predictors has been generated by (3), two choices for B are considered. The motivation for these two choices is this: 2 The mean square error depends on X, a , .@_, and, in 2 the case of ridge regression, on k. Assume that a , k, 2 and X are fixed so that the mean square error, E(L ) is 1 regarded as a function of B only. Newhouse and Oman (1971) have observed that, subject to the constraint ~·~ = 1, E(L 2 1 ) is minimized where S is the normalized eigenvector corresponding to the largest eigenvalue of X'X, and is minimized when ~ is the normalized eigenvector corresponding to the smallest eigenvalue of X'X. 98 McDonald and Galarneau's two choices for these eigenvectors. are The results can be idealized as a .. best case" and a "worst case" for E(L 2 } • 1 The choice of S was indicated in the Wichern- Churchill study by the "orientation," labeled ¢. is defined as ¢ = V'~, Oriented where V is the eigenvector corre- sponding to the smallest eigenvalue. Since the eigenvec- = tors are orthogonal to each other, it follows that ¢ if B corresponds to the smallest eigenvalue and ¢ = 1 0 if S corresponds to the largest eigenvalue. To complete the model it is only necessary to Since it is assumed that e.J. N(O,cr 2 } , -e can be comput e. generated as a set of pseudo-random variates once a value for a 2 is provided. The Study The Wichern-Churchill study used a model with five· predictor variables and 30 observations per regressian. and x3 Observations on the first three variables x1 , x2 , were generated from 2 !.:! z .. +az. , x .. = (1-a) 6 l.J while observations on x .. = l.J l.J J. x4 and x5 (6) were generated from (l-(a*) 2 )~z .. +(a*}z. 6 , l.J J. (7) 99 where a and a* were two (not necessarily different) constants. Five combinations of a and a* were used: (a,a*) = (.99, .99), and (.70, .30). (.99, .10), (.90, .90), (.90, .10), The degree of collinearity was measured by the "spectral condition number," defined as A. /A.s' 1 where A. 1 was the largest eigenvalue of X'X and A.S was the smallest eigenvalue of X'X. The spectral condition numbers for the five (a,a*) were A. /A.S 1 21, and 9. = 278, S89, 41, In addition to changing the covariances of the predictors, the investigators used five different values for the variance. S.O. .10, .SO, 1.0, and These five values of p correspond to five "signal = .§_'.§_Ia 2 ) , to noise" ratios (p .04. = ·.ol, These were a p = 10,000, 100, 4, 1, (Since.§_'.§_ was the same in all cases, p was simply the inverse of a 2 .) For every combination of (a,a*) and p there B: were two possible resulting from ¢ = 1. that resulting from ¢ = 0 and that All in all, there were 50 possible combinations of parameters to consider, each of which was repeated 100 times while varying ~· The estimators considered in the Wichern-Churchill study are given in Table 1. The basis for comparison was 100 (total mean square error of Rule m) ::::: (total mean square error of l~ast squares) ,m=l, ... ~S Findings The results of the simulation study are presented in Table 2. A number of overall patterns can be noted. Among the most noticeable of these is the fact that, almost without exception, the extent to which any given ridge solution improves on the mean square error of the least-squares estimator increases as p increases. In other words, ridge regression is useful where variance is high. A second pattern which is evident is that ridge estimators show more improvement over the least squares solution where A /A is large than where it is small. 1 5 This is to be expected, since ridge regression is specifically intended to handle multicollinearity. On the whole, Rule 5 (Vinod's ridge trace estimator) does not show a consistent improvement over the least squares solution. p In fact, for large values of it performs considerably worse, especially where ~ = 1. On the other hand, its impTovement .over the least squares estimator tends to be better than any of the other estimrnators except Rule 2 for small values of p. 101 Rule 2 shares some of the inconsistency of Rule 5, although not to the same degree. again worse for ¢ = 1 than for ¢ !ts performance is = 0. Rules 1 and 4 are the most consistent in terms of mean square error. They are never much worse than the least squares estimator and are often better. Rule 3 is only occasionally much worse than Rules 1 and 4, and can often be somewhat better, especially for low values of p. In the absence of any additional information, Rules 2 and 5 seem the poorest choices to use by themselves, while 1 and 4 seem the best. 102 Table 1· Ridge Solutions Considered in the Study of Wichern and Churchill Rule Definition 1 The Hoerl-Kennard (1970A) conservative estimator: 2 The Law·less-Wang ( 19 76) estimator 3 The Hoerl-Kennard-Baldwin (1975) estimator: 4 The McDonald-Galarneau (1975) estimator: choose k such that (a*(k)) 'Ca*(k)) 5 = 2 &'&-& 1::1/A. 1 J. The estimator of Vinod (1976) :. choose k to minimize 2 I((pA./(A.+k) 2 i)-1) J. 1 J. where -s = IA./(A.+k) 1 J. J. 2 103 Table 2 The Wichern-Churchill Simulation Results {Source: Wichern and Churchill, 1978)