1\

advertisement
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
RIDGE REGRESSION:
1\
AN EXAMINNriON
OF THE BIASING PARl.<;.METER
A ·thesis S1J.bmitted in partial satisfaction of the
req1J.irements for the degree of Master of Science in
Health Sciencey
Biostatistics
&Epidemiology
by
John R. C. Odencran"l::.z
January, .1979
The Thesis of John Odencrantz is approved:
Madison
Bernard Hanes, Committee Chairman
California State University, Northridge
ii
ACKNOWLEDGMENTS
I would like to thank the members of my committee
for their comments and suggestions.
In particular, I wish
to thank Dr. Bernard Hanes for his support and encouragement, without which this thesis would not have been
possible.
iii
TABLE OF CONTENTS
Page
APPROVAL • • • •
.
ACKNOWLEDGI-IENTS
. ii
. iii
vi
ABSTRACT
Chapter
1
INTRODUCTION . •
1
BACKGROUND •
1
PURPOSE
4
2
REVIEW OF THE LITERATURE .
5
3
OPTIMIZATION AND GENERALIZED RIDGE
REGRESSION . •
• • • • • • • •
9
THE CRITERIA FOR OPTIMIZA'riON
9
DERIVING AN OPTIMUM FOR GENERALIZED
RIDGE REGRESSION • . • • • . . • • • • • 12
4
ESTIMATING THE RIDGE OPTIMUM . • • • .
16
AN ALTERNATIVE OP'l'IMUM FOR GENER~.LIZED
RIDGE REGRESSION . • . . • . • • . .
19
OPTIMIZING THE ORDINARY RIDGE ESTH1.i\'rOR
• 27
2
RIDGE SOLUTIONS FOR THE E(L }
CRITERION • • • • • • • 1• •
27
THE E(L 2 ) SOLUTION OF HOCKING, SPEED,
2
AND LYNN . • • • • • • . • • • • .
• 30
MALLOWS' E(L 2 ) RIDGE SOLUTION • •
2
• 32
THE
• 35
LAWLESS~WANG
ESTIMATOR . • .
THE McDONALD-GALARNEAU ESTIMATOR • .
iv
. • • 36
Chapter
5
Page
OBENCHAIN'S ESTI.tv1ATOR FOR ORDINARY
RIDGE REGRESSION
. • • • • •
40
A REVIEW AND EVALUATION OF THE ORDINARY
RIDGE SOLUTIONS • • • • • • • •
41
CONCLUSIONS • •
44
BIBLIOGRAPHY
46
APPENDIX
I
50
APPENDIX
II
APPENDIX III
e
. . . . .
e
G
e
e
e
e
e
e
59
68
~
APPENDIX
IV
72
APPENDIX
v
79
APPENDIX
VI
90
APPENDIX VII
••••
·:;:
v
•• • • • • •
oP-.
!!t
••
96
ABSTRACT
RIDGE REGRESSION:
AN EXAMINATION OF THE
BIASING PARAMETER
by
John R. C. Odencrantz
Master of Science in Health Science
Biostatistics and Epidemiology
Ridge regression is an alternative to least
squares for highly collinear systems of predictor variables.
It differs from least squares in having a biasing
parameter, k, added to the main diagonal of the X'X
matrix.
A number of rules for choosing k have been pro-
posed, all of which give different solutions.
Fundamen-
tal in applying ridge regression is deciding which rule
to use.
This thesis examines some of the proposed methods
of choosing the biasing parameter.
The two types of
ridge regression, generalized ridge and ordinary ridge, .
are considered separately.
Derivations are given for the
vi
different solutions, with stress on the intent and
underlying assumptions of each.
To permit an evaluation
of relative performance, the results of Wichern and
Churchill's (1978) simulation study are included.
The generalized ridge solutions are derived:
those of Hoerl and Kennard (1970) and Hemmerle and Brantle
(1978) .
In its original form, the solution of Hoerl and
Kennard was iterative.
Hemmerle's (1975) reduction of the
Hoerl-Kennard iteration to a single step is included, as
is a simpler and more intuitive way of achieving the same
result.
Several proposed solutions to the ordinary form
of ridge regression are given.
One of these (Mallows,
(1973) is shown to be incorrect in its final algebraic
form, and a numerical approach is suggested instead.
appendix of numerical examples is added.
Finally, some background results relating to
ridge regression are included.
Among these are a
derivation of ordinary ridge regression from a theorem
in quadratic response surfaces and a presentation of
Marquardt's (1970)
"fractional rank" estimator, a
technique closely related to ridge regression.
_vii
An
Chapter 1
INTRODUC'riON
Background
The standard model for multiple linear regression
is
( 1.1)
where X is an (nxp) matrix of predictor variables, y is
an (nxl) vector of responses, e is an (nxl) error vector
such that E(e)
=
=
0 and E (ee')
-
unknown constant, and
f
cr
2
r ,
-·n
where cr
2
is an
is a (pxl) vector of unknown
regression coefficients.
The usual solution for {1.1) is the Gaussian least
squares estimator
B=
where
S
(X'X)-lX'y,
is (pxl) and E(S)
(1.2)
=~
Multiple regression is among the most popular
tools for the analysis of health data.
Typically such
data are extensive and involve many survey variables,
some of them highly correlated.
Thus regression models
which attempt to make full use of the available information will often be multicollinear.
1
2
This leads to difficulties:
S is
so unstable for
multicollinear data that even minor perturbations of the
data can change the solution drastically (Hoerl, 1962).
The mean square error is likely to be unreasonably large
and the
the
B vector
B vector
tends to have a much greater norm than
it is estimating (Hoerl and Kennard, 1970A).
Under conditions of multicollinearity the
investigator often chooses to drop variables from the
model.
Popular statistical methods of doing this include
stepwise techniques (Efroymson, 1960), calculation of all
possible subsets (Garside, 1971), and regression on principal components (Massy, 1965).
Of these, stepwise methods are the easiest in
terms of computation and interpretation, and are included
in most statistical packages.
However, stepwise methods
are of little use for multicollinear data.
Their intended
function is to eliminate variables with no predicting
power from orthogonal systems of predictor variables.
Principal components regression and selection of
a best predictor subset out of all possible subsets may
both yield satisfactory results subject to the selection
criterion.
Principal components regression (Appendix I)
requires interpretational effort, but its structural simplicity has much to recommend it.
This is especially true
in very large systems where the computation of all possible subsets becomes impractical.
3
If the purpose of the regression model is to
predict one variable from a set of' other variables, dropping predictors makes sense for multicollinear data.
A
subset of predictor variables will specify the response
variable almost as precisely as will the total set.
The problem is that regression, especially in
areas such as epidemiology, is likely to have as its true
intent the explaining of some effect in terms of other
observables.
Since the relationships between the various
predictors are seldom completely understood (otherwise
multicollinearity could be avoided), some loss of explanatory information is bound to accompany reductions in the
model.
The ridge estimator of Hoerl and Kennard (1970A&B)
is another method of handling multicollinearity.
Vari-
ables may still be dropped (Hoerl and Kennard, 1970B;
McDonald and Schwing, 1973), but the emphasis is on transforming the estimators to achieve greater stability and
smaller mean square error.
The ridge estimator is given
by
{1.2)
where I is a (pxp) identity matrix, k is a constant, S*
is (pxl) , and X and
1 are the same as in the least
squares estimator.
The relationship between
A
squares solution
and~*,
~'
the ridge solution, is
the least
4
( 1. 3)
(Hoerl and Kennard, 1970A).
Since
B is
unbiased, it follows that the ridge
solution is biased for kiO.
Usually k, the biasing param-
eter, is chosen to minimize the mean square error of
B*.
In practice, much of ridge regression centers
around estimating the best biasing parameter.
Hoerl and
Kennard considered this problem in their 1970 papers, and
several authors have proposed solutions since then.
Purpose
The purpose of this thesis is to review the rules
currently proposed for determining the biasing parameter.
'I'he theoretical basis for each rule will be given, and
the results of a simulation study comparing some of the
estimators (Wichern and Churchill, 1970) will be presented.
Solutions considered are those of Hoerl and
Kennard (1970A), Mallows (1973), Hemmerle (1975), Hoerl,
e!:_ al.
(l975), McDonald and Galarneau (1975), Hoerl and
Kennard (1976), Lawless and Wang (1976), Hocking, et al.
(1977), Hemmerle and Brantle (1978), and Obenchain (1978).
Chapter 2
REVIEW OF THE LITERATURE
The effects of multicollinearity are well known,
and have been detailed by Farrar and Glauber (1967),
Hoerl and Kennard (1970A), Snee (1973), and Mason, et al.
(1975).
Several authors have also pointed out the preva-
lence of multicollinearity in real data, and examples may
be found in HcDonald and Schwing (1973) and in Gorman and
Toman (1966) .
Computationally, the problem posed by redundant
variables is a fundamental one:
vert a singular matrix.
it is impossible to in-
It is true that in most cases
collinearity does not imply true singularity, but for
practical purposes nearly singular matrices may produce
useless answers.
Hoerl (1962) suggested the application of response
surface methodology (Box and Wilson, 1951; Hoerl, 1959)
to ill-conditioned regression problems.
His ridge esti-
mator was a solution to the Langrangian problem of minimizing the residual sum of squares for a given estimator
norm(Appendix II}.
It differed from Gaussian regression
in having a biasing parameter, k, added to the main diagonal of the correlation matrix.
The only restriction on
k was that it be positive, a result of theoretical
5
6
considerations proposed earlier by Hoerl (1959) and
later proven by Draper (1963).
This was the basis of ridge regression, but, as
Hoerl remarked in the same (1962) paper, the theory was
incomplete.
In particular, no proof had been offered
that the ridge estimator was a good one in terms of mean
square error.
Hoerl's (1964) review of ridge analysis
did not include ridge regression except in a comment that
more work was needed.
A systematic development of the method appeared
later in two papers (Hoerl and Kennard, 1970A&B), and
included the following:
{1) rederivation of the ridge estimator, showing it
to be of minimum length for a given residual sum
of squares (Appendix II),
(2) proof that there exists some ridge estimator
having a smaller mean square error than the
corresponding least-squares estimatorr
(3) a description of a "canonical" form of ridge
regression involving transformed variables,
(4) an algorithm for finding a best (in the mean
square error sense) estimator for the canonical
{generalized) form of ridge regression, and
{5) the graphical ridge trace.
Thus the 1970 papers of Hoerl and Kennard
presented both the theoretical basis for ridge regression
7
and considerable extensions of the technique.
The
methodology used by McDonald and Schwing (1973) in studying air pollution was precisely that given in the second
of the Hoerl-Kennard papers.
Since the appearance of ridge regression, there
has been interest in its relationship to other biased
estimators.
Marquardt (1970) showed that a number of
properties are shared by ridge regression and his "fractional rank" (Appendix I) generalization of the principal
components estimator.
Goldstein and Smith (1974), ex-
tending the work of Lindley and Smith (1972) found that
the ridge solution actually approximates the fractional
rank solution.
Assessments of the relative power of ridge and
other estimators have been made by Mayer and Wilke (1973)
and by Hocking, et al.
(1976).
Mayer and Wilke derived
ordinary ridge estimators and shrunken estimators (Stein,
1960; Sclove, 1968) as minimum norm (for a given residual
sum of. squares) estimators in the class of linear transforms of least squares estimators.
They found that
shrunken estimators had minimum variance among those
studied.
Hocking, et al. carried the generalization still
further, finding a class of estimators that included
principal components estimators and generalized ridge
regression as well as shrunken estimators and ordinary
8
ridge regression.
They concluded the generalized ridge
was most effective at minimizing the mean square error.
The determination of an optimal biasing parameter
was considered by Hoerl and Kennard in their fundamental
work.
Their solution was iterative and involved only the
generalized ridge estimator.
Hemmerle (1975) found that
iteration was not necessary, and Hemmerle and Brantle
(1978) developed an alternative solution, again restricted
to generalized ridge regression.
Biasing parameters for the ordinary ridge
estimator have been considered by Mallows (1973),
Farebrother (1975), Hoerl, et al.
{1975), Hoerl and
Kennard (1976), Lawless and Wang (1976),
~1cDonald
and
Galarneau (1975), Hocking, et al (1976), Obenchain (1978),
and Wichern and Churchill (1978).
In the following sections, some of the solutions
introduced above will be examined in detail.
Chapter 3
OPTIMIZATION AND GENERALIZED RIDGE REGRESSION
The Criteria for Optimization
The simplest form of the ridge estimator is
(3.1)
where X is an (nxp) matrix of n observations on p
predictor variables, y is an (nxl) vector of observations
on the response variable,
k is a constant,
and~*
l
is a
is a (pxp) identity matrix,
(pxl) vector of estimators.
The k in (3.1) can in theory take on any positive
valuei therefore (3.1) has infinitely many possible solutions.
Since these will not all be equally useful to a
researcher using ridge regression, some means of choosing
a value for k is needed.
This in turn means that the
criteria by which a solution is considered to be a good
one must be established.
Hoerl (1962) and, later, Hoerl and Kennard
(1970A&B) favored stability as a criterion.
Stability,
in the sense of Hoerl and Kennard, meant the extent to
which changes in k affect
B*; an estimator is stable or
unstable in this sense depending on whether the absolute
values of the individual terms of dB*/dk are large or
9
10
small.
The values which S* takes on as k changes are
referred to as the ridge trace.
Vinod (1976) objected to this concept of
stability because a strict application of it to any
problem would lead to the conclusion that the optimal k
has an infinitely large value.
He proposed a modified
ridge trace with the k axis replaced by an m-axis defined
as
1'
m = p-EA./(A.+k),
1
l.
(3.2)
l.
where p is as before the number of independent variables
and A· is the ith eigenvalue of X'X.
1
The advantage of
this modification is that the point of maximal stability
for each term d(S*) ./dm is at some m which corresponds to
-
l.
a finite k.
Since
d~*/dm
is a vector, it is of no immediate
use as a test statistic.
Vinod "scalarized" it through
a statistic he termed the Index of Stability of Relative
Magnitudes (ISRM):
=
ISRM
'
2 )/SA.)-1)
2
E((p(A./(A.+k)
l.
1
l.
( 3. 3)
l.
where
-
s
=
v
2
l:A./(A.+k) •
1
l.
l.
The ISRM is zero for othogonal predictor systems,
nonzero for nonorthogonal systems, and large in absolute
value for seriously nonorthogonal systems.
To some
11
extent, it indicates how much a given model resembles an
orthogonal system.
Various considerations of stability or sensitivity
in the estimator have been closely associated with ridge
regression from the inception of the technique.
However,
stability, whether as defined by Hoerl (1962), Hoerl and
Kennard (1970A), or Vinod (1976) cannot be satisfactorily
equated with any statistical concept outside ridge regression.
For this reason, there is interest in finding
other criteria by which a ridge solution can be considered
optimal.
A widely accepted basis for evaluating estimators
is their mean square error.
In the case of ridge regres-
sion, investigators have considered both the ordinary
mean square error defined as
(3.4)
where
S* is the ridge estimator and S is equal to the
expected value of the least-squares estimator, and the
criterion of Stein (1960), defined as
(3.5)
The best ridge estimators in the mean square
error sense are those which minimize one of the expected
values
12
2
E (Ll >
=
E ( ( B*-8)
<§_*-..@_))
I
(3.6)
or
(3.7)
With some algebra (Appendix III) ,(3.6) can be
expressed either as
( 3. 8)
or as
E (L
where a
2
2
1
)
'
=
2
y
2
2
L: ( cr A . +k a . ) / ( A. +k)
1
1
.
1
1
2
(3.9)
,
is the residual mean square error and a. is the
1
ith term of a
=
Q'f, Q being the matrix of eigenvectors
=
of X'X such that Q'X'XQ
Similarly,
~'
the matrix of eigenvalues.
(3.7) can be written either as
(3.10)
or as
=
?
2
L:(o: A.
1
1
2 +\.k 2 a. 2 )/(A. ~.k) 2 •
1
1
1
(3.11)
Deriving an Optimum for Generalized Ridge Regression
A comparison of (3.8) and (3.10) with (3.9) and
(3.11) shows that (3.9) and (3.11) are algebraically simpler than the other two.
For this reason, it is the
practice among researchers investigating optima for ridge
13
regression to use these simpler, transformed forms.
The
X matrix of predictor variables is transformed by postmultiplying it by Q (see Appendix III), which is then
substituted into (3.1) in place of X.
The resulting
estimator is
(3.12)
or, equivalently,
(3.12)
where A is the ·diagonal matrix of
X'X.
Let
and let a
a and 8,
a be
=
~i'
the
eigenval~es
=
o,
Then the relationships between a and
~,
the value of a* which corresponds to k
E(a).
of
and~* and ~* are a = o'~,
a = o's,
and ~*
= Q'~*·
This is a more general case of principal components reression
(Appendix I).
2
2
) and E(L; ) are found by
1
differentiating (3.9) and (3.11) with respect to k and
The optima for E(L
setting the results equal to zero.
=
Thus
'L(~.ka. 2 -~.cr 2 )/(~.+k) 3
1
1
1
1
1
=
0
(3.13)
and
(3.14)
determine the optima.
14
Although (3.13) and (3.14) optimize k on the basis
of t he L
1
2
an dL
2
..
.
J
crlterla,
t h ey are not ana1
ytlc
so.u-
2
tions for k, nor can they be solved analytically for the
general case.
The only way to find a general analytic
solution would be to solve each term of (3.13) and (3.14)
separately for k, which would mean that in general the k.
l
for --the i th term would not be the same as the k . for the
J
jth term.
Let K be a diagonal matrix whose ith non zero
entry, k 1 , is positive but not necessarily the same as
k:.
its jth non zero entry,
J
The generalized ridge esti-
mator is
( 3.15)
'l'he generalized ridge estimator differs from the
ordinary ridge estimator in two respects.
requirement that k.
l
=
First, the
k. satisfies the conditions for a
J
Lagrangian system which minimizes the norm of a ridge
estimator for a given residual sum of squares (Appendix
II) •
The generalized ridge estimator is thus not of
minimum norm.
Secondly, it is not generally the case
the Q'KQ
Therefore, 'it is not true that
=
Qa.*
K.
= Q(A+K)-lQ'~'y
(3.16)
is the same as
Qa.*
=
Q(A+Q'KQ)-lQ'X'y.
( 3 .1 7)
15
This means that the generalized ridge estimator which
optimizes
E((~*-a)
1
E( (Qa*-Qa)
1
(~*-a))
(Qa*-~)).
will not in general optimize
(See Appendix III)
Consequently,
2
2
the E(L 1 > and E(L 2 > criteria are defined, for the
generalized ridge estimator, to be
(3.18)
and
2
=
E(L 2 )
E ( (a*-a)
-
-
'A (a*-a)).
(3.19)
---
In practice, the same definitions are used in
ordinary ridge regression, as well.
The reason for this
is that the data are rescaled so that X 1 X is a correlation
matrix before undergoing a principal components transformation.
Transforming the rescaled variates is not a
linear transpormation of the original rlata.
To solve (3.18) and (3.19) we have, from (3.13}
and
(3~14)
2
2
(A.k.a. -A.a )
l
l
l
l
=
0
(3.20)
and
'
2
2
2 2
(A.l ka.l -A.l a )
=
0
(3.21)
For both, the solution is
k.
l
= a 2 /a.l 2 .
(3.22)
16
Estimating the Ridge Optimum
Equation (3.22) is expressed in terms of unknown
parameters and therefore has to be estimated.
thing would be to replace cr 2 by 8
2
and a.
l
2
The obvious
by &,
2
1.
, the
least squares estimates, but ill-conditioning will tend
2
to rna k e a. 2 1 arger t h an a ...
A
1
1
For a more satisfactory solution, Hoerl and
Kennard (1970A) suggested the following iterative
procedure:
(1) Estimate k. using k.
1
(2) Compute a.*
1
(3) Compute k.
1
1
=
=
= "'2;"'
a. a 1.2.
(A.+k.)-l{Q·'X'y) ..
1
1
-
-
~
1
2
2
d /(a.*)
1
(4} Repeat (2) and (3) until ai* and ki stablize,
i.e., until the iteration no longer changes them.
Note that the process does not attempt to reestimate
&2 .
The reason for this is that, uncondition2
2
ally, the maximum likelihood estimate of cr is 8 , and
because the obvious re-estimation o~
&2
around a* will
always exceed the least squares estimate.
(A brief dis-
cussion of estimating o 2 around the ridge estimator may
be found in Obenchain (1978) .)
As it happens, the convergence points of ki and
a.* can be determined without actually iterating.
1
17
Hemmerle (1975) first proved this (Appendix IV).
Rather
than his algebraic proof, a more intuitive approach is
presented here.
If the iteration converges somewhere, then
a.*= "(A.+k.)-l(Q'X'y).
1
1
1
-
-
-
(3.23)
1
and
(3.24)
must have the same values for k. and a.* at the
1
convergence point.
1
The simplest way to find this point
is to determine where they do have the same values.
Squaring (3.23) and eliminating (a.*)
2
from both equations
1
gives
(k.)+A.
1
1
2 2
(J
=0
(3.25)
From the quadratic formula, the solution to this
is
k.1
=
(2'~'~)i2-2AifJ2~ /<Q'K'y)i4-4AifJ2(Q'~'y)i2
28
( 3. 26)
2
which has two possible values.
To see which is correct,
consieer the curves defined by
(a . *)
1
and
2
=
2
(J /k .
1
, k . >0
1
(3.27)
18
(a.*)
2
=
1
2
(A.+k.)- (Q'X'y) . 2 , k.>O.
l
l
-
-
-
l
(3.28)
l
The following are true:
(1) As ki approaches zero, c >c , and
1 2
(2) as ki goes to infinity, c >c .
1 2
Figure I illustrates the case where
There are two points of interse.ction,
designated k' and k", between
c1
and
c2 •
For this case
the following is also true:
The second derivative with respect to k of the
c1
difference between the inverses of
stant.
and
c2
is a con-
Along with relationships (1) and (2), this
implies (3).
The Hoerl-Kennard iterative procedure can be
expressed as two recursive formulas:
k.
1
=
2
a2 /(a.*)
1
(3.29)
and
=
where the " ==
identity.
from
c2
to
11
2
(),.+k.)(Q'X'v).
l
l
_ _ ..L1
2
(3.30)
indicates a computation rather than an
Then (3.29) represents a horizontal movement
c1 ,
while (3.30) represents a vertical movement
19
from
c1
to
c2 •
Note that this is true for all positive
k .•
1.
This iterative process specified by (3.29) and
(3.30) can be initiated at any positive k .•
If it is
1.
initiated precisely at k' or k" there will of course be
no change with iteration.
If the starting point is at
some k.<k 11 there will be convergence to k', either from
1.
the left or from the right.
Initial values greater than
k" will cause k.1. to increase indefinitely.
Although (3.26) has two possible solutions, only
one of the two is associated with convergence.
This
implies that the minus sign should always be chosen in
(3.26).
Figure
B~
shows the case where
Convergence is from the left only, and
<:>2,,-1
v
A
c1
1.
A
a.
-2
1.
and
c2
=
~-
inter-
sect in only one point.
For Figure
solution.
c,
<:>2' -1
v
"A
A.
1.
a.1.
-2
<~
..
and there is no
If an iteration is attempted, k. increases
1.
indefinitely and ai* goes to zero.
Appendix IV shows the equivalence of these
results with those of Hemmerle.
An Alternative Optimum for Generalized Ridge Regression
The iteration of Hoerl and Kennard is based on
the idea of finding a theoretical optimum (k.
1.
and then estimating that optimum.
=
0
2
/a.
1.
2
)
An alternative method,
20
'..
\
.
'.
\
\
\
I
---+l.cC:.ON'I~~c.e
r C.eNVSf'G~\.1C.€'
\<'
Figure A:
Hoerl-Kennard iteration on the ith term, ai*~
of a generalized ridge estimator.
~2,
u II..
1
-L.a.. -2
1
=
~
21
(a....;r)"'.
\
\.
'.\
\
.\
\.
\
.
I
I
---"~ I
---'l"'
1 'D1\IE'~e.'Nc:;.e'
c.owv~e;,...u:.e
k'
Figure B:
Hoerl-Kennard iteration on the ith term, a.*,
1
of a generalized ridge estimator.
22
\
'\
\.
\.
\
2..
1\
\
ex..
.
\
Figure C:
Hoerl-Kennard iteration on the ith term, a.*,
l
of a generalized ridge estimator.
(5
2, -L.. -2
1\o
a.
>
l
l
~
23
given by Hemmerle and Brantle (1978)
is to find an
estimator for the optimization criterion and then optimize
the estimator.
'rhe following derivation is based on the
paper of Hemn1erle and Brantle (1978):
From (3.9),
.
.,
2
2
2
2
E( (a*-a) '(a*-a)) = L:(A..a +k. a. )I(A..+k.)
-
-
-
-
1
l
.
l
l
l
Since a.*= (A..I(A..+k.))a., then E(a.*-a.)
l
l
.
l
"' 2
E({l-A..I(A.+k.))a.
l
l
l
l
l
l
)
=
=
L:k.
l
(k.I(A..+k.))
l
l
l
2
(a
2
l
2
(3.31)
l
=
IA..+a.
l
2
l
)
therefore
E ( ( a*-a) ' ( a*-a))
'P
1
2
l
(a
2
2
2
I A..l +a.l } I (A.l +k.l ) .
(3.32)
Combining (3.32) and (3.31),
E (
(a*-.£) ' ( a*-a))
=
E( (a*-8)
-
.
-
1
(a*-a))
-
-
+a 2'L:(A..-k.)I(A..(A.+k.)).
1
l
l
l
l
l
(3.33)
Recalling that the A.. are diagonal elements of
l
E(L
2
1
~'
} can be estimated by
(3.34)
Similarly,
E( (a*-a)
so that
'A(a*-~))
=
E( (~*-a)
1
1\(a*-a))
2
.
-1
+a Trace((!I.-K) (1\+K)
)
(3.35)
24
is an unbiased estimator of E(L
2
2
)•
Define
=
v.
l.
:\./{:\.+k.)
l.
l.
(3.36)
l.
so that
* = a.v
..
l. l.
a.·
l.
The ith component, M.,
of (3.34) may be written as
l.
{3.37)
"Ylhere
A
Ll
2
=
1
l:M.
1
l.
'
l.
and
L2
2
-
l::\.M .•
1
l.
A
To minimize Ll
2
A
and L2
2
differentiate M. with
l.
respect to v.:
l.
{3.38)
or, equivalently,
2
v.l. = 1-$ 2 j:\.6.
. •
l. l.
{3.39)
25
Since v. as defined in (3.36) must lie between
1
zero and one, it follows that (3.39) cannot be used for
2
optimization if 8 /(~.&.
1
2
1
)>1.
However, note that M. is
1
quadratic for v. and so increases monotonically as v.
1
1
moves away from the minimum point.
Since the object is
to ·minimize M., it follows that v. should be as close as
1
1
possible to the optimum point.
v.
1
=
Hence the solution is
2
2
8 1 < ~.1 &1. Ha
{l-62/(A.&2)
.
1 1
(3.40)
82/(~.&.2)>1
0
1
1
which corresponds to
{&.1 (1-62/(1..&.2))
1 1
*
ai --
82/(~.&.2)'1
1
1
(3.41)
82/(~.&.2)>1
0
1
The case where
1
2
8 2 /(~.&.
)>1
1 1
is similar to the case
where a.* is constrained for other reasons, such as taking
1
into account prior information about a ..
1
Assuming the
constraint excludes the optimum from the permissible solution region, ai* will lie on the boundary.
For example,
if ai* is constrained so that ai*#A and if
2
2
&. (l-8 /(X.&. )<A, then the solution will be a.*= A.
l.
1~1
1
In practice, it is unlikely that constraints will
be applied directly to the a.*, but it is quite possible
1
26
that the B·*
will be constrained, since there could easily
1
be prior information about the
s.1
(recall that the
s.1
are
related to the nontransformed predictor variables).
Optimization when the Si* are constrained is
difficult, however, because inequalities become complicated under linear transformations.
A constraint on one
Bi* will transform into constraints on several ai*' with
the possible solutions for any one variable partially
dependent on what solutions are chosen for the other
variables.
Hemmerle and Brantle (1978) considered this
problem and propsed a quadratic programming algorithm
as a solution.
The details are given in their paper.
Chapter 4
OPTIMIZING THE ORDINARY RIDGE ESTIMATOR
The preceding chapter motivated the generalized
ridge estimator through the impossibility of obtaining
(3~13)
algebraic solutions to
or (3.14).
Nonetheless,
the ordinary ridge estimator is considered useful for
certain types of problems, so optimizing it is of some
interest.
A number of solutions have been proposed, and
some of them will be considered in this chapter.
Ridge Solutions for the E(L
1
2)
Criterion
The condition for minimizing E(L 2 ) is given by
1
equation (3.13).
An algebraic solution for (3.13) does not exist,
but it is possible to solve it numerically.
choice would be Newton-Raphson iteration.
The obvious
Recall that
(3.13) is
=
1
~(A.ka.
1
l
2
l
2
-A.cr )/(A.+k)
l
l
3
=
0
Then
f
~(3cr
1
2
A.+A.a.
l
l
-~
2
(A.-2k))/(A.+k)
l
l
4
,
(4.1)
and the solution is
( 4. 2)
27
28
where k. is the value of kat the jth iteration and
J
k1
=
0.
Iteration continues until convergence is
achieved.
The difficulty with this solution is that it is
time-consuming.
since
.
a.1
Since the ai must be estimated, and
has too large an absolute value for ill-
conditioned data, the Newton-Raphson iteration must be
nested within the framework of a Hoerl-Kennard iteration,
which naturally involves a great deal of computation.
For this reason the solfttion has not been widely used.
Dempst.er, et al.
(1977) have raised other objections to
its use, but not in sufficient detail to permit an evaluation of their merit.
In contrast to this doubly iterative procedure,
a solution proposed by Wichern and Churchill (1978) is
extremely simple but admittedly not optimal.
pointed out in Hoerl and Kennard (1970A),
As was
(3.13) is nega-
tive for k<0 2Aa max ) 2 , where Iamax I is the largest of the
!ail·
Since this is so, and since (3.13) is negative
from k
=
0 up to the point of minimization, a solution
based on k
= 0 2 /(amax ) 2 will have a smaller mean square
error than the least squares solution.
as a possible biasing parameter.
This suggests
Since the mean square
2
2
error will be minimized by 0 2 /(a max ) 2 <k<0 /(a m1n
. ) , (4.3)
29
is somewhat conservative,
meaning that it does not
produce as much bias as would be theoretically optimal.
Wichern and Churchill attribute (4.3) to Hoerl
and Kennard, although Hoerl and Kennard did not actually
suggest it as an estimator.
To avoid confusion, it will
be referred to here as the Hoerl-Kennard conservative
estimator.
Hoerl, Kennard, and Baldwin (1975) found an
algebraic solution to (3.13) by_assuming, somewhat unrealistically, that X'X is an identity matrix.
case li
=
In that
1 for all i, so (3.13) gives
.,
kl:a.
1
2
l.
-pcr
2
=
(4.4)
0,
and then
(4.5)
Another way to obtain this result, also given in
Hoerl, et al.
(1975) is to use the harmonic mean of the
optimal k. given by equation (3.22), i.e. , k.
.
l.
.
l.
=
2
(j
2
/eti .
If kh is the harmonic mean, then
=
l'
1/pl:l/k.
l.
1
=
'P
1/pl:a./cr
1
l.
2
=
2"f
1/pcr l:a.
1
l.
2
=
( 4. 6)
Therefore, kh
=
2
pcr /.§.'.§_.
30
Hoerl, et al. proposed that the least-squares
for~·£
estimator should be used
in (4.4).
and Kennard (1976) noted that, since
estimate of
B'B
§•§
Later, Hoerl
is not a good
when the data are ill-conditioned, an
iterative process similar to their earlier one (1970A)
should be used.
The E(L
2
1
) Solution of Hocking, Speed, and Lynn
Hocking, et al.
(1977) considered a class of
estimators expressible as
a*= Ba
where
a
(4.7)
is, as before, the transformed least-squares
linear estimator, and where B is a (pxp) diagonal matrix.
For ridge regression, the ith element of B is
b.=
(l+k./A.)
l.
l.
l
-1
1
( 4. 8)
A. ( 1-b.) /b. .
( 4. 9)
which is equivalent to
k.
l.
=
l.
l
l.
For ordinary ridge regression, k.
l
=
k for every i.
In other words,
A·l. (1-b.)/b.
l.
l.
=
Ap (1-b p )/b p
i=l, ... ,p-1.
( 4 .10)
There is thus a constant ratio between Ai and bi/(1-bi).
From this, Hocking, et al.
(1977) suggested that the k.
l.
31
could be combined through a least-squares formula, i.e.,
k
=
.,
1'
2
2
(L\.b./(1-b.) )/(L:b . ./(1-b.) ) •
1
1
1
1
1
1
(4.1.1)
1
This is derived from setting A. as the dependent
1
variable and b.1 I ( 1-b.1 ) as
th~
independ_.ent variable.
It
would be equally feasible to do the reverse, obtaining
k
=
.,
..,
2
(L:A.b./(1-b.))/L:A . •
1
1
1
1
1
(4.12)
1
It is necessary to determine values for the b.
1
before (4.11) can be evaluated.
the procedure of Hoerl, et al.
vidual optima of (3.22):
=
given as k.
1
1
1
(1975) and used the indi-
= a 2 /a.1 2 .
k.
1
This can also be
2
2
A. a I A. a. , so, in accord with ( 4 . 9) •
1
=
( 1-b . ) /b .
Hocking, et al. followed
1
a
2
1
I ( A1. a 1. 2 ) •
(4.13)
Hence (4.11) and (4.12) become, respectively,
k
=
,
2 2 2
1'
2 4 '4 ·_
(L:A.
a.1 ja )/(L:A.
a./a
t
1 1
1 1
1 .
q
2
1'
(l:A.
1
1
2
a.
1
2
=
1'
2 4
)/(L:A. a. )
1
1
1
(4.14)
and
2 2 'P 2
= 1/a 2"'L:A.
a. /L:A . •
1 1
1
1 1
(4.15)
Least-squares estimates could be used for the a.;
1
alternatively an iterative process similar to those of
32
Hoerl and Kennard could be used ..
Mallows' E(L
2
2
) Ridge Solution
Some of the results for the E(L 2 ) criterion can
1
.
2
be extended to E (I, ) :
the Newton-Raphson technique is
2
applicable to (3.23), for example, and the results of
Hocking, et al. can apply to either criterion.
A. more
interesting approach is that of Mallows (1973).
Mallows' solution begins with the "scaled summed
mean squared error," defined as
(4.16)
!t should be noticed that Jk differs from L2
the constant 1/cr
2
2
only by
.
From Appendix !II,
(4.17)
where
(4.18)
and
(4.19)
The residual sum of squares is
33
(4.20)
from which
E(RSSk)
=
2
cr V* k +B k
(4.21)
where
(4.22)
The estimator of E(Jk) is thus
(4.23)
which is to be minimized.
Since the residual sum of
squares about the least-squares model is constant it can
be disregarded in the minimization.
Let
(4.24)
be the variable to be minimized.
and ( 4. 24)
From (4.20)
1
(4.23)
1
1
(4.26)
34
Recall that Q was defined earlier as the matrix
of eigenvectors such that Q'X'XQ
of eigenvalues \. of X'X.
~
=
A, the diagonal matrix
Transforming (4.26) by Q and
1
letting
=
Q'X'y,
1
+2l:\.j(\.+k)
l
=
1/~
1
1
21
'
l:\. (Z./(\.+k)-z./\.) 2 +2~\,j(\.+k)
l
1
1
1
1
1
1
1
1
(4.27)
To minimize this with respect to k,
(4.28)
for which Mallows' solution is
(l+k)/k
(4.29)
and then
( 4. 30)
This solution does not seem to be correct.
can be solved numerically (Appendix V).
However (4~28)
35
Mallows' solution is somewhat similar to the
generalized ridge solution of Hemmerle and Brantle (1978)
in that it optimizes an estimator rather than estimating
a theoretical optimum.
The Lawless-Wang Estimator
Lawless and Wang (1976) derive a solution for
ordinary ridge regression as follows:
Defining a as
before to be the transformation off, i.e., a= QB, suppose a is a multivariate normal random variable with a
distribution defined by
a-N(O,a
2
( 4. 31)
I),
-p--a-p
where p is the dimensionality of
B.
The Bayesian estimator for a is a*, where
a*.
1
=
(A./(A.+a 2/a 2 ))a.
1
1
a
1
i
=
1, ... , p
(4.32)
(Goldstein and Smith, 1975) . To estimate a 2; a o, 2 , f'1rst
notice that if a-N (O,a
-
p -
2
I ) , then
a -p
and
=
2
l
p+ ( L: A . ) a
1
1
a
2
Ia •
'
Since X'X is in correlation form, L:X. = p.
1
1
36
Therefore,
?
2
2
E(EA.a. jpa -1
1
1
1
=
a
2
a
ja
2
•
2
Since aa 2 is expected to be much greater than a ,
2
2
a a 2 ;a 2 can be estimated by EA.a.
/pa .
1 1 1
The Lawless-Wang
optimum is thus
(4.33)
The McDonald-Galarneau Estimator
In contrast to the estimators discussed above,
the estimator of McDonald and Galarneau (1975) is not
based on considerations of mean square error.
Instead,
i t is intended to find the solution whose norm is as close
to the norm of
B as
possible.
To determine this solution,
note that
or, equivalently,
Thus, an estimator of
f'f
is
(4.34)
Equation (4.34) cannot be solved algebraically
for the general case, and McDonald and Galarneau
37
suggested a trial-and-error process involving 201 values
for the interval (0,1).
Iteration·would be a more effi-
cient means, however, and the Newton-Raphson method will
work for this problem.
Specifically, the problem is to
find k such that
(4.35)
or, equivalently,
{ 4. 36)
Iterate as follows:
=
k.-f(k,)/fl (k,)
J
.J
1
J
where
and
f
I
(.k . )
J
=
l
E { 2 (A . +k . )
1
1
J
-3
. (X I y) .
- -
2
)
,
1
k. being the value of k at the jth iteration.
J
Obenchain's Estimator for Ordinary Ridge Regression
Suppose it is assumed that a nearly exact estimate
Of' lies somewhere along the ridge trace, but that it is
unclear just where.
The method of Obenchain (1978)
38
permits a ~* to be determined fo~ any desired probability
level, say, f, that ~* is a better estimate than
least squares estimator.
S,
the
Consider a Scheffe confidence
ellipsoid (Scheffe, 1961) about the least squares estimator.
Searle (1971) gives as an f-level confidence
A
region about
_@.
( 4. 37)
where p is the nQmber of independent variables, n is the
number of observations, and F(p,n-p-l;f) is the value of
the F distribution with p and (n-p-1) degrees of freedom,
A
having a probability of f.
If 8 is the least-squares es-
timator, this ellipsoid covers the· true unknown value of
~with
proability 1-f.
The ridge trace is a subset of points in the
space.
~-
It may be visualized as a path running from the
least-squares estimator (where k
(where k is infinite) .
=
~
0) to the point
=
Q_
Assuming it intersects the boun-
dary of the confidence ellipsoid, it will do so at only
one point.
This point Obenchain chooses as his estimator.
Expressed as a formula, his solution is:
choose 8* such
that
(8*-S} 'X'X(8*-S)
-·-----
=
2
p& F(p,n-p-l;f).
(4.38)
Obenchain's solution must be interpreted with a
certain amount of caution.
ability that
~*
The F value is not the prob-
is a better estimate of 8 than is
A
~'
39
Ll KE L\ HCC:D
SPAC.E"
.4--
\OO(\-~)<!Jo
c.o~ F=\ DENC.E
RE<SlON
Figure D:
Obenchain's method: ridge trace and
accompanying 100(1-f}% confidence region
for a two-variable example.
(Adapted
from Obenchain, 1978}
40
because points outside the confidence ellipsoid may very
A
well be closer to S than to
~*(see
FigureD).
Further, if it is assumed that
~
lies somewhere
along the ridge trace, the F value is still not the probability that ~* is a better estimate than is
S.
The
reason is that a distribution (in this case the F) which
holds for a space will not generally hold for a onedimensional path through that space.
In practice, it is easiest to evaluate (4.38) by
choosing an arbitrary
ated F value.
~*
and then determining the associ-
This fact, as well as the fact that a
satisfactory probability would be difficult to set without some prior knowledge of the associated k, make it
difficult to view Obenchain's estimator as a point solution.
It appears instead to be an additional means of
examining the ridge trace.
Vinod's Estimator for Ordinary Ridge Regression
Vinod's estimator for k was presented in the
preceding chapter.
To repeat it here, it is as follows:
Find a k which minimizes
?
E(p(~./(~.+k)
1
1
1
2 -
?
)s~.-1)~~
where
-
s
=
1'
r~./(~.+k)
1
1
1
2
.
I
(4.39)
41
This cannot be solved algebraically, and so must
Wichern and Churchill (1978) examined a
be estimated.
range of values and chose the most satisfactory solution.
Assuming there was only one local minimum, Newton-Raphson
iteration would work, since (4.39) can be differentiated
with respect to k.
The result is a bit complicated, but
a convergence scheme is more efficient than examining a
set of values and choosing the smallest one.
A Review and Evaluation of the Ordinary Ridge Solutions
Because of the number of solutions for ordinary
ridge regression and the length of some of the derivations, a quick overview would be helpful.
r1ost of the estimators presented above are
intended to minimize the mean square error, either in a
classical or Bayesian sense.
Vinod's (1975)
1
The exceptions to this are
which is based on stability, McDonald and
Galarneau's (1975)
1
which is based on estimator norm, and
Obenchain's (1978), which is based on confidence intervals.
Of the estimators based on the classical mean
2
square error (E(L 1 )) 1 the Hoerl-Kennard theoretical
optimum and the Hoerl-Kennard conservative estimate were
derivable without requiring further assumptions.
The
Hoerl-Kennard conservative solution is intended to be
better than the least-squares solution but does not
minimize the mean square error.
42
The solutions of Hoerl, et al.
(1975), Hoerl and
Kennard (1976), and Hocking, et al.
(1977) are intended
to minimize the mean square error.
However, they are
all arbitrary to a certain degree, or else require additional assumptions such as the assumption by Hoerl, et al.
that the predictor variables are uncorrelated.
The solutions of Lawless and Wang (1976) and
Mallows (1973) are optimal in a Bayesian sense.
The
Bayesian assumption for Lawless and Wang's solution was
given in its derivation, while that of Mallows' solution
is implicit in the E(L 2 ) criterion itself. Stein (1960)
2
and Efron and Horris (1973) detail this Bayesian approach.
The question of how well these estimators perform
in practice has been considered by McDonald and Galarneau
(1975), Hemmerle and Brantle (1978) and Wichern and
Churchill (1978).
All three studies used the simulation
method of McDonald and Galarneau (1975).
Wichern and Churchill's study was the most
thorough and comprehensive, comparing the least squares
estimator with five ridge solutions.
Of these five, the
McDonald-Galarneau estimator and the Hoerl-Kennard conservative estimator were the most consistent in reducing the
mean square error associated with the least squares
solution.
The other ridge estimators considered were those
of Hoerl, et al.
(1975), Lawless and Wang (1976), and
43
Vinod (1975).
All three were quite variable in terms
of performance, especially Vinod's.
Appendix VII
presents the results of the Wichern-Chruchill study
in more detail.
Chapter 5
CONCLUSIONS
As was indicated in Chapter 1, the linear least
squares estimator
suffers from a number of drawbacks when the predictor
variables X. are collinear.
These drawbacks include in-
1
A
stablility for minor changes in data,
S.1 whose absolute
values are too large, and a large mean square error.
Yamamura (1977) reviews and discusses these problems.
Among the techniques which can be used to deal
with collinearity is ridge regression, a term which actually refers to two closely related estimators.
These
are the ordinary ridge estimator
and the generalized ridge estimator
where A is the matrix of eigenvalues of
X'~,
2
is the
matrix of associated eigenvectors, and K is a diagonal
matrix whose nonzero terms k. are positive but not
1
necessarily equal.
44
45
An important problem in ridge regression is
choosing the biasing parameter (k for ordinary ridge,
k.1. for generalized ridge).
In the case of generalized ridge this comes down
to a choice between two possible solutions:
one based on
estimating an optimum and the other based on optimizing
an estimator.
There is no reason to consider either of
these superior in general, though the second is more conservative and may be better for extremely large variances.
Ordinary ridge regression has a known theoretical
optimum, but it is difficult to solve.
As a resul·t,
there is interest in finding other solutions, often based
on combinations of the optimal k. ·from the generalized
1.
ridge estimates.
Due to the arbitrary nature of these combinations,
it is hard to justify them or evaluate them theoretically;
therefore comparisons based on simulation are of considerable importance.
Some work of this sort has already
been done, but no estimator evaluated so far has been consistently superior to the others, and it is likely that a
consistently superior ordinary ridge solution will not be
found.
The current feeling seems to be that more than one
solution should be examined when ridge regression is used.
'
46
BIBLIOGRAPHY
1.
Anderson, T. W.
Introduction to Multivariate
Statistical Analysis. New York:
John Wiley and
Sons, 1958.
2.
Box, G. E. P. and Wilson, K. B.
"On the Experimental
Attainment of Optimum Conditions." Journal of the
Royal Statistical Society, Series B. 13: 1-45,
1951.
3.
Brown, P. J.
"Centering and Scaling in Ridge
Regression." Technometrics.
19:
35-36, 1977.
4.
Dempster, A. P., Schatzoff, M., and Wermuth, N.
"A
Simulation Study of Alternatives to Ordinary Least
Squares." Journal of the American Statistical
Association.
72:
77-90, 1977.
5.
Draper, N. R.
"Ridge Analysis of Response Surfaces."
Technometrics.
5:
469-479, 1963.
6.
Efron, B. and Morris, C.
"Stein's Rule and Its
Competitors--An Empirical Bayes Approach."
Journal of the American Statistical Association.
68: 117-130, 1973.
7.
Efroymson, M. A.
"Multiple Regression Analysis."
Mathematical Methods for Digital Computers, Vol. 1.
A. Ralston ( ed) , Ne\v York: J"ohn Wiley and Sons,
1960, pp. 191-203.
8.
Farrar, D. E. and Glauber, R. R.
"Multicollinearity
in Regression Analysis:
the Problem Revisited."
The Review of Economics and Statistics. 49:
92107, 1967.
9.
Garside, M. J.
"The Best Subset in Multiple
Regression Analysis·." Applied Statistics.
196-200, 1965.
10.
14:
Goldstein, M. and Smith, A. F. M.
"Ridge-Type
Es-timators for Regression Analysis." Journal of
the Royal Statistical Society, Series B.
36:
284-291, 1974.
47
11.
Hemmerle, W. J.
"An Explicit Solution for
Generalized Ridge Regression." Technometrics.
17:309-314, 1975.
12.
Hemmerle, W. J. and Brantle, T. F.
"Explicit and
. Constrained Generalized Ridge Estimators."
Technometrics. 20: 109-120, 1978.
13.
Hocking, R. R., Speed, F. M., and Lynn, M. J.
"A
Class of Biased Estimators in Linear Regression."
Technometrics. 18: 425-438, 1976.
14.
"OptimUm Solution of Many Variables
Hoerl, A. E.
Equations." Chemical Engineering Progress. 55:
69-78, 1976.
15.
"Applications of Ridge Analysis to
Regression Problems." Chemical Engineering
Progress.
58:
54-59, 1962.
16.
Hoerl, A. E. and Kennard, R. W.
"Ridge Regression:
Biased Estimation for Nonorthogonal Problems."
Technometrics. 12: 55-67, 1970A.
17.
"Ridge Regression: Application to
Nonorthogonal Problems." Technome~rics. 12:
69-82, 1970B.
18.
"Ridge Regression:
Iterative Estimation
of the Biasing Parameter." Communications in
.Statistics. 4: 105-123, 1975.
19.
Hoerl, A. E., Kennard, R. Tt7., and Baldwin, K. F.
"Ridge Regression: Some Simulations."
Communications in Statistics.
4: 105-123, 1975.
20.
Hotelling, H.
"Analysis of a Complex of Statistical
Variables Into Principal Components." Journal of
Educational Psychology.
24:
417-441; ~91-520,
1933.
21.
"Simplified Calculation of Principal
Components." Psychometrika. 1: 27-35, 1936.
22.
Kempthorne, 0. Discussion on a paper by D. V.
Lindley and A. F. M. Smith. Journal of the Royal
Statistical Society, Series B:
34:
33-36, 1972.
23.
11
Lawless, J. F. and Wang, P.
A Simulation Study of
Ridge and Other Estimators." Communications in
Statistics.
5:
307-323, 1976:
48
24.
Lindley, D. V. and Smith, A. F. M. "Bayes
Estimators for the Linear Model" (with discussion).
Journal of the Royal Statistic~l Society, Series
=---~~--~~~~~~----------------------~
B. 34: 1-41, 1972.
25.
Mallows, c. P.
"Some Comments on cp."
metrics. 15: 661-675, 1973.
26.
Marquardt, D. W.
"Generalized Inverses, Ridge
Regression, Biased Linear Estimation, and Nonlinear Estimation." Technometrics. 12: 591-611,
1970.
27.
Mason, R. L., Gunst, R. F., and Weber, J. T.
"Regression Analysis and Problems with Multicollinearity." Communications in Statist.ics. 4:
277-292, 1975.
28.
Massy, W. F. "Principal Components Regression in
Exploratory Statistical Research." Journal of the
AmericanStatistical Association. 60: 234-256,
1965.
29.
Mayer, L. S. and Wilke, T. A.
"On Biased Estimation
in Linear Models." Technometrics. 15: 497-508,
1973.
30.
McDonald, G. C. and Galarneau, D. J.
"A Monte Carlo
Evaluation of Some Ridge-Type Estimators." ,Journal
of the American Statistical Association. 70: 407416, 1975.
31.
McDonald, G. C. and Schwing, R. c. "Instabilities
of Regression Estimates Relating Air Pollution to
Mortality." Technometrics. 15: 463-481, 1973.
32.
Morrison, D. F. Multivariate Statistical Methods.
New York: McGraw-Hill, 1967.
33.
Newhouse, J. P. and Oman, S. D.
"An Evaluation of
Ridge Estimators." Rand Report No. 4-716-PR: 128, 1971.
34.
Obenchain, R. L.
"Classical F-Tests and Confidence
Regions for Ridge Regression." Technometrics.
19: 429-439, 1972.
35.
Pearson, K.
"On Lines and Planes of Closest Fit
to Systems of Points in Space." Philosophical
Magazine, Series 6. 2: 559-572, 1901.
Techno-
49
36.
Rao, C. R. Linear Statisti6al Inference and Its
Applications, 2nd ed. New York: John Wiley and
Sons, 1970, pp. 294-305.
37.
Scheffe, H. The Analysis of Variance.
John Wiley and Sons, 1959.
38.
Sclove, s. L.
"Improved Estimators for Coefficients
in Linear Regression." Journal of the American
Statistical Associati6n. 63: 597-606, f968.
39.
Searle, s. R. Linear Models. New York:
Wiley and Sons, 1971, pp. 100-116.
40.
Silvey, S. D.
"Multicollinearity and Imprecise
Estimation.n Journal of the Royal Statistical
Society, Series B.
31: 539-552, 1969.
41.
Snee, R.
"Some Aspects of Nonorthogonal Dat.a
Analysis .. " Journal of Quality Technology. 5:
67-79, 1973.
.
42.
Stein, C.
"Inadmissibility of the Usual Estimator
for the Mean of a Multivariate Normal Distribution." Proceedings of the ·Third Berkeley Symposium
on Mathemat.ical Statistics and P.robabili ty, Vol. 1.
Berkeley: University of Califorillla Press, 1956,
pp • 19 7- 2 0 6 •
New York:
John
43.
"Multiple Regression." Contributions
to Probab~lity and Statistics. I. Olkin (ed).
Stanford: Stanford University Press, 1960,
pp . 4 2 4- 4 4 3 •
44.
Wichern, D. W. and Churchill, G. A.
"A Comparison
of Ridge Estimators." Technometrics. 20: 301311, 1978.
45.
Yamamura, A. M.
"Ridge Regression: An Answer to
Multicollinearity." Unpublished Master's Thesis.
California State University, Northridge, 1977.
50
APPENDIX 1
PRINCIPAL COMPONENTS REGRESSION
Let X be an (nxp) matrix of n observations on p
variables standardized so that X'X is a correlation
matrix.
Suppose one wishes to find an (nxl) vector which
accounts for as much sample variance as possible.
An
algebraic statement of the problem is the following:
find a linear compound
(1)
such that the sample variance
=
Q 'X'XQ
L:L:q.lq.ls .. = 1--1
. . 1.
J
l.J
(2)
l.J
is maximized subject to
is (nxl),
.
x.
-1.
o1 'o 1 =
1.
is the ith column of X.
For this problem Y1
0 1 is a vector
and s . . is the ( i , j ) th
l.J
element of X'X, i.e., s .. is the sample covariance of X.
l.J
and X ..
-J
-1.
The solution of the problem is to use the
Lagrange multiplier A :
1
( 3)
51
where I is the identity matrix. ·To maximize, set (3)
to zero, which gives p simultaneous equations
(4)
The system of equations given by (4) is solved by choosing
A.
1
such that
=
IX'X-A II
- - 1-
It follows that A.
1
0
(5)
is an eigenvalue of X'X.
Premultiply-
ing ( 4) by Q • ,
1
(6)
and, recalling t.hat g_ • Q
1 1
A1
= =1
0 'X'XQ
- -1 =
Therefore, Q
1
=
1.
2
(7)
SYl .
is the
eigenvec~or
associated with
A. , and the magnitude of the sample variance sy
1
by A. 1 •
2
1
is given
2
Since the intention is to maximize sYl' A. 1 is the
largest eigenvalue of X'X.
The first principal component accounts for the
maximum possible variance in the observations.
Proceeding
inductively, the second principal component accounts for
as much of the remaining variance as possible.
ally, the problem is:
Algebraic-
find the linear compound
(8}
52
such that the sample variance
=
L:L:q. q. s ..
2 J 2 lJ
ij l
=
Q 'X'XQ
-2 - --2
(9}
is maximized subject to the constraints
o1 •o 2 =
0.
o2 •o 2 =
1 and
The first of these is as before a standardiza-
tion of length, while the second constrains Q to be
2
orthogonal to 0 1 .
where \
o1 •
2
and \
Again a Lagrangian system is used:
are the multipliers.
3
Premultiplying by
and setting to zero.
2Q
'X'XQ +\
-1 -- - 2 3
=0
Similar premultiplication of {4} by
( 11}
o2 •
implies
that
Q 'X'XQ
--2
-1 -
and hence A
3
=
= 0
0.
(12)
The second component thus satisfied
(X'X-\2_!_)Q2
= 0
and is solved by
IX'X-\2.!.1
=
0.
( 13)
53
Further, premultiplying (10) by Q ' and recalling that
2
1..
3
== 0,
Q 'X'XQ
-2--2
Since sy
that sYl
that
1..
2
(14)
2
2
is to be maximized subject to the fact
has already been accounted for by
1..
1 , it follows
2 is the second largest eigenvalue of X'X, and that
0 2 is the associated eigenvector.
The third principal com-
ponent will similarly be determined by the eigenvector
associated with the third largest eigenvalue and so forth.
Principal components are orthog-onal to each other
and, beginning with the first component, each accounts for
as much as possible of the variance that has not been accounted for by the preceding components.
Geometrically,
they correspond with the principal axes of the data
ellipsoid determined by X.
Principal components regression is linear
regression which uses the principal components as independent variables.
Let y be an (nxl) vector of observa-
tions on a response variable.
Then principal components
regression can be written as
(15)
where Q is the (pxp) matrix whose ith column is equal to
gi' the ith eigenvector of X'X.
Another way of writing
54
( 15) is
(16)
where A is the (pxp) diagonal matrix whose ith diagonal
A~,
entry is
the ith eigenvalue of X'X.
Note that (16)
may be transformed into the ordinary least squares solution by premultiplying it by Q, i.e.,
A
§_
=
QA
since (X'X)-l
-1
Q' X 1 Y1
= QA-lQ'.
(
17)
It may happen that some of the
eigenvalues are very nearly zero and should therefore be
eliminated from the model, since they account for almost
none of the predictor variance.
eigenvalues
Remembering that the
are in order of decreasing size, A can be
partitioned as follows:
(18)
where the last (p-r) eigenvalues are assumed to be zero.
We also partition Q:
(19)
where the last (p-r) eigenvectors correspond with the
zero-valued eigenvalues.
55
Since A
is by assumption a zero matrix, a
-p-r
generalized inverse of (X 1 X)
--r
-1 = Q A -lQ
(XIX)
- - r
-r-r
(Marquardt, 1970).
(X 1 X) -l
--r
-r
is
I
(20)
This ma.y also be written
=
r
~1/A.Q.Q,
1
1-1'--1
I
( 21)
where -Q.
is as before the ith eigenvector of Q.
1
To see
that (21) is correct, recall that q .. is the jth term of
1]
the ith eigenvector.
(X' X)
--r
-1
=
Then
QA-lQI
(22)
tvhere
Q
qll
q21
qrl
q12
q22
qr2
qlp
q2p
qrp
0
0
l/A. 2
0
=
and
1/A.l
0
-1
A
=
0
0
1/A.
r
56
2
l:q.l(l/1...)
I"'
f"
1
{K•X)
1
1
-1 r
L:q.2q.l(l/A.)
r 1 1
1
1
tq.lq.2(1/A.)
1 1
1
1
L:q. q. (1/A.)
1 1 1 1p
1
tq~
(1/A.)
1 1 2
1
L:q. q. (1/A.)
1 1 2 1P
1
r
( 2 3)
r
1
=
r
L:q. q. (1/A..)
1 1p 1 2
1
q.1pq.l(l/A.)
1
1
tq~
(1/A.)
1 1p
1
2
qil
qilqi2
qilqip
qi2qil
2
qi2
qi2qip
f1 {1/A.)
1
(24)
q.1pq.l
1
=
qipqi2
2
qip
r
L:l/A .Q.Q. I .
1
1-1-'-1
The estimator based on the first r eigenvalues would thus
be
A
B
=
.
Q A
-r-r
-1
( 25)
Q 'X'y.
-r - -
In practice, it would not be likely that the last
(p-r) eigenvalues would be precisely zero.
This means
that some method of deciding whether or not a particular
eigenvalue is zero is needed.
Specifically, some constant
c should be set so that X'X may be said to be of rank r
57
if the first r components account for all but some small
fraction of the variance, i.e., if'
The advantage to (25) as an estimator is chiefly
its simplicity.
Since the eigenvectors are all orthogonal
to e'ach other, they can be eliminated from the model
through a simple stepwise procedure.
tates the
decis~on
Also (26) facili-
of how many variables should be
dropped.
Marquardt (1970) considers the possibility of
using "fractional ranks."
His suggest.ion is to eliminate
those eigenvalues which are definitely considered to be
zero and to reduce the inflation caused by inverting the
smaller nonzero eigenvalues by adding a small constant
to the denominator of each term in (21).
To detail how this works, suppose that the rank
of X'X is considered to be greater than r but less than
r+l.
Specifically, suppose it is set to r+f, where f is
some number between zero and one.
Marquardt's·suggested
estimator is
( 27)
where, from (21) and (26),
=
r
L(A.+k),
1
l
( 28)
58
or, equivalently,
{29)
k == f.Ar+l/r
Marquardt's fractional rank estimator may be seen
as a compromise between principal components regression
and ordinary ridge regression, with part of the effects
of multicollinearity being eliminated through one
nique and part through the other.
tech~
All the same, it is
not clear, despite {28), that the concept of "fractional
ranks" is in itself particularly meaningful.
To illu-
strate this, consider the case where f is approximately
equal to one.
The result could very well be two dissimi-
lar estimators for models whose rank was in theory almost
identical, using {27) for one estimate and {25) for the
other.
59
APPENDIX II
RIDGE ANALYSIS AND RIDGE REGRESSION
A frequently occurring problem in multivariate
analysis is to find the stationary values of a function
f(x 1 , ... ,xp) of p variables x , ... ,xp subject to restric1
tions on the x. such as g.(x.l_ , .•• ,x)
1
J
p
=
0 (j
=
l, •.• ,n).
The method by which the problem is solved is that of
Lagrange multipliers.
=
F
Letting
•
f-L~... g.,
1 J J
(1)
w·here A. , ••• , A.n are unknowns, differentiate (1) with
1
respect to each x.1 and set the results equal to zero.
This yields p equations
·ap;ax.
1
=
1
"1
=
1, .•• , n) •
3f/3x.-~A..ag./3x.
J
)
1
=o
(i
=
l, ••• ,p).
(2)
Additionally,
g. == 0
J
(j
( 3)
is a
solution of (2) and (3) after eliminating the A.j.
Let
60
2
a
=
M(x)
M(x , • • • , x )
p
1
F
=
(4)
be the matrix of second order partial derivatives, and
let
M(~)
= M(a 1 , .•• ,ap) be the resulting after the solu-
tion {a , .•• ,a) has been substituted into (4).
p
1
Then if
M(a) is
(a) positive definite, i.e., y'My>O,
(b)
where
negative definite, i.e., y'My<O,
y' = (yl, .•. ,yp) is any
(lxp)
real vector, the
function f(x , ... ,xp) achieves
1
(a) a local minimum at x
=
a,
(b) a local maximum at
=
a,
respectively.
X
To see tnis, suppose F is expanded as a
Taylor series of partial derivatives about a and that h
represents a vector (h , ... ,hp)' of increments hi.
1
Then,
recalling that the first partial derivatives are zero at
x .., a,
(5}
61
where o(h 3 ) represents the higher order terms of the
series.
It is a feature of Taylor series that the effects
of the higher order terms can be made arbitrarily small if
the increments h. are small enough.
1
Since the first order
term is equal to zero, that means that the only term to be
con.sidered is the second order term
greater than zero if M(a)
~h
'M(a) h, which is
is positive definite.
plies that, for all small h, F(a+h)>F(a).
This im-
Therefore, if
h is such that the restrictions gj .= 0 are fulfilledr
(6)
and f(a)
is a
loc~l
for the case where
minimum.
M(~)
Similar reasoning holds
is negative definite.
To apply this to quadratic surfaces, consider the
second order response surface in p variables b , •.• ,bp
1
given by
(7)
where the point (0, .•. ,0) is the origin of measurement
for b , ..• ,bp.
1
The problem is to find the stationary
points of a sphere centered on the origin and having
radius R, in other words, to find the stationary points
of y subj ec·t to
62
(8)
Using the method of Lagrange multipliers, set
F
=
y-Ag, where A is the multiplier.
Then, taking the
first derivatives with respect to the b., rearranging,
1
and dividing by 2, this gives (see equation (2))
(9)
•
•
~s
•
•
•
•
8
.•
•
•
1 P b 1 +~s 2 p b 2 + ... +(s pp -A)b p =
"o
--·
ks
2 p
or, in matrix notation,
(S-AI)B
=
(10)
-~~
where
s
sll
~sl2
~s
~sl2
s22
~s
lp
8
2p
s2
11
' s =
=
...
~slp
~s
2p
,,'
spp
bl
..
=
sp
r
. B
b2
(11)
=
bp
One method of finding B would be to solve (8) and
(9} for b , ••. ,b p, and A.
1
Sometimes, however, the value
of R is not of great importance, in which case it is
63
possible to regard R as variable and A as fixed.
Thus
A can be inserted directly into (9), which can then be
solved for the b.'s, after which R can be computed from
l
( 8) •
Suppose, that, for some R, there is a multiplier
A* which gives as a solution to (9) the vector B*, and
that. A*>d.,
all i, where the d.l are the eigenvalues of S.
l
Eigenvalues are the roots of the equation
ls-drl = o,
(12)
and, if A*<d., all i, then, for an arbitrary (pxl) vector
l
=
z'M(B*)z
--~
--
where d.-A*>O.
l
y
=
f(B*)
=
z' (S-A*I)z
--
--
z'Sz-A*Z'z
--
--
=
z'z(d.-A*),
--
l
(13)
Therefore, M(B*) is positive definite and
is a local minimum.
To derive ridge regression from· this, let
f
X
X 11
x12
xlp
ryl
bl
x21
x22
x2p
y2
b2
=
I
X
nl
xn2
xnp
Y. =
I
Yn
B
=
(14)
bp
where the x .. and y. are known, and the b. are to be
lJ
determined.
l
The problem is to minimize
l
(Y-~B) '(!-X~)
64
-
=
~(y.-blx.l... -~ x. )2
1
1
1
p 1p
1'\
2
2
2
E(y. -b 1 x. y.- ... -b x. y.+b x. + ... +b x.
1
1
11 1
p 1p 1 1 1 1
p 1p
+b 1 b 2 x.1 1 x.1 2 + ... +b p- 1 b p x.1 ( p- l)x.1p )
=
"
y.· 2 -b n
Ex. y.- •.. -b Ex.
y.+b 2"
Ex. 2 + ...
1 1 11· 1
_1
Pl 1 1 1 1 l 1 1
"'
+b 21'\
Ex. 2 +b b Ex.
x. + ...
1 21 l 1 1 2
p 1 lp
" 1 ( p- l)x.1p
+b p- 1 b P1Ex.
( 15)
subject to
2
2
bl + ••• +bp -R
n
If LX .. y.
1 1] 1
=
(16)
0
,
= s.J and EX
.. X. k = s]. k; then ( 15) is
1 1] ]_
the same as (7), except that there are twice as many
cross-product terms sjkbjbk and s ..
The solution is,
J
therefore, as in (9).
\'\
(Ex.
1 11
2
-~)b
~
~
I\
x. b + ... +~x. x. b
= Ex.
y.
1 +Ex.
1 11 l 2 2
1 1 1 1p p
1 11 1
~
2
"'
L:x. x. 2b 1 +(Ex.
-~)b + ... +Ex. x. b
2
1 l 1 l
1 l 2
1 1 2 1p p
1:\
l'
Ex. x. b
1
11 1 2
"'
"
2 -~)b
+L:x.
x. b + ... +(Ex.
1 1 1 2 1p 2
1 1p
p
or, in matrix notation,
=
~x.2y.
1 1
1
l\
Ex.
1
y.
1p 1
( 17)
65
(X'X- I)B
l>
rx.,
1 1..L
=
X'y
.,
2
n
rx. x.
fx. x. 2
1 11 1
rx. x.
1 ;L 1 1 2
2
~x.2
1 1
~x.
x.
1 1 1 1p
~x. x.
1 1 2 1p
~X.
1
1
1
l.p
X'X
B
1
1p
b'
1
"1 1 1 y.1
rx.
b2
"L:X. 2y.
1
,
=
X'y
1
2
1
=
(18)
~
bp
tx.1py.1
1
Since the smallest eigenvalue of X'X is at least zero,.
(X'X-A..!) will be positive definite whenever A.<O, so
(Y-XB) '(Y-XB) will be a minimum for a given 2:b.
1
1
2
.
Thus
the estimator which solves (15) and (16) is
(19)
which is the ridge estimator.
This same estimator also
mi.nimizes Eb. 2 for a given (Y-XB) '(Y-XB).
1
1
To show this
A
briefly; let B be the least squares estimator.
Then
(20)
66
where B is the estimator whose individual terms are the
bi for which fbi
2
is to be minimized.
Using Lagrangian
multipliers as· before,
(21)
where 1/k is the multiplier and S is the residual sum
of squares (Y-XB) '(Y-XB) of B.
Differentiating with
respect to B,
( 22)
and then
( 2 3)
or
(24)
Equivalently,
(25)
or, finally,
( 26)
which is the same result as for the firt Lagrangian
problem.
Since the residual sum of squares and the esti-
mator norm are continuous-valued, it follows that the
67
minimum of either one is a monotonically decreasing
function of the other.
68
APPENDIX III
A DERIVATION OF THE MEAN SQUARE ERROR
ESTIMATOR FOR RIDGE REGRESSION
Consider an (nxp) matrix X of n observations on
p variables and a vector y of n observations on one variable.
Rao (1970) gives the following result:
using
the model
y
where
E(ee'}
~
=
(1)
XB+e
is a (pxl) vector of unknown parameters and where
=
2
cr In' suppose L'Y is an estimator of
r is (pxl) •
r'~,
where
Then
(2)
Let r be the ith column of I , the (pxp) identity
·-P
matrix, and let L be the ith column of the matrix
(X (_?f' X+k_!) -l} • Then
E(L'y-B.) 2
--
]_
=
cr 2 (X(X'X+ki)-l). '(X(X'X+ki)-l).
----
]_
----
]_
+(X' (X(X'X+ki)-l) .-r) 'B'B(X' (X(X'X+ki)-l) .-r)
-
----
J.-
---
Since (X'
(X{X'X+ki)-l)
_,_
---equivalent to
.-r)
----
J.-
(3)
'B is a scalar, this is
J.--
69
E(f3.*-f3.) 2
1
1
= cr 2 (X(X'X+ki)-l).'
(X(X'X+ki)-l).
- - 1
- - -·
1
+B' ({X'X(X'X+ki)-:-l) .-r) ((X'X(X'X+ki)-l) .-r)'
-------
where Si*
= L'y.
1
-----
(4)
1
Equation (4) is the ith term of the
mean square error of f3.*.
1
For the total mean square
error,
E(f3*-f3) '{f3*-f3)
- -· -
= cr 2 trace(X'X+ki)-lX'X(X'X+ki)-l
-- - - - - - -
+fB' ((X'X(X'X+ki)-l) .-r) ((X'X(X'X+ki)-l) .-r) 'f3
1-
----
-
1
----
where A.iis the ith eigenvalueofX'X.
E(f3*-S) 'X'X(S*-S)
=
------
E(Xf3*-Xf3)
-
1
-
Similarly,
- - -(Xf3*-Xf3)
-I
= cr 2 trace((~'X(X'X+k!)-l)X'X(X'X+ki)-l))
+f' (X'X(X'X+k!)- 1 -!)X'~(X'X(X'X+k!)-l-I)~
= cr 2
fA. 2 j{A..+k) 2+k 2 f3' (X'X) (X'X+ki)- 2 f3.
1
1
1
-
(6)
-
Suppose A is a (pxp) matrix whose ith diagonal
entry is A.. and whose off-diagonal entries are zero, and
1
70
Q be the matrix of eigenvectors such that Q'X'XQ
Define
a
=
A.
as
= o•s.
(7)
Then the expectation of a is
E ( a)
= Q' E ( ~)
and the variance-covariance matrix of a is
which means that
where ai' ai, and Ai are the ith elements of a, a(=E(@))
and A, respectively.
Transforming (5) b! Q gives
=
'
~(cr
1
2 A..+k 2 a. 2 )/(A.+k) 2 •
l
l
l
(8}
I
71
Similarly,
( 9)
72
APPENDIX IV
HEMMERLE'S DERIVATION OF THE EXPLICIT SOLUTION
Source:
William J. Hammerle; "An Explicit Solution for
Generalized Ridge Regression"; Technometrics
17 (_l); (1975); p. 309- 314.
Consider the linear estimator defined by
(1)
where X is an (nxp) matrix of independent variables, y is
an (nxl) vector of response observations, A is the diagonal matrix of eigenvalues of X'X, Q is the matrix of
eigenvectors of X'X such that Q'X'XQ
estimated variance.
(Q I X 'y)
-0
1
= A,
and
&2
is the
For convenience, let
0
(Q'X'y)
0
0
---- 2
(2)
B =
0
0
(Q'X'y)p
0
(Q'X'y)p
and
a.*
- l(j)
0
1L
-J
=
a.*2(j)
..
o:
0
(3)
0
a.*
- p (j)
73
where a * i{j) is the jth iteration on a*i of the HoerlKennard process described in Chapter 3.
If the least-
squares estimate can be considered to be the Oth iteration,
then
{4)
Further, the {j+l)st step in the iterative procedure may
be described as
{5)
or
A. 1
-J+
{6)
Now {6) can be written as
(7)
and then
(8)
Letting
(9)
gives
A.+,. . .
-J
{10)
74
and an expression for A. -2 is given by
-J+l
Aj+l
-2
(11)
(12)
diagonal and commute.
Aj+l
Thus
-2
(13)
Multiplying both sides of (13) by D-l gives
-1
D
-2
Aj+l.
(14)
so that if
E. = D-lA . -2 ,
-J
-
(15)
-J
the iterative procedure is reduced to the simple forumula
(16)
Assume that
a.~O
1
for all i and that the iterative
procedure is convergent such that
lim E.
-J
=
E*.
( 17)
From (16) and (17)
E*
2
= -0
E (I+E*)
- --
(18)
75
or
(19)
Now (19) consists of p equations of the form
(e*) 2 + (2-1/e ) e*+l
0
where e
0
=
and e* are scalars.
e*
=
( 1-2e ) ~ 11-4e
0
2e
1
0
(20)
Solving (20) for e*
0)
( 21)
0
To show.whether the plus or minus sign should be
selected, first note that (16) consists, like (19), of p
separate expressions of the form
(22)
where e 0 , ej, and ej+l are scalars anq the subscript j is
. used to denote the jth iterate.
To show that the procedure defined by (22)
converges for e
0
=
~,
observe that for e >o
0
( 23)
(24)
and
(25)
76
For e
0
=
~
=
l-3;1 2j+ 2 .
let
In general,
r:::-/
¥e
. .,g.
J
J
(26)
Then
= v'e (l+e. > ~re-: (l+/e~·l~/e (l+g.)
0
J
0
"]
0
( 2 7)
J
and
=
l-3/2j+ 3
=
( 2 8)
all j.
(29)
Consequently,
v'e.<g.<l for
J
J
eo=~'
Combined with (25), this gives
(30)
a monotonically increasing sequence of real numbers
bounded from above, so the procedure converges.
From (21)
77
e*
=
Lim ej
=
1 for e
0
- \.
(31)
To extend this, suppose that the procedure
defined by (22) converges for e
for the sequence (e' 0 , e' 1 ,
0
~··r
and that
O<e'
0 ~e 0 •
Then
e' ., .•• ) we have e'
J
so that this sequence converges to (e*)
'~e*.
.~e.
J
J
From this
fact it may be seen that the iterative procedure converges
for O<e <\.
0
At this point it is necessary to choose betweeri
the plus and minus sings in (21).
(l-2e )+(1-4e )
0
0
2e
Note that for O<e <\
0
2
0
>
( 1-2e ) 2 +2 ( l-4e ) ( l-4e ) + ( l-4e )
0
0
0
0
2
4e
0
which contradicts the fact that
e*~l.
Consequently the
iterative procedure converges, whenever O<e <\, to
0
e
=
(l-2e )-/(l-4e )
0
0
(32)
78
To show the equivalence of Hemmerle's results to
those of Chapter 3, note from (9) and (15) that
Aiei(j)
=
ki(j), where the subscript j indicates the
jth iteration.
Then, from (23)
I
A.e*. =
~
~
=
2
2
A.-2A.e.
(O)- ¥/A..~ -4A.~ e.~ (O)
~~
~
2ei(O)
a
~
~
1¥1
1
2
2a A./(Q'X'y) .
2
1
=
a
A. _ 2 A . 2 21 <0 , x , Y > . 2 _ ,{ . 2 _ 4 A . 3 21 <Q , x , Y > . 2
~
~
2
4
2
2
2
(O'X'y) . -2A.8 -lQ'X'y) • -4A.d (Q'X'y) .
1
~
~
¥
~
~
2
2a
which is the same as (3.26).
(
33)
79
APPENDIX V
THE SOLUTION OF MALLOWS
Let Q be the matrix of eigenvectors of
that Q'X'XQ
= A,
let z - Q'X'y.
X'~
such
the full rank matrix of eigenvalues, and
The ith term of z is denoted by z ..
-
l
The ordinary ridge solution of Mallows (1973)
is
that k which minimizes
(1)
where ~
2
rank of
is the residual mean square error and p is the
A.
To minimize Ck*' differentiate it and set i t
to zero:
(2)
from which Mallows gets as a solution
(3)
(l+k)/k
and then
(4)
Due to an inability to derive (3)
from (2)
it was
decided to test (4) on some data to see whether it yields
80
an
optimal solution.
The data used are from the example
of Marquardt (1970):
312
10
X=
4/2
10
41"2
10
312
10
1
=
y
2
5/2
5/2
10
10
1
49
2612
30
10
3
from which
X'X
=
~·y
49
50
1
12
2
Q -
Z
=
Q'X'y
=
12
2
51
10
1
=
2512
10
s:l
99
A=
0
1
50
a=
10
e 2 = ~·y-a'Q'x'y = 12/33
The following nine pages give the listing of a
small Fortran program written to test Mallows' solution,
and the results of that program.
The first result, labeled 'MALLOWS SOLUTION', is
Mallows' optimum from equation (4).
The second set of
81
L t! JaG 1
XYt=5·1
dl 8~2 ~
u;
s z=tLtB.
01=1,9~
c 2=. c z
OOG6
lll
oc;:;r
u~
I.N !lJn
U' OO!.'q
Lt, '] IJ'l Q
'-.t·! JG11
Lt.J J.: 12
u; J )1 ,3
U' IJC14
Ul ilC15
d:
uct&
u·
'lOU
u:
JJZG
Ul
J~17
'l~1g
L t~
uoz1
lN
~·
l!C22
lf·i uJZ3
.L
111=85./33.
A 2=5
JJC4
J :' ( 5
!..tt
.
XY2=,1
'l G 2 4
\1025
~~.
~!
F' =2.
II K•1 AL 1 , I ( ( A1 • "'2 • A2 ' ' 2 ) I ( P 'S 2 ) -1 , )
=
ZK=AK~·A-.
CALL CO~P!XYl,X1Z,Al,AZ,SZ,Cl,021P9A1ALC;GIFF~o'lOIFC;ZKt
w,;ITEI61,gQIJl
W1=.lTECS1,gQ3)
WP:::TE!&t,~Ool
-------------- --·---··----- - - - 14i<IiEC61, 'lOU 111<'11\i:, -Wk!TEI61,YOZI AHALC
WF:TEt61,90JI OIFFC
0 LDl F= 0.
Z K::: 0,
IH<.l TE I 61, 'lO 6 I
WP!TEI&1,gJ51
.
WI-UTE!F)1,giJ 0 1
1 C 0 CAli. C0'1~ t X Y1; nz <Al; 112 ;SZtDl ;oz-; P9 111 A:..C;tJIFFG ;~01F'C97l(y---­
------------~-~-
IF!A~~~uLOlF-oiFFCI.L~
O;_f'!iF=;:JIFFC
Lll JC26
i.l~ !JC'Z!
QC2~
ltl
ouzq
J~-.~
0 Q3i
!JC32
~ Q 33
~~!H"i~I~~5fr 0 !~c~
1-H"l T F {(,1, 9J
W"'lTEI61
JC 36
')~37
CJ41
OG42
QC43
01;3
~;ji+4
gQ4
:J
JLd
'l:4g
JC5 ')
!)051
USASI
300
9
61
gQ4)
FC"r:lli
11X9' t1A~LJWS 50L~flO~')--~---­
FQP~'AT!1X,'~".IOG<:: T~LICE'I
YC5 FQR~DT!1X·'~EWTJN-~AP~J0~ ITERATION')
9CC For!-'1\1 11X, 'OE~IVATIIIE=' 9 1X,F14e10l
9C1 fCi;:!'l\y 11X;'K=' i10X;<'11+;t1JI - - - - - - - - - - - - - - - - - - - 9 G2 - F O" ~·A i ( 1 X •• GK= I' g X. ::' 14. 1 GI
g GS F QF ~~A~ I 1 'i , '
go7 '"C~"AT ClX,'
nm
FJ~TFAN
---··
---
"• "- "'" "'"' • • "' '
1 )
I
---------- --
G!~G~O~TIC
- -------
RESU~fS
FJ~
FT~oHAlN
---------------------------~----
Lt. 1)001
Ltl ~ c 0 2
LN JC:i3
i-t' 0 r '14
LN
..t~
Jb~ 3
0•)06
-L.fJ 0C;J7
LtJ il'JCl3
u. a ceq
LN 'JHJ
~~·
1011
u~
:1
LN
JJ12
~u
'-'• J '114
-d
JO 15
U• OC16
LN
u;
)J17
JCB
USlSl
-------·---------~-------------------
CALL CJHPIXY1,XV2,41,AZ,SZ,01,02,F,A1ALC,OIFFC,JDIFC,ZKI
WF! TE t&l o qt) 7'1- ··------------------------------------W".1 1 E ! & 1 , qQ 1 l ZK
WF:ITfi51,9J2) A'IAL:
W:<ITE151 9 00GI OlFFG
tjCJ8
0:; -~ •J ·J14C
IJC45
- -- ------
WUTE.!~1;90&1Qu 3t;~ J=1,t!GG
l K: l K~, !:i C ::>
~:!5
6
'lC117
OJOOu000011 ;o TJ 200
wr:rrrc&t,go2, AHALC
w:::ITEI&l,%01 OIFFC
W".IT E I Ell, gu 71
Gu TO teO------ ----------------------~----2 u iJ Z K= C,
.
')::34
!) ·~ 4
••
f)~T~AN
OltG~OSTIC
F.ESULfS
FJ~
CO~P
-- -·- --------- --···------------
-------
82
--
........... .
...............
11ALLCWS
30LUTIOlll
K.:
CK=
0.0235308099
3,3o21580868.
-z. 9-336'3H&73
DEi\IVATr\IE=
••••••••••
--'K£. ~
CK=
O,QG59788927 --·
~.OOGOCJ00"0
-50.Su5~505055
D~KIVATIVE=
O,C134357191
K:
J,&Q66~65831
CK=
-
DE~IIIATIVS=
-20.7037144419
K:
1).~22~657473
CK:
3,4J65121720
-3.3822205513
OERIVATIIIE=
:J, G32611 0162----~--- . -----------lo308d726829
-3,3167236876
CK=
DE~IVATIV£:
ij,Q432712338
K~
CK=
----oEr.iVH!V£:-
3o2651E85561
--1.2!t5918G'.IIt3'--~-----·
O,il51071~985
K=
---c-K=~------
---~;
OE~Iv~TIIIE=
- - 1 ( = - .. - - ---
-~
2'+CJ 0 & 3&&u s------
-u.~tC79311t5~8
--
CK=
DE~Iv~rrvF=
J 5 3990 8422
3•2'+53816285
~0~
-a.:9123~t953C
K:
0,0542754803
3.21t51DO lj199
CK=
--------u£1\IVA TIVE=- ~'"J ;c 071t CJC 366-z---K=
-'-~-cK=-----------n£~IvAiiVE=
--K=-------
K=
CK=
3; 2450'l8276~tc:-----
-~.OOG0&02719
---- J.:542778081 _ _ _ _ _ _
CK=
OE~·IVAT
O.OS4277aQ80
3.21t5J'J327&2
I VE=
--oH.HATIIIE=--
-a. cot
oo o0040
·------- - - - - - - - - - - a.Q542776Ddi
3,2450'382762
a~oocooooooo
____
83
••••••••••
---RIDGE
I~ACC:
...............
-
O,!rG50CHOOC!- - - - - - - - - - - - - - - - 3.&501914231
DERivATivE= ·23,65&7835528
-I(:
.
CK=
--- ------
- --------------
lo0l~UCJ0000
K:
CK:
3.4769730343
OERiuATI~E=· •12.4462E037f2~
J.Ot5CCJOOLO
7990 -
K=
3~382412
CK=
LlEI\IIIAfi\1£=
--K=----
~j.,2CCOOCOOO.
CK=
OEFhAT I \If=
K~
CK=
--DE~
-7.~&71191!7023
-· -- ----···-
1.3273625Ja
·'-· 2 2 2'>3125 (j1
J,Q25GG~OOOO
3. 293950 3&32
IllATivE=-- -2.oG2&1tu4824 - - - - - - - - - - - · · - -
K=
~.OJOC~uGOCC
-.-cK=·----------J;2731967a3i- - - - OEFivArrvE=
·t.&258422741t
- - -x=- -- ------- ---- o.o350C :JDOu-c----------------
CK=
u£~1\IATIVE=
-
J.260250C355
•l,OC81C76510
K=
Q,Q40000G000
CK=.
l,252339C953
-oE;;. Ill ATI \I C.-=-·--- l~ 6" 1 0 9 0 07 3 5 - - - - · - - - - - - . - - - K=
o.or.5oc a oooo
- 3,21,78033457-- ----·--·-0.3226124537
\IE=
---eK= ---· -OE~IvAfi
·a.osoooooouo -·
--·1(;::
CK=
ilEUIIATIIIE=
K=
·cK=
---OERIVATI oJE =
-----------------
2 4 58 E>3 9 7 T I T - - - - - - - - - - - - u.1273924092
---·~;
OE~J~ATivE=
- - K=--
----
o.cssocooaco
3.2451115981
a. a183227520
J.060C00~000
K=
--cK:-----
3. 2" 5 0 1 382 2 3
-J.1258t,53393
-- ----- -- Q;0651!1Tli!:01ltl-----------
CK=
OE~IvAfiVE=
3,21t75805221
J.212&561102
0,07CODOOOOC
J.25GOt2f4&5
------oEt~!VATIVE= ---J ~-28149808 BT _ _ _ _ _
K=
CK::
~=
Q,G7500000QC
---r;l(.:-----. --3.2531722q7q
0.3368308339
--x=-------CK=
O£~IIIATIVE-=
----------
-··-
K=
~=
3.25681211+07
u,l879929754
-·-··----------------- ·----------
CK:
-··- on:IiiAiii/T=---c~<=
··n-.us~acouourr
-- ---·· -· ·--
OE~IvArrv£=
O.OS50COOOCO
3. 2 0 0 91 2 6577
iJ .1+312 8 2 52 00 _ _ _ _ _ - - - - - - - - - - J.09LOCOCOQO
3. 26Slt23t1Zr------- - - - - - - - - - - - 0.~7G3C2~759
84
K=
CK=
u,0'350COOOCC
3. anc 8 29LoO
OEo<lVATIVE=
.;.5~&18?8010
K=
CK=
0,1CC CLilCCOO
3, 2755;)95813
---- DEf\IVAftllf=
K=
--cK= -oE"IvATIVE=
--1(::-
-
~.i3Cl737051\'3
J.1050CHOOC
3~ 28t0'3721~3
;,5715355768
u.11~orouooo
CK=
3,Zd6q&511575
OEriJATII/1:.=
~.&~1'3'3973~2
K:
CK=
Go115000olOCO
J, 2'3 313 3 7792
OEKIVATI VE=
~.63143Jftlt20
K=
0,120Gt00000
'3, 2'3'35916366 - - - - - - - - - - - - - - --0.660G58309Q
-CK=·-OE~-IVATIIIE=
-K=
CK=
~
.tz5or o oaoo ------------------
OEUIIATliiE=
~. 3063327'337'
o.~660J&2~+37
K=
li .ucuc 0 0000
CK=
3.313-~508032
DE<i.IVATIIIE=
0.715Lt856155-- - - - - - - - - - - - -
K=
J.13500000<10
3. 320&li10348" - - - - - - - - - - - - - - 0. 71+24'32 !15'>4
--cK:o·
OEKIVATIVE=
K= -
J.tftc~cooooo-
CK=
OEr<IIIATIVE=
3.3281'3'33'362
o. 76'3120 8631
K=
G.145uOOOO~O
CK=
3.3360223411
----uEi'I'JATIVE= ---u, 7'354151577 __ _
K=
---cr:·-0E~IVA'l11E=
--- 1(: -------- -- - -- -
CK=
0.1500000000
--T;344H56'3118--------0.6214063810
·a~
1550GOOuilU _ _ _ _ _
3,35244'35~29
DE '<I VA TI 1/f=
c. 8471236370
K=
J,16CQ~OOOOO
CK=
3.3610~82990
-- -11HIVATIVE="-- -J. 872576 %85 _______
K=
---cK=
OE "IV ATI VE=
--1<=---
CK=
OE~IVATtllf=
K=
CK=
---- oE;; rvnrr vE =
o.16sarooo:;o
l. 3& '39C a 2s 57 - - - - - - - - - - - . - - o.s<J777'l2036
---0.1700[000~0--------
3o379C0l06'37
J.'l22737337&
fl.1750C00i100
3,3883542311
J,qlo71t5'557C6
85
~. ta J cr a GG.:. o
-3, 3'j]'j!:1388'l
Jd71'1360854
K=
--cK=--- --0£1-:Ii/AiiVE=
--,.:=---- ---- -
),1S5HOGOn
3.4077'121&50
t),9'l61i"'l&338
CK=
OEt<IVAT I liE=
-----------·----· ---
K=.
CK=
--DEi<I\IATIVE=
~.l'ltODCCDolQ
1o417~741'l10
-- 1.G2C185'l76'1-- ------------ - - - - - -
K=
0.1'l50COCCOO
0EFIIIATIVE=
1ou43'l542151
---cK=-------- --3,4281'35~9~9· ---K= -- ---- --
i1,20CUtDGOOO
3.'+187524772
t,Q&7463u353
CK=
DEKI.ATIVE=
K=
CK=
-----oERI\/t.T I liE=
o.zosoeo~uoo
3.1t<.'l51t3'l!oll4
- 1. ~'10770 8'15"3 _______ --- - - - - - - - -
K=
j,210~C.OuOOC
ot:;;_rlfATIIIc=
1~ ""~ 5£; 70865- - -1.11381&15'11
---·cK= ---
ilo215CCOOO(i0- ---------
-----1(=--
CK=
3,1t718l'l'+575
1.13b6171'l63
DEtdvATIVE=
K=
Oo2ZLOCOUOGD
CK=
j,'ttl32986111
1. 15 9172 Lt5 3 8 - - - - - - - - - - - - -
---DHIVAfiVE=
K=
--"CK=' --- ----------
OEklVATIIIf=
-
K=---- ---CK=
DE~IIIATrvE=
K=
CK::
--oEt:IVAiiiTE=
J.2z5ooauooo
3~ lt'l5GC 201125 _ _ _
lol811t805057
-n,23GOGutlO~O
3o5V&'l27 3'130
lo2o35,.00876
J,2350C OGOuO
3.51'lC7Zu524
-1.22535S118g----------------
K=
0.2t,QQ(~00te
---cK=----------- 1,531'+3356!:7----------JE~IVATIVE=
1•24&'1(97151
- - , , - - - ----- ---a.2r.50COOOilO
CK=
3,5r.4JC94097
O~~IVATIVE=
1o266218l942
K=
Co250G00~000
CK=
l,SS£>7970&56
1. 211 '3275 u77 3 - - - - - - - - - - - - - - - - -
- - ot::.:IVATIVE=
K=
-·cK=- ------OEP.IIIATI'JE=
~.2550CJOOGO
1~ 5E.'l794G717 - - - - - - 1· 5H il6JJ,58
86
1(:
~.~ouul
CK=
OEFII/IITIVE=
.J, 5 szqq7 A<.TII
t. 330&3.H.J58
1(:
o.Z&5oto oooo
CK=
J,sq&<+c51l%7
1,350931+3311-
K=
--cK:
0.270CCOGOJ0
.J,6Ui0l5Eil571. HQq8J'l537
-- OEPii/ATI'JE:
oE;..IvATIVE=
---·11:=
u Ju"l!
----
~.275aoocorD-­
CK:
J.f-231124 7~&8
1.3'l07B24554
OEk!VATIVE=
K=
CK=
~.2800COOOOO
3.&3783050'!7
1. r. 1u 330 4~+&6
OEFIVATIVF::
K=
.
--.:;K::
~.2850COO,OC
~.552C305132
----------------------
OEPIVATlVE=
t.<tZ%2861175
"'"
1),2'lilOCODOOU ------------- --------
K=
CK=
0.295•)000000
3.&8100 32486
1. rt&n 7q6<.3r · ---------------- -
CK:
OERI~ATIIII'=
L\E?IVATIVE=
K=
--c
~~:=--
- ·-··- ..
OE:IHI/ATIVE=
-------,;: =OE~IVAT
3,<;&&422253'!
1.446&760771
o. ~00000 00110
- 3;&q57711JZ1i6 _________ _
1·46&0345352
G. lJ 50 r; 0 OOC 0 ------ ------
CK=
I liE;:
K=
cK=
OE:F"f\1 nr IVE:
1(:
0,3100COOOG0
3,7258~70911
-1,5 22~>C g
~z;;c;-·----
~.~15ec~onoo
3.7~ll71lSOIJq----
t.S4J2322H&
---·1{:--
~.32DOG0Gu~~-
CK=
3. 756&&0 '!323
IJEi':II/A:I\1!:=
1.5571113'3833
K=
CK=
3. 772325'!825
OE.:<IvATIVt;,
K=
---cK: - --- -OH.IvATII/E=
------------- --------
~. ~30000GC00
j , 7 881632652------------·-
t.snc6u80u3
~.335JCOOO~O
3,11G!t17J4118
1,C:,fj'l12'!3064
O'Oh<IVA~IIIf:
----------
1,5751562754
CK=
K=
CK=
-- -- -----------·----
J,H500ol.lGGO
w:= DEl< IVA • I Iff:
.
!..7107231211
1.5043<.400'!6
--cK:
DERIVIITIVE =
--
·~-.,_!::2------------
a.~
.. corooryco
3, '12C3<+50707
1.E>zsn359qs
-·-----·
87
K:
O. 3<t50( ~C.OGO
3; B36E. b4 go n
1oolt21655370
--cK=-oEodvA riVE=
--x=----
~.35aiHOGOO!l
-
3,1!531876133
t.o58337n21
CK=
OEHVATIIIE=
,
K=
CK=
J,l550tOODCO
J,6o'385C 6879
1o671i2799'366____ _
K=
~.36COCOOOOO
OE~IVATIVE=
1oo8'39964471
--utRIIiATIVE=- -
- - -CK.=-------3; 8d 66 7 2 ii5 au----------~=---------
------·a; 3E5GOUOQliO ____ 3.903650&665
1.7C54883790
CK=
OEkiVATlVf=
0,3700DOOOQO
K=
CK=
~.920761lt81,1
---!lH.IIIATI vE= --!;720757 8l3lt___________
0,3750000000
---------3; 938 ~ 61i lo'lC 6- ---- - -OEPtVATIVE=
1.7358(66737
K=
---cK:--- --
--K= - - - - - -------:r;37999'l99':l'3___________________
3o9551o96693'l
1,7506375627
CK=
DERIVATIVE=
K=
CK=
Q,38~999999'3
3,9730765213
1; 76525206urt---------------
--DEI<Ii/ATIV:::::·-
K=
0.36999999'39
------cK:-- ----- --"3, 9908GT2211t _ _ _ _ - - - - ' OE~IVATivE=
1o779E.521t211
---K=---------------
~.391+999999'3·---.-------­
CK=
~E~I~AIIIIE=
---------------------
K:
CK=
-·
4,uJ86f.S6629
1o79l8407895
-----------------------
0.399999'3999
lt,J2o677 337lt
--1;-8078192993----------
---uEF.IVATIIIf=
K=
~.ltOEt9999999
--cK:::-------- ---li;:r.C.824556S- -DERI~A'I1f=
1.8215900'301
---x-=--
-- ----- -J;t,Qqqgqgg9q----r.,"631CB453&
1,6351553050
CK=
OERI~Af!VE=
~.lt1lt9'39999'3
K=
tK=
---uERIVATI~r::
K=
C.o061526q61,5
-- 1,S4851701197 _____ - - - - - - - -
--r;K:- ------- ----
OERIVAriV£=
J.~19999qqqq
r..1 00 078124T ________ _
1.~6167756'30
88
l] ,it24999'.l999 -- -·--·
... u8n96725
1.87463119'+62
K=
CK=
OEI'!VATIVE=
J,42999999'J9
4,13757C 21tb'l
1;8S7LtC33016
K=
Cl<=
-JE?h'AriVf=
Q,<t349S'l999'l
4.1565()7288C!-1. 8 999 7 2 79 0 1
K=
CK£
.OEFI~ATIVE
--1(=-- --
=
o;4l':l9999999 _____ - -- -
-- -
CK=
--- - - - ---·
'+.1755&90602
lo 9123495404
Q£1i.IVAT I \IE=
K=
CK=
Q,41t49999999
4.19475 3 &442
1.924535&742 ---------
--·oc:t<IVATI\o't=
K=
-------------
Oolt49'l'l99999
c.; 21405•H453·-------------lo9365333u't1
-·-cK=·---OE~I>'ATIVE=
l),4549'l99999 ____ _
--1<=----- -------.
CK=
t,,233~836891
DERIVATiVE=
1.9483445337
K=
0,4599999999
'+.2530254217
CK~
---OEiHVATIYE= --
1~959'171451tll-----------
K=
---cK=----·---------..,.;27268251.07--------0ENI~A~l~~=
1.~714161492
-·-o;r.o99999999
--K=·-- ----
CK=
1t,29245314~2
OE~IVATII/~=
1,9826806857
K=
0,4749999999
CK=
~.31233553G7
---oERIIIATIVE=- ---1,993767U'l8--------------
K=
0,4799999999
--cK=--- ------- 4; 3323278996_____
OE~IVArivE=
-J<=- · --
2.00467749'+4
·-- --- n;4849S999'l'l _ _ __
CK=
~ERI\IAT!Vi=
4,3524265005
2.U154138371
K=
0.469999'!999
CK=
4. 3726356029
- - OEt\IvATIVE='
2;025CJ7816i5 ------- - - - - - - -
K=
--cK= ·--- --- ----
rt~3'l291<7497jj---~------.-------,
OE~IVATIVE=
---K:··--
-~-
0.49~9999999
2,036372t.660
0.4999999999
CK=
~. 4133f2~CJ2J
OEi<IVAT!vE=
2. 0 46598 732&
.
.
89
results, labelled 'NEWTON-RAPHSON ITERATION', is from
a Newton-Raphson iteration based on
( 5)
where the subscript denotes the number of iterations.
and where
The third set of numbers, labelled 'RIDGE TRACE',
give the results of a ridge trace beginning with k
and advancingthrough increments of .005 to k
=
=
.005
.5.
Listed are the k ('K = (value)'), the Ck* statistic ( 'CK
=
(v'alue) '), and the derivative of the Ck * sta-
tistic ('DERIVATIVE= (value)').
Mallows' theoretical optimum gives a k of .02353,
for which Ck*
=·
3.30216 and dCk*/dk
=
-2.99370.
The Newton-Raphson iteration gives a k of .05428,
for which Ck*
=
3.24510 and dCk*/dk = .00000.
The ridge trace ·indicates a minimum somewhere
between .05000 and .05500, thus agreeing with the NewtonRaphson iteration.
No other local minimum is in evidence.
Under these circumstances, it is hard to see how
equation (4) can be correct.
90
APPENDIX VI
EXAMPLES OF ORDINARY RIDGE SOLUTIONS
The artificial data of Marquardt (1970), given
in Appendix V, can be used to illustrate some of the ridge
estimators discussed in Charpter 4.
Included here are the
Hoerl-Kennard conservative estimate (1970A), the solution
of Hoerl, Kennard, and Baldwin (1975), the Hoerl-Kennard
iterated estimate (1976), the Hocking, Speed, and Lynn
solution (1977), an iterated form of the Hocking, Speed,
and Lynn solution, and the solutions of Lawless and Wang
(1976) and McDonald and Galarneau (1975) •
Four of the estimates can be given in direct
algebraic form.
The Boerl-Kennard conservative estimate
is
(1)
Substi tu·ting the data of Marquardt (see Appendix V) ,
k
from which
=
(! 2 / 33 ) * .014545
cs) 2 ·
91
ci*
=
A1
A +k
1
0
1.98
1.994545
0
TI
0
.02
.034545
5
=
A
a
A
A +k
1
0
85
[ 2 55697]
=
0
2.89474
and
S*
=
=
Qa*
12
2
-2
12
12
2
-12
2
2.55697
3.85494
=
-0.23883
2.89474
The estimator of Hoer1, Kennard, and Baldwin is
(2)
=
2(12/33)
2
2
(85/33) +(5)
=
.002290
from which
a*
=
2. 54619.1
S*
=
1
r3. 44525
lo .15561J
( 2.32613J
The Hocking, Speed, and Lynn estimator is
2
k
=
a2
:EA. a.
1
1
2
1
( 3)
2
:EA. a.
1
1
1
4
92
;;:: .054751.
. This yields
2.50645]
( 1.33777
,
6* =
[:::~:::]
The Lawless-Wang estimator is
k
=
2(12/33)
=
.05333,
giving
S*
= [:: :::::]
The other three estimators are iterative and
cannot be given in direct algebraic form.
The following
three pages list a Fortran program which calculates the
seven ridge solutions mentioned in this appendix and give
the results of the run.
93
LN JOG1
~~
XY1=5.1
ll!~~~/JJ.
882i
l~ J~hs
A 2= 5
S2=1f,/J3.
LN OJC6
P=2.
~I!
.) .1 ·• 4
g~~~oi 6
t~ ~gg~
·
AKiiAL=1. I ( ( AtH·ZtAZ• •211 (P•SZI-1.1
ZK=AKI'A._
CALL··co'iP(XYl,XYZ,Al ..iAZ;·sz,cr,02,P,A~ALC,OIFF;,)O!FC,ZKl
l ~I OOOq
Lt. ~010
··;:.N llill
LN ~012
lN OC14
··u1 JC1'5
~N
JG16
l~
Q018
LN
-
~~17
ZK=O~
"h
LN
------
~C21
JC22
23 -- - -lliQ
OC24
0025
JC26
ZK=ZK-::JIFFC/ODIFC
0027
-- -·wr.:!TE!61i9011ZK _ _ _ _ _ _ _
JU2>1
W<?!Tf !61, 90 21 AI1ALC
0029
w~ITEI61,9GOI OIFfC
~N
l!J J ~
- - - - - LN
L~
---~--
~D
oe.H
Wli.! TE 161 , 9ll 71
ZK=O.
WFITfi61,9J61
~3~~ ---~----- ~~iH~H:~~~~----·---~----------
L~
~CJ6
~N
9G36
(N
------------
·t;u-·ro--uo---------------~---- -~-------------------
2~0
0C32
LN ~033
B:
---
WRITEI61o906J
WRITE &1,9JS
WR1TEI&1,qil61
CAlL. CD~F IX Y1 ,""XY2oAloli2>S2-tOTi oz-,-p, AiALc·,urFFC ;-JOTFctzx:r-----·--lfiA851uLOlF-QIFFCI.LE •• 0000u000011 ;o TJ 200
OLOif=OiffC
-------- [ N QJ
Lll
Ui
l tl
----------------------
OLDIF=O.
LN JJ2e
l~
WRIT£161,~061
Wii!IE!61;901J"AK~AL-------·--------------------
WF!T£161,9021 AHALC
WF!T£161,9001 OIFFC
LN OC19
LN
---
w.<ITEf&l, 9061
WF.rTEI&1,90J
LN 0)13
Ou 3G~ J=1•tt0
ZK=ZKf,GC~
.
CALL fOMPIXY1tXY2,A1,A2,S2t01,D2,FtA1ALC,OlFFCtaOIFC,ZKI
Oa~7
-o C .J '} - - - - - - - WPI ff:' C6t·, <JO rt - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - LN GG .. O
W~Ilfl61,g011 ZK
C<l41
Wi'ITE;I61,9121 AHAL:;
~N ilC42
300 WfiiTEI6l,'lGOI OIFFC
==~=-ttl
(N
-·-----
~N
· 60!<3
~~
~C~4
---------ge 3- fC'>MAr
(~
LN
9C5
x-,• t1!\ltJWS--3('lt:Ut1""0"1 ......... --------~---------·--------ITEkATlON'I
9 c G For-:"' A• 1 1 " , • oE ·U v Arr 11 E = ' , 1 x , F 1 '+, 1 o 1
----qet- fCRI':A' tl.X >'K=' o"10Xi~14otO)---_------~------------
00~3
~C49
lM ]050
- - - - - - - ~ti 605t--
11
FGPrAT(1X,'FIDG~ T~ICE'I
FORMATI1,,'hEWTJN-~APH50~
9D ..
LN 0~~5
c11 o~" 6
------· :rt :104 7
902 FQR~AT(1Xo'CK•',9X,~1~.10)
9G6 F8RHAT<tx,•••·•~••••••l
907 F PHATI!X,' 'l
--~-------~--
Erm------
----
USASI FJRT~AN CIAG~O~fiC RESU~fS FJR FTh.HAIN
-----------· -----------------------------------------------------------~J
EiH.OkS
Sug~OUTi~E
COHPIXYltXY2,Al,A2,S2,Dl,J2, 0 ,AHALC,)!FFC,JOIFC,ZKI
IIK=1./S2•Xy1H2"ZK/10l+ZKIH3
3K=-1.tsz•xr:!••z•zKtf::J2+ZIO . . l.
---- ----- --- · -·
c K=-_ D 11
1H K
2
OK=-02/ D2+ZK ••2
EK;t,tS2•XY1'•2•ZK .. 'ZIIJ1°10l+ZKI••21
f K= loiS z• -.:r 2 . . z• ZK ... 21 I 02" I 02 t ZKl uz I
GK"2• •011 Ol•lKl
H K= 2, • (J 2/ () 2 ~ l Kl
PK=-1./SZ• xr t•• 2• 101-2. •zKJIDt+ZKJ••:.
QK= l; /S z• YY 2 . . 2• (;) 2-2; • Z K f ( 0 2+ ZK l •• + .
;) K= 2. • J 1/ ( J 1 i- Z Kl • • 3
![
I ••
- --------
-
---
----- -- ---------~----- ---
SK:-2.~)2/1)2+ZKI•+J
AHALC=EKtFK+GK+HK
---I') IFFC= AKt-BK-CK-oK·------DOIFC=-PKt-OKtRKt~K
--
~-~ ---~~----.,-
-
------------ -·
~---
------~---------
·-- ---
i<ETUFN
_ _5NO ___________________ , __________ . _ - - - - - - - - - - - VS~SI
F~~T~AN
OIAGNOST!C
-~------
RESU~rs
FJR COMP
----------.----.. ----
----.------------------
94
d· )
0
d
----------
Lfl J
Ltl ~
L'· 1
i. ~~ u
U;. J
.,l
l
"65
7
USAS!
Q~Ct
su~~rUTI~E
T~Ghii~1,A~,AP1,A~2,31,82,ZK,OloOZI
A=t=at/(~l+~Kl•At
A;2:021HZ+ZKI•A2
'lf'=
(A< i+FZ f"'SOi:T (;:!,1/'!.•
'l2=1A'1-A~'~I·sa·;rl2.lt?..
":fTU"N
-- __ ft!~ --------·· --------FD~T~A~ OIAG~QjJIC ~ESULTS
SU~~OUTI~E
FO~ T~ANS
I~lTL(~LDA1oOLOA2,0LCK,A~l,A~2,ZK,A1,A21
ngi
________ nLPK=C.
3tR~g~L- _____________ --·-------·---·--·- ---·--·----10""
J~~S
J:~o
Ai';l=Al
A~?.=A2
JC Q l _ ----- ___ l.K=D a - - - - - - - - - - :)()( !S
f.'ETUFt
•HG'l
E!iO
- _________ __'j:l__f_hQ.:!S... ______ -----------------
O~J-,TGO _________'iQ __f_Rr QE5 -------------
-----------
95
~·~if.~l.l!.f ~~----------- ····--· ---··-
fl
__
t
HOERL-~~NNA~J
CG~~E~V~T!VE
- - - · -·-
K
..J(:--t .. l1~S~<.'>-'+545--'·-------- ---------- --
Al=
A2=
2.5?bq73564~
2.iq4736~~2l
____ j l ;
3.11549413174
82=
•C.23~R347u42
**•* ......... .
K:
O.J22~89B403
- - U • __ 2.5!>61936!.38 _____ ----------·- --·--·
A2=
?.3261~sggz&
91:
3.4452537905
~~~=---.C.a.l5.5&D.ZZ32.9__ _ _ _ _
-~--~ !.! !! _! !_!._!_~---------------------~ - - - - - - - - - - - -------- - - - - - - -
K=
At:
0·111461915&
2.~38485712g
___A2L- ..Ji-.1 6J.5.I bS.lJ..n.. _ _ _ _=
81~
?.2621493493
B2:
1•18~3902174
=---o-.!!'!:!!!! !~:! _______ -----------------------:=---- ----.- - - -
........ •+ ....
• .J.;>:. !51 !.!t!J.I____-~-
~--~;,;----C
41=
?.50~~486493
t.337766d593
,-==!H.~ • .Z.1!\. 22' C.B5~ 5. ______ _
A2:
~Z5
r.azt1~28tag
...................
K=
n.Js~c7~G&d7
A:l;; _ 2. 5i!2.363&n.at ______________________________ ---· __________ _
A2=
1.2BOR514C8~
.31=
z.&73137J.i5o
~--B2~--£ • H::3Z.3'36:li.9. _ _ _
..............
~--:---~-~-!!::!!! !_~! __ ·--- -------------------,---------------.----...,.-------------.-----------
K:
G.~f3~333333
A1;
z.sr~1g6721J
-~-AZ.;;;·--1•
Ji::39363£3L_______________ - - .--------------~----·----~
~1~
2.73 7 7gq430G
9Z;
~.~:9!2~Jg04
~-~--K•-·--- t.l1~471ZJ>l2--------~-----------·-·~-------~----
61;
~~;
=-Ill; ·- 3. 6;.751 ?124-4-----------~-~--------~- ---- -f!;!;
_ -~-~-
?o53t95Q&gg]
z,?gq14&4t9z
•Q,Q5~?t.Hn50
- --· -----
96
APPENDIX VII
THE SIMUl,ATION STUDY OF WICHERN AND CHURCHILL
'fhe Method
The Wichern-Churchill (1978) simulation study of
ordinary ridge estimators uses the method of McDonald and
Galarneau (1975).
lt assumes a linear model
(1)
XB+e
where X is an (nxp) matrix of observations on the
predietor variables,
¥
is an (nxl) vector of observations
on the response variable, 8 is a (pxl) vector of unknown
regression coefficients, and e is an (nxl) vector of
errors such that
ei - N ( 0, a 2 ) , i
~
(2)
1, ... , n.
For the purposes of simulation, x .. , the (i,j)th
~J
element of
~;"~_·
~J
~'
:;o:;
is computed as
(1___ -a. 2}~-z .. +a.z. ( +l)
J
l.J J l, p - '
( 3}
where zij and zi(p+l} are independent pseudo N(O,l)
variates, and where a is a constant.
Note that, if
O<a<l, then for some variable of X, say Xk'
97
variance
Further, for Xk and some other variable, say X ,
.
m
It follows that akam is the theoretical correlation
between xk and xm.
Once a set of predictors has been generated by
(3), two choices for B are considered.
The motivation for
these two choices is this:
2
The mean square error depends on X, a , .@_, and, in
2
the case of ridge regression, on k.
Assume that a , k,
2
and X are fixed so that the mean square error, E(L ) is
1
regarded as a function of
B only.
Newhouse and Oman
(1971) have observed that, subject to the constraint
~·~
= 1, E(L
2
1
) is minimized where S is the normalized
eigenvector corresponding to the largest eigenvalue of X'X,
and is minimized when
~
is the normalized eigenvector
corresponding to the smallest eigenvalue of X'X.
98
McDonald and Galarneau's two choices for
these eigenvectors.
are
The results can be idealized as a
.. best case" and a "worst case" for E(L 2 } •
1
The choice of
S was indicated in the Wichern-
Churchill study by the "orientation," labeled ¢.
is defined as ¢
=
V'~,
Oriented
where V is the eigenvector corre-
sponding to the smallest eigenvalue.
Since the eigenvec-
=
tors are orthogonal to each other, it follows that ¢
if
B corresponds
to the smallest eigenvalue and ¢
=
1
0 if
S corresponds to the largest eigenvalue.
To complete the model it is only necessary to
Since it is assumed that e.J. N(O,cr 2 } , -e can be
comput e.
generated as a set of pseudo-random variates once a value
for a 2 is provided.
The Study
The Wichern-Churchill study used a model with
five· predictor variables and 30 observations per regressian.
and
x3
Observations on the first three variables
x1 , x2 ,
were generated from
2 !.:! z .. +az. ,
x .. = (1-a)
6
l.J
while observations on
x .. =
l.J
l.J
J.
x4
and
x5
(6)
were generated from
(l-(a*) 2 )~z .. +(a*}z. 6 ,
l.J
J.
(7)
99
where a and a* were two (not necessarily different)
constants.
Five combinations of a and a* were used:
(a,a*)
=
(.99,
.99),
and (.70, .30).
(.99, .10),
(.90,
.90),
(.90, .10),
The degree of collinearity was measured
by the "spectral condition number," defined as A. /A.s'
1
where A. 1 was the largest eigenvalue of X'X and A.S was
the smallest eigenvalue of X'X.
The spectral condition
numbers for the five (a,a*) were A. /A.S
1
21, and 9.
= 278, S89,
41,
In addition to changing the covariances of the
predictors, the investigators used five different values
for the variance.
S.O.
.10, .SO, 1.0, and
These five values of p correspond to five "signal
= .§_'.§_Ia 2 ) ,
to noise" ratios (p
.04.
= ·.ol,
These were a
p
=
10,000, 100, 4, 1,
(Since.§_'.§_ was the same in all cases, p was simply
the inverse of a 2 .)
For every combination of (a,a*) and p there
B:
were two possible
resulting from ¢
=
1.
that resulting from ¢ = 0 and that
All in all, there were 50 possible
combinations of parameters to consider, each of which was
repeated 100 times while varying
~·
The estimators considered in the Wichern-Churchill
study are given in Table 1.
The basis for comparison was
100
(total mean square error of Rule m)
::::: (total mean square error of l~ast squares) ,m=l, ... ~S
Findings
The results of the simulation study are presented
in Table 2.
A number of overall patterns can be noted.
Among the most noticeable of these is the fact
that, almost without exception, the extent to which any
given ridge solution improves on the mean square error of
the least-squares estimator increases as p increases.
In
other words, ridge regression is useful where variance is
high.
A second pattern which is evident is that ridge
estimators show more improvement over the least squares
solution where A /A is large than where it is small.
1 5
This is to be expected, since ridge regression is specifically intended to handle multicollinearity.
On the whole, Rule 5 (Vinod's ridge trace
estimator) does not show a consistent improvement over
the least squares solution.
p
In fact, for large values of
it performs considerably worse, especially where
~
= 1.
On the other hand, its impTovement .over the least squares
estimator tends to be better than any of the other estimrnators except Rule 2 for small values of p.
101
Rule 2 shares some of the inconsistency of Rule 5,
although not to the same degree.
again worse for ¢
=
1 than for ¢
!ts performance is
=
0.
Rules 1 and 4 are the most consistent in terms of
mean square error.
They are never much worse than the
least squares estimator and are often better.
Rule 3 is
only occasionally much worse than Rules 1 and 4, and can
often be somewhat better, especially for low values of p.
In the absence of any additional information,
Rules 2 and 5 seem the poorest choices to use by themselves, while 1 and 4 seem the best.
102
Table 1·
Ridge Solutions Considered in the
Study of Wichern and Churchill
Rule
Definition
1
The Hoerl-Kennard (1970A) conservative
estimator:
2
The Law·less-Wang ( 19 76) estimator
3
The Hoerl-Kennard-Baldwin (1975) estimator:
4
The McDonald-Galarneau (1975) estimator:
choose k such that
(a*(k)) 'Ca*(k))
5
=
2
&'&-& 1::1/A.
1
J.
The estimator of Vinod (1976) :.
choose k to minimize
2
I((pA./(A.+k) 2 i)-1)
J.
1
J.
where
-s
=
IA./(A.+k)
1
J.
J.
2
103
Table 2
The Wichern-Churchill Simulation Results
{Source:
Wichern and Churchill, 1978)
Download