Uploaded by lobopereiramail

slac-r-355 (1)

advertisement
SLAC REPORT-355
STAN-LCS-106
UC-405
(Ml
INTERPRETABLE
PROJECTION
SALLY
CLAIRE
PURSUIT*
MORTON
Stanford
Linear Accelerator Center
Stanford University
Stanford, California 94309
OCTOBER
1989
Prepared for the Department of Energy
under contract number DE-AC03-76SF00515
Printed in the United States of America.
Available from the National Technical Information
Service, U.S. Department
of Commerce, 5285 Port Royal Road, Springfield,
Virginia 22161. Price: Printed Copy A06; Microfiche AOl.
*Ph.D
Dissertation
Abstract
The goal of this thesis is to modify
for interpretability.
The modification
standable model without
sacrificing
projection
pursuit
by trading
produces a more parsimonious
the structure
The method retains the nonlinear versatility
which projection
of projection
accuracy
and under-
pursuit
seeks.
pursuit while clarifying
the results.
Following
an introduction
which outlines the dissertation,
chapters contain the technique
projection
pursuit
regression respectively.
measured as the simplicity
Several interpretability
of rotation
different
as applied to exploratory
in factor
analysis
description,
algorithms
and entropy.
weighting
with
and
of a description
is
projection
a rotationally
invariant
replacing
smoothness.
exploratory
projection
regression are described.
projection
in the original
require
slightly
approach is used to search for a more par-
interpretable
pursuit
The two methods
goals.
interpretability
for both
interpretable
alterations
The interpretability
pursuit
indices for a set of vectors are defined based on the ideas
A roughness penalty
tational
projection
of the coefficients which define its linear projections.
indices due to their contrary
simonious
the first and second
are required.
pursuit
In the former
index is needed and defined.
algorithm
The compu-
Examples
and
case,
In the latter,
of real data are
considered in each situation.
The third
cation
chapter deals with
the connections
between the proposed modifi-
and other ideas which seek to produce more interpretable
models.
The
Abstract
modification
Page iv
as applied to linear regression is shown to be analogous to a nonlin-
ear continuous
method of variable selection.
selection techniques
It is compared with other variable
and is analyzed in a Bayesian context.
Possible extensions
to other data analysis methods are cited and avenues for future research are identified.
The conclusion
addresses the issue of sacrificing
in general.
An example of calculating
pretability
due to a common simplifying
for a histogram,
illustrates
accuracy for parsimony
the tradeoff between accuracy
the applicability
action, namely rounding
of the approach.
and inter-
the binwidth
Acknowledgments
I am grateful
to my principal
I also thank
enthusiasm.
advisor Jerry Friedman
my secondary
Persi Diaconis,
Art Owen, Ani Adhikari,
Kirk Cameron,
David Draper,
Knowles,
Michael
Sheehy, David
Martin,
Siegmund,
advisors and examiners:
Pregibon,
Hal Stern;
and
Brad Efron,
Joe Oliger; my teachers and colleagues:
Tom DiCiccio,
Daryl
for his guidance
Jim Hodges, Iain Johnstone,
Mark
John Rolph,
Anne
and my friends:
Joe Romano,
Mark
Barnett,
Ginger
Brower, Renata Byl, Ray Cowan, Marty
Dart, Judi Davis, Glen Diener, Heather
Gordon,
Arla
Holly
Marincovich,
This
work
Haggerty,
Curt
Lasher,
LeCount,
Alice
Lundin,
Michele
Mike Strange and Joan Winters.
was supported
in part
by the Department
DE-AC03-76F00515.
I dedicate this thesis to
my parents, sister, and brothers,
who inspire me by example.
of Energy,
Grant
Table
of Contents
..*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . aaz
Abstract
Introduction
1. Interpretable
V
................................
Acknowledgments
1
....................................
Exploratory
1.1 The Original
Projection
Exploratory
...........
Pursuit
Projection
Pursuit
4
......
Technique
4
4
............................
1.1.1 Introduction
1.1.3 The Legendre Projection
1.1.4 The Automobile
1.2 The Interpretable
1.2.1 A Combinatorial
Projection
Strategy
Index
1.3.1 Factor Analysis
11
....................
Example
Optimization
1.3 The Interpretability
9
.................
Index
Exploratory
1.2.2 A Numerical
7
..........................
1.1.2 The Algorithm
Pursuit
Approach
15
..............
17
........................
Background
14
14
....................
Strategy
....
17
...................
1.3.2 The Varimax
Index For a Single Vector
............
19
1.3.3 The Entropy
Index For a Single Vector
............
$2
1.3.4 The Distance
Index For a Single Vector
............
,24
1.3.5 The Varimax
Index For Two Vectors
1.4 The Algorithm
1.4.1 Rotational
26
..............
L7
..............................
Invariance
1.4.2 The Fourier Projection
of the Projection
Index
Vi
Index
........
28
. . . . . . . . , . . . . . . . . . 29
Page vii
Tableof Contents
1.4.3 Projection
Axes Restriction
1.4.4 Comparison
With
. . . . . . . . . . . . . . . . . . . $5
Factor Analysis
1.4.5 The Optimization
1.5.2 The Automobile
2. Interpretable
.42
Pursuit
Projection
57
58
60
....................
Pursuit
62
Example
.....................
82
as a Prior
3.4 FutureWork
...............................
88
90
.......................
.............................
3.4.3 Algorithmic
Improvements
91
...................
92
..........................
3.5 A General Framework
Example
92
93
.....................
.............................
97
............................
Exploratory
86
90
3.4.2 Extensions
Gradients
: ........
.........
........................
3.3 Interpretability
3.5.2 Conclusion
73
82
Ridge Regression
3.5.1 The Histogram
68
......................
Linear Regression
Connections
65
.......................
and Conclusions
With
..........
..................
Procedure
62
63
to Include the Number of Terms
2.3 The Air Pollution
.....
Regression Approach
....................
Index
2.2.3 The Optimization
A.1 Interpretable
.......
..........................
2.2.1 The Interpretability
3.4.1 Further
57’
............................
2.2 The Interpretable
3.2 Comparison
............
Regression Technique
2.1.3 Model Selection Strategy
3.1 Interpretable
46
Regression
Pursuit
2.1.2 The Algorithm
2.2.2 Attempts
42
....................
Example
Projection
2.1.1 Introduction
A.
$9
.........................
Projection
2.1 The Original
Appendix
..................
Procedure
1.5.1 An Easy Example
Connections
$7
................................
1.5 Examples
3.
...............
Projection
99
Pursuit
Gradients
.......
99
Page viii
Tableof Contents
A.2 Interpretable
References
Projection
Pursuit
Regression Gradients
. . . . . . 102
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure
[l.l]
Most structured projection scatterplot of the automobile
according to the Legendre index. .....................
Captions
data
12
[1.2] Varimax
interpretability
index for q = 1, p = 2. ..............
20
[1.3] Varimax
interpretability
index for q = 1, p = 3. ..............
21
[1.4] Varimax
interpretability
index contours for q = 1, p = 3. ........
21
[1.5] Simulated
data with
43
n = 200 and p = 2. ..................
[1.6] Projection and interpretability
indices versus X for the simulated data. ..................................
44
Cl.71 Projected simulated data histograms
A. .......................................
45
for various
values of
[1.8] Most structured projection scatterplot of the automobile
......................
according to the Fourier index.
data
[1.9] Most structured projection scatterplot of the automobile
according to the Legendre index. .....................
data
[l.lO]
[l.ll]
47
48
Projection and interpretability
indices versus X for the automobile data. .................................
49
P ro j ec t e d automobile data scatterplots
A. .......................................
50
[1.12] Parameter
[1.13] Country
data..
trace plots for the automobile
of origin projection
....................................
scatterplot
for various values of
data.
..............
53
of the automobile
55
[2.1] Fraction of unexplained variance U versus number of terms
m for the air pollution data. ........................
74
[2.2] Model paths for the air pollution data for models with numberoftermsm=l,
... . 6. .........................
76
iX
Pagex
Figure Captions
[2.3] Model paths for the air pollution data for models with numberoftermsm=7,8,9.
. . . . . . . . . . , . . . . . . . . . . . . . . .
[2.4] Draftsman’s
display for the air pollution
mearregression.
[3.1] I ner
t p reat bl e 1’
[3.2] Interpretability
[3.3] Percent change in
binwidth example.
data . . . . . . . . . . . . . . . 79
. . . . . . . . ... . . . . . . . . . . . . .
prior density for p = 2.
77
84
. . . . . . . . . . . . . . . . . . 89
IMSE
versus multiplying fraction e in the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
List
[l.l]
of Tables
Most structured Legendre plane index values for the automobile data. . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . .
[1.2] Linear combinations
[1.3] Abbreviated
for the automobile
linear combinations
data. ...............
for the automobile
xi
data. ........
29
51
52
Introduction
The goal of this th esis is to modify projection
more interpretability
projection
pursuit
and Stuetzle
accuracy for
in the results. The two techniques examined are exploratory
(Friedman
1981).
1987) and projection
Th e f ormer is an exploratory
produces a description
procedure
pursuit by trading
of a group of variables.
which determines
the relationship
pursuit
regression (Friedman
data analysis method
The latter
is a formal
of a dependent variable
which
modeling
to a set of
predictors.
The common
outcome
of all projection
vectors which define the directions
ponent
is nonlinear
jection
linear combinations
projection
pursuit
projection
pictorially
pursuit
representation.
that
is faced with a collection
q projections
row corresponding
are made.
The resulting
to a projection.
data points.
the linear combinations
of p variables
direction
The statistician
In
and
predictors.
of vectors and a nonlinear
Given a dataset of n observations
com-
contains the pro-
of the projected
the model contains
of
rather than mathematically.
the smooths of the dependent variable versus the projected
The statistician
is a collection
The remaining
the description
and the histograms
regression,
methods
of the linear projections.
and is summarized
For example, in exploratory
pursuit
matrix
A
graphic
each, suppose
is
q X p,
must try to understand
each
and
explain these vectors, both singly and as a group, in the context of the original p
variables and in relation
is to illustrate
a method
to the nonlinear
for trading
components.
The purpose of this thesis
some of the accuracy in the description
or
Page 2
Introduction
model in return
for more interpretability
object is to retain the versatility
increasing the clarity
parsimony
and flexibility
A.
in the matrix
of this promising
The
technique while
of the results.
In this dissertation,
than parsimony,
or simplicity
interpretability
is used in a similar
which may be thought
is that as few parameters
yet broader sense
of as a special case. The principle
of
as possible should be used in a description
or model. Tukey stated the concept in 1961 as ‘It may pay not to try to describe
in the analysis the complexities
that are really present in the situation.’
methods exist which choose more parsimonious
Several
models, such as Mallows’ (1973)
Cp in linear regression which balances the number of parameters
and prediction
error. Another example is the work of Dawes (1979), who restricts
the parameters
in linear models based on standardized
descriptions
data to be 0 or fl,
calling the resulting
‘improper’ linear models. His conclusion is that these models predict
almost as well as ordinary
based on experience.
of the prediction
Throughout
but rather refers to the goodness-of-fit
to the particular
while considering
shares the general philosophical
intuition
this thesis, accuracy is not measured in terms
of future observations
the model or description
Parsimony,
linear regressions and do better than clinical
of
data.
solely the number of parameters
goals of interpretability.
in a model,
These goals are to
produce results which are
(;) easier to understand.
(ii)
easier to compare.
(G)
easier to remember.
(iv)
easier to explain.
The adjective
plex’ with
terpretability
‘simple’ is used interchangeably
‘uninterpretable’
throughout
this thesis.
is included in part to distinguish
which receives extensive treatment
with
‘interpretable’
as is ‘com-
However, the new term in-
this notion from that of simplicity
in the philosophical
and Bayesian literature.
Page 3
Introduction
The quantification
of interpretability
not easy to define or measure.
our intuitions
. . . presents a veritable
The concept is
pursuit
‘the diversity
chaos of opinion.’
are easier than others. In particular,
as produced by projection
of an interpretability
linear combinations
index.
by a computer
more interpretable
Exploratory
rather than a human, in an automatic
projection
pursuit
is considered first
method motivates
approach chosen is supported
terpretability
indices are developed.
requires changes to the original.
ploratory
which can
search for
results.
review of the original
rithmic
lend
This mathe-
index can serve as a ‘cognostic’ (Tukey 1983), or a diagnostic
be interpreted
of
Fortu-
and many other data analysis methods
themselves readily to the development
matical
problem.
As Sober (1975) comments,
about simplicity
nately, some situations
is a difficult
projection
pursuit
in Chapter
1.
the simplifying
modification.
versus alternative
strategies.
The modification
algorithm
An example of the resulting
method is examined.
Projection
A short
The algo-
Various in-
to be employed
interpretable
pursuit
ex-
regression
is considered in a similar manner in Chapter 2. The differing goals of this second
procedure compel changes in the interpretability
index.
The chapter concludes
with an example.
Chapter 3 connects the new approach with established techniques of trading
accuracy
might
for interpretability.
benefit
from
general application
Extensions
this modification
of the tradeoff
to other data analysis
are proposed.
between accuracy
methods
that
The thesis closes with
and interpretability.
a
This
work is an example of an approach which simplifies
the complex results of a novel
statistical
described within
method.
by others in similar
The hope is that the framework
circumstances.
will be used
Chapter
1
Interpretable
Exploratory
In this chapter, interpretable
Section
1.1 presents
projection
pursuit
provides the motivation
projection
and goals of the original
and outlines
for the strategy
requires that
a simplicity
Pursuit
pursuit is demonstrated.
the algorithm.
exploratory
An example
for the new approach is included.
described and support
new method
exploratory
the basic concepts
technique
Projection
which
The modification
chosen is given in the next section.
is
The
index be defined, which is discussed in
Section 1.3. Section 1.4 details the algorithm,
and its application
to the example
is described in the final section.
1.1 The Original
Exploratory
projection
ment of the algorithm
intensive
method
Exploratory
Projection
pursuit
Technique
(Friedman 1987) is an extension and improve-
presented by Friedman and Tukey in 1974. It is a computer
data analysis tool for understanding
helps the statistician
any initial
Pursuit
assumptions,
look for structure
while providing
high dimensional
datasets.
in the data without
The
requiring
the basis for future formal modeling.
1.1.1 Introduction
Classical multivariate
criminant
methods such as principal
analysis can be used successfully
4
components
analysis or dis-
when the data is elliptical
or normal
1.1 The Original Exploratory Projection Pursuit Technique
and well-described
.in nature
pursuit
Page 5
by its first few moments.
Exploratory
is designed to deal with the type of nonlinear structure
niques are ill-equipped
to handle.
onto a line (one dimensional
dimensional
exploratory
the problem
The method linearly
exploratory
projection
while maintaining
projection
pursuit).
these older tech-
projects
pursuit)
the data cloud
or onto a plane (two
By reducing the dimensionality
the same number of datapoints,
overcomes the ‘curse of dimensionality’
(Bellman
1961). This malady
definition
combinations
is chosen for two reasons (Friedman
is easier to understand
of the original variables.
gerate structure
Initially,
interactively
projection
an exhaustive
mathematical
Second, a linear projection
projection
1982, Asimov
numerical
human pattern recognition
consequently
Structure
projection
consider principal
pursuit
components
classical
for specific projection
analysis.
is no longer
scale but rather on a
means that numerous
Thus the scheme, while making
many
A
of a view is defined
feasible for large datasets, requires the careful choice of a projection
(1985) p oin t s out,
views
the method was automated.
optimization.
one. This simplification
which exhibit
choose interesting
index which measures the structure
possible indices may be defined.
ploratory
the
1985). because the time required
search was prohibitive,
measured on a multidimensional
As Huber
First,
does not exag-
pursuit is to find projections
and the space is searched via a computer
univariate
1987).
as it consists of one or two linear
the idea was to let the statistician
by eye (McDonald
to perform
is revealed in
in the data as it is a smoothed shadow of the actual datapoints.
The goal of exploratory
structure.
is due to
space.
A linear projection
projection
of
the technique
the fact that a huge number of points is required before structure
high dimensional
projection
techniques
the analysis
index.
are forms
index choices.
of ex-
For example,
Let Y be a random vector in
RP. The
Page 6
1. InterpretableExploratory Projection Pursuit
definition
of the jth principal
component
is the solution
of the maximization
problem
“p”x
i
Var[PTY]
3
@T&=1
and p:Y
is uncorrelated
all previous principal
Thus, principal
projection
contrast,
with variance (Var) as the projection
is a global
the novelty
its ability
components
.
components analysis is an example of one dimensional
pursuit
Variance
with
to recognize nonlinear
many definitions
index.
or general measure of how structured
and applicability
of structure
of exploratory
or local structure.
and corresponding
exploratory
projection
a view is.
pursuit
As remarked
projection
In
lies in
previously,
index choices exist.
All present indices, however, depend on the same basic premise. The idea is that
though
‘interesting’
is difficult
to define or agree on, ‘uninteresting’
is clearly
normality.
Friedman (1987) and Huber (1985) p rovide theoretical
Effectively,
methods
the normal distribution
which explain
attempting
The statistician
can be explored adequately using traditional
covariance structure.
to address situations
structure
metric
Desirable
computing
is
so far in this discussion
properties
also affect the decision.
is detailed in the next subsection.
The
preference for the type of
and invariance
in the concrete context of a particular
algorithm
pursuit
to views she holds interesting.
choice is thus based on individual
have been neglected
original
projection
must choose a method by which to measure distance from
to be found.
considerations
Exploratory
for which these methods are not applicable.
this normal origin that leads the algorithm
distance
support for this choice.
which
These
index are discussed as the
Page 7
1.1 The Original Exploratory Projection Pursuit Technique
1.1.2 The Algorithm
Two dimensional
exploratory
ful than one dimensional,
situation
pursuit
is more interesting
problems when interpretable
exploratory
plane.
data may have several interesting
index optima
views or local projection
should find as many as possible.
The original
stract
projec-
is considered which need not be addressed in the one dimensional
case. For the present, the goal is to find one structured
algorithm
and use-
so the former is discussed solely. The two dimensional
also raises intriguing
tion pursuit
projection
version
algorithm
(1985), thereby
though the data consists of n observations
variable Y E W.
simplifying
the notation.
Thus,
of length p, consider first a random
= Va?@Y]
and Cov[p~Y,&‘Y]
(/?I, ,&) which
= 1
= 0
where G is the projection
index which measures the structure
density.
on the linear combinations
The constraints
and the
(1987) is reviewed in the ab-
The goal is to find linear combinations
3 Var[PTY]
the
Section 1.5.3 addresses this point.
presented by Friedman
due to Huber
Actually,
of the projection
ensure that
the structure
seen in the plane is not due to covariance (Cov) effects which can be dealt with
by classical methods.
Initially,
the original data Y is sphered (Tukey and Tukey 1981). The sphered
variable 2 E RP is defined as
2 f D- W(Y
with
matrix
U and D resulting
- E[Y])
from an eigensystem
decomposition
EW
of the covariance
of Y. That is,
c = E[(Y - E[Y])(Y - E[Y])*]
= UDU*
P.31
Page 8
1. InterpretableExploratory Projection Pursuit
with U the orthonormal
matrix
of associated eigenvalues.
ponent directions
be translated
of eigenvectors of C, and D the diagonal matrix
The axes in the sphered space are the principal
of Y. The previous optimization
to one involving
problem [l.l]
linear combinations
max
involving
al, a2 and projections
comY can
of 2:
G(cuTZ, o$Z)
QI>q
3
a&
and LVTQ~= 0
The fact
that
structure,
ditions
the standardization
which the technique
calculations
actually
imposed
work required
are performed
In subsequent notation,
in the original
covariance
are now geometric
(Friedman
the parameters of a projection
the value of the index is calculated
1987).
con-
Thus,
all
index G are (,0TY, /3lY)
for the sphered data [1.2].
convenience and is invisible
to
She associates the value of the index with the visual projection
data space.
After the maximizing
translated
to exclude
on the sphered data 2.
The sphering, however, is merely a computational
the statistician.
.
is not concerned with,
reduces the computational
numerical
though
constraints
WI
= 1
4cv2
=
sphered data plane is found, the vectors o1 and CQ are
to the unsphered original variable space via
/31= UD- ia,
p2
Since only the direction
= UD-+a2 .
of the vectors matter,
these combinations
are usually
normed as a final step.
The variance and correlation
pf-cp,
constraints
= 0
.
on ,L?iand @2 can be written
as
D.61
1.1 The
Original Exploratory
Projection
Pursuit
Thus, ,& and ,& are orthogonal
orthogonal
Page 9
Technique
in the covariance metric
while (Ye and ~2 are
geometrically.
The optimization
method used to solve the maximization
coarse search followed by an application
optimization
procedure.
The initial
problem
[1.4] is a
of steepest descent, a derivative-based
survey of the index space via a coarse step-
ping approach helps the algorithm
avoid deception by a small local maximum.
The second stage employs the derivatives of the index to make an accurate search
in the vicinity
of a good starting
The numerical optimization
sess certain computational
compute,
point.
procedure requires that the projection
The index should be fast and stable to
properties.
and must be differentiable.
to the interpretability
1.1.3 The Legendre
Friedman’s
transforming
index pos-
These criteria
surface again with respect
index defined in Section 1.3.
Projection
Index
(1987) Legendre index exhibits
the sphered projections
these properties.
He begins by
to a square with the definition
Rl E 2G+I!;Z)
- 1
R:! E 2@(aTZ) - 1
where @ is the cumulative
random variable.
uninteresting,
probability
density function
of a standard
Under the null hypothesis that the projections
the density p(R1,
As a measure of nonnormality,
R2)
is uniform
normal
are normal and
on the square [-1, I] X [-l,l].
he takes the integral of the squared distance from
the uniform
G&?;Y,
/3;Y)
He expands the density
orthogonal
tion,
-
J’
-1
[p(&,
R2)
-
+I2
d&dR2
respect to a uniform
subsequent integration
taking
P.71
.
p(R1, R2) using Legendre polynomials,
on the square with
along with
J’
-1
weight function.
which
are
This ac-
advantage of the orthogonality
Page 10
1. InterpretableExploratory Projeciion Pursuit
.relationships,
yields an infinite
sum involving
polynomials
in R1 and R2, which are functions
application,
the expansion is truncated
moments.
Thus, instead of E[f(Y)]
n observations
expected values of the Legendre
of the random variable
f or a function
f, the sample mean over the
yl, ~2, . . . , yn is calculated.
rather than in the tails.
Thus, it identifies
rather than heavy-tailed
densities.
variables
2.
covariance structure,
which exhibit
GL also has the property
The index has this characteristic
Since exploratory
in the center of distribution,
projections
(Huber 1985), w h ic h means that it is invariant
of Y.
projection
this property
the area of projection
of ‘affine invariance’
as it is based on the sphered
pursuit
should not be drawn in by
is desirable for any projection
indices but so far alternatives
normality
have proved less computato take advantage
or use alternate methods of measuring distance from
as discussed in Section 1.4.2.
In the following
section, an example of the original algorithm
cussed in order to provide the motivation
considering
.index.
Research continues in
feasible. Other indices use different sets of polynomials
of different weight functions
clustering
under scale changes and location
The Legendre index is also stable and fast to compute.
tionally
In
and sample moments replace theoretical
This index has been shown to find structure
shifts
Y.
how to modify
pretable, the projection
using GL is dis-
for the proposed modification.
After
the method in order to make the results more inter-
index clearly needs a new theoretical
does not have. Consequently,
property
a new index is defined in Section 1.4.2.
which GL
1.1 The Original Exploratory Projection Pursuit Technique
1.1.4 The Automobile
The automobile
Example
dataset (Friedman
1987) consists of ten variables
on 392 car models and reported in ‘Consumer
Yl
Y2
Y3
Y*
Ys
:
:
:
:
:
:
:
:
:
:
Y6
., Y7
Y8
fi
Ylo
The second variable
dicating
Page 11
collected
Reports’ from 1972 to 1982:
gallons per mile (fuel inefficiency)
number of cylinders in engine
size of engine (cubic inches)
engine power (horse power)
automobile weight
acceleration (time from 0 to 60 mph)
model year
American (0,l)
European (0,l)
Japanese (0,l)
has 5 possible values while the last three are binary,
the car’s country
of origin.
As Friedman
suggests, these variables
inare
gaussianized, which means the discrete values are replaced by normal scores after
any repeated observations
their discrete marginals
are randomly
The object is to ensure that
do not overly bias the search for structure.
All the variables are standardized
the analysis.
ordered.
In addition,
to have zero mean and unit variance before
the linear combinations
which define the maximizing
plane (PI, P2) are normed to length one. As a result, the coefficient
in a combination
represents its relative importance.
The definition
of the solution
p1 = (-0.21,-0.03,-0.91,
p2 = (-0.75,
-0.13,
plane is
0.16,
0.43,
0.30,-0.05,-0.01,
0.45,-0.07,
0.03, 0.00,-0.02)T
0.04,-0.15,-0.03,
That is, the horizontal
coordinate
ficiency
-0.03 times the number of cylinders
(standardized)
standardized)
of each observation
and so on. The scatterplot
In the projection
horizontal
for a variable
and vertical
scatterplot,
0.02,-0.01)T
is -0.21 times fuel inef(gaussianized
and
of the points is shown in Figure 1.1.
the combinations
(PI,&)
are the orthogonal
axes and each observation is plotted as (/?TY, PZY).
These
1. Interpretable
Exploratory
Projection
Page 12
Pursuit
2
.
.
..
..
. * *.*.
.
.I .I .* :* .
: .
’c
. 7. ‘*.y
- .,* #.:.* .,.
,’ -: *
*‘>..i\..*
. . .*..
r.*,$s?:.. :..: .
**. : :$ ,I’
. . ., . - “.:“.
. ., :
. .
-1
8
I
I
I
I
I
I
I
Most structured projection scatterplot
according to the Legendre index.
vectors are orthogonal
1
I
I
I
1
0
-1
Fig. 1.1
.
1
of the automobile
in the covariance metric due to the constraint
ever, in the usual reference system, the orthogonal
data
[l.S].
How-
axes correspond to the vari-
ables. For example, the x axis is Yl, the y axis is Y2, the z axis is Y3, and so
on. The combinations
are not orthogonal in this reference frame. This departure
from common graphic convention
is discussed further
The first step is to look at the structure
of the projected
case, the points are clustered along the vertical
out to the upper right corner with a few outliers
cern is whether
fluctuation.
this structure
actually
in Section 1.4.3.
In this
axis at low values and straggle
to the left.
The obvious con-
exists in the data or is due to sampling
The value of the index GL for this particular
question is whether
points.
this value is significant.
view is 0.35 and the
Friedman (1987) approximates
the
answer to this question by generating values of the index for the same number
Page 13
1.1 The Original Exploratory Projection Pursuit Technique
of observations
Comparison
and dimensions
(n and p) under the null hypothesis
of the observed value with
of normality.
these generated ones gives an idea of
how unusual the former is. Sun (1989) d e1ves deeper into the problem,
an analytical
significance
approximation
level. The structure
Given the clustering
attempt
to interpret
Though by projecting
tion pursuit
for a critical
providing
value given the data size and chosen
found in this example is significant.
exhibited
along the vertical
the linear combinations
axis, the second stey)is to
which define the projection
the data from ten to two dimensions,
plane.
exploratory
projec-
has reduced the visual dimension of the problem, the linear projec-
tions must still be interpreted
in the original number of variables.
The structure
is represented by a set of points in two dimensions but understanding
points represent in terms of the original
data requires considering
what these
all ten vari-
ables .
An infinite
straints
number of pairs of vectors (pi, ,&) exist which satisfy
[1.6] and define the most structured
vectors can be spun around the origin
rotation
of the scatterplot
In other words,
in the plane rigidly
and they still satisfy the constraints
The orientation
plane.
and maintain
is inconsequential
the conthe two
via an orthogonal
the structure
found.
to its visual representation
of the structure.
These facts lead to the principle
or most interpretable
linear combinations
contribution
way possible.
have length
Given that the data is standardized
one, the coefficients
of each variable to the combination.
find the simplest
variance
that a plane should be defined in the simplest
representation
This action forces the coefficients
as possible.
represent the individual
Friedman
of the plane by spinning
of the squared coefficients
of the vertical
(1987) attempts
the vectors until
coordinate
of the second combination
and parsimonious
vector is easier to understand,
to
the
are maximized.
p2 to differ as much
As a consequence, variables are forced out of the combination.
more interpretable
and the
vector has fewer variables
compare, remember and explain.
in it.
This
Such a
Page 14
1. InterpretableExploratoy Projection Pursuit
The goal of the present work is to expand this approach.
of interpretability
binations
is discussed in Section 1.3. Criteria
are considered.
More importantly,
in the plane allowed but also the solution
slightly
away from the most structured
of the vectors
plane may be rocked in
p dimensions
plane. The resulting
Exploratory
Projection
sessed but attaching
difficult.
is
Pursuit
Approach
projection
pursuit
produces
(/3TY, PCY), the value of the index GL(PTY, ,%$Y), and the linear
(,&, pz). Th e scatterplot’s
combinations
loss in structure
is deemed sufficient.
As shown in the preceding example, exploratory
the scatterplot
which involve both com-
not only is rotation
acceptable only if the gain in interpretability
1.2 The Interpretable
A precise definition
nonlinear
structure
much meaning to the actual numerical
As remarked earlier, the linear combinations
in terms of the original number of variables.
stand these vectors may involve ‘rounding
may be visually
as-
value of the index is
must still be interpreted
In fact, a mental attempt
to under-
them by eye’ and dropping
variables
which have ‘too small’ coefficients.
The question to be considered is whether the linear combinations
more interpretable
without
losing too much observed structure
The idea of the present modification
the scatterplot
If initially
of variable
might
is linked with parsimony,
be considered.
regression is all subsets regression.
applied to principal
restricts
or parsimony
found in
in (pi, ,f?2).
Strategy
interpretability
selection
in the projection.
is to trade some of the structure
in return for more comprehensibility
1.2.1 A Combinatorial
can be made
components
the linear transformation
A similar
(Krzanowski
matrix
a combinatorial
method
The analogous approach
in linear
idea, principal
variables,
has been
1987, McCabe 1984). This method
to a specific form, for example each row
has two ones and all other entries zero. The result is a variable selection method
with each component
formed from two of the original
variables.
Both variable
1.2 The Interpretable
Exploratory
Projection
selection methods, all subsets and principal
only consider whether
Applying
variables, are discrete in nature and
a variable is in or out of the model.
a combinatorial
in the following
Page 15
Approach
Pursuit
strategy
number of solutions,
to exploratory
projection
pursuit
results
each of which consists of a pair of variable
subsets. Given p variables and the fact that each variable is either in or out of a
combination,
symmetric,
2P possible single subsets of variables exist.
The combinations
are
meaning that the (,f3i, ,&) plane is that the same as the (&, ,&) plane.
Thus, the number of pairs of subsets with unequal members is (‘,‘).
However,
this count does not include the 211pairs with both subsets identical,
which are
permissible
solutions.
It does include 2P -p
degenerate pairs in which one subset
is empty, or both subsets consist of the same single variable.
These degenerate
pairs do not define planes. The total number of solutions is
0
2p
+ 2P _ 2P - p = 9-1
- 2p-1
- p
.
P
Some planes counted may not be permissible
due to the correlation
as discussed in Section 1.4.3, so the actual count may be slightly
constraint
less. However,
the principal
term remains of the order of 2 2P-1 . The number of subsets grows
exponentially
with p and for each subset the optimization
redone.
Unlike
all subsets regression, no smart search methods exist for elimi-
nating poor contenders which do not produce interesting
time required to produce a single exploratory
combinatorial
must be completely
projections.
projection
pursuit
Due to the
solution,
this
approach is not feasible.
1.2.2 A Numerical
Optimization
Strategy
Given that a numerical optimization
is already being done, an approach is to
consider whether a variable selection or interpretability
in the optimization.
The objective
function
used is
criterion
can be included
Page 16
1. InterpretableExploratory Projection Pursuit
for X E [0, 11. Th is f unction
.contribution
pretability
and an interpretability
parameter
its maximum
index S contribution,
X. The interpretability
manner.
exploratory
projection
pursuit
index
index S is defined to
index is divided by
is also E [0, 11.
algorithm
is applied in an
First, find the best plane using the original exploratory
algorithm.
G
indexed by the inter-
or simplicity
possible value in order that its contribution
The interpretable
tion pursuit
sum of the projection
A na 1o g ously, the value of the projection
have values E [O,l].
iterative
is the weighted
projec-
Set max G equal to the index value for this most struc-
tured plane. For a succession of values (Xi, X2, . . . , Xl), such as (0.1,0.2, . . . , 1 .O),
solve [1:9] with X = Xi. In each case, use the previous Xi-1 solution
as a starting
point.
One way to envision the procedure is to imagine beginning at the most structured solution
and being equipped with
an interpretability
turned,
the value of X increases as the weight on simplicity,
plexity,
is increased.
The plane rocks smoothly
If the projection
index is relatively
loss in structure
is gradual.
As the dial is
or the cost of com-
away from the initial
flat near the maximum
When to stop turning
dial.
(Friedman
solution.
1987), the
the dial is discussed in Section
1.5.3.
The additive functional
drawing a parallel with
1984).
In that
goodness-of-fit
fitted
context,
criterion
form choice for the objective function
roughness penalty
the problem
[1.9] is made by
methods for curve-fitting
is a minimization.
The first
(Silverman
term is a
such as the squared distance between the observed and
values, and the second is a measure of the roughness of the curve such
as the integrated
square of the curve’s second derivative.
the curve becomes more rough.
comparable
to interpretability.
As the fit improves,
The negative of roughness, or smoothness,
is
Page 17
1.3 The Interpretability Index
An alternate
idea is to think
of solving a series of constrained
maximization
subproblems
2%
,
GM%
P2TY)
[l.lO]
3
2 G
S(Pl,P2)
for values of ci such as (0.0,O.l) . . . , 1.0). Th’is p ro bl em may be rewritten
as an
unconstrained
Rein-
maximization
using the method of Lagrangian
sch (1967) d escribes the relationship
the relationship
between [1.9] and [l.lO]
The computational
pursuit
solution
[1.8] to
I.
approach [1.9] is linear in
savings for this numerical method versus a combinatorial
can be substantial.
The inner loop, namely finding
an exploratory
I.
one
projection
is the same for either. The outer loop, however, is reduced from
1.3 The Interpretability
The interpretability
Index
index $ measures the simplicity
(pi, pz). It has a minimum
of the pair of vectors
value of zero at the least simple pair and a maximum
value of one at the most simple.
1.3.1 Factor
and consequently
between ci and Xi.
The amount of work required by the functional
differentiable
multipliers.
Like the projection
index G, it needs to be
and fast to compute.
Analysis
For two dimensional
Background
exploratory
projection
pursuit,
the object is simplify
a
2 X p matrix
.
Consider a general q X p matrix
sional exploratory
matrix
projection
0 with entries wij, corresponding
pursuit.
have? When is one matrix
What characteristics
to
q dimen-
does an interpretable
more simple than another?
Researchers have
Page 18
1. InterpretableExploratory Projection Pursuit
considered
such questions with respect to factor loading matrices.
factor analysis is to explain the correlation
tor model.
The solution
structure
in the variables via the fac-
is not unique and the factor matrix
make it more interpretable.
Comments
ence between this rotation
The goal in
often is rotated
are made regarding the geometric
and the interpretable
to
differ-
rocking of the most structured
plane in Section 1.4.4. Though the two situations
are different,
goal is the same and so factor
research is used as a starting
point in the development
Intuitively,
analysis rotation
of a simplicity
the interpretability
‘Local’ interpretability
the philosophical
index S.
of a matrix
may be thought
measures how simple a combination
of in two ways.
or row is individually.
In a general and discrete sense, the more zeros a vector has, the more interpretable
it is as less variables are involved.
a collection
‘Global’ interpretability
measures how simple
of vectors is. Given that the vectors are defining a plane and should
not collapse on each other, a simple set of vectors is one in which each row clearly
contains its own small set of variables and has zeros elsewhere.
Thurstone
(1935) advocated
‘simple structure’ in the factor loading matrix,
defined by a list of desirable properties
ple, each combination
which were discrete in nature.
For exam-
(row) should have at least one zero and for each pair of
rows, only a few columns should have nonzero entries in both rows. In summary,
his requirements
correspond
to a matrix
for, only a subset of variables (columns).
which involves,
or has nonzero entries
Combinations
should not overlap too
much or have nonzero coefficients for the same variables and those that do should
clearly divide into subgroups.
These. discrete notions of interpretability
measure which
is tractable
for computer
must be translated
optimization.
into a continuous
Local simplicity
for a
single vector is discussed first and the results are extended to a set of two vectors.
Page 19
1.9 The Interpretability Index
1.3.2 The Varimax
Consider
pretability
Index
For a Single Vector
a single vector w = (wi, ~2,. . . , w~)~.
translates
entries as possible.
In a discrete
into as few variables in the combination
The goal is to smooth
sense, inter-
or as many zero
the discrete count interpretability
index
D(W)
Z
f:
I(Wi
=
0}
[Ml]
i=l
where
I{+}
is the indicator
Since the exploratory
function.
projection
pursuit
linear combinations
represent direc-
tions and are usually normed as a final step, the index should involve the normed
coefficients.
In addition,
in order to smooth the notion that a variable is in or
out of the vector, the general unevenness of the coefficients
The sign of the coefficients
is inconsequential.
measure the relative mass of the coefficients
should be measured.
In conclusion,
irrespective
the index should
of sign and thus should
involve the normed squared quantities
u”
i=l,...,p
WTW
.
In the 1950’s, several factor analysis researchers arrived separately
criterion
(Gorsuch
1983, Harman
at the same
1976) which is known as ‘varimax’ and is the
variance of the normed squared coefficients.
(1987) used as discussed in Section
1.1.3.
This is the criterion
which Friedman
The corresponding
interpretability
index is denoted by S, and is defined as
s~(w)z-&f-(*-;)2
i=1 ww
.
[1.12]
The leading constant is added to make the index value be E [0, 11. Fig. 1.2 shows
the value of the varimax
index for a linear combination
w in two dimensions
Page 20
1. InterpretableExploratory Projection Pursuit
0.8
0.8
0.2
0.0
angle
Fig. I.2
of the vector
in radians
Varimax interpretability
index for q = 1, p = 2. The value of
the index for a linear combination w in two dimensions is plotted
versus the angle of the direction in radians over the range [0, TIT].
(p = 2) versus the angle of the direction arctan(w2/wl)
of the linear combination
W.
Fig. 1.3 shows the varimax
mensions (p = 3).
to symmetry.
Only vectors with
all positive
components
one in three diare plotted
Fig. 1.3 shows the value of the index as the vertical
versus the values of (WI, ~2).
be graphed.
index for vectors w of length
The value of w3 is known
due
coordinate
and does not need to
Fig. 1.4 shows contours for the surface in Fig. 1.3. These contours
are just the curves of points w which satisfy the equation formed when the left
side of [1.12] is set equal to a constant
ity index is increased,
three points
the contours
el = (l,O,O),
value. As the value of the interpretabil-
move away from (-&, -$, -$=) toward
e2 = (O,l,O)
and es = (O,O, 1). The centerpoint
the
is
1.3 The Interpretability
Index
Page 21
Fig. I.3
Varimax interpretability
index for q = 1, p = 3. The surface of
the index is plotted as the vertical coordinate versus the first two
coordinates (~1, wz) of vectors of length one in the first quadrant.
Fig. J.4
Varimax interpretability
index contours for q = 1, p = 3. The
axes are the components (WI, ~2, ws). The contours, from the
center point outward, are those points which have varimax values S,(w) =(O.O, 0.01, 0.05, 0.2, 0.3, 0.6, 0.8).
1. Interpretable
Exploratory
Projection
the contour for S,(w)
S,(w)
Pursuit
= 0.0 and the next three joined curves are contours for
= 0.01,0.05,0.2.
The next three sets of lines going outward
ei’s are contours corresponding
toward
the
to S,(w) = 0.3,0.6,0.8.
Since w is normed, the varimax
criterion,
Page 22
criterion
S, is equivalent to the ‘quartimax’
which derives its name from the fact it involves the sum of the fourth
powers of the coefficients.
S, is also the coefficient
of variation
squared of the
squared vector components.
1.3.3
The Entropy
Index
For a Single Vector
The vector of normed squared coefficients has length one and all entries are
positive,
similar to a multinomial
set of probabilities
measures how nonuniform
If a vector is more simple,
from one another.
probability
vector.
The negative entropy of a
the distribution
the more uneven or distinguishable
Thus a second possible interpretability
is (Renyi
1961).
its entries are
index is the negative
entropy of the normed squared coefficients or
S,(w)-l+lf:
W
’2lnW
’2
lnp
wTw
wTw *
i=l
The usual entropy
simplicity
Property
measure is slightly
altered to have values E [0, 11. The two
measures S, and Se share four common properties.
1. Both are masimized when
&
where the ei,j
= l,...,p
=
fej
j = l,...,p
are the unit axis vectors.
occurs when only one variable is in the combination.
Thus, the maximum
value
1.3 The Interpretability
Property
2. Both are m inimized
or when the projection
made that
Page 23
Index
when
is an equally weighted average. The argument
an equally weighted average is in fact simple.
deciding which variable most clearly affects the projection,
could be
However, in terms of
it is the most difficult
to interpret.
Property
3. Both are symmetric in the coefficients wi. No variable counts more
than any other.
Property
explanation
Definition.
4. Both are strictly Schur-convex as defined below.
follows Marshall
The following
and Olkin (1979).
Let 5 = (C, &, . . . , &) and y = (ri, 72,. . . , yP) be any two vectors
E RP. Let Cpl 2 $21 2 . - - 2 Cb] ad
qll
1 721 2 . . . 2 ~1 denote their
components in decreasing order. The vector < mujorizes y (C F -y) if
The above definition
holds
w
7 = CP where P is a doubly stochastic
matrix,
that is P has nonnegative
entries, and column and row sums of one. In other
words, if y is a smoothed
or averaged version of C, it is majorized
example of a set of majorizing
(l,O,...
,O) + (go
)...)
by C. An
vectors is
0) h *.. I+ (-
p-
1
l’““P-
-J- 1’ 0) s ($,)
.
1. Interpretable
Definition.
ProjectionPursuit
Exploratory
A function
with strict inequality
Page24
f : RP H R is strictly Schur-convex if
if y is not a permutation
of <.
This type of convexity is an extension of the usual idea of Jensen’s Inequality.
Basically,
if a vector C is more spread out or uneven than y, then S(c) > S(y).
This intuitive
idea of interpretability
now has an explicit
The two indices S, and Se rank all majorizable
ing the theory of Schur-convexity,
mathematical
meaning.
vectors in the same order.
a general class of simplicity
Us-
indices could be
defined.
1.3.4 The Distance
Index
For a Single Vector
Besides the variance interpretation,
S, measures the squared distance from
the normed squared vector to the point (k, k, . . . , t), which m ight thus be called
the least simple or most complex point.
Let the notation
for the Euclidean norm
of any vector 6 be
If V, is defined to be the squared and normed version of w,
lJ, z WTW7W T W ”“’ W T W ’
and t/c is the most complex point, then
&J(w)= --/$l”y
- hII
-
*
[1.13]
1.3 The Interpretability
Page 25
Index
.This index can be generalized.
In contrast to having one most complex point, an
V
alternate index can be defined by considering a set
points.
= {or, . . . , VJ} of m simple
This set must have the properties
Uji>O
i = l,...,p
j=l,...,J
P
Uji = 1
c
j=l,...,J
[1.14]
.
i=l
The u;‘s are the squares of vectors on the unit sphere E
example is
V
= {cj : j = 1,. . . ,p},
used with exploratory
projection
with
pursuit,
J
RP as are u,
and vC. An
= p. In the event that this index is
the statistician
set of simple points rather than be restricted
could define her own
to the choice of V, used in S,.
This distance would be large when w is not simple, so the interpretability
index should involve the negative of it. An example is
[1.15]
The constant
c is calculated
so that the values are E [O,l].
Any distance norm
can be used and an average or total distance could replace the m inimum.
If
V
is chosen to be the ej’s and the m inimum
Euclidean norm is used, the
distance index becomes
s;(w)
where k corresponds
G 1 - $@&)2-&+1]
to the maximum
I w; I, i =
tion does not need to be done at each step, though
efficient
similar
in the vector
choices of
ros ((i, i,O,.
V
must be found.
Analogous
such as all permutations
1,. . . , p.
The m inimiza-
the largest absolute
results
are obtainable
of two $ entries and
cofor
p - 2 ze-
. * , O), ($7 o,;, 0, - ’* , 0), . . .) corresponding to simple solutions of two
variables each.
1. InterpretableExploratory Projection Pursvii
Since the ej’s maximize
relationship
Page 26
S, and are the simple points associated with Sl, the
between the two indices proves interesting.
m4
The minimum
= --St&)
+ --&
(P-&
wi
Algebra reveals
- 1)
.
of the second term occurs when
W:=f
WTW p
or all coefficients
are equal and S:(w) = S,(w) = 0. The maximum
occurs when
fL=l
WTW
and S;(w) = &(w)
The difficulty
= 1. Th e relationship
between the interim
values varies.,
with the distance index S’s is that its derivatives
and present problems
when a derivative-based
Given that the entropy index S, and the varimax
optimization
are not smooth
procedure is used.
index S, share common prop-
erties and the latter is easier to deal with computationally,
the varimax
approach
is generalized to two dimensions.
1.3.5 The Varimax
The varimax
vectors
Wj
=
(Wjl,
Index
For Two Vectors
index SV can be extended to measure the simplicity
Wjz,
for one combination
. . . ) Wjp),
j
=
1,. . . , Q. In the following,
of a set of Q
the varimax
[1.12] is called Si. In order to force orthogonality
index
between
the squared normed vectors, the variance is taken across the vectors and summed
over the variables to produce
[1.16]
1.4 The Algorithm
Page 27
-If the sums were reversed and the variance was taken across the variables
and
summed over the vectors, the unnormed index would equal
WWl) +
with
each element the one dimensional
vector.
simplicity
[1.12] of the corresponding
The previous approach [ 1.161 results in a cross-product
For two dimensional
with appropriate
appropriate
exploratory
norming,
= 12p [iP -
~&l,~2)
with
[1.17]
+ * * - + &(wp)
wJ2)
q
pursuit,
= 2 and the index,
reduces to
+ (P -
vu~l)
norming.
projection
term.
+
W(w2)
21 - $
, [1.181
-&-&
1
1
2
The first term measures the local simplicities
vectors while the second is a cross-product
term measuring the orthogonality
the two normed squared vectors.
This cross-product
orthogonality’,
groups of variables appear in each vector.
maximized
so that different
of the
of
forces the vectors to ‘squared
S, is
when the normed squared versions of w1 and w2 are ek and el, L # 1
and minimized
when both are equal to (i, f, . . . , 5).
1.4 The Algorithm
In this section, the algorithm
jection
pursuit
problem
used to solve the interpretable
[1.9] with
general approach of the original
interpretability
exploratory
exploratory
pro-
index S,, is discussed.
The
projection
pursuit
algorithm
is fol-
lowed but two changes are required, as described in the first three subsections.
1. InterpretableExploratory Projection Pursuit
1.4.1 Rotational
Invariance
Page 28
of the Projection
As discussed in Section 1.1.4, the orientation
and a particular
observed structure
index G no matter
stating
from which
of a scatterplot
is immaterial
should have the same value of the projection
direction
this is that the projection
Index
it is viewed.
An alternative
index should be a function
of the plane, not
of the way the plane is represented (Jones 1983). The interpretable
pursuit
algorithm
should always describe a plane in the simplest
as measured by S,,.
algorithm
should have the property
Definition.
where (PI,
Given these two facts,
A projection
P2)
of
rotation
matrix
projection
way possible
index used by the
rotationalinvariance.
if
e or
is (771,~2) rotated through
with Q the orthogonal
any projection
rotationallyinvariant
index G is
way of
associated with the angle 8 or
[1.19]
Rotational
invariance should not be confused with afhne invariance.
in Section 1.1.3, the latter
Friedman’s
property
is a welcome byproduct
Legendre index GL [l. 71 is not rotationally
erty is not required for his algorithm.
of sphering.
invariant.
This prop-
As described in Section 1.1.4, he simplifies
the axes after finding the most structured
plane by maximizing
varimax index S1 for the second combination.
rock away from the most structured
As remarked
the single vector
He does not allow the solution to
plane in order to increase interpretability.
1.4 The Algorithm
Page29
Recall that the first step in calculating
the projected
the marginals
sphered data to a square. Under the null hypothesis
tion is N(0, I), which means the projection
orientation
GL is to transform
of the axes is immaterial.
is not true, the placement
scatterplot
Intuitively,
of
the projec-
looks like a disk and the
however, if the null hypothesis
of the axes affects the marginals
and thus the index
value.
Empirically,
this lack of rotational
invariance
is evident.
. bile example from Section 1.1.4, the most structured
GL(,BFY, ,BgY) = 0.35. The projection
In the automo-
plane was found to have
index values for the scatterplot
as the axes
are spun through
a series of angles are shown in Table 1.1. A new index, which
seeks to maintain
the computational
properties
and to find the same structure
as GL, is developed in the next subsection.
e
G(/3;Y&Y)
Table1.1
0.0
fi
zi
3*
ZiT
t
L
4
%
7*
x
F
3
t
0.35 0.36 0.34 0.32 0.30 0.29 0.29 0.29 0.30 0.32 0.35
Most structured Legendre plane index values for the automobile
data. Values of the Legendre index for different orientations of
the axes in the most structured plane are given.
1.4.2 The Fourier
Projection
Index
The Legendre index GL is based on the Cartesian coordinates of the projected
sphered data ($2,
coordinates
natural
o$‘Z). Th e index is based on knowing the distribution
under the null hypothesis
alternative
of these coordinates
given that rotational
Polar coordinates
polynomials
are the
invariance is desired. The distribution
is also known under the null hypothesis.
the density via orthogonal
GL.
of normality.
of these
The expansion of
is done in a manner similar
to that of
1. Interpretable
Exploratory
The polar coordinates
Actually,
Pursuit
Projection
Page 30
of a projected
the usual polar coordinate
sphered point (aT2, ac.Z) are
definition
rather than its square but the notation
Under the null hypothesis
involves the radius of the point
is easier given the above definition
that the projection
is N(0, I),
R
of
R.
and 0 are inde-
pendent and
R-
Exp
0
;
0 - Unif[-7r,7r]
.
The proposed index is the integral of the squared distance between the density
of
(R, 0)
and the null hypothesis density, which is the product of the exponential
and uniform
densities,
The density pR,e is expanded as the tensor product
onal polynomials
properties.
chosen specifically
The weight functions
der to utilize the orthogonality
0 portion
R
for their
weight functions
both Cartesian coordinates
and rotational
must match the densities f~ and fe
relationships
and the polynomials
of the expansion must result in rotational
is not affected by rotation.
of two sets of orthog-
Friedman
chosen for the
invariance.
By definition,
(1987) uses Legendre polynomials
as his two density functions
are identical
is Lebesgue measure, and his algorithm
require rotational
Hall (1989) considered Hermite
developed a one dimensional index. The following
the two authors’ approaches.
Throughout
for
(Unif[-l,l]),
the Legendre weight function
invariance.
in or-
does not
polynomials
and
discussion combines aspects of
the discussion, i, j, and k are integers.
1.4 The Algordhm
Page31
The set of polynomials
for the R portion
are defined on the interval
is the Laguerre polynomials
[O,oo) with weight function
W(U) = e-“.
which
The polyno-
mials are
Lo(u) = 1
L,(u)
= u - 1
L;(u)
= (u - 2i + lpi-l(u)
[1.21]
The associated Laguerre functions
- (i - 1)2L44
.
are defined as
Z;(U)f .L;(u)e-i” .
The orthogonality
relationships
O"
J
between the p.olynomials are
Zi(U)Zj(U)dU
=
where 6ij is the i(ronecker
delta function.
Any piecewise smooth function
these polynomials
;,j>Oandi#j
Sij
0
f
: R H R may be expanded in terms of
as
f(U)
= e%&(u)
i=O
where the ai are the Laguerre coefficients
a; Z
O” Li(U)t?-3uf(U)dU
-
J0
The smoothness property
derivatives
of f means the function
has piecewise continuous first
and the Fourier series converges pointwise.
has density f, the coefficients
can be written
Ui = Ef [Zi(W)]
as
.
If a random variable W
1. Interpretable
Exploratory
The 63 portion
Pursuit
Projection
Page 32
of the index is expanded in terms of sines and cosines. Any
piecewise smooth function
f : R H R may be written
as
f(w)= y + 5 [ahcos(kw)+ bksin( kv)]
k=l
with pointwise
convergence.
The orthogonahty
relationships
between these trigonometric
7r
%
J
xJ
J
r
J
cos2(kv)dv =
--r
-%
sin2(kv)dv
bk are
k>l
= 0
k>Oandj>l
cos(kw)sin(jv)dv
dv=2r
.
the Fourier coefficients
1
ak E -~
1
bk E =
=
J
_ cos(kv)f(v)dv = ;Ef [cos(kW )]
J
=
-%
sin(
kv)f(v)dv
= i.Ef
[sin(kW)]
The density pR,e can be expanded via a tensor product
.
of the two sets of poly-
nomials as
i- 2 [aikcos(kv)+ b;ksin(ku)])
k=l
The Uik and
are
--x
--I
The ak and
= ?r
functions
bik are
the coefficients defined as
C&k 3
:Ep [ii(R) cos(kO)]
i,k>O
bik G
:Ep
i>Oandk>l
[Ii(R) sin(kO)]
.
.
Page 33
1.4 The Algorithm
The null distribution,
the uniform
which is the product
density over [-‘ir,~]
of the exponential
density
and
is
h(u>fo(~)
= ($+) (+-) =
.
-&O(U)
The index [1.20] becomes
The further
condition
tion, integration,
that pR,e is square-integrable
and use of the orthogonality
and subsequent
relationships
multiplica-
show that the index
G(PTY, /3,‘Y) equals
+(27r)(&)
~(271)~+~~1T(u~k+b~k)-(2r)(~)(~)
i=O k=l
i=l
[1.22] is equivalent
Maximizing
to maximizing
G&-Y&Y)
The definitions
[I.221
the Fourier index defined as
= ~G(Pf-y,&‘y)
of the coefficients
.
- f
-
in GF yield
GF(PTY,p,‘Y) =f 2 Ep”[Ii(R)] + 2 2 (Ei [ii(R) cos(k@)I+ Ep”k(R)sin(k@)l)
i=O k=l
i=O
- ;Ep[lo(R)].
[1.23]
This index
closely resembles the form of GL in Friedman
i = 0 term in the first sum and the last subtracted
function
is the exponential
polynomials.
In application,
instead
(1987).
The extra
term appear since the weight
of Lebesgue measure as it is for Legendre
each sum is truncated
at the same fixed value and
1. Interpretable
Explorato y
Projection
Pursuit
Page 34
the expected values are approximated
data points.
For example, Ei
by the sample moments taken over the
[Ii(R) cos(kO)]is approximated
by
where rj and Oj are the radius squared and angle for the projected jth observation.
The Fourier index is rotationally
spun by an angle of r.
invariant.
Suppose the projected
The radius squared R is unaffected
points are
by the shift so the
first sum and final term in [1.23] d o not change. The sine and cosine of the new
angle are
cos( 0 + T) = sin 0 cos 7 + cos 0 sin 7
sin(O + r) = cosOcos7
Each component
.
- sin@sinr
in the second term of [1.23] is
Ei[Zi(R) cos(k(0 + T))] +
E~[Zi(R)sin(k(O
cos2(kT)Ei[Zi(R)
sin(kO)]
+ r))] =
+ sin2(kr)E~[Zi(R)
cos(k@ )]
+ 2 sin( kT) cos(kT)Ep[Zi(R)
sin(kO) cos(k@ )]
+
cos2(kT)Ei[Zi(R) cos(k@ )]+
-2sin(kr)
[1.24]
sin2(kr)Ei[Zi(R)
cos(k~)E,[Z;(R)sin(kC3)cos(kO)]
= Ei[Zi(R)cos(kO)]+
E~[Zi(R)sin(kO)]
and the index value is not afFected by the rotation.
the index also has this property
replacing
The truncated
as it is true for each component.
version of
Moreover,
the expected value by the sample mean does not affect [1.24], so the
sample truncated
version of the index is rotationally
Hall (1989) p ro p oses a one dimensional
m ite weight function
densities.
sin(k@)]
In addition,
in the truncated
Hermite
invariant.
function
index.
of e- 3 u2 helps bound his index for heavytailed
The Herprojection
he addresses the question of how many terms to include
version of the index.
Similarly,
the asymptotic
behavior of GL
1.4 The
Page 35
Algorithm
is being investigated
at present. The Fourier index consists of the bounded sine
and cosine Fourier contribution,
and of the Laguerre function
portion
which is
weighted by e- 3 U. The class of densities for which this index is finite must be
determined
as well as the number of terms needed in the sample version.
The Fourier GF and Legendre GL indices are based on comparing the density
of the projection
a rotationally
with the null hypothesis density. Jones and Sibson (1987) define
invariant
to equate structure
find clusters.
with
rotationally
searches through the possible planes with the weighted objec-
[1.9] as a criterion.
Every plane has a single projection
it since GF is rotationally
have a single possible representation;
analytically.
index could be based on
Axes Restriction
The algorithm
S,. Unfortunately,
invariant
(Donoho and Johnstone 1989).
1.4.3 Projection
associated with
Their index tends
outliers while the density indices GF and GL tend to
A possible future
Radon transforms
tive function
index based on comparing cumulants.
the optimal
invariant.
Ideally, each plane would
the most interpretable
representation
index value
one as measured by
of a given plane cannot be solved
However, as the weight on simplicity
(X) is increased, the algorithm
tends to represent each plane most simply.
In order to help the algorithm
plane, the constraints
find the most interpretable
on the linear combinations
of a
(pi, ,&) must be changed. Re-
call that in the original
algorithm,
the linear combinations
(&, &) which define th e solution plane. This constraint
translates
into an orthogonality
which define the solution
the correlation
representation
constraint
[1.6] is imposed on
for the linear combinations
(oi, ~22)
plane in the sphered data space. However, simplicity
is measured for the two unsphered combinations
the two vectors are (fek,
constraint
&eZ),
k
(/?I, /32) and is maximized
# 1. These maximizing
combinations
thogonal in the original data space and correspond to the variable
k and
when
are or-
variable
1. Inierpwfable Ezploraioy Projeciion Pursuit
Page 36
I axes. Unless two variables are uncorrelated,
achieved by a pair of uncorrelated
To ensure that the algorithm
the maximum
orthogonality.
can find a maximum,
in the following
manner.
the optimal
constraint,
Without
translation
loss of generality,
the plane until the two vectors are orthogonal.
@2onto PI and taking the remainder.
a component
given any pime
after the linear combinations
Unfortunately,
cannot be
combinations.
by the pair (PI, p2) which satisfies the correlation
of the plane is calculated
simplicity
defined
the interpretability
have been translated
to
is not known so it is done
PI is fixed and /?2 is spun in
The spinning is done by projecting
That is, ,& is decomposed into the sum of
which is parallel to PI and a component
which is orthogonal
The new ,&, which is called ,8i, is the latter component.
to PI.
Mathematically,
[1.25]
Whether Sv(P, , Pi> is always greater than or equal to SV(p,,p2)
However, as noted above, the maximum
is not clear.
value of the index can be achieved given
this translation.
As an added bonus,
original
the two combinations
(p,,&)
are orthogonal
variable space, which is the usual graphic reference frame.
tion is in contrast to the original exploratory
projection
pursuit
in the
This situa-
algorithm
which
graphs with respect to the covariance metric as noted in Section 1.1.4. In effect,
a further
subjective
representation
simplification
in the solution
has been made as the visual
is more interpretable.
The final solution for any particular
X value is reported as the normed, trans-
lated set of vectors
[1.26]
1.4 The Algorithm
Throughout
Page 37
the rest of this thesis, the orthogonal
(Pr, &,) is written
1.4.4
and Il.181 becomes
= $KP- W l(PJ+ (p- l)S@{)
Comparison
Given that
comparison
W ith
factor
with
X = (Xi, X2)r
is assumed and
for (&, Pi). As a result, whenever S,(p,, /?2) is referred to. the
actual value is S,,(pl,&)
wdv
translat.ion
Factor
+ 21
112
. [1.27]
2
Analysis
analysis is used to motivate
this method
- $ -L!&g
is warranted.
the interpretability
The two dimensional
index,
projection
is defined as
XEBY
where
B
is the 2 X p linear combination
matrix
and the observed variables are Y = (K, Y2, . . . , YP)=. This definition
to the one for principal
principal
projection
components except that in the latter case, usually
in Section
1.1.1, principal
to simplify
Section 1.1.4, involved rigidly
B,
due to Friedman
is p X p. In addition,
maximize
a different
matrix
constraint.
by multiplying
(1987) and discussed in
spinning the projected points (X1,X2)
tion plane. The new set of points is QX =
lation
components
all p
index.
The first attempt
orthogonal
B
components are found so that the dimension of
as was remarked
is similar
QBY
as in [1.19]. S’
mce the rotation
Thus simplification
the linear combination
in the solu-
where Q is a two dimensional
is rigid, it maintains
the corre-
through spinning in the plane is achieved
matrix
B
by Q.
1. Interpretable
Exploratory
p variable
The analogous
In this model,
(fi,
f2>’
(Y,,yz,...
Page 38
Pursuit
factor analysis model is
there are assumed to be two unknown
and E is a
and is found
Projection
p
X 1 error vector.
by seeking to explain
the covariance
, YP) given certain distributional
changing the explanatory
order to simplify
assumptions.
matrix
which is comparable in dimensionality
pursuit
linear combination
to a more interpretable
solution
=
51 is p X 2
of the variables
Due to the non-uniqueness
A rotation
is made in
to CJQ’.
matrix,
QB.
f
rotated producing
Taking the transpose of the new factor-loading
matrix
structure
power of the model.
the factor-loading
factors
The factor loading matrix
of the model, the factors can be orthogonally
without
underlying
matrix
produces QflT, a 2 X p
to the new exploratory
projection
Thus, spinning the linear combinations
is analogous to simplifying
the factor-loading
matrix in a two factor model. A transpose is taken since the factor analysis linear
combinations
are of the underlying
combinations
are of the observed variables.
between factor analysis and principal
Interpretable
plane.
exploratory
factors, while the exploratory
This comparison is analagous to that
components analysis.
projection
pursuit
In this case, the linear combinations
only to the correlation
via [1.25] to further
constraint.
involves rocking
(pi, /32) are moved in
They are then transposed
increase interpretability.
may not be linear combinations
data analysis
the solution
RP subject
to orthogonality
The more interpretable
coefficients
of the original ones. Such a move is not allowed
in the factor-analysis
setting.
The interpretable
method may be thought
looser, less restrictive
version of factor analysis rotation.
of as a
1.4 The
Page 39
Algorithm
1.4.5 The Optimization
Procedure
The interpretable exploratory projection pursuit objective function is
(I-
A) G~(~~~~,Ty)
+
v (p 1, p 2 )
AS
max GF
where &@I,
rithm
P2)
is calculated after the translation
[1.28]
[1.25]. The computer algo-
employed is similar to that of the original exploratory
algorithm
described in Section 1.1.2. The algorithm
projection
is outlined
pursuit
below and then
comments are made on the specific steps.
0. Sphere the data [1.2].
1. Conduct
lem [l.l]
a coarse search to find a starting
point to solve the original prob-
(or [1.28] with X = 0).
2. Use an accurate derivative based optimization
structured
procedure to find the most
plane.
3. Spin the solution
representation.
vectors in the optimal
plane to the most interpretable
Call this solution Pa.
4. Decide on a sequence (Xi, X2, . . . , Xl) of interpretability
5. Use a derivative based optimization
and starting
6. If i =
I,
plane Pi-l.
EXIT.
values.
procedure to solve [1.28] with X = Xi
Call the new solution plane
Otherwise,
parameter
set i = i + 1 and GOT0
Pi..
5.
The search for the best plane is performed in the sphered data space, as
discussed in Section 1.1.2. However, an important
note is that the interpretability
of the plane must always be calculated in terms of the original variables.
combinations,
not the ai combinations,
are the ones the statistician
she is unaware of the sphering, which is just a computational
The pi
sees. In fact,
shortcut
behind
the scenes.
The modification
Friedman
does require one important
difference in the sphering. In
(1987), the suggestion is to consider only the first Q sphered variables
1. Interpretable
Exploratory
Projection
Pursuit
2 where q < p and a considerable
dropping
of the unimportant
unimportant
putational
amount of the variance is explained.
The
sphered variables is the same as the dropping
components in principal
work involved.
Page 40
of
components analysis and reduces the com-
In Step 5, the interpretability
gradients are calculated
for (PI, /32) and th en t ranslated via the inverse of [1.5],
a1 = DfU’p,
a2 = DbJ’p,
p.291
,
to the sphered space. If the gradient components are nonzero only in the p - q
dropped dimensions,
optimization
they become zero after translation.
procedure
the maximum
The derivative-based
assumes it is at the maximum
has not been reached.
and stops, even though
Thus, no reduction
in dimension
during
sphering should be made.
The coarse search in Step 1 is done to ensure that the algorithm
the vicinity
of a large local maximum.
starts in
The procedure which Friedman
(1987)
employs is based on the axes in the sphered data space.
He finds the most
structured
the sphered space.
pair of axes and then takes large steps through
Since the interpretability
a feasible alternative
measure S, is calculated for the original variable space,
m ight be to coarse step through
the original
rather than
sphered data space. For example, the starting point could be the most structured
pair of original
axes. This pair of combinations
is in fact the simplest possible.
On the other hand, stepping evenly through the unsphered data space m ight not
cover the data adequately
as the points could be concentrated
due to covariance structure.
tried so far, the starting
Sphering solves this problem.
procedure used is the program
(Gill et al. 1986). This package solves nonlinear
tion problems.
In all data examples
point did not have an effect on the final solution.
In Steps 2 and 5, the accurate optimization
NPSOL
in some subspace
The search direction
constrained
at any step is the solution
optimiza-
to a quadratic
1.4 The
Page
Algorithm
programming
binations
problem.
The package is employed to solve for the sphered com-
((Y~,cx~) subject to the length and orthogonality
gradients for the projection
index SV derivatives
given the complicated
is extremely
translations
this problem is a difficult
from the un-
[1.25] and from the sphered to unsphered
space [ 1.51. These gradients are given in Appendix
design a steepest descent algorithm
[1.4]. The
The interpretability
as they involve translations
combinations
The package NPSOL
constraints
index GF are straightforward.
are more difficult
correlated to orthogonal
powerful.
A.
At present, work continues to
which maintains
the constraints.
However,
between the sphered and unsphered spaces,
one.
Step 3 can be performed
discretely
41
in two ways.
The initial
spun in the plane to the most interpretable
pair of vectors
representation
can be
or Steps 5
and 6 can be run with A equal to a very small value, say 0.01. This slight weight
on simplicity
permitted
does not overpower
to rock but spinning is allowed.
representation
is similar
of the most structured
to Friedman’s
interpretability
The initial
inator
the desire for structure.
The result is the most interpretable
plane. As noted previously,
(1987) simplification
value of the projection
except that a two vector varimax
index GF is used as max G in the denom-
However, the algorithm
and as the weight on simplicity
move to a larger maximum.
may be caught in a
is increased, the procedure may
Thus, the contribution
of the projection
may at some time be greater than one. This is an unexpected
terpretable
this spinning
index is used instead of a single vector one.
of the first term in [1.28].
local maximum
The plane is not
projection
pursuit
approach, both structure
index term
benefit of the in-
and interpretability
have
been increased.
In the,examples
tried, the algorithm
is not very sensitive to the X sequence
choice as long as the values are not too far apart.
(O.O,O.l,. . . , 1.0) produces the same solutions
sequence (0.0,0.25,0.5,1.0)
does not.
For example,
as (0.0,0.05,0.1,
the sequence
. . . , 1.0) but the
1. Interpretable
Exploratory
Throughout
as the starting
Projection
Pursuit
Page
the loop in Steps 5 and 6, the previous solution
point for the application
of the algorithm
at Xi-1 is used
with X;. This approach
is in the spirit of rocking the solution away from the original solution.
have shown that the objective
made initially
is fairly
42
Examples
smooth and a large gain in simplicity
is
in turn for a small loss in structure.
As remarked in Section 1.1.4 when the example was considered, the data Y is
usually standardized
before analysis. The reported combinations
the coefficients represent the relative importance
nation.
are [1.26]. Thus
of the variables in each combi-
The next section consists of the analysis of an easy example followed by
a return to the automobile
Example.
1.5 Examples
In this section, two examples of interpretable
are examined.
exploratory
projection
The first is an example of the one dimensional algorithm,
second is the two dimensional algorithm
applied to the automobile
in Section 1.1.3. Several implementation
pursuit
while the
data analyzed
issues are discussed at the end of the
section.
1.5.1 An Easy Example
The simulated
data in this example consists of 200 (n) points in two (p) di-
mensions. The horizontal
distributed
and vertical coordinates are independent
with means of zero and variances of one and nine respectively.
data is spun about the origin through
tracted
from each coordinate
coordinate
and normally
of the remaining
an angle of thirty
degrees. Three is sub-
of the first fifty points and three is added to each
points.
The data appear in Fig. 1.5.
Since the data is only in two dimensions and can be viewed completely
scatterplot,
using interpretable
quite contrived.
The
exploratory
projection
pursuit
in this instance is
However, this exercise is useful in helping understand
the procedure works and its outcome.
in a
the way
1.5 Ezamples
101
Page 43
I
I
I
I
I
-10"""""""'
-10
-5
’
”
’
I”’
’
I""1
5
0
10
Yl
Fig.
1.5
Simulated
The algorithm
p = 2.
is run on the data using the varimax
for a single vector
axes restriction
data with n = 200 ad
S1 and the Legendre index GL.
modifications
are irrelevant
interpretability
Rotational
in this situation
sional solution is sought. The values of the simplicity
index
invariance
and
since a one dimen-
parameter
X for which the
solutions are found are (0,O. 1,0.2, . . . , 1 .O).
The most structured
or the first principal
the observations
line through
component
the data should be about thirty
of the data.
are split into two groups.
.
line toward
From Fig. 1.5, the horizontal
projected
and vertical
merge into one. In the horizontal
projection,
of these two axes.
the most structure
onto it. If the data is projected onto the vertical
lines in the
axes. The algorithm
the more structured
axis exhibits
onto this line,
The most interpretable
entire space, R2 in this case, are the horizontal
should move the solution
When projected
degrees
when the data is
axis, the two clusters
the two groups only overlap slightly.
1. Interpretable
Ezploratoq
Projection
Pursuit
Page 44
-y-+-
~2c”“““““““““”
0.8
&
0.6
-1
2
I
I
I
I
I
I
I
I
1
8
I
I
I
I
8
,
I
I
I
I
0.0
0
1
0.8
0.6
0.4
0.2
A
Fig.
Projection and interpretability
indices versus X for the simulated
data. The projection index values are normalized by the X = 0
value and are joined by the solid line. The simplicity index
values are joined by the dashed line.
1.6
Fig. 1.6 shows the values of the projection
the values of A. The projection
downward
and interpretability
indices versus
index begins at 1.0 as it is normed and moves
to about 0.3 as the weight on simplicity
increases.
The simplicity
index begins at about 0.2 and increases to 1.0.
In addition
to this graph, the statistician
Fig. 1.7 shows the projection
.
combinations
histograms,
should view the various projections.
values of the indices, and the linear
for four chosen values of A. The histogram binwidths
as twice the interquartile
have area one.
range divided by ni and the histograms
are calculated
are normed to
Page45
1.5 Examples
0.25
,,I,
, , I I
I \ I ,_l
0.20
0.15
0.10 1
h=O.O
A=O.5
Fig. 1.7
Projected simulated data histograms for various
The values of the indices and combinations are
values of X.
X = 0.0, GL = 1.00, S1 = 0.21, ,L3= (0.85,0.52)T
A = 0.3,
X=
0.5,
GL = 0.93, Sl =
GL = 0.78, S1 =
0.47,
0.71,
p = (0.92,0.47)T
B = (0.97,0.24)T
X = 1.0, GL = 0.27, Sl = 1.00, ,!3= (l.OO,O.OO)T
As predicted,
_
the horizontal
is evident
the algorithm
moves from an angle of about thirty
axis. As the weight on simplicity
in the merging
groups are overlapping.
of the two groups.
Also important
degrees to
is increased, the loss of structure
In the fmal histogram,
to note is the comparison
the two
between the
Page46
1. InterpretableExploratory Projection Pursuit
first and second histograms
value of the projection
the two histograms
1.5.2
are virtually
The automobile
indistinguishable
Example
interpretable
exploratory
consists of 392 (n) observations
GF projection
rotational
in a different
notation
index instead
projection
pursuit.
projection
pursuit
analysis
uses the Fourier
of the Legendre GL index due to the desire for
point
for the algorithm,
of Section 1.4.5. Comparison
invariance
using the Fourier index.
orthogonal
in the original
the plane.
The corresponding
this choice results
the X = 0 or PO plane in the
of these two solutions
in the Legendre projection.
projection
demonstrates
The dashed axes are (PI, ,&), which
variable space and are the simplest
PO solution
representation
constraint
With
of
as Fig. 1.8
purposes.
metric.
Rigidly
[1.6].
imagination,
index values.
are
for the Legendre index is shown in
In Fig. 1.9, the dashed axes are (/&,pz),
variance
the
Fig. 1.8 shows the PO
Fig. 1.9. This plot is the same as Fig. 1.1 but with the same limits
for comparison
using two
Recall that this data
as discussed in Section 1.4.1. Naturally,
starting
lack of rotational
1.1.3 is re-examined
of ten (p) variables.
exploratory
invariance
of 0.26. However,
to the eye.
data discussed in Section
The interpretable
A loss of 7% in the
index is traded for a gain in simplicity
The Automobile
dimensional
at X = 0 and X = 0.3 respectively.
rotating
However,
spinning
For example,
which
the combinations
are orthogonal
maintains
in the co-
the correlation
the Legendre index changes as shown in Table 1.1.
these axes through
the new marginals
different
angles produces lower
after a rotation
of f are less
clustered and therefore GL is reduced.
The value of the Fourier index GF for the projection
the value ‘for the projection
in Fig. 1.8 is 0.21 while
in Fig. 1.9 is 0.19. The Legendre index GL value
for the second is 0.35 as previously
reported.
For the projection
in Fig. 1.8, the
Legendre index varies from 0.19 to 0.21, depending on the orientation
of the axes.
1.5 Ezamples
Page 47
.:.. .
I
:. . .,*
*.
.
.
.
I
: . y;. -.
: ‘*. ..:+‘J:
:.*;..
I
. . a... ..
; .. 1:
. -. .:
..*..‘..
.J.
-- -.-. J : ;;.Y++
*.*.z
.: ..I.
Fag. I.8
----
Most structured projection scatterplot of the automobile data
according to the Fourier index. The dashed axes are the solution
combinations which define the plane.
Both find projections
which exhibit
endre index combinations
are translated
clustering
into two groups.
to an orthogonal
If the Leg-
pair via jl.51, they
become
,& = (-0.21,
-0.03,-0.91,
,&. = (-0.80,-0.14,
0.16,
0.27,
and have interpretability
the simplest representation
0.30, -0.05, -0.01,
0.49,-0.02,
measure S@,
0.03, 0.00, -0.02)T
0.03,-O-16,--0.02,
0.02,-0.01)T
/&) = 0.50. Of course, this may not be
of the plane though the orthogonalizing
transforma-
tion may help.
The Fourier index combinations
PI = (-0.06,
p2 = (
0.00,
-0.22, -0.79, -0.44,
O.ll,-0.53,
originally
are
0.34,-0.05,
0.82,-0.03,
0.02, -0.14, -0.05,
0.05,-0.01,
0.16,
0.04)T
0.04,-0.06)T
1. InterpretableEzploratoy Projection Pursuit
Fig. 1.9
Most structured projection scatterplot of the automobile data
according to the Legendre index. The dashed axes are the solution combinations which define the plane maximize GL and are
orthogonal in the covariance metric.
and have interpretability
representation
Page48
measure 0.17.
These axes are spun to the simplest
as discussed in Section 1.4.5 and orthogonalized.
axes are shown in Fig. 1.8 and the combinations
p1 = (-0.03,
-0.12, -0.95, -0.01,
@2 = (-0.02,
0.16,-0.09,
with interpretability
0.29,-0.02,
0.94,-0.19,
parameter
are
0.00, -0.03,-0.02,
0.09,-0.01,
0.18,
O.OO)T
0.07,-0.06)T
measure 0.80.
Fig. 1.10 shows the values of the interpretability
plicity
The resulting
and Fourier indices for sim-
values X = (0.0, 0.1, . . . , l.O), analogous to Fig. 1.6 for the sim-
ulated data. This example demonstrates
as the interpretability
that the projection
index does, possibly
index may increase
because the algorithm
gets bumped
Page49
1.5 Examples
CT
2
0
0.2
0.4
0.6
0.6
1
h
Fig. 1.10
Projection and interpretability
indices versus X for the automobile data.
The projection index values are normalized by the
X = 0 value and are joined by the solid line. The simplicity
index values are joined by the dashed line.
out of a projection
local maximum
as it moves toward a simpler solution.
Given this plot, the statistician
lution
planes for specific
may then choose to view several of the so-
X values.
actual values of the combinations
Six solutions
are given in Table 1.2. If coefficients
less than 0.05 in absolute value are replaced by -,
sense, this action of discarding
nature of the interpretability
at which
coefficients
the coefficients
move quickly
suggests investigating
are shown in Fig. 1.11.
‘small’ coefficients
which are
Table 1.3 results.
is contrary
The
In some
to the continuous
measure. However, the second table shows the rate
decrease.
Due to the squaring
in the index
SV, the
to one but slowly to zero. This convergence problem
the use of a power less than two in the index in the future
as discussed in Section 3.4.3. The table also demonstrates
the global simplicity
1. Interpretable
Exploratory
I I I I
3
Projection
I I I I
. ..
Pursuit
Page 50
I I I /
-_
I
-4
-2
0
2
I
,
I
-4
I
-4
I 1
I
I
I
,
0
h=0.4
*...
I
-2
X=0.6
..
-2
0
2
h=0.7
..
*
i:
:i
.
.I
;.
. .
;;
;
.
*.
.
.
-.
;.
. ”
.’ ‘:
:- *:
I
-4
I
I
,
I
I
I
I
-2
x=1.0
Fig.
I.11
Projected automobile data scatterplots for various values of A.
The values of the indices can be seen in Fig. 1.10 and the combinations are given in Table 1.2
c
2
Page 51
1.5 Examples
0.4
0.00
-0.09
-0.97
0.06
0.22
-0.01
0.02
-0.02
-0.02
0.00
-0.06
0.15
0.00
0.95
-0.15
0.08
-0.03
0.19
0.07
-0.06
0.00
-0.08
-0.98
0.04
0.16
-0.02
0.03
-0.02
-0.02
-0.01
-0.03
0.12
0.01
0.97
-0.11
0.12
-0.03
0.14
0.06
-0.04
0.01
-0.05
-0.99
0.01
0.12
-0.01
0.04
0.00
-0.01
-0.01
-0.03
0.06
0.00
0.98
-0.07
0.11
-0.03
0.11
0.05
-0.03
0.01
-0.05
-0.99
0.02
0.09
-0.01
0.05
-0.01
-0.02
-0.01
-0.02
0.06
0.01
0.98
-0.04
0.10
-0.04
0.11
0.05
-0.03
0.02
-0.04
-0.99
0.01
0.09
0.04
0.02
0.02
-0.01
0.00
-0.02
-0.02
0.01
0.99
-0.02
-0.02
0.03
0.01
0.04
-0.09
0.01
-0.03
-1.00
0.00
0.03
0.03
0.02
0.01
0.00
-0.01
0.00
-0.01
0.00
1.00
0.00
0.01
0.03
0.02
0.03
-0.07
0.00
0.00
-1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
.
0.5
0.6
0.7
0.8
0.9
1.0
Table 1.2
Linear combinations for the automobile data. The linear combinations are given for the range of X values. The first row for
each X value is PT and the second is ,f?T.
1. Interpretable
3.2
Exploratory
Page 52
Pursuit
-
-0.10
-0.96
0.05
0.25
-
0.17
-
0.94
-0.19
0.06
0.24
0.94
-0.18
-0.10
3.3
Projection
-0.05
-
0.17
-0.96
-
-0.09
-0.97
0.06
0.22
0.15
-
0.95
-0.15
0.08
0.08
-
-
0.18
0.18
-
0.07
0.07
-
-0.05
-0.05
-
3.4
-0.06
3.5
-
-0.08
-
0.12
-0.05
3.6
3.7
3.8
3.9
-0.98
-0.99
-
0.06
-
-
-0.05
-0.99
-
0.06
-
-
-
-
-
-
-
-
-1.00
-
-
-
-
-0.99
-1.00
1.0
Table I.3
-
0.97
0.98
0.98
0.99
1.00
1.00
0.16
-0.11
0.12
-0.07
0.09
0.09
0.08
0.12
0.11
0.10
0.05
-
0.19
0.14
0.11
0.11
0.07
0.06
0.05
0.05
-0.06
-
-
-
-
-
-
-
-
-
-
-
-0.09
-
-
-
-
-
_
-
-
-
-
-
-0.07
-
-
-
-
-
_
-
-
-
-
-
-
Abbreviated
linear combinations for the automobile data. The
linear combinations
are given for the range of X values as in
Table 1.2. A - replaces any coefficient less than 0.05 in absolute
value.
Page 59
1.5 Examples
1.0
0.5
0.0
-_-----_
-
-0.5
-1.0
IIll
0
IIll
IIII
IIII
0.4
0.2
0.0
III1
0.8
III1
1
0
III1
0.2
IIII
0.0
0.8
1
Y3
IIII
0.5
IIII
0.6
y1
1.0
III1
0.4
Iill
r”
IIII
III1
-
III1
---
---
-0.5
-1.0
III1
0
III1
0.2
IIII
IllI
0.4
0.8
Ill1
IIll-
0.8
1
0
IIII
0.2
IIII
IIll
0.4
0.6
Yr
IllI-
0.8
1
YS
1.0
0.5
-
---
-\
0.0
III1
0
IIII
0.2
IIII
IIII
0.4
0.6
IIll-,
0.8
y,
Fig.
I. It
Parameter trace plots for the automobile data.
The values of
the parameters are given for those variables whose coefficients
are large enough. The parameter values (/3li solid, &i dashed)
are plotted versus A.
1
1. Interpretable
Exploratory
Projection Pursuit
of the two combinations.
If one variable
Page 54
is in a combination,
it usually
is absent
in the other.
Another
useful diagnostic
tool fashioned
after ridge regression trace graphs
is shown in Fig. 1.12. For variables whose coefficients
X, the values of the coefficients
are 2 0.10 for any value of
in each combination
One of the most interesting
solution
are plotted
versus X.
planes is the one for X = 0.8. The combi-
nations involve four variables which are split into two pairs, one in each combination.
The first combination
and the fifth
variable
involves the negative of the third variable engine size
automobile
weight.
The second combination
engine power and the negative of the gaussianized
the solution
plane with
by the plotting
must decide when the tradeoff
In this example,
since the projection
plane. However,
the decision may be difficult.
no external
a plane’s usefulness.
No doubt
measurement
Rather,
interpretable
cedure to.find
ticular
error can be used to judge
of planes which exhibit
Friedman
projection
(1987) presents
applies a transformation
it yet leaves other orthogonal
pursuit
planes.
algorithm
In fact,
After
disturbing
a method
to the structured
planes undisturbed.
The
can employ the same pro-
Once the A value and hence the par-
plane have been chosen, the structure
as desired.
structure.
should be removed without
planes.
He recursively
several interesting
found and simplified
Given that this is an exploratory
1.9. The object is to find all such views.
of other interesting
exploratory
increased from the previous
such as prediction
plane is found, its structure
plane which normalizes
and accu-
the decision rests with the statistician.
in Figs. 1.8 and
removal.
between simplicity
index actually
this data has a wealth
two are evident
for structure
delineated
the X = 0.8 model is a good stopping
place, especially
the structure
Japanese flag. Fig. 1.13 shows
the type of car (Japanese or Non-Japanese)
racy should be halted.
a structured
the
symbol.
The statistician
technique,
involves
is removed and the next plane is
Page 55
1.5 Examples
.-.
*
.,’
- -.
. . ...:: y.,.
-.. . . .
.. ,.. .. .
. **
: :
.
Fig.
The interpretability
onal.
;*
.
..
of a collection
j(:
of solution planes could be considered.
two planes might be simple in relation
However,
in orthogonal
‘-I
..*
Country of origin projection scatterplot of the automobile data
for X = 0.8. American and European cars are plotted as points,
Japanese cars as asterisks.
I.13
example,
.*
since the structure
planes, in practice
to each other if they are orthog-
removal procedure
solution
For
does not affect structure
sets of planes tend to be orthogonal
anyway.
Originally,
Section
1.1.1.
to rotate
color.
exploratory
After
the point
The statistician
was time-consuming
The solution
projection
choosing
pursuit
was interactive,
three variables,
cloud in real time.
A fourth
the statistician
dimension
could then pick out structure
and only allowed
was to automate
combinations
the structure
as mentioned
in
used a joystick
could be shown in
by eye. However,
this task
of four variables
at best.
identification
by defining
an index
1. Interpretable
Exploratory
which measures structure
statistician
Projection Pursuit
mathematically
loses some of her individual
Page 56
and to use a computer
choice in structure
optimizer.
identification
The
but she
can define her own index if desired.
Interpretable
variable
selection.
coupled with
exploratory
projection
The interpretability
a numerical
pursuit
is an analogous
index is a mathematical
optimization
routine,
choosing which three or four variables
to view.
automation
of
measure which,
takes the place of interactively
Chapter
Interpretable
Pursuit
Regression
Projection
This chapter
describes interpretable
ilar in organization
common
to the previous
projection
chapter
modification
reviewing
the notation
is considered
ing goals of exploratory
interpretability
pursuit
though
to both have been addressed previously.
inal algorithm,
In the second section,
is detailed.
and projection
index must be changed slightly
It is sim-
condensed as several issues
and strategy.
pursuit
regression.
Section 2.1 deals with the orig-
and the new algorithm
projection
2
the
Due to the differ-
pursuit
regression,
from that of Chapter
the
1. The final
section consists of an example.
2.1 The Original
The original
and Stuetzle
Projection
projection
(1981).
tures and extends
pursuit
Friedman
the approach
regression in addition
Pursuit
Regression
Technique
regression technique
(1984a, 1985) improves
to include
classification
to single response regression.
response regression is considered.
57
is presented in Friedman
several algorithmic
and multiple
In this chapter,
fea-
response
only single
2. Interpretable
2.1.1
Projection
Pursuit
Page 58
Regression
Introduction
The easiest way to understand
as a generalization
tion pursuit
(1982).
pursuit
linear regression.
regression is to consider it
Many authors motivate
projec-
regression in this manner, among them Efron (1988) and McDonald
For ease of notation,
predictors
X
= (X1,X2,.
linear function
familiar
of ordinary
projection
suppose the means are removed from each of the
of the centered X.
notation,
The goal is to model the response Y as a
. . ,Xp)T.
With
the usual assumptions
and slightly
un-
the single response linear regression model may be written
WI
Y - E[Y] = WTX + c
where c is the random error term with zero mean. The vector w = (WI, ~2, . . . , w~)~
consists of the regression coefficients.
In general, this linear function
Y given particular
is estimated
values of the predictors
by the conditional
expectation
2 = (q, 52,. . . , zp). The fitted
of
value
of Y is
3(x)
The expected
= E[Y] + WTX
value of Y is estimated
distance between the true and fitted
by the sample mean.
w of the model are estimated
m in
In practice,
mean.
The expected
L2
random variables is
L2(w, x, Y) E E[Y - Q2
The parameters
WI
.
Lg(w,X,Y)
.
by
.
P-31
the sample mean over the n data points replaces the population
2.1 The Original Projection
Page 59
Pursuit Regression Technique
The model [2.1] may be written
Y - E[Y] = /qct'X)
WI
+ E
where p = wTw and a = (01, CY~,. . . , c+,)’ is w normed.
is analogous to [2.2]. Th e p arameters
equation
[2.3] subject
to the constraint
The rewritten
the projection
model
between the fitted
generalization
projections
the fitted
pursuit
stricted
combinations
oj which
zero mean and unit variance.
of functions
o’. The relationship
is a straight
line. A natural
pursuit
unrestricted
regression
functions
of
parametrically.
regression model with m terms is
@jfj(aTX)
the functions
The parameters
In the usual way, the conditional
WI
s
+ E
define the directions
to have length one. In addition,
the terms.
oTX
Y depends only on
value to be a sum of univariate
y - E[Y]= 5
The linear
as in
to vary. Projection
which are smooth but otherwise
The projection
p and o may be estimated
variables X onto the direction
values Y and the projection
allowing
value
response variable
is to allow this relationship
does just that,
fitted
that aTa, = 1.
[2.4] shows that
of the predictor
The resulting
of the smooths
fj are smooth,
pj capture
are re-
and have
the variation
between
mean is used to estimate
the sum
as
m
p(X)
=
E[Y]
+
C
@j.fj(aTx>
j=l
with the parameters
Analogous
estimated
to exploratory
as in [2.3].
projection
pursuit,
may be more successful than other nonlinear
mensional
space (Huber
1985).
The model
projection
methods
pursuit
by working
regression
in a lower di-
[2.5] is useful when the underlying
2. Interpretable
relationship
Pursuit Regression
Projection
Page 60
between the response and predictors
is nonlinear,
gression [ 2.41, and when the relationship
is smooth,
methods
Diaconis
that
such as recursive
any function
number
can be approximated
of terms
m.
Substantial
theoretical
properties
algorithm
are difficult
2.1.2
partitioning.
work
of the method.
as opposed
to other
and Shahshahani
by the model
remains
versus ordinary
nonlinear
(1984) show
[2.5] for a large enough
to be done
In addition,
re-
with
the numerical
respect
to the
aspects of the
as discussed in Section 2.2.3.
The Algorithm
The parameters
pj, oj and the functions
min
fj are estimated
by minimizing
La@, a, f, x, Y)
@j,CYj,fj:j=l,...VZ
3
QTaj = 1
WI
JVjl = 0
and Var[fj]
The criterion
However,
for.
algorithm
k=
if certain
Friedman
results
l,..
[2.6] cannot
be minimized
simultaneously
ones are fixed, the optimal
(1985) employs
are discussed
in Section
. , m. The problem
for all the parameters.
optimization
as they are pertinent
2.2.3.
.
values of others are easily solved
such an ‘alternating’
in this section
is considered
j = 1,. . . ,m
= 1
First,
strategy.
His
when the modified
he considers
a specific term
k,
[2.6] may be written
min -Wk - Bkfi;(~TX)12
pk#k,fk
WI
where Rk s
j#k
For the kth term,
in turn
while
the three sets of parameters
all others
are held constant.
have been found, the next term is considered.
After
/?k, ok, and fk are estimated
all elements
The algorithm
of the kth term
cycles through
the
2.1 The Original Projection
in the model until
terms
alternating
strategy
The minimizing
Pursuit Regression Technique
the objective
Page 61
in [2.6] d oes not decrease sufficiently.
The
is discussed in more detail in Section 2.2.3.
,Bk is
P-81
The minimizing
This
function
estimate
is found
fk for any particular
using the nonparametric
man (198413). Th e resulting
constraints.
direction
procedure.
squares problem
as possible.
Minimizing
but rather
as an
the criterion
and requires an
[2.6] as a function
of ok is a least-
In applying
than use a method
of applying
these methods,
an iterative
search procedure
if the function
to be minimized
simplifies
and is easily approximated
(Gill
determining
optimization
is of least-squares
fact and uses the Gauss-Newton
first
deriva-
Hessian is preferable.
specifically
et al. 1981).
to
about the function
which only employs
the Hessian, outweighs their additional
However,
on’this
directly
which also uses the actual or approximate
the difficulty
rately estimating
ak.
function
the goal is to use as much information
Thus, rather
tives, a method
italizes
to satisfy mean and variance
ok cannot be determined
as seen in [2.7].
a function,
Usually
curve is standardized
discussed in Fried-
value for each observation.
The minimizing
minimize
smoother
It is not expressed as a mathematical
estimated
iterative
point azx is
or accu-
properties.
form,
the Hessian
Friedman
(1985) cap-
procedure
to find the optimal
2. Interpretable
2.1.3
Projection
Model
Selection
Strategy
Model selection in the original
ing the number
Page 62
Pursuit Regression
projection
pursuit
of terms m in the model.
gression considers not only the number
regression consists of choos-
Interpretable
projection
pursuit
re-
of terms but also the interpretability
of
those terms.
Friedman
M.
(1985) suggests starting
The procedure
for finding
the algorithm
this model is discussed in Section 2.2.3. A model
with M - 1 terms is then determined
the previous
with a large number of terms
model as a starting
using the M - 1 most important
point.
The importance
terms in
of a term k in a model
of m terms is defined as
Ik E fjj
k
[2.10]
1
where ]/3l] is the maximum
absolute
parameter.
Since the functions
to have variance one, [@J-lmeasures the contribution
strained
fj are con-
of the term to the
model.
The statistician
may then plot the value of the objective
in [2.6], which
is
the model’s residual sum of squares, versus the number of terms for each model.
In most cases, the plot has an elbow shape. The usual advice is to choose that
model closest to the tip of the elbow, where the increase in accuracy
additional
2.2
term is not worth
The Interpretable
A projection
number
of terms
functions
fi.
m =
the increased complexity.
Projection
pursuit
regression
1, . . . , M.
Since these functions
they are visually
Pursuit
analysis
combination
represents.
The parameters
Regression
produces
Approach
a series of models
The models’ nonlinear
components
are smooths and not functionally
assessed by the statistician.
associated
due to the
Each is considered
aj in order to understand
,f3j measure the relative
what
with
are the
expressed,
along with
its
aspect of the data it
importance
of the terms.
2.2 The Interpretable
Each direction
Projection
oj must be considered in the context of the original
variables p. The collection
such as the correlation
of combinations
constraint
regression technique,
of variables
similar
to those in Section
selection
before, a weighted
penalty
criterion
in projection
balancing
The simplicity
1.2 show that
pursuit
As with any
a combinatorial
criterion
The minimization
_
problem
index measures the interpretability
since the objective
1nterpretabilit.y
that the number of terms m is fixed.
This assump-
tion is discussed in Section 2.2.2. The goal is define an interpretability
am). At first glance, the situation
foragroupofmvectors(ai,cy2,...,
exploratory
two vectors (PI, ,&) is measured.
a plane.
which,
In relation
when normed
in the latter
the simplest
index S
is a gen-
where the simplicity
of
case, the vectors define
pairs of combinations
are ones
This type of orthogonality
is
in Section 2.3.5. Measures of this squared orthog-
and each vector’s
individual
Within
each combination,
variables
index
pursuit
and squared, are orthogonal.
onality
interpretability
projection
However,
to each other,
named squared orthogonality
of
Index
Consider for the moment
of interpretable
to be
in the next two subsections.
The Interpretability
eralization
[2X]
of the collection
is minimized.
an
becomes
, Q2 T.--T a m )
xqcrl
As
[2.6] with
in the first term causes both contributions
and is subtracted
ap-
regression is not feasible.
min La
index choices are.discussed
2.2.1
pursuit.
is desirable for parsimony.
L2(P)crJJm
for X E [0, l]. Th e d enominator
combinations
projection
the goodness-of-fit
is employed.
(1 -A)
min
/3j,Crj,fj:j=l,..., TTZ
E [O,l].
in exploratory
a variable selection method which causes the same subgroup
proach to variable
interpretability
number of
is not subject to any global restriction
to appear in all the combinations
Arguments
Page 63
Pursuif Regression Approach
interpretability
enter the index
are selected via the single vector
Sr [1.12] by encouraging
the vector
coefficients
S, [1.18].
varimax
to vary.
2. Interpretable
Different
variables
Page 64
Pursuit Regression
Projection
are selected for either vector by forcing
the combinations
to
exclude the same variables.
Projection
pursuit
be simple within
is penalized.
interest
regression has a different
itself.
That is, homogeneity
The varimax
each vector
of the coefficients
within
should
a vector
index Sr for one vector achieves this. However,
in the
of decreasing the total number of variables in the model, the same small
subset of variables
of each direction
variables
should be nonzero in each vector.
Summing
as in [ 1.171 d oes not force the vectors
necessarily.
On the other hand, if the exploratory
the simplicities
to include
to squared orthogonality
and would contain
may be used for projection
if variable
pursuit
selection is the object,
Consider
the m X p matrix
the directions
different
the same
projection
measure [l ,161 for q = m vectors is used, the vectors would
varimax
Though
goal. First,
variables.
pursuit
be forced
This old index
regression to achieve this outcome.
However,
a new index must be developed.
of combinations
aj are constrained
defined for all sets of combinations
to have length one in [2.6], the index is
in general.
The goal is that all the combina-
tions are nonzero or load into the same columns so that the same variables
in all combinations.
are
An index’ which forces this outcome is based on a summary
vector y of the matrix
52 whose components
are
2
3;’ -
2
j=l
Each component
tribution
yi is positive
to variable
5;
i=l,...,p
*
[2.12]
“T~j
and contains
the sum of each term’s
i, or the total column weight.
The components
relative
con-
of the new
2.2 The Interpretable
Projection
Pursuit Regression Approach
Page 65
vector sum to m. The object is to force these components
sible. For a single combination,
index S, [1.12].
to be as varied as pos-
this is achieved via the varimax
For y appropriately
normed,
this measure is
= $-&py
WY)
interpretability
.
12.131
84
The index
function
[2.13] f orces the weight
of all the combinations,
f2
Sd
O!y1,(Y2,...,~m)=-
The function
[F
of the columns
1)5 5 g-&l
waJ+2(pP
j#k
j=l
is in contrast
to force the squared orthogonality
interpretability
of each combination.
exploratory
projection
over the terms is subtracted
of each column
to be dissimilar.
of the each total column weight over the rows of the matrix,
of the model,
in order
of the combinations.
The index forces the overall weights
dispersion
+m;;:l) *
i=l
to S, for interpretable
[1.18], in which the cross-product
pursuit
As a
this index is called Ss and may also be written
Sr measures the individual
This re-expression
of R to vary widely.
depends on the goodness-of-fit
criterion,
An example
The
or terms
using this
index is discussed in Section 2.3.
2.2.2
Attempts
to Include
In the discussion
argument
that
so certain.
with
However,
assessed.
of Terms
of terms m in the model is fixed.
fewer terms is more interpretable
on further
Each term involves
The interpretability
be visually
so far, the number
a model
-more persuasive.
the Number
reflection,
its combination
of the combination
this conclusion
The
than one with
does not appear
oj and the smooth
function
can be measured but the function
fj.
must
2. Interpretable
Projection
Pursuit
Consider the following
out knowing
Regression
example with two variables Xi and X2 (p = 2). With-
the data context,
ranking
(1) n-l =
in order of interpretability
function
Situations
of widely
is impossible.
varying
would be difficult
f2(X2)
The first model involves two functions
combination.
of the most complex
easily can be imagined
more interpretable.
The second involves only one
combination
in which
clearly
For example, if the two variables
traits
(apples vs.
to understand
models
f(X1ix2)
each, the simplest
consisting
the two fitted
2, fl(&),
(2)m=l,
of one variable
Page 66
oranges),
of variables,
the average.
one or the other
are distinct
model
measurements
then a combination
of the two
and Model (1) would be preferable.
However, if
the two were aggregate measures of similar
characteristics
(reading
and spelling
scores) which could be combined easily into a single variable (verbal ability),
second model m ight be easier to interpret.
the number
As with
tion pursuit
statistician
weight
number
interpretable
is equipped
exploratory
with
projection
pursuit,
the algorithm
interpretable
as an interactive
dial.
process in which the
and the model becomes more simple.
is included
should drop terms as it simplifies
three methods
considered.
Factors required
fortunately,
none of the resulting
projec-
As the dial is turned,
of terms to begin with is m and this parameter
discussion,
ways to include
measure of a model is enlightening
an interpretability
(X) in [2.11] is increased
the following
the
are unsuccessful.
regression can be envisioned
sure of simplicity,
below.
However, considering
of terms in the interpretability
even if present attempts
is
for including
the
If the
in the meathe model.
In
m in the index [2.13] are
to put the index values E [0, l] are ignored.
indices works in practice
Un-
for the reasons given
2.2
The Interpretable
Projection
Pursuit
Since the interpretability
%rst attempt
Regression
Page 67
Approach
of a model decreases with the number of terms, the
is to multiply
the interpretability
index [2.13] by A, obtaining
Sa(CY1,Cr2,. . .) CXm)E -;[$-&-;)2]
.
.
t=l
The resulting
index Sa decreases as the number of terms increases. However, this
measure does not work in practice as each term’s contribution
is reweighted
when
the number of terms changes. Instead, the index should be such that submodels
of the current
model measured contribute
the same to the index, regardless
of
the size of the complete model.
Each model can be thought
number
of points
Section 2.3.4.
increases.
The total
increases as the number
contain
of as a point
Consider
E BP, As terms are added, the
a distance
interpretability
of distance from a set of points
of points
the sarne variables,
Since the object
does.
a plausible
interpretability
Sb(al,Q2,-..,~m)--
ri$
el[Vc~~
index
as in
to a particular
is that
point
all the terms
index is
-Vll12
.
j=l
As in Chapter
1, the set V is composed of simple vectors
squared and normed
one dimensional
[1.14] and vaj is the
version of the oj term [1.13]. This index is similar
index [1.15]. The m inimum
j but of the sum of distances
the same interpretable
to the
is not taken of each individual
in order to ensure that
term
all the terms simplify
to
vector ~1. If the set V consists of the cl’s (1 = 1,. . . ,p),
Sb reduces to
Pm
2
"ji
- T
cc
i=l
.
where yk is the largest
with
the removal
discontinuous.
j=l
~j
component
of the m inimization,
-
2-/k
+ m
tyj
as defined
in [2.12].
the derivatives
Unfortunately,
of the given index
even
are
Page 68
2. Interpretable Projection Pursuit Regression
Averaging
which
all the possible
is continuous.
toward
distances
vectors
the closest ~1 overpowers
Unfortunately,
though
continuous
all the terms to collapse toward
In conclusion,
the number
sion model
cannot
selection
the interpretability
linear regression
procedure
is poor.
to explore
2.2.3
The strategy
term model
it.
The outlook
Instead,
should
procedure
projection
than a strict
themselves
situation
with
model
regresin which
between
model
pursuit
of
measur-
pursuit
the tradeoff
for an analogy
interpretable
the number
projection
controls
inter-
selection
in
regression
selection
compare
these models is discussed
is described
in the next subsection.
is a
pro-
in Section
Procedure
a large number
of models
of terms m = 1,. . . , M, the following
models for different
X. Steps 1 and 2 find the original
which
toward
and smoothly
to a one parameter
X completely
to find a sequence of interpretable
ity parameter
be less than
of the terms
such a method,
is to begin with
model with number
at incorporating
and the interpretability
be reduced
The Optimization
T must
simplifies
and pulls the term
for simultaneously
the model space, rather
cess. How the statistician
2.3. The optimization
As a result,
attempts
A method
parameter
and accuracy.
each term
as opposed to Sa, the index S, does not force
Without
yet.
pretability
index
the same simple vector.
of terms
has not been found
ul.
the others
none of the three
terms into the index works.
ing both
a third
The power r must be chosen so that
one of the interpretable
one so that
produces
minimizes
[2.6].
M.
For each sub-
algorithm
is employed
values of the interpretabil-
projection
pursuit
regression
m
Steps 3 and 4 find the sequence of models
which minimize [2.11] for various X values. Throughout the description, updating a parameter
means noting
the new value and basing subsequent
dependent
Page 69
2.2 The Interpretable
Projection
Pursuit Regression Approach
calculations
Moving
a model means permanently
on it.
changing its parameter
values.
1. The objective
function
lined in Friedman
2. Use a backwards
is [2.6]. Use the stagewise modeling
and Stuetzle
procedure
out-
(1981) to build the M term model.
stepwise approach to fit down to the m term model.
Fori=l,...,M-m
begin
Rank the terms from most (term 1) to least important
as measured by [2.10]. D iscard the least important
Use an alternating
procedure
to minimize
(term M - i + 1)
term.
the remaining
M - i terms.
a. Fork=l,...,M-i
begin
Update
the term’s parameters
other terms are fixed.
respectively.
step to find the new direction
by updating
pk and fi; using
Only one Gauss-Newton
ation due to the expense of a step.
objective
the
Choosing from among several steplengths,
do a single Gauss-Newton
plete the iteration
pk, Ok and curve fi; assuming
ok. Com-
[2.8] and [2.9]
step is taken for each iterContinue
iterating
until
the
stops decreasing sufficiently.
end
b. If the objective
decreased on the last complete loop through
(a), move the model.
If the objective
(GOT0
a.). Otherwise,
another pass
term model is complete.
end
decreased sufficiently,
the optimization
the terms
perform
of the M - i
2. Interpretable
Projection
Pursuit Regression
Page 70
3. Let X0 = 0 and call the m term model resulting
model.
Choose a sequence of interpretability
from Steps 1 and 2 the X0
parameters
(Xl, X2, . . . , XI).
Let i = 1.
4. The objective
is [2.11].
function
model using a forecasting
starting
Set X = X; and solve for the m term
alternating
procedure
and the Xi-1 model as the
point.
Reorder the terms in reverse order of importance
1) to most important
[2.10] from least (term
(term m).
Make a move in the best direction
possible.
a. For k = 1,. . . , m
begin
Choosing from among several steplengths,
update the crk resulting
from the best step in the steepest descent direction.
iteration
by updating
the
,& and fk using [2.8] and [2.9] respectively.
Only one steepest descent direction
of a step. Always perform
iterating
Complete
step is taken due to the expense
at least one iteration
until the objective
and then continue
stops decreasing sufficiently.
end
6. If only one loop through
the terms
the model regardless of whether
perform
another
completed
pass (GOT0
and the objective
one loop has been completed
perform
another pass (GOT0
Otherwise,
the objective
move
decreases or not and
a). If more than one loop has been
decreased, move the model.
and the objective
a). Otherwise,
m term model with interpretability
c. If i = I, EXIT.
(a) has been completed,
parameter
If more than
decreased sufficiently,
the optimization
X; is complete.
let i = i + 1 and GOT0
4.
of the
2.2 The Interpretable
Projection
The forecasting
Pursuit
alternating
Regression
procedure
in Step 4 differs from the alternating
procedure in Step 2. In the latter case, determining
specific term reduced to a least-squares problem
this problem
could not be solved analytically
Hessian.
direction
(Gauss-Newton)
However, with the addition
of the objective.
as Gauss-Newton.
The gradients
utilized
of the interpretability
term
the effect a simplification
algorithm
as described
not updated
between
for ease in calculation.
However,
sacrificing
accuracy
a decrease in the objective.
shift in term two produces
index.
S, and the equality
The increase in interpretability
in the solution
among term
for an individual
a decrease.
Such a
algorithm.
does not work because of the strong interaction
index
This approach
However, possibly this change
of moves is not considered in the original
in the simplicity
A term is
this approximation
/
For example, suppose a change in term
Using this ‘look one step ahead’ approach in interpretable
regression
In the original
scenarios exist in which
the global m inimum.
in term one and the resulting
combination
decreases as a result.
the terms,
A.
is made to forecast
in Step 2, each term is considered separately.
ignores the interaction
one does not produce
is the negative
are given in Appendix
in that an attempt
function
method
is not as accurate
of one term will have on the others.
unless the objective
does not produce
this method
of the objective
The Step 4 approach is also different
the least-
Thus, a cruder optimization
Unfortunately,
While
which
must be used. Steepest descent is employed and the step direction
of the gradient
al; for a
and had to be iterated,
[2.11] no longer has this form.
the objective
the optimal
as noted in Section 21.2.
squares form lent itself to a special procedure
the estimated
Page 71
Approach
projection
between
pursuit
the terms
contributions
to the
term must be very large
before it changes on its own. However, once it changes, the other terms quickly
follow suit, like a row of dominoes.
the algorithm
between.
produces
The result is that if Step 2 approach
is used,
either very complex or very simple models but none in
2. Inierpretable
Projection
Pursuit
Thus, a compromise
is due to the fact that
sure, irrespective
simplification
Regression
Page 72
is reached based on empirical
all terms contribute
of their relative
evidence.
The first change
equally to the interpretability
importance
mea-
12.101. Thus, when considering
a
of the model, that term which least affects the model fit should be
considered first as it does as well as any other at increasing
interpretability.
As
a result, the terms are looked at in reverse order of import ante (Step 4).
The second change is that the algorithm
once as opposed to one term at a time.
moves the entire set of m terms at
A move is not evaluated until it is formed
from a sequence of submoves, each of which is a shift of an individual
4~). These submoves are made in reverse order of goodness-of-fit
are made in the best direction
The third
move appears to be a poor one.
4a is positive
(Lundy
importance
and
possible, the steepest descent one.
change is that the algorithm
a form of annealing
term (Step
always moves the model, even if the
In some respects this jiggling
1985). Th e m inimum
yet quite small, so large unwelcome
steplength
of the model is
considered in Step
increases in the objective
are
impossible.
Both
the second and third
of the simplification
of one term
works well in practice
requirement
that
changes are attempts
on subsequent
as is demonstrated
the algorithm
results.
However,
the algorithm
behavior
is seen in the following
at forecasting
terms.
quickly
example.
The given algorithm
in the next section.
always move means that
the effect
Occasionally,
the
a clearly worse model
adjusts for the next value of X. This
2.3 The Air
Pollution
2.3 The Air
Pollution
The example
using additive
Page 73
Example
Example
in this section concerns air pollution.
models in Hastie and Tibshirani
and using alternating
conditional
expectations
(1985). It consists of 330 (n) o b servations
(p) independent
variables
each.
The data
(1984) and Buja et al. (1989),
(ACE)
in Breiman
and Friedman
of one response variable
The daily
is analyzed
observations
(Y) and nine
were recorded
in Los
Angeles in 1976. The variables are
Y : ozone concentration
x1 : Vandenburg 500 m illibar height
x2 : windspeed
x3 : humidity
x4 : Sandburg Air Force Base temperature
xg : inversion base height
x6 : Daggott pressure gradient
x7 : inversion base temperature
X8 : visibility
xg : day of the year
As in exploratory
projection
pursuit,
all of the variables are standardized
mean zero and variance one before projection
As suggested by Friedman
chosen to be nine (p). The algorithm
M by backstepping
model is measured
the introduction,
as the fraction
inaccuracy
respect to the data rather
fraction
regression is applied.
(1984a), the original
Section 2.2.3) is run for a large value of M
terms m = l,...,
pursuit
to have
algorithm
initially.
(Steps 1 and 2 in
For this example,
M
is
produces all the submodels with number of
from the largest.
of variance
it cannot
The inaccuracy
explain.
of each
As noted in
in this thesis denotes lack of fit as measured with
than to the population
in general.
From [2.6], this
is defined as
u
-
J52MdJJ)
k(Y)
*
The plot of the number of terms and fraction of unexplained variance of each
model is shown in Fig. 2.1.
2. Interpretable
Projection
Page 74
Pursuit Regression
m
Fig. 2.1
Fraction of unexplained variance U versus number of terms m
for the air pollution data. Slight ‘elbows’ are seen at m = 2,5,7.
Using the original
by weighing
approach,
the statistician
chooses the model from this plot
the increase in accuracy against the additional
chooses a model
complexity
of adding
at an ‘elbow’ in the plot,
where the
a term.
She generally
marginal
increase in accuracy due to the next term levels off. In some situations,
such an elbow may not exist or it may not be a good model choice in actuality.
Only one model for each number
model space is one dimensional
Interpretable
particular.
.
measure.
projection
number
of terms m is found.
pursuit
regression
expands
point for each simplicity
the model
the algorithm
space for a
by adding an interpretability
search for a given m is the model
shown in Fig. 2.1. Then for a sequence of X values which signify
weight on simplicity,
m, the
in U.
of terms m to two dimensions
The starting
For a particular
an increasing
cuts a path through the model plane [V X S,].
2.3 The Air Pollution
Lubinsky
Page 75
Example
(1988) d iscuss searching
and Pregibon
model space in general.
vides the structure
through
Their premise is that a formalization
a description
of this action pro-
by which such a search can be automated.
which
space characterization,
prehensive
than
agree that
two important
In their work, the latter
Their
is based on Mallows’ (1983) work,
the two dimensional
U and Sg summary
description
dimensions
concept is an extension
eters measure and is in the spirit
includes both ‘the conciseness of the description
descriptive
is-more
given above.
are accuracy
comThey
and parsimony.
of the usual number
of interpretability
or
of param-
as defined in this thesis.
It
and its usefulness in conveying
information.’
Initially
for this and other
quence is (O.O,O.l,.
. . , 1.0).
examples,
However,
the interpretability
the usual result
parameter
is a path
X se-
through
the
model space which consists of a few clumps of models separated by large breaks
in the path.
Even the forecasting
2.2.3 cannot completely
der to produce
with
curve.
eliminate
a smoother
additional
nature
described
in Section
these large hops between model groups.
path, the statistician
values of X specifically
For example,
of the algorithm
In or-
is advised to run the algorithm
chosen to produce
a more continuous
if on the first pass the path has a large hole between the
X = 0.3 and X = 0.4 models, the algorithm
should be run with
ues of (0.33,0.36,0.39).
twenty models are produced for each
valueof
m = l,...,
Using this strategy,
9. The actual
Various
diagnostic
through
X val-
X values used are not shown in the following
figures as their values are not important.
a guide for the algorithm
additional
The interpretability
parameter
is solely
the model space.
plots can be made of the collection
of models which
are
distinguished
by their m, U and Sg values. Chambers et al. (1983) provide several
possibilities.
Given that the number of terms variable m is discrete, a partitioning
approach is used. Partitioning
in a plot represents
S, and inaccuracy
plots are shown in Figs. 2.2 and
a model with
U.
Ideally,
the given number
of terms,
for the best comparison,
2.3. Each point
interpretability
these plots should
be
2. Interpretable
Projection
L,,,,
1.0 F
Pursuit
I,,,
,(I,
Page 76
Regression
I,,,
(,,It
+
ql,,,,,,,,,,,,
f ,,(,,,,
1
0.1
0.15
0.2
0.25
U,
‘III
1.0
‘III
0.3
0.35
0.1
0.15
0.2
m=l
III1
U,
III,
-
0.25
0.3
0.35
m=2
‘III9
45
4M4
0.6
-
0.6
r
0.4
-
0.2
F
4
4
cc
4
4
%
44
4bF
0.0
0.1
““I
0.15
““I
0.2
““I
U,
L,,,,
1.0 -
0.3
”
”
0.35
0.1
0.15
,4,:1,,
0.2
U,
, , I,,
, , I,,
0.25
0.3
0.35
0.3
0.35
, ,I
m=4
m
I,,,
0.6
I I I I 1,
m=3
4
vi”
0.25
““I
4
44
0.4
4
4
$
0.0
0.1
14 +
” ” “I’
0.15
0.2
U,
Fig. 2.2
-I
” ” I ““I
0.25
m=5
0.3
” ”
0.35
0.1
0.15
0.2
U,
0.25
m=6
Model paths for the air pollution data for models with number
oftermsm=l,...
,6. Each point indicates the interpretability
S, and fraction of unexplained variance U for a model with the
given number of terms.
2.3 The Air Pollution Example
L,,,,
1.0
,,I[
Page 77
,,,I
,,I,
y--
,,,,-I
I,,,
I,,,
,,I,
II,,
4’
44
II,,4
’
4
4-
4 88
0.8
-
0.6
-
$
4
ml
0.4
0.2
4
a0
44
:
:4jt
0.0
“I’
0.1
-
4
0.15
44
” ”
L,,,,
$4
4
;to
14
”
0.2
” ”
0.25
U,
1.0
4
$
I,,,
”
’ “I”
0.3
0.35
p8
0.1
0.15
m=7
,I,,
’
0.2
U,
‘I,,
II
III’
0.25
‘III-
0.3
0.35
m=8
,,,I-
‘4
-
7
+#
0.8
-
r/-Y 0.6
7
0.4
0.2
%
4
&
4
-
4
4
14
0.0
4
.“l’b”“““l”‘l”‘I”
0.1
0.15
0.2
0.25
U.
Fig. 2.8
0.35
Model paths for the air pollution data for models with number
Each point indicates the interpretability
of terms m = 7,8,9.
Ss and fraction of unexplained variance U for a model with the
given number of terms.
lined up side by side.
superficially
0.3
m=9
However,
note that though
Ss scales
appear to be the same for all the graphs, they are not as implicit
each plot is the number
of terms m. A symbolic
are graphed
in one plot of unexplained
a particular
graphing
the simplicity
the interpretability
variance
symbol for the number
scales are dependent
scatterplot
in
in which
all models
U versus interpretability
S, with
of terms, also obscures the fact that
on the number
of terms in the model.
2. Interpretable
Projection
Pursuit
As interpretability
increases, so does the inaccuracy
values of m , the path through
tially
Page 78
Regression
of a model.
the model is ‘elbow-shaped’,
a large gain in interpretability
of all possible models.
have the same inaccuracy.
that ini-
is made for a small increase in-inaccuracy.
The curves shift to the left as m increases, as the additional
overall inaccuracy
indicating
For most
terms decrease the
For all values of m , the X = 1 models
These models have all directions
parallel and equal to
an ej, so in effect they only have one term of one variable.
Due to the forecasting
is not always monotonic.
non-monotonic
is found.
m inimum.
in smaller interpretability
for m = 3, interpretability
value of X, rather
the number
The draftsman’s
the inaccuracy
out of a local
for a particular
are possible for a given
linear regression variable selection methods find
inaccuracy
for a fixed interpretability
display in Fig. 2.4 shows how the number
U and simplicity
simpler
versus the same interpretability
Ss vary.
Again,
than one with
note that
more, plotting
scale is m isleading.
value, which is
of terms m and
if a model
all the models
the models from which to choose if certain requirements
can define equivalence
with
This set of plots is useful
met, such. as U 5 0.20, Sg 2 0.50, or some combination
the statistician
The
of parameters.
fewer terms is considered
in determining
This type
values Ss E [0.2,0.4].
it helps describe the models which
the model which m inimizes
partitioning
readjusts.
does not find the global m inimum
number of terms m . In contrast,
usually
and larger inaccuracy,
poor move may be needed to force the algorithm
The algorithm
the model space
a clearly worse model as evidenced by a
usually on the next step, the algorithm
is evident
intermediate
employed, the path through
Occasionally,
move resulting
However,
of behavior
algorithm
thereof.
must be
More formally,
classes of models from the draftsman
and
plots consisting of sets of models with different number of terms
which satisfy certain
inaccuracy
and interpretability
criteria.
2.3 The Air Pollution
Page 79
Example
m
0.6
0.4
0.2
0.0
0
2
4
6
8
10
m
Fig.
2.4
Draftsman’s
display for the air pollution data All possible pairwise scatterplots of number of terms m, fraction of unexplained
variance U and interpretability
S, are shown.
The statistician
of the variance,
with
must now choose a model.
approximately
even moderate
inaccuracy
crossing
model which explains
model with
The first term explains
75% (U = 0.25). For models with
interpretability
(S,
the 0.20 threshold
2 0.40), cannot
(U
5 6, models
be achieved
without
2 0.20) as seen in Fig. 2.2.
more than 80% of the variance is required,
S, = 0.60 is possible.
m
the bulk
a seven term
However, its advantages over the original
term model (Fig. 2.1) which explains
slightly
less is debatable.
If a
two
If close to 0.25
2. Interpretable
inaccuracy
Projection
Page 80
Pursuit Regression
U is acceptable,
simple two or three terms .models are possible.
.example, a three term model exists with
a two term model exists with
Ss = 0.85, U = 0.23 and combinations-
Q!l = ( 0.0,
0.0, 0.0,
1.0, 0.0, 0.0, 0.0,
0.0,
p2 = 1.10,
Q2 = ( 0.0,
0.0, 0.3,
0.9,
0.0,-0.2)r
The fourth
variable,
0.0, 0.0, 0.1,
Sandburg
Air Force Base temperature,
as is cited in other analyses of this data (Hastie
Breiman
and Friedman
alternating
is
is the most in-
and Tibshirani
1984,
1985). The last variable day of the year also has an effect.
pursuit
conditional
.
index with a power less than two may be warranted.
fluential
The projection
o.oy
1.5.2 and 3.4.3, the convergence of the coefficients
slow and an interpretability
regression model is different
in form to the additive
and
expectation
models as it includes smooths of several vari-
ables rather
than of one variable.
Thus, the three models cannot be compared
functiondly.
However, the three inaccuracy
As remarked
of the data.
interpretability
In addition,
among the models.
Similar
tative
assessment is a further
automated
to the weighing
does
one model may be more understandable
and
a function
of the model’s
may have a clear quadratic
number
of terms
m,
this quali-
subjective
notion
of interpretability
which
to exploratory
projection
pursuit,
projection
regression is
procedure.
As such, an objective
applied to choose between models.
is not
Methods
pursuit
measure of predictive
error may be
such as cross-validation
may be used
to produce an unbiased estimate of the prediction
strategy
fj when choosing
by the index Ss.
In contrast
a modeling
views the functions
and therefore
For example,
form.
of terms in the
and depends on the context
each of the curves is a smooth
description,
than another.
of the number
is subjective
the statistician
Though
not have a functional
explainable
measures are comparable.
in Section 2.2.2, the inclusion
gauging of a model’s
-.
S, = 0.40 and U = 0.21. Alternatively,
p1 = 0.21,
As discussed in Sections
For
accuracy in the models.
A good
is to choose a small number of models based on the above procedure
and
2.3
I
The Air
then distinguish
1
Page 81
Example
Pollution
between them using a resampling
to a lack of identifiability
method.
which results from the fact that the number
/
/
cannot be included
/
is not a one parameter
(X) minimization
problem.
I
’
the model’s
increases with
an additional
interpretable
I
modeling
!
I
,
/
.,!
I
1
I
I
I
!
in the interpretability
complexity
projection
technique.
Unfortunately,
pursuit
index,
the cross-validation
A subjective
of terms
procedure
measure of how
term must be made.
regression is an exploratory
due
Thus,
rather than a strict
Chapter
Connections
A comparison
scribed
of the accuracy
in the previous
warranted
method
and established
demonstrates
the generality
pursuit,
Linear
This setting
whose properties
First,
than
use random
observed
vector
of the trading
de-
selection
techniques
is
between the proposed
This discussion
is preliminary
and
an example
accuracy for interpretability
which
approach.
Regression
procedures
have not yet been proposed
modification
provides
for pro-
is considered for linear regression
various other model space search methods
for the linear
variable
is stated
and fitted
Y consists
approach
are known for comparison.
the notation
The problem
tradeoff
The last section includes
the interpretable
in this section.
connections
Conclusions
other data analysis methods to which this method
Since other model selection
jection
other model
ideas are considered.
are identified.
3.1 Interpretable
with
In this chapter,
topics of future work, including
could be extended,
and interpretability
two chapters
and interesting.
and
3
notation
regression
as in Chapter
as a minimization
values rather
problem
than
is described.
2, matrix
X has entries xij,
between
the
The
(yl, ~2,. . . , yn),
the value of the jth predictor
82
is used.
value minimization.
of the response values for the n observations
and the n X p matrix
notation
of the squared distance
an expected
Rather
for the jth
3.1 Interpretable
observation.
Linear
Page 83
Regression
If an intercept
term is required,
a column of ones is included.
error vector is E and the model may be written
as
Y=XO+c
The parameters
distance
.
p = (/31, ,&, . . . , ,f$,) are estimated
between the n fitted
The
by m inimizing
and actual observations.
the squared
The problem
in matrix
form is
(Y - XP)‘(Y
m jn
The least squares estimates
- XP)
solve the normal equations
BL* E (xTx)-lxTy
The modification
“F (I - x) (Y
in [1.12] for example.
tion with
-
XPts>‘(Y
The denominator
the problem
rather
is used instead
index Sr defined
of the first term is the m inimum
squared
linear regression solution
Xp
that
linear regression
of squared distance
becomes a maximization
P.21
- XBLS)
of the predictors
the response is the ordinary
X E [O, 11, is
_ x S(P)
index S is the single vector varimax
the combination
correlation
parameter
(Y - -v>*v - -w
distance possible, which would be at the ordinary
Recall that
WI
.
for values of the interpretability
where the interpretability
.
has maximum
solution.
[3.1].
correla-
Thus,
if the
as a measure of the model’s fit,
and the simplicity
term should be added
than subtracted.
As the interpretability
parameter
moves away from the ordinary
X increases,
the fitted
vector
f
= X,8
least squares fit E’LS in the space spanned by the
3.
Connections
Fig. 3.1
p predictors
and Conclusions
Page 84
Interpretable
linear regression.
As the interpretability
parameter X increases, the fit moves away from the least squares fit
~LS to the interpretable
fit ~ILR in the space spanned by the
p predictors.
as shown in Fig. 3.1. The interpretable
fits ?ILR
are not necessarily
the same length as the least squares one.
As a variable’s
coefficient
decreases toward zero , the fit moves into the space
spanned by a subset of p - 1 predictors.
The most interpretable
p - 1 variables
may not be the best p - 1 in a least squares sense. Even if the variable
is the same, the interpretable
.
fitting
least squares model.
coefficients
search may not guide the statistician
The interpretability
index
S attempts
subset
to the best
to pull the
/?i apart since a diverse group is considered more simple, whereas the
least squares method
considers only squared distance when choosing a model.
3.1 Interpretable
Page 85
Linear Regression
The definition
of the interpretability
index
that it can be used as a ‘cognostic’ (Tukey
for a more interpretable
model.
S as a smooth
function
means
1983) to guide the automatic
search
However, these smooth and differentiable
erties are the root of the reason the equation
The problem
coefficients
similar
is that the interpretability
which make the objective
[3.2] cannot be solved explicitly.
index contains the squared and normed
a nonquadratic
function.
A linear solution
to [3.1] is not possible.
In comparison,
tion criteria
Mallows’ (1973) CP and Akaike’s
involve
a count function
As [3.2] d oes not admit
[l.ll].
prop-
criterion
as an interpretability
a linear solution,
estimate.
variable
index as defined in
For both the CP and AIC
cannot be determined
criteria,
the complexity
model is equal to the number of variables in the model, say k. With
norming,
the associated interpretability
selec-
the more general Mallows’ CL
which may be used for any linear estimator
the interpretable
(1974) AIC
of a
appropriate
index is
sM(p)sl-
'c
WI
.
P
The Mallows
model prediction
variable
selection
technique
uses an unbiased
error to choose the best model.
The resulting
estimate
of the
search through
the model space includes only one model for a given number of variables.
natively,
which
the interpretability
guides the statistician
Another
search procedure
in a nonlinear
result of this search is variable
interpretability.
Using this terminology,
means that the CP and AIC
for
techniques
is a model estimation
manner
selection
through
as variables
the discreteness
Alter-
procedure
the model
are discarded
of the criterion
are solely variable,
space.
rather
for
[3.3]
than model,
selection procedures.
As is discussed in Section 1.3.4, an alternate
interpretability
defined as the negative distance from a set of simple points.
is whether
the modification
[3.2] can be thought
index sd can be
The natural
question
of as a type of shrinkage.
3. Connections
Page 86
and Conclusions
3.2 Comparison
W ith
Ridge
Regression
In Section 1.3.4, a distance interpretability
index Sd is defined in [1.15] which
involves the distances to a set of simple points V = {VI,. . . , YJ}. Before restricting its values to be E [O,l], th is simplicity
the negative of the m inimum
The coefficient
not the absolute
lead to a nonlinear
properties
vector
This
standardization
others in the coefficient
the coefficient
vji
P’P
.
so that
the relative
Though
mass,
these actions
are possible.
Suppose for the moment that the
X are standardized
to have mean zero and variance
ensures that
vector though
one variable
does not overwhelm
the length of the coefficient
one. This action partially
vector in the interpretability
In order to remove the need for squaring,
be considered.
2
>
vectors of any length may be compared equally and
such as Schur-convexity
is not necessarily
#
size or sign, of the elements matters.
response Y and predictors
one.
p
i=l
CC
,L?is squared and normed
solution,
vector ,f3 is
distance to V or
m in
j=l,...,J
-
measure of the coefficient
the
vector itself
removes the reason for normalizing
index.
all possible sign combinations
must
The simple set vectors vj are no longer the squares of vectors on
the unit RP sphere but just vectors on it. For example, they could be fej,
1 ,‘“7
p. The distance to each simple point is written
(P-vj)T(P-vj)
If the optimization
parameter
problem
in matrix
j=l,...,p
is reparameterized
j =
form as
,
with
a new interpretability
K, [3.2] becomes
rnjn (Y - Xp)‘(Y
- XP) + K j-ynp
(P - vj>‘(P
- ,...,
- Vj)
K>O
.
3.2 Comparison
The second m inimum
may be placed outside of the first since the first term does
not involve Uj resulting
m in
j=l
The solution
,...,P
[
rnp
in
- Xp)
p s (XTX
+ KI)-‘(XTY
vector is
portion
This estimated
Draper
matrix
m inimizes
the
of [3.4].
vector
is similar
to that
of ridge regression
and Smith 1981) with ridge parameter
Ridge regression may be examined
The method
P-51
+ Kzq)
and I is the index which
pR f (XTX
point.
P4
+ K (P - Uj)'(fi
(Y - X/?)‘(Y
where I is the p X p identity
bracketed
Page 87
With Ridge Regression
K. The ridge estimate
+ KI)-lXTY
1976,
is
.
WI
from either a frequentist
is advised in situations
(Thisted
or Bayesian
where the matrix
which can occur when the variables are collinear.
XTX
In addition,
view-
is unstable,
it does better than
linear regression in terms of mean square error as it allows biased estimates.
If a Bayesian analysis is used, the distributional
error term E has independent
Given this assumption,
.
components
assumption
is made that the
each with mean zero and variance 02.
the prior belief
produces the Bayes rule [3.6].
The ridge estimate
[3.1] toward
the origin
@ R [3.6] sh rm’ k s away from the least squares solution
as the ridge parameter
K increases.
~,QS
The interpretability
3. Connections
Page 88
p^ [3.5] sh rm
’ k s away from the least squares solution
estimate
point
and Conclusions
ul as the interpretability
during
the procedure
If the varimax
parameter
K increases.
toward
the simple
The index I may change
however.
index [1.12] is used instead of the distance
index [1.15], the
result is a solution
similar to [3.5] except that the shrinkage is away from the set
of 2p points
4~3,.
(&*,
regression terminology,
. . f &).
Shrinkage
toward
points
is the usual ridge
so the distance index is used in the explanation
above.
As Draper and Smith (1981) point out, ridge regression places a restriction
the size of the coefficients
,0, whether or not that restriction
interpretability
[3.2] d oes not require any restriction
approach
squares the coefficients before examining
mathematical
problems.
on the coefficients
Clearly,
the approach
and identically
-logL(P;Y)
.
terized
a prior
is discussed in the next section.
is also the maximum
likelihood
the posterior
distribution
Thus, as Good and Gaskins
m inimizes
solution
givenr
N(0, a2). , Then the squared
or
= (Y - XP)‘(Y
estimate
likelihood
are that the elements of the error
distributed
distance is the negative of the log likelihood
prior.
can be viewed as placing
The necessary assumptions
term E are independent
tribution,
as it norms and
as a Prior
The least squares solution
The maximum
The
As noted, however, this produces
and this Bayesian viewpoint
3.3 Interpretability
certain conditions.
them.
is called a prior.
on
- XP)
*
this expression.
is equal to the likelihood
Given a prior dismultiplied
(1971) point out, m inimizing
[3.2]
(Y - XP)'(Y - XP) - fqP)
by the
the reparame-
3.3 Interpretability
as
a Prior
Page 89
1.0
angle of /3 in radians
Fig.
3.2
is equivalent
Interpretability
prior density for p = 2.
The prior fn( ,0) is
plotted versus the angle of the coefficient vector in radians for
various values of K.
to minimizing
the negative
do so is to put a prior density
interpretability
for a given interpretability
K>O
index, the prior
parameter
To
to
.
distribution
of the coefficients
value K > 0 is
where CK is the normalizing constant.
density
minus the log prior.
on p where the prior is proportional
exp(KS(P))
For the varimax
log likelihood
belongs to the general exponential
The coefficients are dependent.
family
defined by Watson
(1983).
The
Page 90
3. Connections and Conclusions
The prior
calculated
function,
for the p = 2 case is plotted
using numerical
The constant
Since the exponential
integration.
the basic shape of the curve resembles the index
As K increases, the prior
increases.
3.4 Future
are pushed toward
Similar
C, is
is a monotonic
S1 as in Fig.
1.2.
on interpretability
an angle of zero (p = (1,O)) or an
results would be seen for p = 3.
Work
The previous
interpretable
sections may serve as a foundation
method
connecting
with
this approach
both extending
algorithm
becomes more steep as the weight
The coefficients
angle of $ (p = (0,l)).
3.4.1
in Fig. 3.2.
other
with
the method
variable
selection
for the comparison
Further
techniques.
others is proposed below.
In addition,
to other data analysis methods
of the
work
ideas for
and improving
the
are given.
Further
Connections
Schwarz (1978) and C. J. Stone (1981, 1982) suggest other model selection
techniques
based on Bayesian assumptions.
pares the Schwarz and Akaike
M. Stone (1979) asymptotically
(1974) criteria,
noting
that the comparison
fected by the type of analysis used. In the Bayesian framework,
technique
Another
could be compared
technique
with
these others utilizing
which may provide
sanen (1987), who appl’les the minimum
theory
to measure the complexity
asymptotic
comparison
description
length principle
In addition,
is af-
the interpretable
an interesting
of a model.
com-
analysis.
is that of Risfrom coding
the work by Copas
(1983) on shrinkage in stepwise regression may provide ways to extend the ridge
regression discussion of Section 3.2. Interestingly
rules noted involve
the number
of parameters,
enough, all the model selection
k in [3.3], rather
than a smooth
measure of complexity.
In Chapter
computational
2, the varimax
and entropy
indices have similar
appeal and both have historical
motivation.
intuitive
and
Interpretability
as
9.4 Future
Work
measured
by these indices could be compared
philosophical
Though
Page 91
terminology
the framing
the coefficients
with
simplicity
as defined using
in Good (1968), Sober (1975) and Rosenkrantz
of the interpretable
attempts
approach
as the placing
to do this, a less mathematical
(1977).
of a prior
on
and more philosophical
discussion may prove beneficial.
3.4.2
Extensions
The interpretable
search into
a numerical
whose resulting
extended
approach
one.
description
It can be applied
to measure the simplicity
of interpretability
complicated
descriptions.
At present,
of other description
and accuracy
for methods
feasible va.riable.selection
that
produce
any subset of predictors
bound search methods
is known
analytically
linear
can be employed
regression
variable
linear regression may help in collinear
that interpretable
to include from a group of collinear
A general
.
class of models
regression is an example.
optimization
a model sebut for which
the least squares solution
Second, branch
for
and
search the model space,
procedures
situations.
exist,
interpretable
The least squares solution
suggested.
is
At present, examples
linear regression clearly chooses the variables
ones and produces a stable solution.
for which
Whenever
must
all sub-
need not be considered.
selection
the interpretable
useful is generalized linear models (McCullagh
separate
it provides
combinations
which smartly
and in fact, ridge regression is usually
seem to indicate
types such as functions,
to be [3.1].
areas so that all possible contenders
Though
unstable
First,
method
prove useful for even more
is that
linear
space
If the index is
approaches do not exist. For linear regression,
sets regression is possible for two reasons.
eliminating
might
however, its main improvement
procedure
model
to any data analysis
or model involves linear combinations.
the tradeoff
lection
changes the usual combinatorial
be done.
method
may prove
and Nelder 1983), of which logistic
a generalized
At present,
linear model is considered,
stepwise
methods
a
are used
9.
to choose models.
approach
3.4.3
Page 92
and Conclusions
Connections
These methods
on the basis of prediction
Algorithmic
As described
algorithm
could be compared
with
the interpretable
error.
Improvements
in Chapter
would benefit
variant
Fourier
others.
Other rotationally
1, the interpretable
from further
projection
exploratory
improvements.
First,
index GF needs further
invariant
projection
projection
testing
pursuit
the rotationally
in-
and comparison
with
indices are possible as suggested
in Section 1.4.2.
Second, present work involves designing a procedure
to solve the constrained
optimization
instead of using the general routine as outlined
improvement
should decrease the computational
projection
pursuit
needs further
regression
forecasting
time involved.
procedure
described
in Chapters
as the weight on interpretability
of the squaring
in Section
1 and 2, the convergence of the coefficients
increases is slow. This property
used in the varimax
zero as the combinations
the
2.2.3
moves to a maximizing
3.5 A General
to have the same derivatives
which is the varimax
index
The sections of the latter
at their joined
points.
Framework
The main result of this thesis is the computer
tion of projection
results because
ei (Figs. 1.2 and 1.3). Solutions
except for a range close to the e;‘s where it is linear.
index could be matched
to zero
index and the result that the slope goes to
might be to use a lower power or a piecewise function
pursuit
implementation
of a modifica-
which makes the results more understandable.
the approach provides
Beyond this specific application,
accuracy
Analogously,
investigation.
As mentioned
dition,
in Section 1.4.5. The
a general framework
for tackling
the search for interpretability
is made often in statistics,
sometimes
implicitly.
similar
at
In adproblems.
the expense of
The identification
and
3.5 A General
formalization
of this action is useful since choices which previously
tive become objective.
histogram
is examined.
simplifying
In the next subsection,
the rounding
were subjec-
of the binwidth
for a
This common example shows that the consequences of a
action in terms of accuracy loss can be approximated.
of interpretability
outcomes
Page 99
Framework
must be broadened
than linear combinations.
to deal with
Measuring
The definition
a much more elusive set of
the interpretability
increase is
difficult.
3.5.1
The Histogram
Example
Consider the problem of drawing a histogram
Two quantities
must be determined.
x0, which usually
bin.
is chosen so that
The second is the binwidth
of n observations
~1, x2, . . . , zn.
The first is the left endpoint
of the first bin
no observations
fall on the boundaries
h, which generally
a rule of thumb
and then rounded
usually multiples
of powers of ten. Based on the mathematical
determine
the rules widely
the binwidth
employed,
the resulting
intervals
to
are simple,
approach used to
the loss in accuracy incurred
is to estimate
Scott (1979) d e t ermines a binwidth
the integrated
by rounding
the shape of the true underlying
rule which asymptotically
mean square error of the histogram
rule and further
theoretical
results.
employs is useful in approximating
A certain
the further
m inimizes
density from the true density.
Diaconis and Freedman (1981) use the same criterion
rounded.
according
can be calculated.
The purpose of a histogram
density.
so that
is calculated
of a
to lead to a slightly
approximation
accuracy
different
step which
Scott
lost if the binwidth
is
The discussion below follows his approach.
The integrated
the true density
mean square error of the estimated
histogram
density f” from
f is
IMSE
G
J [
O” E f^( x) - f(x)] 2dx
--oo
1
=O” f’(~)~dx + O(;
+ $h2
nh
-CO
J
WI
+ h”)
Page 94
3 Connections and Conclusions
where h is the histogram
binwidth.
Minimizing
the first two terms of [3.7] pro-
duces the estimate
P-81
Scott also shows that if the binwidth
in IMSE
is multiplied
by a factor c > 0, the increase
is
= ~IMSE(IZ)
IMSE(cf)
WI
.
Via Monte
Carlo studies, Scott shows [3.8] and [3.9] to be good approximations
for normal
data.
In reality,
derivative
density
[3.8] is useless since the underlying
Scott,
f’ are unknown.
as a reference distribution.
where s is the estimated
suggest the similar
standard
and Diaconis
density
f and therefore
and Freedman
its
use the normal
Scott suggests the approximation
deviation
of the data. Diaconis
and Freedman
approximation
where IQ is the interquartile
range of the data. Monte Carlo studies have shown
these approximations
Given either
approximate
to be robust.
,.
approximation
hs or AD, the statistician
the binwidth
discussed in a moment.
estimate
by rounding.
At present, consider rounding
of such simplification
the estimate
k* is E (fis - u, hs + u) where u is some rounding
u would be i if the binwidth
-
The benefits
may elect to further
also be written
is rounded
to an integer.
as
A* E (1 + e)iis
unit.
is.
are
The new
For example,
The new binwidth
may
Page 95
3.5 A General Framework
where e is a positive
or negative factor depending
rounded up or down. This multiplying
on whether
the old estimate
is
factor must be 2 1 as a negative binwidth
is impossible.
The estimation
procedure
may be drawn schematically:
binwidth
h
binwidth
u
i
binwidth
u
fis
rounded binwidth
u
h*
minimize
estimated
first two terms
use normal
approximate
density as reference
round
If [3.9] is used as an approximation
rounding,
for the resulting loss in IMSE
*
between the IMSE
of hs and h* is written
the relationship
IMSE(fi*)
The percent
factor
change in IMSE
e3
= (13Tl ; e; 21MSE(
can be plotted
due to
.
is)
as a function
of the multiplying
e as shown in Fig. 3.3.
This exercise demonstrates
different
repercussions
difficult
to measure explicitly.
-are all that
it to another.
are simpler.
can be retained
that
Finally,
might
a binwidth
the class delineations
(1979) sh ows that
result
the accuracy
is
is easier to draw and describe
Certainly
in short-term
up or down results in
The increase in interpretability
The histogram
In fact, Ehrenberg
removes confusion
rounding
in terms of accuracy.
since the class boundaries
to remember.
that
memory.
in explaining
are easier
two digits other than zero
In addition,
the rounding
the histogram
or comparing
in the actual observations
xi may prompt
3. Connections
Page 96
and Conclusions
e
Fig. 3.3
rounding.
Percent change in IMSE
versus multiplying
fraction e in the
binwidth example. The rounded binwidth A* - (1 + e$s, where
ks is the estimated binwidth due to Scott (1979).
Extra
digits beyond the number
of significant
ones in the data leads
to a false sense of accuracy.
How to measure the interpretability
Diaconis
determine.
interpretability
gain.
of a histogram
directly
is difficult
(1987) suggests conducting
experiments
A typical
be to divide a statistics
experiment
m ight
to quantify
into two matched groups and to present to either group respectively
histogram
and its simplified
version.
Measurements
to
the
class
a unrounded
of interpretability
could be
made on the basis of the correctness of answers to questions such as ‘How would
-
adding the following
observations
change the shape of the histogram?’
is the lower quartile ?‘. As interpretability
is increased, some information
or ‘Where
may be
lost. For example, the question ‘Where is the mode?’ may become unanswerable.
Page 97
5.5 A General Framework
This loss of information,
a more general measure of inaccuracy
could be measured along with interpretability.
be included
3.5.2
by asking data analysts their reaction
have changed statistics
hand, the resulting
and applications.
flexibility
expert opinion
could
-
to rounding.
principle
technique,
to monitor
pursuit.
modes of information
these computer
these abilities
of exploratory
undreamed
of possibilities
losing the ability
amount
of flexi-
computer-intensive
the statistician
and communicating
retains
her new
to achieve her basic
those results to others.
tools has come freedom.
provide
of abilities
The aim of this thesis is to use the
results in a particular
without
On the one
In the initial
the power with which to follow
stages of an
the basic tenets
data analysis (Tukey 1977), to let the data drive the analysis rather
than subjecting
it to preconceived
stages, the wealth
expanded
previously
In this manner,
translation
goal of clearly understanding
analysis,
and irreversibly.
the sacrifice of an acceptable
for more interpretable
projection
With
has cultivated
and unmanageable.
of parsimony
in return
substantially
On the other, the sheer number and complexity
can be both bewildering
assumptions
of models or descriptions
enormously.
using the bootstrap
Confirmation
(Efron
noted, the statistician
‘What
IMSE,
Conclusion
Computers
bility
In addition,
than
which may not be true.
In later
which can be fit and evaluated
of these models
can be readily
answered
procedures.
As Tukey
can be confirmed?’
but rather
1979) or other resampling
no longer answers ‘What
has
can be done?‘.
Computing
putational
statistician’s
power is extending
and confirmational
imagination
and eliminating
boundaries.
are alleviated
Even unconscious
(McDonald
results of such an analysis can be complex,
important,
hard to explain.
dora’s box has been opened.
Though
previous mathematical,
restrictions
and Pedersen 1985).
hard to understand,
this progress is exciting,
Just as grappling
with
comon the
The
and even more
in a sense a Pan-
the theoretical
demons of
9. Connections and Conclusions
new methods
Page 98
such as projection
task, so too is considering
In order to understand
effectively,
a controlled
computing
power
which
pursuit
(Huber
the parsimonious
and communicate
1985) is a necessary and difficult
aspects.
the results of a statistical
use of these new methods
has produced
these
novel
to balance the search for an accurate and truthful
equally
important
such a balance.
desire for simplicity.
is helpful.
techniques
description
Interpretable
Fortunately,
provides
analysis
the very
a means
of the data with an
projection
pursuit
strikes
Appendix
A
Gradients
A.1
Interpretable
In this
pursuit
Exploratory
section,
objective
are calculated.
desired gradients
the gradients
Projection
Gradients
for the interpretable
exploratory
projection
function
Since the search procedure
is conducted
in the sphered space, the
are
dF
aajplT
dF
aF aF
-daj = (aajl’G’***’
From [A.l],
Pursuit
the gradients
may be written
aF _ (I- A> aGF
dcvj
max GF
hi
in vector notation
+A%
99
j=1,2
as,
j=1,2
.
as
.
Page 100
Appendix A. Gradients
The Fourier projection index gradients are calculated from [1.23], yielding
From the definition
of the Laguerre functions
z:(R) = CSe-iR
with recursive equations
[1.20],
- ie-!jR
i=O,l,...
derived from the definition
)
of the Laguerre polynomials
p.211
=o
-=
o
a”L”, -1
-=
ddLRz
-==ii:
-=
f3R
2
2i - 1
--fR)%&
( i
The gradients
definition
i - 1
fLi-l-
of the radius
squared
R
(7)x
= (2Xj)Z
2=
2
3 )...
and angle 0 are calculated
[1.20]. If X1 z QTZ and X2 G ~$2,
g
.
dLi-2
the gradients
.
using the
are
j=1,2
P-31
A.1 Interpretable
Exploratory
Projection Pursuit
Page 101
Gradients
In [A.3], note that 2 is a vector E RP, while X1 and X2 are scalars. As with the
calculation
of the value of the index, the expected
values are approximated
by
sample means over the data.
The gradients
of the simplicity
tion [1.27], th e orthogonally
mapping
written
translated
from the unsphered
in matrix
notation
index are calculated
component
&
using the index definidefinition
11.251, and the
to the sphered space [1.5]. The gradients
may be
as
[A*41
ml as,-a~;ap,
d012=bpiap2acv2 *
In [A.4], note that the partial
a p X p matrix.
derivative
of one vector with respect to another
is
For example,
ah = aplr
(-1
aQ, ts aals
r,s
= l,,..,
.
p
Using [1.5] yields
aPjr
-=aaj,
Urs
a
where U and D are the eigenvector
r,s = l,...,p
and eigenvalue matrices
defined in [1.3]. Using
[1.25] yields
a&
-
= - Plr (
P2aTPl>
ah,
MA
w;, _
Plr
---pTB,(BI,)
ap2,
The diagonal
-
elements are
WlsW2)
I2
r,s=l,...,
pandr#s
.
Page 102
Appendix A. Gradients
The two vector varimax
bination
of the individual
%hPd
I
Taking
simplicities
= $P
partial
simplicity
- Wl(Pl>
derivatives
index S, [I.121 may be written
with
+
21 -
.
qp,,p,j
yields
6
gr
-plr [ --I-(p-l)sl(pl)
p
P
-ah = 2PTB,
P;&
The partial
denoted as C yielding
and a cross-term
+ (P - l)Sl(PZ)
as a. com-
1
to ,8& is identical
respect
r=
-~+CtP1,82)]
l,...,p
.
1
with
&
components
replacing
,&
components .
A.2
Interpretable
Projection
Pursuit
In this section, the gradients
sion objective
Regression
for the interpretable
projection
pursuit
The gradients
Y)
-
WI
wJ(W,~2,...,4
used in the steepest descent search for directions
Cyj are
dF
-=
daj
aF
aF dF
-y
(---dajl ' aaj2" ' * ' aajp
From [A.5], the gradients
aF
Friedman
may be written
(i-x)
(1985) calculates
aL2
L
= -2E[Rj
&j
The partials
regres-
function
F(P, a, f, x, Y) z (1 - X) L2(P;;;fi;Y
are calculated.
Gradients
j = l,...,m
in vector notation
aL2
the
L2 distance
j=
gradients
- @jfj((YTX)]Pjfj(cYTX)X
of the curves fj are estimated
l,...,m
.
as
.
from [2.7], yielding
j = I.,. . . ,m
using interpolation.
.
A.2 Interpretable
The gradients of the simplicity
[2.13]. Taking partial
derivatives
index are calculated
The partials
using the index definition
yields
a&J
-=mC~l,~(~-;)$
doji
Page 103
Projection Pursuit Regression Gradients
j=l,...,
mandi=l,.:..,p
k=l
of the overall vector y [2.12] are
-2cViic4
ark=
(“jTclj)2
dcrj;
2crjl,(Crroj
-
ark=
aajk
forj=l,...,
mandi=l,...,
(“TCyj)2
p.
Ic
“3k)
+
2
.
References
Akaike,
H. (1974).
Transactions
Asimov,
“A new look at the statistical
on Automatic
D. (1985).
data,” SIAM
Bellman,
“The
Journal
Control AC-19,
Grand
R. E. (1961).
Adaptive
and Statistical
Control
IEEE
716-723.
A tool for viewing
Tour:
of Scientific
model identification,”
Computing
multidimensional
6, 128-143.
Processes, Princeton
University
Press,
Princeton.
Breiman,
L. and Friedman,
for multiple
J. H. (1985).
regression and correlation
ican Statistical
Association
Chambers,
discussion),”
J. M., Cleveland,
Annals
Journal
-
of Statistics
Wadsworth,
“Regression,
of the Royal Statistical
Dawes, R. M. (1979).
discussion),”
R. (1989).
W. S., Kleiner,
ical Methods GOT Data Analysis,
Copas, J. B. (1983).
(with
optimal
transformations
Journal
of the Amer-
smoothers
and additive
80, 580-619.
Buja, A., Hastie, T. and Tibshirani,
models (with
“Estimating
prediction
“Linear
17, 453-555.
B. and Tukey, P. A. (1983).
Graph-
Boston.
and shrinkage
(with
discussion),”
Society, Series B 45, 311-354.
“Th e robust beauty of improper
making,”
American
Psychologist
34, 571-582.
Diaconis,
P. (1987).
Personal communication.
linear models in decision
I
I
Page 105
References
Diaconis,
I
P. and Freedman, D. (1981).
Lz theory,”
Zeitschrift
fuer
“On the histogram
as a density estimator:
WahrscheidichkeitstheoTie
und Verwandte
Gebiete
57, 453-476.
Diaconis,
P. and Shahshahani,
combinations,”
/
SIAM
Journal
D. L. and Johnstone,
Donoho,
and a duality
“On
M . (1984).
of Scientific
and Statistical
Annals
functions
Computing
“Projection-based
I. M . (1989).
with kernel methods,”
nonlinear
of Statistics
of linear
5, 175-191.
approximation
17, 58-106.
!
Draper,
N. and Smith, H. (1981).
Applied Regression Analysis,
Wiley, New York
;:
1
/
,
Efron,
B. (1982).
The Jackknife,
CMBS
38, SIAM-NSF,
Efron, B. (1988).
the Bootstrap,
and Other Resampling
Plans,
Philadelphia.
“Computer-intensive
methods in statistical
regression,”
SIAM
Review 30,421-449.
Ehrenberg,
A. S. C. (1981).
Statistical
Association
Friedman,
J. fi.
Department
Friedman,
Department
Friedman,
jection
Journal
of the American
35, 67-71.
(1984a).
“SMART
of Statistics,
Stanford
J. H. (1984b).
user’s guide,”
Technical
Stanford
J. H. (1985).
LCSOOl,
Technical Report LCSO05,
University.
“Classification
Technical
Report
University.
“A variable span smoother,”
of Statistics,
pursuit,”
“The problem of numeracy,”
Report
and multiple
LCS012,
regression
Department
through
of Statistics,
pro-
Stanford
University.
Friedman,
J. H. (1987).
ican Statistical
Friedman,
exploratory
projection
pursuit,”
Journal
of the Amer-
Association 82, 249-266.
J. H. and Stuetzle,
Aal of the.American
Friedman,
“Exploratory
Statistical
W. (1981).
“Projection
regression,”
JOUF
Association 76, 817-823.
J. H. and Tukey, J. W. (1974).
data analysis,”
pursuit
IEEE
Transactions
“A projection
on
Computers
pursuit
algorithm
C-23,
881-889.
for
References
Page 106
Gill, P., Murray,
W. and Wright,
M. H. (1981). Practical
Optimization,
Academic
Press, London.
Gill, P. E., Murray,
for NPSOL,”
Stanford
W., Saunders, M. A. and Wright,
Technical
Report
M. A. (1986).
SOL 86-2, Department
“:User’s guide
of Operations
Research,
University.
Good, I. 3. (1968).
and a sharpened
“C orroboration,
British
razor,”
explanation,
evolving
probability,
for the Philosophy
Journal
simplicity
of Science 19, 123-
143.
Good, I. J. and Gaskins, R. A. (1971).
probability
densities,”
Gorsuch,
R. L. (1983).
“N on p arametric
roughness penalties
for
58, 255-277.
Biometrika
Factor
Analysis,
Lawrence
Erlbaum
Associates,
New
Jersey.
Hall, P. (1987).
tion pursuit,”
“On polynomial-based
Annals
of Statistics
projection
projec-
17, 589-605.
H. H. (1976). M o d em Factor Analysis,
Harman,
indices for exploratory
The University
of Chicago Press,
Chicago.
Hastie,
T. and Tibshirani,
Department
Huber,
of Statistics,
P. (1985).
R. (1984).
Stanford
“Projection
“Generalized
additive
models,”
LSCO02,
University.
pursuit
(with
discussion),”
Annals
of Statistics
13, 435-525.
Jones, M. C. (1983).
analysis,”
Ph.D.
“The
Dissertation,
projection
University
Jones, M. C. and Sibson, R. (1987).
sion) ,” Journal
Krzanowski,
structure,
pursuit
W. J. (1987).
using principal
is projection
data
pursuit?
(with
discus-
Society, Series A 150, l-36.
“S e 1ec t’ion of variables
components,”
for exploratory
of Bath.
“What
of the Royal Statistical
algorithm
Applied
to preserve multivariate
Statistics
36, 22-33.
data
References
Page 107
Lubinsky,
D. and Pregibon,
“Data
M. (1985).
“Applications
problems
in statistics,”
Marshall,
A. W. and Olkin,
Its Applications,
of the annealing
Biometrika
Academic
I. (1979).
Journal
of
to combinatorial
Theory
of Majorization
and
Press, New York.
“Some comments
Mallows,
C. L. (1983).
“Data
and Robustness,
algorithm
Inequalities:
C. L. (1973).
McCabe,
as search,”
72, 191-198.
Mallows,
New York,
analysis
38, 247-268.
Econometrics
Lundy,
D. (1988).
on C,,”
description,”
in Scientific
eds. G. E. P. Box, T. Leonard
15, 661-676.
Technometrics
Inference
Data Analysis,
and C.-F. Wu, Academic
Press,
135-151.
G. P. (1984).
McCullagh,
“Principal
P. and Nelder,
variables,”
J. A. (1983).
Technometrica
G eneralized
26, 137-144.
Linear
Models,
Chapman
and Hall, New York.
McDonald,
tation,
J. A. (1982).
Department
McDonald,
analysis
puting
“I n t eractive
of Statistics,
Stanford
J. A. and Pedersen,
part
I: introduction,”
graphics
Ph.D.
Disser-
University.
J. (1985).
SIAM
for data analysis,”
“Computing
Journal
environments
of Scientific
for data
and Statistical
Com-
6, 1004-1012.
Reinsch, C. H. (1967).
“S moothing
by spline functions,”
Numeriache
Mathematik
10, 177-183.
RCnyi,
A. (1961).
of the Fourth
“On
Berkeley
Symposium
Vol. 1, ed. J. Neyman,
Rissanen,
J. (1987).
Royal Statistical
Rosenkrantz,
measures
of entropy
on Mathematical
547-561, University
“Stochastic
Society,
and information,”
Statistics
of California
complexity
(with
in Proceedings
and Probability,
Press, Berkeley.
discussion)”
Journal
of the
Series B 49, 223-239, 252-265.
R. D. (1977).
I n f erence, Method
and Decision,
Reidel,
Boston.
Page 108
References
Schwarz, G. (1978).
,
I
“Estimating
the dimension
of a model,” Annuls of Statistics
6, 461-464.
I
Scott, D. (1979).
I
I
“On optimal
and data-based
histograms,”
B&net&z
66, 605-
610.
Silverman,
B. W. (1984).
cyclopedia
of Statistical
“Penalized
maximum
Sciences, eds. S. Kotz
likelihood
estimation,”
and N. L. Johnson,
in En-
Wiley,
New
selection of an accurate and parsimonious
nor-
York, 664-66’7.
Sober, E. (1975).
Simplicity,
Stone, C. J. (1981).
“Ad missible
mal linear regression model,”
Stone,
Annals
“Local
C. J. (1982).
Akaike’s
Clarendon
Press, Oxford.
of Statistics
asymptotic
model selection rule,” Annals
9, 475-485.
admissibility
of the Institute
of a generalization
of Statistical
of
Mathematics
34, 123-133.
Stone, M. (1979).
Journal
of the Royal Statistical
Sun, J. (1989).
of Statistics,
Thisted,
“C omments on model selection criteria of Akaike and Schwarz,”
“P-values in projection
Stanford
pursuit,”
Ph.D. Dissertation,
Department
University.
R. A. (1976).
Bayes methods,”
Society, Series B 41, 276-278.
“Ridge
regression,
Ph.D. Dissertation,
minimax
Department
estimation
of Statistics,
and empirical
Stanford
Univer-
sity.
Thurstone
L. L. (1935).
The V ec t OTS of the Mind,
University
of Chicago Press,
Chicago.
Tukey, J. W. (1961).
_
“D iscussion, emphasizing
of variance and spectrum
analysis,”
Tukey, J. W. (1977). Exploratory
the connection
Technometrics
Data Analysis,
between analysis
3, 201-202.
Addison-Welsey,
Reading,
MA.
References
Tukey,
statistics:
Page 109
J. W. (1983).
PTOCeedingS
“Another
of
R. Sacher and J. Wilkinson,
Tukey,
views,”
look at the future,”
the ldth symposiu m on the Interface,
Springer-Verlag,
P. W. and Tukey, J. W. (1981).
in Interpreting
in Computer
Mu&ivariate
Data,
G. S. (1983).
Statistics
and
eds. K. Heiner,
New York, 2-8.
“Preparation;
ed.
prechosen sequences of
V. Barnett,
189-213.
Watson,
Science
on Spheres, Wiley,
New York.
Wiley,
New York,
Download