Projection Pursuit Regression/Classification X • Motivation: Consider the following example:

advertisement
Projection Pursuit Regression/Classification
• Motivation: Consider the following example:
1.0
Xi ∼ U nif ([−1, 1] × [−1, 1]); Yi = X1X2; i = 1, 2, . . . , n.
●
●
●
● ●
●
●
●
●●
●
0.5
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
● ●
●
●
●
●
●
●●
●
●●● ●
●
●
● ●●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●●●●●●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
−0.5
●
●
●
●
−0.5
●
●
●
●
●
●
●
●●
●
0.0
●
●
−0.5
0.0
●●
y
●
●
●
0.0
●
●
●
●
y
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
−1.0
x2
0.5
0.5
●
●
●
−1.0
−0.5
0.0
x1
●
●
●
●
●
0.5
●
●
●
●
●
●
1.0
−1.0
●
●
−0.5
0.0
x1
0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
x2
845
Using CART
1.80
1.50
1.30
1.10
0.71
0.60
0.53
CART Partition of Response (times 0.1)
−Inf
q
c
g
x1 < −0.833681
|
q
12
g
b
b
b b
ik
n
c
i
b
b
c
bb
0.5
w
10
b
−0.116700
−0.755300
x1
0.0
x1 < −0.0411978
8
x1 < 0.264755
g k
c
w
b
c
0.493500
0.034080
b
g
c
n
c
g
k
g
g
b
b
b
c
b
iq
b
c
g
4
x1 < 0.421723
−0.350800
g
n
bi
g
c
b
−0.5
x1 < −0.293539
−0.005133
−0.511900
−0.133400
n
b
g
6
x2 < −0.653911
b
i
b
x2 < −0.510474
c
i
b
b
g
ng
w
k
b
n
w gg
cg
c
b
x2 < 0.35276
deviance
x2 < 0.47524
b
b
c
b
i
c
g
−1.0
b
0.096180
0.436200
2
4
6
size
8
10
b
gb
cq
b
q
−0.5
0.0
c
b b
b
k
w
k
b
0.5
x2
• Partitioning is difficult because interaction effects are not
addressed by CART: to detect interactions, we have to look
at marginals along the directions in predictor space.
• Marginal variables are not adequate to split the data informatively
846
0.5
●
●
●●
●
●
●●●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
−0.5
●
●
●
●●
●
●
●
●
● ●● ●
●
●
●●
● ●
●
●
●
−0.5
●
●
●●
●
●
●
●
●
●
●
●
●
●● ● ●
● ●
0.0
●
●
●
●●
●●
●
●
●
●
● ●
y
●
●
0.0
y
0.5
Toy Example Continued
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−1.5
−1.0
−0.5
0.0
(x1 + x2)
●
●
0.5
1.0
1.5
−2
−1
0
1
(x1 − x2)
• Plotting Y against X1 + X2, X1 − X2 reveals a dependence.
• Building an additive model seems better than splitting: while
Y can not be written as a weighted sum of X1 and X2, it can
be written as a sum of a function of a linear combination of
2 − 1 (X − X )2 .
predictors: Y = 1
(X
+
X
)
1
2
1
2
4
4
847
Projection Pursuit Classification/Regression
• Fitting the response as a linear combination of smooth functions of projections of our predictors is therefore our goal.
• Our objective will be to find a direction (unit vector) a and
a linear function l(z) = b0 + b1z
s(x) = b0 + b1a0x
0 x)2 (regression) or misminimizes RSS = n
(y
−
b
−
b
a
0
1
i
i=1
classification rate (as before).
P
• Note: A linear function s(x) is a ridge function. It varies
only along a direction a and is constant along all orthogonal
directions. i.e., if a0x = a0x∗, then s(x) = s(x∗).
848
Projection Pursuit Classification/Regression
• Generalizations: Replace linear functions by more general
ridge functions: i.e., allow for more than one term to get:
s(x) =
M
X
si(a0ix)
i=1
• Note that if ai = ei, i.e., ais are restricted to be the coorP
dinate directions, then s(x) = M
i=1 si (x) is the same as an
additive model.
• We now address the issue of fitting the model for known M .
849
Fitting Projection Pursuit
Classification/Regression Models
• The general model is set as: Y ∼
PM
0 x)
s
(
a
m
m
m=1
• As written, sm(·) is not uniquely determined because we can
add and subtract constants. We get around this by centering
the responses such that Ȳ = 0.
• Additionally, we require sms to have mean zero, i.e., we specP
0 x ) = 0∀m.
ify n
s
(
a
m
m i
i=1
• So, we need to find a1, a2, . . . , aM , s1(·), s2(·), . . . , sM (·) minimizing the RSS or the misclassification error rate.
850
Fitting PPR Models
• Suppose we know a1, a2, . . . , aM ; s2(·), s3(·), . . . , sM (·).
• Then the only unknown is s1(·).
0
• Draw a scatterplot of the residuals ri = Yi − M
m=2 sm (am xi )
against a01xi. Use a scatterplot smoother (supersmoother,
etc) moving average on (ri, a01xi).
P
• We thus get an iterative procedure using backfitting.
851
Fitting PPR Models
• Suppose we know a1, a2, . . . , aM and that at any time, we
have current guesses for s1(·), s2(·), . . . , sM (·).
• Initially, all si(·) ≡ 0.
– Cycle over m = 1, 2, . . . , M in the following manner:
∗ Get new sm(·)s in the same manner as described for the
first direction: keep going until there is no improvement.
– Thus, we get an algorithm for finding s1(·), s2(·), . . . , sM (·)
given a1, a2, . . . , aM .
852
• The question now arises: how do we find a1, a2, . . . , aM ?
• Note that for any choice of a1, a2, . . . , aM , we can find optimal
s1(·), s2(·), . . . , sM (·) and compute the corresponding RSS.
• We optimize over a1, a2, . . . , aM . We perform numerical optimization with M (p − 1) unknowns.
Iterative Backfitting Algorithm
853
Projection Pursuit Examples
●
−2.5
●
●
● ●
●
●
●●
●
●
●
●●●●
●● ●
● ● ●
●
● ●
● ●●
● ●
●
● ●
●●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
−4.0
term 2
0.0
●
−3.0
●
●
●
●●
●
●●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
−0.5
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
●
●● ●
●
●
● ●●
●
●
●
● ● ●
●
● ● ● ●●
●
●● ●
●
●
●
●
●
●
●
●● ●
●●
● ●
● ●
●
●
●
● ●●
●
●●● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
−4.5
●
●
●
●●
●
●
●
●
●
●
●
●
−5.0
pp.toy$fitted.values
●
●
●●
●●
−3.5
0.5
●
●
●
●
●
−0.5
0.0
y
0.5
−1
0
1
2
●
3
term 1
854
• Fit or toy data and iris data
Download