Lecture 4

advertisement
Outline
Regression on a large number of correlated inputs
 A few comments about shrinkage methods, such as ridge regression
 Methods using derived input directions
 Principal components regression
 Partial least squares regression (PLS)
Data mining and statistical learning,
lecture 4
Partitioning of the expected squared prediction error


E ( y j  yˆ j ) 2  EE ( y j )  E ( yˆ j )  Vary j  yˆ j 
2
bias
Shrinkage decreases the variance but increases the
bias
Shrinkage methods are more robust to structural
changes in the analysed data
Data mining and statistical learning,
lecture 4
Advantages of ridge regression over OLS
The models are easier to comprehend because strongly
correlated inputs tend to get similar regression
coefficients
Generalizations to new data sets are facilitated by a larger
robustness to structural changes in the analysed data set
Data mining and statistical learning,
lecture 4
Ridge regression
- a note on standardization
The principal components and the shrinkage in ridge
regression are scale-dependent.
Inputs are normally standardized to mean zero and variance
one prior to the regression
Data mining and statistical learning,
lecture 4
Regression methods using derived input directions
Extract linear combinations of
the inputs as derived features,
and then model the target
(response) as a linear function
of these features
y
z1
z2
…
zM
Z m   0 m  αmT X , m  1, ... , M
y  0  β T Z
Data mining and statistical learning,
lecture 4
x1
x2
…
xp
Absorbance
Absorbance records for ten samples of chopped meat
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Sample_1
Sample_2
Sample_3
1 response variable
(fat)
Sample_4
Sample_5
Sample_6
Sample_7
Sample_8
Sample_9
Sample_10
1
12 23 34 45 56 67 78 89 100
Channel
Data mining and statistical learning,
lecture 4
100 predictors
(absorbance at 100
wavelengths or
channels)
The predictors are
strongly correlated
to each other
Absorbance records for ten samples of chopped meat
6.0
Sample_12
High fat
Sample_133
Absorbance
samples
Sample_48
5.0
4.0
Sample_145
Sample_176
3.0
Sample_186
Sample_215
2.0
Low fat
samples
Sample_43
1.0
Sample_44
Sample_45
0.0
1
12 23 34 45 56 67 78 89 100
Channel
Data mining and statistical learning,
lecture 4
3-D plots of absorbance records for samples of meat
- channels 1, 50 and 100
3D Scatterplot of Channel1 vs Channel50 vs Channel100
4
Channel1
3
5
4
2
2
3
3
Channel100
4
5
2
Data mining and statistical learning,
lecture 4
Channel50
3-D plots of absorbance records for samples of meat
- channels 40, 50 and 60
3D Scatterplot of Channel60 vs Channel50 vs Channel40
5
Channel60
4
5
3
4
2
3
3
Channel40
4
5
2
Data mining and statistical learning,
lecture 4
Channel50
3-D plot of absorbance records for samples of meat
- channels 49, 50 and 51
3D Scatterplot of Channel49 vs Channel50 vs Channel51
5
Channel49
4
3
5
4
2
3
3
4
Channel51
5
Channel50
2
Data mining and statistical learning,
lecture 4
Matrix plot of absorbance records for samples of meat
- channels 1, 50 and 100
Matrix Plot of Channel1, Channel50, Channel100
3
4
5
4
3
Channel1
2
5
4
Channel50
3
5
4
Channel100
3
2
3
4
Data mining and statistical learning,
lecture 4
3
4
5
Principal Component Analysis (PCA)
• PCA is a technique for reducing the complexity of high
dimensional data
• It can be used to approximate high dimensional data with a
few dimensions so that important features can be visually
examined
Data mining and statistical learning,
lecture 4
Principal Component Analysis
- two inputs
15
X2
PC1
10
PC2
5
0
5
X1
Data mining and statistical learning,
lecture 4
10
3-D plot of artificially generated data
- three inputs
Surface Plot of z vs y, x
PC1
4
z
2
0
2
-2
0
-2
-2
0
x
2
-4
4
Data mining and statistical learning,
lecture 4
y
PC2
Principal Component Analysis
The first principal component (PC1) is the direction that
maximizes the variance of the projected data
The second principal component (PC2) is the direction
that maximizes the variance of the projected data after
the variation along PC1 has been removed
The third principal component (PC3) is the direction that
maximizes the variance of the projected data after the
variation along PC1 and PC2 has been removed
Data mining and statistical learning,
lecture 4
Eigenvector and eigenvalue
In this shear transformation
of the Mona Lisa, the
picture was deformed in
such a way that its central
vertical axis (red vector)
was not modified, but the
diagonal vector (blue) has
changed direction. Hence
the red vector is an
eigenvector of the
transformation and the blue
vector is not. Since the red
vector was neither stretched
nor compressed, its
eigenvalue is 1.
Data mining and statistical learning,
lecture 4
Sample covariance matrix
 s11

 .
 .

 .
s
 m1
.
.
.
.
.
.
s1m 

. 
. 

. 
s mm 
where
n
síj 
 (x
k 1
ik
 xi. )(( x jk  x j . ) 2
n 1
, i  1 ,..., m, j  1 ,..., m
Data mining and statistical learning,
lecture 4
Eigenvectors of covariance and correlation matrices
The eigenvectors of a covariance matrix provide
information about the major orthogonal directions of the
variation in the inputs
The eigenvalues provide information about the strength
of the variation along the different eigenvectors
The eigenvectors and eigenvalues of the correlation
matrix provide scale-independent information about the
variation of the inputs
Data mining and statistical learning,
lecture 4
Principal Component Analysis
15
X2
Eigenanalysis of the Covariance Matrix
10
5
0
5
X1
10
Eigenvalue
2.8162
0.3835
Proportion
0.880
0.120
Cumulative
0.880
1.000
Variable
PC1
PC2
X1
0.523
0.852
X2
0.852
-0.523
Loadings
Data mining and statistical learning,
lecture 4
Principal Component Analysis
Coordinates in
the coordinate
system
determined by
the principal
components
Data mining and statistical learning,
lecture 4
Principal Component Analysis
Surface Plot of z vs y, x
Eigenanalysis of the Covariance
Matrix
Eigenvalue
Proportion
Cumulative
4
z
1.6502
0.687
0.687
0.7456
0.310
0.997
0.0075
0.003
1.000
2
0
2
-2
0
-2
-2
0
x
2
-4
4
y
Variable
x
y
z
Data mining and statistical learning,
lecture 4
PC1
0.887
0.034
0.460
PC2
0.218
-0.909
-0.354
PC3
-0.407
-0.414
0.814
Scree plot
Scree Plot of x, ..., z
1.8
1.6
Eigenvalue
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1
2
Component Number
Data mining and statistical learning,
lecture 4
3
Principal Component Analysis
- absorbance data from samples of chopped meat
Eigenanalysis of the Covariance Matrix
Eigenvalue
Proportion
Cumulative
26.127
0.987
0.987
0.239
0.009
0.996
0.078
0.003
0.999
0.030
0.001
1.000
0.002
0.000
1.000
0.001
0.000
1.000
Data mining and statistical learning,
lecture 4
0.000
0.000
1.000
0.000
0.000
1.000
0.000
0.000
1.000
Scree plot
- absorbance data
Scree Plot of Channel1, ..., Channel100
25
Eigenvalue
20
One direction
is responsible
for most of the
variation in the
inputs
15
10
5
0
1
10
20
30
40
50
60
70
Component Number
80
90
Data mining and statistical learning,
lecture 4
100
Loadings of PC1, PC2 and PC3
- absorbance data
Loadings of PC1, PC2, PC3
Variable
PC1
PC2
PC3
0.2
Data
0.1
The loadings
define derived
inputs (linear
combinations
of the inputs)
0.0
-0.1
-0.2
1
11
21
31
41
51
61
71
81
91
Data mining and statistical learning,
lecture 4
Software recommendations
Minitab 15  Stat  Multivariate  Principal components
SAS Enterprise Miner  Princomp/Dmneural
Data mining and statistical learning,
lecture 4
Regression methods using derived input directions
- Partial Least Squares Regression
Extract linear combinations of the
inputs as derived features, and then
model the target (response) as a
linear function of these features
y
Select the intermediates so that the
covariance with the response
variable is maximized
Normally, the inputs are
standardized to mean zero and
variance one prior to the PLS
analysis
Data mining and statistical learning,
lecture 4
z1
x1
z2
x1
…
…
zM
xp
Partial least squares regression (PLS)
Step 1: Standardize inputs to mean zero and variance one
Step 2: Compute the first derived input by setting
p
z1   1 j x j
j 1
where the 1j are standardized univariate regression
coefficients of the response vs each of the inputs
Repeat:
Remove the variation in the inputs along the directions
determined by existing z-vectors
Compute another derived input
Data mining and statistical learning,
lecture 4
Methods using derived input directions
Principal components regression (PCR)
The derived directions are determined by the X-matrix alone, and
are orthogonal
Partial least squares regression (PLS)
The derived directions are determined by the covariance
of the output and linear combinations of the inputs, and are
orthogonal
Data mining and statistical learning,
lecture 4
PLS in SAS
The following statements are available in PROC PLS. Items within
the brackets < > are optional.
PROC PLS < options > ;
BY variables ;
CLASS variables < / option > ;
MODEL dependent-variables = effects < / options > ;
OUTPUT OUT= SAS-data-set < options > ;
To analyze a data set, you must use the PROC PLS and MODEL
statements. You can use the other statements as needed.
Data mining and statistical learning,
lecture 4
proc PLS in SAS
proc pls data=mining.tecatorscores method=pls nfac=10;
model fat=channel1-channel100;
output out=tecatorpls predicted=predpls;
proc pls data=mining.tecatorscores method=pcr nfac=10;
model fat=channel1-channel100;
output out=tecatorpcr predicted=predpcr;
run;
Data mining and statistical learning,
lecture 4
Download