Outline Regression on a large number of correlated inputs A few comments about shrinkage methods, such as ridge regression Methods using derived input directions Principal components regression Partial least squares regression (PLS) Data mining and statistical learning, lecture 4 Partitioning of the expected squared prediction error E ( y j yˆ j ) 2 EE ( y j ) E ( yˆ j ) Vary j yˆ j 2 bias Shrinkage decreases the variance but increases the bias Shrinkage methods are more robust to structural changes in the analysed data Data mining and statistical learning, lecture 4 Advantages of ridge regression over OLS The models are easier to comprehend because strongly correlated inputs tend to get similar regression coefficients Generalizations to new data sets are facilitated by a larger robustness to structural changes in the analysed data set Data mining and statistical learning, lecture 4 Ridge regression - a note on standardization The principal components and the shrinkage in ridge regression are scale-dependent. Inputs are normally standardized to mean zero and variance one prior to the regression Data mining and statistical learning, lecture 4 Regression methods using derived input directions Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features y z1 z2 … zM Z m 0 m αmT X , m 1, ... , M y 0 β T Z Data mining and statistical learning, lecture 4 x1 x2 … xp Absorbance Absorbance records for ten samples of chopped meat 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Sample_1 Sample_2 Sample_3 1 response variable (fat) Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 Sample_9 Sample_10 1 12 23 34 45 56 67 78 89 100 Channel Data mining and statistical learning, lecture 4 100 predictors (absorbance at 100 wavelengths or channels) The predictors are strongly correlated to each other Absorbance records for ten samples of chopped meat 6.0 Sample_12 High fat Sample_133 Absorbance samples Sample_48 5.0 4.0 Sample_145 Sample_176 3.0 Sample_186 Sample_215 2.0 Low fat samples Sample_43 1.0 Sample_44 Sample_45 0.0 1 12 23 34 45 56 67 78 89 100 Channel Data mining and statistical learning, lecture 4 3-D plots of absorbance records for samples of meat - channels 1, 50 and 100 3D Scatterplot of Channel1 vs Channel50 vs Channel100 4 Channel1 3 5 4 2 2 3 3 Channel100 4 5 2 Data mining and statistical learning, lecture 4 Channel50 3-D plots of absorbance records for samples of meat - channels 40, 50 and 60 3D Scatterplot of Channel60 vs Channel50 vs Channel40 5 Channel60 4 5 3 4 2 3 3 Channel40 4 5 2 Data mining and statistical learning, lecture 4 Channel50 3-D plot of absorbance records for samples of meat - channels 49, 50 and 51 3D Scatterplot of Channel49 vs Channel50 vs Channel51 5 Channel49 4 3 5 4 2 3 3 4 Channel51 5 Channel50 2 Data mining and statistical learning, lecture 4 Matrix plot of absorbance records for samples of meat - channels 1, 50 and 100 Matrix Plot of Channel1, Channel50, Channel100 3 4 5 4 3 Channel1 2 5 4 Channel50 3 5 4 Channel100 3 2 3 4 Data mining and statistical learning, lecture 4 3 4 5 Principal Component Analysis (PCA) • PCA is a technique for reducing the complexity of high dimensional data • It can be used to approximate high dimensional data with a few dimensions so that important features can be visually examined Data mining and statistical learning, lecture 4 Principal Component Analysis - two inputs 15 X2 PC1 10 PC2 5 0 5 X1 Data mining and statistical learning, lecture 4 10 3-D plot of artificially generated data - three inputs Surface Plot of z vs y, x PC1 4 z 2 0 2 -2 0 -2 -2 0 x 2 -4 4 Data mining and statistical learning, lecture 4 y PC2 Principal Component Analysis The first principal component (PC1) is the direction that maximizes the variance of the projected data The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed The third principal component (PC3) is the direction that maximizes the variance of the projected data after the variation along PC1 and PC2 has been removed Data mining and statistical learning, lecture 4 Eigenvector and eigenvalue In this shear transformation of the Mona Lisa, the picture was deformed in such a way that its central vertical axis (red vector) was not modified, but the diagonal vector (blue) has changed direction. Hence the red vector is an eigenvector of the transformation and the blue vector is not. Since the red vector was neither stretched nor compressed, its eigenvalue is 1. Data mining and statistical learning, lecture 4 Sample covariance matrix s11 . . . s m1 . . . . . . s1m . . . s mm where n síj (x k 1 ik xi. )(( x jk x j . ) 2 n 1 , i 1 ,..., m, j 1 ,..., m Data mining and statistical learning, lecture 4 Eigenvectors of covariance and correlation matrices The eigenvectors of a covariance matrix provide information about the major orthogonal directions of the variation in the inputs The eigenvalues provide information about the strength of the variation along the different eigenvectors The eigenvectors and eigenvalues of the correlation matrix provide scale-independent information about the variation of the inputs Data mining and statistical learning, lecture 4 Principal Component Analysis 15 X2 Eigenanalysis of the Covariance Matrix 10 5 0 5 X1 10 Eigenvalue 2.8162 0.3835 Proportion 0.880 0.120 Cumulative 0.880 1.000 Variable PC1 PC2 X1 0.523 0.852 X2 0.852 -0.523 Loadings Data mining and statistical learning, lecture 4 Principal Component Analysis Coordinates in the coordinate system determined by the principal components Data mining and statistical learning, lecture 4 Principal Component Analysis Surface Plot of z vs y, x Eigenanalysis of the Covariance Matrix Eigenvalue Proportion Cumulative 4 z 1.6502 0.687 0.687 0.7456 0.310 0.997 0.0075 0.003 1.000 2 0 2 -2 0 -2 -2 0 x 2 -4 4 y Variable x y z Data mining and statistical learning, lecture 4 PC1 0.887 0.034 0.460 PC2 0.218 -0.909 -0.354 PC3 -0.407 -0.414 0.814 Scree plot Scree Plot of x, ..., z 1.8 1.6 Eigenvalue 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1 2 Component Number Data mining and statistical learning, lecture 4 3 Principal Component Analysis - absorbance data from samples of chopped meat Eigenanalysis of the Covariance Matrix Eigenvalue Proportion Cumulative 26.127 0.987 0.987 0.239 0.009 0.996 0.078 0.003 0.999 0.030 0.001 1.000 0.002 0.000 1.000 0.001 0.000 1.000 Data mining and statistical learning, lecture 4 0.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 Scree plot - absorbance data Scree Plot of Channel1, ..., Channel100 25 Eigenvalue 20 One direction is responsible for most of the variation in the inputs 15 10 5 0 1 10 20 30 40 50 60 70 Component Number 80 90 Data mining and statistical learning, lecture 4 100 Loadings of PC1, PC2 and PC3 - absorbance data Loadings of PC1, PC2, PC3 Variable PC1 PC2 PC3 0.2 Data 0.1 The loadings define derived inputs (linear combinations of the inputs) 0.0 -0.1 -0.2 1 11 21 31 41 51 61 71 81 91 Data mining and statistical learning, lecture 4 Software recommendations Minitab 15 Stat Multivariate Principal components SAS Enterprise Miner Princomp/Dmneural Data mining and statistical learning, lecture 4 Regression methods using derived input directions - Partial Least Squares Regression Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features y Select the intermediates so that the covariance with the response variable is maximized Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis Data mining and statistical learning, lecture 4 z1 x1 z2 x1 … … zM xp Partial least squares regression (PLS) Step 1: Standardize inputs to mean zero and variance one Step 2: Compute the first derived input by setting p z1 1 j x j j 1 where the 1j are standardized univariate regression coefficients of the response vs each of the inputs Repeat: Remove the variation in the inputs along the directions determined by existing z-vectors Compute another derived input Data mining and statistical learning, lecture 4 Methods using derived input directions Principal components regression (PCR) The derived directions are determined by the X-matrix alone, and are orthogonal Partial least squares regression (PLS) The derived directions are determined by the covariance of the output and linear combinations of the inputs, and are orthogonal Data mining and statistical learning, lecture 4 PLS in SAS The following statements are available in PROC PLS. Items within the brackets < > are optional. PROC PLS < options > ; BY variables ; CLASS variables < / option > ; MODEL dependent-variables = effects < / options > ; OUTPUT OUT= SAS-data-set < options > ; To analyze a data set, you must use the PROC PLS and MODEL statements. You can use the other statements as needed. Data mining and statistical learning, lecture 4 proc PLS in SAS proc pls data=mining.tecatorscores method=pls nfac=10; model fat=channel1-channel100; output out=tecatorpls predicted=predpls; proc pls data=mining.tecatorscores method=pcr nfac=10; model fat=channel1-channel100; output out=tecatorpcr predicted=predpcr; run; Data mining and statistical learning, lecture 4