exercises week4

Answers week 4 - PCA 2022-10-06 SVD and Eigendecomposition An alternative form of the singular value decomposition of centered data matrix C is: C= k X dj uj vjT , j=1 where uj is the j’th left-singular vector of C and vj is the j’th right-singular vector of C. From this alternative form it is clear that the singular vectors are only defined up to sign. If a left singular vector has its sign changed, changing the sign of the corresponding right vector gives an equivalent decomposition, that is, uj vjT = −uj (−vjT ). This information is relevant for the rest of this exercise. In this exercise, the data of Example 1 from the lecture slides are used. These data have been stored in the file Example1.dat. Use the function read.table() to import the data into R. Use the function as.matrix() to convert the data frame to a matrix. The two features are not centered. To center the two features the function scale() with the argument scale=F can be used. data <- read.table('Example1.dat') %>% as.matrix() %>% scale(scale = FALSE) Use the function svd() to apply a singular value decomposition to the centered data matrix. Inspect the three pieces of output, that is, U, D, and V. singular <- svd(data) singular ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $d [1] 31.219745 $u [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] $v 5.032642 [,1] -0.26680000 -0.05470446 -0.55896540 -0.30683494 -0.01466951 0.19742603 0.48959143 0.02536542 0.48959143 [,1] [1,] 0.4289437 [,2] 0.28739123 -0.33737580 -0.13388431 -0.23562992 0.18564553 -0.43912150 -0.01784596 0.70866669 -0.01784596 [,2] 0.9033312 1 ## [2,] 0.9033312 -0.4289437 Are the right-singular vectors the same as on the slides? Yes! Use the information provided at the beginning of this exercise to correct for any possible differences. Then, use a single matrix product to calculate the principal component scores. Plot the scores on the second principal component (y-axis) against the scores on the first principal component (x-axis) and let the range of the y-axis run from -16 to 16 and the range of the x-axis from -18 to 18. # Principal component scores are pc_scores <- data %*% singular$v pc_scores 5 0 −5 −15 pc_scores[, 2] 10 15 ## [,1] [,2] ## [1,] -8.3294282 1.44633709 ## [2,] -1.7078593 -1.69789153 ## [3,] -17.4507576 -0.67379177 ## [4,] -9.5793087 -1.18584098 ## [5,] -0.4579784 0.93428745 ## [6,] 6.1635905 -2.20994117 ## [7,] 15.2849198 -0.08981231 ## [8,] 0.7919021 3.56646553 ## [9,] 15.2849198 -0.08981231 # These are the coordinates of the data in the eigenspace plot(pc_scores[,1], pc_scores[,2], xlim = c(-18, 18), ylim = c(-16, 16)) −15 −10 −5 0 5 10 15 pc_scores[, 1] Next, use the centered data matrix and the sample size to calculate the sample covariance matrix. covmat <- t(data) %*% data * 1/(nrow(data)-1) # or var(data) Use the function eigen() to apply an eigendecomposition to the sample covariance matrix. Check whether the eigenvalues are equal to the variances of the two principal components. 2 eig_decomp <- eigen(covmat) eig_decomp ## ## ## ## ## ## ## ## eigen() decomposition $values [1] 121.834063 3.165935 $vectors [,1] [,2] [1,] 0.4289437 -0.9033312 [2,] 0.9033312 0.4289437 var_pc_scores <- var(pc_scores) var_pc_scores ## [,1] [,2] ## [1,] 1.218341e+02 5.729168e-15 ## [2,] 5.729168e-15 3.165935e+00 all.equal(eig_decomp$values, diag(var_pc_scores)) ## [1] TRUE Be aware that the R-base function var() takes N − 1 in the denominator, to get an unbiased estimate of the variance. Finally, calculate the percentage of total variance explained by each principal component. tot_var <- sum(diag(var_pc_scores)) diag(var_pc_scores)/tot_var ## [1] 0.97467252 0.02532748 Principal component analysis n this exercise, a PCA is used to determine the financial strength of insurance companies. Eight relevant features have been selected: (1) gross written premium, (2) net mathematical reserves, (3) gross claims paid, (4) net premium reserves, (5) net claim reserves, (6) net income, (7) share capital, and (8) gross written premium ceded in reinsurance. To perform a principal component analysis, an eigendecomposition can be applied to the sample correlation matrix R instead of the sample covariance matrix S. Note that the sample correlation matrix is the sample covariance matrix of the standardized features. These two ways of doing a PCA will yield different results. If the features have the same scales (the same units), then the covariance matrix should be used. If the features have different scales, then it’s better in general to use the correlation matrix because otherwise the features with high absolute variances will dominate the results. The means and standard deviations of the features can be found in the following table. First we need to load the sample correlation matrix into a variable R<-matrix(c(1.00,0.32,0.95,0.94,0.84,0.22,0.47,0.82, 0.32,1.00,0.06,0.21,0.01,0.30,0.10,0.01, 0.95,0.06,1.00,0.94,0.89,0.14,0.44,0.81, 0.94,0.21,0.94,1.00,0.88,0.19,0.50,0.68, 0.84,0.01,0.89,0.88,1.00,-0.23,0.55,0.63, 0.22,0.30,0.14,0.19,-0.23,1.00,-0.15,0.21, 0.47,0.10,0.44,0.50,0.55,-0.15,1.00,0.14, 0.82,0.01,0.81,0.68,0.63,0.21,0.14,1.00),nrow=8) R 3 ## ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [,1] 1.00 0.32 0.95 0.94 0.84 0.22 0.47 0.82 [,2] 0.32 1.00 0.06 0.21 0.01 0.30 0.10 0.01 [,3] 0.95 0.06 1.00 0.94 0.89 0.14 0.44 0.81 [,4] [,5] [,6] [,7] [,8] 0.94 0.84 0.22 0.47 0.82 0.21 0.01 0.30 0.10 0.01 0.94 0.89 0.14 0.44 0.81 1.00 0.88 0.19 0.50 0.68 0.88 1.00 -0.23 0.55 0.63 0.19 -0.23 1.00 -0.15 0.21 0.50 0.55 -0.15 1.00 0.14 0.68 0.63 0.21 0.14 1.00 Use R to apply a PCA to the sample correlation matrix. An alternative criterion for extracting a smaller number of principal components m than the number of original variables k in applying a PCA to the sample correlation matrix, is the eigenvalue-greater-than-one rule. This rule says that m (the number of extracted principal components) should be equal to the number of eigenvalues greater than one. Since each of the standardized variables has a variance of one, the total variance is k. If a principal component has an eigenvalue greater than one, than its variance is greater than the variance of each of the original standardized variables. Then, this principal component explains more of the total variance than each of the original standardized variables. We use the function eigen() to apply an eigendecomposition. EVD<-eigen(R) EVD ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## eigen() decomposition $values [1] 4.654827640 1.446030987 1.014432893 0.571991390 0.252849855 0.030896265 [7] 0.024937666 0.004033303 $vectors [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [,1] [,2] [,3] 0.45570531 0.12463861 0.02088898 0.08689900 0.52463760 0.64691821 0.45188484 -0.01953857 -0.15139278 0.44624499 0.04180405 0.03735239 0.42053982 -0.29072335 0.03373389 0.05961452 0.72505439 -0.16605510 0.25148835 -0.29178379 0.57778034 0.37120442 0.10830307 -0.44068482 [,7] [,8] 0.038874157 0.85706447 -0.043269517 -0.20497515 0.136731994 -0.36317881 0.566496879 -0.12694808 -0.751619348 -0.09534185 -0.303156525 -0.03488190 0.009592252 -0.07815482 0.008640717 -0.24289099 [,4] 0.09143931 0.49402517 -0.02028604 -0.05121431 0.15985973 -0.57393946 -0.59904953 0.17527574 [,5] 0.08670416 0.09829888 -0.15374784 -0.45226743 -0.30317044 -0.13338165 0.39039893 0.70179853 [,6] 0.15624367 0.03301796 0.77321029 -0.50350707 -0.21447390 -0.05338675 -0.01771592 -0.27195800 (a) How many principal components should be extracted according to the eigenvalue-greater-than-one rule According to the eigenvaule-greater-than-one rule, there are 3 principal components that should be extracted (three eigenvalues are greater to zero if we see the results above) (b) How much of the total variance does this number of extracted principal components explain? a = (4.654827640 + 1.446030987 + 1.014432893) b = (4.654827640 + 1.446030987 + 1.014432893 + 0.571991390 + 0.252849855 +0.030896265 +0.024937666 +0.00 4 a/b ## [1] 0.8894114 If we extract 3 principal components then 89% of the total variance can be explained (c) Make a scree-plot. How many principal components should be extracted according to the scree-plot? We can make the scree-plot with the following code: 3 2 0 1 EVD$values 4 plot(EVD$values) 1 2 3 4 5 6 7 8 Index plot(EVD$values,type='line') ## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first ## character 5 4 3 2 0 1 EVD$values 1 2 3 4 5 6 7 8 Index There are 2 principal components that we should extract according to the scree plot (the third is close to zero). (d) How much of the total variance does this number of extracted principal components explain? a = 4.654827640 + 1.446030987 b = (4.654827640 + 1.446030987 + 1.014432893 + 0.571991390 + 0.252849855 +0.030896265 +0.024937666 +0.004033303) a/b ## [1] 0.7626073 If we extract 2 principal components then 76% of the total variance can be explained. 6

exercises week4

Related documents

Products

Support

exercises week4

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib