Uploaded by Tom scheer

exercises week4

advertisement
Answers week 4 - PCA
2022-10-06
SVD and Eigendecomposition
An alternative form of the singular value decomposition of centered data matrix C is:
C=
k
X
dj uj vjT ,
j=1
where uj is the j’th left-singular vector of C and vj is the j’th right-singular vector of C. From this alternative
form it is clear that the singular vectors are only defined up to sign. If a left singular vector has its sign
changed, changing the sign of the corresponding right vector gives an equivalent decomposition, that is,
uj vjT = −uj (−vjT ). This information is relevant for the rest of this exercise.
In this exercise, the data of Example 1 from the lecture slides are used. These data have been stored in the file
Example1.dat. Use the function read.table() to import the data into R. Use the function as.matrix()
to convert the data frame to a matrix. The two features are not centered. To center the two features the
function scale() with the argument scale=F can be used.
data <- read.table('Example1.dat') %>%
as.matrix() %>%
scale(scale = FALSE)
Use the function svd() to apply a singular value decomposition to the centered data matrix. Inspect the
three pieces of output, that is, U, D, and V.
singular <- svd(data)
singular
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$d
[1] 31.219745
$u
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
$v
5.032642
[,1]
-0.26680000
-0.05470446
-0.55896540
-0.30683494
-0.01466951
0.19742603
0.48959143
0.02536542
0.48959143
[,1]
[1,] 0.4289437
[,2]
0.28739123
-0.33737580
-0.13388431
-0.23562992
0.18564553
-0.43912150
-0.01784596
0.70866669
-0.01784596
[,2]
0.9033312
1
## [2,] 0.9033312 -0.4289437
Are the right-singular vectors the same as on the slides?
Yes!
Use the information provided at the beginning of this exercise to correct for any possible differences. Then,
use a single matrix product to calculate the principal component scores. Plot the scores on the second
principal component (y-axis) against the scores on the first principal component (x-axis) and let the range of
the y-axis run from -16 to 16 and the range of the x-axis from -18 to 18.
# Principal component scores are
pc_scores <- data %*% singular$v
pc_scores
5
0
−5
−15
pc_scores[, 2]
10
15
##
[,1]
[,2]
## [1,] -8.3294282 1.44633709
## [2,] -1.7078593 -1.69789153
## [3,] -17.4507576 -0.67379177
## [4,] -9.5793087 -1.18584098
## [5,] -0.4579784 0.93428745
## [6,]
6.1635905 -2.20994117
## [7,] 15.2849198 -0.08981231
## [8,]
0.7919021 3.56646553
## [9,] 15.2849198 -0.08981231
# These are the coordinates of the data in the eigenspace
plot(pc_scores[,1], pc_scores[,2], xlim = c(-18, 18), ylim = c(-16, 16))
−15
−10
−5
0
5
10
15
pc_scores[, 1]
Next, use the centered data matrix and the sample size to calculate the sample covariance matrix.
covmat <- t(data) %*% data * 1/(nrow(data)-1) # or var(data)
Use the function eigen() to apply an eigendecomposition to the sample covariance matrix. Check whether
the eigenvalues are equal to the variances of the two principal components.
2
eig_decomp <- eigen(covmat)
eig_decomp
##
##
##
##
##
##
##
##
eigen() decomposition
$values
[1] 121.834063
3.165935
$vectors
[,1]
[,2]
[1,] 0.4289437 -0.9033312
[2,] 0.9033312 0.4289437
var_pc_scores <- var(pc_scores)
var_pc_scores
##
[,1]
[,2]
## [1,] 1.218341e+02 5.729168e-15
## [2,] 5.729168e-15 3.165935e+00
all.equal(eig_decomp$values, diag(var_pc_scores))
## [1] TRUE
Be aware that the R-base function var() takes N − 1 in the denominator, to get an unbiased estimate of the
variance. Finally, calculate the percentage of total variance explained by each principal component.
tot_var <- sum(diag(var_pc_scores))
diag(var_pc_scores)/tot_var
## [1] 0.97467252 0.02532748
Principal component analysis
n this exercise, a PCA is used to determine the financial strength of insurance companies. Eight relevant
features have been selected: (1) gross written premium, (2) net mathematical reserves, (3) gross claims paid,
(4) net premium reserves, (5) net claim reserves, (6) net income, (7) share capital, and (8) gross written
premium ceded in reinsurance. To perform a principal component analysis, an eigendecomposition can be
applied to the sample correlation matrix R instead of the sample covariance matrix S. Note that the sample
correlation matrix is the sample covariance matrix of the standardized features. These two ways of doing a
PCA will yield different results. If the features have the same scales (the same units), then the covariance
matrix should be used. If the features have different scales, then it’s better in general to use the correlation
matrix because otherwise the features with high absolute variances will dominate the results.
The means and standard deviations of the features can be found in the following table.
First we need to load the sample correlation matrix into a variable
R<-matrix(c(1.00,0.32,0.95,0.94,0.84,0.22,0.47,0.82,
0.32,1.00,0.06,0.21,0.01,0.30,0.10,0.01,
0.95,0.06,1.00,0.94,0.89,0.14,0.44,0.81,
0.94,0.21,0.94,1.00,0.88,0.19,0.50,0.68,
0.84,0.01,0.89,0.88,1.00,-0.23,0.55,0.63,
0.22,0.30,0.14,0.19,-0.23,1.00,-0.15,0.21,
0.47,0.10,0.44,0.50,0.55,-0.15,1.00,0.14,
0.82,0.01,0.81,0.68,0.63,0.21,0.14,1.00),nrow=8)
R
3
##
##
##
##
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[,1]
1.00
0.32
0.95
0.94
0.84
0.22
0.47
0.82
[,2]
0.32
1.00
0.06
0.21
0.01
0.30
0.10
0.01
[,3]
0.95
0.06
1.00
0.94
0.89
0.14
0.44
0.81
[,4] [,5] [,6] [,7] [,8]
0.94 0.84 0.22 0.47 0.82
0.21 0.01 0.30 0.10 0.01
0.94 0.89 0.14 0.44 0.81
1.00 0.88 0.19 0.50 0.68
0.88 1.00 -0.23 0.55 0.63
0.19 -0.23 1.00 -0.15 0.21
0.50 0.55 -0.15 1.00 0.14
0.68 0.63 0.21 0.14 1.00
Use R to apply a PCA to the sample correlation matrix. An alternative criterion for extracting a smaller
number of principal components m than the number of original variables k in applying a PCA to the sample
correlation matrix, is the eigenvalue-greater-than-one rule. This rule says that m (the number of extracted
principal components) should be equal to the number of eigenvalues greater than one. Since each of the
standardized variables has a variance of one, the total variance is k. If a principal component has an eigenvalue
greater than one, than its variance is greater than the variance of each of the original standardized variables.
Then, this principal component explains more of the total variance than each of the original standardized
variables.
We use the function eigen() to apply an eigendecomposition.
EVD<-eigen(R)
EVD
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
eigen() decomposition
$values
[1] 4.654827640 1.446030987 1.014432893 0.571991390 0.252849855 0.030896265
[7] 0.024937666 0.004033303
$vectors
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[,1]
[,2]
[,3]
0.45570531 0.12463861 0.02088898
0.08689900 0.52463760 0.64691821
0.45188484 -0.01953857 -0.15139278
0.44624499 0.04180405 0.03735239
0.42053982 -0.29072335 0.03373389
0.05961452 0.72505439 -0.16605510
0.25148835 -0.29178379 0.57778034
0.37120442 0.10830307 -0.44068482
[,7]
[,8]
0.038874157 0.85706447
-0.043269517 -0.20497515
0.136731994 -0.36317881
0.566496879 -0.12694808
-0.751619348 -0.09534185
-0.303156525 -0.03488190
0.009592252 -0.07815482
0.008640717 -0.24289099
[,4]
0.09143931
0.49402517
-0.02028604
-0.05121431
0.15985973
-0.57393946
-0.59904953
0.17527574
[,5]
0.08670416
0.09829888
-0.15374784
-0.45226743
-0.30317044
-0.13338165
0.39039893
0.70179853
[,6]
0.15624367
0.03301796
0.77321029
-0.50350707
-0.21447390
-0.05338675
-0.01771592
-0.27195800
(a) How many principal components should be extracted according to the eigenvalue-greater-than-one rule
According to the eigenvaule-greater-than-one rule, there are 3 principal components that should be extracted
(three eigenvalues are greater to zero if we see the results above)
(b) How much of the total variance does this number of extracted principal components explain?
a = (4.654827640 + 1.446030987 + 1.014432893)
b = (4.654827640 + 1.446030987 + 1.014432893 + 0.571991390 + 0.252849855 +0.030896265 +0.024937666 +0.00
4
a/b
## [1] 0.8894114
If we extract 3 principal components then 89% of the total variance can be explained
(c) Make a scree-plot. How many principal components should be extracted according to the scree-plot?
We can make the scree-plot with the following code:
3
2
0
1
EVD$values
4
plot(EVD$values)
1
2
3
4
5
6
7
8
Index
plot(EVD$values,type='line')
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
5
4
3
2
0
1
EVD$values
1
2
3
4
5
6
7
8
Index
There are 2 principal components that we should extract according to the scree plot (the third is close to
zero).
(d) How much of the total variance does this number of extracted principal components explain?
a = 4.654827640 + 1.446030987
b = (4.654827640 + 1.446030987 + 1.014432893 + 0.571991390 +
0.252849855 +0.030896265 +0.024937666 +0.004033303)
a/b
## [1] 0.7626073
If we extract 2 principal components then 76% of the total variance can be explained.
6
Download