Correspondence Analysis - Winona State University

advertisement
Correspondence Analysis in R and JMP
Correspondence analysis (CA) is a dimension reduction technique for contingency
tables/cross-tabulations of nominal or ordinal variables. It is similar to principal
components analysis for continuous variables. The data matrix for simple
correspondence analysis is typically a two-way contingency table, but it could be any
other table of non-negative ratio-scale data where relative values are of primary
interest. Singular value decomposition (SVD) of a properly scaled matrix is used to
achieve the dimension reduction to typically k = 2 or k = 3 dimensions. Multiple
correspondence analysis (MCA) allows for larger dimensional situations, such as a
three-way contingency table. We will not dig too deep into the theory in this course,
especially for MCA, but we will consider the how SVD is used to create the lower
dimensional representation of the data.
Correspondence Analysis in R
The data for this example are taken from a study of suicides in the former West
Germany in the years 1974 to 1977, reported by Van der Heijden and de Leeuw (1985).
Nine methods of suicide were tabulated by sex and age category. The primary interest
here is in the variation of suicide patterns by age and by sex. The data can be regarded
as a two-way contingency table, with the 34 age/sex categories forming the rows and
the nine methods of suicide forming the columns.
> Suicide
pois cookgas toxgas hang drown gun knife jump other
m1015
4
0
0 247
1 17
1
6
9
m1520 348
7
67 578
22 179
11
74
175
m2025 808
32
229 699
44 316
35 109
289
m2530 789
26
243 648
52 268
38 109
226
m3035 916
17
257 825
74 291
52 123
281
m3540 1118
27
313 1278
89 299
53
78
198
m4045 926
13
250 1273
89 299
53
78
198
m4550 855
9
203 1381
71 347
68 103
190
m5055 684
14
136 1282
87 229
62
63
146
m5560 502
6
77 972
49 151
46
66
77
m6065 516
5
74 1249
83 162
62
92
122
m6570 513
8
31 1360
75 164
56 115
95
m7075 425
5
21 1268
90 121
44 119
82
m7580 266
4
9 866
63 78
30
79
34
m8085 159
2
2 479
39 18
18
46
19
1
m8590
m90p
w1015
w1520
w2025
w2530
w3035
w3540
w4045
w4550
w5055
w5560
w6065
w6570
w7075
w7580
w8085
w8590
w90p
70
18
28
353
540
454
530
688
566
716
942
723
820
740
624
495
292
113
24
1
0
0
2
4
6
2
5
4
6
7
3
8
8
6
8
3
4
1
0
1
3
11
20
27
29
44
24
24
26
14
8
4
4
1
2
0
0
259
76
20
81
111
125
178
272
343
447
691
527
702
785
610
420
223
83
19
16
4
0
6
24
33
42
64
76
94
184
163
245
271
244
161
78
14
4
10
2
1
15
9
26
14
24
18
13
21
14
11
4
1
2
0
0
0
9
4
0
2
9
7
20
14
22
21
37
30
35
38
27
29
10
6
2
18
6
10
43
78
86
92
98
103
95
129
92
140
156
129
129
84
34
7
10
2
6
47
67
75
78
110
86
88
131
92
114
90
46
35
23
2
0
> dim(Suicide)
[1] 34 9
> suicide.prop = Suicide/sum(Suicide)
> suicide.prop
m1015
m1520
m2025
m2530
m3035
m3540
m4045
m4550
m5055
m5560
m6065
m6570
m7075
m7580
m8085
m8590
m90p
w1015
w1520
w2025
w2530
w3035
w3540
w4045
w4550
w5055
w5560
w6065
w6570
w7075
w7580
w8085
w8590
w90p
pois
7.5e-05
6.6e-03
1.5e-02
1.5e-02
1.7e-02
2.1e-02
1.7e-02
1.6e-02
1.3e-02
9.5e-03
9.7e-03
9.7e-03
8.0e-03
5.0e-03
3.0e-03
1.3e-03
3.4e-04
5.3e-04
6.6e-03
1.0e-02
8.5e-03
1.0e-02
1.3e-02
1.1e-02
1.3e-02
1.8e-02
1.4e-02
1.5e-02
1.4e-02
1.2e-02
9.3e-03
5.5e-03
2.1e-03
4.5e-04
cookgas
0.0e+00
1.3e-04
6.0e-04
4.9e-04
3.2e-04
5.1e-04
2.4e-04
1.7e-04
2.6e-04
1.1e-04
9.4e-05
1.5e-04
9.4e-05
7.5e-05
3.8e-05
1.9e-05
0.0e+00
0.0e+00
3.8e-05
7.5e-05
1.1e-04
3.8e-05
9.4e-05
7.5e-05
1.1e-04
1.3e-04
5.6e-05
1.5e-04
1.5e-04
1.1e-04
1.5e-04
5.6e-05
7.5e-05
1.9e-05
toxgas
0.0e+00
1.3e-03
4.3e-03
4.6e-03
4.8e-03
5.9e-03
4.7e-03
3.8e-03
2.6e-03
1.4e-03
1.4e-03
5.8e-04
4.0e-04
1.7e-04
3.8e-05
0.0e+00
1.9e-05
5.6e-05
2.1e-04
3.8e-04
5.1e-04
5.5e-04
8.3e-04
4.5e-04
4.5e-04
4.9e-04
2.6e-04
1.5e-04
7.5e-05
7.5e-05
1.9e-05
3.8e-05
0.0e+00
0.0e+00
hang
0.00465
0.01088
0.01316
0.01220
0.01553
0.02406
0.02397
0.02600
0.02414
0.01830
0.02352
0.02561
0.02388
0.01631
0.00902
0.00488
0.00143
0.00038
0.00153
0.00209
0.00235
0.00335
0.00512
0.00646
0.00842
0.01301
0.00992
0.01322
0.01478
0.01149
0.00791
0.00420
0.00156
0.00036
drown
1.9e-05
4.1e-04
8.3e-04
9.8e-04
1.4e-03
1.7e-03
1.7e-03
1.3e-03
1.6e-03
9.2e-04
1.6e-03
1.4e-03
1.7e-03
1.2e-03
7.3e-04
3.0e-04
7.5e-05
0.0e+00
1.1e-04
4.5e-04
6.2e-04
7.9e-04
1.2e-03
1.4e-03
1.8e-03
3.5e-03
3.1e-03
4.6e-03
5.1e-03
4.6e-03
3.0e-03
1.5e-03
2.6e-04
7.5e-05
gun
3.2e-04
3.4e-03
6.0e-03
5.0e-03
5.5e-03
5.6e-03
5.6e-03
6.5e-03
4.3e-03
2.8e-03
3.1e-03
3.1e-03
2.3e-03
1.5e-03
3.4e-04
1.9e-04
3.8e-05
1.9e-05
2.8e-04
1.7e-04
4.9e-04
2.6e-04
4.5e-04
3.4e-04
2.4e-04
4.0e-04
2.6e-04
2.1e-04
7.5e-05
1.9e-05
3.8e-05
0.0e+00
0.0e+00
0.0e+00
knife
1.9e-05
2.1e-04
6.6e-04
7.2e-04
9.8e-04
1.0e-03
1.0e-03
1.3e-03
1.2e-03
8.7e-04
1.2e-03
1.1e-03
8.3e-04
5.6e-04
3.4e-04
1.7e-04
7.5e-05
0.0e+00
3.8e-05
1.7e-04
1.3e-04
3.8e-04
2.6e-04
4.1e-04
4.0e-04
7.0e-04
5.6e-04
6.6e-04
7.2e-04
5.1e-04
5.5e-04
1.9e-04
1.1e-04
3.8e-05
jump
0.00011
0.00139
0.00205
0.00205
0.00232
0.00147
0.00147
0.00194
0.00119
0.00124
0.00173
0.00217
0.00224
0.00149
0.00087
0.00034
0.00011
0.00019
0.00081
0.00147
0.00162
0.00173
0.00185
0.00194
0.00179
0.00243
0.00173
0.00264
0.00294
0.00243
0.00243
0.00158
0.00064
0.00013
other
1.7e-04
3.3e-03
5.4e-03
4.3e-03
5.3e-03
3.7e-03
3.7e-03
3.6e-03
2.7e-03
1.4e-03
2.3e-03
1.8e-03
1.5e-03
6.4e-04
3.6e-04
1.9e-04
3.8e-05
1.1e-04
8.9e-04
1.3e-03
1.4e-03
1.5e-03
2.1e-03
1.6e-03
1.7e-03
2.5e-03
1.7e-03
2.1e-03
1.7e-03
8.7e-04
6.6e-04
4.3e-04
3.8e-05
0.0e+00
2
> suicide.condprop = Suicide/apply(Suicide,1,sum)
> suicide.condprop
m1015
m1520
m2025
m2530
m3035
m3540
m4045
m4550
m5055
m5560
m6065
m6570
m7075
m7580
m8085
m8590
m90p
w1015
w1520
w2025
w2530
w3035
w3540
w4045
w4550
w5055
w5560
w6065
w6570
w7075
w7580
w8085
w8590
w90p
pois
0.014
0.238
0.316
0.329
0.323
0.324
0.291
0.265
0.253
0.258
0.218
0.212
0.195
0.186
0.203
0.178
0.159
0.412
0.630
0.626
0.541
0.538
0.522
0.456
0.476
0.435
0.436
0.394
0.353
0.369
0.387
0.408
0.441
0.421
cookgas
0.00000
0.00479
0.01250
0.01084
0.00599
0.00782
0.00409
0.00279
0.00518
0.00308
0.00211
0.00331
0.00230
0.00280
0.00256
0.00254
0.00000
0.00000
0.00357
0.00464
0.00715
0.00203
0.00379
0.00322
0.00399
0.00323
0.00181
0.00384
0.00382
0.00355
0.00625
0.00420
0.01562
0.01754
toxgas
0.000000
0.045859
0.089418
0.101292
0.090621
0.090646
0.078641
0.062907
0.050314
0.039568
0.031290
0.012826
0.009655
0.006298
0.002558
0.000000
0.008850
0.044118
0.019643
0.023202
0.032181
0.029442
0.033359
0.019324
0.015957
0.011993
0.008444
0.003841
0.001908
0.002365
0.000781
0.002797
0.000000
0.000000
hang
0.867
0.396
0.273
0.270
0.291
0.370
0.400
0.428
0.474
0.499
0.528
0.563
0.583
0.606
0.613
0.659
0.673
0.294
0.145
0.129
0.149
0.181
0.206
0.276
0.297
0.319
0.318
0.337
0.375
0.361
0.328
0.312
0.324
0.333
drown
0.00351
0.01506
0.01718
0.02168
0.02609
0.02577
0.02800
0.02200
0.03219
0.02518
0.03510
0.03103
0.04138
0.04409
0.04987
0.04071
0.03540
0.00000
0.01071
0.02784
0.03933
0.04264
0.04852
0.06119
0.06250
0.08487
0.09831
0.11762
0.12929
0.14429
0.12578
0.10909
0.05469
0.07018
gun
0.059649
0.122519
0.123389
0.111713
0.102609
0.086591
0.094055
0.107530
0.084721
0.077595
0.068499
0.067853
0.055632
0.054584
0.023018
0.025445
0.017699
0.014706
0.026786
0.010441
0.030989
0.014213
0.018196
0.014493
0.008644
0.009686
0.008444
0.005281
0.001908
0.000591
0.001563
0.000000
0.000000
0.000000
knife
0.00351
0.00753
0.01367
0.01584
0.01834
0.01535
0.01667
0.02107
0.02294
0.02364
0.02622
0.02317
0.02023
0.02099
0.02302
0.02290
0.03540
0.00000
0.00357
0.01044
0.00834
0.02030
0.01061
0.01771
0.01396
0.01707
0.01809
0.01680
0.01813
0.01597
0.02266
0.01399
0.02344
0.03509
jump
0.0211
0.0507
0.0426
0.0454
0.0434
0.0226
0.0245
0.0319
0.0233
0.0339
0.0389
0.0476
0.0547
0.0553
0.0588
0.0458
0.0531
0.1471
0.0768
0.0905
0.1025
0.0934
0.0743
0.0829
0.0632
0.0595
0.0555
0.0672
0.0744
0.0763
0.1008
0.1175
0.1328
0.1228
other
0.03158
0.11978
0.11285
0.09421
0.09908
0.05734
0.06228
0.05888
0.05401
0.03957
0.05159
0.03930
0.03770
0.02379
0.02430
0.02545
0.01770
0.08824
0.08393
0.07773
0.08939
0.07919
0.08340
0.06924
0.05851
0.06042
0.05549
0.05473
0.04294
0.02720
0.02734
0.03217
0.00781
0.00000
> suicide.rowprop = apply(suicide.prop,1,sum)
> suicide.rowprop
m1015 m1520 m2025 m2530 m3035 m3540 m4045 m4550 m5055 m5560 m6065 m6570 m7075 m7580 m8085 m8590
0.0054 0.0275 0.0482 0.0452 0.0534 0.0650 0.0599 0.0608 0.0509 0.0366 0.0445 0.0455 0.0410 0.0269 0.0147 0.0074
m90p w1015 w1520 w2025 w2530 w3035 w3540 w4045 w4550 w5055 w5560 w6065 w6570 w7075 w7580 w8085
0.0021 0.0013 0.0105 0.0162 0.0158 0.0185 0.0248 0.0234 0.0283 0.0408 0.0312 0.0392 0.0395 0.0318 0.0241 0.0135
w8590
w90p
0.0048 0.0011
> barplot(suicide.rowprop,xlab="Gender-Age")
3
> suicide.colprop = apply(suicide.prop,2,sum)
> suicide.colprop
pois cookgas toxgas
hang
drown
gun
knife
jump
other
0.33075 0.00476 0.04056 0.38370 0.04992 0.05882 0.01791 0.05252 0.06107
> barplot(suicide.colprop,xlab=”Method Utilized”)
> suicide.mat = as.matrix(Suicide)
> mosaicplot(suicide.mat,color=T,main=”Suicides in West Germany”)
The mosaic plot above gives the breakdown of method chosen within each age/sex
category. As some of the age/sex categories have very few people in them, the results
are a bit hard to read. However, we can see some general trends with age and
differences in methods chosen across gender. We will now see how correspondence
4
analysis can be used to visualize the relationships between age/sex and method used.
Correspondence Analysis by Direct Computation
The mathematics behind this method of summarizing a contingency table involves
performing a SVD of the “residuals” from chi-square test of independence.
π‘Ÿ
𝑐
π‘Ÿ
2
𝑐
2
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )
(𝑝̂𝑖𝑗 − 𝑝̂ π‘–βˆ™ 𝑝̂ βˆ™π‘— )
2
πœ’ = ∑∑
= ∑∑
~ πœ’(π‘Ÿ−1)×(𝑐−1)
𝐸𝑖𝑗
𝑝̂ π‘–βˆ™ 𝑝̂ βˆ™π‘—
2
𝑖=1 𝑗=1
𝑖=1 𝑗=1
> chisq.test(Suicide)
Pearson's Chi-squared test
data: Suicide
X-squared = 10061, df = 264, p-value < 2.2e-16
Here is general idea:
5
In terms of proportions the matrix C has elements
2
𝑐𝑖𝑗 = √πœ’π‘–π‘—
=
𝑝𝑖𝑗 − π‘π‘–βˆ™ π‘βˆ™π‘—
√π‘π‘–βˆ™ π‘βˆ™π‘—
The dimension of C for the suicide data is 34 rows and 9 columns.
In order to form the elements of C we need to form the following matrices in R:
>
>
>
>
>
>
suicide.prop = as.matrix(Suicide/sum(Suicide))
rdiag = diag(1/sqrt(suicide.rowprop))
cdiag = diag(1/sqrt(suicide.colprop))
suicide.rowtot = as.matrix(suicide.rowtot)
suicide.coltot = as.matrix(suicide.coltot)
suicide.prop = as.matrix(suicide.prop)
> E = rdiag%*%(as.matrix(suicide.prop) -suicide.rowtot%*%t(suicide.coltot))%*%cdiag
> E
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[,1]
-0.04034
-0.02669
-0.00582
-0.00069
-0.00312
-0.00309
-0.01679
-0.02820
-0.03048
-0.02423
-0.04130
-0.04396
-0.04763
-0.04124
-0.02689
-0.02283
-0.01375
0.00504
0.05350
0.06551
0.04598
0.04910
0.05230
0.03323
0.04252
0.03645
0.03236
0.02167
0.00771
0.01187
[,2]
-5.1e-03
6.6e-05
2.5e-02
1.9e-02
4.1e-03
1.1e-02
-2.4e-03
-7.1e-03
1.4e-03
-4.7e-03
-8.1e-03
-4.5e-03
-7.2e-03
-4.7e-03
-3.9e-03
-2.8e-03
-3.2e-03
-2.5e-03
-1.8e-03
-2.3e-04
4.3e-03
-5.4e-03
-2.2e-03
-3.4e-03
-1.9e-03
-4.5e-03
-7.6e-03
-2.6e-03
-2.7e-03
-3.1e-03
[,3]
-0.01475
0.00436
0.05327
0.06409
0.05744
0.06342
0.04626
0.02735
0.01093
-0.00094
-0.00971
-0.02938
-0.03105
-0.02791
-0.02290
-0.01732
-0.00726
0.00063
-0.01067
-0.01098
-0.00523
-0.00752
-0.00563
-0.01613
-0.02056
-0.02866
-0.02818
-0.03611
-0.03813
-0.03384
[,4]
0.0571
0.0032
-0.0393
-0.0390
-0.0346
-0.0056
0.0066
0.0176
0.0330
0.0358
0.0492
0.0616
0.0651
0.0589
0.0448
0.0382
0.0215
-0.0052
-0.0396
-0.0524
-0.0476
-0.0446
-0.0452
-0.0265
-0.0235
-0.0212
-0.0188
-0.0149
-0.0029
-0.0066
[,5]
-1.5e-02
-2.6e-02
-3.2e-02
-2.7e-02
-2.5e-02
-2.8e-02
-2.4e-02
-3.1e-02
-1.8e-02
-2.1e-02
-1.4e-02
-1.8e-02
-7.7e-03
-4.3e-03
-2.5e-05
-3.5e-03
-3.0e-03
-8.0e-03
-1.8e-02
-1.3e-02
-6.0e-03
-4.4e-03
-9.9e-04
7.7e-03
9.5e-03
3.2e-02
3.8e-02
6.0e-02
7.1e-02
7.5e-02
[,6]
0.00025
0.04356
0.05846
0.04635
0.04172
0.02919
0.03554
0.04950
0.02409
0.01481
0.00842
0.00794
-0.00266
-0.00287
-0.01791
-0.01184
-0.00782
-0.00651
-0.01356
-0.02542
-0.01443
-0.02505
-0.02640
-0.02795
-0.03482
-0.04093
-0.03670
-0.04372
-0.04662
-0.04284
[,7]
-0.00788
-0.01286
-0.00696
-0.00328
0.00074
-0.00487
-0.00226
0.00583
0.00848
0.00820
0.01310
0.00839
0.00351
0.00378
0.00463
0.00321
0.00603
-0.00479
-0.01100
-0.00711
-0.00898
0.00244
-0.00859
-0.00022
-0.00496
-0.00127
0.00025
-0.00163
0.00033
-0.00259
[,8]
-0.01006
-0.00135
-0.00954
-0.00657
-0.00922
-0.03330
-0.02987
-0.02216
-0.02876
-0.01554
-0.01254
-0.00460
0.00194
0.00198
0.00334
-0.00252
0.00012
0.01476
0.01087
0.02111
0.02742
0.02430
0.01498
0.02030
0.00782
0.00616
0.00229
0.01270
0.01899
0.01851
[,9]
-0.00874
0.03941
0.04602
0.02850
0.03555
-0.00384
0.00121
-0.00218
-0.00644
-0.01665
-0.00810
-0.01879
-0.01913
-0.02474
-0.01806
-0.01240
-0.00810
0.00393
0.00950
0.00859
0.01441
0.00999
0.01424
0.00506
-0.00174
-0.00052
-0.00399
-0.00508
-0.01457
-0.02445
6
[31,]
[32,]
[33,]
[34,]
0.01511 3.3e-03 -0.03066 -0.0139
0.01567 -9.6e-04 -0.02176 -0.0135
0.01336 1.1e-02 -0.01398 -0.0067
0.00514 6.1e-03 -0.00660 -0.0027
5.3e-02
3.1e-02
1.5e-03
3.0e-03
-0.03665 0.00551
-0.02814 -0.00340
-0.01684 0.00287
-0.00795 0.00421
0.03270
0.03289
0.02433
0.01005
-0.02119
-0.01357
-0.01496
-0.00810
> suicide.sva = svd(E)
> suicide.sva
$d
[1] 3.1e-01 2.7e-01 1.0e-01 7.1e-02 5.1e-02 3.1e-02 2.6e-02 2.4e-02 4.9e-17
$u
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[31,]
[32,]
[33,]
[34,]
[,1]
-0.138
-0.163
-0.179
-0.151
-0.144
-0.182
-0.201
-0.235
-0.185
-0.141
-0.152
-0.144
-0.114
-0.095
-0.032
-0.046
-0.028
0.026
0.147
0.216
0.169
0.190
0.192
0.173
0.193
0.227
0.217
0.254
0.247
0.255
0.235
0.182
0.096
0.045
[,2]
-0.214
0.079
0.329
0.313
0.283
0.173
0.098
0.034
-0.072
-0.111
-0.187
-0.262
-0.298
-0.278
-0.226
-0.178
-0.100
0.026
0.169
0.197
0.181
0.151
0.166
0.053
0.038
-0.017
-0.035
-0.099
-0.176
-0.166
-0.109
-0.056
-0.023
-0.017
[,3]
-0.1394
-0.0757
0.1278
0.1747
0.1479
0.1457
0.1167
0.0105
0.0240
-0.1105
-0.0756
-0.2032
-0.1514
-0.1155
-0.1065
-0.0979
-0.0556
-0.1106
-0.3294
-0.3326
-0.1998
-0.2253
-0.1858
-0.1347
-0.1544
0.0018
0.0771
0.2271
0.3165
0.3736
0.1958
0.0486
-0.1170
-0.0291
[,1]
0.486
-0.026
-0.346
-0.359
0.416
-0.509
-0.030
0.278
-0.060
[,2]
0.380
0.097
0.458
-0.633
-0.242
0.252
-0.075
-0.016
0.329
[,3]
-0.290
0.119
0.383
-0.153
0.824
0.144
0.037
-0.175
-0.032
[,4]
0.0413
0.5724
0.3399
0.0864
0.1202
-0.4865
-0.2722
-0.0738
-0.1595
-0.1643
0.0139
0.0911
0.1382
0.0606
-0.0092
-0.0266
-0.0350
0.0907
-0.0714
-0.0854
0.1374
-0.0026
-0.0445
0.0404
-0.1552
-0.0915
-0.1144
0.0607
0.0995
0.0043
0.1062
0.1305
0.0262
-0.0017
[,5]
-0.1164
-0.3123
0.0012
0.2368
0.0014
0.1409
-0.0406
0.0104
-0.1209
0.0372
-0.0703
0.0372
0.0593
0.1386
0.0921
0.0011
0.0547
0.1380
-0.1034
0.0294
0.1330
0.0818
-0.1028
0.0177
-0.1358
-0.2722
-0.3073
-0.2209
-0.0545
0.0413
0.3374
0.3165
0.4368
0.2091
[,6]
-0.43758
0.00765
-0.08003
-0.13647
-0.06274
-0.31954
-0.02373
0.52413
0.14021
0.23640
0.06487
0.17603
-0.09030
-0.00925
-0.25406
-0.22181
-0.09727
-0.20771
0.10104
0.07245
-0.02952
0.09283
-0.15594
-0.04242
-0.09728
-0.04405
0.10523
-0.04252
-0.08752
0.05583
0.16690
-0.00084
0.01442
0.05107
[,7]
0.1539
0.1566
0.2709
-0.0063
-0.4603
0.2057
-0.0313
-0.0240
0.0813
0.0291
-0.3393
0.1047
-0.0064
0.1068
-0.1270
-0.0604
-0.2136
-0.1640
0.2820
0.1003
0.0893
-0.3897
-0.0633
-0.2173
0.1162
-0.0122
-0.0171
0.0378
-0.0331
0.1614
0.0060
0.0045
0.1944
0.0649
[,8]
0.2091
0.2196
-0.3765
-0.1096
0.0439
0.0637
0.3168
0.2999
-0.2833
-0.0318
-0.1901
-0.1379
0.0797
0.0786
-0.0768
-0.1129
-0.0957
0.2407
0.1321
0.0221
0.0823
-0.0101
0.0689
-0.0313
-0.0831
-0.1679
-0.0742
-0.0956
0.0027
0.2438
-0.0443
0.3027
-0.2209
-0.2030
[,9]
0.290
-0.182
0.051
0.035
0.064
-0.247
-0.016
0.066
-0.134
0.357
-0.237
-0.124
-0.282
0.081
0.036
0.031
-0.042
-0.040
-0.019
-0.180
-0.026
-0.061
-0.114
0.211
0.172
-0.310
0.086
0.410
-0.180
-0.229
-0.037
0.108
-0.035
0.078
$v
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[,4]
-0.37
0.10
-0.36
-0.10
0.10
0.28
-0.11
0.55
0.56
[,5]
-0.115
0.322
0.314
-0.054
-0.131
0.013
0.157
0.669
-0.541
[,6]
0.154
-0.275
-0.376
-0.221
0.076
0.620
0.470
-0.043
-0.313
[,7]
[,8] [,9]
0.180 0.023 0.575
0.568 -0.677 0.069
-0.295 0.167 0.201
0.026 0.031 0.619
0.042 0.073 0.223
0.325 0.173 0.243
-0.584 -0.613 0.134
-0.185 0.241 0.229
-0.273 -0.207 0.247
7
> U = suicide.sva$u[,1:2]
> V = suicide.sva$v[,1:2]
> U[,1] = delta[1]*U[,1]/sqrt(suicide.rowtot)
> U[,2] = delta[2]*U[,2]/sqrt(suicide.rowtot)
> V[,1] = delta[1]*V[,1]/sqrt(suicide.coltot)
> V[,2] = delta[2]*V[,2]/sqrt(suicide.coltot)
> CA = rbind(U,V)
> inertia = sum(delta^2)
> inertia
[1] 0.19
> per1 = delta[1]^2/inertia
> per1
[1] 0.52
> per2 = delta[2]^2/inertia
> per2
[1] 0.38
> options(digits=5)
>
+
>
>
>
>
plot(CA[,1],CA[,2],type="n",xlab = paste("coord 1% inertia =",format(per1*100)),
ylab = paste("coord 2% inertia =",format(per2*100)))
text(CA[,1],CA[,2],labels=c(dimnames(Suicide)[[1]],dimnames(Suicide)[[2]]))
abline(h=0,lty=2)
abline(v=0,lty=2)
title(main="Correspondence Analysis for West German Suicides (1974 - 1977)")
8
General Function for Conducting Correspondence Analysis (corresp)
Here is function that performs all of the above operations given a two-way contingency table as
input. You can change the title to whatever you would like to be.
> corresp = function(x,title="Correspondence Analysis") {
x.prop <- as.matrix(x/sum(x))
x.row <- apply(x.prop,1,sum)
x.col <- apply(x.prop,2,sum)
rdiag <- diag(1/sqrt(x.row))
cdiag <- diag(1/sqrt(x.col))
x.row <- as.matrix(x.row)
x.col <- as.matrix(x.col)
E <- rdiag%*%(x.prop - x.row%*%t(x.col))%*%cdiag
x.sva <- svd(E)
delta <- x.sva$d
U <- x.sva$u[,1:2]
V <- x.sva$v[,1:2]
U[,1] <- delta[1]*U[,1]/sqrt(x.row)
U[,2] <- delta[2]*U[,2]/sqrt(x.row)
V[,1] <- delta[1]*V[,1]/sqrt(x.col)
V[,2] <- delta[2]*V[,2]/sqrt(x.col)
U <- rbind(U,V)
inertia <- sum(delta[delta>0]*delta[delta>0])
per1 <- (delta[1]*delta[1]/inertia)*100
per2 <- (delta[2]*delta[2]/inertia)*100
dim1 <- dim(x)[1]
ds <- as.integer(dim1+1)
dim2 <- dim(x)[2]
dt <- dim1 + dim2
plot(U[,1],U[,2],type="n",xlab=paste("coord 1 - % inertia =",format(per1)),
ylab=paste("coord 2 - % inertia =",format(per2)))
text(U[1:dim1,1],U[1:dim1,2],labels=dimnames(x)[[1]],col=2)
text(U[ds:dt,1],U[ds:dt,2],labels=dimnames(x)[[2]],col=4)
abline(h=0,v=0,lty=2)
title(title)
}
> corresp(Suicide,title=”Correspondence Analysis of Suicides in West Germany (1974-1979)”)
9
Correspondence Analysis in JMP
Right-click on the cells in the mosaic plot and select Cell Labeling > Show Percents.
10
SVD details
3-D Correspondence Analysis - uses JMP 12 (not released yet)
11
Multiple Correspondence Analysis (MCA) in R (will be available in JMP 12)
Not all tables are two dimensional. As a simple example consider the following data
dealing with graduate school admissions at the University of California – Berkley. The
categorical variables are gender of the applicant, the program they are applying to, and
whether or not they were admitted to the program. Thus we have three
categorical/nominal variables to consider in our analysis that can stored in a data array.
> UCBAdmissions
, , Dept = A
Gender
Admit
Male Female
Admitted 512
89
Rejected 313
19
, , Dept = B
Gender
Admit
Male Female
Admitted 353
17
Rejected 207
8
, , Dept = C
Sexual discrimination at UCB?
Gender
Admit
Male Female
Admitted 120
202
Rejected 205
391
, , Dept = D
Gender
Admit
Male Female
Admitted 138
131
Rejected 279
244
, , Dept = E
Gender
Admit
Male Female
Admitted
53
94
Rejected 138
299
, , Dept = F
Gender
Admit
Male Female
Admitted
22
24
Rejected 351
317
It appears that there is gender discrimination in graduate school admissions at UCB.
However there is third dimension that is being ignored here, namely the programs
(departments) students are applying to.
12
If we take department applied to into account this what we see.
Conditioning on the program applied to, it is not clear there is gender discrimination. A
higher percentage of female applicants are admitted in 4 of the 6 programs! This is an
example of what is commonly referred to as Simpson’s Paradox. Multiple
correspondence analysis will allow us to construct a 2-D display, which represents a
dimension reduction, showing the relationship between the three categorical variables
in these data.
13
> ucb.mca = mjca(UCBAdmissions)
> ucb.mca = mjca(UCBAdmissions)
> summary(ucb.mca)
Principal inertias (eigenvalues):
dim
1
2
3
4
5
value
%
cum%
0.114945 80.5 80.5
0.005694
4.0 84.5
00000000
0.0 84.5
00000000
0.0 84.5
00000000
0.0 84.5
-------- ----Total: 0.142840
scree plot
************************
*
Columns:
1
2
3
4
5
6
7
8
9
10
name
mass
| Admit:Admitted | 129
| Admit:Rejected | 204
| Gender:Female | 135
|
Gender:Male | 198
|
Dept:A |
69
|
Dept:B |
43
|
Dept:C |
68
|
Dept:D |
58
|
Dept:E |
43
|
Dept:F |
53
qlt
911
911
863
863
838
829
731
832
812
737
inr
93
59
95
65
117
124
108
106
117
116
|
|
|
|
|
|
|
|
|
|
k=1
365
-231
-399
272
512
573
-270
-110
-384
-355
cor
875
875
845
845
837
824
594
828
787
547
ctr
150
95
187
127
156
123
43
6
55
58
k=2 cor ctr
|
74 36 123 |
| -47 36 78 |
|
59 19 84 |
| -40 19 57 |
|
13
1
2 |
| -45
5 15 |
| 130 137 199 |
|
7
3
0 |
|
69 25 35 |
| -210 190 406 |
Ho
How does this plot help explain
the apparent admission bias?
14
More examples of MCA
International survey on attitudes towards science in the form of four Likert scale items
(1 = strongly disagree,…, 5 = strongly agree (items A-D)) and demographic info for the
respondents (sex, age, and education level).
> head(wg93) # this data set is the ca library – notice there is a line for each respondent!
A B C D sex age edu
1 2 3 4 3
2
2
3
2 3 4 2 3
1
3
4
3 2 3 2 4
2
3
2
4 2 2 2 2
1
2
3
5 3 3 3 3
1
5
2
6 3 4 4 5
1
3
2
> ?wg93
> dim(wg93)
[1] 871
7
> wg93.mca = mjca(wg93)
> plot(wg93.mca)
> summary(wg93.mca)
Principal inertias (eigenvalues):
dim
1
2
3
4
5
6
7
8
9
10
11
12
value
%
cum%
0.028924 36.5 36.5
0.017038 21.5 58.1
0.005828
7.4 65.4
0.004047
5.1 70.6
0.002142
2.7 73.3
0.000845
1.1 74.3
0.000676
0.9 75.2
0.000383
0.5 75.7
0.000283
0.4 76.0
8.8e-050
0.1 76.1
2.9e-050
0.0 76.2
1e-06000
0.0 76.2
-------- ----Total: 0.079144
scree plot
************
*******
**
**
*
15
Columns:
name
1 |
A:1
2 |
A:2
3 |
A:3
4 |
A:4
5 |
A:5
6 |
B:1
7 |
B:2
8 |
B:3
9 |
B:4
10 |
B:5
11 |
C:1
12 |
C:2
13 |
C:3
14 |
C:4
15 |
C:5
16 |
D:1
17 |
D:2
18 |
D:3
19 |
D:4
20 |
D:5
21 | sex:1
22 | sex:2
23 | age:1
24 | age:2
25 | age:3
26 | age:4
27 | age:5
28 | age:6
29 | edu:1
30 | edu:2
31 | edu:3
32 | edu:4
33 | edu:5
34 | edu:6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
mass
20
53
33
29
8
12
29
34
46
23
25
52
32
25
9
10
38
33
37
25
70
73
15
34
26
24
20
23
6
62
40
15
8
11
qlt
822
639
719
658
727
768
683
648
564
686
754
569
566
657
686
656
85
504
74
644
625
625
108
364
55
125
166
611
64
586
255
386
471
500
inr
33
23
29
30
37
38
29
27
25
36
36
23
29
32
37
34
26
29
26
30
18
18
33
27
28
28
30
31
34
22
25
34
32
32
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
k=1
306
135
-57
-243
-519
464
201
104
-144
-350
358
93
-93
-285
-414
169
-21
-15
-31
32
-85
82
-138
-104
-43
-18
69
249
131
148
-54
-237
-204
-222
> attributes(wg93.mca)
$names
[1] "sv"
"lambda"
[8] "levels.n"
"nd"
[15] "rowcoord"
"rowpcoord"
[22] "colinertia" "colcoord"
[29] "Burt"
"Burt.upd"
cor ctr
k=2 cor ctr
554 63 | 213 268 52 |
627 33 | -18 12
1 |
46
4 | -218 673 94 |
622 60 |
59 36
6 |
540 73 | 306 188 43 |
480 87 | 360 288 88 |
576 40 | -87 107 13 |
247 13 | -133 401 35 |
465 33 | -66 99 12 |
450 97 | 253 236 86 |
494 110 | 260 260 99 |
281 15 | -94 289 27 |
111 10 | -189 455 67 |
640 71 |
47 17
3 |
365 51 | 389 321 76 |
132 10 | 337 524 65 |
14
1 | -48 71
5 |
3
0 | -189 501 70 |
42
1 | -27 32
2 |
12
1 | 234 632 80 |
501 17 | -42 124
7 |
501 17 |
41 124
7 |
101 10 |
36
7
1 |
281 13 |
57 84
7 |
54
2 |
-5
1
0 |
10
0 | -60 115
5 |
96
3 | -58 69
4 |
610 50 |
11
1
0 |
62
4 |
21
2
0 |
557 47 | -34 29
4 |
120
4 | -57 136
8 |
266 30 | 160 121 23 |
374 12 | 104 97
5 |
441 19 |
81 60
4 |
"inertia.e"
"nd.max"
"rowctr"
"colpcoord"
"subinertia"
"inertia.t"
"rownames"
"rowcor"
"colctr"
"JCA.iter"
"inertia.et"
"rowmass"
"colnames"
"colcor"
"indmat"
"levelnames"
"rowdist"
"colmass"
"colsup"
"call"
"factors"
"rowinertia"
"coldist"
"subsetcol"
$class
[1] "mjca"
> attributes(wg93.mca)
$names
[1] "sv"
"lambda"
[8] "levels.n"
"nd"
[15] "rowcoord"
"rowpcoord"
[22] "colinertia" "colcoord"
[29] "Burt"
"Burt.upd"
"inertia.e"
"nd.max"
"rowctr"
"colpcoord"
"subinertia"
"inertia.t"
"rownames"
"rowcor"
"colctr"
"JCA.iter"
"inertia.et"
"rowmass"
"colnames"
"colcor"
"indmat"
"levelnames"
"rowdist"
"colmass"
"colsup"
"call"
"factors"
"rowinertia"
"coldist"
"subsetcol"
16
> plot3d(wg93.mca$rowcoord[,1:3],type="n",xlab="Dim1",ylab="Dim2",zlab="Dim3")
> text3d(wg93.mca$rowcoord[,1:3],texts=wg93.mca$levelnames,col="blue")
17
Download