slides - CARME 2011

advertisement
UCLA Department of Statistics
History and Theory
of Nonlinear Principal Component Analysis
Jan de Leeuw
February 11, 2011
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Abstract
Relationships between Multiple Correspondence Analysis (MCA) and
Nonlinear Principal Component Analysis (NLPCA), which is defined as
PCA with Optimal Scaling (OS), are discussed. We review the history of
NLPCA.
We discuss forms of NLPCA that have been proposed over the years:
Shepard-Kruskal- Breiman-Friedman-Gifi PCA with optimal scaling,
Aspect Analysis of correlations,
Guttman’s MSA,
Logit and Probit PCA of binary data, and
Logistic Homogeneity Analysis.
Since I am trying to summarize 40+ years of work, the presentation will
be rather dense.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Linear PCA
History
(Linear) Principal Components Analysis (PCA) is sometimes attributed
to Hotelling (1933), but that is surely incorrect.
The equations for the principal axes of quadratic forms and surfaces, in
various forms, were known from classical analytic geometry (notably
from work by Cauchy and Jacobi in the mid 19th century).
There are some modest beginnings in Galton’s Natural Inheritance of
1889, where the principal axes are connected for the first time with the
“correlation ellipsoid".
There is a full-fledged (although tedious) discussion of the technique in
Pearson (1901), and there is a complete application (7 physical traits of
3000 criminals) in MacDonell (1902), by a Pearson co-worker.
There is proper attribution in: Burt, C., Alternative Methods of Factor
Analysis and their Relations to Pearson’s Method of “Principle Axes”, Br.
J. Psych., Stat. Sec., 2 (1949), pp. 98-121.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Linear PCA
How To
Hotelling’s introduction of PCA follows the now familiar route of making
successive orthogonal linear combinations with maximum variance. He
does this by using Power iterations (without reference), discussed in
1929 by Von Mises and Pollaczek-Geiringer.
Pearson, following Galton, used the correlation ellipsoid throughout.
This seems to me the more basic approach.
He cast the problem in terms of finding low-dimensional subspaces
(lines and planes) of best (least squares) fit to a cloud of points, and
connects the solution to the principal axes of the correlation ellipsoid.
In modern notation, this means minimizing SSQ(Y − XB 0 ) over n × r
matrices X and m × r matrices B. For r = 1 this is the best line, etc.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Correspondence Analysis
History
Simple Correspondence Analysis (CA) of a bivariate frequency table
was first discussed, in fairly rudimentary form, by Pearson (1905), by
looking at transformations linearizing regressions. See De Leeuw, On
the Prehistory of Correspondence Analysis, Statistica Neerlandica, 37,
1983, 161–164.
This was taken up by Hirshfeld (Hartley) in 1935, where the technique
was presented in a fairly complete form (to maximize correlation and
decompose contingency). This approach was later adopted by
Gebelein, and by Renyi and his students in their study of maximal
correlation.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Correspondence Analysis
History
In the 1938 edition of Statistical Methods for Research Workers Fisher
scores a categorical variable to maximize a ratio of variances (quadratic
forms). This is not quite CA, because it is presented in an (asymmetric)
regression context.
Symmetric CA and the reciprocal averaging algorithm are discussed,
however, in Fisher (1940) and applied by his co-worker Maung
(1941a,b).
In the early sixties the chi-square metric, relating CA to metric
multidimensional scaling (MDS), with an emphasis on geometry and
plotting, was introduced by Benzécri (thesis of Cordier, 1965).
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Multiple Correspondence Analysis
History
Different weighting schemes to combine quantitative variables to an
index that optimizes some variance-based discrimination or
homogeneity criterion were proposed in the late thirties by Horst (1936),
by Edgerton and Kolbe (1936), and by Wilks (1938).
The same idea was applied to quantitative variables in a seminal paper
by Guttman (1941), that presents, for the first time, the equations
defining Multiple Correspondence Analysis (MCA).
The equations are presented in the form of a row-eigen (scores), a
column-eigen (weights), and a singular value (joint) problem.
The paper introduces the “codage disjonctif complet” as well as the
“Tableau de Burt”, and points out the connections with the chi-square
metric.
There is no geometry, and the emphasis is on constructing a single
scale. In fact Guttman warns against extracting and using additional
eigen-pairs.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Multiple Correspondence Analysis
Further History
In Guttman (1946) scale or index construction was extended to paired
comparisons and ranks. In Guttman (1950) it was extended to scalable
binary items.
In the fifties and sixties Hayashi introduced the quantification techniques
of Guttman in Japan, where they were widely disseminated through the
work of Nishisato. Various extensions and variations were added by the
Japanese school.
Starting in 1968, MCA was studied as a form of metric MDS by De
Leeuw.
Although the equations defining MCA were the same as those defining
PCA, the relationship between the two remained problematic.
These problems are compounded by “horse shoes” or the “effect
Guttman”, i.e. artificial curvilinear relationships between successive
dimensions (eigenvectors).
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Nonlinear PCA
What ?
PCA can be made non-linear in various ways.
1
First, we could seek indices which discriminate maximally and are
non-linear combinations of variables. This generalizes the weighting
approach (Hotelling).
2
Second, we could find nonlinear combinations of components that are
close to the observed variables. This generalizes the reduced rank
approach (Pearson).
3
Third, we could look for transformations of the variables that optimize
the linear PCA fit. This is known (term of Darrell Bock) as the optimal
scaling (OS) approach.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Nonlinear PCA
Forms
The first approach has not been studied much, although there are some
relations with Item Response Theory.
The second approach is currently popular in Computer Science, as
“nonlinear dimension reduction”. I am currently working on a polynomial
version, but there is not unified theory, and the papers are usually of the
“‘well, we could also do this” type familiar from cluster analysis.
The third approach preserves many of the properties of linear PCA and
can be connected with MCA as well. We shall follow its history and
discuss the main results.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Nonlinear PCA
PCA with OS
Guttman observed in 1959 that if we require that the regression between
monotonically transformed variables are linear, then the transformations
are uniquely defined. In general, however, we need approximations.
The loss function for PCA-OS is SSQ(Y − XB 0 ), as before, but now we
minimize over components X , loadings B, and transformations Y .
Transformations are defined column-wise (over variables) and belong to
some restricted class (monotone, step, polynomial, spline).
Algorithms often are of the alternating least squares type, where optimal
transformation and low-rank matrix approximation are alternated until
convergence.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
PCA-OS
History of programs
Shepard and Kruskal used the monotone regression machinery of
non-metric MDS to construct the first PCA-OS programs around 1962.
The paper describing the technique was not published until 1975.
Around 1970 versions of PCA-OS (sometimes based on Guttman’s rank
image principle) were developed by Lingoes and Roskam.
In 1973 De Leeuw, Young, and Takane started the ALSOS project, with
resulted in PRINCIPALS (published in 1978), and PRINQUAL in SAS.
In 1980 De Leeuw (with Heiser, Meulman, Van Rijckevorsel, and many
others) started the Gifi project, which resulted in PRINCALS, in SPSS
CATPCA, and in the R package homals by De Leeuw and Mair (2009).
In 1983 Winsberg and Ramsay published a PCA-OS version using
monotone spline transformations.
In 1987 Koyak, using the ACE smoothing methodology of Breiman and
Friedman (1985), introduced mdrace.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
PCA/MCA
The Gifi Project
The Gifi project followed the ALSOS project. It has or had as its explicit goals:
1
Unify a large class of multivariate analysis methods by combining a
single loss function, parameter constraints (as in MDS), and ALS
algorithms.
2
Give a very general definition of component analysis (to be called
homogeneity analysis) that would cover CA, MCA, linear PCA, nonlinear
PCA, regression, discriminant analysis, and canonical analysis.
3
Write code and analyze examples for homogeneity analysis.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Loss of Homogeneity
The basic Gifi loss function is
m
σ (X , Y ) = ∑ SSQ(X − Gj Yj ).
j =1
The n × kj matrices Gj are the data, coded as indicator matrices (or
dummies). Alternatively, Gj can be a B-spline basis. Also, Gj can have
zero rows for missing data.
X is an n × p matrix of object scores, satisfying the normalization
conditions X 0 X = I.
Yj are kj × p matrices of category quantifications. There can be rank,
level and additivity constraints on the Yj .
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
ALS
The basic Gifi algorithm alternates
(k )
1
X (k ) = ORTH(∑m
j =1 Gj Yj
2
(k +1)
Yj
= argmin tr
Yj ∈Yj
).
(k +1)
(k +1)
(Ŷj
− Yj )0 Dj (Ŷj
− Yj ).
We use the following notation.
Superscript (k ) is used for iterations.
ORTH () is any orthogonalization method such as QR, Gram-Schmidt,
or SVD.
Dj = Gj0 Gj are the marginals.
(k +1)
Ŷj
= Dj−1 Gj0 X (k ) are the category centroids.
The constraints on Yj are written as Yj ∈ Yj .
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Star Plots
Let’s look at some movies.
GALO: 1290 students, 4 variables. We show both MCA and NLPCA.
Senate: 100 senators, 20 votes. Since the variables are binary, MCA =
NLPCA.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
objplot galo
97
1024
1037
1072
1150
1156
1165
337
353
454
486
591
680
748
795
853
10
40
54
1241
421
447
774
1123
1119
37
67
1017
1088
1094
1120
1149
1163
1204
1250
1266
218
327
335
355
402
405
490
569
610
677
688
715
743
769
965
147
417
557
575
1003
1006
1018
1042
1041
1046
1059
1089
1169
1173
1189
1193
1203
1208
152
192
201
262
282
291
294
326
339
376
387
393
429
436
510
533
541
600
619
642
708
765
915
33
46
45
86
441
41
1168
1175
1184
1206
1209
1257
113
145
193
366
475
532
602
608
618
621
933
991
65
42
1146
1268
615
622
825
842
34
38
59
73
187
331
354
707
49
94
120
458
523
70759
1103
1115
431
589
762
782
820
875
924
941
990
16 1211
78
121
1065
1138
119
130
284
289
329
342
372
430
467
514
513
543
562
641
804
843
992
7
1141
757
1009
1068
1155
1243
240
356
440
489
496
659
669
675
914
31 1256
1217
1095
652
692
35
1020
1043
1064
1210
1081224
230
248
495
545
674
738
826
918
1002
1001
1021
1056
1157
1178
134
270
280
378
470
528
537
609
626
634
686
694
841
845
847
964
36
87
171
224
242
1139
1194
1202
281
453
468
791
937
996
93
1225
1110
684
64
66
1124
144
810
755
55
1000
1049
1125
1215
285
347
406
526
536
535
546
596
613
749
780
1054
473
844
863
949
53
68
1222
4
105
150
158
245
320
401
482
521
783
840
909
926
982
1033
1051
1086
1259
102
153
256
322
357
499
509
508
661
697
885
911
910
60
1171
1245
123
156
316
492
750
789
50
198
241
330
333
434
506
685
702
872
930
83 975
82
157
138 208 271
518
597
617
761
938
947
1177
1216
112
84
211
1152
1223
126
203
276
583
587
628
637
636
655
689
74
691
1025
1176
1229
1791287
191
606
719
751
770
700862
814
950
993
272
1050
1191
1249
110
202
382
449
457
561
630
638
986
88 483
214
9421167
132
149
703
839
80
127
1236
1281
115 142
324
359
570
573
592
595
646
667
919
967
978
1063
268
803
1097
1170
1205
131
275
338
395
407
527
725
939
946
997
77 253
222
1028
103
109
133
146
210
267
598
601
662
746
813
833
940
977
29
1055
1136
1218
1371286
165
371
399
837
856
859
960
972
1069
1077
1164
1262
228
243
257
318
644
720
763
792
832
861
944
39
21114 665
1004
1013
1027
1026
1040
1053
1070
1074
1160
1186
1188
1201
1254
1267
1280
163
162
170
180
188
204
209
229
235
273
287
299
307
315
325
336
374
398
400
403
408
460
484
487
544
554
586
604
611
764
767
790
797
802
817
819
822
827
858
882
921
923
934
945
966
988
18
23
51
56
61
69
79
1174
217
221
227
247
412
442
727
11 6641102
17
22
216
435
155
232
672
679
159
1011
1048
1092
1190
1228
181
225
260
334
340
428
461
466
633
683
737
821
823
89 1079
568 1239
118
903
1117
443
107
588
8491090
1151
1161
207
344
507
643
895
19
747
71
62
640
1015
1023
1220
1235
1240
125
136
148
249
317
332
392
485
593
629
954
985
995
28
32
63
9
427
1248
1275
1284
699
788
805
1277
27
721
139
239
302
361
381
502
980
151
1237
1264
295
377
463
878
775
1029
1076
1143
1181
1199
1219
1251
1253
1258
1270
1269
1278
164
169
195
226
234
259
283
301
368
379
385
410
419
448
450
455
465
464
501
547
576
627
701
706
729
736
745
754
768
794
799
798
824
850
867
873
880
913
917
932
959
963
962
971
984
987
999
20
91
1122
1172
1247
1271
1274
174
414
771
1
215
616
277
75
5
1035
1034
1062
1066
1148
1166
1182
1185
1233
1232
1265
161
182
185
189
237
298
297
352
362
365
367
383
386
394
397
437
446
480
479
491
512
516
530
550
553
556
560
563
567
578
739
752
781
784
787
796
800
834
838
854
888
902
901
900
927
956
969
968
974
983
26
1142
1147
1221
1244
122
411
424
474
722
1154
1260
166
380
603
696
801
504
76
1252
1263
175
415
439
831
874
973
30
1108
581
313542438 1127
806
254
670
925
943
979
2497
1131
607
1273
899
1071
712
1061
1109
348
555
1282
205 314734
290
303
452
498
520
529
716
718
733
776
828
830
744
341
250
319
6906
177
345 1198
690
1058
135
154
206
258
278
304
309
343
396
420
515
571
723
735
753
786
793
855
891
912
922
955
15 989
13
92
777
472
742
772
14
12
857
928
1105
1107
6661098
758
869
1118
413 1212
760
865
879
893
1032
1096
1129
1187
263
534
612
635
678
818
48
1114
274
1116
1007
1179
705
1038
1196
116
288
445
579
625
632
631
713
756
836
961
57
1030
1113
1192
238
255
346
728
1091
582
1255
199
519
72
1158
517 951
614
648
864
1128
364
599
623
81
1121
47 363
476
994
1016
1132
1261
176
213
384
418
426
493
522
740
778
807
852
851
871
889
970
998
1234
808
816
1106
892
953
1101
1226
1276
101
178
265
391
647
658
657
660
663
811
896
929
935
98 4161238
1112
1010
726
773
212
1085
1084
140
676
231
549
1135
584
590
99
1104
129
143
349
351
531
698
809
1019
650
1140
1207
1231
293
358
369
580
594
709
85 1126
1039
1067
1081
1080
1153
1159
1162
1200
220
223
261
469
539
559
624
687
141
360
1014
1022
305
308
310
433
653
846
870
25
741
168
494
884
1242
1145
1213
897
160
186
194
292
296
425
558
920
936
478
3 1272
236
505
1901083
388
1130
389 877
717
404
1214
24
279
2521290
286
710
952
1075
1230
1279
173
197
196
219
246
264
373
375
390
409
423
432
500
525
540
566
766
779
785
829
883
894
916
124
471
1134
649
200
266
451
704
848
1144
711
1057
1093
1099
1133
1197
311
574
585
639
645
695
860
876
43
477
1180
1183
251
323
95
117
552
233
565
572
673
682
815
866
886
444
905
1195
350
605
812
931
58
90
904 1008
577
1005
1036
184
183
269
370
459
548
890
981
172
422
724
868
321
481
1100
1087
1246
835
96
244
1111
1012
456
462
958
1227
714
104
654
6931283
524
957
111
1045
1044
1047
1052
1078
1082
306
503
538
564
656
668
52
44
976
488
328
948
300
167
8
312
106
128
732
731
1001073
551
1031
1060
1137
620
651
671
681
511
898
908
730
1285
1288
881
907 1289
887
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
objplot senate
Kyl
Grassley
Thurmond
Allard
Gramm
Sessions
Santorum
Lieberman
Byrd
Enzi
Nickles
Carper
Kerry
Frist
Voinovich
Kennedy
Dayton
Leahy
Bayh
Wellstone
Daschle
Harkin
Biden
Reed
Levin
Reid
Rockefeller
Stabenow
Cantwell
Graham
Corzine
Lugar
Feingold
Hollings
Dodd
Allen
Wyden
Nelson
EnsignMcCain
Akaka
Bunning
Gregg
Craig
Brownback
Roberts
Helms
Warner
Murkowski
Lott
Johnson
Thompson
Kohl
Shelby
DeWine
Smith1
Thomas
Hagel
Landrieu
Lincoln
Smith
Chafee
Inouye
Bennett
Hatch
McConnell
Edwards
Mikulski
Durbin
Boxer
Sarbanes
Schumer
Clinton
Hutchinson
Inhofe
Dorgan
Conrad
Fitzgerald
Crapo
Bingaman
Burns
Jeffords
Domenici
Hutchison
Carnahan
Torricelli
Murray
Bond Miller
Cochran
Cleland
Feinstein
Stevens
Campbell
Nelson1
Baucus
Breaux
Specter
Collins
Snowe
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Single Variables
If there are no constraints on the Yj homogeneity analysis is MCA.
We will not go into additivity constraints, because they take us from PCA
and MCA towards regression and canonical analysis. See the homals
paper and package.
A single variable has constraints Yj = zj aj0 , i.e. category quantifications
are of rank one. In a given analysis some variables can be single while
other can be multiple (unconstrained). More generally, there can be rank
constraints on the Yj .
This can be combined with level constraints on the single quantifications
zj , which can be numerical, polynomial, ordinal, or nominal.
If all variables are single homogeneity analysis is NLPCA (i.e. PCA-OS).
This relationship follows from the form of the loss function.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Multiple Variables
There is another relationship, which is already implicit in Guttman
(1941). If we transform the variables to maximize the dominant
eigenvalue of the correlation matrix, then we find both the first MCA
dimension and the one-dimensional nominal PCA solution.
But there are deeper relations between MCA and NLPCA. These were
developed in a series of papers by De Leeuw and co-workers, starting in
1980. Their analysis also elucidates the “Effect Guttman”.
These relationships are most easily illustrated by performing an MCA of
a continuous standardized multivariate normal, say on m variables,
analyzed in the form of a Burt Table (with doubly-infinite subtables).
Suppose the correlation matrix of this distribution is the m × m matrix
R = {rj ` }.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Multinormal MCA
Suppose I = R [0] , R = R [1] , R [2] , ... is the infinite sequence of Hadamard
(elementwise) powers of R.
[s ]
[s ]
Suppose λj are the m eigenvalues of R [s] and yj
corresponding eigenvectors.
are the
[s ]
The eigenvalues of the MCA solution are the m × ∞ eigenvalues λj .
[s ]
The MCA eigenvector corresponding to λj
[s ]
yj ` Hs ,
consists of the m functions
with Hs the s normalized Hermite polynomial.
th
An MCA eigenvector consists of m linear transformations, or m
quadratic transformations, and so on. There are m linear eigenvectors,
m quadratic eigenvectors, and so on.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Multinormal MCA
The same theory applies to what Yule (1900) calls “strained
multinormal” variables zj , in which there exists diffeomorphisms φj such
that φj (zj ) are jointly multinormal (an example are Gaussian copulas).
And the same theory also applies, except for the polynomial part, when
separate transformations of the variables exists that linearize all
bivariate regressions (this generalizes a result of Pearson from 1905).
Under all these scenarios, MCA solutions are NLPCA solutions, and
vice versa.
With the provision that NLPCA solutions are always selected from the
same R [s] , while MCA solutions come from all R [s] .
[1]
Also, generally, the dominant eigenvalue is λ1 and the second largest
[2]
[1]
one is either λ1 or λ2 . In the first case we have a horseshoe.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Gifi
Bilinearizability
The “joint bilinearizibility” also occurs (trivially) if m = 2, i.e. in CA, and if
kj = 2 for all j, i.e. for binary variables.
If there is joint linearizability then the joint first-order asymptotic normal
distribution of the induced correlation coefficients does not depend on
the standard errors of the computed optimal transformations (no matter
if they come from MCA or NLPCA or any other OS method).
There is additional horseshoe theory, due mostly to Schriever (1986),
that uses the Krein-Gantmacher-Karlin theory of total positivity. It is not
based on families of orthogonal polynomials, but on (higher-order) order
relations.
This was, once again, anticipated by Guttman (1950) who used finite
difference equations to derive the horseshoe MCA/NLPCA for the binary
items defining a perfect scale.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
More
Pavings
If we have a mapping of n objects into Rp then a categorical variable
can be used to label the objects.
The subsets corresponding to the categories of the variables are
supposed to be homogeneous.
This can be formalized either as being small or as being separated by
lines or curves. There are many ways to quantify this in loss functions.
MCA (multiple variables, star plots) tends to think small (within vs
between), NLPCA tends to think separable.
Guttman’s MSA defines outer and inner points of a category. An outer
point is a closest point for any point not in the category.
The closest outer point for an inner point should belong to the same
category as the inner point. This is a nice “topological” way to define
separation, but it is hard to quantify.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
More
Aspects
The aspect approach (De Leeuw and Mair, JSS, 2010, using theory
from De Leeuw, 1990) goes back to Guttman’s 1941 original motivation.
An aspect is any function of all correlations (and/or the correkation
ratios) between m transformed variables.
Now choose the transformations/quantifications such that the aspect is
maximized. We use majorization to turns this into a sequence of least
squares problems.
For MCA the aspect is the largest eigenvalue, for NLPCA it is the sum of
the largest p eigenvalues.
Determinants, multiple correlations, canonical correlations can also be
used as aspects.
Or: the sum of the differences of the correlation ratios and the squared
correlation coefficients.
Multinormal, strained multinormal, and bilinearizability theory applies to
all aspects.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
More
Logistic
Instead of using least squares throughout, we can build a similar system
using logit or probit log likelihoods. This is in the development stages.
The basic loss function, corresponding to the Gifi loss function, is
n
m
kj
∑ ∑ ∑ gij ` log
i =1 j =1 `=1
exp{−φ (xi , yj ` )}
kj
∑ν =1 exp{−φ (xi , yj ν )}
where the data are indicators, as before. The function φ can be
distance, squared distance, or negative inner product.
This emphasizes separation, because we want the xi closest to the yj `
for which gij ` = 1.
We use majorization to turn this into a sequence of reweighted least
squares MDS or PCA problems.
Jan de Leeuw
NLPCA History
UCLA Department of Statistics
Download