UCLA Department of Statistics History and Theory of Nonlinear Principal Component Analysis Jan de Leeuw February 11, 2011 Jan de Leeuw NLPCA History UCLA Department of Statistics Abstract Relationships between Multiple Correspondence Analysis (MCA) and Nonlinear Principal Component Analysis (NLPCA), which is defined as PCA with Optimal Scaling (OS), are discussed. We review the history of NLPCA. We discuss forms of NLPCA that have been proposed over the years: Shepard-Kruskal- Breiman-Friedman-Gifi PCA with optimal scaling, Aspect Analysis of correlations, Guttman’s MSA, Logit and Probit PCA of binary data, and Logistic Homogeneity Analysis. Since I am trying to summarize 40+ years of work, the presentation will be rather dense. Jan de Leeuw NLPCA History UCLA Department of Statistics Linear PCA History (Linear) Principal Components Analysis (PCA) is sometimes attributed to Hotelling (1933), but that is surely incorrect. The equations for the principal axes of quadratic forms and surfaces, in various forms, were known from classical analytic geometry (notably from work by Cauchy and Jacobi in the mid 19th century). There are some modest beginnings in Galton’s Natural Inheritance of 1889, where the principal axes are connected for the first time with the “correlation ellipsoid". There is a full-fledged (although tedious) discussion of the technique in Pearson (1901), and there is a complete application (7 physical traits of 3000 criminals) in MacDonell (1902), by a Pearson co-worker. There is proper attribution in: Burt, C., Alternative Methods of Factor Analysis and their Relations to Pearson’s Method of “Principle Axes”, Br. J. Psych., Stat. Sec., 2 (1949), pp. 98-121. Jan de Leeuw NLPCA History UCLA Department of Statistics Linear PCA How To Hotelling’s introduction of PCA follows the now familiar route of making successive orthogonal linear combinations with maximum variance. He does this by using Power iterations (without reference), discussed in 1929 by Von Mises and Pollaczek-Geiringer. Pearson, following Galton, used the correlation ellipsoid throughout. This seems to me the more basic approach. He cast the problem in terms of finding low-dimensional subspaces (lines and planes) of best (least squares) fit to a cloud of points, and connects the solution to the principal axes of the correlation ellipsoid. In modern notation, this means minimizing SSQ(Y − XB 0 ) over n × r matrices X and m × r matrices B. For r = 1 this is the best line, etc. Jan de Leeuw NLPCA History UCLA Department of Statistics Correspondence Analysis History Simple Correspondence Analysis (CA) of a bivariate frequency table was first discussed, in fairly rudimentary form, by Pearson (1905), by looking at transformations linearizing regressions. See De Leeuw, On the Prehistory of Correspondence Analysis, Statistica Neerlandica, 37, 1983, 161–164. This was taken up by Hirshfeld (Hartley) in 1935, where the technique was presented in a fairly complete form (to maximize correlation and decompose contingency). This approach was later adopted by Gebelein, and by Renyi and his students in their study of maximal correlation. Jan de Leeuw NLPCA History UCLA Department of Statistics Correspondence Analysis History In the 1938 edition of Statistical Methods for Research Workers Fisher scores a categorical variable to maximize a ratio of variances (quadratic forms). This is not quite CA, because it is presented in an (asymmetric) regression context. Symmetric CA and the reciprocal averaging algorithm are discussed, however, in Fisher (1940) and applied by his co-worker Maung (1941a,b). In the early sixties the chi-square metric, relating CA to metric multidimensional scaling (MDS), with an emphasis on geometry and plotting, was introduced by Benzécri (thesis of Cordier, 1965). Jan de Leeuw NLPCA History UCLA Department of Statistics Multiple Correspondence Analysis History Different weighting schemes to combine quantitative variables to an index that optimizes some variance-based discrimination or homogeneity criterion were proposed in the late thirties by Horst (1936), by Edgerton and Kolbe (1936), and by Wilks (1938). The same idea was applied to quantitative variables in a seminal paper by Guttman (1941), that presents, for the first time, the equations defining Multiple Correspondence Analysis (MCA). The equations are presented in the form of a row-eigen (scores), a column-eigen (weights), and a singular value (joint) problem. The paper introduces the “codage disjonctif complet” as well as the “Tableau de Burt”, and points out the connections with the chi-square metric. There is no geometry, and the emphasis is on constructing a single scale. In fact Guttman warns against extracting and using additional eigen-pairs. Jan de Leeuw NLPCA History UCLA Department of Statistics Multiple Correspondence Analysis Further History In Guttman (1946) scale or index construction was extended to paired comparisons and ranks. In Guttman (1950) it was extended to scalable binary items. In the fifties and sixties Hayashi introduced the quantification techniques of Guttman in Japan, where they were widely disseminated through the work of Nishisato. Various extensions and variations were added by the Japanese school. Starting in 1968, MCA was studied as a form of metric MDS by De Leeuw. Although the equations defining MCA were the same as those defining PCA, the relationship between the two remained problematic. These problems are compounded by “horse shoes” or the “effect Guttman”, i.e. artificial curvilinear relationships between successive dimensions (eigenvectors). Jan de Leeuw NLPCA History UCLA Department of Statistics Nonlinear PCA What ? PCA can be made non-linear in various ways. 1 First, we could seek indices which discriminate maximally and are non-linear combinations of variables. This generalizes the weighting approach (Hotelling). 2 Second, we could find nonlinear combinations of components that are close to the observed variables. This generalizes the reduced rank approach (Pearson). 3 Third, we could look for transformations of the variables that optimize the linear PCA fit. This is known (term of Darrell Bock) as the optimal scaling (OS) approach. Jan de Leeuw NLPCA History UCLA Department of Statistics Nonlinear PCA Forms The first approach has not been studied much, although there are some relations with Item Response Theory. The second approach is currently popular in Computer Science, as “nonlinear dimension reduction”. I am currently working on a polynomial version, but there is not unified theory, and the papers are usually of the “‘well, we could also do this” type familiar from cluster analysis. The third approach preserves many of the properties of linear PCA and can be connected with MCA as well. We shall follow its history and discuss the main results. Jan de Leeuw NLPCA History UCLA Department of Statistics Nonlinear PCA PCA with OS Guttman observed in 1959 that if we require that the regression between monotonically transformed variables are linear, then the transformations are uniquely defined. In general, however, we need approximations. The loss function for PCA-OS is SSQ(Y − XB 0 ), as before, but now we minimize over components X , loadings B, and transformations Y . Transformations are defined column-wise (over variables) and belong to some restricted class (monotone, step, polynomial, spline). Algorithms often are of the alternating least squares type, where optimal transformation and low-rank matrix approximation are alternated until convergence. Jan de Leeuw NLPCA History UCLA Department of Statistics PCA-OS History of programs Shepard and Kruskal used the monotone regression machinery of non-metric MDS to construct the first PCA-OS programs around 1962. The paper describing the technique was not published until 1975. Around 1970 versions of PCA-OS (sometimes based on Guttman’s rank image principle) were developed by Lingoes and Roskam. In 1973 De Leeuw, Young, and Takane started the ALSOS project, with resulted in PRINCIPALS (published in 1978), and PRINQUAL in SAS. In 1980 De Leeuw (with Heiser, Meulman, Van Rijckevorsel, and many others) started the Gifi project, which resulted in PRINCALS, in SPSS CATPCA, and in the R package homals by De Leeuw and Mair (2009). In 1983 Winsberg and Ramsay published a PCA-OS version using monotone spline transformations. In 1987 Koyak, using the ACE smoothing methodology of Breiman and Friedman (1985), introduced mdrace. Jan de Leeuw NLPCA History UCLA Department of Statistics PCA/MCA The Gifi Project The Gifi project followed the ALSOS project. It has or had as its explicit goals: 1 Unify a large class of multivariate analysis methods by combining a single loss function, parameter constraints (as in MDS), and ALS algorithms. 2 Give a very general definition of component analysis (to be called homogeneity analysis) that would cover CA, MCA, linear PCA, nonlinear PCA, regression, discriminant analysis, and canonical analysis. 3 Write code and analyze examples for homogeneity analysis. Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Loss of Homogeneity The basic Gifi loss function is m σ (X , Y ) = ∑ SSQ(X − Gj Yj ). j =1 The n × kj matrices Gj are the data, coded as indicator matrices (or dummies). Alternatively, Gj can be a B-spline basis. Also, Gj can have zero rows for missing data. X is an n × p matrix of object scores, satisfying the normalization conditions X 0 X = I. Yj are kj × p matrices of category quantifications. There can be rank, level and additivity constraints on the Yj . Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi ALS The basic Gifi algorithm alternates (k ) 1 X (k ) = ORTH(∑m j =1 Gj Yj 2 (k +1) Yj = argmin tr Yj ∈Yj ). (k +1) (k +1) (Ŷj − Yj )0 Dj (Ŷj − Yj ). We use the following notation. Superscript (k ) is used for iterations. ORTH () is any orthogonalization method such as QR, Gram-Schmidt, or SVD. Dj = Gj0 Gj are the marginals. (k +1) Ŷj = Dj−1 Gj0 X (k ) are the category centroids. The constraints on Yj are written as Yj ∈ Yj . Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Star Plots Let’s look at some movies. GALO: 1290 students, 4 variables. We show both MCA and NLPCA. Senate: 100 senators, 20 votes. Since the variables are binary, MCA = NLPCA. Jan de Leeuw NLPCA History UCLA Department of Statistics objplot galo 97 1024 1037 1072 1150 1156 1165 337 353 454 486 591 680 748 795 853 10 40 54 1241 421 447 774 1123 1119 37 67 1017 1088 1094 1120 1149 1163 1204 1250 1266 218 327 335 355 402 405 490 569 610 677 688 715 743 769 965 147 417 557 575 1003 1006 1018 1042 1041 1046 1059 1089 1169 1173 1189 1193 1203 1208 152 192 201 262 282 291 294 326 339 376 387 393 429 436 510 533 541 600 619 642 708 765 915 33 46 45 86 441 41 1168 1175 1184 1206 1209 1257 113 145 193 366 475 532 602 608 618 621 933 991 65 42 1146 1268 615 622 825 842 34 38 59 73 187 331 354 707 49 94 120 458 523 70759 1103 1115 431 589 762 782 820 875 924 941 990 16 1211 78 121 1065 1138 119 130 284 289 329 342 372 430 467 514 513 543 562 641 804 843 992 7 1141 757 1009 1068 1155 1243 240 356 440 489 496 659 669 675 914 31 1256 1217 1095 652 692 35 1020 1043 1064 1210 1081224 230 248 495 545 674 738 826 918 1002 1001 1021 1056 1157 1178 134 270 280 378 470 528 537 609 626 634 686 694 841 845 847 964 36 87 171 224 242 1139 1194 1202 281 453 468 791 937 996 93 1225 1110 684 64 66 1124 144 810 755 55 1000 1049 1125 1215 285 347 406 526 536 535 546 596 613 749 780 1054 473 844 863 949 53 68 1222 4 105 150 158 245 320 401 482 521 783 840 909 926 982 1033 1051 1086 1259 102 153 256 322 357 499 509 508 661 697 885 911 910 60 1171 1245 123 156 316 492 750 789 50 198 241 330 333 434 506 685 702 872 930 83 975 82 157 138 208 271 518 597 617 761 938 947 1177 1216 112 84 211 1152 1223 126 203 276 583 587 628 637 636 655 689 74 691 1025 1176 1229 1791287 191 606 719 751 770 700862 814 950 993 272 1050 1191 1249 110 202 382 449 457 561 630 638 986 88 483 214 9421167 132 149 703 839 80 127 1236 1281 115 142 324 359 570 573 592 595 646 667 919 967 978 1063 268 803 1097 1170 1205 131 275 338 395 407 527 725 939 946 997 77 253 222 1028 103 109 133 146 210 267 598 601 662 746 813 833 940 977 29 1055 1136 1218 1371286 165 371 399 837 856 859 960 972 1069 1077 1164 1262 228 243 257 318 644 720 763 792 832 861 944 39 21114 665 1004 1013 1027 1026 1040 1053 1070 1074 1160 1186 1188 1201 1254 1267 1280 163 162 170 180 188 204 209 229 235 273 287 299 307 315 325 336 374 398 400 403 408 460 484 487 544 554 586 604 611 764 767 790 797 802 817 819 822 827 858 882 921 923 934 945 966 988 18 23 51 56 61 69 79 1174 217 221 227 247 412 442 727 11 6641102 17 22 216 435 155 232 672 679 159 1011 1048 1092 1190 1228 181 225 260 334 340 428 461 466 633 683 737 821 823 89 1079 568 1239 118 903 1117 443 107 588 8491090 1151 1161 207 344 507 643 895 19 747 71 62 640 1015 1023 1220 1235 1240 125 136 148 249 317 332 392 485 593 629 954 985 995 28 32 63 9 427 1248 1275 1284 699 788 805 1277 27 721 139 239 302 361 381 502 980 151 1237 1264 295 377 463 878 775 1029 1076 1143 1181 1199 1219 1251 1253 1258 1270 1269 1278 164 169 195 226 234 259 283 301 368 379 385 410 419 448 450 455 465 464 501 547 576 627 701 706 729 736 745 754 768 794 799 798 824 850 867 873 880 913 917 932 959 963 962 971 984 987 999 20 91 1122 1172 1247 1271 1274 174 414 771 1 215 616 277 75 5 1035 1034 1062 1066 1148 1166 1182 1185 1233 1232 1265 161 182 185 189 237 298 297 352 362 365 367 383 386 394 397 437 446 480 479 491 512 516 530 550 553 556 560 563 567 578 739 752 781 784 787 796 800 834 838 854 888 902 901 900 927 956 969 968 974 983 26 1142 1147 1221 1244 122 411 424 474 722 1154 1260 166 380 603 696 801 504 76 1252 1263 175 415 439 831 874 973 30 1108 581 313542438 1127 806 254 670 925 943 979 2497 1131 607 1273 899 1071 712 1061 1109 348 555 1282 205 314734 290 303 452 498 520 529 716 718 733 776 828 830 744 341 250 319 6906 177 345 1198 690 1058 135 154 206 258 278 304 309 343 396 420 515 571 723 735 753 786 793 855 891 912 922 955 15 989 13 92 777 472 742 772 14 12 857 928 1105 1107 6661098 758 869 1118 413 1212 760 865 879 893 1032 1096 1129 1187 263 534 612 635 678 818 48 1114 274 1116 1007 1179 705 1038 1196 116 288 445 579 625 632 631 713 756 836 961 57 1030 1113 1192 238 255 346 728 1091 582 1255 199 519 72 1158 517 951 614 648 864 1128 364 599 623 81 1121 47 363 476 994 1016 1132 1261 176 213 384 418 426 493 522 740 778 807 852 851 871 889 970 998 1234 808 816 1106 892 953 1101 1226 1276 101 178 265 391 647 658 657 660 663 811 896 929 935 98 4161238 1112 1010 726 773 212 1085 1084 140 676 231 549 1135 584 590 99 1104 129 143 349 351 531 698 809 1019 650 1140 1207 1231 293 358 369 580 594 709 85 1126 1039 1067 1081 1080 1153 1159 1162 1200 220 223 261 469 539 559 624 687 141 360 1014 1022 305 308 310 433 653 846 870 25 741 168 494 884 1242 1145 1213 897 160 186 194 292 296 425 558 920 936 478 3 1272 236 505 1901083 388 1130 389 877 717 404 1214 24 279 2521290 286 710 952 1075 1230 1279 173 197 196 219 246 264 373 375 390 409 423 432 500 525 540 566 766 779 785 829 883 894 916 124 471 1134 649 200 266 451 704 848 1144 711 1057 1093 1099 1133 1197 311 574 585 639 645 695 860 876 43 477 1180 1183 251 323 95 117 552 233 565 572 673 682 815 866 886 444 905 1195 350 605 812 931 58 90 904 1008 577 1005 1036 184 183 269 370 459 548 890 981 172 422 724 868 321 481 1100 1087 1246 835 96 244 1111 1012 456 462 958 1227 714 104 654 6931283 524 957 111 1045 1044 1047 1052 1078 1082 306 503 538 564 656 668 52 44 976 488 328 948 300 167 8 312 106 128 732 731 1001073 551 1031 1060 1137 620 651 671 681 511 898 908 730 1285 1288 881 907 1289 887 Jan de Leeuw NLPCA History UCLA Department of Statistics objplot senate Kyl Grassley Thurmond Allard Gramm Sessions Santorum Lieberman Byrd Enzi Nickles Carper Kerry Frist Voinovich Kennedy Dayton Leahy Bayh Wellstone Daschle Harkin Biden Reed Levin Reid Rockefeller Stabenow Cantwell Graham Corzine Lugar Feingold Hollings Dodd Allen Wyden Nelson EnsignMcCain Akaka Bunning Gregg Craig Brownback Roberts Helms Warner Murkowski Lott Johnson Thompson Kohl Shelby DeWine Smith1 Thomas Hagel Landrieu Lincoln Smith Chafee Inouye Bennett Hatch McConnell Edwards Mikulski Durbin Boxer Sarbanes Schumer Clinton Hutchinson Inhofe Dorgan Conrad Fitzgerald Crapo Bingaman Burns Jeffords Domenici Hutchison Carnahan Torricelli Murray Bond Miller Cochran Cleland Feinstein Stevens Campbell Nelson1 Baucus Breaux Specter Collins Snowe Jan de Leeuw NLPCA History UCLA Department of Statistics Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Single Variables If there are no constraints on the Yj homogeneity analysis is MCA. We will not go into additivity constraints, because they take us from PCA and MCA towards regression and canonical analysis. See the homals paper and package. A single variable has constraints Yj = zj aj0 , i.e. category quantifications are of rank one. In a given analysis some variables can be single while other can be multiple (unconstrained). More generally, there can be rank constraints on the Yj . This can be combined with level constraints on the single quantifications zj , which can be numerical, polynomial, ordinal, or nominal. If all variables are single homogeneity analysis is NLPCA (i.e. PCA-OS). This relationship follows from the form of the loss function. Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Multiple Variables There is another relationship, which is already implicit in Guttman (1941). If we transform the variables to maximize the dominant eigenvalue of the correlation matrix, then we find both the first MCA dimension and the one-dimensional nominal PCA solution. But there are deeper relations between MCA and NLPCA. These were developed in a series of papers by De Leeuw and co-workers, starting in 1980. Their analysis also elucidates the “Effect Guttman”. These relationships are most easily illustrated by performing an MCA of a continuous standardized multivariate normal, say on m variables, analyzed in the form of a Burt Table (with doubly-infinite subtables). Suppose the correlation matrix of this distribution is the m × m matrix R = {rj ` }. Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Multinormal MCA Suppose I = R [0] , R = R [1] , R [2] , ... is the infinite sequence of Hadamard (elementwise) powers of R. [s ] [s ] Suppose λj are the m eigenvalues of R [s] and yj corresponding eigenvectors. are the [s ] The eigenvalues of the MCA solution are the m × ∞ eigenvalues λj . [s ] The MCA eigenvector corresponding to λj [s ] yj ` Hs , consists of the m functions with Hs the s normalized Hermite polynomial. th An MCA eigenvector consists of m linear transformations, or m quadratic transformations, and so on. There are m linear eigenvectors, m quadratic eigenvectors, and so on. Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Multinormal MCA The same theory applies to what Yule (1900) calls “strained multinormal” variables zj , in which there exists diffeomorphisms φj such that φj (zj ) are jointly multinormal (an example are Gaussian copulas). And the same theory also applies, except for the polynomial part, when separate transformations of the variables exists that linearize all bivariate regressions (this generalizes a result of Pearson from 1905). Under all these scenarios, MCA solutions are NLPCA solutions, and vice versa. With the provision that NLPCA solutions are always selected from the same R [s] , while MCA solutions come from all R [s] . [1] Also, generally, the dominant eigenvalue is λ1 and the second largest [2] [1] one is either λ1 or λ2 . In the first case we have a horseshoe. Jan de Leeuw NLPCA History UCLA Department of Statistics Gifi Bilinearizability The “joint bilinearizibility” also occurs (trivially) if m = 2, i.e. in CA, and if kj = 2 for all j, i.e. for binary variables. If there is joint linearizability then the joint first-order asymptotic normal distribution of the induced correlation coefficients does not depend on the standard errors of the computed optimal transformations (no matter if they come from MCA or NLPCA or any other OS method). There is additional horseshoe theory, due mostly to Schriever (1986), that uses the Krein-Gantmacher-Karlin theory of total positivity. It is not based on families of orthogonal polynomials, but on (higher-order) order relations. This was, once again, anticipated by Guttman (1950) who used finite difference equations to derive the horseshoe MCA/NLPCA for the binary items defining a perfect scale. Jan de Leeuw NLPCA History UCLA Department of Statistics More Pavings If we have a mapping of n objects into Rp then a categorical variable can be used to label the objects. The subsets corresponding to the categories of the variables are supposed to be homogeneous. This can be formalized either as being small or as being separated by lines or curves. There are many ways to quantify this in loss functions. MCA (multiple variables, star plots) tends to think small (within vs between), NLPCA tends to think separable. Guttman’s MSA defines outer and inner points of a category. An outer point is a closest point for any point not in the category. The closest outer point for an inner point should belong to the same category as the inner point. This is a nice “topological” way to define separation, but it is hard to quantify. Jan de Leeuw NLPCA History UCLA Department of Statistics More Aspects The aspect approach (De Leeuw and Mair, JSS, 2010, using theory from De Leeuw, 1990) goes back to Guttman’s 1941 original motivation. An aspect is any function of all correlations (and/or the correkation ratios) between m transformed variables. Now choose the transformations/quantifications such that the aspect is maximized. We use majorization to turns this into a sequence of least squares problems. For MCA the aspect is the largest eigenvalue, for NLPCA it is the sum of the largest p eigenvalues. Determinants, multiple correlations, canonical correlations can also be used as aspects. Or: the sum of the differences of the correlation ratios and the squared correlation coefficients. Multinormal, strained multinormal, and bilinearizability theory applies to all aspects. Jan de Leeuw NLPCA History UCLA Department of Statistics More Logistic Instead of using least squares throughout, we can build a similar system using logit or probit log likelihoods. This is in the development stages. The basic loss function, corresponding to the Gifi loss function, is n m kj ∑ ∑ ∑ gij ` log i =1 j =1 `=1 exp{−φ (xi , yj ` )} kj ∑ν =1 exp{−φ (xi , yj ν )} where the data are indicators, as before. The function φ can be distance, squared distance, or negative inner product. This emphasizes separation, because we want the xi closest to the yj ` for which gij ` = 1. We use majorization to turn this into a sequence of reweighted least squares MDS or PCA problems. Jan de Leeuw NLPCA History UCLA Department of Statistics