Multiple taxicab correspondence analysis Choulakian, V.1 UniversiteĢ de Moncton, Moncton, N.B., E1A 3E9, Canada. choulav@umoncton.ca Summary. We compare the statistical analysis of indicator matrices and Burt tables by correspondence analysis (CA) and taxicab correspondence analysis (TCA). There are two new results in this paper. First, TCA of a Burt table corresponds to a particular kind of CA of the indicator matrix based on the centroid decomposition. Second, the response patterns in multiple TCA will be represented as (number of the variables +1) equidistant cluster points on the first principal axis. Key words: Indicator matrix; supplementary points; Burt table; response pattern; correspondence analysis; taxicab correspondence analysis; centroid method; matrix norms. 1 Introduction Usually the analysis of indicator matrices is done by multiple correspondence analysis (Benzecri, 1973; Greenacre, 1984), also named dual scaling (Nishisato, 1994) or homogeneity analysis (Gifi, 1990). The aim of this paper is to compare the statistical analysis of indicator matrices by correspondence analysis (CA) and taxicab correspondence analysis (TCA). TCA is a L1 version of CA recently proposed by Choulakian (2006a). In section 2 we briefly present a mathematical description of TCA. In section 3 we present the novel results of this paper: multiple taxicab correspondence analysis (MTCA) of indicator matrices and the analysis of Burt tables by TCA. Section 4 compares both methods on a well known data set. In section 5 we conclude with some remarks. There are two novel results in this paper. First, TCA of a Burt table corresponds to a particular kind of CA of the disjunctive table based on the centroid decomposition. Second, the response patterns in MTCA will be represented as (number of the variables +1) equidistant cluster points on the first principal axis. 558 Choulakian, V. 2 Taxicab correspondence analysis Let P = T/n be a correspondence matrix, where T of dimension rxc is a Pc Pr contingency table, n = j=1 i=1 Tij , the grand total of T. We define pi. = Pc Pr j=1 pij , p.j = i=1 pij , Dr = Diag(pi. ) a diagonal matrix having diagonal elements pi. , and similarly Dc = Diag(p.j ). The q-th vector norm of a vector Pm q v = (v1 , ..., vm )′ is defined to be ||v||q = ( i=1 |vi | )1/q for q ≥ 1 and ||v||∞ = maxi |vi | . Let k = rank(P) − 1. In TCA the calculation of the dispersion measures λα , principal axes vα and uα , and principal factor scores gα and sα , for α = 0, 1, ..., k, is done in an stepwise manner. We put P0 = P. Let Pα be the residual correspondence matrix at the α-th iteration. The variational definitions of the TCA at the α-th iteration are ||Pα v||1 ||P′α u||1 u′ Pα v λα = maxv = maxu = maxu,v , ||v||∞ ||u||∞ ||u||∞ ||v||∞ = maxkPα vk1 subject to vj = 1 or − 1 for j = 1, ...c, maxkP′α uk1 subject to ui = 1 or − 1 for i = 1, ...r. = (1) Let vα = arg max ||Pα v||1 , (2) uα = arg max ||P′α u||1 , (3) vj =±1 ui =±1 Then the transition formulas are gα = Dr−1 Pα vα , (4) ′ D−1 c Pα uα , (5) sα = uα = sgn(gα ), (6) vα = sgn(sα ), (7) where sgn(.) is the coordinatewise sign function, sgn(x) = 1 if x > 0, and sgn(x) = −1 if x ≤ 0. The α-th taxicab dispersion measure can be represented in many different ways λα = ||Pα vα ||1 = ||Dr gα ||1 = u′α Dr gα , ′ Dc sα . = ||P′α uα ||1 = ||Dc sα ||1 = vα (8) The (α + 1)-th residual correspondence matrix is ′ Pα+1 = Pα −Dr gα sα Dc /λα . (9) Pk Similar to the ordinary CA, the total dispersion is defined to be α=1 λ2α , and the Pkproportion of the explained variation by the α-th principal axis is λ2α / β=1 λ2β , and the cumulative explained variation is CEV (α) = Pα Pk 2 2 γ=1 λγ / β=1 λβ , for α = 0, 1, ..., k. We note that Multiple taxicab correspondence analysis P1 =P − pr p′c ; 559 (10) that is, the best rank one approximation of P is given by ( pi. p.j ), which is the correspondence matrix obtained under the independence assumption between the row and column variables. This solution is considered trivial both here and in CA. The reconstitution formula in TCA and CA is " pij = pi. p.j 1 + k X # gα (i)sα (j)/λα . α=1 (11) The calculation of the principal scores and the principal component weights of TCA can be accomplished by two algorithms. The first one is based on complete enumeration using (2) or (3). The second one is based on iterating the transition formulae (6,7,8,9), which is similar to reciprocal averaging algorithm used in CA. This is an ascent algorithm. The iterative algorithm could converge to a local maximum; so it should be restarted from multiple initial points. The rows or the columns of the data can be used as initial values. More technical details about TCA and a deeper comparison between TCA and CA is done in Choulakian (2006a). 3 Multiple taxicab correspondence analysis and TCA of Burt tables Let Z be a complete disjunctive table of p categorical variables Z1 , Z2 , ..., Zp with respectively m1 , m2 , ..., mp modalities observed over a sample P of n individuals. CA of the super indicator matrix Z of dimension nx pi=1 mi is named multiple correspondence analysis (MCA), or homogeneity analysis or dual scaling. Similarly, the application of TCA to the super indicator matrix Z will be named multiple taxicab correspondence analysis (MTCA). A novel result of MTCA is the following Theorem 1. The response patterns in MTCA will be represented as ( number of the variables +1) equidistant cluster points on the first principal axix. This theorem should be compared with the Weber correspondence analysis of De Leeuw and Michailidis (2004). De Leeuw and Michailidis (2004) showed that one-dimensional Weber correspondence analysis is a combinatorial optimization problem, and that the scores g1 (i) take exactly two values, one negative and one positive. Note that the above theorem concerns only the scores g1 (i). 3.1 TCA of Burt table MCA of Z is equivalent to the CA of the Burt’s table B = Z′ Z. The two analyses produce the same factor scores of the modalities, but the eigenvalues 560 Choulakian, V. in MCA of Z equal to the square root of the eigenvalues of the CA of the associated Burt table. MTCA of Z is not equivalent to the TCA of the Burt’s table B = Z′ Z. In fact, we have the following optimization equations based on matrix norms: 2 ||Zα v||2 ||Bα v||1 . (12) = maxv maxv ||v||∞ ||v||∞ This identity shows that TCA of the Burt table is equivalent to a particular kind of CA of the complete disjunctive table based on the centroid decomposition. For the centroid decomposition, see Choulakian (2003, 2005, 2006b). Suppose that we have calculated the factor scores of the categories, sα , and the dispersion measures, λα , for α = 0, ..., k, by TCA of the Burt table. Then we can calculate the scores of the response patterns of Z by considering the rows of Z as supplementary points in the following way: gα = D−1 r Zα sgn(sα ), (13) where Dr is the diagonal matrix having elements the sum of the rows of Z, and ′ Zα+1 = Zα − Dr gα sα Dc /λα for α = 0, ..., k. (14) 4 Example: Survey Evaluation data Our example, taken from McCutcheon (1987), and reconsidered by, among others, van der Ark and van der Heijden (1998) involves four categorical variables from the 1982 General Social Survey. Two items are evaluations of surveys by white respondents (Y1 = Purpose and Y2 = Accuracy) and the other two are evaluations of these respondents by the interviewer (Y3 = Understanding and Y4 = Cooperation). Y1 has three categories: good, depends and waste. Y2 has two categories: mostly true and not true. Y3 has two categories: good and fair-poor. And Y4 has three categories: interested, cooperative and hostile-impatient. McCutcheon (1987) classified the respondents by latent class analysis into three groups: ideals, believers and skeptics. In the four-way cross-tabulation there are three zeros. By representing this data as a weighted indicator matrix, T, we obtain a contingency table of 33 = 36 − 3 rows representing the response patterns of the respondents on the four items and 10 columns representing the 10 categories of the four items. Table 1 displays the dispersion measures and the associated cumulative explained variation in % of MTCA and MCA. We clearly see that the first two dimensions of MTCA explain 17.61% more than the first two dimensions of MCA. Figures 1 and 2 display the biplots of the first two dimensions obtained by MTCA and MCA, respectively. In both figures the positions of the 10 categories are almost the same: We see three groupings of the categories: U1, C1, P1 and A1; U2, C2 and C3; and A2, P2 and P3. However, the positions of the 33 response patterns differ. In Figure 1, we clearly see 10 clusters of the 33 response patterns; while no such clustering is found in Multiple taxicab correspondence analysis 561 Table 1. Dispersion measures and cumulative proportions of explained dispersion in TCA and CA of Survey Evaluation data. MTCA MCA CA-Burt α λ2α CEV (α) λ2α CEV (α) λ2α TCA-Burt CEV (α) λ2α CEV (α) 1 2 3 4 5 6 0.154 0.112 0.093 0.064 0.006 0.004 35.57 61.39 83.05 97.76 99.10 100 0.371 0.286 0.251 0.249 0.181 0.164 24.73 43.78 60.48 77.06 89.10 100 0.1375 0.0817 0.0622 0.0616 0.0326 0.0267 34.18 54.48 69.95 85.26 93.36 100 48.84 76.96 89.55 99.51 99.81 100 0.0655 0.0377 0.0169 0.0134 0.0004 0.0003 Figure 2. Nine of the cluster points are found on the perimeter of a parallelogram. The response pattern ijkl shows the ith category of the respondents’ Purpose, the jth category of the respondent’s Accuracy, the kth category of the interviewers’ evaluation of the Understanding of the respondent and the lth category of the interviewers’ evaluation of the Cooperation of the respondent. The two parallel sides of the parallelogram 1111-1122 and 2211-2222 represent the respondents’ items, Purpose and Accuracy; and the two parallel sides of the parallelogram 1111-2211 and 1122-3222 represent the interviewers’ items, Understanding and Cooperation. We note that on the first axis there are 5 equidistant cluster points delineating the “ideals” on the extreme left to the “skeptics” on the extreme right. To have three groups of respondents, similar to McCutcheon’s analysis, we define the response patterns having at least three 1s to represent the “ideals”, the response patterns having exactly two ones as the “believers”, and the response patterns having at most one 1 as the “skeptics”. In this case, the weight of the “ideals” is 70.38% compared with 61.9% as given by McCutcheon; the weight of the “believers” is 20.3% compared with McCutcheon’s value of 22.3%; and the weight of the “skeptics” is 9.32% compared with McCutcheon’s value of 15.8% Table 1 also displays the dispersion measures and the associated cumulative explained variation in % of TCA and CA of the Burt table. We clearly see that the first two dimensions of TCA explain 22.48% more than the first two dimensions of CA. Figure 3 displays the biplot of the first two dimensions of the TCA of the Burt table with the 33 response patterns of the indicator matrix Z as supplementary points. Figure 3 looks like Figure 1 and has almost the same interpretation. 5 Conclusion We conclude with the following remarks. 562 Choulakian, V. Fig. 1. MTCA of Survey Evaluation data. Fig. 2. MCA of Survey Evaluation data. Multiple taxicab correspondence analysis 563 Fig. 3. TCA of Burt table Survey Evaluation data. First, MTCA and MCA can produce different results, because the geometry of these two methods are different: MTCA is based on the L1 norm, while MCA is based on the euclidean norm. Second, factor scores obtained by CA of a disjunctive table and its associated Burt table are the same. This is not true in the case of TCA : TCA of a Burt table is equivalent to a particular CA of the disjunctive table based on the centroid method. Third, the response patterns in MTCA will be represented as ( number of variables +1) equidistant cluster points on the first principal axis. In the example discussed in this paper, there are four variables, and the number of equidistant cluster response points are 5 as shown in Figures 1 and 3 References [Ben03] [Cho03] [Cho05] [Cho06] Benzecri J.P. L’Analyse des DonneĢes: Vol. 2: L’analyse des Correspondances. Paris: Dunod. Choulakian, V. (2003). The optimality of the centroid method. Psychometrika, 68, 473-475. Choulakian, V. (2005). Transposition invariant principal component analysis in L1 for long tailed data. Statistics and Probability Letters, 71, 23-31. Choulakian, V. (2006a). Taxicab correspondence analysis. Psychometrika, 71,1-13. 564 [Cho06] Choulakian, V. Choulakian, V. (2006b). L1 -norm projection pursuit principal component analysis. Computational Statistics and Data Analysis, 50, 1441-1451. [DeL06] De Leeuw, J. and Michailidis, G. (2004). Weber correspondence analysis. The one-dimensional case. Journal of Computational and Graphical Statistics, 13, 946-953. [Gif90] Gifi, A. (1990). Nonlinear Multivariate Analysis. New York: Wiley. [Gre84] Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. Academic Press. [McC87] McCutcheon, A.L. (1987). Latent class analysis. Beverly Hills, CA: Sage. [Nis94] Nishisato, S. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. Hillsdale, NJ: Lawrence Erlbaum. [VV03] van der Ark, L.A. and van der Heijden, P.G.M. (1998). Graphical display of latent budget analysis and latent class analysis, with special reference to correspondence analysis. In Visualization of Categorical Data, ed. Blasius, J. and Greenacre,M.. Academic Press, 489-508.