Multiple taxicab correspondence analysis Choulakian, V.

advertisement
Multiple taxicab correspondence analysis
Choulakian, V.1
UniversiteĢ de Moncton, Moncton, N.B., E1A 3E9, Canada. choulav@umoncton.ca
Summary. We compare the statistical analysis of indicator matrices and Burt tables by correspondence analysis (CA) and taxicab correspondence analysis (TCA).
There are two new results in this paper. First, TCA of a Burt table corresponds to a
particular kind of CA of the indicator matrix based on the centroid decomposition.
Second, the response patterns in multiple TCA will be represented as (number of
the variables +1) equidistant cluster points on the first principal axis.
Key words: Indicator matrix; supplementary points; Burt table; response pattern;
correspondence analysis; taxicab correspondence analysis; centroid method; matrix
norms.
1 Introduction
Usually the analysis of indicator matrices is done by multiple correspondence
analysis (Benzecri, 1973; Greenacre, 1984), also named dual scaling (Nishisato,
1994) or homogeneity analysis (Gifi, 1990). The aim of this paper is to compare the statistical analysis of indicator matrices by correspondence analysis
(CA) and taxicab correspondence analysis (TCA). TCA is a L1 version of CA
recently proposed by Choulakian (2006a). In section 2 we briefly present a
mathematical description of TCA. In section 3 we present the novel results
of this paper: multiple taxicab correspondence analysis (MTCA) of indicator matrices and the analysis of Burt tables by TCA. Section 4 compares
both methods on a well known data set. In section 5 we conclude with some
remarks.
There are two novel results in this paper. First, TCA of a Burt table corresponds to a particular kind of CA of the disjunctive table based on the
centroid decomposition. Second, the response patterns in MTCA will be represented as (number of the variables +1) equidistant cluster points on the first
principal axis.
558
Choulakian, V.
2 Taxicab correspondence analysis
Let P = T/n be a correspondence
matrix, where T of dimension rxc is a
Pc Pr
contingency table, n = j=1 i=1 Tij , the grand total of T. We define pi. =
Pc
Pr
j=1 pij , p.j =
i=1 pij , Dr = Diag(pi. ) a diagonal matrix having diagonal
elements pi. , and similarly Dc = Diag(p.j ). The
q-th vector norm of a vector
Pm
q
v = (v1 , ..., vm )′ is defined to be ||v||q = ( i=1 |vi | )1/q for q ≥ 1 and
||v||∞ = maxi |vi | . Let k = rank(P) − 1.
In TCA the calculation of the dispersion measures λα , principal axes vα
and uα , and principal factor scores gα and sα , for α = 0, 1, ..., k, is done in
an stepwise manner. We put P0 = P. Let Pα be the residual correspondence
matrix at the α-th iteration.
The variational definitions of the TCA at the α-th iteration are
||Pα v||1
||P′α u||1
u′ Pα v
λα = maxv
= maxu
= maxu,v
,
||v||∞
||u||∞
||u||∞ ||v||∞
= maxkPα vk1
subject to vj = 1 or − 1 for j = 1, ...c,
maxkP′α uk1
subject to ui = 1 or − 1 for i = 1, ...r.
=
(1)
Let
vα = arg max ||Pα v||1 ,
(2)
uα = arg max ||P′α u||1 ,
(3)
vj =±1
ui =±1
Then the transition formulas are
gα = Dr−1 Pα vα ,
(4)
′
D−1
c Pα uα ,
(5)
sα =
uα = sgn(gα ),
(6)
vα = sgn(sα ),
(7)
where sgn(.) is the coordinatewise sign function, sgn(x) = 1 if x > 0, and
sgn(x) = −1 if x ≤ 0.
The α-th taxicab dispersion measure can be represented in many different
ways
λα = ||Pα vα ||1 = ||Dr gα ||1 = u′α Dr gα ,
′
Dc sα .
= ||P′α uα ||1 = ||Dc sα ||1 = vα
(8)
The (α + 1)-th residual correspondence matrix is
′
Pα+1 = Pα −Dr gα sα Dc /λα .
(9)
Pk
Similar to the ordinary CA, the total dispersion is defined to be α=1 λ2α ,
and the
Pkproportion of the explained variation by the α-th principal axis
is λ2α / β=1 λ2β , and the cumulative explained variation is CEV (α) =
Pα
Pk
2
2
γ=1 λγ /
β=1 λβ , for α = 0, 1, ..., k.
We note that
Multiple taxicab correspondence analysis
P1 =P − pr p′c ;
559
(10)
that is, the best rank one approximation of P is given by ( pi. p.j ), which is the
correspondence matrix obtained under the independence assumption between
the row and column variables. This solution is considered trivial both here
and in CA. The reconstitution formula in TCA and CA is
"
pij = pi. p.j 1 +
k
X
#
gα (i)sα (j)/λα .
α=1
(11)
The calculation of the principal scores and the principal component weights
of TCA can be accomplished by two algorithms. The first one is based on
complete enumeration using (2) or (3). The second one is based on iterating
the transition formulae (6,7,8,9), which is similar to reciprocal averaging algorithm used in CA. This is an ascent algorithm. The iterative algorithm could
converge to a local maximum; so it should be restarted from multiple initial
points. The rows or the columns of the data can be used as initial values.
More technical details about TCA and a deeper comparison between TCA
and CA is done in Choulakian (2006a).
3 Multiple taxicab correspondence analysis and TCA of
Burt tables
Let Z be a complete disjunctive table of p categorical variables Z1 , Z2 , ..., Zp
with respectively m1 , m2 , ..., mp modalities observed over a sample
P of n individuals. CA of the super indicator matrix Z of dimension nx pi=1 mi is
named multiple correspondence analysis (MCA), or homogeneity analysis or
dual scaling. Similarly, the application of TCA to the super indicator matrix
Z will be named multiple taxicab correspondence analysis (MTCA). A novel
result of MTCA is the following
Theorem 1. The response patterns in MTCA will be represented as ( number
of the variables +1) equidistant cluster points on the first principal axix.
This theorem should be compared with the Weber correspondence analysis of De Leeuw and Michailidis (2004). De Leeuw and Michailidis (2004)
showed that one-dimensional Weber correspondence analysis is a combinatorial optimization problem, and that the scores g1 (i) take exactly two values,
one negative and one positive. Note that the above theorem concerns only the
scores g1 (i).
3.1 TCA of Burt table
MCA of Z is equivalent to the CA of the Burt’s table B = Z′ Z. The two
analyses produce the same factor scores of the modalities, but the eigenvalues
560
Choulakian, V.
in MCA of Z equal to the square root of the eigenvalues of the CA of the
associated Burt table.
MTCA of Z is not equivalent to the TCA of the Burt’s table B = Z′ Z. In
fact, we have the following optimization equations based on matrix norms:
2
||Zα v||2
||Bα v||1
.
(12)
= maxv
maxv
||v||∞
||v||∞
This identity shows that TCA of the Burt table is equivalent to a particular
kind of CA of the complete disjunctive table based on the centroid decomposition. For the centroid decomposition, see Choulakian (2003, 2005, 2006b).
Suppose that we have calculated the factor scores of the categories, sα , and
the dispersion measures, λα , for α = 0, ..., k, by TCA of the Burt table. Then
we can calculate the scores of the response patterns of Z by considering the
rows of Z as supplementary points in the following way:
gα = D−1
r Zα sgn(sα ),
(13)
where Dr is the diagonal matrix having elements the sum of the rows of Z,
and
′
Zα+1 = Zα − Dr gα sα Dc /λα for α = 0, ..., k.
(14)
4 Example: Survey Evaluation data
Our example, taken from McCutcheon (1987), and reconsidered by, among
others, van der Ark and van der Heijden (1998) involves four categorical
variables from the 1982 General Social Survey. Two items are evaluations
of surveys by white respondents (Y1 = Purpose and Y2 = Accuracy) and
the other two are evaluations of these respondents by the interviewer (Y3 =
Understanding and Y4 = Cooperation). Y1 has three categories: good, depends and waste. Y2 has two categories: mostly true and not true. Y3 has two
categories: good and fair-poor. And Y4 has three categories: interested, cooperative and hostile-impatient. McCutcheon (1987) classified the respondents
by latent class analysis into three groups: ideals, believers and skeptics. In the
four-way cross-tabulation there are three zeros. By representing this data as
a weighted indicator matrix, T, we obtain a contingency table of 33 = 36 − 3
rows representing the response patterns of the respondents on the four items
and 10 columns representing the 10 categories of the four items.
Table 1 displays the dispersion measures and the associated cumulative
explained variation in % of MTCA and MCA. We clearly see that the first
two dimensions of MTCA explain 17.61% more than the first two dimensions of MCA. Figures 1 and 2 display the biplots of the first two dimensions
obtained by MTCA and MCA, respectively. In both figures the positions of
the 10 categories are almost the same: We see three groupings of the categories: U1, C1, P1 and A1; U2, C2 and C3; and A2, P2 and P3. However,
the positions of the 33 response patterns differ. In Figure 1, we clearly see
10 clusters of the 33 response patterns; while no such clustering is found in
Multiple taxicab correspondence analysis
561
Table 1. Dispersion measures and cumulative proportions of explained dispersion
in TCA and CA of Survey Evaluation data.
MTCA
MCA
CA-Burt
α
λ2α
CEV (α)
λ2α
CEV (α)
λ2α
TCA-Burt
CEV (α) λ2α
CEV (α)
1
2
3
4
5
6
0.154
0.112
0.093
0.064
0.006
0.004
35.57
61.39
83.05
97.76
99.10
100
0.371
0.286
0.251
0.249
0.181
0.164
24.73
43.78
60.48
77.06
89.10
100
0.1375
0.0817
0.0622
0.0616
0.0326
0.0267
34.18
54.48
69.95
85.26
93.36
100
48.84
76.96
89.55
99.51
99.81
100
0.0655
0.0377
0.0169
0.0134
0.0004
0.0003
Figure 2. Nine of the cluster points are found on the perimeter of a parallelogram. The response pattern ijkl shows the ith category of the respondents’
Purpose, the jth category of the respondent’s Accuracy, the kth category of
the interviewers’ evaluation of the Understanding of the respondent and the
lth category of the interviewers’ evaluation of the Cooperation of the respondent. The two parallel sides of the parallelogram 1111-1122 and 2211-2222
represent the respondents’ items, Purpose and Accuracy; and the two parallel
sides of the parallelogram 1111-2211 and 1122-3222 represent the interviewers’
items, Understanding and Cooperation. We note that on the first axis there
are 5 equidistant cluster points delineating the “ideals” on the extreme left
to the “skeptics” on the extreme right. To have three groups of respondents,
similar to McCutcheon’s analysis, we define the response patterns having at
least three 1s to represent the “ideals”, the response patterns having exactly
two ones as the “believers”, and the response patterns having at most one 1
as the “skeptics”. In this case, the weight of the “ideals” is 70.38% compared
with 61.9% as given by McCutcheon; the weight of the “believers” is 20.3%
compared with McCutcheon’s value of 22.3%; and the weight of the “skeptics”
is 9.32% compared with McCutcheon’s value of 15.8%
Table 1 also displays the dispersion measures and the associated cumulative explained variation in % of TCA and CA of the Burt table. We clearly
see that the first two dimensions of TCA explain 22.48% more than the first
two dimensions of CA. Figure 3 displays the biplot of the first two dimensions
of the TCA of the Burt table with the 33 response patterns of the indicator
matrix Z as supplementary points. Figure 3 looks like Figure 1 and has almost
the same interpretation.
5 Conclusion
We conclude with the following remarks.
562
Choulakian, V.
Fig. 1. MTCA of Survey Evaluation data.
Fig. 2. MCA of Survey Evaluation data.
Multiple taxicab correspondence analysis
563
Fig. 3. TCA of Burt table Survey Evaluation data.
First, MTCA and MCA can produce different results, because the geometry of these two methods are different: MTCA is based on the L1 norm, while
MCA is based on the euclidean norm.
Second, factor scores obtained by CA of a disjunctive table and its associated Burt table are the same. This is not true in the case of TCA : TCA of a
Burt table is equivalent to a particular CA of the disjunctive table based on
the centroid method.
Third, the response patterns in MTCA will be represented as ( number
of variables +1) equidistant cluster points on the first principal axis. In the
example discussed in this paper, there are four variables, and the number of
equidistant cluster response points are 5 as shown in Figures 1 and 3
References
[Ben03]
[Cho03]
[Cho05]
[Cho06]
Benzecri J.P. L’Analyse des DonneĢes: Vol. 2: L’analyse des Correspondances. Paris: Dunod.
Choulakian, V. (2003). The optimality of the centroid method. Psychometrika, 68, 473-475.
Choulakian, V. (2005). Transposition invariant principal component analysis in L1 for long tailed data. Statistics and Probability Letters, 71,
23-31.
Choulakian, V. (2006a). Taxicab correspondence analysis. Psychometrika,
71,1-13.
564
[Cho06]
Choulakian, V.
Choulakian, V. (2006b). L1 -norm projection pursuit principal component
analysis. Computational Statistics and Data Analysis, 50, 1441-1451.
[DeL06] De Leeuw, J. and Michailidis, G. (2004). Weber correspondence analysis. The one-dimensional case. Journal of Computational and Graphical
Statistics, 13, 946-953.
[Gif90]
Gifi, A. (1990). Nonlinear Multivariate Analysis. New York: Wiley.
[Gre84] Greenacre, M. J. (1984). Theory and Applications of Correspondence
Analysis. Academic Press.
[McC87] McCutcheon, A.L. (1987). Latent class analysis. Beverly Hills, CA: Sage.
[Nis94]
Nishisato, S. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. Hillsdale, NJ: Lawrence Erlbaum.
[VV03] van der Ark, L.A. and van der Heijden, P.G.M. (1998). Graphical display
of latent budget analysis and latent class analysis, with special reference to
correspondence analysis. In Visualization of Categorical Data, ed. Blasius,
J. and Greenacre,M.. Academic Press, 489-508.
Download