Generalized ANOVA for SDA Vladimir Batagelj , Simona Korenjak- ˇ Cerne

advertisement
Generalized ANOVA for SDA
?
Vladimir Batagelj1 , Simona Korenjak-Černe2 , Nataša Kejžar3
1. IMFM - Institute of Mathematics, Physics and Mechanics, Ljubljana, Slovenia
2. University of Ljubljana, Faculty of Economics, Ljubljana, Slovenia
3. University of Ljubljana, Faculty of Medicine, Ljubljana, Slovenia
? Contact author: simona.cerne@ef.uni-lj.si
Keywords: Generalized ANOVA, Generalized Ward’s method, Generalized Huygens theorem
In Batagelj, 1988, the generalized Huygens theorem for any dissimilarity d (a basis for generalized
ANOVA) was proved. It is based on the generalized definition of the cluster error p(C)
p(C) =
X
1
w(X) · w(Y ) · d(X, Y )
2 · w(C) X,Y ∈C
and on the following extension of dissimilarity to generalized center C̃ of a cluster C
d(U, C̃) = d(C̃, U ) =
1 X
(
w(X) · d(X, U ) − p(C)).
w(C) X∈C
The generalized Huygens theorem takes the form
X
X
p(C) +
w(C) · d(C̃, Ẽ) = IW + IB .
I = p(E) =
C∈C
C∈C
Studer et al., 2011, used this to generalize ANOVA to a set of sequences. It is implemented in the
R package TraMineR. For dissimilarity they use optimal matching and therefore call a generalized variance a discrepancy. They also exposed the problem of nonnegativity of the dissimilarity
d(U, C̃). In Batagelj, 1988, it is shown that d(U, C̃) is nonnegative if the dissimilarity d between
units satisfies the triangle inequality. If a dissimilarity d is not a metric it can be transformed into
it using the power transformation (Joly and Le Calvé, 1986).
In this paper we study possible adaptation of the generalized ANOVA for symbolic data analysis.
In Batagelj et al., 2015, it is shown that the generalized Huygens theorem holds for the first of
six proposed dissimilarities. Here, we propose a more general approach that can be used for any
dissimilarity. Due to Joly and Le Calvé, 1986, for each dissimilarity d there exist a real positive
number p, called the metric index, such that dk is a metric for k ≤ p, and is not a metric for k > p.
Therefore if for the selected dissimilarity d the triangle inequality does not hold, its metric index
is less than 1. For such dissimilarities, we can find their metric index and use the transformed
dissimilarity in the generalized formulas. In these cases the generalized Huygens theorem can be
used. We further follow the procedure proposed in Studer et al., 2011: the part of the discrepancy
which is explained by differences between clusters is measured with
R2 =
IB
.
I
Alternatively, the comparison between the inertia between clusters and the inertia within them is
made with
IB /(m − 1)
F =
,
IW /(n − m)
where m is the number of clusters and n is the number of units. Since in the general case the
distribution of F is not the F-distribution, we use Monte Carlo approach to test the statistical significance.
The proposed approach will be demonstrated on a real-life data.
References
Batagelj, V. (1988). Generalized Ward and Related Clustering Problems. Classification and Related Methods of Data Analysis, 67–74.
Batagelj, V., Korenjak-Černe, S., and Kejžar, N. (2015). Clustering of Modal Valued Data. Draft
of a chapter in Brito, P. (ed.) Analysis of Distributional Data.
Joly, S., and Le Calvé, G. (1986). Etude des puissances d’une distance. Statistique et Analyse de
Données, 11, 30–50.
Studer, M., Ritschard, G., Gabadinho, A., and Müller, N. S. (2011). Discrepancy Analysis of State
Sequences. Sociological Methods and Research. Sociological Methods and Research., Vol 40,
Num. 3, 471–510.
Download