Generalized ANOVA for SDA ? Vladimir Batagelj1 , Simona Korenjak-Černe2 , Nataša Kejžar3 1. IMFM - Institute of Mathematics, Physics and Mechanics, Ljubljana, Slovenia 2. University of Ljubljana, Faculty of Economics, Ljubljana, Slovenia 3. University of Ljubljana, Faculty of Medicine, Ljubljana, Slovenia ? Contact author: simona.cerne@ef.uni-lj.si Keywords: Generalized ANOVA, Generalized Ward’s method, Generalized Huygens theorem In Batagelj, 1988, the generalized Huygens theorem for any dissimilarity d (a basis for generalized ANOVA) was proved. It is based on the generalized definition of the cluster error p(C) p(C) = X 1 w(X) · w(Y ) · d(X, Y ) 2 · w(C) X,Y ∈C and on the following extension of dissimilarity to generalized center C̃ of a cluster C d(U, C̃) = d(C̃, U ) = 1 X ( w(X) · d(X, U ) − p(C)). w(C) X∈C The generalized Huygens theorem takes the form X X p(C) + w(C) · d(C̃, Ẽ) = IW + IB . I = p(E) = C∈C C∈C Studer et al., 2011, used this to generalize ANOVA to a set of sequences. It is implemented in the R package TraMineR. For dissimilarity they use optimal matching and therefore call a generalized variance a discrepancy. They also exposed the problem of nonnegativity of the dissimilarity d(U, C̃). In Batagelj, 1988, it is shown that d(U, C̃) is nonnegative if the dissimilarity d between units satisfies the triangle inequality. If a dissimilarity d is not a metric it can be transformed into it using the power transformation (Joly and Le Calvé, 1986). In this paper we study possible adaptation of the generalized ANOVA for symbolic data analysis. In Batagelj et al., 2015, it is shown that the generalized Huygens theorem holds for the first of six proposed dissimilarities. Here, we propose a more general approach that can be used for any dissimilarity. Due to Joly and Le Calvé, 1986, for each dissimilarity d there exist a real positive number p, called the metric index, such that dk is a metric for k ≤ p, and is not a metric for k > p. Therefore if for the selected dissimilarity d the triangle inequality does not hold, its metric index is less than 1. For such dissimilarities, we can find their metric index and use the transformed dissimilarity in the generalized formulas. In these cases the generalized Huygens theorem can be used. We further follow the procedure proposed in Studer et al., 2011: the part of the discrepancy which is explained by differences between clusters is measured with R2 = IB . I Alternatively, the comparison between the inertia between clusters and the inertia within them is made with IB /(m − 1) F = , IW /(n − m) where m is the number of clusters and n is the number of units. Since in the general case the distribution of F is not the F-distribution, we use Monte Carlo approach to test the statistical significance. The proposed approach will be demonstrated on a real-life data. References Batagelj, V. (1988). Generalized Ward and Related Clustering Problems. Classification and Related Methods of Data Analysis, 67–74. Batagelj, V., Korenjak-Černe, S., and Kejžar, N. (2015). Clustering of Modal Valued Data. Draft of a chapter in Brito, P. (ed.) Analysis of Distributional Data. Joly, S., and Le Calvé, G. (1986). Etude des puissances d’une distance. Statistique et Analyse de Données, 11, 30–50. Studer, M., Ritschard, G., Gabadinho, A., and Müller, N. S. (2011). Discrepancy Analysis of State Sequences. Sociological Methods and Research. Sociological Methods and Research., Vol 40, Num. 3, 471–510.