Tutorial 1 General Introduction to SDA Yin-Jing Tien (田銀錦) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014 Symbolic data Analysis (SDA) (Diday 1987) Text: Billard and Diday (2006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley. Diday, E., Noirhomme-Fraiture, M. (2008): Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England. Symbolic data (Diday 1987) • Classical Data : Individuals: single value Single player age = 25, eye color = blue • Symbolic Data : Symbolic units (Concept: groups) Team interval : age range = [20, 36] multiple values: eye color = {blue,brown,black} Symbolic data analysis When? • When we are interested the higher level units (Concept: groups/classes ). • When the initial data are composed by Symbolic data tables • When the data is BIG Symbolic data types Symbolic data types (quantitative) Multi-valued symbolic random variable Y is one or more values {12,23,20} Interval-valued symbolic random variable Y is one that takes values in an interval [17, 25] Modal multi-valued {0.5, 3/8, 1.5, 4/8, 2, 1/8} Y (u) {k , k ; k 1,2,...,su} Modal interval-valued (Histogram) {[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7} Y (u) {[auk , buk ), puk ; k 1,2,...,su} 6 Symbolic data types (qualitative) Multi-valued symbolic random variable Y is one or more values E.g., Bird Colors, Y=color Modal multi-valued Y (u) {k , k ; k 1,2,...,su} {single, 3/8, married, 5/8} Basic Descriptive Statistics: Interval Value Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. Sample Mean of Ii is 1 𝑍𝑖 = 2𝑘 Sample Variance of Zi is 𝑘 (𝑎𝑐𝑖 + 𝑏𝑐𝑖 ) 𝑐=1 Basic Descriptive Statistics: Interval Value Rewrite 𝑆𝑍2𝑖 as 1 𝑆𝑍2𝑖 = 3𝑘 𝑘 2 [ 𝑎𝑐𝑖 − 𝑍𝑖 + 𝑎𝑐𝑖 − 𝑍𝑖 𝑏𝑐𝑖 − 𝑍𝑖 + 𝑏𝑐𝑖 − 𝑍𝑖 2 ] 𝑐=1 Total Variation = Within Variation + Between Variation 2 1 1 [ 𝑏 − 𝑎𝑐𝑖 3 2 𝑐𝑖 1 Within Variation = 𝑘 𝑘 1 + 𝑏 − 𝑎𝑐𝑖 2 𝑐𝑖 1 [ 𝑎 − 𝐼𝑐𝑖 3 𝑐𝑖 𝑐=1 𝑘 1 Between Variation = 𝑘 𝑐=1 1 [ 𝐼 − 𝑍𝑖 3 𝑐𝑖 1 𝑏 − 𝑎𝑐𝑖 2 𝑐𝑖 2 2 = 1 𝑏 − 𝑎𝑐𝑖 12 𝑐𝑖 + 𝑎𝑐𝑖 − 𝐼𝑐𝑖 𝑏𝑐𝑖 − 𝐼𝑐𝑖 + 𝑏𝑐𝑖 − 𝐼𝑐𝑖 2 ] 2 1 𝐼𝑐𝑖 = (𝑎𝑐𝑖 + 𝑏𝑐𝑖 ) 2 1 + 𝑏 − 𝑎𝑐𝑖 2 𝑐𝑖 + 𝐼𝑐𝑖 − 𝑍𝑖 𝐼𝑐𝑖 − 𝑍𝑖 + 𝐼𝑐𝑖 − 𝑍𝑖 2 ] 1 𝑍𝑖 = 2𝑘 𝑘 (𝑎𝑐𝑖 + 𝑏𝑐𝑖 ) 𝑐=1 2 Similarity between Variables (interval-valued data) (Billard and Diday (2006)) The empirical covariance function between Zi and Zj is 𝐶𝑜𝑣 𝑍𝑖 , 𝑍𝑗 1 = 4𝑘 𝑘 𝑎𝑐𝑖 + 𝑏𝑐𝑖 𝑎𝑐𝑗 + 𝑏𝑐𝑗 𝑐=1 𝑘 1 − 2 4𝑘 𝑘 𝑎𝑐𝑖 + 𝑏𝑐𝑖 𝑐=1 𝑎𝑐𝑗 + 𝑏𝑐𝑗 𝑐=1 The empirical correlation coefficient between Zi and Zj is 𝑟 𝑍𝑖 , 𝑍𝑗 = Where 𝐶𝑜𝑣 𝑍𝑖 , 𝑍𝑗 𝑆𝑍𝑖 𝑆𝑍𝑗 Distance between concept Definition 7.6: The Cartesian join A⊕B between two sets A and B is their componentwise union, Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their componentwise intersection, Distance between concept Distance between concept (Multi-valued) The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) (relative sizes) (relative content) Distance between concept (Multi-valued) Example: Color and Habitat of Birds (Table 7.2) Y1 = Color, Y2 = Habitat For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 The Gowda-Diday dissimilarity For Y2: D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2 Normalized (adjust for scale) weights are D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2 = 5/6 3 2 p=2 Distance between concept (Multi-valued) The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994) ϕj(ω1, ω2)= ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 ) For Y1: ϕ1(ω1, ω2)= 2-1+γ (2*1-2-1) = 1-γ For Y2: ϕ2(ω1, ω2)= 2-1+γ (2*1-2-1) = 1-γ Taking γ =0.5 Unweighted Minkowski distance Dq (ω1, ω2)= (0.5q+0.5q)1/q Weighted Minkowski distance ( Dq (ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q ) Distance between concept (Interval-valued) Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) Dj(ω1, ω2) for the variable Yj 𝑝 Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2) D(ω1, ω2) = 𝑗=1 Dj1(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 − 𝑏2𝑗 − 𝑎2𝑗 /𝑘𝑗 (relative length) Dj2(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 + 𝑏2𝑗 − 𝑎2𝑗 − 2𝐼𝑗 /𝑘𝑗 (relative content) Dj3(ω1, ω2) = 𝑎1𝑗 − 𝑎2𝑗 / Y𝑗 (relative position) 𝑘𝑗 = 𝑀𝑎𝑥 𝑏1𝑗 , 𝑏2𝑗 − 𝑀𝑖𝑛 𝑎1𝑗 , 𝑎2𝑗 𝐼𝑗 = length of the entire distance spanned by ω1 and ω2 𝑀𝑎𝑥 𝑎1𝑗 , 𝑎2𝑗 − 𝑀𝑖𝑛 𝑏1𝑗 , 𝑏2𝑗 , if the intervals overlap , otherwise 0 Y𝑗 = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 ) 𝑐 𝑐 length of the intersection total length in Y covered by the observe values of Yj Distance between concept (Interval-valued) The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994) ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 ) ω1 ⊕ ω2 = 𝑀𝑖𝑛 𝑎1 , 𝑎2 , 𝑀𝑎𝑥 𝑏1 , 𝑏2 ω1 ⊗ ω2 = 𝑀𝑎𝑥 𝑎1 , 𝑎2 , 𝑀𝑖𝑛 𝑏1 , 𝑏2 (empty if no interaction) The generalized Minkowski distance of order q ≥1 between two interval-valued observations ξ(ω1) and ξ(ω2) is 1/𝑞 𝑝 dq(ω1, ω2) = 𝑤𝑗∗ ϕj(ω1, ω2) 𝑞 𝑗=1 Where ϕj(ω1, ω2) is the Ichino-Yaguchi distance and 𝑤𝑗∗ is a weight function associated with variable Yj . When q = 1 City Block distance When q = 2 Euclidean distance ϕj(ω1, ω2) = a1j − a2j + b1j − b2j Distance between concept (Interval-valued) The Hausdorff Distance (Chavent and Lechevallier, 2002) 𝑝 d(ω1, ω2) = ϕj(ω1, ω2) = 𝑀𝑎𝑥( a1j − a2j , b1j − b2j ) ϕj(ω1, ω2) 𝑗=1 The Euclidean Hausdorff Distance 1/2 𝑝 d(ω1, ω2) = ϕj(ω1, ω2) 2 Where ϕj(ω1, ω2) is the Hausdorff Distance 𝑗=1 The Normalization Euclidean Hausdorff Distance 1/2 𝑝 d(ω1, ω2) = ϕj(ω1, ω2)/𝐻𝑗 1 Where 𝐻𝑗2 = 2 2𝑘 2 𝑗=1 𝑘 𝑘 𝜙 (𝜔𝑠 , 𝜔𝑡 ) 𝑠=1 𝑡=1 2 𝑗 The Span Normalization Euclidean Hausdorff Distance 1/2 𝑝 d(ω1, ω2) = ϕj(ω1, ω2)/|Y𝑗 | 𝑗=1 2 Where the span |Y𝑗 | = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 ) 𝑐 𝑐 Distance between concept (Interval-valued) Example: Take the first 3 observations only of veterinary data 𝑘1 = 𝑀𝑎𝑥 𝑏11 , 𝑏21 − 𝑀𝑖𝑛 𝑎11 , 𝑎21 = 180 − 120 = 60 𝑘2 = 355 − 222.2 = 132.8 𝐼1 = 𝑀𝑎𝑥 𝑎11 , 𝑎21 − 𝑀𝑖𝑛 𝑏11 , 𝑏21 = 158 − 160 = 2 𝐼2 = 322 − 354 = 32 Y1 = Max(𝑏𝑐1 ) − Min(𝑎𝑐1 ) 𝑐 Gowda-Diday dissimilarity 2 Y2 𝑐 = 185 − 120 = 65 = 355 − 117.2 = 237.8 Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2) D(ω1, ω2) = 𝑗=1 (Y1) = [ 60 − 2 /60 + 60 + 2 − 2 ∗ 2 /60 +|120-158|/65] (Y2) + [ 131.8 − 33 /132.8 + 131.8 + 33 − 2 ∗ 32 /132.8 +|222.2 − 322|/237.8] = 4.44 Dj1(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 − 𝑏2𝑗 − 𝑎2𝑗 /𝑘𝑗 Dj2(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 + 𝑏2𝑗 − 𝑎2𝑗 − 2𝐼𝑗 /𝑘𝑗 Dj3(ω1, ω2) = 𝑎1𝑗 − 𝑎2𝑗 / Y𝑗 Distance between concept (Interval-valued) The Ichino-Yaguchi dissimilarity ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 ) ω1 ⊕ ω2 = 𝑀𝑖𝑛 𝑎1 , 𝑎2 , 𝑀𝑎𝑥 𝑏1 , 𝑏2 ω1 ⊗ ω2 = 𝑀𝑎𝑥 𝑎1 , 𝑎2 , 𝑀𝑖𝑛 𝑏1 , 𝑏2 (empty if no interaction) ϕ1(ω1, ω2) = |180-120|− 160−158 + γ(2 160 − 158 − 180−160 − 160−158 ) = 58+γ(-58) ϕ2(ω1, ω2) = |355-222.2|− 354−322 + γ(2 354 − 322 − 354−222.2 − 355−322 ) = 100.8+ γ(100.8) The generalized Minkowski distance 1/𝑞 𝑝 dq(ω1, ω2) = 𝑤𝑗∗ ϕj(ω1, ω2) 𝑗=1 𝑞 When q = 1 City Block distance When q = 2 Euclidean distance Distance between concept (Interval-valued) The Hausdorff Distance ϕj(ω1, ω2) = 𝑀𝑎𝑥( a1j − a2j , b1j − b2j ) 𝑝 d(ω1, ω2) = ϕj(ω1, ω2) ϕ1(ω1, ω2) = 𝑀𝑎𝑥( 120 − 158 , 180 − 160 ) =38 𝑗=1 ϕ2(ω1, ω2) = 𝑀𝑎𝑥( 222.2 − 322 , 354 − 355 ) =99.8 = 38 + 99.8 = 137.8 The Euclidean Hausdorff Distance 1/2 2 = (382 +99.82 ) 1/2 = 106.97 𝑘 𝑗=1 1 𝐻𝑗2 = 2 The Normalization Euclidean Hausdorff Distance 2𝑘 1/2 ϕj(ω1, ω2) d(ω1, ω2) = 2 2 ϕj(ω1, ω2)/𝐻𝑗 d(ω1, ω2) = 2 = 2.633 𝑗=1 The Span Normalization Euclidean Hausdorff Distance 1/2 𝑝 d(ω1, ω2) = ϕj(ω1, ω2)/|Y𝑗 | 𝑗=1 2 = 0.720 𝐻12 = 𝑘 𝑠=1 𝑡=1 1 [382 + 2 2×3 𝜙 (𝜔𝑠 , 𝜔𝑡 ) 2 𝑗 552 + 272 ]=288.78 𝐻22 = 5150.39 |Y𝑗 | = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 ) 𝑐 𝑐 |Y𝑗 | = 185-120 = 65 |Y𝑗 | = 355-117.2 = 237.8 Distance between concept (group) of interval-valued data Comparison of between-concept distance measures Interval-valued symbolic data analysis • Books (Bock and Diday (2000), Billard and Diday (2003, 2006), and Diday and Noirhomme-Fraiture (2008)) • PCA (Chouakria, Cazes, and Diday (2000); Palumbo and Lauro (2003); Gioia and Lauro (2006); Hamada, Minami, and Mizuta (2008)) • Clustering analysis ( Brito (2002); Souza and de Carvalho (2004); Chavent et al. (2006); Bock (2008)) • Discriminant analysis (Lauro, Verde, and Palumbo (2000); Duarte Silva and Brito (2006)) • MDS (Groenen et al. (2006); Minami and Mizuta (2008) • Regression (Billard and Diday (2000); de Carvalho et al. (2004)) Visualization Tools for Symbolic Data (Analysis) Symbolic Data Analysis Software • SODAS (2003) FREE from 2 European Consortium • SYR (2008) More professional from SYROKKO Company www.syrokko.com