ω - Institute of Statistical Science, Academia Sinica

advertisement
Tutorial 1
General Introduction to SDA
Yin-Jing Tien (田銀錦)
Institute of Statistical Science
Academia Sinica
gary@stat.sinica.edu.tw
June 13, 2014
Symbolic data Analysis (SDA)
(Diday 1987)
Text:
Billard and Diday (2006):
Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.
Diday, E., Noirhomme-Fraiture, M. (2008):
Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England.
Symbolic data
(Diday 1987)
• Classical Data : Individuals: single value
Single player
age = 25, eye color = blue
• Symbolic Data : Symbolic units (Concept: groups)
Team
interval : age range = [20, 36]
multiple values: eye color = {blue,brown,black}
Symbolic data analysis
When?
• When we are interested the higher level units
(Concept: groups/classes ).
• When the initial data are composed by
Symbolic data tables
• When the data is BIG
Symbolic data types
Symbolic data types (quantitative)
Multi-valued symbolic random variable Y is
one or more values
{12,23,20}
Interval-valued symbolic random variable Y is
one that takes values in an interval
[17, 25]
Modal multi-valued
{0.5, 3/8, 1.5, 4/8, 2, 1/8}
Y (u)  {k ,  k ; k  1,2,...,su}
Modal interval-valued (Histogram)
{[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7}
Y (u)  {[auk , buk ), puk ; k  1,2,...,su}
6
Symbolic data types (qualitative)
Multi-valued symbolic random variable Y is
one or more values
E.g., Bird Colors, Y=color
Modal multi-valued
Y (u)  {k ,  k ; k  1,2,...,su}
{single, 3/8, married, 5/8}
Basic Descriptive Statistics: Interval Value
Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith
variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.
Sample Mean of Ii is
1
𝑍𝑖 =
2𝑘
Sample Variance of Zi is
𝑘
(𝑎𝑐𝑖 + 𝑏𝑐𝑖 )
𝑐=1
Basic Descriptive Statistics: Interval Value
Rewrite 𝑆𝑍2𝑖 as
1
𝑆𝑍2𝑖 =
3𝑘
𝑘
2
[ 𝑎𝑐𝑖 − 𝑍𝑖
+ 𝑎𝑐𝑖 − 𝑍𝑖 𝑏𝑐𝑖 − 𝑍𝑖 + 𝑏𝑐𝑖 − 𝑍𝑖 2 ]
𝑐=1
Total Variation = Within Variation + Between Variation
2
1 1
[
𝑏 − 𝑎𝑐𝑖
3 2 𝑐𝑖
1
Within Variation =
𝑘
𝑘
1
+
𝑏 − 𝑎𝑐𝑖
2 𝑐𝑖
1
[ 𝑎 − 𝐼𝑐𝑖
3 𝑐𝑖
𝑐=1
𝑘
1
Between Variation =
𝑘
𝑐=1
1
[ 𝐼 − 𝑍𝑖
3 𝑐𝑖
1
𝑏 − 𝑎𝑐𝑖
2 𝑐𝑖
2
2
=
1
𝑏 − 𝑎𝑐𝑖
12 𝑐𝑖
+ 𝑎𝑐𝑖 − 𝐼𝑐𝑖 𝑏𝑐𝑖 − 𝐼𝑐𝑖 + 𝑏𝑐𝑖 − 𝐼𝑐𝑖 2 ]
2
1
𝐼𝑐𝑖 = (𝑎𝑐𝑖 + 𝑏𝑐𝑖 )
2
1
+
𝑏 − 𝑎𝑐𝑖
2 𝑐𝑖
+ 𝐼𝑐𝑖 − 𝑍𝑖 𝐼𝑐𝑖 − 𝑍𝑖 + 𝐼𝑐𝑖 − 𝑍𝑖 2 ]
1
𝑍𝑖 =
2𝑘
𝑘
(𝑎𝑐𝑖 + 𝑏𝑐𝑖 )
𝑐=1
2
Similarity between Variables (interval-valued data)
(Billard and Diday (2006))
The empirical covariance function between Zi and Zj is
𝐶𝑜𝑣 𝑍𝑖 , 𝑍𝑗
1
=
4𝑘
𝑘
𝑎𝑐𝑖 + 𝑏𝑐𝑖 𝑎𝑐𝑗 + 𝑏𝑐𝑗
𝑐=1
𝑘
1
− 2
4𝑘
𝑘
𝑎𝑐𝑖 + 𝑏𝑐𝑖
𝑐=1
𝑎𝑐𝑗 + 𝑏𝑐𝑗
𝑐=1
The empirical correlation coefficient between Zi and Zj is
𝑟 𝑍𝑖 , 𝑍𝑗 =
Where
𝐶𝑜𝑣 𝑍𝑖 , 𝑍𝑗
𝑆𝑍𝑖 𝑆𝑍𝑗
Distance between concept
Definition 7.6: The Cartesian join A⊕B between two sets A and B is their
componentwise union,
Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their
componentwise intersection,
Distance between concept
Distance between concept (Multi-valued)
The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)
(relative sizes)
(relative content)
Distance between concept (Multi-valued)
Example: Color and Habitat of Birds (Table 7.2)
Y1 = Color, Y2 = Habitat
For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2
D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2
The Gowda-Diday dissimilarity
For Y2: D11(ω1, ω2)=(|2-1|)/2 = 1/2
D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2
D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2
Normalized (adjust for scale) weights are
D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2 = 5/6
3
2
p=2
Distance between concept (Multi-valued)
The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994)
ϕj(ω1, ω2)= ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
For Y1: ϕ1(ω1, ω2)= 2-1+γ (2*1-2-1)
= 1-γ
For Y2: ϕ2(ω1, ω2)= 2-1+γ (2*1-2-1)
= 1-γ
Taking γ =0.5
Unweighted Minkowski distance
Dq (ω1, ω2)= (0.5q+0.5q)1/q
Weighted Minkowski distance (
Dq (ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q
)
Distance between concept (Interval-valued)
Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k
concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.
The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)
Dj(ω1, ω2) for the variable Yj
𝑝
Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2)
D(ω1, ω2) =
𝑗=1
Dj1(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 − 𝑏2𝑗 − 𝑎2𝑗 /𝑘𝑗
(relative length)
Dj2(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 + 𝑏2𝑗 − 𝑎2𝑗 − 2𝐼𝑗 /𝑘𝑗 (relative content)
Dj3(ω1, ω2) = 𝑎1𝑗 − 𝑎2𝑗 / Y𝑗
(relative position)
𝑘𝑗 = 𝑀𝑎𝑥 𝑏1𝑗 , 𝑏2𝑗 − 𝑀𝑖𝑛 𝑎1𝑗 , 𝑎2𝑗
𝐼𝑗 =
length of the entire distance spanned by ω1 and ω2
𝑀𝑎𝑥 𝑎1𝑗 , 𝑎2𝑗 − 𝑀𝑖𝑛 𝑏1𝑗 , 𝑏2𝑗 , if the intervals overlap
, otherwise
0
Y𝑗 = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 )
𝑐
𝑐
length of the intersection
total length in Y covered by the observe values of Yj
Distance between concept (Interval-valued)
The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994)
ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
ω1 ⊕ ω2 = 𝑀𝑖𝑛 𝑎1 , 𝑎2 , 𝑀𝑎𝑥 𝑏1 , 𝑏2
ω1 ⊗ ω2 = 𝑀𝑎𝑥 𝑎1 , 𝑎2 , 𝑀𝑖𝑛 𝑏1 , 𝑏2
(empty if no interaction)
The generalized Minkowski distance of order q ≥1 between two interval-valued
observations ξ(ω1) and ξ(ω2) is
1/𝑞
𝑝
dq(ω1, ω2) =
𝑤𝑗∗ ϕj(ω1, ω2)
𝑞
𝑗=1
Where ϕj(ω1, ω2) is the Ichino-Yaguchi distance and 𝑤𝑗∗ is a weight function
associated with variable Yj .
When q = 1  City Block distance
When q = 2  Euclidean distance
ϕj(ω1, ω2) = a1j − a2j + b1j − b2j
Distance between concept (Interval-valued)
The Hausdorff Distance (Chavent and Lechevallier, 2002)
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2) = 𝑀𝑎𝑥( a1j − a2j , b1j − b2j )
ϕj(ω1, ω2)
𝑗=1
The Euclidean Hausdorff Distance
1/2
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2)
2
Where ϕj(ω1, ω2) is the Hausdorff Distance
𝑗=1
The Normalization Euclidean Hausdorff Distance
1/2
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2)/𝐻𝑗
1
Where 𝐻𝑗2 = 2
2𝑘
2
𝑗=1
𝑘
𝑘
𝜙 (𝜔𝑠 , 𝜔𝑡 )
𝑠=1 𝑡=1
2
𝑗
The Span Normalization Euclidean Hausdorff Distance
1/2
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2)/|Y𝑗 |
𝑗=1
2
Where the span |Y𝑗 | = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 )
𝑐
𝑐
Distance between concept (Interval-valued)
Example: Take the first 3 observations
only of veterinary data
𝑘1 = 𝑀𝑎𝑥 𝑏11 , 𝑏21 − 𝑀𝑖𝑛 𝑎11 , 𝑎21
= 180 − 120 = 60
𝑘2 = 355 − 222.2 = 132.8
𝐼1 = 𝑀𝑎𝑥 𝑎11 , 𝑎21 − 𝑀𝑖𝑛 𝑏11 , 𝑏21
= 158 − 160 = 2
𝐼2 = 322 − 354 = 32
Y1 = Max(𝑏𝑐1 ) − Min(𝑎𝑐1 )
𝑐
Gowda-Diday dissimilarity
2
Y2
𝑐
= 185 − 120 = 65
= 355 − 117.2 = 237.8
Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2)
D(ω1, ω2) =
𝑗=1
(Y1)
= [ 60 − 2 /60 + 60 + 2 − 2 ∗ 2 /60 +|120-158|/65]
(Y2)
+ [ 131.8 − 33 /132.8 + 131.8 + 33 − 2 ∗ 32 /132.8 +|222.2 − 322|/237.8]
= 4.44
Dj1(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 − 𝑏2𝑗 − 𝑎2𝑗 /𝑘𝑗
Dj2(ω1, ω2) = 𝑏1𝑗 − 𝑎1𝑗 + 𝑏2𝑗 − 𝑎2𝑗 − 2𝐼𝑗 /𝑘𝑗
Dj3(ω1, ω2) = 𝑎1𝑗 − 𝑎2𝑗 / Y𝑗
Distance between concept (Interval-valued)
The Ichino-Yaguchi dissimilarity
ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
ω1 ⊕ ω2 = 𝑀𝑖𝑛 𝑎1 , 𝑎2 , 𝑀𝑎𝑥 𝑏1 , 𝑏2
ω1 ⊗ ω2 = 𝑀𝑎𝑥 𝑎1 , 𝑎2 , 𝑀𝑖𝑛 𝑏1 , 𝑏2
(empty if no interaction)
ϕ1(ω1, ω2) = |180-120|− 160−158 + γ(2 160 − 158 − 180−160 − 160−158 )
= 58+γ(-58)
ϕ2(ω1, ω2) = |355-222.2|− 354−322 + γ(2 354 − 322 − 354−222.2 − 355−322 )
= 100.8+ γ(100.8)
The generalized Minkowski distance
1/𝑞
𝑝
dq(ω1, ω2) =
𝑤𝑗∗ ϕj(ω1, ω2)
𝑗=1
𝑞
When q = 1  City Block distance
When q = 2  Euclidean distance
Distance between concept (Interval-valued)
The Hausdorff Distance
ϕj(ω1, ω2) = 𝑀𝑎𝑥( a1j − a2j , b1j − b2j )
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2)
ϕ1(ω1, ω2) = 𝑀𝑎𝑥( 120 − 158 , 180 − 160 ) =38
𝑗=1
ϕ2(ω1, ω2) = 𝑀𝑎𝑥( 222.2 − 322 , 354 − 355 ) =99.8
= 38 + 99.8 = 137.8
The Euclidean Hausdorff Distance
1/2
2
= (382 +99.82 ) 1/2 = 106.97
𝑘
𝑗=1
1
𝐻𝑗2 = 2
The Normalization Euclidean Hausdorff Distance
2𝑘
1/2
ϕj(ω1, ω2)
d(ω1, ω2) =
2
2
ϕj(ω1, ω2)/𝐻𝑗
d(ω1, ω2) =
2
= 2.633
𝑗=1
The Span Normalization Euclidean Hausdorff Distance
1/2
𝑝
d(ω1, ω2) =
ϕj(ω1, ω2)/|Y𝑗 |
𝑗=1
2
= 0.720
𝐻12 =
𝑘
𝑠=1 𝑡=1
1
[382 +
2
2×3
𝜙 (𝜔𝑠 , 𝜔𝑡 )
2
𝑗
552 + 272 ]=288.78
𝐻22 = 5150.39
|Y𝑗 | = Max(𝑏𝑐𝑗 ) − Min(𝑎𝑐𝑗 )
𝑐
𝑐
|Y𝑗 | = 185-120 = 65
|Y𝑗 | = 355-117.2 = 237.8
Distance between concept (group) of interval-valued data
Comparison of between-concept distance measures
Interval-valued symbolic data analysis
• Books
(Bock and Diday (2000), Billard and Diday (2003,
2006), and Diday and Noirhomme-Fraiture (2008))
• PCA
(Chouakria, Cazes, and Diday (2000); Palumbo and
Lauro (2003); Gioia and Lauro (2006); Hamada,
Minami, and Mizuta (2008))
• Clustering analysis
( Brito (2002); Souza and de
Carvalho (2004); Chavent et al. (2006); Bock (2008))
• Discriminant analysis (Lauro, Verde, and Palumbo (2000);
Duarte Silva and Brito (2006))
• MDS (Groenen et al. (2006); Minami and Mizuta (2008)
• Regression (Billard and Diday (2000); de Carvalho et al.
(2004))
Visualization Tools for Symbolic Data (Analysis)
Symbolic Data Analysis Software
• SODAS (2003)
FREE from 2 European Consortium
• SYR (2008)
More professional from SYROKKO Company
www.syrokko.com
Download