Document

advertisement
1
Fuzzy cluster validity indices
报告人: 王其娜
时间:20130621
2
• Datasets
▫
▫
▫
▫
Iris
Wine
Glass
Vowel
• Validity indices
▫ Entropy
▫ Purity
▫ Xie-Beni
• Results
3
•Iris
•Wine
•Glass
•Vowel
4
1. Iris
• Iris以鸢尾花的特征作为数据来源,数据集包含
150个数据集,分为3类,每类50个数据,每个数
据包含4个属性,是在数据挖掘、数据分类中非常
常用的测试集、训练集。
• Creator: R.A. Fisher
▫ Fisher's paper is a classic in the field and is
referenced frequently to this day. (See Duda &
Hart, for example.)
 Fisher, R.A., “The use of multiple measurements in taxonomic problems”,
Annual Eugrnics, Vol.7, Part II, 179-188, 1936.
 Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
 http://archive.ics.uci.edu/ml/
5
• Attribute information:
▫
1. sepal length in cm 萼片
▫
2. sepal width in cm
▫
3. petal length in cm 花瓣
▫
4. petal width in cm
▫
class:
▫
-- Iris Setosa
▫
-- Iris Versicolour
▫
-- Iris Virginica
野古草
斑色
弗吉尼亚光叶草
6
• Summary Statistics:
Min
Max
Mean
SD
Class
correlation
Sepal
length
4.3
7.9
5.84
0.83
0.7826
Sepal
width
2.0
4.4
3.05
0.43
-0.4194
Petal
length
1.0
6.9
3.76
1.76
0.9490 (high!)
Petal
width
0.1
2.5
1.20
0.76
0.9565 (high!)
7
• Class Distribution: 33.3% for each of 3 classes.
Iris Data
8
7
PW
6
5
setosa
4
versico
3
virinic
2
1
0
0
1
2
3
SL
4
5
8
2. Wine
• These data are the results of a chemical analysis
of wines grown in the same region in Italy but
derived from three different cultivars.
• The analysis determined the quantities of 13
constituents found in each of the three types of
wines. (178 samples)
9
• The attributes are (dontated by Riccardo Leardi)
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
10
• Past Usage:
• (1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of classifiers in high dimensional
settings, Tech. Rep. no. 92-02, (1992), Dept. of
Computer Science and Dept. of Mathematics and
Statistics, James Cook University of North
Queensland. (Also submitted to Technometrics).
11
• The data was used with many others for
comparing various classifiers. The classes are
separable, though only RDA has achieved 100%
correct classification. (RDA : 100%, QDA 99.4%,
LDA 98.9%, 1NN 96.1% (z-transformed data))
• (All results using the leave-one-out
technique) In a classification context, this is a
well posed problem with "well behaved" class
structures. A good data set for first testing of a
new classifier, but not very challenging.
12
• (2) S. Aeberhard, D. Coomans and O. de Vel,
"The classification performance of RDA“, Tech.
Rep. no. 92-01, (1992), Dept. of Computer
Science and Dept. of Mathematics and Statistics,
James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
• Here, the data was used to illustrate the superior
performance of the use of a new appreciation
function with RDA.
13
• Attribute information:
▫ All attributes are continuous. No statistics
available, but suggest to standardize variables for
certain uses (e.g. for us with classifiers which are
NOT scale invariant)
• Class Distribution: number of instances per class
(178)
▫ class 1: 59
▫ class 2: 71
▫ class 3: 48
14
3. Glass Identification Data Set
• From USA Forensic Science Service; 6 types of
glass; defined in terms of their oxide content (i.e.
Na, Fe, K, etc)
• Creator: B. German
• Discription: 214个数据,9个
属性,7类
15
• Past Usage:
▫ -- Rule Induction in Forensic Science
▫ -- Ian W. Evett and Ernest J. Spiehler
▫ -- Central Research Establishment Home Office
Forensic Science Service Aldermaston, Reading,
Berkshire RG7 4PN
▫ -- Unknown technical note number (sorry, not
listed here)
▫ -- General Results: nearest neighbor held its own
with respect to the rule-based system
16
• The study of classification of types of glass was
motivated by criminological investigation. At
the scene of the crime, the glass left can be used
as evidence...if it is correctly identified!
17
• Attribute information:
1. RI: refractive index
2. Na: Sodium (unit measurement: weight percent
in corresponding oxide, as are attributes 3-9)
3. Mg: Magnesium
4. Al: Aluminum
5. Si: Silicon
6. K: Potassium
7. Ca: Calcium
8. Ba: Barium
9. Fe: Iron
18
• Type of glass: (class attribute)
▫ -- 1 building_windows_float_processed
▫ -- 2 building_windows_non_float_processed
▫ -- 3 vehicle_windows_float_processed
▫ -- 4 vehicle_windows_non_float_processed (none
in this database)
▫ -- 5 containers
▫ -- 6 tableware
▫ -- 7 headlamps
19
• Summary statistics:
Attribute
Min
Max
Mean
SD
Correlation
with class
RI
Na
Mg
Al
Si
K
Ca
Ba
Fe
1.5112
10.73
0
0.29
69.81
0
5.43
0
0
1.5339
17.38
4.49
3.5
75.41
6.21
16.19
3.15
0.51
1.5184
13.4079
2.6845
1.4449
72.6509
0.4971
8.957
0.175
0.057
0.003
0.8166
1.4424
0.4993
0.7745
0.6522
1.4232
0.4972
0.0974
-0.1642
0.503
-0.7447
0.5988
0.1515
-0.01
0.0007
0.5751
-0.1879
20
• Class Distribution: (out of 214 total instances)
▫ -- 163 Window glass (building windows and vehicle
windows)
 -- 87 float processed
 -- 70 building windows
 -- 17 vehicle windows
 -- 76 non-float processed
 -- 76 building windows
 -- 0 vehicle windows
▫ -- 51 Non-window glass
 -- 13 containers
 -- 9 tableware
 -- 29 headlamps
21
4. Japanese Vowels Data Set
• This dataset records 640 time series of 12 LPC
cepstrum coefficients taken from nine male
speakers.
• Creator: Mineichi Kudo, Jun Toyama, Masaru
Shimbo
22
• Number of Instances (Utterances)
• Training: 270 (30 utterances by 9 speakers. See
file 'size_ae.train'.)
• Testing: 370 (24-88 utterances by the same 9
speakers in different opportunities. See file
'size_ae.test'.)
23
• Each speaker is a set of consecutive blocks.
• ae.train: there are 30 blocks for each speaker.
▫ speaker 1 : blocks 1-30
▫ speaker 2 : blocks 31-60
▫ so on up to speaker 9.
• ae.test: speakers 1 to 9 have the corresponding
number of blocks: 31, 35, 88, 44, 29, 24, 40, 50,
29.
▫ Thus, blocks 1-31 represent speaker 1 (31
utterances of /ae/), blocks 32-66 represent
speaker 2 (35 utterances of /ae/), and so on.
24
• Past Usage
▫ M. Kudo, J. Toyama and M. Shimbo. (1999).
"Multidimensional Curve Classification Using
Passing-Through Regions". Pattern Recognition
Letters, Vol. 20, No. 11--13, pages 1103--1111.
• Number of Attributes
▫ 12 real values
25
•Entropy
•Purity
•Xie-Beni
26
1. Entropy(熵)
• 历史上最早引入熵的是以善于构思物理概念沛然
著称的克劳修斯;
dQ
• 对于任意的闭合可逆过程,都有∮ T  0 。
这里的dQ为流入系统的热量,T为热力学温度,
则从一个状态O到另一个状态A,S的变化定义为
A
S S 
0

O
dQ
T
可逆过程。
27
• 为了便于计算,不一定拘泥于实际所经历的路线,
不妨设想一个联系初、终态的可逆过程。在此过
程中,热量Q全部转化为功W:

dQ
1

T
ΔS 
T

dQ 
dQ
T

W
T
Q

T

W
,
T
1
T

V2
V1
pdV  nR l o g e
V2
V1
 Nk l og e
计 算 中 应 用 了 理 想 气 体 状 态 方 程 pV  nRT  Nk T
V2
V1
28
“不朽的丰碑”
• 玻耳兹曼关系式,它为熵做出了微观的解释:
S  k l og W
• 这里的k为玻耳兹曼常数,W为与某一宏观状态所
对应的的围观状态数(或容配数),log为对数符
号,更确切地应采用自然对数loge或ln。
• 玻耳兹曼关系式把宏观量S与微观状态数W联系起
来,在宏观与微观之间架设了一座桥梁,既说明
了微观状态数W的物理意义,也给出了熵函数的
统计解释(微观意义)。
29
不可或缺—信息
• 信息具有多种多样的载体,这是信息的重要特征。
例如,人类通过语言、符号等来传递信息,而生
物体内的信息则是通过电化学的变化,经过神经
系统来传递。
• 信息在传输过程中,由于不可避免的噪声干扰或
译码错误,往往会发生信息的减损,最理想的传
输过程在于保真,即将信息一成不变地传输过去。
但另一方面,信息还有一个重要特征是,不但不
会在使用中消耗掉,而且还可以复制、散布。
30
天作之合—信息与熵
• 首先,考虑存在有P种可能性,其几率是均等的。
例如,一个摩斯电码P=2;一个拉丁字母P=27;
一旦在P种可能性之中选定其一,我们就取得了
信息,P愈大,相应地做出了选择之后的信息量
也愈大,这样,信息I被定义为
I  K l og e P.
这里K为比例常数。
31
• 由于相互独立的选择可能性(或几率)是相乘的,
对应的信息量按此定义就具有相加性。如果考虑
一个信息量是一连串几个相互独立的选择的结果,
其中每一个选择都是在0或1之间做出的,因为总
的P值应为P=2n,于是
I  K l og e P  nK l og e 2.
• 如果令I与n等同,则
K 
1
l og e 2
 l og 2 e
32
• 这样定出的信息量的单位,就是在计算机科学中
普遍使用的比特(bit);如果令K等于玻耳兹曼
常数k,那么信息量就用熵的单位来度量。
• 信息相等于物理系统中总熵中的一个负值的量
信息=熵S的减少=负熵N的增加。
• 信息的负熵原理:香农定义的熵与热力学的熵之
标准定义差一正负号,因而香农定义的信息在传
递的不可逆过程中,由于噪音的干涉,传递差错,
只减不增,呈现了负熵的特征。
33
聚类有效性函数:熵公式
•
香农熵
•
Bezdek的划分熵
•
熵值结果
34
• 香农信息熵是统计理论中的一个重要概念。对于
概率分布
p   p , p ,  , p  , (  p  1) ,
n
1
2
n
i 1
• 香农信息熵定义成
n
H    pi l n pi ,
i 1
• 香农信息熵的一个基本结论是:
1) 0  H  l n n;
2) H  l n n 
H  (
1
n
,
1
n
, ,
1
n
);
35
• Bezdek直接模仿香农信息熵公式,定义了模糊划
分的划分熵
1
H U; c   
   ln 。
n
n
• 实验结果:
Ent
c
j
 
i 1
ni j
Ent r opy 
ij

j 1
ij
i 1 j 1
l n(
nj
k
c
ni j
)
nj
nj
n
Ent
j
实际数据集中分为c个类,当前算法分成了k个类。
36
2. Purity
pur i t y ( Di )  max( Pr i ( c j ) )
j
• 数据集D中的类别集合为 C
 ( c1, c 2 , . . . , c k )
• 聚类算法也生成了k个聚类,这k个聚类把数据集
D划分成k个两两相交的集合 D , D , . . . , D
1
•
2
k
Pr i ( c j ) 是聚类i或者Di
中属于类别 c j 的数据点所占
的比例,最优值越高越好。
• 整个聚类结果的纯度总和为
k
pur i t y t ot al ( D ) 

i 1
Di
D
 pur i t y ( Di )
37
3. Xie-Beni
c
n

S 
i i
 i j Vi  X j
2
2
j 1
n mi n Vi  Vj
2
i ,j
• Definition 1: d 
of X j from class i.
ij
i j
X j  Vi
, the fuzzy deciation
• Definition 2: d  mi n V  V is called the
separation of the fuzzy c-partition, where d is
the minimum distance between cluster centroids.
mi n
i ,j
i
j
mi n
38
Download