Dynamic clustering of histograms using Wasserstein metric

advertisement
Dynamic clustering of histograms using
Wasserstein metric
Antonio Irpino1 , Rosanna Verde1 , and Yves Lechevallier2
1
2
Facoltá di studi politici e per l’alta formazione europea e mediterranea
Seconda Universitá degli studi di Napoli
Caserta - Italy irpino@unina.it rosanna.verde@unina2.it
INRIA Roquencourt France lechevallier@inria.fr
Summary. In the present paper we present a new distance, based on the Wasserstein metric, to cluster a set of data described by distributions with finite continuous
support (histograms). The proposed distance allows us to define a measure of the
inertia of data with respect a barycenter that satisfies the Huygens theorem. Thus,
this measure is proposed as an allocation function in the dynamic clustering process
because it allows us to optimize the criterion of the minimum within inertia of the
classes with respect to their barycenters. An application to real data is performed
to illustrate the procedure.
Key words: clustering, histogram data, Mallow’s distance, inertia decomposition
1 Introduction
In many real experiences, data are collected and/or represented by frequency distributions. If Y is a numerical and continuous variable, many distinct values yi can
be observed. In these cases, the values are usually grouped in a smaller number H
of consecutive and disjoint bins Ih (groups, classes, intervals, etc.). The frequency
distribution of the variable Y is given considering the number of data values nh
falling in each Ih . The histogram is then the typical graphical representation for the
variable Y.
The interest to analyze data expressed by frequency distributions, as well as by
histograms, is apparent in many fields of research. In particular, we may refer to the
treatment of experimental data that are collected in a range of values, whereas the
measurement instrument gives only approximated (or rounded) values. An example
can be given by sensors for air pollution control located in different zones of an urban
area. The different distributions of measured data about the different levels of air
pollutants across a day, allow us to compare, and then to group into homogeneous
clusters, the different controlled zones.
In a different context of analysis, histograms are the key to understanding digital
images. A digital image is basically a mosaic of square tiles or ”pixels” of uniform
color that are so tiny that the composite image appears uniform and smooth. Instead
870
Antonio Irpino, Rosanna Verde, and Yves Lechevallier
of sorting them by color, they can be sorted into 256 levels of brightness from black
(value 0) to white (value 255) with 254 gray levels in between. The height of each
vertical ”bar” tells you how many pixels there are for that particular brightness level.
In the present paper, we propose to analyze data expressed by distributions as
well as “histograms” of values. The classification of this kind of data can be useful
to discover typologies of phenomena on the basis of the similarity of the frequency
distributions.
Dynamic Clustering (DC) ( [Did71], [DS76]) is proposed as a suitable method
to partition a set of data represented by frequency distributions. We recall that DC
needs to define a proximity function, to assign the individuals to the clusters, and
to choose a way to represent the clusters by means of a description that optimizes
a representation function. Further, the representation of a cluster, called prototype,
is consistent with the description of the clustered elements: i.e., if the data to be
clustered are distributions, then the prototype is also a distribution.
According to the nature of the data, we suggest using a distance based on the
Wassertein metric [GS02]. In section 2, we outline the general schema of DC.
In section 3, after recalling the definition of histogram data we present an extension of the Wassertein distance in order to compare two histogram descriptions.
We also prove that it is possible to define an inertia measure among data that satisfies the Huygens theorem of decomposition of inertia, considering the prototypes as
barycenters. In section 4, we present some results on a climatic dataset. Section 5
reports some concluding remarks.
2 Dynamic clustering algorithm
A proximity measure δ is a non negative function defined on each couple of elements
of the space of descriptions of E, where the closer the individuals are, the lower is
the value assumed by δ. Let E be a set of n data characterized by p continuous
variables Yj (j = 1, . . . , p). The dynamic clustering algorithm looks for the partition
P ∈ PK of E in K classes among all the possible partitions PK , and the vector
L ∈ LK of K prototypes representing the classes in P such that the following ∆
fitting criterion between L and P is minimized:
∆(P ∗ , L∗ ) = M in{∆(P, L) | P ∈ PK , L ∈ LK }.
(1)
Such a criterion is defined as the sum of dissimilarity or distance measures δ(yi , Gk )
of fitting between each element yi belonging to a class Ck ∈ P and the class representation Gk ∈ L:
∆(P, L) =
K
X
X
δ(yi , Gk ).
k=1 yi ∈Ck
A prototype Gk associated with a class Ck is an element of the space of description
of E, and it can be represented, in this context, as a histogram. The algorithm is
initialized by generating K random clusters or, alternatively, K random prototypes.
Generally the criterion ∆(P, L) is based on an additive distance on the p descriptors.
A similar approach has been proposed by [CDL03] in a different context of
analysis.
Dynamic clustering of histograms using Wasserstein metric
871
3 Wasserstein metric for histogram data
Let Y be a continuous variable defined on a finite support S = [y; y], where y and
y are the minimum and maximum values of the domain of Y. The variable Y is
supposed partitioned into a set of contiguous intervals (bins) {I1 , . . . , Ih , . . . , IH },
where Ih = [y h ; y h ). Given N observations on the variable Y, each semi-open interP
val, Ih is associated with a random variable equal to Ψ (Ih ) = N
u=1 Ψyu (Ih ) where
Ψyu (Ih ) = 1 if yu ∈ Ih and 0 otherwise. Thus, it is possible to associate to Ih an
empirical distribution πh = Ψ (Ih )/N .
A histogram of Y is then the graphical representation in which each pair (Ih , πh )
(for h = 1, . . . , H) is represented by a vertical bar, with base interval Ih along the
horizontal axis and the area proportional to πh . Consider E as a set of n empirical
distributions Y(i) (i = 1, . . . , n).
In the case of a histogram description it is possible to assume that S(i) = [y i ; y i ],
h
where yi ∈ R. Considering a set of uniformly dense intervals Ihi = y hi , y hi
that
i. Ili ∩S
Imi = ∅; l 6= m ;
ii.
Isi = [y i ; yi ]
such
s=1,...,ni
the support can also be written as S(i) = {I1i , ..., Iui , ..., Ini i }. In the present paper,
we denote with ψi (y) the (empirical) density function associated with the description
of i and with Ψi (y) its distribution function. It is possible to define the description
of Y(i) as:
Z
Y (i) = {(Iui , πui ) | ∀Iui ∈ S(i); πui =
Z
ψi (y)dy ≥ 0} where
Iui
ψi (y)dy = 1.
S(i)
If F and G are the distribution functions of µ and ν respectively, the Kantorovich
metric is defined by
+∞
Z
Z1
|F (x) − G(x)| dx =
dW (µ, ν) :=
−∞
−1
F (t) − G−1 (t) dt.
0
In particular, we focus our attention on the following distance:
v
u 1
uZ
2
u
dM (Y (i), Y (j)) := t
Ψi−1 (w) − Ψj−1 (w) dw.
(2)
0
It is the well known Mallow’s [Mal72] distance in L2 , derived from the Wasserstein
metric. The main computational drawbacks are related to the invertibility of the
distribution functions. In the following, we show how it can be exactly and efficiently
computed when data are represented by histograms.
Given a histogram description of Y(i) by means of Hi weighted intervals as
follows:
Y (i) = {(I1i , π1i ) , ..., (Iui , πui ) , ..., (IHi i , πHi i )} ,
we define the following quantities wli in order to represent the cumulative weights
associated with the elementary intervals of Y (i):
872
Antonio Irpino, Rosanna Verde, and Yves Lechevallier
(
wli =
P0
l=0
πhi l = 1, . . . , Hi
.
(3)
h=1,...,l
Using (3), and assuming a uniform density for each Ih , we may describe the empirical
distribution function as:
Ψi (y) = wi + y − y li
wli − wl−1i
y li − y li
iff y li ≤ y ≤ y li .
Then, the inverse distribution function is a piecewise function defined as follows:
Ψi−1 (t) = y li +
t − wl−1i y li − y li
wli − wl−1i
wl−1i ≤ t < wli
.
To compute the distance between two histogram descriptions Y(i) and Y(j) we
need to identify a set of uniformly dense intervals to compare on the basis of the
two inverse distribution functions. Let us have the set of the cumulated weights of
the two distributions:
w = w0i , ..., wui , ...., wHi i , w0j , ..., wvi , ...., wHj j .
To compute the distance, we need to sort w without the repetitions. The sorted
values can be represented by the vector
w = [w0 , ..., wl , ...., wm ]
where
w0 = 0
wm = 1
max(Hi , Hj ) ≤ m ≤ (Hi + Hj − 1) .
With the same vector, it is possible to associate a vector of m weights π = [πl ] where
πl = wl − wl−1 . Using this quantity, it is possible to prove that the computation of
the squared distance between the two histogram descriptions is:
d2M (Y (i), Y (j)) :=
w
m Z l
X
2
Ψi−1 (t) − Ψj−1 (t)
dt.
(4)
l=1 w
l−1
Each couple (wl−1 , wl ) allows us to identify two uniformly dense intervals, one for i
and one for j, having respectively the following bounds:
Ili = [Ψi−1 (wl−1 ); Ψi−1 (wl )]
and
Ilj = [Ψj−1 (wl−1 ); Ψj−1 (wl )].
For each interval, it is possible to compute the centers and the radii as follows:
cli = (Ψi−1 (wl ) + Ψi−1 (wl−1 ))/2
rli = (Ψi−1 (wl ) − Ψi−1 (wl−1 ))/2.
Because intervals are uniformly distributed, we may express them as the function of
its center and radius as c + r(2t − 1) for 0 ≤ t ≤ 1. It is possible to rewrite equation
(4) as:
d2M (Y (i), Y (j)) :=
m
X
l=1
Z1
[(cli + rli (2t − 1)) − (clj + rlj (2t − 1))]2 dt
πl
0
Dynamic clustering of histograms using Wasserstein metric
m
X
d2M (Y (i), Y (j)) :=
873
πl (cli − clj )2 +
l=1
1
(rli − rlj )2 .
3
(5)
It is easy to show an extension of the proposed distance to the multivariate case,
i.e. where Y(i) = {Y1 (i), . . . , Yp (i)}. Under the hypotheses that the p variables are
independent, it is possible to express the multivariate version of d2M (Y (i), Y (j)) as
follows:
d2M (Y
p mk
X
X
(i), Y (j)) :=
(k)
πl
(k)
cli
−
k=1 l=1
(k)
clj
2
2 1 (k)
(k)
+
r − rlj
.
3 li
(6)
Equation (6) shows that dM is an Euclidean distance. The theorem of Huygens being
applicable the dynamic clustering algorithm converges.
3.1 Using dM for the dynamic clustering algorithm
Given a set of n histogram data, it is possible to define its “barycenter” as a histogram
itself (the so-called prototype). According to the optimization of a representation
criterion, the prototypal histogram Y(b) can be computed minimizing the following
(sum of distance) function:
f (Y(b)|Y (1), . . . , Y(n)) = f (c1b , r1b , . . . , cmb , rmb ) =
=
n
X
d2 (Y (i), Y (b)) =
i=1
n X
m
X
πj (cji − cjb )2 +
i=1 j=1
1
(rji − rjb )2 .
3
Once fixed m (and so also πm ) equal to the cardinality of the elementary intervals
of the union of the supports of the Y(i)’s, the support of Y(b) can be expressed as
a vector of m couples (cjb , rjb ). The function in (??) reaches a minimum when the
following first order conditions are satisfied:
8
>
<
>
:
∂f
∂cjb
= −2πj
∂f
∂rjb
= − 32 πj
n
P
[(cji − cjb )] = 0
i=1
n
P
[(rji − rjb )] = 0
i=1
for each j = 1, . . . , m. It is easy to prove that function (??) holds a minimum when:
cjb = n−1
n
X
i=1
cji
;
rjb = n−1
n
X
rji .
i=1
The barycenter (prototype) of the n histogram data is then the following:
Y (b) = {([c1b − r1b ; c1b + r1b ], π1 ); . . . ; ([cjb − rjb ; cjb + rjb ], πj ); . . . ;
; . . . ; ([cmb − rmb ; cmb + rmb ], πm )}.
(7)
The identification of a barycenter permits us to show a second property of the
proposed distance. We prove that it is possible to express a measure of inertia of
data using d2M . The total inertia with respect to the barycenter Y(b) of a set of n
histogram data is given by the following quantity:
874
Antonio Irpino, Rosanna Verde, and Yves Lechevallier
n
X
TI =
d2M (Y(i), Y(b)).
i=1
The total inertia can be decomposed into within and between clusters inertia, according to the Huygens theorem. Let us consider a partition of E into K clusters. For
each cluster Ck , k = 1, .., K a histogram barycenter denoted by Y(bk ) is computed
by a local optimization of (??). Minimizing the following function:
f (Y(b)|Y(b1 ), . . . , Y(bK )) = f (c1b1 , r1b1 , . . . , cmbK , rmbK ) =
=
K
P
k=1
|Ck | 2
d (Y(bk ), Y(b))
n
K
P
=
k=1
|Ck |
n
m
P
j=1
πj (cjbk − cjb )2 +
1
3
(rjbk − rjb )2
(8)
where |Ck | is the cardinality of cluster Ck , it is possible to prove that the problem
(??) and (8) have the same solution for Y(b). The last result permits us to obtain
a decomposition of the total inertia3 as follows:
T I = W I + BI =
=
K
P
P
d2M (Y(i), Y(bk )) +
k=1 i∈Ck
K
P
|Ck |d2M (Y(bk ), Y(b)).
(9)
k=1
4 Application
In this section, we present some results of an analysis performed on a climatic
dataset. The original dataset drd964x.tmpst.txt4 contains the sequential “Time
Biased Corrected” state climatic division monthly Average Temperatures recorded
in the 48 states of the US from 1895 to 2004 (Hawaii and Alaska are not present in
the dataset).
The analysis consisted of the following three steps:
1. We have represented the data as the distributions of the monthly temperatures
of each of the 48 states.
2. We have performed 100 dynamic clustering for k = 2...9.
3. For each solution, in order to evaluate the quality of the results we have computed a quality of partition index as the proportion of between inertia on the
total inertia:
QP I = BI/T I.
In Table 1 are the elementary statistics for the 900 runs of the algorithm. To detect
a good choice for the number of clusters, we computed the Calinski and Harabasz
index [CH74] that is defined as:
CH(k) =
BI(k)/(k − 1)
W I(k)/(n − k)
where k denotes the number of clusters, and BI(k) and W I(k) denote the between
and within cluster inertia, in order to identify an optimal number of clusters. It is
defined as a value of k that maximizes CH(k). The results suggest that an optimal
choice for k is 5. In fig. 1, the 48 states are grouped into 5 groups.
3
For the sake of brevity, we do not report here the proof.
freely available at the National Climatic Data Center website of the US
http://www1.ncdc.noaa.gov/pub/data/cirs/
4
Dynamic clustering of histograms using Wasserstein metric
875
Table 1. Dynamic clustering main results: the first 5 columns describe elementary
statistics about the QPI, the last one the best Calinski-Harabasz index
k
2
3
4
5
6
7
8
9
10
QPI min
0.6482
0.8069
0.8449
0.8494
0.8665
0.8577
0.8835
0.9028
0.8917
QPI max
0.6555
0.8190
0.8666
0.8924
0.9086
0.9144
0.9233
0.9350
0.9360
QPI mean
0.6527
0.8125
0.8528
0.8663
0.8847
0.9000
0.9121
0.9174
0.9208
QPI median
0.6527
0.8124
0.8522
0.8646
0.8811
0.9044
0.9135
0.9168
0.9202
QPI std dev
0.0014
0.0029
0.0044
0.0080
0.0115
0.0126
0.0070
0.0048
0.0061
Best CH
44.07
70.94
72.02
77.65
75.05
69.63
70.70
64.05
60.81
Fig. 1. The 48 states grouped into 5 clusters
5 Conclusions
In this paper we have presented a new distance for clustering data represented
by histograms. The main interesting property of the proposed distance is that it
is possible to compute it also when two distributions have different bins, whereas
commonly used distances or dissimilarities, such as the L2 norm or the Kullback
Leibner divergence, compare two distributions with identical bins. This means, for
example, that it is possible to compare two descriptions of the same phenomenon
measured by two sensors that use different scales (a US meter is scaled in inches,
whereas an Italian one is scaled in centimeters). The algorithm SOM of Kohonen
can be also used because dM is a Euclidean distance thus derivable. The proposed
distance has been also proposed by [IV05] in the context of interval data analysis.
References
[CH74] Calinski, R.B. and Harabasz, J. (1974): A dendrite method for cluster analysis, Communications in Statistics, 3, 1–27.
876
Antonio Irpino, Rosanna Verde, and Yves Lechevallier
[CDL03] Chavent, M., De Carvalho, F.A.T., Lechevallier, Y., and Verde, R. (2003):
Trois nouvelles méthodes de classification automatique des données symbolique de type intervalle, Revue de Statistique Appliquée, LI, 4, 5–29.
[DS76] Diday, E., and Simon, J.C. (1976): Clustering analysis, In: Fu, K.S.
(Eds.),Digital Pattern Recognition, 47–94, Springer Verlag, Heidelberg.
[Did71] Diday, E. (1971): Le méthode des nuées dynamique, Revue de Statistique
Appliquée, 19, 2, 19–34.
[GS02] Gibbs, A.L. and Su, F.E. (2002): On choosing and bounding probability
metrics, International Statistical Review, 70, 419.
[IV05] Irpino, A. and Verde, R.(2005): A New Distance for Symbolic Data Clustering, CLADAG 2005, Book of short papers, MUP, 393-396.
[Mal72] Mallows, C. L. (1972): A note on asymptotic joint normality. Annals of
Mathematical Statistics, 43(2),508-515.
Download