High dimensional data clustering

advertisement
High dimensional data clustering
Charles Bouveyron1,2 , Stéphane Girard1 , and Cordelia Schmid2
1
2
LMC-IMAG, BP 53, Université Grenoble 1, 38041 Grenoble Cedex 9, France
charles.bouveyron@imag.fr, stephane.girard@imag.fr
INRIA Rhône-Alpes, Projet Lear, 655 av. de l’Europe, 38334 Saint-Ismier
Cedex, France
cordelia.schmid@inrialpes.fr
Summary. Clustering in high-dimensional spaces is a recurrent problem in many
domains, for example in object recognition. High-dimensional data usually live in
different low-dimensional subspaces hidden in the original space. This paper presents
a clustering approach which estimates the specific subspace and the intrinsic dimension of each class. Our approach adapts the Gaussian mixture model framework to
high-dimensional data and estimates the parameters which best fit the data. We obtain a robust clustering method called High-Dimensional Data Clustering (HDDC).
We apply HDDC to locate objects in natural images in a probabilistic framework.
Experiments on a recently proposed database demonstrate the effectiveness of our
clustering method for category localization.
Key words: Model-based clustering, high-dimensional data, dimension reduction,
dimension reduction, parsimonious models
1 Introduction
In many scientific domains, the measured observations are high-dimensional. For example, visual descriptors used in object recognition are often high-dimensional and
this penalizes classification methods and consequently recognition. Popular clustering methods are based on the Gaussian mixture model and show a disappointing
behavior when the size of the training dataset is too small compared to the number
of parameters to estimate. To avoid overfitting, it is therefore necessary to find a
balance between the number of parameters to estimate and the generality of the
model. In this paper we propose a Gaussian mixture model which determines the
specific subspace in which each class is located and therefore limits the number of
parameters to estimate. The Expectation-Maximization (EM) algorithm [DLR77] is
used for parameter estimation and the intrinsic dimension of each class is determined
automatically with the scree test of Cattell. This allows to derive a robust clustering method in high-dimensional spaces, called High Dimensional Data Clustering
(HDDC). In order to further limit the number of parameters, it is possible to make
additional assumptions on the model. We can for example assume that classes are
spherical in their subspaces or fix some parameters to be common between classes.
814
Charles Bouveyron, Stéphane Girard, and Cordelia Schmid
We evaluate HDDC on a recently proposed visual recognition dataset [DDQ06]. We
compare HDDC to standard clustering methods and to the state of the art results.
We show that our approach outperforms existing results for object localization.
This paper is organized as follows. Section 2 presents the state of the art on
clustering of high-dimensional data. In Section 3, we describe our parameterization
of the Gaussian mixture model. Section 4 presents our clustering method, i.e. the
estimation of the parameters and of the intrinsic dimensions. Experimental results
for our clustering method are given in Section 5.
2 Related work on high-dimensional clustering
Many methods use global dimensionality reduction and then apply a standard clustering method. Dimension reduction techniques are either based on feature extraction
or feature selection. Feature extraction builds new variables which carry a large part
of the global information. The most known method is Principal Component Analysis
(PCA) which is a linear technique. Recently, many non-linear methods have been
proposed, such as Kernel PCA and non-linear PCA. In contrast, feature selection
finds an appropriate subset of the original variables to represent the data. Global
dimension reduction is often advantageous in terms of performance, but loses information which could be discriminant, i.e. clusters are often hidden in different
subspaces of the original feature space and a global approach cannot capture this.
It is also possible to use a parsimonious model [FR02] which reduces the number
of parameters to estimate. It is for example possible to fix some parameters to
be common between classes. These methods do not solve the problem of high dimensionality because clusters are usually hidden in different subspaces and many
dimensions are irrelevant. Recent methods determine the subspaces for each cluster.
Many subspace clustering methods use heuristic search techniques to find the subspaces. They are usually based on grid search methods and find dense clusterable
subspaces [PHL04]. The approach ”mixtures of Probabilistic Principal Component
Analyzers” [TB99] proposes a latent variable model and derives an EM based method
to cluster high-dimensional data. Bocci et al. [BVV06] propose a similar method to
cluster dissimilarity data. In this paper, we introduce an unified approach for classspecific subspace clustering which includes these two methods and allows additional
regularizations.
3 Gaussian mixture models for high-dimensional data
Clustering divides a given dataset {x1 , ..., xn } of n data points into k homogeneous groups. Popular clustering techniques use Gaussian Mixture Models (GMM),
which assume that each class is represented by a Gaussian probability
density. Data
P
{x1 , ..., xn } ∈ Rp are then modeled with the density f (x, θ) = ki=1 πi φ(x, θi ), where
φ is a multi-variate normal density with parameter θi = {µi , Σi } and πi are mixing proportions. This model estimates full covariance matrices and therefore the
number of parameters is very large in high dimensions. However, due to the empty
space phenomenon we can assume that high-dimensional data live in subspaces with
a dimensionality lower than the dimensionality of the original space. We therefore
High dimensional data clustering
Ei ⊥
x
815
x
Pi ⊥ (x) x
d(x, Ei )
x
x
x
µi X
x
Ei
x
x
x
Pi (x)
x
x
x
x
x
x
x
x
d(µi , Pi (x))
x
x
Fig. 1. The class-specific subspace Ei .
propose to work in low-dimensional class-specific subspaces in order to adapt classification to high-dimensional data and to limit the number of parameters to estimate.
3.1 The family of Gaussian mixture models
We remind that class conditional densities are Gaussian N (µi , Σi ) with means µi and
covariance matrices Σi , i = 1, ..., k. Let Qi be the orthogonal matrix of eigenvectors
of Σi , then ∆i = Qti Σi Qi is a diagonal matrix containing the eigenvalues of Σi . We
further assume that ∆i is divided into two blocks:
0
ai1
0
B
..
B
.
B
B
aidi
B 0
∆i = B
B
B
B
0
1 9
=
C
di
C
0
C ;
C
C 9
C
bi
0 C >
>
C =
C
..
A > (p − di )
.
>
;
0
bi
where aij > bi , ∀j = 1, ..., di . The class specific subspace Ei is generated by the
di first eigenvectors corresponding to the eigenvalues aij with µi ∈ Ei . Outside
this subspace, the variance is modeled by the single parameter bi . Let Pi (x) =
t
Q̃i Q̃i (x − µi ) + µi be the projection of x on Ei , where Q̃i is made of the di first
columns of Qi supplemented by zeros. Figure 1 summarizes these notations.
The mixture model presented above will be in the following referred to by
[aij bi Qi di ]. By fixing some parameters to be common within or between classes,
we obtain a family of models which correspond to different regularizations. For example, if we fix the first di eigenvalues to be common within each class, we obtain
the more restricted model [ai bi Qi di ]. The model [ai bi Qi di ] is often robust and gives
satisfying results, i.e. the assumption that each matrix ∆i has only two different
eigenvalues is in many cases an efficient way to regularize the estimation of ∆i . In
this paper, we focus on the models [aij bi Qi di ], [aij bQi di ], [ai bi Qi di ], [ai bQi di ] and
[abQi di ].
816
Charles Bouveyron, Stéphane Girard, and Cordelia Schmid
3.2 The decision rule
Classification assigns an observation x ∈ Rp with unknown class membership to
one of k classes C1 , ..., Ck known a priori. The optimal decision rule, called Bayes
decision rule, affects the observation x to P
the class which has the maximum posterior
probability P (x ∈ Ci |x) = πi φ(x, θi )/ kl=1 πl φ(x, θl ). Maximizing the posterior
probability is equivalent to minimizing −2 log(πi φ(x, θi )). For the model [aij bi Qi di ],
this results in the decision rule δ + which assigns x to the class minimizing the
following cost function Ki (x):
X
1
kx − Pi (x)k2 +
log(aij ) + (p − di ) log(bi ) − 2 log(πi ),
bi
j=1
di
Ki (x) = kµi − Pi (x)k2Λi +
t
where k.kΛi is the Mahalanobis distance associated with the matrix Λi = Q̃i ∆i Q̃i .
The
Pposterior probability can therefore be rewritten as follows: P (x ∈ Ci |x) =
1/ kl=1 exp 12 (Ki (x) − Kl (x)) . It measures the probability that x belongs to Ci
and allows to identify dubiously classified points.
We can observe that this new decision rule is mainly based on two distances: the
distance between the projection of x on Ei and the mean of the class; and the distance
between the observation and the subspace Ei . This rule assigns a new observation
to the class for which it is close to the subspace and for which its projection on the
class subspace is close to the mean of the class. If we consider the model [ai bi Qi di ],
the variances ai and bi balance the importance of both distances. For example, if the
data are very noisy, i.e. bi is large, it is natural to balance the distance kx − Pi (x)k2
by 1/bi in order to take into account the large variance in E⊥
i .
Remark that the decision rule δ + of our models uses only the projection on
Ei and we only have to estimate a di -dimensional subspace. Thus, our models are
significantly more parsimonious than the general GMM. For example, if we consider
100-dimensional data, made of 4 classes and with common intrinsic dimensions di
equal to 10, the model [ai bi Qi di ] requires the estimation of 4 015 parameters whereas
the full Gaussian mixture model estimates 20 303 parameters.
4 High Dimensional Data Clustering
In this section we derive the EM-based clustering framework for the model [aij bi Qi di ]
and its sub-models. The new clustering approach is in the following referred to by
High-Dimensional Data Clustering (HDDC). By lack of space, we do not present
proofs of the following results which can be found in [BGS06].
4.1 The clustering method HDDC
Unsupervised classification organizes data in homogeneous groups using only the observed values of the p explanatory variables. Usually, the parameters are estimated
by the EM algorithm which repeats iteratively E and M steps. If we use the parameterization presented in the previous section, the EM algorithm for estimating the
parameters θ = {πi , µi , Σi , aij , bi , Qi , di }, can be written as follows:
– E step: this step computes at the iteration q the conditional posterior probabilities
(q)
(q)
tij = P (xj ∈ Ci |xj ) according to the relation:
High dimensional data clustering
(q)
tij = 1/
k
X
817
1 (q−1)
(q−1)
(K
(xj ) − Kl
(xj )) ,
2 i
exp
l=1
where Ki is defined in Paragraph 3.2.
– M step: this step maximizes at the iteration q the conditional likelihood. Proportions, means and covariance matrices of the mixture are estimated by:
(q)
π̂i
=
(q)
Σ̂i
=
n
n
(q)
X
ni
1 X (q)
(q)
(q)
(q)
, µ̂i = (q)
tij xj , ni =
tij .
n
ni j=1
j=1
n
1 X
(q)
ni
(q)
(q)
(q)
tji (xj − µ̂i )(xj − µ̂i )t .
j=1
The estimation of HDDC parameters is detailed in the following subsection.
4.2 Estimation of HDDC parameters
Assuming for the moment that parameters di are known and omitting the index q of
the iteration for the sake of simplicity, we obtain the following closed form estimators
for the parameters of our models:
– Subspace Ei : the di first columns of Qi are estimated by the eigenvectors associated
with the di largest eigenvalues λij of Σ̂i .
– Model [aij bi Qi di ]: the estimators of aij are the di largest eigenvalues λij of Σ̂i and
the estimator of bi is the mean of the (p − di ) smallest eigenvalues of Σ̂i and can be
written as follows:
1
b̂i =
(p − di )
Tr(Σ̂i ) −
di
X
!
λij
.
(1)
j=1
– Model [ai bi Qi di ]: the estimator of bi is given by (1) and the estimator of ai is the
mean of the di largest eigenvalues of Σ̂i :
âi =
di
1 X
λij ,
di j=1
(2)
– Model [ai bQi di ]: the estimator of ai is given by (2) and the estimator of b is:
b̂ =
1
(np −
where Ŵ =
Pk
i=1
Pk
i=1
ni di )
n Tr(Ŵ ) −
k
X
i=1
ni
di
X
!
λij
,
(3)
j=1
π̂i Σ̂i .
– Model [abQi di ]: the estimator of b is given by (3) and the estimator of a is:
â = Pk
k
X
1
i=1
ni di
i=1
ni
di
X
j=1
λij .
(4)
818
Charles Bouveyron, Stéphane Girard, and Cordelia Schmid
4.3 Intrinsic dimension estimation
We also have to estimate the intrinsic dimensions of each subclass. This is a difficult
problem with no unique technique to use. Our approach is based on the eigenvalues
of the class conditional covariance matrix Σ̂i of the class Ci . The jth eigenvalue of
Σ̂i corresponds to the fraction of the full variance carried by the jth eigenvector
of Σ̂i . We estimate the class specific dimension di , i = 1, ..., k, with the empirical
method scree-test of Cattell [Cat66] which analyzes the differences between eigenvalues in order to find a break in the scree. The selected dimension is the one for
which the subsequent differences are smaller than a threshold. In our experiments,
the threshold is chosen by cross-validation. We also compared to the probabilistic
criterion BIC [Sch78] which gave very similar results.
5 Experimental results
In this section, we use our clustering method HDDC to recognize and locate objects in natural images. Object category recognition is one of the most challenging
problems in computer vision. Recent methods use local image descriptors which
are robust to occlusions, clutters and geometric transformations. Many of these approaches form clusters of local descriptors as an initial step; in most cases clustering
is achieved with k-means, diagonal or spherical GMM and EM estimation – with or
without PCA to reduce the dimension. Dorko and Schmid [DS04] select discriminant clusters based on the likelihood ratio and use the most discriminative ones for
recognition. Bag-of-keypoint methods [ZML05] represent an image by a histogram
of cluster labels and learn a Support Vector Machine classifier.
5.1 Protocol and data
We use an approach similar to Dorko and Schmid [DS04]. Local descriptors of dimension 128 are extracted from the training images (see [DS04] for details) and then
are organized into k groups by a clustering method (k = 200 in our experiments). We
then compute the discriminative capacity of the class Ci for a given object category
O through the hposterior probability
Ri = P (Ci ∈ O|Ci ). This probability is estii
mated by Ri =
Ψ tΨ
−1
Ψ t Φ , where Φj = P (xj ∈ O|xj ) and Ψjl = P (xj ∈ Cl |xj ).
i
Learning can be either supervised or weakly supervised. In the supervised framework, the objects are segmented using bounding boxes and only the descriptors
located inside the bounding boxes are labeled as positive in the learning step. In the
weakly-supervised scenario, the object are not segmented and all descriptors from
images containing the object are labeled as positive. Note that in this case many
descriptors from the background are labeled as positive. In both cases, we consider
that P (xj ∈ O|xj ) = 1 if xj is positive and P (xj ∈ O|xj ) = 0 otherwise. For each
descriptor of a test image, the probability
that this point belongs to the object O
P
is then given by P (xj ∈ O|xj ) = ki=1 Ri P (xj ∈ Ci |xj ) where the posterior probability P (xj ∈ Ci |xj ) is obtained by the decision rule associated to the clustering
method (see Paragraph 3.2 for HDDC).
We compare the HDDC clustering method to the following classical clustering
methods: diagonal Gaussian mixture model, spherical Gaussian mixture model, and
High dimensional data clustering
Learning
[aij bi ]
Supervised
Weakly-sup.
0.172
0.145
HDDC [∗ ∗ Qi di ]
[aij b] [ai bi ]
0.181
0.147
819
GMM
Pascal
[ai b] PCA+diag. Diagonal Spherical Best of [DDQ06]
0.183 0.175
0.142 0.148
0.177
0.120
0.161
0.110
0.150
0.106
Table 1. Object localization on the database Pascal test2 : mean of the average
precision on the four object categories. Best results are highlighted.
data reduction with PCA combined with a diagonal Gaussian mixture model. The
diagonal GMM has a covariance matrix defined by Σi = diag(σi1 , ..., σip ) and the
spherical GMM is characterized by Σi = σi Id. For all the models the parameters
were estimated via the EM algorithm. The EM estimation used the same initialization based on k-means for both HDDC and classical methods.
The object category database used in our experiments is the Pascal
dataset [DDQ06] which contains four categories: motorbikes, bicycles, people and
cars. There are 684 training images and two test sets: test1 and test2. We evaluate
our method on the set test2, which is the most difficult of the two test sets and
contains 956 images. There are on average 250 descriptors per image. From a computational point of view, the localization step is very fast. For the learning step,
computing time mainly depends of the number of groups k and is equal on average
to 2 hours on a recent computer. To locate an object in a test image, we compute for
each descriptor the probability to belong to the object. We then predict the bounding box based on the arithmetic mean and the standard deviation of descriptors. In
order to compare our results with those of the Pascal Challenge [DDQ06], we used its
evaluation criterion ”average precision” which is the area under the precision-recall
curve computed for the predicted bounding boxes (see [DDQ06] for further details).
5.2 Object localization results
Table 1 presents localization results for the dataset Pascal test2 with supervised
and weakly-supervised training. First of all, we observe that HDDC performs better
than standard GMM within the probabilistic framework described in Section 5.1 and
particularly in the weakly-supervised framework. This indicates that our clustering
method identifies relevant clusters for each object category. In addition, HDDC
provides better localization results than the state of the art methods reported in the
Pascal Challenge [DDQ06]. Note that the difference between the results obtained in
the supervised and in the weakly-supervised framework is not very high. This means
that HDDC efficiently identifies discriminative clusters of each object category even
with weak supervision. Weakly-supervised results are promising as they avoid time
consuming manual annotation. Figure 2 shows examples of object localization on
test images with the model [ai bi Qi di ] of HDDC and supervised training.
Acknowledgments
This work was supported by the French department of research through the ACI
Masse de données (Movistar project).
0.112
/
820
Charles Bouveyron, Stéphane Girard, and Cordelia Schmid
(a) car
(b) motorbike
(c) bicycle
Fig. 2. Object localization on on the database Pascal test2 : predicted bounding
boxes with HDDC are in red and true bounding boxes are in yellow.
References
[BVV06] Bocci, L., Vicari, D., Vichi, M.: A mixture model for the classification of
three-way proximity data. Computational Statistics and Data Analysis, 50, 1625–
1654 (2006).
[BGS06] Bouveyron, C., Girard, S., Schmid, C.: High-Dimensional Data Clustering.
Technical Report 1083M, LMC-IMAG, Université J. Fourier Grenoble 1 (2006).
[Cat66] Cattell, R.: The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245–276 (1966).
[DDQ06] D’Alche Buc, F., Dagan, I., Quinonero, J.: The 2005 Pascal visual object
classes challenge. Proceedings of the first PASCAL Challenges Workshop, Springer
(2006).
[DLR77] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38
(1977).
[DS04] Dorko, G., Schmid, C.: Object class recognition using discriminative local
features. Technical Report 5497, INRIA (2004).
[FR02] Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis and
density estimation. Journal of American Statistical Association, 97, 611–631
(2002).
[PHL04] Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional
data: a review. SIGKDD Explor. Newsl. 6, 90–105 (2004).
[Sch78] Schwarz, G.: Estimating the dimension of a model. Annals of Statistics, 6,
461–464 (1978).
[TB99] Tipping, M., Bishop, C.: Mixtures of probabilistic principal component analysers. Neural Computation, 443–482 (1999).
[ZML05] Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and
kernels for classification of texture and object categories. Technical Report 5737,
INRIA (2005).
Download