High dimensional data clustering Charles Bouveyron1,2 , Stéphane Girard1 , and Cordelia Schmid2 1 2 LMC-IMAG, BP 53, Université Grenoble 1, 38041 Grenoble Cedex 9, France charles.bouveyron@imag.fr, stephane.girard@imag.fr INRIA Rhône-Alpes, Projet Lear, 655 av. de l’Europe, 38334 Saint-Ismier Cedex, France cordelia.schmid@inrialpes.fr Summary. Clustering in high-dimensional spaces is a recurrent problem in many domains, for example in object recognition. High-dimensional data usually live in different low-dimensional subspaces hidden in the original space. This paper presents a clustering approach which estimates the specific subspace and the intrinsic dimension of each class. Our approach adapts the Gaussian mixture model framework to high-dimensional data and estimates the parameters which best fit the data. We obtain a robust clustering method called High-Dimensional Data Clustering (HDDC). We apply HDDC to locate objects in natural images in a probabilistic framework. Experiments on a recently proposed database demonstrate the effectiveness of our clustering method for category localization. Key words: Model-based clustering, high-dimensional data, dimension reduction, dimension reduction, parsimonious models 1 Introduction In many scientific domains, the measured observations are high-dimensional. For example, visual descriptors used in object recognition are often high-dimensional and this penalizes classification methods and consequently recognition. Popular clustering methods are based on the Gaussian mixture model and show a disappointing behavior when the size of the training dataset is too small compared to the number of parameters to estimate. To avoid overfitting, it is therefore necessary to find a balance between the number of parameters to estimate and the generality of the model. In this paper we propose a Gaussian mixture model which determines the specific subspace in which each class is located and therefore limits the number of parameters to estimate. The Expectation-Maximization (EM) algorithm [DLR77] is used for parameter estimation and the intrinsic dimension of each class is determined automatically with the scree test of Cattell. This allows to derive a robust clustering method in high-dimensional spaces, called High Dimensional Data Clustering (HDDC). In order to further limit the number of parameters, it is possible to make additional assumptions on the model. We can for example assume that classes are spherical in their subspaces or fix some parameters to be common between classes. 814 Charles Bouveyron, Stéphane Girard, and Cordelia Schmid We evaluate HDDC on a recently proposed visual recognition dataset [DDQ06]. We compare HDDC to standard clustering methods and to the state of the art results. We show that our approach outperforms existing results for object localization. This paper is organized as follows. Section 2 presents the state of the art on clustering of high-dimensional data. In Section 3, we describe our parameterization of the Gaussian mixture model. Section 4 presents our clustering method, i.e. the estimation of the parameters and of the intrinsic dimensions. Experimental results for our clustering method are given in Section 5. 2 Related work on high-dimensional clustering Many methods use global dimensionality reduction and then apply a standard clustering method. Dimension reduction techniques are either based on feature extraction or feature selection. Feature extraction builds new variables which carry a large part of the global information. The most known method is Principal Component Analysis (PCA) which is a linear technique. Recently, many non-linear methods have been proposed, such as Kernel PCA and non-linear PCA. In contrast, feature selection finds an appropriate subset of the original variables to represent the data. Global dimension reduction is often advantageous in terms of performance, but loses information which could be discriminant, i.e. clusters are often hidden in different subspaces of the original feature space and a global approach cannot capture this. It is also possible to use a parsimonious model [FR02] which reduces the number of parameters to estimate. It is for example possible to fix some parameters to be common between classes. These methods do not solve the problem of high dimensionality because clusters are usually hidden in different subspaces and many dimensions are irrelevant. Recent methods determine the subspaces for each cluster. Many subspace clustering methods use heuristic search techniques to find the subspaces. They are usually based on grid search methods and find dense clusterable subspaces [PHL04]. The approach ”mixtures of Probabilistic Principal Component Analyzers” [TB99] proposes a latent variable model and derives an EM based method to cluster high-dimensional data. Bocci et al. [BVV06] propose a similar method to cluster dissimilarity data. In this paper, we introduce an unified approach for classspecific subspace clustering which includes these two methods and allows additional regularizations. 3 Gaussian mixture models for high-dimensional data Clustering divides a given dataset {x1 , ..., xn } of n data points into k homogeneous groups. Popular clustering techniques use Gaussian Mixture Models (GMM), which assume that each class is represented by a Gaussian probability density. Data P {x1 , ..., xn } ∈ Rp are then modeled with the density f (x, θ) = ki=1 πi φ(x, θi ), where φ is a multi-variate normal density with parameter θi = {µi , Σi } and πi are mixing proportions. This model estimates full covariance matrices and therefore the number of parameters is very large in high dimensions. However, due to the empty space phenomenon we can assume that high-dimensional data live in subspaces with a dimensionality lower than the dimensionality of the original space. We therefore High dimensional data clustering Ei ⊥ x 815 x Pi ⊥ (x) x d(x, Ei ) x x x µi X x Ei x x x Pi (x) x x x x x x x x d(µi , Pi (x)) x x Fig. 1. The class-specific subspace Ei . propose to work in low-dimensional class-specific subspaces in order to adapt classification to high-dimensional data and to limit the number of parameters to estimate. 3.1 The family of Gaussian mixture models We remind that class conditional densities are Gaussian N (µi , Σi ) with means µi and covariance matrices Σi , i = 1, ..., k. Let Qi be the orthogonal matrix of eigenvectors of Σi , then ∆i = Qti Σi Qi is a diagonal matrix containing the eigenvalues of Σi . We further assume that ∆i is divided into two blocks: 0 ai1 0 B .. B . B B aidi B 0 ∆i = B B B B 0 1 9 = C di C 0 C ; C C 9 C bi 0 C > > C = C .. A > (p − di ) . > ; 0 bi where aij > bi , ∀j = 1, ..., di . The class specific subspace Ei is generated by the di first eigenvectors corresponding to the eigenvalues aij with µi ∈ Ei . Outside this subspace, the variance is modeled by the single parameter bi . Let Pi (x) = t Q̃i Q̃i (x − µi ) + µi be the projection of x on Ei , where Q̃i is made of the di first columns of Qi supplemented by zeros. Figure 1 summarizes these notations. The mixture model presented above will be in the following referred to by [aij bi Qi di ]. By fixing some parameters to be common within or between classes, we obtain a family of models which correspond to different regularizations. For example, if we fix the first di eigenvalues to be common within each class, we obtain the more restricted model [ai bi Qi di ]. The model [ai bi Qi di ] is often robust and gives satisfying results, i.e. the assumption that each matrix ∆i has only two different eigenvalues is in many cases an efficient way to regularize the estimation of ∆i . In this paper, we focus on the models [aij bi Qi di ], [aij bQi di ], [ai bi Qi di ], [ai bQi di ] and [abQi di ]. 816 Charles Bouveyron, Stéphane Girard, and Cordelia Schmid 3.2 The decision rule Classification assigns an observation x ∈ Rp with unknown class membership to one of k classes C1 , ..., Ck known a priori. The optimal decision rule, called Bayes decision rule, affects the observation x to P the class which has the maximum posterior probability P (x ∈ Ci |x) = πi φ(x, θi )/ kl=1 πl φ(x, θl ). Maximizing the posterior probability is equivalent to minimizing −2 log(πi φ(x, θi )). For the model [aij bi Qi di ], this results in the decision rule δ + which assigns x to the class minimizing the following cost function Ki (x): X 1 kx − Pi (x)k2 + log(aij ) + (p − di ) log(bi ) − 2 log(πi ), bi j=1 di Ki (x) = kµi − Pi (x)k2Λi + t where k.kΛi is the Mahalanobis distance associated with the matrix Λi = Q̃i ∆i Q̃i . The Pposterior probability can therefore be rewritten as follows: P (x ∈ Ci |x) = 1/ kl=1 exp 12 (Ki (x) − Kl (x)) . It measures the probability that x belongs to Ci and allows to identify dubiously classified points. We can observe that this new decision rule is mainly based on two distances: the distance between the projection of x on Ei and the mean of the class; and the distance between the observation and the subspace Ei . This rule assigns a new observation to the class for which it is close to the subspace and for which its projection on the class subspace is close to the mean of the class. If we consider the model [ai bi Qi di ], the variances ai and bi balance the importance of both distances. For example, if the data are very noisy, i.e. bi is large, it is natural to balance the distance kx − Pi (x)k2 by 1/bi in order to take into account the large variance in E⊥ i . Remark that the decision rule δ + of our models uses only the projection on Ei and we only have to estimate a di -dimensional subspace. Thus, our models are significantly more parsimonious than the general GMM. For example, if we consider 100-dimensional data, made of 4 classes and with common intrinsic dimensions di equal to 10, the model [ai bi Qi di ] requires the estimation of 4 015 parameters whereas the full Gaussian mixture model estimates 20 303 parameters. 4 High Dimensional Data Clustering In this section we derive the EM-based clustering framework for the model [aij bi Qi di ] and its sub-models. The new clustering approach is in the following referred to by High-Dimensional Data Clustering (HDDC). By lack of space, we do not present proofs of the following results which can be found in [BGS06]. 4.1 The clustering method HDDC Unsupervised classification organizes data in homogeneous groups using only the observed values of the p explanatory variables. Usually, the parameters are estimated by the EM algorithm which repeats iteratively E and M steps. If we use the parameterization presented in the previous section, the EM algorithm for estimating the parameters θ = {πi , µi , Σi , aij , bi , Qi , di }, can be written as follows: – E step: this step computes at the iteration q the conditional posterior probabilities (q) (q) tij = P (xj ∈ Ci |xj ) according to the relation: High dimensional data clustering (q) tij = 1/ k X 817 1 (q−1) (q−1) (K (xj ) − Kl (xj )) , 2 i exp l=1 where Ki is defined in Paragraph 3.2. – M step: this step maximizes at the iteration q the conditional likelihood. Proportions, means and covariance matrices of the mixture are estimated by: (q) π̂i = (q) Σ̂i = n n (q) X ni 1 X (q) (q) (q) (q) , µ̂i = (q) tij xj , ni = tij . n ni j=1 j=1 n 1 X (q) ni (q) (q) (q) tji (xj − µ̂i )(xj − µ̂i )t . j=1 The estimation of HDDC parameters is detailed in the following subsection. 4.2 Estimation of HDDC parameters Assuming for the moment that parameters di are known and omitting the index q of the iteration for the sake of simplicity, we obtain the following closed form estimators for the parameters of our models: – Subspace Ei : the di first columns of Qi are estimated by the eigenvectors associated with the di largest eigenvalues λij of Σ̂i . – Model [aij bi Qi di ]: the estimators of aij are the di largest eigenvalues λij of Σ̂i and the estimator of bi is the mean of the (p − di ) smallest eigenvalues of Σ̂i and can be written as follows: 1 b̂i = (p − di ) Tr(Σ̂i ) − di X ! λij . (1) j=1 – Model [ai bi Qi di ]: the estimator of bi is given by (1) and the estimator of ai is the mean of the di largest eigenvalues of Σ̂i : âi = di 1 X λij , di j=1 (2) – Model [ai bQi di ]: the estimator of ai is given by (2) and the estimator of b is: b̂ = 1 (np − where Ŵ = Pk i=1 Pk i=1 ni di ) n Tr(Ŵ ) − k X i=1 ni di X ! λij , (3) j=1 π̂i Σ̂i . – Model [abQi di ]: the estimator of b is given by (3) and the estimator of a is: â = Pk k X 1 i=1 ni di i=1 ni di X j=1 λij . (4) 818 Charles Bouveyron, Stéphane Girard, and Cordelia Schmid 4.3 Intrinsic dimension estimation We also have to estimate the intrinsic dimensions of each subclass. This is a difficult problem with no unique technique to use. Our approach is based on the eigenvalues of the class conditional covariance matrix Σ̂i of the class Ci . The jth eigenvalue of Σ̂i corresponds to the fraction of the full variance carried by the jth eigenvector of Σ̂i . We estimate the class specific dimension di , i = 1, ..., k, with the empirical method scree-test of Cattell [Cat66] which analyzes the differences between eigenvalues in order to find a break in the scree. The selected dimension is the one for which the subsequent differences are smaller than a threshold. In our experiments, the threshold is chosen by cross-validation. We also compared to the probabilistic criterion BIC [Sch78] which gave very similar results. 5 Experimental results In this section, we use our clustering method HDDC to recognize and locate objects in natural images. Object category recognition is one of the most challenging problems in computer vision. Recent methods use local image descriptors which are robust to occlusions, clutters and geometric transformations. Many of these approaches form clusters of local descriptors as an initial step; in most cases clustering is achieved with k-means, diagonal or spherical GMM and EM estimation – with or without PCA to reduce the dimension. Dorko and Schmid [DS04] select discriminant clusters based on the likelihood ratio and use the most discriminative ones for recognition. Bag-of-keypoint methods [ZML05] represent an image by a histogram of cluster labels and learn a Support Vector Machine classifier. 5.1 Protocol and data We use an approach similar to Dorko and Schmid [DS04]. Local descriptors of dimension 128 are extracted from the training images (see [DS04] for details) and then are organized into k groups by a clustering method (k = 200 in our experiments). We then compute the discriminative capacity of the class Ci for a given object category O through the hposterior probability Ri = P (Ci ∈ O|Ci ). This probability is estii mated by Ri = Ψ tΨ −1 Ψ t Φ , where Φj = P (xj ∈ O|xj ) and Ψjl = P (xj ∈ Cl |xj ). i Learning can be either supervised or weakly supervised. In the supervised framework, the objects are segmented using bounding boxes and only the descriptors located inside the bounding boxes are labeled as positive in the learning step. In the weakly-supervised scenario, the object are not segmented and all descriptors from images containing the object are labeled as positive. Note that in this case many descriptors from the background are labeled as positive. In both cases, we consider that P (xj ∈ O|xj ) = 1 if xj is positive and P (xj ∈ O|xj ) = 0 otherwise. For each descriptor of a test image, the probability that this point belongs to the object O P is then given by P (xj ∈ O|xj ) = ki=1 Ri P (xj ∈ Ci |xj ) where the posterior probability P (xj ∈ Ci |xj ) is obtained by the decision rule associated to the clustering method (see Paragraph 3.2 for HDDC). We compare the HDDC clustering method to the following classical clustering methods: diagonal Gaussian mixture model, spherical Gaussian mixture model, and High dimensional data clustering Learning [aij bi ] Supervised Weakly-sup. 0.172 0.145 HDDC [∗ ∗ Qi di ] [aij b] [ai bi ] 0.181 0.147 819 GMM Pascal [ai b] PCA+diag. Diagonal Spherical Best of [DDQ06] 0.183 0.175 0.142 0.148 0.177 0.120 0.161 0.110 0.150 0.106 Table 1. Object localization on the database Pascal test2 : mean of the average precision on the four object categories. Best results are highlighted. data reduction with PCA combined with a diagonal Gaussian mixture model. The diagonal GMM has a covariance matrix defined by Σi = diag(σi1 , ..., σip ) and the spherical GMM is characterized by Σi = σi Id. For all the models the parameters were estimated via the EM algorithm. The EM estimation used the same initialization based on k-means for both HDDC and classical methods. The object category database used in our experiments is the Pascal dataset [DDQ06] which contains four categories: motorbikes, bicycles, people and cars. There are 684 training images and two test sets: test1 and test2. We evaluate our method on the set test2, which is the most difficult of the two test sets and contains 956 images. There are on average 250 descriptors per image. From a computational point of view, the localization step is very fast. For the learning step, computing time mainly depends of the number of groups k and is equal on average to 2 hours on a recent computer. To locate an object in a test image, we compute for each descriptor the probability to belong to the object. We then predict the bounding box based on the arithmetic mean and the standard deviation of descriptors. In order to compare our results with those of the Pascal Challenge [DDQ06], we used its evaluation criterion ”average precision” which is the area under the precision-recall curve computed for the predicted bounding boxes (see [DDQ06] for further details). 5.2 Object localization results Table 1 presents localization results for the dataset Pascal test2 with supervised and weakly-supervised training. First of all, we observe that HDDC performs better than standard GMM within the probabilistic framework described in Section 5.1 and particularly in the weakly-supervised framework. This indicates that our clustering method identifies relevant clusters for each object category. In addition, HDDC provides better localization results than the state of the art methods reported in the Pascal Challenge [DDQ06]. Note that the difference between the results obtained in the supervised and in the weakly-supervised framework is not very high. This means that HDDC efficiently identifies discriminative clusters of each object category even with weak supervision. Weakly-supervised results are promising as they avoid time consuming manual annotation. Figure 2 shows examples of object localization on test images with the model [ai bi Qi di ] of HDDC and supervised training. Acknowledgments This work was supported by the French department of research through the ACI Masse de données (Movistar project). 0.112 / 820 Charles Bouveyron, Stéphane Girard, and Cordelia Schmid (a) car (b) motorbike (c) bicycle Fig. 2. Object localization on on the database Pascal test2 : predicted bounding boxes with HDDC are in red and true bounding boxes are in yellow. References [BVV06] Bocci, L., Vicari, D., Vichi, M.: A mixture model for the classification of three-way proximity data. Computational Statistics and Data Analysis, 50, 1625– 1654 (2006). [BGS06] Bouveyron, C., Girard, S., Schmid, C.: High-Dimensional Data Clustering. Technical Report 1083M, LMC-IMAG, Université J. Fourier Grenoble 1 (2006). [Cat66] Cattell, R.: The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276 (1966). [DDQ06] D’Alche Buc, F., Dagan, I., Quinonero, J.: The 2005 Pascal visual object classes challenge. Proceedings of the first PASCAL Challenges Workshop, Springer (2006). [DLR77] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38 (1977). [DS04] Dorko, G., Schmid, C.: Object class recognition using discriminative local features. Technical Report 5497, INRIA (2004). [FR02] Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis and density estimation. Journal of American Statistical Association, 97, 611–631 (2002). [PHL04] Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl. 6, 90–105 (2004). [Sch78] Schwarz, G.: Estimating the dimension of a model. Annals of Statistics, 6, 461–464 (1978). [TB99] Tipping, M., Bishop, C.: Mixtures of probabilistic principal component analysers. Neural Computation, 443–482 (1999). [ZML05] Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories. Technical Report 5737, INRIA (2005).