A New technique for Similarity Searches in Image Databases using PCA Independent Study Submitted by: Venugopal Rajagopal 1. Introduction In recent years there had been significant growth in research into methods for organizing and retrieving similar images in the database. Some of the techniques include text labeling and Content Based Image Retrieval (CBIR). In text labeling each image is indexed by there keywords. The keywords are chosen in a way that best represents the image that is stored, so reducing the search to merely a keyword search which is done using some string matching techniques. One particular disadvantage of this approach is its supervisory nature. In CBIR images are retrieved automatically based on the color, texture and shape features of the image [2]. There are a lot of commercial software’s which implement CBIR techniques and its feasibility is still a hot research area topic. In this paper we will try to explain a technique for similarity search which is based on CBIR and uses Principal Component Analysis to store and retrieve images from the database. The rest of this paper is organized as follows: Section 2 gives a brief introduction for Image Similarity; Section 3 describes methodology behind our technique; Section 4 analyzes the results from the experiments; Section 5 summarizes our conclusions and future work. 2. Image Similarity The aim of the Image database system is to assist the human user to retrieve images. Features of the images (color, texture etc.,) are stored in order to facilitate searching, as the features representing each images is so huge some kind of feature reduction technique has to be used which can represent the same image in the low dimensional feature space. One most widely used technique is the Principal Component analysis which we will explain later as a separate subsection because it is the core to our technique. The Image database system takes in an image as the query extracts the features and computes the distance between the query and the stored images and returns those images that are “close” [1]. The Euclidean distance between points in a multidimensional feature space is often used. In short, the image features and distance function are selected in a way that the resultant distance is the measure of image similarity. Our technique uses Hausdorff distance function which will be explained in a separate sub section. 2.1 Principal Component Analysis (PCA) Principal component analysis is used for dimensionality reduction in image databases. The following explanation is adopted from [3]. PCA reduces the redundancy contained within the data by creating a new series of components in which the axes of the new coordinate systems point in the direction of decreasing variance. The mean of the original data is the origin of the transformed system with the transformed axes of each component mutually orthogonal. To begin the transformation, the covariance matrix , or the correlation matrix of the original data is found. Using the covariance matrix, the eigenvalues where , are obtained from , n is the total number of original images, and is an identity matrix. The eigenvalues are equal to the variance of each corresponding component image. The eigenvectors , define the axes of the components and are obtained from The principal components are then given as where is the digital number matrix of the original data, and transformation matrix given by is the PCA images thus generated are uncorrelated and ordered by decreasing variance. The correlation matrix of the transformed data is a diagonal matrix of which the elements are comprised of the eigenvalues. The transformed data points are linear combinations of their original data values weighted by the eigenvectors. Each column of the eigenvector matrix represents the weights applied to the pixels of the corresponding bands to build each respective principal component. The percent of the total variance in each of the components is given by The first component image has the maximum signal-to-noise ratio and the largest percentage of the total variance. Each subsequent component contains the maximum variance for any axes orthogonal to the previous component. 2.2 Hausdorff distance The following explanation of Hausdorff distance is adopted from [4]. Given two finite point sets A = {a1,…,ap} and B = {b1,…,bq}, the Hausdorff distance is defined as H(A, B) = max ( D(A,B), D(B,A) ) where and “d” is some underlying norm on the points of A and B (e.g., the L2 or Euclidean norm). The function D (A,B) is called the directed Hausdorff distance from A to B. It identifies the point a belonging to A that is farthest from any point of B, and measures the distance from a to its nearest neighbor in B (using some given norm). That is D (A,B) in effect ranks each point of A based on its distance to the nearest point of B, and then uses the largest ranked such point as the distance (the most mismatched point of A). Intuitively, if D (A,B) = d, then each point of A must be within distance d of some point of B, and there also is some point of A that is exactly distance d from the nearest point of B (the most mismatched point). The Hausdorff distance H (A,B), is the maximum of D (A,B) and D (B,A). Thus it measures the degree of mismatch between two sets, by measuring the distance of the point of A that is farthest from any point of B and vice versa. Intuitively, if the Hausdorff distance is d, then every point of A must be within a distance d of some point of B and vice versa. Thus the notion of resemblance encoded by this distance is that each member of A be near some member of B and vice versa. Though it is well known that Hausdorff distance is a metric over the set of all closed, bounded sets here in this paper we use this function to compute the distances between two finite points sets. 3. Methodology The idea is to extract the relevant features that best describes the image and map the original high dimensional feature space into a low dimensional feature space for storing. Given an image A of size H and W (height H and width W), we extract the luminosity component of the image and divide it into n sub-images each of size h and w (height h and width w) where h<H and w<W. We project each of the sub-images using PCA as explained in section 2.1 and extract the first three principal components of each of the sub-images. This results in the projection of each of the sub-images into a three dimensional feature space and thus reducing the complexity. For example, given an image of size 160 x 160 (height 160 and width 160) it is divided into sub-images of size 8 x 8 (height 8 and width 8) resulting in 400 sub images. Each sub-image projected into a three dimensional feature space using PCA and thus reduces the dimensionality of the whole image to 1200 from 25600 feature points. When a query image is given as the input its dimensionality is reduced in the same way as explained above and the distances between the features are computed using Hausdorff distance function as explained in section 2.2 and those images that are close are returned. 4. Results The experiments were run on a Pentium 4 2.4 GHz, 512 M RAM with Matlab as the software tool. Experiments were conducted by comparing images between themselves, by partitioning them into classes, instead of giving one image as a query input. a) f1 b) f2 c) f3 d) f4 e) f5 f) f6 g) f7 h) f8 Fig: 1 Figure 1 shows the images that were used for conducting the experiments. The image dataset was partitioned into three classes C1, C2 and C3. Class C1 contains images f1 and f2, Class C2 contains images f3 and f4 and Class C3 contains images f5, f6, f7 and f8. Each image was divided into sub-images of size 8 x 8, and first three principal components of each sub-images was retained. The Hausdorff distance matrix obtained after comparing the images is shown in Figure 2. f1 f1 f2 f3 f4 f5 f6 f7 f8 0 1.4046 6.5256 7.3577 8.4823 8.7002 7.5478 6.4462 f2 1.4046 0 6.433 7.2645 8.387 9.6207 7.5474 6.351 f3 6.5256 6.433 0 1.3262 2.0721 3.4855 1.6439 1.5292 f4 7.3577 7.2645 1.3262 0 1.5719 3.4214 2.1472 1.9252 f5 8.4823 8.387 2.0721 1.5719 0 2.589 1.1272 2.0361 f6 8.7002 9.6207 3.4855 3.4214 2.589 0 2.2812 3.2968 f7 7.5478 7.5474 1.6439 2.1472 1.1272 2.2812 0 1.2572 f8 6.4462 6.351 1.5292 1.9252 2.0361 3.2968 1.2572 0 Fig: 2 Based on the distance matrix from Figure 2, the results obtained were shown in Figure 3. The second closest to fig: 1 is fig 2 The second closest to fig: 2 is fig 1 The second closest to fig: 3 is fig 4 The second closest to fig: 4 is fig 3 The second closest to fig: 5 is fig 7 The second closest to fig: 6 is fig 7 The second closest to fig: 7 is fig 5 The second closest to fig: 8 is fig 7 Fig: 3 Based on the above results we can see that the misclassification rate is zero, no images belonging to a different class got misclassified. 5. Conclusions and Future work A new technique to find similar images from an image database is explained in this paper. The technique is fairly simple and straight forward, given an image it is divided into sub-images and projects each of the sub-images into a three dimensional feature space using PCA and stored as features in the database. Hausdorff distance function is used to compute the distances between the images in the reduced feature space for similarity searches. Experimental results illustrate the superiority of our technique. Due to limited time constraint we couldn’t conduct more detailed experiments. Future work would be to conduct experiments on a huge image database, using modified PCA and Hausdorff distance algorithms (complexity linear) and do a detailed analysis. References [1] D. Squire, T. Pun, “A Comparison of Human and Machine Assessments of Image Similarity for the Organization of Image Databases” [2] J. Eakins, M. Graham, “Content-based Image Retrieval: A report to the JISC Technology Applications Programme” [3] http://doppler.unl.edu/~bcorner/pca.html [4] D. P. Huttenlocher, G. A. Klanderman, W. J. Rucklidge, “Comparing Images Using the Hausdorff Distance”