1 A LSH-based Compact Representation of Images for Similarity Comparison Emily Huang JIW, Fall 2003 Advisor: Moses Charikar Abstract Computations involving large and complex data sets are expensive when these data sets are represented in raw, uncompressed form. This work presents a representation for one such type of these data sets, images, which is compact and has the properties of preserving a useful distance measure between objects of this type. The representation depends on a locality sensitive hashing (LSH)1 on vector representations of the image objects. As a result, each image is associated with a 1000-bit signature. This representation is not a compression scheme, as no information about the image itself is preserved in the representation, but rather, is a signature on image data for the purpose of cheap similarity comparison. The method is then evaluated for its effectiveness in gauging image similarity in a pool of test images. Introduction Digital images are used widely in electronic media, but there is as of yet no inexpensive method for similarity comparison. Image data is typically present in a space-expensive form, and similarity search in a large pool could take an inordinate amount of time. A solution to this problem has applications in any field where quick comparisons of these objects are desirable. Database search (web search, for example) is the natural application, but others also arise in computer vision. Image search is, of course, not the only application of the general methodology. Any large or complex data type (audio or text, for example) can be similarly represented as a compact signature, and the resulting potentiality of applications is more far-reaching. In this methodology, images are represented as vectors, and the angle of separation between two representative vectors corresponds to the similarity between the images. Locality sensitive hashing is performed on the vectors, in which the vectors are compared to a random set of hyperplanes in the space of the image vectors. According to this LSH scheme, the probability that any two image vectors are split by (on opposite sides of) a given hyperplane is equal to theta/pi, where theta is the angle between the vectors. Thus, by generating a large number of hyperplanes, we can determine, with high probability, the angle of separation between two image vectors. “A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, PrhF[h(x) = h(y)] = sim(x, y) Here sim(x, y) is some similarity function defined on the collection of objects.” [1] 1 2 In the work presented, images are represented as RAW files. In such a file, each pixel of an image is represented in three descriptive channels and the data is stored without compression. Raw images are converted to representative vectors in a high-dimensional space using average values of the channels. A set of 1000 random hyperplanes is generated, each represented by an orthogonal vector. An image vector is determined to be on either the “positive” or “negative” side of each hyperplane, and assigned bit 1 or 0 for that hyperplane as a result. Each image undergoes this process to be assigned a 1000-bit signature. Previous Work A previous approach to the problem of determining similarity in image databases was introduced by Rubner, Tomasi, and Guibas [2]. Earth Mover Distance (EMD) has been proven to be an effective measure of similarity for colors and textures in images. Unlike the more natural (and less effective) method of representing images in multidimensional space and using norms to define the distance between two points, EMD calculates the amount of effort necessary to “move” from one distribution in a feature space to another (a feature space is a set of properties, such as coordinate values, average values). EMD is effective, but algorithms for performing similarity search using EMD are inefficient on large data sets. In particular, performing the similarity search requires dealing with the unwieldy large data sets themselves. Indyk and Thaper [3] propose an embedding of EMD into Euclidean space combined with LSH, which improves the performance of EMD by orders of magnitude. The distance computation supported by that method has running time linear in the size of the image. Approach LSH was first introduced by Indyk and Motwani [4] as a general technique for similarity search. This work expands on the work on a LSH for vectors by Charikar [5], specifically, representing images as high-dimensional vectors, and using a LSH to obtain small signatures to represent them. This approach is an approximation of the EMD using average values across sectors of the image as features upon which to calculate distance. The distance comparison supported by this method has a running time that is linear in the size of the signature. This approach is expected to produce signatures that are extremely effective at determining mathematically “close” images, as it is an approximation of EMD. In searching for the image most similar to a reference image in a large pool of images, a search on the signatures created using this approach should reliably return images at closest distance to the reference with respect to the chosen features. This property does not mean that the returned images are necessarily what human perception would deem “closest,” as perception may take into account measurements on additional features. These results will be discussed along with analysis later. Methodology Raw Images The image data used in this approach were shots taken using a digital camera. The JPEG format 3 images were first converted to RAW format in CIE Lab color mode.2 This format stores the pixelar data as an interleaved stream of bytes: three bytes per pixel, one representing the value (0-255, inclusive) of each channel. The Lab color space was chosen over RGB because 1) it is device independent and 2) it is designed to have the property that the measurable distance between two points in the color space corresponds to the perceptual distance between the colors. A simple calculation suffices for the conversion. The Lab color space also captures some interesting features of perception. Of the three channels, Luminance, a (red-green), and b (blue-yellow), Luminance is most important. An example of the dominance of the Luminance channel in human perception is given in Figures 1 and 2. Figure 1. Clockwise, from top left: Lab, L, b, and a channels of an image for which Luminance dominance is evident. Note that the a and b channels, which are mostly gray, provide almost no information about the composite image. CIE, or Commission Internationale de l’Eclairage, or the International Commission on Illumination. The current Lab standard was developed in 1976. 2 4 Figure 2. Clockwise, from top left: Lab, L, b, and a channels of an image for which the a and b channels provide more information about the whole image, but L is still dominant. Most of the information processed in interpreting an image stems from the Luminance channel, with auxiliary information from the color channels. It is important that an effective similarity measure capture this property. Furthermore, by weighting the Luminance channel more heavily in the resulting signature, we may obtain results that more closely mirror the eye’s perception of “similar” images. This will be reflected in the results of performing our methodology, and discussed in a later section. Image Vectors Once an image has been obtained as a stream of raw Lab bytes, it must be converted to a representative vector in a high dimensional space. This is the stage in which the feature space, the set of properties used for comparison, of the images is selected. Abstractly, the features selected in this implementation are the average values of the image in the L, a, and b channels in recursively smaller sections of the image. The vector is constructed from a concatenation of 5 vectors representing the average values of the whole image over smaller and smaller divisions, as shown in Figure 3. … Figure 3. A vector is written by concatenating (v0), (v1, v2, v3, v4), (v5, v6, v7, … v19, v20), … where vi is the average value of the image in section i. In this implementation, image vectors represent the average value feature of an image down to 9 levels of recursion, or pixel level in a 512 by 512 image. 3 In order to prevent the lowest levels of features (i.e., pixel level) from dominating the vector, coordinates on lower levels are weighted more lightly. All coordinates of the vector at each level are multiplied by a weighting factor. Level 0, the average of all values over the entire image, has a weighting factor of 1, and the weighting factor for each successive level of recursion is divided by 2 so that the total weight of each level always remains 1. Image Signatures A set of random hyperplanes is generated for comparison with the image vectors. The hyperplanes are stored as vectors representing the hyperplanes to which they are orthogonal. Each of these hyperplanes contributes one bit to the eventual image signature. A given image vector is assigned a bit, 1 or 0, depending on which “side” of the hyperplane it resides on, in the bisected space. This polarity is calculated by taking the dot product of the image vector with the orthogonal vector; if the dot product is positive, the image vector is assigned to the + side of the hyperplane, and receives a bit 1 for this comparison; if the dot product is negative, the image vector is assigned to the – side of the hyperplane, and receives a bit 0 for this comparison. The result is that if two images vectors are both on the + side of a hyperplane, they are both assigned bit 1 for that hyperplane, and vice versa. Thus, the probability that two image vectors will be on opposite sides of a randomly chosen hyperplane can be approximated by the fraction of bits in their signatures that differ. 3 512 x 512 was chosen as a reasonable size for web viewing. 6 Figure 4. Image vectors u and v have angle between them. u and v are on opposite “sides” of hyperplane a, but on the same “side” of hyperplane b. To prevent memory insufficiency issues, all coordinates of the hyperplanes in the first 8 levels are stored in memory, while coordinates past 8 levels of recursion are generated on the fly from a common seed—while the hyperplane is being compared to each image vector. This particular implementation decision minimizes the time consumed in computing image signatures. Three signatures are calculated and stored for each image, one for each Lab channel. Image Diff To determine the signature distance measure of the similarity of two images, the percentage of bits differing in their signatures, or diff, is counted. A different diff is obtained for each channel (L, a, and b) and the weighted signature distance is calculated as the weighted average of the L, a, and b diffs. Results Image “Similarity” How well do the image signatures generated by this method provide a basis for judging image similarity? Quite well in certain cases; not well at all, in others. From a pool of 500 images, several reference images were chosen, and signature distances with the entire pool taken. We would like the images with the smallest signature distances to be the most similar images to the reference image in the pool. In some cases, this test gives meaningful results, as in Figure 5. 7 (reference 1) Figure 5. A reference image and top 10 images matching from the image pool, using 1000 bit signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool: 8.55%. The top 10 images represent diffs of 4.68-6.08% with average diff 5.44%. Note that the images in Figure 5 all follow a similar Luminance pattern, and, when it exists, a similar hue pattern. Reference 1 has a strong Luminance pattern, and thus the top-matched results generally follow that pattern. Note, also, that the average diff of the pool in the L channel alone is 22.32%, while the average diffs of the pool in the a and b channels are 1.82% and 2.54%, respectively. In general, this pool exhibits the luminance dominance discussed in the methodologies section. In Figure 6, reference 2 does not have a strong Luminance pattern, but has a much stronger hue pattern in the red-green channel. When the channels are weighted the same as in Figure 5, the closest image displays the same hue pattern, and all others in the top nearest neighbors demonstrate an attempt (at least) at achieving the same hue pattern. 8 (reference 2) Figure 6. A reference image and top 10 images matching from the image pool, using 1000 bit signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool: 10.30%. The top 10 images represent diffs of 6.55-8.50% with average diff 7.92%. Effect of Signature Length Diffing reference image 1 from Figure 5 against the same pool of images with signature length 100 instead of 1000 produces the following top ten matches (Figure 7). Signature length has a large effect on the effectiveness of diffing images; namely, it increases the number of trials so that the measured fraction of hyperplanes that split any two vectors more closely reflects the probability that the vectors will be split by a random hyperplane. Note that the intersection between the two sets (Figures 5 and 7) is 2 images. At signature length of 500, the intersection becomes 8 images, with 100% of the intersection shuffled. When signature length is 750, the intersection remains 8 images, with 75% of the intersection shuffled. 9 (reference 1) Figure 7. A reference image and top 10 images matching from the image pool, using 100 bit signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool: 12.13%. The top 10 images represent diffs of 7.0-11.0% with average diff 9.69% Some images do not fare well when subjected to this representation. Figure 8 gives an example of an image for which the rest of the pool is too closely clustered around the average signature distance for diffing to be meaningful. In this case, the nearest images don’t display any obvious similar patterns to the reference image.4 4 Although the same two people, Aileen (left) and Wen (right), do appear in the same position in 3 of the top 10 images! 10 (reference 3) Figure 8. A reference image and top 10 images matching from the image pool, using 1000 bit signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average distance of pool: 13.4%. The eleventh image, bottom, is of interest because it is perceptually similar to the reference image but falls at a diff of 12.42% from the reference. Discussion The LSH-based representation of images is successful at determining image similarity for certain types of images relative to an image pool. It has potential for use in database searching, 11 particularly in applications where the images have a long lifespan (so that the recalculation rate of signatures is low). This methodology utilizes a number of “leaps of faith,” or abstractions from the measure of distance between images, to achieve compactness of representation. Firstly, we are not sure whether average values of the Lab channels across sections of the image are the right set of features to choose for differentiating between images. Secondly, exactly how well an image vector represents the features of an image that distinguish it from other images is unknown. Thirdly, exactly how well the angle between two image vectors represents the actual distance between the corresponding images is unknown. The experiments on a pool of images in this work verify that the approach is promising, and that all of the above-mentioned leaps were taken in good faith. A number follow up experiments can be done; for example, since EMD is recognized as a “good” measure of image distance, comparing the effectiveness of this methodology against EMD will be an important test. The approach can perhaps also be improved by choosing a different set of features from which to build an image vector, and thresholding instead of just weighting the different channels in the signature once it is obtained. In general, this work takes an entirely different approach to the notion of image similarity than, for example, an edge detection method. The difference between this and other methods of image similarity search, along with different factors that affect the eye’s perception of image similarity, have already become apparent in the above work. Furthermore, Conclusion The LSH-based compact representation of images studied was shown to be an effective measure of image similarity in certain situations, but quite meaningless (in terms of visual interpretation) in others. In particular, images that are uniformly different from all other images in a pool do not produce meaningful signature distances, but images with well-distributed distances from the other images in a pool do produce meaningful signature distances. The method of signing images for easy search and comparison is promising, with the right choice of image features, signature length, and weighting method. 12 References [1] M. Charikar. Similarity Estimation Techniques From Rounding. Proceedings of the Symposium on Theory of Computing, 2002. [2] Y. Rubner, C. Tomasi, and L. Guibas. A Metric for Distributions with Applications to Image Databases. Proceedings of the IEEE International Conference, 1998, pp. 59-66. [3] P. Indyk and N. Thaper. Fast Image Retrieval via Embeddings. [4] P. Indyk and R. Motwani. Approximate Nearest Neighbor – Towards Removing the Curse of Dimensionality. Proceedings of the 30th Symposium on Theory of Computing, 1998, pp. 604613. [5] M. Charikar. Similarity Estimation Techniques From Rounding. Proceedings of the Symposium on Theory of Computing, 2002.