A LSH-based compact representation of images for easy searching

advertisement
1
A LSH-based Compact Representation of Images for Similarity Comparison
Emily Huang
JIW, Fall 2003
Advisor: Moses Charikar
Abstract
Computations involving large and complex data sets are expensive when these data sets are
represented in raw, uncompressed form. This work presents a representation for one such type of
these data sets, images, which is compact and has the properties of preserving a useful distance
measure between objects of this type. The representation depends on a locality sensitive hashing
(LSH)1 on vector representations of the image objects. As a result, each image is associated with
a 1000-bit signature. This representation is not a compression scheme, as no information about
the image itself is preserved in the representation, but rather, is a signature on image data for the
purpose of cheap similarity comparison. The method is then evaluated for its effectiveness in
gauging image similarity in a pool of test images.
Introduction
Digital images are used widely in electronic media, but there is as of yet no inexpensive method
for similarity comparison. Image data is typically present in a space-expensive form, and
similarity search in a large pool could take an inordinate amount of time. A solution to this
problem has applications in any field where quick comparisons of these objects are desirable.
Database search (web search, for example) is the natural application, but others also arise in
computer vision.
Image search is, of course, not the only application of the general methodology. Any large or
complex data type (audio or text, for example) can be similarly represented as a compact
signature, and the resulting potentiality of applications is more far-reaching.
In this methodology, images are represented as vectors, and the angle of separation between two
representative vectors corresponds to the similarity between the images. Locality sensitive
hashing is performed on the vectors, in which the vectors are compared to a random set of
hyperplanes in the space of the image vectors. According to this LSH scheme, the probability
that any two image vectors are split by (on opposite sides of) a given hyperplane is equal to
theta/pi, where theta is the angle between the vectors. Thus, by generating a large number of
hyperplanes, we can determine, with high probability, the angle of separation between two image
vectors.
“A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of
objects, such that for two objects x, y,
PrhF[h(x) = h(y)] = sim(x, y)
Here sim(x, y) is some similarity function defined on the collection of objects.” [1]
1
2
In the work presented, images are represented as RAW files. In such a file, each pixel of an
image is represented in three descriptive channels and the data is stored without compression.
Raw images are converted to representative vectors in a high-dimensional space using average
values of the channels. A set of 1000 random hyperplanes is generated, each represented by an
orthogonal vector. An image vector is determined to be on either the “positive” or “negative”
side of each hyperplane, and assigned bit 1 or 0 for that hyperplane as a result. Each image
undergoes this process to be assigned a 1000-bit signature.
Previous Work
A previous approach to the problem of determining similarity in image databases was introduced
by Rubner, Tomasi, and Guibas [2]. Earth Mover Distance (EMD) has been proven to be an
effective measure of similarity for colors and textures in images. Unlike the more natural (and
less effective) method of representing images in multidimensional space and using norms to
define the distance between two points, EMD calculates the amount of effort necessary to
“move” from one distribution in a feature space to another (a feature space is a set of properties,
such as coordinate values, average values). EMD is effective, but algorithms for performing
similarity search using EMD are inefficient on large data sets. In particular, performing the
similarity search requires dealing with the unwieldy large data sets themselves. Indyk and Thaper
[3] propose an embedding of EMD into Euclidean space combined with LSH, which improves
the performance of EMD by orders of magnitude. The distance computation supported by that
method has running time linear in the size of the image.
Approach
LSH was first introduced by Indyk and Motwani [4] as a general technique for similarity search.
This work expands on the work on a LSH for vectors by Charikar [5], specifically, representing
images as high-dimensional vectors, and using a LSH to obtain small signatures to represent
them. This approach is an approximation of the EMD using average values across sectors of the
image as features upon which to calculate distance. The distance comparison supported by this
method has a running time that is linear in the size of the signature.
This approach is expected to produce signatures that are extremely effective at determining
mathematically “close” images, as it is an approximation of EMD. In searching for the image
most similar to a reference image in a large pool of images, a search on the signatures created
using this approach should reliably return images at closest distance to the reference with respect
to the chosen features. This property does not mean that the returned images are necessarily what
human perception would deem “closest,” as perception may take into account measurements on
additional features. These results will be discussed along with analysis later.
Methodology
Raw Images
The image data used in this approach were shots taken using a digital camera. The JPEG format
3
images were first converted to RAW format in CIE Lab color mode.2 This format stores the
pixelar data as an interleaved stream of bytes: three bytes per pixel, one representing the value
(0-255, inclusive) of each channel. The Lab color space was chosen over RGB because 1) it is
device independent and 2) it is designed to have the property that the measurable distance
between two points in the color space corresponds to the perceptual distance between the colors.
A simple calculation suffices for the conversion.
The Lab color space also captures some interesting features of perception. Of the three channels,
Luminance, a (red-green), and b (blue-yellow), Luminance is most important. An example of the
dominance of the Luminance channel in human perception is given in Figures 1 and 2.
Figure 1. Clockwise, from top left: Lab, L, b, and a channels of an image for which Luminance
dominance is evident. Note that the a and b channels, which are mostly gray, provide almost no
information about the composite image.
CIE, or Commission Internationale de l’Eclairage, or the International Commission on Illumination. The current
Lab standard was developed in 1976.
2
4
Figure 2. Clockwise, from top left: Lab, L, b, and a channels of an image for which the a and b
channels provide more information about the whole image, but L is still dominant.
Most of the information processed in interpreting an image stems from the Luminance channel,
with auxiliary information from the color channels. It is important that an effective similarity
measure capture this property. Furthermore, by weighting the Luminance channel more heavily
in the resulting signature, we may obtain results that more closely mirror the eye’s perception of
“similar” images. This will be reflected in the results of performing our methodology, and
discussed in a later section.
Image Vectors
Once an image has been obtained as a stream of raw Lab bytes, it must be converted to a
representative vector in a high dimensional space. This is the stage in which the feature space,
the set of properties used for comparison, of the images is selected. Abstractly, the features
selected in this implementation are the average values of the image in the L, a, and b channels in
recursively smaller sections of the image. The vector is constructed from a concatenation of
5
vectors representing the average values of the whole image over smaller and smaller divisions, as
shown in Figure 3.
…
Figure 3. A vector is written by concatenating (v0), (v1, v2, v3, v4), (v5, v6, v7, … v19, v20), …
where vi is the average value of the image in section i.
In this implementation, image vectors represent the average value feature of an image down to 9
levels of recursion, or pixel level in a 512 by 512 image. 3 In order to prevent the lowest levels of
features (i.e., pixel level) from dominating the vector, coordinates on lower levels are weighted
more lightly. All coordinates of the vector at each level are multiplied by a weighting factor.
Level 0, the average of all values over the entire image, has a weighting factor of 1, and the
weighting factor for each successive level of recursion is divided by 2 so that the total weight of
each level always remains 1.
Image Signatures
A set of random hyperplanes is generated for comparison with the image vectors. The
hyperplanes are stored as vectors representing the hyperplanes to which they are orthogonal.
Each of these hyperplanes contributes one bit to the eventual image signature. A given image
vector is assigned a bit, 1 or 0, depending on which “side” of the hyperplane it resides on, in the
bisected space. This polarity is calculated by taking the dot product of the image vector with the
orthogonal vector; if the dot product is positive, the image vector is assigned to the + side of the
hyperplane, and receives a bit 1 for this comparison; if the dot product is negative, the image
vector is assigned to the – side of the hyperplane, and receives a bit 0 for this comparison. The
result is that if two images vectors are both on the + side of a hyperplane, they are both assigned
bit 1 for that hyperplane, and vice versa. Thus, the probability that two image vectors will be on
opposite sides of a randomly chosen hyperplane can be approximated by the fraction of bits in
their signatures that differ.
3
512 x 512 was chosen as a reasonable size for web viewing.
6
Figure 4. Image vectors u and v have angle  between them. u and v are on opposite “sides” of
hyperplane a, but on the same “side” of hyperplane b.
To prevent memory insufficiency issues, all coordinates of the hyperplanes in the first 8 levels
are stored in memory, while coordinates past 8 levels of recursion are generated on the fly from a
common seed—while the hyperplane is being compared to each image vector. This particular
implementation decision minimizes the time consumed in computing image signatures.
Three signatures are calculated and stored for each image, one for each Lab channel.
Image Diff
To determine the signature distance measure of the similarity of two images, the percentage of
bits differing in their signatures, or diff, is counted. A different diff is obtained for each channel
(L, a, and b) and the weighted signature distance is calculated as the weighted average of the L,
a, and b diffs.
Results
Image “Similarity”
How well do the image signatures generated by this method provide a basis for judging image
similarity? Quite well in certain cases; not well at all, in others. From a pool of 500 images,
several reference images were chosen, and signature distances with the entire pool taken. We
would like the images with the smallest signature distances to be the most similar images to the
reference image in the pool. In some cases, this test gives meaningful results, as in Figure 5.
7
(reference 1)
Figure 5. A reference image and top 10 images matching from the image pool, using 1000 bit
signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool:
8.55%. The top 10 images represent diffs of 4.68-6.08% with average diff 5.44%.
Note that the images in Figure 5 all follow a similar Luminance pattern, and, when it exists, a
similar hue pattern. Reference 1 has a strong Luminance pattern, and thus the top-matched
results generally follow that pattern. Note, also, that the average diff of the pool in the L channel
alone is 22.32%, while the average diffs of the pool in the a and b channels are 1.82% and
2.54%, respectively. In general, this pool exhibits the luminance dominance discussed in the
methodologies section.
In Figure 6, reference 2 does not have a strong Luminance pattern, but has a much stronger hue
pattern in the red-green channel. When the channels are weighted the same as in Figure 5, the
closest image displays the same hue pattern, and all others in the top nearest neighbors
demonstrate an attempt (at least) at achieving the same hue pattern.
8
(reference 2)
Figure 6. A reference image and top 10 images matching from the image pool, using 1000 bit
signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool:
10.30%. The top 10 images represent diffs of 6.55-8.50% with average diff 7.92%.
Effect of Signature Length
Diffing reference image 1 from Figure 5 against the same pool of images with signature length
100 instead of 1000 produces the following top ten matches (Figure 7). Signature length has a
large effect on the effectiveness of diffing images; namely, it increases the number of trials so
that the measured fraction of hyperplanes that split any two vectors more closely reflects the
probability that the vectors will be split by a random hyperplane. Note that the intersection
between the two sets (Figures 5 and 7) is 2 images. At signature length of 500, the intersection
becomes 8 images, with 100% of the intersection shuffled. When signature length is 750, the
intersection remains 8 images, with 75% of the intersection shuffled.
9
(reference 1)
Figure 7. A reference image and top 10 images matching from the image pool, using 100 bit
signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average diff of pool:
12.13%. The top 10 images represent diffs of 7.0-11.0% with average diff 9.69%
Some images do not fare well when subjected to this representation. Figure 8 gives an example
of an image for which the rest of the pool is too closely clustered around the average signature
distance for diffing to be meaningful. In this case, the nearest images don’t display any obvious
similar patterns to the reference image.4
4
Although the same two people, Aileen (left) and Wen (right), do appear in the same position in 3 of the top 10
images!
10
(reference 3)
Figure 8. A reference image and top 10 images matching from the image pool, using 1000 bit
signatures. The L channel has been weighted at 0.5, a at 0.25, and b at 0.25. Average distance of
pool: 13.4%. The eleventh image, bottom, is of interest because it is perceptually similar to the
reference image but falls at a diff of 12.42% from the reference.
Discussion
The LSH-based representation of images is successful at determining image similarity for certain
types of images relative to an image pool. It has potential for use in database searching,
11
particularly in applications where the images have a long lifespan (so that the recalculation rate
of signatures is low).
This methodology utilizes a number of “leaps of faith,” or abstractions from the measure of
distance between images, to achieve compactness of representation. Firstly, we are not sure
whether average values of the Lab channels across sections of the image are the right set of
features to choose for differentiating between images. Secondly, exactly how well an image
vector represents the features of an image that distinguish it from other images is unknown.
Thirdly, exactly how well the angle between two image vectors represents the actual distance
between the corresponding images is unknown.
The experiments on a pool of images in this work verify that the approach is promising, and that
all of the above-mentioned leaps were taken in good faith. A number follow up experiments can
be done; for example, since EMD is recognized as a “good” measure of image distance,
comparing the effectiveness of this methodology against EMD will be an important test.
The approach can perhaps also be improved by choosing a different set of features from which to
build an image vector, and thresholding instead of just weighting the different channels in the
signature once it is obtained. In general, this work takes an entirely different approach to the
notion of image similarity than, for example, an edge detection method. The difference between
this and other methods of image similarity search, along with different factors that affect the
eye’s perception of image similarity, have already become apparent in the above work.
Furthermore,
Conclusion
The LSH-based compact representation of images studied was shown to be an effective measure
of image similarity in certain situations, but quite meaningless (in terms of visual interpretation)
in others. In particular, images that are uniformly different from all other images in a pool do not
produce meaningful signature distances, but images with well-distributed distances from the
other images in a pool do produce meaningful signature distances. The method of signing images
for easy search and comparison is promising, with the right choice of image features, signature
length, and weighting method.
12
References
[1] M. Charikar. Similarity Estimation Techniques From Rounding. Proceedings of the
Symposium on Theory of Computing, 2002.
[2] Y. Rubner, C. Tomasi, and L. Guibas. A Metric for Distributions with Applications to Image
Databases. Proceedings of the IEEE International Conference, 1998, pp. 59-66.
[3] P. Indyk and N. Thaper. Fast Image Retrieval via Embeddings.
[4] P. Indyk and R. Motwani. Approximate Nearest Neighbor – Towards Removing the Curse of
Dimensionality. Proceedings of the 30th Symposium on Theory of Computing, 1998, pp. 604613.
[5] M. Charikar. Similarity Estimation Techniques From Rounding. Proceedings of the
Symposium on Theory of Computing, 2002.
Download