I R U E

advertisement
IMAGE RECOGNITION USING EDIT DISTANCE
Josh Jelin, Data Matching Methods and Their Uses, October 18th, 2013
INTRODUCTION
One of the most effective techniques for determining similarity between strings is edit distance. Since images are
in many ways two-dimensional images, can we develop effective image recognition software using the concepts of
edit distances? Initial results, unfortunately, suggest that edit distance doesn’t offer substantial accuracy
improvements versus simpler techniques.
METHODS
THE DATA
27 images were selected to form the basis of our data set. Of these, 17 were photographs, one was a simple
drawing, and 9 were blotches of color of differing sizes. Some photographs were intentionally selected due to their
similarity to one another. Each image was converted to a jpeg, and then cropped and shrunk to 10x10 pixels in
order to keep software runtimes small. After that, each image was saved three times; once using no compression
(jpeg 12,) once using a moderate amount of compression (jpeg 6,) and once using significant compression (jpeg 1.)
This compression was intended to add varying levels of noise to the data. Finally, a solid white and solid black
image were each added to our data set, bringing our total data set to 83 images.
Figure 1: Example images
Original Image
Shrunk to 10x10 pixels
Maximum compression
FOUR MEASURES OF IMAGE DIFFERENCES
We looked at four measures to identify potential differences between images; average color distance, percentage
of matching pixels, copy distance, and weighted copy distance. Of these, the first two are intended to be “simple”
measures of the difference between images, and the second two are intended to be “edit-distance-like” measures
of differences.
Average color distance was simply measured by finding the mean Euclidian difference between corresponding
pixels. On a computer, each pixel is recorded as a color with three values, red, green, and blue (RGB.) RGB can be
treated as XYZ coordinates to get a decent idea of the differences, or distances, between them. White (1,1,1) and
black (0,0,0) have the largest distance. In this model, all distances have been scaled down so that the distance from
white to black is 1. In this way, each pixel is compared to its counterpart in the other image (see figure 2.) The
distance between each pair of points is recorded in a matrix of distances. The average color distance is the mean
value of this matrix.
Percentage of matching pixels works similarly. The distance matrix is calculated in exactly the same way as it was
for average color distance. Then, an alpha is chosen. All distances smaller than alpha are called a match, and all
pixels pairs with a color distance larger than alpha are called a non-match. Calculating an optimized alpha value is
computationally expensive, so it will be out of scope of this paper. However, a good alpha was experimentally
determined to be 0.125 by a visual inspection of ROC curves. This means that pixels whose color distance is less
than 1/8th of the distance between white and black are considered to be the same color, aka a match. The
percentage of matching pixels is the percentage of pixels in the distance matrix categorized as a match.
Figure 2: Comparing two images, pixel by pixel
≈0
≈0.2
In this example, a picture is being compared to a more
compressed version of itself. The top hilighted point,
point (2,2), has hardly been discolored by the
compression process, so the distance between the
two points is almost 0. The other hilighted point, (5,7),
has been more affected by the compression process
and has a distance of 0.2. Our alpha = 0.125, so the
top pixel is a match and the bottom is a non-match.
Copy distance is the first of our two edit-distance-like measures. Here’s how it works:
1. The list of non-matching pixels is grabbed from one of the images being compared. We’ll call that image
“image 1.”
2. We take this list of pixels and try to find them in the other image, “image 2.”
3. Each pixel that is found in image 2 scores 1 point. Each pixel not found scores 0 points.
4. The points are summed for the final copy distance.
This methodology has two obvious flaws. First, it isn’t easily reversible. No good solution to this issue was
determined as of the writing of this paper. Second, it seems like there should be some penalty for a copiable pixel
being further away in the image versus, say, an adjacent pixel.
Weighted copy distance is similar to copy distance, but it takes the distance (in terms of their location in the image)
between matching pixels into account. If a pixel matches its partner, then the weight for that pixel is 0. If a pixel is
a mismatch but no copy-able pixels exist within the image, the weight is set to 1. For non-matching pixels with an
available copy, the Euclidian distance between the pixel of interest and each potential copy is calculated. Then the
shortest distance is selected and converted to a (0,1) scale. Refer to figure 3 for an example. Once the weight has
been determined for each individual pixel, all these distances are summed to create the weighted copy distance
between images. Intuitively, that means that a large weighted copy distance indicates high dissimilarity between
images.
Figure 3: Weighted copy distance example
In this example, we’ve identified three potential copies for the pixel
in the lower right. The Euclidian distance between these points is 1,
2, and √5. The shortest distance is 1, therefore the weight for this
pixel is set to 1/√200, where √200 is the diagonal of this image.
COMPARISONS
As stated previously, our data set was essentially three copies of each of 27 images, plus a solid white and solid
black image. The effectiveness of our models would be measured by their ability to deduplicate this list and
identify which triplets of images belonged together. This means we had 83 choose 2, or 3403 comparisons.
BLOCKING
Edit distance measures are computationally expensive to calculate, so we must to block out unlikely matches. By
creating a decision tree based on the average color distance and percentage of matching pixels criteria, we
observe that pairs of images with an average color distance > 0.08 are true matches under 0.01% of the time. 3256
of our 3403 comparisons (95.7%) fall into this category. By utilizing transitivity, we could eliminate all false
negatives created through blocking (see appendix 3.) By blocking out these comparisons, we can reduce our run
time from about 30 minutes to just over 30 seconds.
MODELING
Modeling was done in two ways: logistic regression and decision tree. Because of their construction, percentage of
matching pixels, copy distance and weighted copy distance were treated as an interaction variable.
ANALYSIS
We found that copy distance and weighted copy distance were quite important in the logistic regression model,
but less important in the tree model. Unfortunately, the tree model outperformed the regression model, meaning
the edit distance measures were relatively ineffective. The full form of both models can be viewed in the appendix.
The logistic regression model had four significant predictors: average color distance (p ≈ 10-6), and three
interaction terms; percentage of matching pixels * copy distance (p = 0.0379), copy distance * weighted copy
distance (p = 0.0181) and all the triple interaction (p = 0.0140.) The final interaction term, percentage of matching
pixels * weighted copy distance, was a near-miss at p= 0.1393. All other terms were insignificant.
The decision tree used three predictors: Average color distance, percentage of matching pixels, and weighted copy
distance. While the contribution of weighted copy distance was nonzero, it was primarily used as tiebreaker once
most of the heavy lifting has been completed by the two simpler measures.
To compare these two models,
we plotted side-by-side ROC
curves:
The decision tree model
outperforms
the
logistic
regression. This is unfortunate,
since the tree makes only
minimal use of the one of only
our edit distance measures.
Therefore, edit distance did
not substantially improve our
model.
For some examples of matching and non-matching images, see appendix 3. Note that all photographs were
correctly deduplicated using both models. The models had issues with the geometric shape and blotches of color.
DISCUSSION
Although edit distance had an impact on our results, in this implementation it was not a major contributor. The
simpler measures were effective to deduplicate this particular data set on their own.
It’s possible that more complex implementations of edit distance could be very effective in image recognition. For
example, there is no way in this implementation to edit a swath of points at once. In real-world image recognition,
it’s possible that images could be cropped differently, mirrored, or put through an Instagram filter. Right now, the
only edit the software is capable of is copying pixels. If available edit operations involved altering multiple points
through filters, mirroring, or cropping, it’s possible that edit distance could become a more effective measure.
Cropping brings me to a specific point about this data set: It was very artificial. While the code provided should
theoretically be able to handle any set of equally-sized images, in reality photographs are taken using a variety of
cameras which save images in a multitude of sizes. To work on real-world images, the software would need some
way to normalize image sizes.
Right now the model is very specialized to deduplicate noisy images. In fact, it even easily deduplicated three
nearly identical photographs of the same person, in the same situation, taken with the same camera. The primary
difference between these three images was the slight shifts in the background. This led to large values in our most
important criteria, average color distance. This doesn’t seem like a desirable result. We probably want to be able
to identify that these are pictures of the same person. New and more complex edit distance measures may
perform better on this sort of problem than average color distance. However, if we were to eliminate average
color distance, we wouldn’t have anything simple to block on.
Even with the current blocking scheme, this implementation of these ideas would be prohibitively computationally
expensive if applied to real-world tasks. Let’s estimate just how slow this software is. We find 3403 comparison *
100 pixels per comparison / 30 seconds ≈ 11,000 pixels compared per second. A single HD image has 2,073,600
pixels. Spending roughly a minute and a half per comparison is totally unrealistic for something like Facebook
image tagging, or airport security. Dynamic programming could contribute to solving this problem. Additionally,
some sort of down-scaling may be effective. However it’s conceivable that edit distance is simply too expensive
relative to other techniques to use for image recognition.
CONCLUSION
It’s possible that image recognition could be done effectively using the concepts of edit distance. The interaction
variables for edit distance were statistically significant in our logistic regression model on blocked data. However,
this model was noticeably outperformed by a decision tree. The tree model, unfortunately, made only minimal use
of the edit-distance-like measures. What’s more, the gains generated by the edit-distance-like measures were
extremely computationally expensive, and would require multiple significant coding improvements before they
could be effectively applied to real images. In this implementation, edit distance was only a minor improvement
over simpler and less computationally demanding techniques.
APPENDIX 1: LOGISTIC REGRESSION ON BLOCKED DATA
APPENDIX 2: DECISION TREE MODEL ON BLOCKED DATA
APPENDIX 3: SELECTED EXAMPLES OF MATCHING AND NON-MATCHING IMAGES
TRUE POSITIVES
Image quality 12
Image quality 6
Image quality 1
Joint Logit Pred
0.97
Joint Tree Pred
0.8
0.94
1
1
1
0.76
0.8
0.87
0.8
FALSE POSITIVES
Image 1
Image 2
Logit Pred
0.7
Tree Pred
0
0.6
0.4
TRUE NEGATIVE
Image 1
Image 2
Logit Pred
0
Tree Pred
0.4
0
0.5
Eliminated during
blocking
Eliminated during
blocking
Logit Pred
0
Tree Pred
0.5
Eliminated during
blocking
Eliminated during
blocking
FALSE NEGATIVE
Image 1
Image 2
In this final example, a green blotch was improperly separated during blocking. These are images “green 1” and
“green 12,” named for their levels of compression. However, green 6 was not blocked out, and was compared to
both green 1 and green 12. In fact, both the logistic regression and the tree assigned a probability of 1 that green 6
matches both green 1 and green 12. By implementing this style of after-the-fact transitivity, we could create a
perfect blocking set.
Download