IMAGE RECOGNITION USING EDIT DISTANCE Josh Jelin, Data Matching Methods and Their Uses, October 18th, 2013 INTRODUCTION One of the most effective techniques for determining similarity between strings is edit distance. Since images are in many ways two-dimensional images, can we develop effective image recognition software using the concepts of edit distances? Initial results, unfortunately, suggest that edit distance doesn’t offer substantial accuracy improvements versus simpler techniques. METHODS THE DATA 27 images were selected to form the basis of our data set. Of these, 17 were photographs, one was a simple drawing, and 9 were blotches of color of differing sizes. Some photographs were intentionally selected due to their similarity to one another. Each image was converted to a jpeg, and then cropped and shrunk to 10x10 pixels in order to keep software runtimes small. After that, each image was saved three times; once using no compression (jpeg 12,) once using a moderate amount of compression (jpeg 6,) and once using significant compression (jpeg 1.) This compression was intended to add varying levels of noise to the data. Finally, a solid white and solid black image were each added to our data set, bringing our total data set to 83 images. Figure 1: Example images Original Image Shrunk to 10x10 pixels Maximum compression FOUR MEASURES OF IMAGE DIFFERENCES We looked at four measures to identify potential differences between images; average color distance, percentage of matching pixels, copy distance, and weighted copy distance. Of these, the first two are intended to be “simple” measures of the difference between images, and the second two are intended to be “edit-distance-like” measures of differences. Average color distance was simply measured by finding the mean Euclidian difference between corresponding pixels. On a computer, each pixel is recorded as a color with three values, red, green, and blue (RGB.) RGB can be treated as XYZ coordinates to get a decent idea of the differences, or distances, between them. White (1,1,1) and black (0,0,0) have the largest distance. In this model, all distances have been scaled down so that the distance from white to black is 1. In this way, each pixel is compared to its counterpart in the other image (see figure 2.) The distance between each pair of points is recorded in a matrix of distances. The average color distance is the mean value of this matrix. Percentage of matching pixels works similarly. The distance matrix is calculated in exactly the same way as it was for average color distance. Then, an alpha is chosen. All distances smaller than alpha are called a match, and all pixels pairs with a color distance larger than alpha are called a non-match. Calculating an optimized alpha value is computationally expensive, so it will be out of scope of this paper. However, a good alpha was experimentally determined to be 0.125 by a visual inspection of ROC curves. This means that pixels whose color distance is less than 1/8th of the distance between white and black are considered to be the same color, aka a match. The percentage of matching pixels is the percentage of pixels in the distance matrix categorized as a match. Figure 2: Comparing two images, pixel by pixel ≈0 ≈0.2 In this example, a picture is being compared to a more compressed version of itself. The top hilighted point, point (2,2), has hardly been discolored by the compression process, so the distance between the two points is almost 0. The other hilighted point, (5,7), has been more affected by the compression process and has a distance of 0.2. Our alpha = 0.125, so the top pixel is a match and the bottom is a non-match. Copy distance is the first of our two edit-distance-like measures. Here’s how it works: 1. The list of non-matching pixels is grabbed from one of the images being compared. We’ll call that image “image 1.” 2. We take this list of pixels and try to find them in the other image, “image 2.” 3. Each pixel that is found in image 2 scores 1 point. Each pixel not found scores 0 points. 4. The points are summed for the final copy distance. This methodology has two obvious flaws. First, it isn’t easily reversible. No good solution to this issue was determined as of the writing of this paper. Second, it seems like there should be some penalty for a copiable pixel being further away in the image versus, say, an adjacent pixel. Weighted copy distance is similar to copy distance, but it takes the distance (in terms of their location in the image) between matching pixels into account. If a pixel matches its partner, then the weight for that pixel is 0. If a pixel is a mismatch but no copy-able pixels exist within the image, the weight is set to 1. For non-matching pixels with an available copy, the Euclidian distance between the pixel of interest and each potential copy is calculated. Then the shortest distance is selected and converted to a (0,1) scale. Refer to figure 3 for an example. Once the weight has been determined for each individual pixel, all these distances are summed to create the weighted copy distance between images. Intuitively, that means that a large weighted copy distance indicates high dissimilarity between images. Figure 3: Weighted copy distance example In this example, we’ve identified three potential copies for the pixel in the lower right. The Euclidian distance between these points is 1, 2, and √5. The shortest distance is 1, therefore the weight for this pixel is set to 1/√200, where √200 is the diagonal of this image. COMPARISONS As stated previously, our data set was essentially three copies of each of 27 images, plus a solid white and solid black image. The effectiveness of our models would be measured by their ability to deduplicate this list and identify which triplets of images belonged together. This means we had 83 choose 2, or 3403 comparisons. BLOCKING Edit distance measures are computationally expensive to calculate, so we must to block out unlikely matches. By creating a decision tree based on the average color distance and percentage of matching pixels criteria, we observe that pairs of images with an average color distance > 0.08 are true matches under 0.01% of the time. 3256 of our 3403 comparisons (95.7%) fall into this category. By utilizing transitivity, we could eliminate all false negatives created through blocking (see appendix 3.) By blocking out these comparisons, we can reduce our run time from about 30 minutes to just over 30 seconds. MODELING Modeling was done in two ways: logistic regression and decision tree. Because of their construction, percentage of matching pixels, copy distance and weighted copy distance were treated as an interaction variable. ANALYSIS We found that copy distance and weighted copy distance were quite important in the logistic regression model, but less important in the tree model. Unfortunately, the tree model outperformed the regression model, meaning the edit distance measures were relatively ineffective. The full form of both models can be viewed in the appendix. The logistic regression model had four significant predictors: average color distance (p ≈ 10-6), and three interaction terms; percentage of matching pixels * copy distance (p = 0.0379), copy distance * weighted copy distance (p = 0.0181) and all the triple interaction (p = 0.0140.) The final interaction term, percentage of matching pixels * weighted copy distance, was a near-miss at p= 0.1393. All other terms were insignificant. The decision tree used three predictors: Average color distance, percentage of matching pixels, and weighted copy distance. While the contribution of weighted copy distance was nonzero, it was primarily used as tiebreaker once most of the heavy lifting has been completed by the two simpler measures. To compare these two models, we plotted side-by-side ROC curves: The decision tree model outperforms the logistic regression. This is unfortunate, since the tree makes only minimal use of the one of only our edit distance measures. Therefore, edit distance did not substantially improve our model. For some examples of matching and non-matching images, see appendix 3. Note that all photographs were correctly deduplicated using both models. The models had issues with the geometric shape and blotches of color. DISCUSSION Although edit distance had an impact on our results, in this implementation it was not a major contributor. The simpler measures were effective to deduplicate this particular data set on their own. It’s possible that more complex implementations of edit distance could be very effective in image recognition. For example, there is no way in this implementation to edit a swath of points at once. In real-world image recognition, it’s possible that images could be cropped differently, mirrored, or put through an Instagram filter. Right now, the only edit the software is capable of is copying pixels. If available edit operations involved altering multiple points through filters, mirroring, or cropping, it’s possible that edit distance could become a more effective measure. Cropping brings me to a specific point about this data set: It was very artificial. While the code provided should theoretically be able to handle any set of equally-sized images, in reality photographs are taken using a variety of cameras which save images in a multitude of sizes. To work on real-world images, the software would need some way to normalize image sizes. Right now the model is very specialized to deduplicate noisy images. In fact, it even easily deduplicated three nearly identical photographs of the same person, in the same situation, taken with the same camera. The primary difference between these three images was the slight shifts in the background. This led to large values in our most important criteria, average color distance. This doesn’t seem like a desirable result. We probably want to be able to identify that these are pictures of the same person. New and more complex edit distance measures may perform better on this sort of problem than average color distance. However, if we were to eliminate average color distance, we wouldn’t have anything simple to block on. Even with the current blocking scheme, this implementation of these ideas would be prohibitively computationally expensive if applied to real-world tasks. Let’s estimate just how slow this software is. We find 3403 comparison * 100 pixels per comparison / 30 seconds ≈ 11,000 pixels compared per second. A single HD image has 2,073,600 pixels. Spending roughly a minute and a half per comparison is totally unrealistic for something like Facebook image tagging, or airport security. Dynamic programming could contribute to solving this problem. Additionally, some sort of down-scaling may be effective. However it’s conceivable that edit distance is simply too expensive relative to other techniques to use for image recognition. CONCLUSION It’s possible that image recognition could be done effectively using the concepts of edit distance. The interaction variables for edit distance were statistically significant in our logistic regression model on blocked data. However, this model was noticeably outperformed by a decision tree. The tree model, unfortunately, made only minimal use of the edit-distance-like measures. What’s more, the gains generated by the edit-distance-like measures were extremely computationally expensive, and would require multiple significant coding improvements before they could be effectively applied to real images. In this implementation, edit distance was only a minor improvement over simpler and less computationally demanding techniques. APPENDIX 1: LOGISTIC REGRESSION ON BLOCKED DATA APPENDIX 2: DECISION TREE MODEL ON BLOCKED DATA APPENDIX 3: SELECTED EXAMPLES OF MATCHING AND NON-MATCHING IMAGES TRUE POSITIVES Image quality 12 Image quality 6 Image quality 1 Joint Logit Pred 0.97 Joint Tree Pred 0.8 0.94 1 1 1 0.76 0.8 0.87 0.8 FALSE POSITIVES Image 1 Image 2 Logit Pred 0.7 Tree Pred 0 0.6 0.4 TRUE NEGATIVE Image 1 Image 2 Logit Pred 0 Tree Pred 0.4 0 0.5 Eliminated during blocking Eliminated during blocking Logit Pred 0 Tree Pred 0.5 Eliminated during blocking Eliminated during blocking FALSE NEGATIVE Image 1 Image 2 In this final example, a green blotch was improperly separated during blocking. These are images “green 1” and “green 12,” named for their levels of compression. However, green 6 was not blocked out, and was compared to both green 1 and green 12. In fact, both the logistic regression and the tree assigned a probability of 1 that green 6 matches both green 1 and green 12. By implementing this style of after-the-fact transitivity, we could create a perfect blocking set.