Visual Object Recognition

advertisement
VISUAL OBJECT RECOGNITION
The Prowess of Humans and Progress of Machines
Abstract
Visual object recognition is a field that arose from the study of human vision. One the earliest
developments in the field was Optical Character Recognition (OCR), and was motivated by
assisting the blind. OCR has become a prominent technology and has been a stepping stone,
propelling research in specific object recognition—recognizing particular instances of a type of
object—and general object recognition—recognizing that different instances of an object belong
to the same type—which has led to exciting developments and applications in today’s society.
Chun Ji (CJ) Wang
chunjiwa@usc.edu
Introduction
The common adage that “a picture is worth a thousand words” emphasizes the fact that
humans are perceptive creatures who rely heavily on sight and vision. In studying human vision,
researchers have made much progress in the past few decades mimicking the visual prowess of
humans in machines. A robot can be trained to determine a person’s gender by his or her photo. A
computer can efficiently identify and locate a criminal and his getaway vehicle by a single image.
A single image holds the ultimate key that can unlock a wealth of information beyond the
imagination.
That is the premise behind Google Goggles, a mobile object recognition application for
smartphones. It can accomplish many amazing things including: recognizing famous landmarks,
translating a photo of foreign text, and recognizing other objects like books, paintings, and
CD/DVD covers. From the user’s standpoint, the process is invisible and quick; the app sends the
image back to Google, whose computers attempt object recognition on it and returns the results
back to the user.
Object recognition systems try to replicate human’s innate ability to accurately and rapidly
identify objects in our visual environment. “The apparent ease of with which we recognize objects
belies the magnitude of this feat: we effortlessly recognize objects from among tens of thousands
of possibilities and we do so within a fraction of a second, in spite of tremendous variation in the
appearance of each one.” [1] However, humans also face limitations, especially with specific
object recognition—identifying particular instances of a type of object. We may not be able to
name the specific make and model of a car, but we easily perform generic object recognition—
recognizing different instances of an object as belonging to the same category [2]— and instantly
label it as a car. Machines on the other hand, can only perform object recognition if the object is
1
already part of its repertoire of recognizable objects. Unlike humans, no computer is 100% sure of
its recognition abilities.
Optical Character Recognition
However, this does not mean machines cannot achieve accurate results. The most accurate
application of object recognition to date is optical character recognition, or OCR, which is the
mechanical or electronic conversion of handwritten, typewritten, or printed text into machineencoded text. Its accuracy can be attributed to its rich development history, as early OCR
inventions date back to as early as 1929. [3] These devices relied on template matching, which
compares characters to those given in a template.
By shining a light upon the input character, the reflected rays bounce through a template
and hit a photoelectric cell, which produces a current that is used to determine if the input character
matches that of the template. Since an input character is black text, it will not match the template
if the cell receives some light and produces a current above a certain threshold. Otherwise, littleto-no current would indicate a match. [3][4][5]
Assisting the blind was the motivation behind pioneering early OCR devices. A machine
that detects letters of the alphabet and speaks them out loud was developed in 1949 with
sponsorship from RCA, the Veterans Administration, and the wartime Office of Scientific
Research and Development. However, further development was halted as it was too costly. [6] In
1974, Ray Kurzweil developed an OCR program that could recognize any style of print,
positioning OCR as an application to overcome the handicap of blindness. Two years later, Stevie
Wonder, a renowned blind musician, purchased the first production version of the Kurzweil
Reading Machine, which could read out loud books, magazines, and other printed documents. [7]
2
OCR has come a long way since the days of aiding the blind. Today, it assists offices in
automatic data entry from paper documents as well as digitizing printed records. Template
matching, also known as matrix matching, is still in use. This technique “compares what the OCR
device sees as a character against a library of character matrices or templates. When an image
matches one of these prescribed templates within a given level of accuracy, the OCR application
assigns that image the corresponding American Standard Code for Information Interchange
(ASCII) code.” [8] When characters are less predictable or when the text is in an image taken on
a smartphone, feature extraction becomes the preferred method. This technique works by searching
for specific features and interpreting the “open areas, closed shapes, diagonal lines, line
intersections, etc.” [8] as characters.
Specific Object Recognition
Feature extraction is crucial to visual object-specific recognition. The procedure follows
three general steps: “(1) Extract local features from both the training and test images
independently. (2) Match the feature sets to find putative correspondences. (3) Verify if the
matched features occur in a consistent geometric configuration.” [2] This is the way Google
Goggles operates. If we take a picture of the Golden Gate Bridge in San Francisco with the Goggles
app, Google would first (ignoring the transmission of the compressed image to its servers)
determine a distinctive set of keypoints for the image. Then, for each of these points, a surrounding
region is defined in a manner that is invariant to image scaling and image rotation. A descriptor,
or a way to describe the appearance, is then computed for each region. [2]
Google then searches in its databases for image descriptors that are similar to the local
features of our Golden Gate Bridge image. Because Google can recognize a vast number of objects,
the databases that contain the descriptors are immense. Thus, it is naïve to compare one-by-one
3
the current input descriptor with all existing descriptors. To make this process practical and usable,
Google implements a type of database structure and applies an algorithm that facilitates efficient
similarity search. [2] Such algorithms include tree and hashing algorithms, which offer the most
control on how candidate matches are made. Another approach is visual vocabulary, which groups
together similar local descriptors where each group is represented by a unique token. This approach
lacks the control of tree and hashing algorithms but allows for faster verification between two
images. [2]
Before Google returns its match results, it performs the third and final step of verifying if
the matches occur in a consistent geometric configuration. This prevents false matches and
increases accuracy when an image has fewer local descriptors. A common geometric
transformation is estimated for the locations and scales of corresponding features between the two
images. If such a transformation exists, it is highly likely that the objects in both images are the
same. [2] Once this step is complete and the percentage of error is below a set threshold, Google
will return results of the Golden Gate Bridge back to our smartphone.
Other applications of specific object recognition include: image matching for creating
panoramas, object recognition for facial and license plate recognition systems, and large-scale
image retrieval for gathering images with similar features as the input image. Like humans, today’s
computers and machines excel at these kinds of specific object recognition. The automation in
creating panoramas, finding a particular face, and identifying a speeder’s license plate eliminates
the need for time-consuming manual labor and increases the efficiency and throughput of each
job. In addition, a wealth of information available at our fingertips is certainly empowering. But
do not ignore the cases where pieces of a panorama may be misaligned, a particular face identified
inaccurately, a speeding ticket sent to the wrong driver, or Google giving us a result of a Java
4
textbook when its cover also features the New York Public Library lion. [9] Even though these
faults are very rare and the latest OCR readers are able to achieve accuracy rates of 99.9975% [8],
there still isn’t that 100% guarantee without human review.
Generic Object Recognition
“A robust, real-world machine solution still evades us,” [1] yet, as engineers, we are still
performing research and tackling the even more difficult case of generic category-based object
recognition. Just like with specific object recognition, general object recognition also follows three
basic steps: “(1) Choose a representation and accompanying model (which may be hand-crafted,
learned, or some combination thereof). (2) Given a novel image, search for evidence supporting
the object models, and assign scores or confidences to all such candidates. (3) Take care to suppress
any redundant or conflicting detections.” [2]
Imagine for a moment that the Golden Gate Bridge is not famous enough to warrant itself
a name; it is just another suspension bridge, or any other bridge for that matter. Now if we run that
image through a generic category-level object recognition system, it first represents the image
description as one of two types of models. The first type is a window-based model, where
appearance is described for a particular, rectangular region of interest. The other type is a partsbased model, which combines “separate descriptors for the appearance of a set of local parts
together with a geometric layout.” [2] Think of each part as a small window. Now let us consider
six local parts for our bridge: a pair of parts for the tops of the towers where the cables are
connected to, another pair where the towers meet the roadways, and the final pair where the towers
are fixed into concrete above the waterline. This set of parts forms two rectangles, the top one with
a taller height than the bottom, which can be used to represent a bridge. With a window-based
model, the bridge would be represented by a single rectangle that encloses the entire bridge.
5
Recognizing our object’s category using the window-based model is an image
classification problem. In other words, a classifier determines if a bridge is present or absent from
the given window. [2] Detection with the window-based model is algorithmically simpler since it
considers the holistic appearance of the window. Of course, in order to be accurate, the window of
our bridge must have a certain level of invariance compared to the windows of bridges in the
images of the system’s database. With parts-based models, detection relies on more complex
search procedures for matching both the parts as well as their geometric relationship with each
other. [2] Once the system is able to return a match given that our bridge has a similar window or
that its local parts and their spatial layout are similar to other bridges, the final step is to verify that
the bridge is actually a bridge using the same verification technique as in specific object
recognition.
Instances of generic object recognition have been successfully implemented and are in use
today. Applications include face detection (not to be confused with facial recognition) and person
detection. The latter is used to count people, measure occupancy, and study crowds. [10] Face
detection is a feature common in many of today’s consumer digital cameras. These cameras will
overlay a bounding box around a subject’s face in the shot. Some of the higher-end models can
even detect a smile on a person’s face, automatically triggering the shutter. The ease with which
digital cameras can perform this task is attributed to the high similarity in the pattern of different
face instances and different standing persons that makes recognizing the presence of a face or
person relatively simple and fast. [2]
Limitations
But not all objects are easy to recognize. Object recognition based on two-dimensional
images is inherently limited. The object in an image will have a certain position and pose, it may
6
be partially occluded, or the lighting and background may vary in comparison with the training
images. In contrast to digital recognition, human vision operates in high-dimensional space. With
each glimpse, an image is projected into the eye and conveyed to the brain in a spiking activity
pattern of ~1 million retinal ganglion cells. In other words, each image is one point out of a ~1
million dimensional retinal ganglion cell representation. [1] Thus it is easy for us to recognize
objects even if they are shown to us in different positions, poses, and settings. In an attempt to
match the human vision prowess, researchers have started to train the machines to recognize
objects, by associating an object with its corresponding category, using better training images.
However, training images are not immune to the aforementioned limitations, making it
difficult to benchmark the true capabilities of artificial object recognition. By using sets of more
“natural” images, researchers hope that they can capture the essence of problems encountered in
the real world. But even though a set may contain a large number of images, the variations in object
pose and other factors are poorly defined and not varied systematically, leading to inconsistencies.
Furthermore, the majority of images are “composed,” meaning that the photographer decided how
the shot should be framed resulting in deliberately positioned objects, eliminating randomness. As
a result the shot may not properly reflect the variation found in reality. [11]
Conclusion
These issues have not stopped scientists and engineers from trying to artificially replicate
object recognition. In fact, significant progress has been made since the days of the early OCR
devices. OCR is a mature and commercialized technology. Specific object recognition performs
well for the objects that it can recognize. And in recent years, methods for general object
recognition have shown much better performances, with around 60% accuracy [11], which
suggests that even though these approaches are still well below human performance, they are at
7
least heading in the right direction. As we continue to wonder, marvel, and awe at our own visual
recognition prowess and the progress that has been made to artificially replicate it, we can only
ponder when that progress may one day catch up and even surpass that of our own.
8
References
[1]
J. J. DiCarolo and D. D. Cox, “Untangling invariant object recognition,” TRENDS in
Cognitive Sciences, vol. 11, no. 8, pp. 333-341, July 2007.
[2]
K. Grauman and B. Leibe, “Visual Object Recognition,” in Synthesis Lectures on
Artificial Int. and Mach. Learning. Morgan & Claypool, 2001.
[3]
G. Tauschek, “Beading Mach.,” U.S. Patent 2 026 329, Dec. 31, 1935.
[4]
P. W. Handel, “Statistical Mach.,” U.S. Patent 1 915 993, June 27, 1933.
[5]
D. H. Shepard, “App. for Reading,” U.S. Patent 2 663 758, Dec. 22, 1953.
[6]
M. Mann, “Reading Mach. Spells Out Loud,” Popular Sci., vol. 154, no. 2, pp. 125-127,
Feb. 1949.
[7]
R. Kurzweil, The Age of Spiritual Machines. New York: Viking, 1999.
[8]
E. T. Eaton, “Limiting storage or transmission of visual information using optical
character recognition,” U.S. Patent 7 092 568, Aug. 15, 2006.
[9]
S. Segan, “Hands On with Google Goggles: New York City,” PC Mag., Dec. 2009.
[10]
Z. Zhang et al., “A Robust Human Detection and Tracking System Using a HumanModel-Based Camera Calibration,” in the Eighth Int. Workshop on Visual Surveillance,
2008.
[11]
N. Pinto et al., “Why is Real-World Visual Object Recognition Hard?,” PLoS
Computational Biology 4(1): e27, Jan. 2008.
9
Download