VISUAL OBJECT RECOGNITION The Prowess of Humans and Progress of Machines Abstract Visual object recognition is a field that arose from the study of human vision. One the earliest developments in the field was Optical Character Recognition (OCR), and was motivated by assisting the blind. OCR has become a prominent technology and has been a stepping stone, propelling research in specific object recognition—recognizing particular instances of a type of object—and general object recognition—recognizing that different instances of an object belong to the same type—which has led to exciting developments and applications in today’s society. Chun Ji (CJ) Wang chunjiwa@usc.edu Introduction The common adage that “a picture is worth a thousand words” emphasizes the fact that humans are perceptive creatures who rely heavily on sight and vision. In studying human vision, researchers have made much progress in the past few decades mimicking the visual prowess of humans in machines. A robot can be trained to determine a person’s gender by his or her photo. A computer can efficiently identify and locate a criminal and his getaway vehicle by a single image. A single image holds the ultimate key that can unlock a wealth of information beyond the imagination. That is the premise behind Google Goggles, a mobile object recognition application for smartphones. It can accomplish many amazing things including: recognizing famous landmarks, translating a photo of foreign text, and recognizing other objects like books, paintings, and CD/DVD covers. From the user’s standpoint, the process is invisible and quick; the app sends the image back to Google, whose computers attempt object recognition on it and returns the results back to the user. Object recognition systems try to replicate human’s innate ability to accurately and rapidly identify objects in our visual environment. “The apparent ease of with which we recognize objects belies the magnitude of this feat: we effortlessly recognize objects from among tens of thousands of possibilities and we do so within a fraction of a second, in spite of tremendous variation in the appearance of each one.” [1] However, humans also face limitations, especially with specific object recognition—identifying particular instances of a type of object. We may not be able to name the specific make and model of a car, but we easily perform generic object recognition— recognizing different instances of an object as belonging to the same category [2]— and instantly label it as a car. Machines on the other hand, can only perform object recognition if the object is 1 already part of its repertoire of recognizable objects. Unlike humans, no computer is 100% sure of its recognition abilities. Optical Character Recognition However, this does not mean machines cannot achieve accurate results. The most accurate application of object recognition to date is optical character recognition, or OCR, which is the mechanical or electronic conversion of handwritten, typewritten, or printed text into machineencoded text. Its accuracy can be attributed to its rich development history, as early OCR inventions date back to as early as 1929. [3] These devices relied on template matching, which compares characters to those given in a template. By shining a light upon the input character, the reflected rays bounce through a template and hit a photoelectric cell, which produces a current that is used to determine if the input character matches that of the template. Since an input character is black text, it will not match the template if the cell receives some light and produces a current above a certain threshold. Otherwise, littleto-no current would indicate a match. [3][4][5] Assisting the blind was the motivation behind pioneering early OCR devices. A machine that detects letters of the alphabet and speaks them out loud was developed in 1949 with sponsorship from RCA, the Veterans Administration, and the wartime Office of Scientific Research and Development. However, further development was halted as it was too costly. [6] In 1974, Ray Kurzweil developed an OCR program that could recognize any style of print, positioning OCR as an application to overcome the handicap of blindness. Two years later, Stevie Wonder, a renowned blind musician, purchased the first production version of the Kurzweil Reading Machine, which could read out loud books, magazines, and other printed documents. [7] 2 OCR has come a long way since the days of aiding the blind. Today, it assists offices in automatic data entry from paper documents as well as digitizing printed records. Template matching, also known as matrix matching, is still in use. This technique “compares what the OCR device sees as a character against a library of character matrices or templates. When an image matches one of these prescribed templates within a given level of accuracy, the OCR application assigns that image the corresponding American Standard Code for Information Interchange (ASCII) code.” [8] When characters are less predictable or when the text is in an image taken on a smartphone, feature extraction becomes the preferred method. This technique works by searching for specific features and interpreting the “open areas, closed shapes, diagonal lines, line intersections, etc.” [8] as characters. Specific Object Recognition Feature extraction is crucial to visual object-specific recognition. The procedure follows three general steps: “(1) Extract local features from both the training and test images independently. (2) Match the feature sets to find putative correspondences. (3) Verify if the matched features occur in a consistent geometric configuration.” [2] This is the way Google Goggles operates. If we take a picture of the Golden Gate Bridge in San Francisco with the Goggles app, Google would first (ignoring the transmission of the compressed image to its servers) determine a distinctive set of keypoints for the image. Then, for each of these points, a surrounding region is defined in a manner that is invariant to image scaling and image rotation. A descriptor, or a way to describe the appearance, is then computed for each region. [2] Google then searches in its databases for image descriptors that are similar to the local features of our Golden Gate Bridge image. Because Google can recognize a vast number of objects, the databases that contain the descriptors are immense. Thus, it is naïve to compare one-by-one 3 the current input descriptor with all existing descriptors. To make this process practical and usable, Google implements a type of database structure and applies an algorithm that facilitates efficient similarity search. [2] Such algorithms include tree and hashing algorithms, which offer the most control on how candidate matches are made. Another approach is visual vocabulary, which groups together similar local descriptors where each group is represented by a unique token. This approach lacks the control of tree and hashing algorithms but allows for faster verification between two images. [2] Before Google returns its match results, it performs the third and final step of verifying if the matches occur in a consistent geometric configuration. This prevents false matches and increases accuracy when an image has fewer local descriptors. A common geometric transformation is estimated for the locations and scales of corresponding features between the two images. If such a transformation exists, it is highly likely that the objects in both images are the same. [2] Once this step is complete and the percentage of error is below a set threshold, Google will return results of the Golden Gate Bridge back to our smartphone. Other applications of specific object recognition include: image matching for creating panoramas, object recognition for facial and license plate recognition systems, and large-scale image retrieval for gathering images with similar features as the input image. Like humans, today’s computers and machines excel at these kinds of specific object recognition. The automation in creating panoramas, finding a particular face, and identifying a speeder’s license plate eliminates the need for time-consuming manual labor and increases the efficiency and throughput of each job. In addition, a wealth of information available at our fingertips is certainly empowering. But do not ignore the cases where pieces of a panorama may be misaligned, a particular face identified inaccurately, a speeding ticket sent to the wrong driver, or Google giving us a result of a Java 4 textbook when its cover also features the New York Public Library lion. [9] Even though these faults are very rare and the latest OCR readers are able to achieve accuracy rates of 99.9975% [8], there still isn’t that 100% guarantee without human review. Generic Object Recognition “A robust, real-world machine solution still evades us,” [1] yet, as engineers, we are still performing research and tackling the even more difficult case of generic category-based object recognition. Just like with specific object recognition, general object recognition also follows three basic steps: “(1) Choose a representation and accompanying model (which may be hand-crafted, learned, or some combination thereof). (2) Given a novel image, search for evidence supporting the object models, and assign scores or confidences to all such candidates. (3) Take care to suppress any redundant or conflicting detections.” [2] Imagine for a moment that the Golden Gate Bridge is not famous enough to warrant itself a name; it is just another suspension bridge, or any other bridge for that matter. Now if we run that image through a generic category-level object recognition system, it first represents the image description as one of two types of models. The first type is a window-based model, where appearance is described for a particular, rectangular region of interest. The other type is a partsbased model, which combines “separate descriptors for the appearance of a set of local parts together with a geometric layout.” [2] Think of each part as a small window. Now let us consider six local parts for our bridge: a pair of parts for the tops of the towers where the cables are connected to, another pair where the towers meet the roadways, and the final pair where the towers are fixed into concrete above the waterline. This set of parts forms two rectangles, the top one with a taller height than the bottom, which can be used to represent a bridge. With a window-based model, the bridge would be represented by a single rectangle that encloses the entire bridge. 5 Recognizing our object’s category using the window-based model is an image classification problem. In other words, a classifier determines if a bridge is present or absent from the given window. [2] Detection with the window-based model is algorithmically simpler since it considers the holistic appearance of the window. Of course, in order to be accurate, the window of our bridge must have a certain level of invariance compared to the windows of bridges in the images of the system’s database. With parts-based models, detection relies on more complex search procedures for matching both the parts as well as their geometric relationship with each other. [2] Once the system is able to return a match given that our bridge has a similar window or that its local parts and their spatial layout are similar to other bridges, the final step is to verify that the bridge is actually a bridge using the same verification technique as in specific object recognition. Instances of generic object recognition have been successfully implemented and are in use today. Applications include face detection (not to be confused with facial recognition) and person detection. The latter is used to count people, measure occupancy, and study crowds. [10] Face detection is a feature common in many of today’s consumer digital cameras. These cameras will overlay a bounding box around a subject’s face in the shot. Some of the higher-end models can even detect a smile on a person’s face, automatically triggering the shutter. The ease with which digital cameras can perform this task is attributed to the high similarity in the pattern of different face instances and different standing persons that makes recognizing the presence of a face or person relatively simple and fast. [2] Limitations But not all objects are easy to recognize. Object recognition based on two-dimensional images is inherently limited. The object in an image will have a certain position and pose, it may 6 be partially occluded, or the lighting and background may vary in comparison with the training images. In contrast to digital recognition, human vision operates in high-dimensional space. With each glimpse, an image is projected into the eye and conveyed to the brain in a spiking activity pattern of ~1 million retinal ganglion cells. In other words, each image is one point out of a ~1 million dimensional retinal ganglion cell representation. [1] Thus it is easy for us to recognize objects even if they are shown to us in different positions, poses, and settings. In an attempt to match the human vision prowess, researchers have started to train the machines to recognize objects, by associating an object with its corresponding category, using better training images. However, training images are not immune to the aforementioned limitations, making it difficult to benchmark the true capabilities of artificial object recognition. By using sets of more “natural” images, researchers hope that they can capture the essence of problems encountered in the real world. But even though a set may contain a large number of images, the variations in object pose and other factors are poorly defined and not varied systematically, leading to inconsistencies. Furthermore, the majority of images are “composed,” meaning that the photographer decided how the shot should be framed resulting in deliberately positioned objects, eliminating randomness. As a result the shot may not properly reflect the variation found in reality. [11] Conclusion These issues have not stopped scientists and engineers from trying to artificially replicate object recognition. In fact, significant progress has been made since the days of the early OCR devices. OCR is a mature and commercialized technology. Specific object recognition performs well for the objects that it can recognize. And in recent years, methods for general object recognition have shown much better performances, with around 60% accuracy [11], which suggests that even though these approaches are still well below human performance, they are at 7 least heading in the right direction. As we continue to wonder, marvel, and awe at our own visual recognition prowess and the progress that has been made to artificially replicate it, we can only ponder when that progress may one day catch up and even surpass that of our own. 8 References [1] J. J. DiCarolo and D. D. Cox, “Untangling invariant object recognition,” TRENDS in Cognitive Sciences, vol. 11, no. 8, pp. 333-341, July 2007. [2] K. Grauman and B. Leibe, “Visual Object Recognition,” in Synthesis Lectures on Artificial Int. and Mach. Learning. Morgan & Claypool, 2001. [3] G. Tauschek, “Beading Mach.,” U.S. Patent 2 026 329, Dec. 31, 1935. [4] P. W. Handel, “Statistical Mach.,” U.S. Patent 1 915 993, June 27, 1933. [5] D. H. Shepard, “App. for Reading,” U.S. Patent 2 663 758, Dec. 22, 1953. [6] M. Mann, “Reading Mach. Spells Out Loud,” Popular Sci., vol. 154, no. 2, pp. 125-127, Feb. 1949. [7] R. Kurzweil, The Age of Spiritual Machines. New York: Viking, 1999. [8] E. T. Eaton, “Limiting storage or transmission of visual information using optical character recognition,” U.S. Patent 7 092 568, Aug. 15, 2006. [9] S. Segan, “Hands On with Google Goggles: New York City,” PC Mag., Dec. 2009. [10] Z. Zhang et al., “A Robust Human Detection and Tracking System Using a HumanModel-Based Camera Calibration,” in the Eighth Int. Workshop on Visual Surveillance, 2008. [11] N. Pinto et al., “Why is Real-World Visual Object Recognition Hard?,” PLoS Computational Biology 4(1): e27, Jan. 2008. 9