TECHNOLOGY by Massimo Piccardi and Tony Jan Recent Advances in Computer Vision omputer vision is the sumer camcorders are based branch of artificial intelon standards such as the ligence that focuses on proDigital Video (DV), which viding computers with the provides videos with 720 × functions typical of human 480 pixels/frame at a rate of vision. To date, computer 30 frames/s. Even Webcams vision has produced imporcan now provide images of tant applications in fields satisfactory quality at prices such as industrial automastarting as low as $25. tion, robotics, biomedicine, The availability of affordand satellite observation of able hardware and software Earth. In the field of industrihas opened the way for new, al automation alone, its applipervasive applications of cations include guidance for computer vision. These robots to correctly pick up applications have one factor and place manufactured in common. They tend to be parts, nondestructive quality human-centered; that is, and integrity inspection, and either humans are the taron-line measurements. gets of the vision system or Until a few years ago, they wander about wearing chronic problems affected small cameras, or sometimes computer-vision systems and both. Vision systems have prevented their widespread Figure 1. The Face Detection Project can automatically distinguish become the central sensor in adoption. Since its start, comapplications such as puter vision has appeared as images containing faces from other images and put a box around • human-computer interfaces a computationally intensive each detected face—frontal, profile, or three-quarter images. (HCIs), the links between and almost intractable field computers and their users; because its algorithms require a minimum recent years, however, increased perfor- • augmented perception, tools that increase of hundreds of MIPS (millions of instruc- mance at the system level—faster micronormal perception capabilities of humans; tions per second) to be executed in accept- processors, faster and larger memories, and • automatic media interpretation, which able real time. Even the input–output of faster and wider buses—has made computprovides an understanding of the conhigh-resolution images at video rate was er vision affordable on a wide scale. Fast tent of modern digital media, such as traditionally a bottleneck for common com- microprocessors and digital-signal procesvideos and movies, without the need for puting platforms such as personal comput- sors are now available as off-the-shelf soluhuman intervention or annotation; and ers and workstations. To solve these prob- tions, and some of them can execute calcu- • video surveillance and biometrics. lems, the research community has produced lations at rates of thousands of MIPS. The an impressive number of dedicated com- Texas Instruments C6414 processor, for Human-computer interfaces puter-vision systems. One such famous sys- example, runs at 600 MHz and can achieve The basic idea behind the use of comtem was the Massively Parallel Processor a peak performance of 4,800 MIPS. High- puter vision in HCIs is that in several appli(MPP), designed at the Goddard Space speed serial buses such as the IEEE 1394 cations, computers can be instructed more Flight Center in 1983 and operated there and USB 2.0 are capable of transferring naturally by human gestures than by the until 1991. The MPP used an array of hundreds of megabits per second, a rate use of a keyboard or mouse. In one inter16,384 single-bit processors and was capa- that greatly exceeds the requirements of esting application, computer scientist ble at peak performance of 250 million any common high-resolution video camera. James L. Crowley of the National Polytechfloating-point operations/s—an impressive These buses are already integrated into the nical Institute of Grenoble in France and most recent personal computer chipsets or his colleagues used human eye movements feat at the time. Dedicated computers such as the MMP are available as inexpensive daughter- to scroll a computer screen up and down. A have always received a cold reception from boards. Moreover, video cameras have gone camera located on top of the screen tracked industr y because they were expensive, almost completely to digital, and they come the eye movements. The French researchers cumbersome, and difficult to program. In in several price ranges and types. Con- reported that a trained operator could comHenry Schneiderman and Takeo Kanade, Carnegie Mellon University C FEBRUARY/MARCH 2003 © American Institute of Physics 18 The Industrial Physicist Institute for Information Technology National Research Council Canada Technology many hard-to-find remote controls around today’s homes, provide environmental surveillance, and turn the TV off when you fall asleep in your favourite armchair. Figure 2. A camera tracks the point of each player’s nose closest to the camera and links it to the The Voice demos/faceindex—one is shown in Figure 1 —and anyone can submit an image to http://www.vasc.ri.cmu.edu/cgi-bin/demos/ findface.cgi, which will process the image overnight and depict all detected faces with a box around them. However, computer vision can do much more for multimedia. For example, it is an invaluable support to recent multimedia standards aimed at compressing digital videos—reducing their size in bytes—while still retaining acceptable visual quality. One such standard is MPEG-4 from the Moving Picture Expert Group, which allows the compression of different objects in a scene with specific compression levels in such a way as to adjust the trade-off between space reduction and visual quality on a per-object basis. The basic idea is that important objects such as actors should retain the highest visual quality, while objects in the background can be encoded with lower quality to save bytes. Nonetheless, MPEG-4 is silent on how to separate a video into the objects of which it is composed. Here again, computer vision can help with a variety of techniques that perform the task automatically. Another application is The vOICe, developed at Philips return the computer ball across the “net.” Research Laboratories (Eindhoven, The Netherlands) by Peter B. L. plete a given task 32% faster by using his Meijer and available online for testing at eyes rather than a keyboard or mouse to http://www.seeingwithsound.com. The direct screen scrolling. In general, using vOICe provides a simple yet effective means cameras to sense human gestures is much of augmented perception for people with pareasier than making users wear cumbersome tially impaired vision. In the virtual demonperipherals such as digital gloves. stration, the camera accompanies you in your Another interesting example of an HCI wanderings. The camera periodically scans application can be downloaded from the scene in front of you and turns images http://www.cv.iit.nrc.ca/research/Nouse/ for into sounds, using different pitches and personal testing, provided a Webcam is lengths to encode objects’ position and size. plugged into your personal computer. This application—called Nouse, for nose as a Media interpretation mouse—tracks the movements of your The use of computer vision for automatic nose, and was developed by Dmitry Gorod- media interpretation assists users in searching nichy. You can play NosePong, a nose-driven for specific scenes and shots otherwise not version of the Pong video game (Figure 2), annotated in the video-scene indexes. For or test your ability to paint with your nose example, images containing faces can be or to write with your nose. Although this automatically distinguished from other Video surveillance application is slanted toward fun, it is a con- images, as the results of the Face Detection Perhaps the most developed modern applivincing demonstration of the potential uses Project led by Henr y Schneiderman and cation of computer vision is video surveillance. of cameras as natural interfaces. In industry, Takeo Kanade at Carnegie Mellon University Long gone are the days when video surveilfor example, an operator might quickly stop (CMU) prove. The CMU face detector is lance meant low-resolution, black-and-white, a conveyor belt with a specific gesture considered the most accurate for frontal face analog closed-circuit television. Nowadays, detected by a camera without needing to detection and is also reliable for facial pro- computer vision enables the integration of physically push a button, pull a lever, or files and three-quarter images. Many exam- views from many cameras into a single, concarry a remote control. ples are available at http://vasc.ri.cmu.edu/ sistent “superimage.” Such an image autoCameras could matically detects scenes also become powerwith people and/or vehiful peripherals for cles or other targets of the so-called intelliinterest, classifies them gent home. A camin categories such as era located in your people, cars, bicycles, or living room would buses, extracts their traperform several jectories, recognizes tasks, starting with limb and arm positions, sensing a human and provides some form b presence and then a of behavior analysis. turning the lights on Figure 3. This parking-lot surveillance system subtracts the static background The analysis relies on and the heat up. a list of previously specIndeed, cameras image, distinguishes a person from moving vehicles, locates the head, and cal- ified behaviors or on could replace the culates the speed of the head in each frame. statistical observations University of Technology, Sydney, Australia red “bat” at the top (or bottom) of the table to 20 The Industrial Physicist Speed of head 20 a Normal behavior 15 10 Speed of head University of Technology, Sydney, Australia For those who want to build their own surveillance systems, an enormous amount 5 of equipment is available. Web sites of manufacturers such as Sony, Axis, Pelco, 0 0 5 10 15 20 25 30 35 40 45 50 and many others offer a wide range of cameras. You can find network cameras starting Video frames (at 5 frames per second) at less than $500 that can be simply plugged Abnormal behavior b into any network, such as a TCP-IP, which 20 can carry a full Web server and allow camera frames to be downloaded and processed. 15 Adjustable pan–tilt–zoom cameras can be used to point and focus on specific targets 10 over wide survey areas. And if cabling poses a problem because of camera location, 5 wireless versions are available off-the-shelf. Computer vision, already a useful aid in 0 0 5 10 15 20 25 30 35 40 45 50 several industrial processes, will find increasing uses as companies develop new applicaVideo frames (at 5 frames per second) tions in areas such as HCI, augmented perFigure 4. Examples of the speed of the head (in pixels per frame) of a person in ception, and automatic media interpretation. the parking lot exhibiting normal behavior (a) and abnormal behavior (b). Such Its potential to improve plant and public safety is attracting increasing attention in video surveillance might alert a security guard to a possible car thief. today’s security-conscious world. such as frequent-versus-infrequent behav- that represents only the static objects in the iors. The basic goal is not to completely scene—from the current frame (Figures 3a Further reading Crowley, J. L; Coutaz, J.; Bérard, F. Perreplace security personnel but to assist them and 3b). The next step is to distinguish peoin supervising wider areas and focusing their ple from moving vehicles on the basis of a ceptual user interfaces: things that see. attention on events of interest. Although the form factor, such as the height:width ratio, Commun. ACM 2000, 43 (3), 54–64. Jan, T.; Piccardi, M.; Hintz, T. Automated critical issue of privacy must be addressed and to locate their heads as the top region before society widely adopts these video sur- in their silhouette. In this way, the head’s Human Behaviour Classification using veillance systems, the recent need for speed at each frame is automatically deter- Modified Probabilistic Neural Network. In increased security has made them more like- mined. Then, a series of speed samples are Proc. Int. Conf. Computational Intelligence for ly to win general acceptance. In addition, repeatedly measured for each person in the Modelling, Control and Automation; CIMCA several technical countermeasures can be scene. Each series covers an interval of 2003, Vienna, Austria, Feb. 12–14, 2003. National Instruments Corp. (Austin, taken to prevent privacy abuses, such as pro- about 10 s, which is enough to detect suspiTX), markets a range of computer-vision tecting access to video footage by way of cious behavior patterns (Figure 4). passwords and encryption. Finally, a neural network classifier, trained products. Its LabView-based Vision line At the University of Technology in Syd- to recognize the suspicious behaviors, pro- focuses on industrial and scientific uses. ney, Australia, we have developed and test- vides the behavior classification. In the http://sine.ni.com/apps/we/nioc.vp?cid=1 ed a system that can detect suspicious experiments we performed, the system 286&lang=US. Pavlidis, I.; Morellas, V.; Tsiamyrtzis, P.; pedestrian behavior in parking lots. Our achieved good accuracy, with a reasonably approach is based on the assumption that a limited number of false dismissals and false Harp, S. Urban surveillance systems: from suspicious behavior corresponds to an indi- alarms—4% and 2%, respectively, among the laboratory to the commercial world. vidual’s erratic walking trajector y. The more than 100 test samples. Although man- Proc. IEEE 2001, 89 (10), 1478–1496. Ω rationale behind this assumption is that a ufacturers and operators of surveillance sysB I O G R A P H Y potential offender will wander about and tems have often been reluctant to accept Massimo Piccardi (massimo@it.uts. stop between different cars to inspect their innovation, recent results from research labedu.au) is an associate professor of comcontents, whereas normal users will main- oratories of major companies prove that puter science and Tony Jan (jant@ tain a more direct path of travel. these systems are now reliable, economical, it.uts.edu.au) is a lecturer in the departThe first step consists of detecting all the and ready for commercialization. One examment of computer systems at the Universimoving objects in the scene by subtracting ple is DETER from Honeywell Labs, a prototy of Technology in Sydney, Australia. an estimated “background image”—one type urban-surveillance system. 21 The Industrial Physicist