by Massimo Piccardi and Tony Jan
Recent Advances in Computer Vision
omputer vision is the
sumer camcorders are based
branch of artificial intelon standards such as the
ligence that focuses on proDigital Video (DV), which
viding computers with the
provides videos with 720 ×
functions typical of human
480 pixels/frame at a rate of
vision. To date, computer
30 frames/s. Even Webcams
vision has produced imporcan now provide images of
tant applications in fields
satisfactory quality at prices
such as industrial automastarting as low as $25.
tion, robotics, biomedicine,
The availability of affordand satellite observation of
able hardware and software
Earth. In the field of industrihas opened the way for new,
al automation alone, its applipervasive applications of
cations include guidance for
computer vision. These
robots to correctly pick up
applications have one factor
and place manufactured
in common. They tend to be
parts, nondestructive quality
human-centered; that is,
and integrity inspection, and
either humans are the taron-line measurements.
gets of the vision system or
Until a few years ago,
they wander about wearing
chronic problems affected
small cameras, or sometimes
computer-vision systems and
both. Vision systems have
prevented their widespread Figure 1. The Face Detection Project can automatically distinguish
become the central sensor in
adoption. Since its start, comapplications such as
puter vision has appeared as images containing faces from other images and put a box around
• human-computer interfaces
a computationally intensive each detected face—frontal, profile, or three-quarter images.
(HCIs), the links between
and almost intractable field
computers and their users;
because its algorithms require a minimum recent years, however, increased perfor- • augmented perception, tools that increase
of hundreds of MIPS (millions of instruc- mance at the system level—faster micronormal perception capabilities of humans;
tions per second) to be executed in accept- processors, faster and larger memories, and • automatic media interpretation, which
able real time. Even the input–output of faster and wider buses—has made computprovides an understanding of the conhigh-resolution images at video rate was er vision affordable on a wide scale. Fast
tent of modern digital media, such as
traditionally a bottleneck for common com- microprocessors and digital-signal procesvideos and movies, without the need for
puting platforms such as personal comput- sors are now available as off-the-shelf soluhuman intervention or annotation; and
ers and workstations. To solve these prob- tions, and some of them can execute calcu- • video surveillance and biometrics.
lems, the research community has produced lations at rates of thousands of MIPS. The
an impressive number of dedicated com- Texas Instruments C6414 processor, for Human-computer interfaces
puter-vision systems. One such famous sys- example, runs at 600 MHz and can achieve
The basic idea behind the use of comtem was the Massively Parallel Processor a peak performance of 4,800 MIPS. High- puter vision in HCIs is that in several appli(MPP), designed at the Goddard Space speed serial buses such as the IEEE 1394 cations, computers can be instructed more
Flight Center in 1983 and operated there and USB 2.0 are capable of transferring naturally by human gestures than by the
until 1991. The MPP used an array of hundreds of megabits per second, a rate use of a keyboard or mouse. In one inter16,384 single-bit processors and was capa- that greatly exceeds the requirements of esting application, computer scientist
ble at peak performance of 250 million any common high-resolution video camera. James L. Crowley of the National Polytechfloating-point operations/s—an impressive These buses are already integrated into the nical Institute of Grenoble in France and
most recent personal computer chipsets or his colleagues used human eye movements
feat at the time.
Dedicated computers such as the MMP are available as inexpensive daughter- to scroll a computer screen up and down. A
have always received a cold reception from boards. Moreover, video cameras have gone camera located on top of the screen tracked
industr y because they were expensive, almost completely to digital, and they come the eye movements. The French researchers
cumbersome, and difficult to program. In in several price ranges and types. Con- reported that a trained operator could comHenry Schneiderman and Takeo Kanade, Carnegie Mellon University
many hard-to-find remote controls around today’s homes, provide environmental surveillance,
and turn the TV off when you fall
asleep in your favourite armchair.
Figure 2. A camera tracks the point of each player’s nose closest to the camera and links it to the
The Voice
demos/faceindex—one is shown in Figure 1
—and anyone can submit an image to
findface.cgi, which will process the image
overnight and depict all detected faces with a
box around them.
However, computer vision can do much
more for multimedia. For example, it is an
invaluable support to recent multimedia
standards aimed at compressing digital
videos—reducing their size in bytes—while
still retaining acceptable visual quality. One
such standard is MPEG-4 from the Moving
Picture Expert Group, which allows the compression of different objects in a scene with
specific compression levels in such a way as
to adjust the trade-off between space reduction and visual quality on a per-object basis.
The basic idea is that important objects such
as actors should retain the highest visual
quality, while objects in the background can
be encoded with lower quality to save bytes.
Nonetheless, MPEG-4 is silent on how to
separate a video into the objects of which it
is composed. Here again, computer vision
can help with a variety of techniques that
perform the task automatically.
Another application is The
vOICe, developed at Philips
return the computer ball across the “net.”
Research Laboratories (Eindhoven,
The Netherlands) by Peter B. L.
plete a given task 32% faster by using his Meijer and available online for testing at
eyes rather than a keyboard or mouse to http://www.seeingwithsound.com. The
direct screen scrolling. In general, using vOICe provides a simple yet effective means
cameras to sense human gestures is much of augmented perception for people with pareasier than making users wear cumbersome tially impaired vision. In the virtual demonperipherals such as digital gloves.
stration, the camera accompanies you in your
Another interesting example of an HCI wanderings. The camera periodically scans
application can be downloaded from the scene in front of you and turns images
http://www.cv.iit.nrc.ca/research/Nouse/ for into sounds, using different pitches and
personal testing, provided a Webcam is lengths to encode objects’ position and size.
plugged into your personal computer. This
application—called Nouse, for nose as a Media interpretation
mouse—tracks the movements of your
The use of computer vision for automatic
nose, and was developed by Dmitry Gorod- media interpretation assists users in searching
nichy. You can play NosePong, a nose-driven for specific scenes and shots otherwise not
version of the Pong video game (Figure 2), annotated in the video-scene indexes. For
or test your ability to paint with your nose example, images containing faces can be
or to write with your nose. Although this automatically distinguished from other Video surveillance
application is slanted toward fun, it is a con- images, as the results of the Face Detection
Perhaps the most developed modern applivincing demonstration of the potential uses Project led by Henr y Schneiderman and cation of computer vision is video surveillance.
of cameras as natural interfaces. In industry, Takeo Kanade at Carnegie Mellon University Long gone are the days when video surveilfor example, an operator might quickly stop (CMU) prove. The CMU face detector is lance meant low-resolution, black-and-white,
a conveyor belt with a specific gesture considered the most accurate for frontal face analog closed-circuit television. Nowadays,
detected by a camera without needing to detection and is also reliable for facial pro- computer vision enables the integration of
physically push a button, pull a lever, or files and three-quarter images. Many exam- views from many cameras into a single, concarry a remote control.
ples are available at http://vasc.ri.cmu.edu/ sistent “superimage.” Such an image autoCameras could
matically detects scenes
also become powerwith people and/or vehiful peripherals for
cles or other targets of
the so-called intelliinterest, classifies them
gent home. A camin categories such as
era located in your
people, cars, bicycles, or
living room would
buses, extracts their traperform
jectories, recognizes
tasks, starting with
limb and arm positions,
sensing a human
and provides some form
presence and then a
of behavior analysis.
turning the lights on Figure 3. This parking-lot surveillance system subtracts the static background
The analysis relies on
and the heat up.
a list of previously specIndeed, cameras image, distinguishes a person from moving vehicles, locates the head, and cal- ified behaviors or on
could replace the culates the speed of the head in each frame.
statistical observations
red “bat” at the top (or bottom) of the table to
Speed of head
Normal behavior
Speed of head
For those who want to build their own
surveillance systems, an enormous amount
of equipment is available. Web sites of
manufacturers such as Sony, Axis, Pelco,
50 and many others offer a wide range of cameras. You can find network cameras starting
Video frames (at 5 frames per second)
at less than $500 that can be simply plugged
Abnormal behavior
into any network, such as a TCP-IP, which
can carry a full Web server and allow camera frames to be downloaded and processed.
Adjustable pan–tilt–zoom cameras can be
used to point and focus on specific targets
over wide survey areas. And if cabling poses
a problem because of camera location,
wireless versions are available off-the-shelf.
Computer vision, already a useful aid in
50 several industrial processes, will find increasing uses as companies develop new applicaVideo frames (at 5 frames per second)
tions in areas such as HCI, augmented perFigure 4. Examples of the speed of the head (in pixels per frame) of a person in
ception, and automatic media interpretation.
the parking lot exhibiting normal behavior (a) and abnormal behavior (b). Such
Its potential to improve plant and public
safety is attracting increasing attention in
video surveillance might alert a security guard to a possible car thief.
today’s security-conscious world.
such as frequent-versus-infrequent behav- that represents only the static objects in the
iors. The basic goal is not to completely scene—from the current frame (Figures 3a Further reading
Crowley, J. L; Coutaz, J.; Bérard, F. Perreplace security personnel but to assist them and 3b). The next step is to distinguish peoin supervising wider areas and focusing their ple from moving vehicles on the basis of a ceptual user interfaces: things that see.
attention on events of interest. Although the form factor, such as the height:width ratio, Commun. ACM 2000, 43 (3), 54–64.
Jan, T.; Piccardi, M.; Hintz, T. Automated
critical issue of privacy must be addressed and to locate their heads as the top region
before society widely adopts these video sur- in their silhouette. In this way, the head’s Human Behaviour Classification using
veillance systems, the recent need for speed at each frame is automatically deter- Modified Probabilistic Neural Network. In
increased security has made them more like- mined. Then, a series of speed samples are Proc. Int. Conf. Computational Intelligence for
ly to win general acceptance. In addition, repeatedly measured for each person in the Modelling, Control and Automation; CIMCA
several technical countermeasures can be scene. Each series covers an interval of 2003, Vienna, Austria, Feb. 12–14, 2003.
National Instruments Corp. (Austin,
taken to prevent privacy abuses, such as pro- about 10 s, which is enough to detect suspiTX), markets a range of computer-vision
tecting access to video footage by way of cious behavior patterns (Figure 4).
passwords and encryption.
Finally, a neural network classifier, trained products. Its LabView-based Vision line
At the University of Technology in Syd- to recognize the suspicious behaviors, pro- focuses on industrial and scientific uses.
ney, Australia, we have developed and test- vides the behavior classification. In the http://sine.ni.com/apps/we/nioc.vp?cid=1
ed a system that can detect suspicious experiments we performed, the system 286&lang=US.
Pavlidis, I.; Morellas, V.; Tsiamyrtzis, P.;
pedestrian behavior in parking lots. Our achieved good accuracy, with a reasonably
approach is based on the assumption that a limited number of false dismissals and false Harp, S. Urban surveillance systems: from
suspicious behavior corresponds to an indi- alarms—4% and 2%, respectively, among the laboratory to the commercial world.
vidual’s erratic walking trajector y. The more than 100 test samples. Although man- Proc. IEEE 2001, 89 (10), 1478–1496. Ω
rationale behind this assumption is that a ufacturers and operators of surveillance sysB I O G R A P H Y
potential offender will wander about and tems have often been reluctant to accept
Massimo Piccardi (massimo@it.uts.
stop between different cars to inspect their innovation, recent results from research labedu.au) is an associate professor of comcontents, whereas normal users will main- oratories of major companies prove that
puter science and Tony Jan (jant@
tain a more direct path of travel.
these systems are now reliable, economical,
it.uts.edu.au) is a lecturer in the departThe first step consists of detecting all the and ready for commercialization. One examment of computer systems at the Universimoving objects in the scene by subtracting ple is DETER from Honeywell Labs, a prototy of Technology in Sydney, Australia.
an estimated “background image”—one type urban-surveillance system.
