Design and Application of a ... Tracking System Jared Smith-Mickelson

advertisement
Design and Application of a Head Detection and
Tracking System
by
Jared Smith-Mickelson
Submitted to the
Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Masters of Engineering in Electrical Engineering and Computer
Science ENG
MASSACHUSETTS INSTITUTE
at the
OF TECHNOLOGY
Massachusetts Institute of Technology
JUL 2 7 2000
June 2000
LIBRARIES
Copyright 2000 Jared Smith-Mickelson. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
A uthor ...........................
............
Department of Electrical Engineering and Computer Science
May 22, 2000
C ertified by ................................
.......
. ............
Trevor J. Darrell
T~hesisupervisor
Accepted by..
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
2
Design and Application of a Head Detection and Tracking System
by
Jared Smith-Mickelson
Submitted to the
Department of Electrical Engineering and Computer Science
May 22, 2000
In Partial Fulfillment of the Requirements for the Degree of
Masters of Engineering in Electrical Engineering and Computer Science
Abstract
This vision system is designed to detect and track the head of a subject moving about
within HAL, an intelligent environment. It monitors activity in the room through a
stereo camera pair and detects heads using shape, motion, and size cues. Once a head
is found, the system tracks its three dimensional position in real time. To test and
demonstrate the system, an automated teleconferencing application was developed.
The head coordinates from the tracking system are transformed into pan and tilt
directives to drive two steerable teleconferencing cameras. As the subject moves
about the room, these cameras keep the head within their field of view.
Thesis Supervisor: Trevor J. Darrell
Title: Assistant Professor
3
4
Acknowledgments
Foremost, I would like to thank the three advisors I have had over the course of
this thesis's development. Professor Charles Sodini and Associate Director Howard
Shrobe were my instructors for a class entitled Technology Demonstration Systems.
They gave me the opportunity to explore various areas of research, led me through the
process of refining and preparing a proposal, and helped me begin work at an early
stage. Assistant Professor Trevor Darrell was research advisor to me during the later
stages of my work and influential on a technical level. Having in-depth knowledge of
the field, he was able to discuss with me the finer details of my work, offering relevant
suggestions and pointing me to key papers.
I would also like to give thanks to Michael Coen. As founder and head of the HAL
project, he played a critical role in helping integrate my work into the room. He was
always available to answer questions, share new ideas with, and give feedback.
Lastly, I wish to thank two other members of the HAL project, Krzysztof Gajos
and Stephen Peters, both of whom willingly provided their support and were pleasurable coworkers.
5
6
Contents
1
Introduction
11
2 Related Work
15
3
System Implementation
21
3.1
Overview . . . . . . . . . . . . . . . . . . .
21
3.2
Room Layout . . . . . . . . . . . . . . . .
23
3.3
Head Detection . . . . . . . . . . . . . . .
24
3.3.1
Motion Detection . . . . . . . . . .
25
3.3.2
Determining Depth . . . . . . . . .
27
3.3.3
Finding a Coherent Line of Moving Pixels
30
3.3.4
Ellipse Fitting . . . . . . . . . . . .
31
Head Tracking . . . . . . . . . . . . . . . .
32
. . .
33
3.5
Transforming Coordinates . . . . . . . . .
35
3.6
Equipment . . . . . . . . . . . . . . . . . .
37
3.4
3.4.1
4
Determining Accurate Depth
Results
39
4.1
Head Detection and Tracking Results
39
4.1.1
Continuous Detection . . . . .
39
4.1.2
Occlusion
. . . . . . . . . . .
40
4.1.3
Out-of-plane Rotation
. . . .
42
7
4.2
5
4.1.4
Head Decoys
4.1.5
Accuracy of Depth Calculation
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
43
Teleconferencing Results . . . . . . . . . . . . . . . . . . . . . . . . .
43
Conclusion
5.1
42
Future Work.
53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
54
List of Figures
1-1
Stereo Camera Pair's View of the Intelligent Room
. . . . . . . . . .
12
3-1
System Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3-2
HAL Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3-3
Motion Detection Using Frame Differencing and Thresholding
. . . .
26
3-4
Calculation of Depth to High Motion in Each Column . . . . . . . . .
29
3-5
Highest Line of Coherent Motion Wide Enough to be a Head Where
the Resulting Candidate is a Correct Detection
3-6
. . . . . . . . . . . .
31
Highest Line of Coherent Motion Wide Enough to be a Head Where
the Resulting Candidate is a False Positive . . . . . . . . . . . . . . .
32
3-7
Template of Head From Left and Right Image . . . . . . . . . . . . .
34
3-8
Stereo Camera Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3-9
Steerable Teleconferencing Camera
38
4-1
Consecutive Frames of a Sequence Illustrating the Benefits of Continuous Detection
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4-2
Consecutive Frames of an Initialization Sequence
. . . . . . . . . . .
44
4-3
Every Forth Frame of an Occlusion Sequence . . . . . . . . . . . . . .
45
4-4
Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking
46
4-5
Consecutive Frames of an Occlusion Sequence Using Template Tracking 47
4-6
Every Tenth Frame of a Rotation Sequence . . . . . . . . . . . . . . .
48
4-7
Every Tenth Frame of a Decoy Sequence
49
9
. . . . . . . . . . . . . . . .
4-8
Every Tenth Frame of a Depth Test Sequence
. . . . . . . . . . . . .
50
4-9
Plot of Depth to Head Over Time . . . . . . . . . . . . . . . . . . . .
51
10
Chapter 1
Introduction
The continued rise in processor power has recently dropped the computation time
of many computer vision tasks to a point where they can be used in real time systems. One area of computer vision research that has benefited by this advance and
received considerable attention in the last few years is person tracking. Person tracking is a broad field encompassing the detection, tracking and recognition of bodies,
heads, faces, expressions, gestures, actions, and gaze directions. Applications include
surveillance, human computer interaction, teleconferencing, computer animation, virtual holography, and intelligent environments. This thesis describes the implementation of a head detection and tracking system. To test and demonstrate the system,
the output of the tracker is used to guide the movement of steerable cameras in an
automated teleconferencing system.
The head detection and tracking system was built as an extension to the Al Lab's
intelligent environment known as HAL. The development of HAL is a continuing effort
to create a space in which humans can interact naturally, through voice and gesture,
with computer agents that control the room's appliances and software applications.
The tracking system monitors activity in the room through a monochrome stereo
camera pair mounted high up on a wall opposite a couch. The stereo pair's view of
HAL is shown in Figure 1-1.
11
Figure 1-1: Stereo Camera Pair's View of the Intelligent Room. The top image is
from the left camera. The bottom is from the right.
The system takes a multi-modal approach to detection, using shape, motion, and
size cues. At the core of the detector is an elliptical shape filter. Applying the filter at
multiple scales to the entire image is costly. To achieve real time detection, the system
uses a novel combination of depth and motion information to narrow the search. The
use of these cues also helps reduce the detector's rate of false positives.
Once a head is detected, the system tracks it using the elliptical shape filter,
constrained by depth and a simple velocity model. The same head is tracked until
the continuously running detector presents a better candidate.
As a head is tracked, its depth is calculated using normalized correlation and
refined by computing the parametric optical flow across the matching templates from
the left and right images. This refinement is essential for accurate depth.
The three-dimensional coordinates of the head relative to the stereo pair are transformed into pan and tilt directives for HAL's two teleconferencing cameras. The result
12
is an automated teleconferencing system whereby HAL's two steerable cameras keep
the subject's head in their field of view.
13
14
Chapter 2
Related Work
The survey of existing systems presented in this section is by no means exhaustive.
A comprehensive overview of the vast number of systems described in the literature
is beyond the scope of this document. The systems mentioned here were chosen to
illustrate the broad variety of approaches taken to the problem of head detection and
tracking.
An earlier vision system for HAL is described by Coen and Wilson in [3]. By
applying background differencing, skin color detection, and face detection algorithms
on images from every camera in the room, the system builds hyper-dimensional feature
vectors.
These vectors are classified using previously trained statistical models of
events. Events are qualitative in nature: someone is sitting on the couch, someone is
standing in front of the couch, someone is lying on the couch.
Background differencing is carried out on images from HAL's steerable teleconferencing cameras by synthesizing a background from a panoramic image constructed
offline. To detect skin color the algorithm described in [5] is used. This technique
involves classifying each pixel as "skin" or "not skin" using empirically estimated
Gaussian models in the log color-opponent space. The face detection algorithm used
is the same as that used by Cog, the humanoid robot [12]. A face is detected when
ratios between the average intensities of various spatial regions are satisfied.
15
All three of these modules are prone to error. Because the teleconferencing cameras do not rotate about their centers of projections, background differencing is unreliable, especially for regions physically close to the cameras. Backgrounds also tend to
change with lighting and human use. The face detector works only for frontal views.
And the skin color detector will find all skin colored objects in the room, regardless
of whether they belong to a person. Fortunately, these nonidealities are permissible
as the system has only to decide between a limited number of qualitative events. The
system cannot, however, produce accurate coordinates of a subject's head.
Rowley et al., at CMU, have a neural network that detects frontal upright views
of faces [10]. The network was trained using hand labeled images of over one thousand faces taken from the Web. To refine the network, negative examples were also
presented during training. The negative examples were chosen in a bootstrapping
fashion by applying early versions of the network to natural images containing no
faces.
In a later paper [11], an addition to the system is described which allows it to
detect faces with arbitrary in-plane rotation. The addition is a neural net router which
determines and corrects the orientation of each candidate location before it is passed
to the original system.
This method, however, cannot be applied to out-of-plane
rotation. CMU's system is limited to frontal views of faces. It is also computationally
expensive and is generally applied only to static scenes. However, ten years from now,
it may be feasible to run numerous neural nets in parallel, each trained for recognizing
different orientations of the head, achieving a robust real-time continuous detection
system for human heads.
McKenna et al. developed a neural network based on CMU's and applied it to
both the detection and tracking of heads in dynamic scenes [8]. For detection, they use
motion to constrain the search space of the expensive neural net computation. After
grouping moving pixels into blobs, they estimate which part of each blob corresponds
to a head and limit the search to these regions. The details of how they estimate the
16
position of the head are left out, so the similarity of their approach to that described
in section 3.3 is undetermined. For tracking, a Kalman filter is used to predict the
position of the head. The neural network is applied to a small window around this
predicted position. Its output is fed back in to the Kalman filter as a measurement.
Another system which addresses both detection and tracking is described by
Breymer et al. at SRI International [1]. Its continuous detector segments foreground
objects by applying background differencing to depth map sequences. It then correlates person-shaped templates with detected foreground objects to present candidates
to the tracker. The tracker uses a Kalman filter to predict position. Measurements
are made by performing correlations with an intensity template and are fed back in to
the Kalman filter. To avoid drift, the position of the intensity template is re-centered
using the person-shaped template. The shape of the person templates reflect the
assumption implied by Breymer et al. that people hold their arms at their sides and
remain upright. The system is not designed to handle exceptions to this assumption.
Pfinder, developed at the media lab by Wren et al., tracks people by following the
color and shape of blobs [14]. After extracting a user silhouette through background
subtraction, the system segments it into blobs by clustering feature vectors made up
of color and spatial information. The position of each blob is individually tracked
and predicted using a Kalman filter. Pixels in a new frame of video are classified
as background or as belonging to a particular blob using a maximum a posteriori
probability approach. Connectivity constraints are added and smoothing operators
are applied to improve the clarity of the blob segmentation. The system is designed to
handle shadows and occlusion. Skin color is used to help label the blobs corresponding
to the head and the hands.
Darrell et al. describe a system developed at Interval Research that integrates
depth, color, and face detection modules to detect and track heads [4]. Using realtime depth sequences from a stereo system implemented on an FPGA, images are
segmented into user silhouettes. For each silhouette, head candidates are collect17
ed from the three modules and integrated to produce a final head position. The
depth module places candidate heads under the maxima of each silhouette. The color
module searches for skin colored regions within each silhouette in a manner derived
from [5]. Candidates are placed in skin colored regions that are high enough and of
the right size to be heads. The face detection module uses CMU's neural network [10]
and is initially run over the entire image. All resulting detections are presented as
candidates. This process is slow. In order to present candidates successively, the
position of skin color and depth candidates are locally tracked and presented by the
face detector module if they have overlapped with a face detection hit in the recent
past. As in [2], the failure modes of the modules used in this system are claimed to
be nearly independent. When this is the case, increasing the number of modules used
greatly increases the reliability of the system.
A wholly different approach was taken by Morimoto et al. at IBM Almaden Research Center [9]. Noting that human eyes reflect near infrared light directly back
towards the source, they built a camera which captures two images: one illuminated
with on-axis infrared light and another illuminated with off-axis infrared light. Regions that are bright in the on-axis image and dark in the off-axis image generally
correspond to eyes. The positions of heads are extrapolated from the detected positions of the eyes. One drawback to the system is that since the two images are taken
asynchronously, only static eyes can be detected.
Stan Birchfield, at Stanford, has a head tracker that combines outputs from an
elliptical shape module and a color histogram module to correct the predicted position
of the subject's head. The elliptical shape module computes the average of dotproduct
magnitude between an ellipse normal and the image gradient at candidate positions in
a window around the predicted location. The second module computes the histogram
intersection between the newly calculated histogram at each candidate position and a
previously learned static model of the subject's hair and skin color. The claim is that
the two modules have roughly orthogonal failure modes and therefore complement
18
each other to produce a robust tracking system.
Birchfield's system does not address the issue of initialization except to say that
the image region used to define the subject's model histogram is set either manually
or by the output of the elliptical shape module alone. In a cluttered environment,
however, the ellipse module cannot be used to reliably find the head. In addition,
the computation time needed to run the ellipse fitting across the entire image and
at multiple scales is on the order of seconds. It is therefore unlikely that Birchfield's
system can be used to reliably detect the heads of new subjects.
The systems summarized in this section were chosen to illustrate the wide variety
of approaches taken to the problem of head detection and tracking. The use of color,
shape, motion, depth, size, background differencing, depth background differencing,
neural networks, ratio templates, and infrared reflectance have all been demonstrated.
Yet more systems have been built which use invasive techniques such as fiducials and
headgear. These have been intentionally left out. The focus of HAL is to enable
humans to interact with computer systems as they do with each other. Systems
which force the user to conform to unnatural, unfamiliar, or obtrusive methods of
interaction are not condoned. In the development of HAL, the goal is to instead to
design computer systems capable of interacting on a human level.
This thesis documents an extension to Birchfield's elliptical tracker [2]. The new
system is designed to handle initialization through continuous real-time detection in
the spirit of [1]. Motion and size cues have been incorporated to compensate for the
unavailability of color. And stereo is used to compute accurate depth to the subject's
head.
19
20
Chapter 3
System Implementation
3.1
Overview
Tracking systems have been designed to handle occlusion, out-of-plane rotation,
changes in lighting, and deformation. But they are never foolproof. A usable system
must expect and be capable of recovering from tracking failure. One way to meet this
requirement is through the use of continuous real-time detection as described in [1].
If detection occurs in every frame, tracking failure, if recognized, can be immediately
corrected. The head detector described in this thesis has been designed for real-time
use and is run in parallel with the tracker. When the detector finds a better-fit candidate than is currently being tracked, it triggers the tracker to switch attention to
the new target. If the tracker is without a target, it will lock on to the first candidate
found by the detector.
The head tracker is quite robust. Its low failure rate allows a high frequency of
misses on the part of the detector. Misses cause no detrimental effects while the
tracker is correctly following the head. When the tracker fails, it usually does so in
conjunction with head motion. The detector, however, is most reliable under such
circumstances and thereby works to complement the tracker's inefficiencies.
The
failure modes of the detector and tracker are nearly mutually exclusive. This claim
21
is similar to that of Birchfield's [2] mentioned in section 2.
To follow the head, the system employs an elliptical contour tracker. Head position
is predicted using a simple velocity model. An ellipse is then best fit to the image
gradient in a window around the predicted position. The fit is measured by averaging
the magnitude of the dot product between the ellipse normal and the image gradient.
This tracking technique is resistant to out-of-plane rotation, partial occlusion, and
variation in hair and skin colors.
Depth to the head is calculated from disparity in position of projections onto
the image planes of the stereo pair. To find this disparity, the image of the head
as captured by the left camera is used as a template and matched to a position
in the right image.
The correspondence is found through normalized correlation.
This process yields a discrete result. To achieve a sub-pixel measure of disparity,
parametric optical flow is computed across the match.
The head detector uses the same elliptical shape filter as the tracker.
When
applied at multiple scales to the entire image, this shape fitting is a computationally
expensive operation. To achieve real time detection rates, its search space must be
drastically reduced. This reduction is accomplished through a novel use of motion
and size cues. Motion is found by frame differencing. If a pixel's intensity changes
significantly from one frame to the next, it is considered moving. Size is found from
disparity in correspondence matches as described above.
Reducing the search space of the elliptical shape filter requires an assumption
about where the head is most likely to be. The assumption made in this system is
that the top of a head in motion produces a coherent line of pixels as wide as the
head itself, and that no other such line appears above the head. The detector needs
then to find the highest coherent line of moving pixels wide enough to be a human
head and apply an appropriately sized elliptical shape filter to the region under this
line. This assumption also helps reduce the detector's rate of false positives. An
unconstrained elliptical shape filter may detect an object in the background or one
22
whose size cannot possibly be that of a human head. Situations where this assumption
fails include when a large object is moving above the head and when the head is not
moving fast enough to register motion.
To find a coherent line of moving pixels wide enough to represent the top of a head
and to chose an appropriately sized ellipse for which to search, the detector must first
have knowledge of depth to the motion in the scene. Ideally, it needs to know the
distance to the highest moving pixel in each column of the image. Unfortunately,
the depth to an arbitrary pixel cannot always be found with confidence. Ambiguity
arises when attempts are made to find correspondences for image patches lacking
strong features.
To address this, the detector favors image patches around lower
moving pixels if they offer more promising correspondence properties, namely strong
edges in multiple directions.
Once the highest coherent line of moving pixels is found, the detector returns a
new head candidate, one which best fits an appropriately sized ellipse in the region
directly under the line. This candidate becomes the new tracking target if the tracker
is without one or if the tracker's current target is less fit with regard to elliptical
shape.
Finally, the three-dimensional coordinates of the head relative to the stereo pair
are transformed into pan and tilt directives for the two teleconferencing cameras in
the HAL. The cameras steer to keep the subject's head with their field of view.
A flow chart of the system is show in figure 3-1.
3.2
Room Layout
HAL is an intelligent office space. It monitors user activity via lapel microphones,
eight video cameras, and the states of equipment in the room. It can present material
to users through a sound system, two projector displays, and a television. A picture
of HAL is shown in figure 3-2. There is a couch against the wall. In front of this
couch is a coffee table upon which sits, at a height of 50cm, one of the two steerable
23
detector
motion
detection
normalized
correlation
_.
highest coherent
line of motion
stereo
r-
images
- -- -- -- ------- -- -- - - ----
-- ~-- -- -- -- -- -- - -- -- - - --
--- I- --- - --choose better
fit candidate
ellipse
fitting
tracker1
ellipse
fitting
position
parametric
prediction
optical flow
normalized
correlation
head
position
coordinate
transform
steerable
camera
coordinate
transform
steerable
camera
Figure 3-1: System Flow Chart
teleconferencing cameras. This camera can capture close-up frontal shots of people
sitting on the couch, but cannot tilt high enough to see the face of someone standing.
A second teleconferencing camera is located on top of a television beyond the end of
the couch at a height of 120cm. This camera obtains three-quarters views of people
sitting on the couch. The stereo camera pair is mounted high up on the wall opposite
the couch. It looks down towards the couch at an angle of 350 from horizontal.
3.3
Head Detection
The head detector looks for an elliptical shape in a search space constrained by
motion and depth cues. One drawback to this approach is that stationary heads are
invisible to the detector. In a teleconferencing scenario, however, it is reasonable to
assume that the head will move regularly. And if it does not, there is no need for the
cameras to steer to a new position. It should also be noted that head motion generally
accompanies tracking failure. If there is no motion, a head cannot be detected. But,
24
Figure 3-2: HAL Layout
without motion, it is unlikely the tracker will fail. Except for situations involving
severe occlusion, the tracker robustly trains stationary objects.
3.3.1
Motion Detection
Occasionally the most naive approach is found to yield adequate results. This was case
with motion detection. Simple frame differencing is used to find pixels corresponding
to moving objects. If a pixel's intensity changes significantly from one frame to the
next, it is considered moving.
Et(x, y, t) = -E(x, y, t) = E(x, y, t) - E(x, y, t - 1)
6t
(3.1)
(|Et(x, y, t)I > thm) ++ MOTION(x, y, t)
(3.2)
The threshold value, thm, is set well above the magnitude of noise in the image to
minimize false positives. This approach requires little computation and has minimal
latency. Other techniques, such as finding temporal zero crossings as in [8], require
25
smoothing in time which introduces significant latencies. The results of (3.2) can be
seen in figure 3-3.
Figure 3-3: Motion Detection Using Frame Differencing and Thresholding. The top
image is a raw frame of video. The bottom image shows the result of the motion
detection algorithm (3.2).
To constrain the search space for the shape filter, an assumption is made: The
top of a head in motion will create a coherent line of moving pixels as wide as the
head itself, and no other such line will appear above the head. There are, of course,
situations where this is not the case. For example, someone may wave their arms above
their head as in figure 3-6. Or, the head's velocity may not be great enough to create
a coherent line of motion. But, as mentioned above, the complimentary nature of
tracking and continuous detection allows for a substantial degree of detection misses.
Given this assumption, the shape filter need only search the region under the
highest coherent line of moving pixels wide enough to represent the top of a head. To
further constrain the search space, the size of the ellipse can be set according to the
depth to the line. To find such a line, the depth to the highest moving pixel in each
26
column of the image must be calculated. This depth is used to adjust the width of a
search window through which to look for the highest coherent line of moving pixels.
Thus, moving objects too small to be heads are ignored. The line of motion they
create is too narrow. The white line in figure 3-4 shows the highest moving pixel in
each column of the image.
3.3.2
Determining Depth
Depth can be extracted from correspondences across the images of a stereo camera
pair. An object's depth, z, relates inversely to the disparity, d, in the position of its
projections on to the two image planes.
fb
d
_
Here
f
(3.3)
is the focal length, and b is the baseline, the distance between the two
cameras. The stereo pair used for this system has a horizontal baseline of 7cm.
Disparity is found by normalized correlation as in [7]. Given an image patch from
the left camera, this technique involves finding a corresponding image patch from the
right camera which maximizes the normalized correlation of the two.
argmax
'(,7
EZixE
Ii (x, Y)Ir (x + , y + n)
X,)2IZj~h Z2 Ir(X + y + ).2
(3.4)
(.4
Here, (i, j) is the position of the bottom-left corner of the patch taken from the
left image, w and h are the width and height of the patch, and
((, g)
is the dispari-
ty. For a calibrated stereo pair with horizontal baseline, the vertical disparity, q, is
zero. For a well aligned, but uncalibrated stereo pair, a two dimensional search for
correspondence is workable. The match will appear somewhere along the nearly horizontal epipolar line, and the vertical component of disparity can simply be ignored.
The only calibration absolutely necessary is correcting for any horizontal offset due
27
to convergence.
When an image is thought of as a vector of intensity values, (3.4) is simply an
inner product, the cosine of the angle between two vectors. The dimensionality of the
space is equal to the number of pixels in the patch. In terms of stochastic detection
theory, this technique is similar to that of using a matched filter to recognize a known
signal in a noisy channel.
Of interest to the detector, as mentioned above, is the depth to the highest moving
pixel in each column of the image. The detector uses this depth information to select
a coherent line of moving pixels wide enough to be the top of a head. The elliptical
shape filter is then applied to the region of the image under this line.
Unfortunately, depth cannot be accurately calculated at every point desired. It is
difficult to find, with confidence, the correspondence of an image patch lacking strong
features. Correspondences found for patches containing high contrast edges in multiple directions are generally more accurate. The detection system, when calculating
the depth to the highest moving pixel in each column, will choose a patch centered
about a lower pixel if that patch contains stronger features. The black squares in the
top image of figure 3-4 represent the image patches chosen by the detector. The black
squares in the bottom image show the corresponding patches found by normalized
correlation. The black line in the top image represents the depth to each patch. The
higher the line, the closer the patch is to the stereo pair.
To measure the strength of the features within an image patch, a principal component analysis is applied to the set of image gradients of the patch. In the case of
an image patch containing strong edges in multiple directions, both components will
have high energy. The energies of the principal components are the eigenvalues of the
image gradient's covariance matrix
1
j+h i+w
(w + 1)(h + )y=j x=i
E(x, y) 2
EX(x, y)E"(x, y)
28
E(x, y)E, (x, y)
E"(x, y)2
(3.5)
Figure 3-4: Calculation of Depth to High Motion in Each Column. The aspect ratios
of the images as they are produced by the stereo pair have been preserved in this
figure to illustrate the true size of the correspondence templates. The stereo camera
produces two line interlaced frames, each with a resolution of 320 x 120. All other
images in this document have been resized for clarity.
where
Ex(x, y) = +E(x, y) = -[E(x+1,y)-E(x,y)+E(x+l,y+1)-E(x,y+1)] (3.6)
2
6x
and
Ev(x,y) = -E(x, y) = -[E(x,y+1)-E(x,y)+E(x+1,y+1)-E(x+1,y)] (3.7)
6y
2
The measure of feature strength used by the detector is the energy of the smaller
component. This criterion was independently derived by Shi and Tomasi [13].
There are tradeoffs involved in choosing the size of the image patch for which to
find a correspondence. Small patches are computationally efficient and can be more
29
accurate when there is considerable variation in depth. However, if the patches are
made too small, ambiguity arises in the match. A patch size of nine by nine pixels
was found to offer a good balance.
3.3.3
Finding a Coherent Line of Moving Pixels
Once depth has been determined to motion in each column of the image, the detector
must find the highest coherent line of moving pixels wide enough to be a human
head.
To do this, the detector uses the depth information to pick a 14cm wide
window through which to look for a coherent line of moving pixels. The baseline of
the stereo pair used in this system is 7cm, and since scale is proportional to disparity,
the window need simply be twice as wide as the disparity. This proportionality can
be derived from the projection equation which relates the width of an object, w, to
the width of its projection, w'.
W = -w
(3.8)
z
Applying equation (3.3) yields
W
w= -d
(3.9)
A line of moving pixels is considered coherent if its variance vertically is below a
threshold. Let h(x) be the highest moving pixel in each column, d(x) be the disparity
of that high motion, and C be the set of columns whose windows satisfy the coherency
constraint.
Pi =
0
2
1
i+d(i)
(3.10)
h(x)
2d(i)1x=i-d(i)
1
2d(i) +1
i+d)
(h(x)
-
pa)
2
(3.11)
2d~)+1(i)
(Ui2 < thvd(i) 2 ) ++* (i E C)
30
(3.12)
The detector finds the column in C whose window has the largest average pi.
argmax
i E C pi
(3.13)
The elliptical shape filter is applied under this highest coherent line of moving
pixels. The thick white line in figure 3-5 shows the highest 14cm wide coherent line
of moving pixels. In thin white, the best fit ellipse under this line is shown.
Figure 3-5: Highest Line of Coherent Motion Wide Enough to be a Head Where the
Resulting Candidate is a Correct Detection
3.3.4
Ellipse Fitting
The detector locates heads by searching for an elliptical shape 21cm wide and 25cm
tall. The measure of elliptical fit used, 0, is the average of dot product magnitude
between the ellipse normal and the image gradient. To reduce the attraction towards
isolated strong edges, the dot product magnitude is clipped above a threshold, thc.
4(x, y,
1N.,
)=
Z min(thc, n,(i)
No =1
Ex (x + sx, (i), y + sy,()
)
E
LEy (x + sx'(i), y + Sy,(i))
(3.14)
For an ellipse of width -, N, is the number of pixels along the perimeter, n, (i) is
the normal at the ith perimeter pixel, and (sx,(i), sy,(i)) is the position of the ith
31
perimeter pixel, relative to the center of the ellipse. The height of the ellipse is set at
1.2o-. Expect for the clipping, this is the same measure used by Birchfield in [2].
The detector computes 0 once for every column making up the highest coherent
line of moving pixels. The size of the ellipse is set using the disparity measurements.
The width of a head is assumed to be 21cm, or 3d(i), and the height, 25cm, or 3.6d(i).
Since the line of motion is taken to be the top of a head, the ellipse is placed below and
tangent to the line at vertical position of h(i) - 1.8d(i). The final output candidate
of the detector is the ellipse which had the best fit.
If the tracker is currently without a target, the candidate found by the detector is
tracked. Otherwise the elliptical fit of the candidate is compared to the elliptical fit
of the target being tracked. If it is greater, the tracker switches attention to the new
candidate. Figure 3-6 shows in white, the candidate found by the detector. However
the elliptical fit of this false positive is less than that of the actual head being tracked,
the black ellipse. The false positive is ignored.
Figure 3-6: Highest Line of Coherent Motion Wide Enough to be a Head Where the
Resulting Candidate is a False Positive. The tracking target is shown in black.
3.4
Head Tracking
The tracker uses a simple constant velocity model to predict the new position of the
head. It then searches a region around the predicted position for a best fit ellipse.
32
The size of the region is constrained linearly with disparity under the assumption that
human head acceleration is limited. A range of ellipse sizes, 14-21cm, is used to allow
for changes in depth from one frame to the next and slight variation in curvature due
to head rotation. Size is determined from the depth calculation made in the previous
frame.
The fact that the detector and tracker use the same measure of elliptical fit provides for an alternate and perhaps simpler model of the detector tracker synergy. For
every new frame, a space is constructed in which an elliptical shape is searched for.
The space is the union of a window around the predicted location of the head and a
region under the highest coherent line of moving pixels. The best fit ellipse in this
space is taken to be the new position of the head.
3.4.1
Determining Accurate Depth
It is important that the depth of the tracker's final coordinate output be accurate.
Small errors in depth may translate to significant pan and tilt errors in HAL's two
teleconferencing cameras. As presented in section 3.3.2, normalized correlation alone
is insufficient, for it gives discrete results. At a range of 300cm, a one pixel error in
disparity translates to a depth error of 30cm. After a normalized correlation match
is found, a sub-pixel disparity measurement is achieved by calculating a parametric
optical flow across the template from the left image and its matching template from
the right image. The parametric optical flow is purely translational. Figure 3-7 shows
the template of the head as taken from the left image and its corresponding template
from the right image. Pure translational flow constrains all flow vectors to be equal
and provides an elegant least squares solution to the brightness constraint equation
Exu + Eyv + El. = 0
33
(3.15)
Figure 3-7: Template of Head From Left and Right Image
Here, El, is the change in intensity across the templates from the left and right images.
u and v represent the sub-pixel flow from the left to the right templates. Given a
disparity ( , r/), E., E,, and Eir are calculated as follows.
Ex(x, y)
=
[I1(X +1, y) - I1(X, y) +
I1(X + 1, y + 1) - I(x,y+ 1) +
Ir(x +
Ey(x,y) =
+ 1, y +
r7)
- Ir(X + , y + r7)+
Ir (X + (+ 1, y + 77 + 1) - Ir (x + 6, y + rq +1)
1
[
, y + 1) - I1(X, y) +
I(x + 1,y + 1) -I(x+ 1,y)+
Ir(x + , Y +q r+ 1) - Ir(X + 6, Y +q)
+
Ir(x + 6+ 1,Y +'q + 1) - Ir(x + 6 + 1,y +q)]
Eir(X,Y) =
[I(x + ,Y+ 7)- I(X,Y)+
Ir(X+
y + r + 1) - I(x,y+ 1) +
Ir(X +
+ ly + 7) - I(x + 1,y)+
Ir (X + (+
1,y +
34
T,+
1) - I1(X + 1, y + )](3.16)
Each 2 x 2 pixel group in the template image gives a brightness constraint. The result
is the over-determined system
Ex E-
(3.17)
E.,
EX, Ey, and Eir are column vectors containing gradient calculations from each 2 x 2
pixel group used by (3.16). For example,
Ex(x, y)
E.(x, y + 1)
Ex(x, y + h - 1)
EX=
Ex(x + 1, y)
(3.18)
E2(x + 1, y + 1)
Ex(x + w - 1, y + h - 2)
Ex(x + w - 1, y + h - 1)
To solve for u and v, a pseudo inverse is used.
[1
T
V
The final disparity is
3.5
E] T
ET
Ex EYT
EyT
(( +
Eir
(3.19)
Ey
u, + v).
Transforming Coordinates
As the head is tracked in the images from the stereo camera, and accurate measurements of disparity are made, pan and tilt directives must be calculated to drive the
movement of the teleconferencing cameras. This is done in three steps. First, the
35
real-world position of the head, in Cartesian coordinates relative to the stereo camera
pair, must be found. This position is then transformed into Cartesian coordinates
relative to each of the two teleconferencing cameras, using knowledge of the cameras'
relative orientations. Finally, these transformed Cartesian coordinates are mapped
to polar pan and tilt values to drive the movement of the teleconferencing cameras.
To transform the location and disparity, (x, y, d), of the head being tracked in the
image to Cartesian coordinates relative to the stereo pair, (
ysp, z,), the projection
equations are used [6].
xSP
b
=
(3.20)
d
(3.21)
-b
Ysp
d
b
zs,= -f
d
(3.22)
The Cartesian coordinates are then multiplied by a rotation/translation matrix for
each of the two teleconferencing cameras in HAL, resulting in coordinates (
ytc, ztr)
relative to each teleconferencing camera.
xte
Ytc
ztJ
cos 0 - sin 6 sinq'
0
Cos 0
sin G
cos 0 sin 0
-sin 0 cos 0 cos9Ax - sin Z~z
- sin
cos 0 cos 4
In the above equation,
#
1
P
AY
Ysp
sin 9Ax + cos 9xy
zS
1
(3.23)
is the downward tilt angle of the stereo pair. Ax, Ay,
and Az represent the distance from the teleconferencing camera to the stereo pair.
This translation is described in the tilt corrected coordinate frame of the stereo pair.
6 represents how far the teleconferencing camera is rotated in the xz-plane, again
relative to the tilt corrected stereo pair.
Using inverse tangent relationships, the coordinates of the head relative to the
teleconferencing cameras are further transformed into pan and tilt directives. These
36
pan and tilt values are sent over a serial line to drive the movement of the cameras.
pan
= tan-
1(tc)
tilt
= tan-
()
Zc
7r
+ -[1 - sign(zt,)]sign(xte)
2
(3.24)
(3.25)
The teleconferencing cameras used in this system have a 1800 pan and 900 tilt range.
3.6
Equipment
The stereo camera pair used is the STH-V2 made by Videre Design, figure 3-8. It can
output left and right video signals at a resolution of 320 x 240 or one interlaced signal
combining 320 x 120 subsampled images. The later mode is used for this thesis. The
lenses used give the camera a 390 field of view.
Figure 3-8: Stereo Camera Pair
The stereo images are captured using a Matrox Meteor frame grabber and processed on a 600MHz Pentium III. The steerable teleconferencing cameras used are
Sony EVI-D30s.
37
Figure 3-9: Steerable Teleconferencing Camera
38
Chapter 4
Results
4.1
Head Detection and Tracking Results
It is difficult to quantitatively assess the behavior of the head detection and tracking
system. Statistics such as deviation from ground truth, rate of false positives, and
mean time to detection all depend greatly on the situation. How fast, how often, and
in what direction does the subject's head move? Is the head ever severely occluded?
What types of of background clutter are present? How often does motion occur above
the head? Statistics for a system such as this are only meaningful when collected over
hundreds of real-world trials. And even then, subjects must be carefully instructed to
act naturally, to neither intentionally try to fool the system nor be overly cautious.
In light of these difficulties, this section instead provides a qualitative assessment
of the system. Situations are presented which illustrate both the system's strengths
and weaknesses. And comparisons are drawn to other methods considered during the
development of the system.
4.1.1
Continuous Detection
Figure 4-1 illustrates the success and importance of continuous detection. Here, as
the head moves downward in the image, the elliptical tracker is held fast by the high
39
contrast edge between the couch and the back wall and looses the curved bottom edge
of the jaw. By the sixth frame, it is following the front of the hair line rather than
the perimeter of the face. At this point, the tracker's hold on the head is in jeopardy.
Further downward movement will likely cause the tracker to loose the head entirely.
In the seventh frame, however, the candidate presented by the continuously running
detector is better fit than the current target and the system is restabilized.
Another benefit of continuous detection is that it eliminates the need for initialization. Many head tracking systems ignore the issue of initialization. Others have
manual procedures. The focus during the design of this system was always kept on
usability. Figure 4-2 shows each frame of an initialization sequence. In the first frame
of this sequence, the tracker is without a target. By the second frame, enough of
the torso has come into the frame to register motion. The detector notices this and
presents the torso, as its best fit candidate, to the tracker. During the next four
frames, the tracker, having no better target, follows the torso. When enough of the
head comes into the frame, the continuously running detector recognizes that the
head is a better fit target than the torso, and switches the tracker's attention to the
head.
4.1.2
Occlusion
The nature of the elliptical tracker makes the system resistant to partial occlusion.
Figure 4-3 shows every forth frame of a sequence in which the tracker is unaffected
by arms occluding the stereo camera's view of the head. This resistance is a result
of taking the average edge strength around the perimeter of the ellipse.
Objects
partially occluding the head generally leave enough of the head's edges visible to
keep this average high. In fact, human heads are rarely the exact shape for which
the ellipse filter is searching. Close examination of the position of the ellipse as the
system is tracking reveals that it is usually following just one or two high contrast
curves and not the entire perimeter of the head.
40
I
I
Figure 4-1: Consecutive Frames of a Sequence Illustrating the Benefits of Continuous
Detection. The frame order in this sequence and all others in this document is left
to right, top to bottom.
Template based trackers can easily fail under situations of occlusion. The occluding object often pulls the template off the target object. To illustrate this, figures
4-4 and 4-5 show consecutive frames of an occlusion sequence. In the first sequence,
an elliptical tracker is used. The occluding object is properly ignored. In the second
sequence, a template tracker is used and is pulled off the subject's head by the occluding hand. Template trackers determine the new position of an image patch by
computing a normalized correlation (3.4).
41
4.1.3
Out-of-plane Rotation
The elliptical shape tracker was also chosen for its ability to handle out-of-plane head
rotation. Techniques such as template tracking that follow patterns of brightness fail
upon object rotation when the pattern becomes self occluded. In most scenarios, the
assumption that the head will not rotate significantly is invalid. The elliptical tracker
instead relies upon the shape of the perimeter of the head as projected onto the image
plane. This shape is nearly elliptical, regardless of head rotation. Figure 4-6 shows a
rotation sequence through which the tracker remains correctly locked on the head.
4.1.4
Head Decoys
Figure 4-7 shows the results of a situation contrived to demonstrate irrecoverable
system failure. This can happen when the tracker locks on to an object whose elliptical
shape is far better fit than a human head. As long as the elliptical object is visible,
the tracker will never switch its attention away from it. In this figure, the black ellipse
shows the tracking target. The white ellipse shows the detector candidate. A high
contrast drawing of a head sized ellipse was made as a decoy to steal the attention of
the tracker. In the first six frames of the sequence, the tracker correctly follows the
head. In the seventh frame, the drawing is moved slightly. The detector notices the
decoy and determines that it is a better-fit ellipse than the head being tracked. At
this point, the tracker switches its attention to the drawing. In the last three frames,
although the detector is finding the true human head, the elliptical fit of the decoy is
far stronger and holds the attention of tracker.
Although this situation is contrived, it illustrates a major weakness. The system
detects and tracks ellipses, not heads.
Objects in the background that better fit
ellipses better than human heads do are detriments to the system. If these decoys
are moved or if a tracked head passes directly in front of them, the system can fail.
42
4.1.5
Accuracy of Depth Calculation
Section 3.4.1 describes a technique for accurately calculating depth to the head. After
a discrete disparity value is found via normalized correlation, a sub-pixel shift is
calculated across the matching templates from the left and right images. This shift
is found using parametric optical flow and is added to the preliminary discrete value
to obtain a more accurate disparity. To test the validity of this approach, a sequence
was taken of cyclical head movement towards and away from the stereo pair, figure
4-8. The depth determined by the tracker is plotted in figure 4-9. For comparison, the
plot also shows with a dotted line the discrete output of the normalized correlation
calculation alone.
Accurate disparity calculations are critical in a teleconferencing scenario. In HAL,
the teleconferencing cameras' two views of the couch are nearly orthogonal to that of
the stereo pair. A slight error in depth can translate to a significant error in the pan
or tilt angle of a teleconferencing camera. At frame 110 of the depth test sequence,
figure 4-9 shows a discrepancy of 15cm between the depth calculated with and without
parametric optical flow. If a teleconferencing camera were aiming for a tight shot of
the head, an error of this magnitude could result in undesirable cropping of the
head. Measuring sub-pixel disparities increases the accuracy of depth calculations by
roughly one order of magnitude and is a critical feature of this system.
4.2
Teleconferencing Results
The major drawback to the cameras used in this system is that their maximum drive
speed is only 800 per second. When the subject is close and moving laterally, the
cameras can take on the order of seconds to move to a new position. Hence, they
cannot react quickly to a head moving out of their field of view, despite the fact that
real-time tracking data is available. Aside from this caveat, the system works well.
In most situations, the cameras provide well-centered close-up shots of the head.
43
I
Figure 4-2: Consecutive Frames of an Initialization Sequence
44
I
Figure 4-3: Every Forth Frame of an Occlusion Sequence.
45
Figure 4-4: Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking.
46
Figure 4-5: Consecutive Frames of an Occlusion Sequence Using Template Tracking.
47
Figure 4-6: Every Tenth Frame of a Rotation Sequence
48
Figure 4-7: Every Tenth Frame of a Decoy Sequence
49
Figure 4-8: Every Tenth Frame of a Depth Test Sequence. The first image in this
sequence is frame 50. The last is frame 130. The image templates used in the
correspondence calculation are shown in the bottom left corner of each frame. A plot
of depth to the head in this sequence can be found in figure 4-9
50
280-
270-
260-
250-
240-
230-
220
40
50
60
70
80
90
100
Frame Number
110
120
130
140
Figure 4-9: Plot of Depth to Head Over Time. The dotted line plots depth as derived
from the discrete output of the normalized correlation calculation alone. The solid line
plots depth after the sub-pixel optical flow results have been added to the disparity
calculation. The video sequence corresponding to this plot can be seen in figure 4-8
51
52
Chapter 5
Conclusion
A system has been developed which detects and tracks the head of a subject moving about within HAL, an intelligent environment. It monitors activity in the room
through a stereo camera pair. The detector works by looking for an elliptical shape
in a search space constrained by motion and size cues. The tracker follows the elliptical shape until the detector presents one which is better fit. Depth to the head is
calculated using normalized correlation and refined in accuracy by determining the
parametric optical flow across the matched image templates from the left and right
cameras.
To test and demonstrate the system, an automated teleconferencing application
was developed. The three-dimensional coordinates of the subject's head are transformed into polar coordinates to drive the pan and tilt of two steerable cameras.
The test was a success. Although the steerable cameras move slowly and cannot
keep up with a quickly moving head, they eventually center on the head when it
comes to rest. The system is robust against partial occlusion, rotation, changes in
lighting, and variation in hair and skin color.
53
5.1
Future Work
One clear path for future work is to extend the system to support the detection and
tracking of multiple heads. There is nothing in the design of the current system to
prevent such an extension. Rather than tracking the best fit ellipse in the scene, the
system could simply track all candidates whose elliptical fits exceeded a threshold.
The detector, rather than switching the attention of the tracker, could instead spawn
new trackers. Additional logic could be added to retire trackers whose targets do
not move for long periods of time. As a side effect, extending the system to support
multiple heads may alleviate the problem illustrated in figure 4-7 whereby a decoy
can permanently steal the attention of the tracker.
As was mentioned in section 2, some of the more robust head detection and
tracking systems have been the result of multi-modal approaches. And the authors
of such systems generally claim that adding more modes increases performance. Two
modes which could be added to future versions of this system are skin color detection
and pattern recognition. Both of these were considered during the development of the
existing system but never implemented. Color was not available from the monochrome
stereo pair. And existing pattern recognition systems, such as CMU's [10], failed due
to the downward viewing angle of the stereo pair. There are, however, other cameras
in the room which could supply skin color and face detection information to the
system, narrowing the search for heads in the image from the stereo pair to epipolar
regions. Additionally, a pattern recognition module could be specially trained to
detect heads from the downward viewing angle of the stereo pair.
One problem with the existing system is the time it takes for the teleconferencing
cameras to drive to a new position. It might be advantageous for future versions
of the system to account for this delay. The current system, when commanding the
cameras to drive, relays the position of the head as of time the command is issued.
The system might instead predict where the head will be by the time the cameras
finish the movement and instead relay this position.
54
Another solution to the slow camera problem is to digitally crop the head from
wider angle shots. In each frame captured by the teleconferencing cameras, head
position could be used to select a cropping region. The resolution of the resulting
sequence would, of course, be reduced. But, this reduction is already common practice
in teleconferencing scenarios due to limited bandwidth.
55
56
Bibliography
[1] D. Beymer and K. Konolige. Real-time tracking of multiple people using continuous detection. IEEE Conference on Computer Vision and Pattern Recognition,
1999.
[2] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. IEEE Conference on Computer Vision and Pattern Recognition, pages
232-237, Santa Barbara, CA, June 1998.
[3] M. Coen and K. Wilson. Learning spatial event models from multiple-camera
perspectives.
Annual Conference of the IEEE Industrial Electronics Society,
1999.
[4] T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking
using stereo, color, and pattern detection. IEEE Conference on Computer Vision
and Pattern Recognition, pages 601-609, Santa Barbara, CA, June 1998.
[5] M. Fleck, D. Forsyth, and C. Bregler.
Finding naked people. volume 2 of
European Conference on Computer Vision, pages 592-602, 1996.
[6] B. K. P. Horn. Robot Vision. The MIT Press, Cambridge, Massachusetts, 1986.
[7] K. Konolige. Small vision systems: Hardware and implementation. Eighth International Symposium on Robotics Research, Hayama, Japan, October 1997.
[8] S. McKenna and S. Gong. Tracking faces. International Conference on Automatic
Face and Gesture Recognition, Killington, Vermont, October 1996.
57
[9] C. Morimoto, D. Koons, A. Amir, and M. Flickner. Real-time detection of
eyes and faces. Workshop on Perceptual User Interfaces, San Francisco, CA,
November 1998.
[10] H. Rowley, S. Baluja, and T. Kanade.
Neural network-based face detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23-38,
January 1998.
[11] H. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based
face detection. IEEE Conference on Computer Vision and Pattern Recognition,
Santa Barbara, CA, June 1998.
[12] B. Scassellati. Eye finding via face detection for a foveated, active vision system.
National Conference on Artificial Intelligence, Madison, WI, 1999.
[13] J. Shi and C. Tomasi. Good features to track. IEEE Conference on Computer
Vision and Pattern Recognition, pages 593-600, 1994.
[14] C. Wren, W. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time
tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780-785, July 1997.
58
Download