Design and Application of a Head Detection and Tracking System by Jared Smith-Mickelson Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Masters of Engineering in Electrical Engineering and Computer Science ENG MASSACHUSETTS INSTITUTE at the OF TECHNOLOGY Massachusetts Institute of Technology JUL 2 7 2000 June 2000 LIBRARIES Copyright 2000 Jared Smith-Mickelson. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. A uthor ........................... ............ Department of Electrical Engineering and Computer Science May 22, 2000 C ertified by ................................ ....... . ............ Trevor J. Darrell T~hesisupervisor Accepted by.. Arthur C. Smith Chairman, Department Committee on Graduate Theses 2 Design and Application of a Head Detection and Tracking System by Jared Smith-Mickelson Submitted to the Department of Electrical Engineering and Computer Science May 22, 2000 In Partial Fulfillment of the Requirements for the Degree of Masters of Engineering in Electrical Engineering and Computer Science Abstract This vision system is designed to detect and track the head of a subject moving about within HAL, an intelligent environment. It monitors activity in the room through a stereo camera pair and detects heads using shape, motion, and size cues. Once a head is found, the system tracks its three dimensional position in real time. To test and demonstrate the system, an automated teleconferencing application was developed. The head coordinates from the tracking system are transformed into pan and tilt directives to drive two steerable teleconferencing cameras. As the subject moves about the room, these cameras keep the head within their field of view. Thesis Supervisor: Trevor J. Darrell Title: Assistant Professor 3 4 Acknowledgments Foremost, I would like to thank the three advisors I have had over the course of this thesis's development. Professor Charles Sodini and Associate Director Howard Shrobe were my instructors for a class entitled Technology Demonstration Systems. They gave me the opportunity to explore various areas of research, led me through the process of refining and preparing a proposal, and helped me begin work at an early stage. Assistant Professor Trevor Darrell was research advisor to me during the later stages of my work and influential on a technical level. Having in-depth knowledge of the field, he was able to discuss with me the finer details of my work, offering relevant suggestions and pointing me to key papers. I would also like to give thanks to Michael Coen. As founder and head of the HAL project, he played a critical role in helping integrate my work into the room. He was always available to answer questions, share new ideas with, and give feedback. Lastly, I wish to thank two other members of the HAL project, Krzysztof Gajos and Stephen Peters, both of whom willingly provided their support and were pleasurable coworkers. 5 6 Contents 1 Introduction 11 2 Related Work 15 3 System Implementation 21 3.1 Overview . . . . . . . . . . . . . . . . . . . 21 3.2 Room Layout . . . . . . . . . . . . . . . . 23 3.3 Head Detection . . . . . . . . . . . . . . . 24 3.3.1 Motion Detection . . . . . . . . . . 25 3.3.2 Determining Depth . . . . . . . . . 27 3.3.3 Finding a Coherent Line of Moving Pixels 30 3.3.4 Ellipse Fitting . . . . . . . . . . . . 31 Head Tracking . . . . . . . . . . . . . . . . 32 . . . 33 3.5 Transforming Coordinates . . . . . . . . . 35 3.6 Equipment . . . . . . . . . . . . . . . . . . 37 3.4 3.4.1 4 Determining Accurate Depth Results 39 4.1 Head Detection and Tracking Results 39 4.1.1 Continuous Detection . . . . . 39 4.1.2 Occlusion . . . . . . . . . . . 40 4.1.3 Out-of-plane Rotation . . . . 42 7 4.2 5 4.1.4 Head Decoys 4.1.5 Accuracy of Depth Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Teleconferencing Results . . . . . . . . . . . . . . . . . . . . . . . . . 43 Conclusion 5.1 42 Future Work. 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 54 List of Figures 1-1 Stereo Camera Pair's View of the Intelligent Room . . . . . . . . . . 12 3-1 System Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3-2 HAL Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3-3 Motion Detection Using Frame Differencing and Thresholding . . . . 26 3-4 Calculation of Depth to High Motion in Each Column . . . . . . . . . 29 3-5 Highest Line of Coherent Motion Wide Enough to be a Head Where the Resulting Candidate is a Correct Detection 3-6 . . . . . . . . . . . . 31 Highest Line of Coherent Motion Wide Enough to be a Head Where the Resulting Candidate is a False Positive . . . . . . . . . . . . . . . 32 3-7 Template of Head From Left and Right Image . . . . . . . . . . . . . 34 3-8 Stereo Camera Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-9 Steerable Teleconferencing Camera 38 4-1 Consecutive Frames of a Sequence Illustrating the Benefits of Continuous Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4-2 Consecutive Frames of an Initialization Sequence . . . . . . . . . . . 44 4-3 Every Forth Frame of an Occlusion Sequence . . . . . . . . . . . . . . 45 4-4 Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking 46 4-5 Consecutive Frames of an Occlusion Sequence Using Template Tracking 47 4-6 Every Tenth Frame of a Rotation Sequence . . . . . . . . . . . . . . . 48 4-7 Every Tenth Frame of a Decoy Sequence 49 9 . . . . . . . . . . . . . . . . 4-8 Every Tenth Frame of a Depth Test Sequence . . . . . . . . . . . . . 50 4-9 Plot of Depth to Head Over Time . . . . . . . . . . . . . . . . . . . . 51 10 Chapter 1 Introduction The continued rise in processor power has recently dropped the computation time of many computer vision tasks to a point where they can be used in real time systems. One area of computer vision research that has benefited by this advance and received considerable attention in the last few years is person tracking. Person tracking is a broad field encompassing the detection, tracking and recognition of bodies, heads, faces, expressions, gestures, actions, and gaze directions. Applications include surveillance, human computer interaction, teleconferencing, computer animation, virtual holography, and intelligent environments. This thesis describes the implementation of a head detection and tracking system. To test and demonstrate the system, the output of the tracker is used to guide the movement of steerable cameras in an automated teleconferencing system. The head detection and tracking system was built as an extension to the Al Lab's intelligent environment known as HAL. The development of HAL is a continuing effort to create a space in which humans can interact naturally, through voice and gesture, with computer agents that control the room's appliances and software applications. The tracking system monitors activity in the room through a monochrome stereo camera pair mounted high up on a wall opposite a couch. The stereo pair's view of HAL is shown in Figure 1-1. 11 Figure 1-1: Stereo Camera Pair's View of the Intelligent Room. The top image is from the left camera. The bottom is from the right. The system takes a multi-modal approach to detection, using shape, motion, and size cues. At the core of the detector is an elliptical shape filter. Applying the filter at multiple scales to the entire image is costly. To achieve real time detection, the system uses a novel combination of depth and motion information to narrow the search. The use of these cues also helps reduce the detector's rate of false positives. Once a head is detected, the system tracks it using the elliptical shape filter, constrained by depth and a simple velocity model. The same head is tracked until the continuously running detector presents a better candidate. As a head is tracked, its depth is calculated using normalized correlation and refined by computing the parametric optical flow across the matching templates from the left and right images. This refinement is essential for accurate depth. The three-dimensional coordinates of the head relative to the stereo pair are transformed into pan and tilt directives for HAL's two teleconferencing cameras. The result 12 is an automated teleconferencing system whereby HAL's two steerable cameras keep the subject's head in their field of view. 13 14 Chapter 2 Related Work The survey of existing systems presented in this section is by no means exhaustive. A comprehensive overview of the vast number of systems described in the literature is beyond the scope of this document. The systems mentioned here were chosen to illustrate the broad variety of approaches taken to the problem of head detection and tracking. An earlier vision system for HAL is described by Coen and Wilson in [3]. By applying background differencing, skin color detection, and face detection algorithms on images from every camera in the room, the system builds hyper-dimensional feature vectors. These vectors are classified using previously trained statistical models of events. Events are qualitative in nature: someone is sitting on the couch, someone is standing in front of the couch, someone is lying on the couch. Background differencing is carried out on images from HAL's steerable teleconferencing cameras by synthesizing a background from a panoramic image constructed offline. To detect skin color the algorithm described in [5] is used. This technique involves classifying each pixel as "skin" or "not skin" using empirically estimated Gaussian models in the log color-opponent space. The face detection algorithm used is the same as that used by Cog, the humanoid robot [12]. A face is detected when ratios between the average intensities of various spatial regions are satisfied. 15 All three of these modules are prone to error. Because the teleconferencing cameras do not rotate about their centers of projections, background differencing is unreliable, especially for regions physically close to the cameras. Backgrounds also tend to change with lighting and human use. The face detector works only for frontal views. And the skin color detector will find all skin colored objects in the room, regardless of whether they belong to a person. Fortunately, these nonidealities are permissible as the system has only to decide between a limited number of qualitative events. The system cannot, however, produce accurate coordinates of a subject's head. Rowley et al., at CMU, have a neural network that detects frontal upright views of faces [10]. The network was trained using hand labeled images of over one thousand faces taken from the Web. To refine the network, negative examples were also presented during training. The negative examples were chosen in a bootstrapping fashion by applying early versions of the network to natural images containing no faces. In a later paper [11], an addition to the system is described which allows it to detect faces with arbitrary in-plane rotation. The addition is a neural net router which determines and corrects the orientation of each candidate location before it is passed to the original system. This method, however, cannot be applied to out-of-plane rotation. CMU's system is limited to frontal views of faces. It is also computationally expensive and is generally applied only to static scenes. However, ten years from now, it may be feasible to run numerous neural nets in parallel, each trained for recognizing different orientations of the head, achieving a robust real-time continuous detection system for human heads. McKenna et al. developed a neural network based on CMU's and applied it to both the detection and tracking of heads in dynamic scenes [8]. For detection, they use motion to constrain the search space of the expensive neural net computation. After grouping moving pixels into blobs, they estimate which part of each blob corresponds to a head and limit the search to these regions. The details of how they estimate the 16 position of the head are left out, so the similarity of their approach to that described in section 3.3 is undetermined. For tracking, a Kalman filter is used to predict the position of the head. The neural network is applied to a small window around this predicted position. Its output is fed back in to the Kalman filter as a measurement. Another system which addresses both detection and tracking is described by Breymer et al. at SRI International [1]. Its continuous detector segments foreground objects by applying background differencing to depth map sequences. It then correlates person-shaped templates with detected foreground objects to present candidates to the tracker. The tracker uses a Kalman filter to predict position. Measurements are made by performing correlations with an intensity template and are fed back in to the Kalman filter. To avoid drift, the position of the intensity template is re-centered using the person-shaped template. The shape of the person templates reflect the assumption implied by Breymer et al. that people hold their arms at their sides and remain upright. The system is not designed to handle exceptions to this assumption. Pfinder, developed at the media lab by Wren et al., tracks people by following the color and shape of blobs [14]. After extracting a user silhouette through background subtraction, the system segments it into blobs by clustering feature vectors made up of color and spatial information. The position of each blob is individually tracked and predicted using a Kalman filter. Pixels in a new frame of video are classified as background or as belonging to a particular blob using a maximum a posteriori probability approach. Connectivity constraints are added and smoothing operators are applied to improve the clarity of the blob segmentation. The system is designed to handle shadows and occlusion. Skin color is used to help label the blobs corresponding to the head and the hands. Darrell et al. describe a system developed at Interval Research that integrates depth, color, and face detection modules to detect and track heads [4]. Using realtime depth sequences from a stereo system implemented on an FPGA, images are segmented into user silhouettes. For each silhouette, head candidates are collect17 ed from the three modules and integrated to produce a final head position. The depth module places candidate heads under the maxima of each silhouette. The color module searches for skin colored regions within each silhouette in a manner derived from [5]. Candidates are placed in skin colored regions that are high enough and of the right size to be heads. The face detection module uses CMU's neural network [10] and is initially run over the entire image. All resulting detections are presented as candidates. This process is slow. In order to present candidates successively, the position of skin color and depth candidates are locally tracked and presented by the face detector module if they have overlapped with a face detection hit in the recent past. As in [2], the failure modes of the modules used in this system are claimed to be nearly independent. When this is the case, increasing the number of modules used greatly increases the reliability of the system. A wholly different approach was taken by Morimoto et al. at IBM Almaden Research Center [9]. Noting that human eyes reflect near infrared light directly back towards the source, they built a camera which captures two images: one illuminated with on-axis infrared light and another illuminated with off-axis infrared light. Regions that are bright in the on-axis image and dark in the off-axis image generally correspond to eyes. The positions of heads are extrapolated from the detected positions of the eyes. One drawback to the system is that since the two images are taken asynchronously, only static eyes can be detected. Stan Birchfield, at Stanford, has a head tracker that combines outputs from an elliptical shape module and a color histogram module to correct the predicted position of the subject's head. The elliptical shape module computes the average of dotproduct magnitude between an ellipse normal and the image gradient at candidate positions in a window around the predicted location. The second module computes the histogram intersection between the newly calculated histogram at each candidate position and a previously learned static model of the subject's hair and skin color. The claim is that the two modules have roughly orthogonal failure modes and therefore complement 18 each other to produce a robust tracking system. Birchfield's system does not address the issue of initialization except to say that the image region used to define the subject's model histogram is set either manually or by the output of the elliptical shape module alone. In a cluttered environment, however, the ellipse module cannot be used to reliably find the head. In addition, the computation time needed to run the ellipse fitting across the entire image and at multiple scales is on the order of seconds. It is therefore unlikely that Birchfield's system can be used to reliably detect the heads of new subjects. The systems summarized in this section were chosen to illustrate the wide variety of approaches taken to the problem of head detection and tracking. The use of color, shape, motion, depth, size, background differencing, depth background differencing, neural networks, ratio templates, and infrared reflectance have all been demonstrated. Yet more systems have been built which use invasive techniques such as fiducials and headgear. These have been intentionally left out. The focus of HAL is to enable humans to interact with computer systems as they do with each other. Systems which force the user to conform to unnatural, unfamiliar, or obtrusive methods of interaction are not condoned. In the development of HAL, the goal is to instead to design computer systems capable of interacting on a human level. This thesis documents an extension to Birchfield's elliptical tracker [2]. The new system is designed to handle initialization through continuous real-time detection in the spirit of [1]. Motion and size cues have been incorporated to compensate for the unavailability of color. And stereo is used to compute accurate depth to the subject's head. 19 20 Chapter 3 System Implementation 3.1 Overview Tracking systems have been designed to handle occlusion, out-of-plane rotation, changes in lighting, and deformation. But they are never foolproof. A usable system must expect and be capable of recovering from tracking failure. One way to meet this requirement is through the use of continuous real-time detection as described in [1]. If detection occurs in every frame, tracking failure, if recognized, can be immediately corrected. The head detector described in this thesis has been designed for real-time use and is run in parallel with the tracker. When the detector finds a better-fit candidate than is currently being tracked, it triggers the tracker to switch attention to the new target. If the tracker is without a target, it will lock on to the first candidate found by the detector. The head tracker is quite robust. Its low failure rate allows a high frequency of misses on the part of the detector. Misses cause no detrimental effects while the tracker is correctly following the head. When the tracker fails, it usually does so in conjunction with head motion. The detector, however, is most reliable under such circumstances and thereby works to complement the tracker's inefficiencies. The failure modes of the detector and tracker are nearly mutually exclusive. This claim 21 is similar to that of Birchfield's [2] mentioned in section 2. To follow the head, the system employs an elliptical contour tracker. Head position is predicted using a simple velocity model. An ellipse is then best fit to the image gradient in a window around the predicted position. The fit is measured by averaging the magnitude of the dot product between the ellipse normal and the image gradient. This tracking technique is resistant to out-of-plane rotation, partial occlusion, and variation in hair and skin colors. Depth to the head is calculated from disparity in position of projections onto the image planes of the stereo pair. To find this disparity, the image of the head as captured by the left camera is used as a template and matched to a position in the right image. The correspondence is found through normalized correlation. This process yields a discrete result. To achieve a sub-pixel measure of disparity, parametric optical flow is computed across the match. The head detector uses the same elliptical shape filter as the tracker. When applied at multiple scales to the entire image, this shape fitting is a computationally expensive operation. To achieve real time detection rates, its search space must be drastically reduced. This reduction is accomplished through a novel use of motion and size cues. Motion is found by frame differencing. If a pixel's intensity changes significantly from one frame to the next, it is considered moving. Size is found from disparity in correspondence matches as described above. Reducing the search space of the elliptical shape filter requires an assumption about where the head is most likely to be. The assumption made in this system is that the top of a head in motion produces a coherent line of pixels as wide as the head itself, and that no other such line appears above the head. The detector needs then to find the highest coherent line of moving pixels wide enough to be a human head and apply an appropriately sized elliptical shape filter to the region under this line. This assumption also helps reduce the detector's rate of false positives. An unconstrained elliptical shape filter may detect an object in the background or one 22 whose size cannot possibly be that of a human head. Situations where this assumption fails include when a large object is moving above the head and when the head is not moving fast enough to register motion. To find a coherent line of moving pixels wide enough to represent the top of a head and to chose an appropriately sized ellipse for which to search, the detector must first have knowledge of depth to the motion in the scene. Ideally, it needs to know the distance to the highest moving pixel in each column of the image. Unfortunately, the depth to an arbitrary pixel cannot always be found with confidence. Ambiguity arises when attempts are made to find correspondences for image patches lacking strong features. To address this, the detector favors image patches around lower moving pixels if they offer more promising correspondence properties, namely strong edges in multiple directions. Once the highest coherent line of moving pixels is found, the detector returns a new head candidate, one which best fits an appropriately sized ellipse in the region directly under the line. This candidate becomes the new tracking target if the tracker is without one or if the tracker's current target is less fit with regard to elliptical shape. Finally, the three-dimensional coordinates of the head relative to the stereo pair are transformed into pan and tilt directives for the two teleconferencing cameras in the HAL. The cameras steer to keep the subject's head with their field of view. A flow chart of the system is show in figure 3-1. 3.2 Room Layout HAL is an intelligent office space. It monitors user activity via lapel microphones, eight video cameras, and the states of equipment in the room. It can present material to users through a sound system, two projector displays, and a television. A picture of HAL is shown in figure 3-2. There is a couch against the wall. In front of this couch is a coffee table upon which sits, at a height of 50cm, one of the two steerable 23 detector motion detection normalized correlation _. highest coherent line of motion stereo r- images - -- -- -- ------- -- -- - - ---- -- ~-- -- -- -- -- -- - -- -- - - -- --- I- --- - --choose better fit candidate ellipse fitting tracker1 ellipse fitting position parametric prediction optical flow normalized correlation head position coordinate transform steerable camera coordinate transform steerable camera Figure 3-1: System Flow Chart teleconferencing cameras. This camera can capture close-up frontal shots of people sitting on the couch, but cannot tilt high enough to see the face of someone standing. A second teleconferencing camera is located on top of a television beyond the end of the couch at a height of 120cm. This camera obtains three-quarters views of people sitting on the couch. The stereo camera pair is mounted high up on the wall opposite the couch. It looks down towards the couch at an angle of 350 from horizontal. 3.3 Head Detection The head detector looks for an elliptical shape in a search space constrained by motion and depth cues. One drawback to this approach is that stationary heads are invisible to the detector. In a teleconferencing scenario, however, it is reasonable to assume that the head will move regularly. And if it does not, there is no need for the cameras to steer to a new position. It should also be noted that head motion generally accompanies tracking failure. If there is no motion, a head cannot be detected. But, 24 Figure 3-2: HAL Layout without motion, it is unlikely the tracker will fail. Except for situations involving severe occlusion, the tracker robustly trains stationary objects. 3.3.1 Motion Detection Occasionally the most naive approach is found to yield adequate results. This was case with motion detection. Simple frame differencing is used to find pixels corresponding to moving objects. If a pixel's intensity changes significantly from one frame to the next, it is considered moving. Et(x, y, t) = -E(x, y, t) = E(x, y, t) - E(x, y, t - 1) 6t (3.1) (|Et(x, y, t)I > thm) ++ MOTION(x, y, t) (3.2) The threshold value, thm, is set well above the magnitude of noise in the image to minimize false positives. This approach requires little computation and has minimal latency. Other techniques, such as finding temporal zero crossings as in [8], require 25 smoothing in time which introduces significant latencies. The results of (3.2) can be seen in figure 3-3. Figure 3-3: Motion Detection Using Frame Differencing and Thresholding. The top image is a raw frame of video. The bottom image shows the result of the motion detection algorithm (3.2). To constrain the search space for the shape filter, an assumption is made: The top of a head in motion will create a coherent line of moving pixels as wide as the head itself, and no other such line will appear above the head. There are, of course, situations where this is not the case. For example, someone may wave their arms above their head as in figure 3-6. Or, the head's velocity may not be great enough to create a coherent line of motion. But, as mentioned above, the complimentary nature of tracking and continuous detection allows for a substantial degree of detection misses. Given this assumption, the shape filter need only search the region under the highest coherent line of moving pixels wide enough to represent the top of a head. To further constrain the search space, the size of the ellipse can be set according to the depth to the line. To find such a line, the depth to the highest moving pixel in each 26 column of the image must be calculated. This depth is used to adjust the width of a search window through which to look for the highest coherent line of moving pixels. Thus, moving objects too small to be heads are ignored. The line of motion they create is too narrow. The white line in figure 3-4 shows the highest moving pixel in each column of the image. 3.3.2 Determining Depth Depth can be extracted from correspondences across the images of a stereo camera pair. An object's depth, z, relates inversely to the disparity, d, in the position of its projections on to the two image planes. fb d _ Here f (3.3) is the focal length, and b is the baseline, the distance between the two cameras. The stereo pair used for this system has a horizontal baseline of 7cm. Disparity is found by normalized correlation as in [7]. Given an image patch from the left camera, this technique involves finding a corresponding image patch from the right camera which maximizes the normalized correlation of the two. argmax '(,7 EZixE Ii (x, Y)Ir (x + , y + n) X,)2IZj~h Z2 Ir(X + y + ).2 (3.4) (.4 Here, (i, j) is the position of the bottom-left corner of the patch taken from the left image, w and h are the width and height of the patch, and ((, g) is the dispari- ty. For a calibrated stereo pair with horizontal baseline, the vertical disparity, q, is zero. For a well aligned, but uncalibrated stereo pair, a two dimensional search for correspondence is workable. The match will appear somewhere along the nearly horizontal epipolar line, and the vertical component of disparity can simply be ignored. The only calibration absolutely necessary is correcting for any horizontal offset due 27 to convergence. When an image is thought of as a vector of intensity values, (3.4) is simply an inner product, the cosine of the angle between two vectors. The dimensionality of the space is equal to the number of pixels in the patch. In terms of stochastic detection theory, this technique is similar to that of using a matched filter to recognize a known signal in a noisy channel. Of interest to the detector, as mentioned above, is the depth to the highest moving pixel in each column of the image. The detector uses this depth information to select a coherent line of moving pixels wide enough to be the top of a head. The elliptical shape filter is then applied to the region of the image under this line. Unfortunately, depth cannot be accurately calculated at every point desired. It is difficult to find, with confidence, the correspondence of an image patch lacking strong features. Correspondences found for patches containing high contrast edges in multiple directions are generally more accurate. The detection system, when calculating the depth to the highest moving pixel in each column, will choose a patch centered about a lower pixel if that patch contains stronger features. The black squares in the top image of figure 3-4 represent the image patches chosen by the detector. The black squares in the bottom image show the corresponding patches found by normalized correlation. The black line in the top image represents the depth to each patch. The higher the line, the closer the patch is to the stereo pair. To measure the strength of the features within an image patch, a principal component analysis is applied to the set of image gradients of the patch. In the case of an image patch containing strong edges in multiple directions, both components will have high energy. The energies of the principal components are the eigenvalues of the image gradient's covariance matrix 1 j+h i+w (w + 1)(h + )y=j x=i E(x, y) 2 EX(x, y)E"(x, y) 28 E(x, y)E, (x, y) E"(x, y)2 (3.5) Figure 3-4: Calculation of Depth to High Motion in Each Column. The aspect ratios of the images as they are produced by the stereo pair have been preserved in this figure to illustrate the true size of the correspondence templates. The stereo camera produces two line interlaced frames, each with a resolution of 320 x 120. All other images in this document have been resized for clarity. where Ex(x, y) = +E(x, y) = -[E(x+1,y)-E(x,y)+E(x+l,y+1)-E(x,y+1)] (3.6) 2 6x and Ev(x,y) = -E(x, y) = -[E(x,y+1)-E(x,y)+E(x+1,y+1)-E(x+1,y)] (3.7) 6y 2 The measure of feature strength used by the detector is the energy of the smaller component. This criterion was independently derived by Shi and Tomasi [13]. There are tradeoffs involved in choosing the size of the image patch for which to find a correspondence. Small patches are computationally efficient and can be more 29 accurate when there is considerable variation in depth. However, if the patches are made too small, ambiguity arises in the match. A patch size of nine by nine pixels was found to offer a good balance. 3.3.3 Finding a Coherent Line of Moving Pixels Once depth has been determined to motion in each column of the image, the detector must find the highest coherent line of moving pixels wide enough to be a human head. To do this, the detector uses the depth information to pick a 14cm wide window through which to look for a coherent line of moving pixels. The baseline of the stereo pair used in this system is 7cm, and since scale is proportional to disparity, the window need simply be twice as wide as the disparity. This proportionality can be derived from the projection equation which relates the width of an object, w, to the width of its projection, w'. W = -w (3.8) z Applying equation (3.3) yields W w= -d (3.9) A line of moving pixels is considered coherent if its variance vertically is below a threshold. Let h(x) be the highest moving pixel in each column, d(x) be the disparity of that high motion, and C be the set of columns whose windows satisfy the coherency constraint. Pi = 0 2 1 i+d(i) (3.10) h(x) 2d(i)1x=i-d(i) 1 2d(i) +1 i+d) (h(x) - pa) 2 (3.11) 2d~)+1(i) (Ui2 < thvd(i) 2 ) ++* (i E C) 30 (3.12) The detector finds the column in C whose window has the largest average pi. argmax i E C pi (3.13) The elliptical shape filter is applied under this highest coherent line of moving pixels. The thick white line in figure 3-5 shows the highest 14cm wide coherent line of moving pixels. In thin white, the best fit ellipse under this line is shown. Figure 3-5: Highest Line of Coherent Motion Wide Enough to be a Head Where the Resulting Candidate is a Correct Detection 3.3.4 Ellipse Fitting The detector locates heads by searching for an elliptical shape 21cm wide and 25cm tall. The measure of elliptical fit used, 0, is the average of dot product magnitude between the ellipse normal and the image gradient. To reduce the attraction towards isolated strong edges, the dot product magnitude is clipped above a threshold, thc. 4(x, y, 1N., )= Z min(thc, n,(i) No =1 Ex (x + sx, (i), y + sy,() ) E LEy (x + sx'(i), y + Sy,(i)) (3.14) For an ellipse of width -, N, is the number of pixels along the perimeter, n, (i) is the normal at the ith perimeter pixel, and (sx,(i), sy,(i)) is the position of the ith 31 perimeter pixel, relative to the center of the ellipse. The height of the ellipse is set at 1.2o-. Expect for the clipping, this is the same measure used by Birchfield in [2]. The detector computes 0 once for every column making up the highest coherent line of moving pixels. The size of the ellipse is set using the disparity measurements. The width of a head is assumed to be 21cm, or 3d(i), and the height, 25cm, or 3.6d(i). Since the line of motion is taken to be the top of a head, the ellipse is placed below and tangent to the line at vertical position of h(i) - 1.8d(i). The final output candidate of the detector is the ellipse which had the best fit. If the tracker is currently without a target, the candidate found by the detector is tracked. Otherwise the elliptical fit of the candidate is compared to the elliptical fit of the target being tracked. If it is greater, the tracker switches attention to the new candidate. Figure 3-6 shows in white, the candidate found by the detector. However the elliptical fit of this false positive is less than that of the actual head being tracked, the black ellipse. The false positive is ignored. Figure 3-6: Highest Line of Coherent Motion Wide Enough to be a Head Where the Resulting Candidate is a False Positive. The tracking target is shown in black. 3.4 Head Tracking The tracker uses a simple constant velocity model to predict the new position of the head. It then searches a region around the predicted position for a best fit ellipse. 32 The size of the region is constrained linearly with disparity under the assumption that human head acceleration is limited. A range of ellipse sizes, 14-21cm, is used to allow for changes in depth from one frame to the next and slight variation in curvature due to head rotation. Size is determined from the depth calculation made in the previous frame. The fact that the detector and tracker use the same measure of elliptical fit provides for an alternate and perhaps simpler model of the detector tracker synergy. For every new frame, a space is constructed in which an elliptical shape is searched for. The space is the union of a window around the predicted location of the head and a region under the highest coherent line of moving pixels. The best fit ellipse in this space is taken to be the new position of the head. 3.4.1 Determining Accurate Depth It is important that the depth of the tracker's final coordinate output be accurate. Small errors in depth may translate to significant pan and tilt errors in HAL's two teleconferencing cameras. As presented in section 3.3.2, normalized correlation alone is insufficient, for it gives discrete results. At a range of 300cm, a one pixel error in disparity translates to a depth error of 30cm. After a normalized correlation match is found, a sub-pixel disparity measurement is achieved by calculating a parametric optical flow across the template from the left image and its matching template from the right image. The parametric optical flow is purely translational. Figure 3-7 shows the template of the head as taken from the left image and its corresponding template from the right image. Pure translational flow constrains all flow vectors to be equal and provides an elegant least squares solution to the brightness constraint equation Exu + Eyv + El. = 0 33 (3.15) Figure 3-7: Template of Head From Left and Right Image Here, El, is the change in intensity across the templates from the left and right images. u and v represent the sub-pixel flow from the left to the right templates. Given a disparity ( , r/), E., E,, and Eir are calculated as follows. Ex(x, y) = [I1(X +1, y) - I1(X, y) + I1(X + 1, y + 1) - I(x,y+ 1) + Ir(x + Ey(x,y) = + 1, y + r7) - Ir(X + , y + r7)+ Ir (X + (+ 1, y + 77 + 1) - Ir (x + 6, y + rq +1) 1 [ , y + 1) - I1(X, y) + I(x + 1,y + 1) -I(x+ 1,y)+ Ir(x + , Y +q r+ 1) - Ir(X + 6, Y +q) + Ir(x + 6+ 1,Y +'q + 1) - Ir(x + 6 + 1,y +q)] Eir(X,Y) = [I(x + ,Y+ 7)- I(X,Y)+ Ir(X+ y + r + 1) - I(x,y+ 1) + Ir(X + + ly + 7) - I(x + 1,y)+ Ir (X + (+ 1,y + 34 T,+ 1) - I1(X + 1, y + )](3.16) Each 2 x 2 pixel group in the template image gives a brightness constraint. The result is the over-determined system Ex E- (3.17) E., EX, Ey, and Eir are column vectors containing gradient calculations from each 2 x 2 pixel group used by (3.16). For example, Ex(x, y) E.(x, y + 1) Ex(x, y + h - 1) EX= Ex(x + 1, y) (3.18) E2(x + 1, y + 1) Ex(x + w - 1, y + h - 2) Ex(x + w - 1, y + h - 1) To solve for u and v, a pseudo inverse is used. [1 T V The final disparity is 3.5 E] T ET Ex EYT EyT (( + Eir (3.19) Ey u, + v). Transforming Coordinates As the head is tracked in the images from the stereo camera, and accurate measurements of disparity are made, pan and tilt directives must be calculated to drive the movement of the teleconferencing cameras. This is done in three steps. First, the 35 real-world position of the head, in Cartesian coordinates relative to the stereo camera pair, must be found. This position is then transformed into Cartesian coordinates relative to each of the two teleconferencing cameras, using knowledge of the cameras' relative orientations. Finally, these transformed Cartesian coordinates are mapped to polar pan and tilt values to drive the movement of the teleconferencing cameras. To transform the location and disparity, (x, y, d), of the head being tracked in the image to Cartesian coordinates relative to the stereo pair, ( ysp, z,), the projection equations are used [6]. xSP b = (3.20) d (3.21) -b Ysp d b zs,= -f d (3.22) The Cartesian coordinates are then multiplied by a rotation/translation matrix for each of the two teleconferencing cameras in HAL, resulting in coordinates ( ytc, ztr) relative to each teleconferencing camera. xte Ytc ztJ cos 0 - sin 6 sinq' 0 Cos 0 sin G cos 0 sin 0 -sin 0 cos 0 cos9Ax - sin Z~z - sin cos 0 cos 4 In the above equation, # 1 P AY Ysp sin 9Ax + cos 9xy zS 1 (3.23) is the downward tilt angle of the stereo pair. Ax, Ay, and Az represent the distance from the teleconferencing camera to the stereo pair. This translation is described in the tilt corrected coordinate frame of the stereo pair. 6 represents how far the teleconferencing camera is rotated in the xz-plane, again relative to the tilt corrected stereo pair. Using inverse tangent relationships, the coordinates of the head relative to the teleconferencing cameras are further transformed into pan and tilt directives. These 36 pan and tilt values are sent over a serial line to drive the movement of the cameras. pan = tan- 1(tc) tilt = tan- () Zc 7r + -[1 - sign(zt,)]sign(xte) 2 (3.24) (3.25) The teleconferencing cameras used in this system have a 1800 pan and 900 tilt range. 3.6 Equipment The stereo camera pair used is the STH-V2 made by Videre Design, figure 3-8. It can output left and right video signals at a resolution of 320 x 240 or one interlaced signal combining 320 x 120 subsampled images. The later mode is used for this thesis. The lenses used give the camera a 390 field of view. Figure 3-8: Stereo Camera Pair The stereo images are captured using a Matrox Meteor frame grabber and processed on a 600MHz Pentium III. The steerable teleconferencing cameras used are Sony EVI-D30s. 37 Figure 3-9: Steerable Teleconferencing Camera 38 Chapter 4 Results 4.1 Head Detection and Tracking Results It is difficult to quantitatively assess the behavior of the head detection and tracking system. Statistics such as deviation from ground truth, rate of false positives, and mean time to detection all depend greatly on the situation. How fast, how often, and in what direction does the subject's head move? Is the head ever severely occluded? What types of of background clutter are present? How often does motion occur above the head? Statistics for a system such as this are only meaningful when collected over hundreds of real-world trials. And even then, subjects must be carefully instructed to act naturally, to neither intentionally try to fool the system nor be overly cautious. In light of these difficulties, this section instead provides a qualitative assessment of the system. Situations are presented which illustrate both the system's strengths and weaknesses. And comparisons are drawn to other methods considered during the development of the system. 4.1.1 Continuous Detection Figure 4-1 illustrates the success and importance of continuous detection. Here, as the head moves downward in the image, the elliptical tracker is held fast by the high 39 contrast edge between the couch and the back wall and looses the curved bottom edge of the jaw. By the sixth frame, it is following the front of the hair line rather than the perimeter of the face. At this point, the tracker's hold on the head is in jeopardy. Further downward movement will likely cause the tracker to loose the head entirely. In the seventh frame, however, the candidate presented by the continuously running detector is better fit than the current target and the system is restabilized. Another benefit of continuous detection is that it eliminates the need for initialization. Many head tracking systems ignore the issue of initialization. Others have manual procedures. The focus during the design of this system was always kept on usability. Figure 4-2 shows each frame of an initialization sequence. In the first frame of this sequence, the tracker is without a target. By the second frame, enough of the torso has come into the frame to register motion. The detector notices this and presents the torso, as its best fit candidate, to the tracker. During the next four frames, the tracker, having no better target, follows the torso. When enough of the head comes into the frame, the continuously running detector recognizes that the head is a better fit target than the torso, and switches the tracker's attention to the head. 4.1.2 Occlusion The nature of the elliptical tracker makes the system resistant to partial occlusion. Figure 4-3 shows every forth frame of a sequence in which the tracker is unaffected by arms occluding the stereo camera's view of the head. This resistance is a result of taking the average edge strength around the perimeter of the ellipse. Objects partially occluding the head generally leave enough of the head's edges visible to keep this average high. In fact, human heads are rarely the exact shape for which the ellipse filter is searching. Close examination of the position of the ellipse as the system is tracking reveals that it is usually following just one or two high contrast curves and not the entire perimeter of the head. 40 I I Figure 4-1: Consecutive Frames of a Sequence Illustrating the Benefits of Continuous Detection. The frame order in this sequence and all others in this document is left to right, top to bottom. Template based trackers can easily fail under situations of occlusion. The occluding object often pulls the template off the target object. To illustrate this, figures 4-4 and 4-5 show consecutive frames of an occlusion sequence. In the first sequence, an elliptical tracker is used. The occluding object is properly ignored. In the second sequence, a template tracker is used and is pulled off the subject's head by the occluding hand. Template trackers determine the new position of an image patch by computing a normalized correlation (3.4). 41 4.1.3 Out-of-plane Rotation The elliptical shape tracker was also chosen for its ability to handle out-of-plane head rotation. Techniques such as template tracking that follow patterns of brightness fail upon object rotation when the pattern becomes self occluded. In most scenarios, the assumption that the head will not rotate significantly is invalid. The elliptical tracker instead relies upon the shape of the perimeter of the head as projected onto the image plane. This shape is nearly elliptical, regardless of head rotation. Figure 4-6 shows a rotation sequence through which the tracker remains correctly locked on the head. 4.1.4 Head Decoys Figure 4-7 shows the results of a situation contrived to demonstrate irrecoverable system failure. This can happen when the tracker locks on to an object whose elliptical shape is far better fit than a human head. As long as the elliptical object is visible, the tracker will never switch its attention away from it. In this figure, the black ellipse shows the tracking target. The white ellipse shows the detector candidate. A high contrast drawing of a head sized ellipse was made as a decoy to steal the attention of the tracker. In the first six frames of the sequence, the tracker correctly follows the head. In the seventh frame, the drawing is moved slightly. The detector notices the decoy and determines that it is a better-fit ellipse than the head being tracked. At this point, the tracker switches its attention to the drawing. In the last three frames, although the detector is finding the true human head, the elliptical fit of the decoy is far stronger and holds the attention of tracker. Although this situation is contrived, it illustrates a major weakness. The system detects and tracks ellipses, not heads. Objects in the background that better fit ellipses better than human heads do are detriments to the system. If these decoys are moved or if a tracked head passes directly in front of them, the system can fail. 42 4.1.5 Accuracy of Depth Calculation Section 3.4.1 describes a technique for accurately calculating depth to the head. After a discrete disparity value is found via normalized correlation, a sub-pixel shift is calculated across the matching templates from the left and right images. This shift is found using parametric optical flow and is added to the preliminary discrete value to obtain a more accurate disparity. To test the validity of this approach, a sequence was taken of cyclical head movement towards and away from the stereo pair, figure 4-8. The depth determined by the tracker is plotted in figure 4-9. For comparison, the plot also shows with a dotted line the discrete output of the normalized correlation calculation alone. Accurate disparity calculations are critical in a teleconferencing scenario. In HAL, the teleconferencing cameras' two views of the couch are nearly orthogonal to that of the stereo pair. A slight error in depth can translate to a significant error in the pan or tilt angle of a teleconferencing camera. At frame 110 of the depth test sequence, figure 4-9 shows a discrepancy of 15cm between the depth calculated with and without parametric optical flow. If a teleconferencing camera were aiming for a tight shot of the head, an error of this magnitude could result in undesirable cropping of the head. Measuring sub-pixel disparities increases the accuracy of depth calculations by roughly one order of magnitude and is a critical feature of this system. 4.2 Teleconferencing Results The major drawback to the cameras used in this system is that their maximum drive speed is only 800 per second. When the subject is close and moving laterally, the cameras can take on the order of seconds to move to a new position. Hence, they cannot react quickly to a head moving out of their field of view, despite the fact that real-time tracking data is available. Aside from this caveat, the system works well. In most situations, the cameras provide well-centered close-up shots of the head. 43 I Figure 4-2: Consecutive Frames of an Initialization Sequence 44 I Figure 4-3: Every Forth Frame of an Occlusion Sequence. 45 Figure 4-4: Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking. 46 Figure 4-5: Consecutive Frames of an Occlusion Sequence Using Template Tracking. 47 Figure 4-6: Every Tenth Frame of a Rotation Sequence 48 Figure 4-7: Every Tenth Frame of a Decoy Sequence 49 Figure 4-8: Every Tenth Frame of a Depth Test Sequence. The first image in this sequence is frame 50. The last is frame 130. The image templates used in the correspondence calculation are shown in the bottom left corner of each frame. A plot of depth to the head in this sequence can be found in figure 4-9 50 280- 270- 260- 250- 240- 230- 220 40 50 60 70 80 90 100 Frame Number 110 120 130 140 Figure 4-9: Plot of Depth to Head Over Time. The dotted line plots depth as derived from the discrete output of the normalized correlation calculation alone. The solid line plots depth after the sub-pixel optical flow results have been added to the disparity calculation. The video sequence corresponding to this plot can be seen in figure 4-8 51 52 Chapter 5 Conclusion A system has been developed which detects and tracks the head of a subject moving about within HAL, an intelligent environment. It monitors activity in the room through a stereo camera pair. The detector works by looking for an elliptical shape in a search space constrained by motion and size cues. The tracker follows the elliptical shape until the detector presents one which is better fit. Depth to the head is calculated using normalized correlation and refined in accuracy by determining the parametric optical flow across the matched image templates from the left and right cameras. To test and demonstrate the system, an automated teleconferencing application was developed. The three-dimensional coordinates of the subject's head are transformed into polar coordinates to drive the pan and tilt of two steerable cameras. The test was a success. Although the steerable cameras move slowly and cannot keep up with a quickly moving head, they eventually center on the head when it comes to rest. The system is robust against partial occlusion, rotation, changes in lighting, and variation in hair and skin color. 53 5.1 Future Work One clear path for future work is to extend the system to support the detection and tracking of multiple heads. There is nothing in the design of the current system to prevent such an extension. Rather than tracking the best fit ellipse in the scene, the system could simply track all candidates whose elliptical fits exceeded a threshold. The detector, rather than switching the attention of the tracker, could instead spawn new trackers. Additional logic could be added to retire trackers whose targets do not move for long periods of time. As a side effect, extending the system to support multiple heads may alleviate the problem illustrated in figure 4-7 whereby a decoy can permanently steal the attention of the tracker. As was mentioned in section 2, some of the more robust head detection and tracking systems have been the result of multi-modal approaches. And the authors of such systems generally claim that adding more modes increases performance. Two modes which could be added to future versions of this system are skin color detection and pattern recognition. Both of these were considered during the development of the existing system but never implemented. Color was not available from the monochrome stereo pair. And existing pattern recognition systems, such as CMU's [10], failed due to the downward viewing angle of the stereo pair. There are, however, other cameras in the room which could supply skin color and face detection information to the system, narrowing the search for heads in the image from the stereo pair to epipolar regions. Additionally, a pattern recognition module could be specially trained to detect heads from the downward viewing angle of the stereo pair. One problem with the existing system is the time it takes for the teleconferencing cameras to drive to a new position. It might be advantageous for future versions of the system to account for this delay. The current system, when commanding the cameras to drive, relays the position of the head as of time the command is issued. The system might instead predict where the head will be by the time the cameras finish the movement and instead relay this position. 54 Another solution to the slow camera problem is to digitally crop the head from wider angle shots. In each frame captured by the teleconferencing cameras, head position could be used to select a cropping region. The resolution of the resulting sequence would, of course, be reduced. But, this reduction is already common practice in teleconferencing scenarios due to limited bandwidth. 55 56 Bibliography [1] D. Beymer and K. Konolige. Real-time tracking of multiple people using continuous detection. IEEE Conference on Computer Vision and Pattern Recognition, 1999. [2] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. IEEE Conference on Computer Vision and Pattern Recognition, pages 232-237, Santa Barbara, CA, June 1998. [3] M. Coen and K. Wilson. Learning spatial event models from multiple-camera perspectives. Annual Conference of the IEEE Industrial Electronics Society, 1999. [4] T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. IEEE Conference on Computer Vision and Pattern Recognition, pages 601-609, Santa Barbara, CA, June 1998. [5] M. Fleck, D. Forsyth, and C. Bregler. Finding naked people. volume 2 of European Conference on Computer Vision, pages 592-602, 1996. [6] B. K. P. Horn. Robot Vision. The MIT Press, Cambridge, Massachusetts, 1986. [7] K. Konolige. Small vision systems: Hardware and implementation. Eighth International Symposium on Robotics Research, Hayama, Japan, October 1997. [8] S. McKenna and S. Gong. Tracking faces. International Conference on Automatic Face and Gesture Recognition, Killington, Vermont, October 1996. 57 [9] C. Morimoto, D. Koons, A. Amir, and M. Flickner. Real-time detection of eyes and faces. Workshop on Perceptual User Interfaces, San Francisco, CA, November 1998. [10] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23-38, January 1998. [11] H. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based face detection. IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, June 1998. [12] B. Scassellati. Eye finding via face detection for a foveated, active vision system. National Conference on Artificial Intelligence, Madison, WI, 1999. [13] J. Shi and C. Tomasi. Good features to track. IEEE Conference on Computer Vision and Pattern Recognition, pages 593-600, 1994. [14] C. Wren, W. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780-785, July 1997. 58