Smart Surveillance: Tracking and Classification of Moving Pedestrians Brigit A. Schroeder Department of Computer Science University of Massachusetts, Lowell 1 University Drive, Lowell, MA 01854 bschroed@cs.uml.edu ABSTRACT In this paper, a system of pedestrian tracking and classification introduced using histogram of oriented gradients (HOG) as feature descriptors and Naïve Bayes for classfications. Tracking and target segmentation is performed using a basic blob tracker which relies largely upon frame differencing to perform motion detection. The work in this paper will show that pedestrians can potentially be classified using a summarized version of the original HOG descriptor, in terms of increased detection and reduced false alarm rate. The classifier in this paper was trained off of well-known and established pedestrian datasets from INRIA and MIT. The classification results reveal the need for multi-scale target classification and similarity between the training and test test data. PROJECT DESCRIPTION This problem presented in this paper can be broken into two individual parts, (1) target tracking and (2) target classification, which need to be knit together well to achieve a smarter surveillance system. Target Tracking Tracking can be achieved using a computer vision-based approach of blob tracking, which relies heavily upon detecting change in the scene frame-to-frame (e.g. frame differencing) and characterizing each region of change as a moving target (e.g. define a blob via connected components labeling). Author Keywords Histogram of Oriented Gradients, Pedestrian Tracker, Blob Tracker, Naïve Bayes Classifier, Artificial Intelligence. INTRODUCTION For this project, I used computer vision-based object tracking and probabilistic recognition methods to automatically identify pedestrians in fixed surveillance video. Tracking moving objects in surveillance video provides a way to nominate and segment out targets of potential interest, and adding a layer of recognition to identify targets adds degrees of intelligence and automation to an ordinary surveillance system. To build the most robust system, the challenges of being able to detect and recognize targets at varying scale (determined by target distance from camera) and illumination need to be overcome. Figure 1. Overview of pedestrian tracking and classification system. Figure 2. Example of initial image processing steps of Blob Tracker: frame differencing to create “blobs”. Red boxes are detected targets used for target segmentation in original surveillance video. Difference images are simply the absolute difference of pixel values between two successive frames of video which are thresholded and converted to binary images (pixels either have a value of 0 or 1). The binary images differentiate foreground from background pixels, where the foreground pixels (value = 1) are presumed to be potential moving targets. The binary images undergo image dilation to enlarge the boundaries and fill holes of regions. The resulting image (see in figure above) is then “regionalized” into individual blobs using connected components labeling. The connected components algorithm treats the binary image as a graph and checks each pixel for 4- or 8-way connectivity of neighboring pixels with the same value in order to group them, which can be done in one or multiple passes (over the image). The method presented here is fairly rudimentary (e.g. basic) for a tracker, as the tracker is primarily intended for target tracking. Once a target region has been identified, it can be segmented out from the video frame and passed to the classification system. I experimented with a variation of the blob tracker of which used a combination of building a simple background model and then differencing new frames against the model. The background model I tested was a moving average of the incoming frames (a very simplified approach for doing background modeling), which did a nice job of remove much of the video image noise and making pedestrians in tight groups more distinct. The model worked very well for scenes with mostly pedestrians and low contrast vehicles, but light vehicles would leave a ghosted footprint that cause a large number of false detections. Applying a single Gaussian background model, which calculates a new mean and standard deviation every frame with which the model is updated (as described in [1]), would probably facilitate cleaner, more robust target segmentation in the future. Figure 2. Train-test pipeline for classification. Feature Extraction The first stage involves extracting a set of distinctive features from a large set of representative training data (e.g. a collection of pedestrian image chips or background nonpedestrian data) into feature vectors. Histogram of oriented gradient (HOG) descriptors have been shown by Dalaal and Triggs [2] to perform well in pedestrian identification as they robustly define the local geometry and shape of each target. Histograms of oriented gradients are created by calculating the gradient magnitude (1) and orientation (2) of each pixel’s intensity: (1) (2) The gradient magnitude was binned based upon its unsigned orientation (0-180 degrees) and then the histogram was L2-normalized. The local normalization is applied to make each feature illumination invariant. Each bin in the histogram represents the amount of edge strength (e.g. gradient) over a range of orientation. Stringing a series of normalized histograms together, each for a gridded sub-region in the target image, forms a HOG descriptor for each target image, as seen in Figure 3: Figure 3. Example of final output of pedestrian track which extracted 64x128 targets for classification. The worked well for visible and somewhat occluded targets, but also generated some false alarms seen in the street. The tracker was tuned to intentionally have some false alarms for classification purposes. Target Classification Before target classification can occur, a Naïve Bayes classifier for each class (e.g. pedestrian, background) was trained and tested offline using features extracted from class (e.g. pedestrian vs. non-pedestrian) images. The general train-test for target classification flow is as follows: Figure 4. Examples of how HOG descriptor is formed: a histogram of oriented gradients is created for each gridded sub-region of the using gradient magnitude (gridded images are gradient images) and its respective orientation. Note that as the number of cells in the grid increases, the length of the HOG descriptor increases. More specifically, each 64x128 pixel image chip in this project was sub-divided into 8x16 cells of 8x8 pixels regions. Each cell was then grouped into blocks of 4x4 cells with overlapping regions. Each cell is represented as a 9-bin histogram of oriented gradients and each 4x4 block is represented by a 4x9 vector of local histograms which are L-2 normalized. The final feature vector for a single image chip is quite large, with 9 x 4 x 7 x 15 (=3,780) element feature vector which is formally know as a “HOG descriptor.” The large feature vector size can potentially be a challenge in terms of system processing speed and memory, depending upon the platform and optimizations used. In an effort to reduce the dimensionality of the large feature vector, I experimented with summarizing each local 9-bin histogram (pre-normalization) by its local maximum, mode and mode frequency (number of occurrences of mode). The general thought behind this is that these summarized characteristics would represent the dominant local orientation of given feature (covered by a cell region). This reduced the feature vector size by one third to 1,260 elements. Each histogram summarization (maximum/mode/mode frequency) could by itself can be considered a 420 element feature vector. During the process of training and testing feature sets, it is not unusual to test different combinations of feature models to find the one that provides the best target classification. In the end, I extracted full the HOG descriptors for each image chip, in addition to the “summarized” HOG descriptors for histogram maximum/mode/mode frequency for training and testing (described later in this paper). There are many published implementations of HOG descriptors where either the feature extraction has been simplified by using non-overlapping (e.g. regular) blocks of image chips cells and also where dimensionality reduction is introduced via principal components analysis (PCA) [3, 4]. This is formally called PCA-HOG. The mean mu and standard deviation sigma for each element F in the extracted feature vector v are calculate assuming a normally (Gaussian) probability distribution for a large set of class features and make up the elements of the classifier. In the testing phase, each class image sample is then classified by extracting the same HOG features and using the Naïve Bayes classifier to calculate the posterior probability (or likelihood) of each class type. Bayes Theorem is used to calculate the posterior probability of each class type for a given sample based upon the extracted features F1,…,Fn: where p(C = class n) is the class prior, p(F1,…,Fn| C = class n) are the class feature probabilities, and p(F1,…,Fn) are the class feature marginal probabilities. The marginal probabilities are often not used in calculating the posterior probability because they do not depend on the class C. The class feature probabilities are first calculated using the classifier data from the training phase, mean mu and standard deviation sigma, and the class feature value v for each extracted test features: This assumes a continuous probability distribution function. The class with the maximum likelihood is then chosen as the target classification: Training and Testing The training stage uses the set of feature vectors extracted in the previous step to train a Naïve Bayes classifier for each class type, in this case pedestrian and non-pedestrian (e.g. background). A Naïve Bayes classifier is a linear discriminative classifier which conveniently (or “naively”) assumes conditional independence between all the features in a class. The training stage involves estimating the model parameters for each class C, which includes class priors and probability distribution. The priors for each class can be estimated based up the percentage of class samples in the training set : prior(C = class n ) = (number of class n samples/total number training samples) For the two-class classification system in this project, two posterior probabilities, P(pedestrian | F1,…,Fn) and P(background | F1,…,Fn), were calculated for each extracted surveillance video target. The target classification is made by choosing the class with the maximum probability. For this project, I attempted to write my own implementation of the Naïve Bayes classifier (see project source code). The training portion of the classifier was successful, but the training portion ran into numeric overflow issues. This problem was due to the large size of the HOG descriptor feature (both regular and summarized) growing too large multiplicatively. This implementation also suffered from the problem of overfitting, where any zero or very small values would cause the probabilities to go to zero or possible introduce too much noise. Though Laplace smoothing would address the overfitting issue, it still would not address the numerical overflow. As such, I used the Matlab Naïve Bayes implementation which uses log-likelihoods (instead of normal likelihoods) and seemed to fix the numerical stability issues. Data Set The image datasets that were evaluated use for training and testing target classifiers are from the INRIA dataset repository, a well-known source for large annotated datasets of various kinds (e.g. pedestrians, animals, cars, etc) and also the MIT Pedestrian Dataset from the MIT Center for Biological and Computational Learning. The datasets contain image chips of pedestrian of varying poses and sizes and in crowded and uncrowded scenarios. All of the chips in a given set are of uniform size. The non-pedestrian examples were randomly sub-sampled from full-sized negative source images provided with the training sets at a rate of 3 samples per frame to avoid target redundancy. These images were typically urban street and sidewalk scenes with few or no people in them. The process of randomly sampling negative targets was used in the original Dalaal and Triggs paper [2]. The datasets were then filtered by hand to get rid of human targets that may have been extracted. surveillance video with many “confusing” features in the background, such as bright crosswalks, seemed to throw the classifier off as well. Another issue was the large amount of sensor noise in the low-resolution surveillance video which may have been picked up when calculating the image gradients for the HOG descriptor. The image datasets appeared to be taken with cameras with higher resolution CCDs while the surveillance video was generally very grainy. A Gaussian low-pass filter would help in reducing in the image noise, however, in case studies done by Dalaal and Triggs [2], they found a HOG descriptor-based classifier worked best without pre-processing with a Gaussian filter. Positive Negative Samples Samples MIT Pedestrian 500 5000 INRIA Person 500 1491 Dataset Table 1. Datasets for training and testing: Positive samples were 64 x 128 image chips of pedestrians in various poses. Negative samples of the same size were randomly sub-sampled from urban traffic scenes. ANALYSIS OF RESULTS Figure 5. Example of pedestrian classifier training data from INRIA collection The performance of classification can be tested against a labeled set of data containing a mixture of known class types, where probability of proper detection and false alarm rate for each classifier can be calculated in a confusion matrix: For classification, it was extremely important that the content and scale of the training images be reflected in the images to be tested, e.g. the targets extracted from surveillance video. For example, if a classifier is trained on targets of a single scale, trying to classify targets of varying scale would have a higher likelihood of misclassification. The afore-mentioned datasets are very large and a crosssection of image chips matching the likely surveillance scenarios need to be extracted. The video that was analyzed was collected with a surveillance camera using a ‘typical’ surveillance scene, such as an urban scene with pedestrians and non-pedestrian moving targets such as cars. Some of the challenges I found while using the various datasets and surveillance video were those related to scale, image context, and sensor noise. The classifier was trained on a set of 64x128 pixel targets with pedestrians of varying poses and sizes; the targets extracted from the surveillance video were of varying size as well, but the classifier seemed to have trouble classifying segmented targets that were not within a certain scale range of the training chips or offcenter in the image chip. Also, targets extracted from the In this confusion matrix, p corresponds to positive examples and n corresponds to negative examples, which would be pedestrian and non-pedestrian examples for this project. The numbers of primary interest are the numbers of true positives and false positives (also called false alarms), which can be used to calculate the pedestrian classifier probability of detection and false alarm rates: P(true detection) = # true positives / (# true positives + # false negatives) P(false alarm) = # false positives / (# false positives + # true negatives) For the training and testing of the MIT Pedestrian and INRIA Persons datasets, the datasets were combined for each class and then randomized. The pedestrian and nonpedestrian classifiers were trained on 80% of the data set and tested on the remaining 20%. In the original Dalaal and Triggs paper [2], it appeared that approximately 10% of the training examples were positive and the remaining 90% negative. In my training and testing set, approximately 15% are positive and the remainder negative. The large bias in the dataset toward the non-pedestrian classifier allows for better generalization of background targets which could be any number of objects (e.g. cars, posts, signs, street, etc). Initial tests with full HOG descriptors and summarized HOG descriptors (using maximum value/mode/mode frequency) showed that the summarized HOG descriptor generally had better performance. Further refinement testing revealed that using the maximum value of the histograms alone performed the best over using various combinations of the three (or the other two by themselves). The results presented below show that the 420 element summarized HOG descriptor performed better than the 3,780 element HOG descriptor with a higher true positive rate and lower false alarm rate for the MIT and INRIA datasets. The results in Table 2 show the range of values resulting after several rounds of testing to establish a baseline. In general, the performance was much better than initially anticipated: Feature Type True Positive False Alarm HOG 76-80% 4-5% Summarized HOG 91-94% 1.5-2% (Max Value) Table 2. Summarization of classification rates for regular HOG descriptor vs. Summarized HOG descriptor (using max of histogram) on MIT and INRIA Datasets (“ideal data”). Classifiers were trained on 80% of randomized data and tested on remaining 20%, also randomized, for each class type. The results in Table 3 show classification rates for target chips extracted from the surveillance video passed processed by the tracker. The classification rates were not as good due to issues described previously in the Data Set section (e.g target scale, noise, etc.), but the summarized HOG descriptor still showed a consistent 15% performance increase over the standard HOG descriptor. The caveat with these results is that the size of the tracker chip set was much smaller, due to the limited number of pedestrians and false alarms that could be extracted from a few videos. Feature Type True Positive False Alarm HOG 0% 50% Summarized HOG 14% 43% (Max Value) Table 3. Summarization of classification rates for regular HOG descriptor vs. Summarized HOG descriptor (using max of histogram) on random mixture targets extracted from object tracker. Classifiers were trained on 80% of randomized data from MIT and INRIA datasets. The object tracker generally performed exactly as it should, tracking moving objects and noise in the scene consistently. There were false detections (e.g. non moving objects) in areas of the scene where there was consistent image noise, which was good for testing purposes. DISCUSSION The analysis of classification would have benefitted from using n-fold cross-validation testing (e.g. “leave one out” testing) to get a better cross-section of performance and also using the model of training on one specific dataset and then training on the other (e.g. train on the INRIA dataset, test on MIT dataset) rather than mixing them together. This was suggested to me by a co-worker who has a PhD in Computer Vision. In general, most pedestrians extracted by the object tracker were classified as background, except in the case where pedestrians were clearly defined against a flat, uncluttered background (e.g. no road stripes) and also centered on the target chip. True background objects were nearly never classified as pedestrians. This points to a need for a multiscale detector which time scope of the project didn’t permit me to implement. I learned from this portion of the project that classifiers trained on specific target sizes don’t always have a lot flexibility if trying to classify something that is a few standard deviations outside of the classifier target size. I have read various papers that HOG descriptors for smaller target sizes can be scaled up to the original classifier scale because no geometric information is lost in upsampling. For downsampling, this is not the case as geometric information is lost in a downsampled image (through omitting pixel information the smaller image). The independent pieces of the project worked very nicely in their own “stovepipes.” However, when trying to connect the two together was when I ran into problems. Using a more progressive, integrated testing style would be benefit a project like this in the future. Also, using more testing and training data from the source of the classified object would also be helpful. Keck for tirelessly answering and advising me on my research-related questions. CONCLUSION In summary, the most distinct part of my work was that I found that a summarized version of the original histogram of oriented gradients descriptor, using the maximum value of each local cell histogram, outperforms the original HOG descriptor. I can also conclude that the need for a multiscale detector is necessary (or measures to address target scale) and will significantly affect the ‘real time’ performance of a classifier. The classifier also benefits from being trained on input image sources that have similar characteristics (e.g. resolution, noise) to the target image that will be classified. ACKNOWLEDGMENTS The work described in this paper was conducted as part of a Fall 2011 Artificial Intelligence course, taught in the Computer Science department of the University of Massachusetts Lowell by Prof. Fred Martin. I’d to thank Fred for his enthusiasm about my project and to Dr. Mark REFERENCES: 1. Background Subtraction in C++: http://iss.bu.edu/data/jkonrad/reports/HWYY1006buece.pdf 2. N. Dalal, B. Triggs. Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), page 1:886-893, 2005 3.Wei-Lwun Lu, J.J. Little, Simultaneous Tracking and Action Recognition using the PCA-HOG Descriptor. Computer and Robot Vision, 2006. The 3rd Canadian Conference on In Computer and Robot Vision, 2006. The 3rd Canadian Conference on (2006), pp. 6-6. 4. Principle Component Analysis: http://en.wikipedia.org/wiki/Principal_component_analy sis