baschroed_ai_fpp_report

advertisement
Smart Surveillance: Tracking and Classification of Moving
Pedestrians
Brigit A. Schroeder
Department of Computer Science
University of Massachusetts, Lowell
1 University Drive, Lowell, MA 01854
bschroed@cs.uml.edu
ABSTRACT
In this paper, a system of pedestrian tracking and
classification introduced using histogram of oriented
gradients (HOG) as feature descriptors and Naïve Bayes for
classfications. Tracking and target segmentation is
performed using a basic blob tracker which relies largely
upon frame differencing to perform motion detection. The
work in this paper will show that pedestrians can potentially
be classified using a summarized version of the original
HOG descriptor, in terms of increased detection and
reduced false alarm rate. The classifier in this paper was
trained off of well-known and established pedestrian
datasets from INRIA and MIT. The classification results
reveal the need for multi-scale target classification and
similarity between the training and test test data.
PROJECT DESCRIPTION
This problem presented in this paper can be broken into
two individual parts, (1) target tracking and (2) target
classification, which need to be knit together well to
achieve a smarter surveillance system.
Target Tracking
Tracking can be achieved using a computer vision-based
approach of blob tracking, which relies heavily upon
detecting change in the scene frame-to-frame (e.g. frame
differencing) and characterizing each region of change as a
moving target (e.g. define a blob via connected components
labeling).
Author Keywords
Histogram of Oriented Gradients, Pedestrian Tracker, Blob
Tracker, Naïve Bayes Classifier, Artificial Intelligence.
INTRODUCTION
For this project, I used computer vision-based object
tracking and probabilistic recognition methods to
automatically identify pedestrians in fixed surveillance
video. Tracking moving objects in surveillance video
provides a way to nominate and segment out targets of
potential interest, and adding a layer of recognition to
identify targets adds degrees of intelligence and automation
to an ordinary surveillance system. To build the most
robust system, the challenges of being able to detect and
recognize targets at varying scale (determined by target
distance from camera) and illumination need to be
overcome.
Figure 1. Overview of pedestrian tracking and classification system.
Figure 2. Example of initial image processing steps of Blob
Tracker: frame differencing to create “blobs”. Red boxes are
detected targets used for target segmentation in original
surveillance video.
Difference images are simply the absolute difference of
pixel values between two successive frames of video which
are thresholded and converted to binary images (pixels
either have a value of 0 or 1). The binary images
differentiate foreground from background pixels, where the
foreground pixels (value = 1) are presumed to be potential
moving targets. The binary images undergo image dilation
to enlarge the boundaries and fill holes of regions. The
resulting image (see in figure above) is then “regionalized”
into individual blobs using connected components labeling.
The connected components algorithm treats the binary
image as a graph and checks each pixel for 4- or 8-way
connectivity of neighboring pixels with the same value in
order to group them, which can be done in one or multiple
passes (over the image).
The method presented here is fairly rudimentary (e.g. basic)
for a tracker, as the tracker is primarily intended for target
tracking. Once a target region has been identified, it can be
segmented out from the video frame and passed to the
classification system. I experimented with a variation of the
blob tracker of which used a combination of building a
simple background model and then differencing new frames
against the model. The background model I tested was a
moving average of the incoming frames (a very simplified
approach for doing background modeling), which did a nice
job of remove much of the video image noise and making
pedestrians in tight groups more distinct. The model
worked very well for scenes with mostly pedestrians and
low contrast vehicles, but light vehicles would leave a
ghosted footprint that cause a large number of false
detections. Applying a single Gaussian background model,
which calculates a new mean and standard deviation every
frame with which the model is updated (as described in
[1]), would probably facilitate cleaner, more robust target
segmentation in the future.
Figure 2. Train-test pipeline for classification.
Feature Extraction
The first stage involves extracting a set of distinctive
features from a large set of representative training data (e.g.
a collection of pedestrian image chips or background nonpedestrian data) into feature vectors. Histogram of oriented
gradient (HOG) descriptors have been shown by Dalaal and
Triggs [2] to perform well in pedestrian identification as
they robustly define the local geometry and shape of each
target. Histograms of oriented gradients are created by
calculating the gradient magnitude (1) and orientation (2) of
each pixel’s intensity:
(1)
(2)
The gradient magnitude was binned based upon its
unsigned orientation (0-180 degrees) and then the
histogram was L2-normalized. The local normalization is
applied to make each feature illumination invariant. Each
bin in the histogram represents the amount of edge strength
(e.g. gradient) over a range of orientation. Stringing a
series of normalized histograms together, each for a gridded
sub-region in the target image, forms a HOG descriptor for
each target image, as seen in Figure 3:
Figure 3. Example of final output of pedestrian track which extracted
64x128 targets for classification. The worked well for visible and
somewhat occluded targets, but also generated some false alarms seen
in the street. The tracker was tuned to intentionally have some false
alarms for classification purposes.
Target Classification
Before target classification can occur, a Naïve Bayes
classifier for each class (e.g. pedestrian, background) was
trained and tested offline using features extracted from class
(e.g. pedestrian vs. non-pedestrian) images. The general
train-test for target classification flow is as follows:
Figure 4. Examples of how HOG descriptor is formed: a histogram of
oriented gradients is created for each gridded sub-region of the using
gradient magnitude (gridded images are gradient images) and its
respective orientation. Note that as the number of cells in the grid
increases, the length of the HOG descriptor increases.
More specifically, each 64x128 pixel image chip in this
project was sub-divided into 8x16 cells of 8x8 pixels
regions. Each cell was then grouped into blocks of 4x4 cells
with overlapping regions. Each cell is represented as a 9-bin
histogram of oriented gradients and each 4x4 block is
represented by a 4x9 vector of local histograms which are
L-2 normalized. The final feature vector for a single image
chip is quite large, with 9 x 4 x 7 x 15 (=3,780) element
feature vector which is formally know as a “HOG
descriptor.” The large feature vector size can potentially be
a challenge in terms of system processing speed and
memory, depending upon the platform and optimizations
used.
In an effort to reduce the dimensionality of the large feature
vector, I experimented with summarizing each local 9-bin
histogram (pre-normalization) by its local maximum, mode
and mode frequency (number of occurrences of mode). The
general thought behind this is that these summarized
characteristics would represent the dominant local
orientation of given feature (covered by a cell region). This
reduced the feature vector size by one third to 1,260
elements.
Each
histogram
summarization
(maximum/mode/mode frequency) could by itself can be
considered a 420 element feature vector. During the
process of training and testing feature sets, it is not unusual
to test different combinations of feature models to find the
one that provides the best target classification. In the end, I
extracted full the HOG descriptors for each image chip, in
addition to the “summarized” HOG descriptors for
histogram maximum/mode/mode frequency for training and
testing (described later in this paper).
There are many published implementations of HOG
descriptors where either the feature extraction has been
simplified by using non-overlapping (e.g. regular) blocks of
image chips cells and also where dimensionality reduction
is introduced via principal components analysis (PCA) [3,
4]. This is formally called PCA-HOG.
The mean mu and standard deviation sigma for each
element F in the extracted feature vector v are calculate
assuming a normally (Gaussian) probability distribution for
a large set of class features and make up the elements of the
classifier. In the testing phase, each class image sample is
then classified by extracting the same HOG features and
using the Naïve Bayes classifier to calculate the posterior
probability (or likelihood) of each class type. Bayes
Theorem is used to calculate the posterior probability of
each class type for a given sample based upon the extracted
features F1,…,Fn:
where p(C = class n) is the class prior, p(F1,…,Fn| C =
class n) are the class feature probabilities, and p(F1,…,Fn)
are the class feature marginal probabilities. The marginal
probabilities are often not used in calculating the posterior
probability because they do not depend on the class C. The
class feature probabilities are first calculated using the
classifier data from the training phase, mean mu and
standard deviation sigma, and the class feature value v for
each extracted test features:
This assumes a continuous probability distribution function.
The class with the maximum likelihood is then chosen as
the target classification:
Training and Testing
The training stage uses the set of feature vectors extracted
in the previous step to train a Naïve Bayes classifier for
each class type, in this case pedestrian and non-pedestrian
(e.g. background). A Naïve Bayes classifier is a linear
discriminative classifier which conveniently (or “naively”)
assumes conditional independence between all the features
in a class. The training stage involves estimating the model
parameters for each class C, which includes class priors and
probability distribution. The priors for each class can be
estimated based up the percentage of class samples in the
training set :
prior(C = class n ) = (number of class n samples/total
number training samples)
For the two-class classification system in this project, two
posterior probabilities, P(pedestrian | F1,…,Fn) and
P(background | F1,…,Fn), were calculated for each
extracted surveillance video target. The target classification
is made by choosing the class with the maximum
probability.
For this project, I attempted to write my own
implementation of the Naïve Bayes classifier (see project
source code). The training portion of the classifier was
successful, but the training portion ran into numeric
overflow issues. This problem was due to the large size of
the HOG descriptor feature (both regular and summarized)
growing too large multiplicatively. This implementation
also suffered from the problem of overfitting, where any
zero or very small values would cause the probabilities to
go to zero or possible introduce too much noise. Though
Laplace smoothing would address the overfitting issue, it
still would not address the numerical overflow. As such, I
used the Matlab Naïve Bayes implementation which uses
log-likelihoods (instead of normal likelihoods) and seemed
to fix the numerical stability issues.
Data Set
The image datasets that were evaluated use for training and
testing target classifiers are from the INRIA dataset
repository, a well-known source for large annotated datasets
of various kinds (e.g. pedestrians, animals, cars, etc) and
also the MIT Pedestrian Dataset from the MIT Center for
Biological and Computational Learning. The datasets
contain image chips of pedestrian of varying poses and
sizes and in crowded and uncrowded scenarios. All of the
chips in a given set are of uniform size.
The non-pedestrian examples were randomly sub-sampled
from full-sized negative source images provided with the
training sets at a rate of 3 samples per frame to avoid target
redundancy. These images were typically urban street and
sidewalk scenes with few or no people in them. The process
of randomly sampling negative targets was used in the
original Dalaal and Triggs paper [2]. The datasets were
then filtered by hand to get rid of human targets that may
have been extracted.
surveillance video with many “confusing” features in the
background, such as bright crosswalks, seemed to throw the
classifier off as well. Another issue was the large amount of
sensor noise in the low-resolution surveillance video which
may have been picked up when calculating the image
gradients for the HOG descriptor. The image datasets
appeared to be taken with cameras with higher resolution
CCDs while the surveillance video was generally very
grainy. A Gaussian low-pass filter would help in reducing
in the image noise, however, in case studies done by Dalaal
and Triggs [2], they found a HOG descriptor-based
classifier worked best without pre-processing with a
Gaussian filter.
Positive
Negative
Samples
Samples
MIT Pedestrian
500
5000
INRIA Person
500
1491
Dataset
Table 1. Datasets for training and testing: Positive samples
were 64 x 128 image chips of pedestrians in various poses.
Negative samples of the same size were randomly sub-sampled
from urban traffic scenes.
ANALYSIS OF RESULTS
Figure 5. Example of pedestrian classifier training data from INRIA
collection
The performance of classification can be tested against a
labeled set of data containing a mixture of known class
types, where probability of proper detection and false alarm
rate for each classifier can be calculated in a confusion
matrix:
For classification, it was extremely important that the
content and scale of the training images be reflected in the
images to be tested, e.g. the targets extracted from
surveillance video. For example, if a classifier is trained on
targets of a single scale, trying to classify targets of varying
scale would have a higher likelihood of misclassification.
The afore-mentioned datasets are very large and a crosssection of image chips matching the likely surveillance
scenarios need to be extracted.
The video that was analyzed was collected with a
surveillance camera using a ‘typical’ surveillance scene,
such as an urban scene with pedestrians and non-pedestrian
moving targets such as cars.
Some of the challenges I found while using the various
datasets and surveillance video were those related to scale,
image context, and sensor noise. The classifier was trained
on a set of 64x128 pixel targets with pedestrians of varying
poses and sizes; the targets extracted from the surveillance
video were of varying size as well, but the classifier seemed
to have trouble classifying segmented targets that were not
within a certain scale range of the training chips or offcenter in the image chip. Also, targets extracted from the
In this confusion matrix, p corresponds to positive
examples and n corresponds to negative examples, which
would be pedestrian and non-pedestrian examples for this
project. The numbers of primary interest are the numbers of
true positives and false positives (also called false alarms),
which can be used to calculate the pedestrian classifier
probability of detection and false alarm rates:
P(true detection) = # true positives / (# true positives + # false
negatives)
P(false alarm) = # false positives / (# false positives + # true
negatives)
For the training and testing of the MIT Pedestrian and
INRIA Persons datasets, the datasets were combined for
each class and then randomized. The pedestrian and nonpedestrian classifiers were trained on 80% of the data set
and tested on the remaining 20%. In the original Dalaal and
Triggs paper [2], it appeared that approximately 10% of the
training examples were positive and the remaining 90%
negative. In my training and testing set, approximately 15%
are positive and the remainder negative. The large bias in
the dataset toward the non-pedestrian classifier allows for
better generalization of background targets which could be
any number of objects (e.g. cars, posts, signs, street, etc).
Initial tests with full HOG descriptors and summarized
HOG descriptors (using maximum value/mode/mode
frequency) showed that the summarized HOG descriptor
generally had better performance. Further refinement
testing revealed that using the maximum value of the
histograms alone performed the best over using various
combinations of the three (or the other two by themselves).
The results presented below show that the 420 element
summarized HOG descriptor performed better than the
3,780 element HOG descriptor with a higher true positive
rate and lower false alarm rate for the MIT and INRIA
datasets. The results in Table 2 show the range of values
resulting after several rounds of testing to establish a
baseline. In general, the performance was much better than
initially anticipated:
Feature Type
True Positive
False Alarm
HOG
76-80%
4-5%
Summarized
HOG
91-94%
1.5-2%
(Max Value)
Table 2. Summarization of classification rates for regular HOG
descriptor vs. Summarized HOG descriptor (using max of histogram)
on MIT and INRIA Datasets (“ideal data”). Classifiers were trained
on 80% of randomized data and tested on remaining 20%, also
randomized, for each class type.
The results in Table 3 show classification rates for target
chips extracted from the surveillance video passed
processed by the tracker. The classification rates were not
as good due to issues described previously in the Data Set
section (e.g target scale, noise, etc.), but the summarized
HOG descriptor still showed a consistent 15% performance
increase over the standard HOG descriptor. The caveat with
these results is that the size of the tracker chip set was much
smaller, due to the limited number of pedestrians and false
alarms that could be extracted from a few videos.
Feature Type
True Positive
False Alarm
HOG
0%
50%
Summarized
HOG
14%
43%
(Max Value)
Table 3. Summarization of classification rates for regular HOG
descriptor vs. Summarized HOG descriptor (using max of histogram)
on random mixture targets extracted from object tracker. Classifiers
were trained on 80% of randomized data from MIT and INRIA
datasets.
The object tracker generally performed exactly as it should,
tracking moving objects and noise in the scene consistently.
There were false detections (e.g. non moving objects) in
areas of the scene where there was consistent image noise,
which was good for testing purposes.
DISCUSSION
The analysis of classification would have benefitted from
using n-fold cross-validation testing (e.g. “leave one out”
testing) to get a better cross-section of performance and
also using the model of training on one specific dataset and
then training on the other (e.g. train on the INRIA dataset,
test on MIT dataset) rather than mixing them together. This
was suggested to me by a co-worker who has a PhD in
Computer Vision.
In general, most pedestrians extracted by the object tracker
were classified as background, except in the case where
pedestrians were clearly defined against a flat, uncluttered
background (e.g. no road stripes) and also centered on the
target chip. True background objects were nearly never
classified as pedestrians. This points to a need for a multiscale detector which time scope of the project didn’t permit
me to implement. I learned from this portion of the project
that classifiers trained on specific target sizes don’t always
have a lot flexibility if trying to classify something that is a
few standard deviations outside of the classifier target size.
I have read various papers that HOG descriptors for smaller
target sizes can be scaled up to the original classifier scale
because no geometric information is lost in upsampling. For
downsampling, this is not the case as geometric information
is lost in a downsampled image (through omitting pixel
information the smaller image).
The independent pieces of the project worked very nicely in
their own “stovepipes.” However, when trying to connect
the two together was when I ran into problems. Using a
more progressive, integrated testing style would be benefit
a project like this in the future. Also, using more testing and
training data from the source of the classified object would
also be helpful.
Keck for tirelessly answering and advising me on my
research-related questions.
CONCLUSION
In summary, the most distinct part of my work was that I
found that a summarized version of the original histogram
of oriented gradients descriptor, using the maximum value
of each local cell histogram, outperforms the original HOG
descriptor. I can also conclude that the need for a multiscale detector is necessary (or measures to address target
scale) and will significantly affect the ‘real time’
performance of a classifier. The classifier also benefits from
being trained on input image sources that have similar
characteristics (e.g. resolution, noise) to the target image
that will be classified.
ACKNOWLEDGMENTS
The work described in this paper was conducted as part of a
Fall 2011 Artificial Intelligence course, taught in the
Computer Science department of the University of
Massachusetts Lowell by Prof. Fred Martin. I’d to thank
Fred for his enthusiasm about my project and to Dr. Mark
REFERENCES:
1. Background Subtraction in C++:
http://iss.bu.edu/data/jkonrad/reports/HWYY1006buece.pdf
2. N. Dalal, B. Triggs. Histograms of oriented gradients for
human detection. IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR),
page 1:886-893, 2005
3.Wei-Lwun Lu, J.J. Little, Simultaneous Tracking and
Action Recognition using the PCA-HOG Descriptor.
Computer and Robot Vision, 2006. The 3rd Canadian
Conference on In Computer and Robot Vision, 2006. The
3rd Canadian Conference on (2006), pp. 6-6.
4. Principle
Component
Analysis:
http://en.wikipedia.org/wiki/Principal_component_analy
sis
Download