An assessment of the video analytics technology gap for transportation facilities

advertisement
An assessment of the video analytics technology gap for
transportation facilities
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation
Thornton, J., J. Baran-Gale, and A. Yahr. “An Assessment of the
Video Analytics Technology Gap for Transportation Facilities.”
Technologies for Homeland Security, 2009. HST ’09. IEEE
Conference On. 2009. 135-142. Copyright © 2009, IEEE
As Published
http://dx.doi.org/10.1109/THS.2009.5168025
Publisher
Institute of Electrical and Electronics Engineers
Version
Final published version
Accessed
Thu May 26 18:40:02 EDT 2016
Citable Link
http://hdl.handle.net/1721.1/62008
Terms of Use
Article is made available in accordance with the publisher's policy
and may be subject to US copyright law. Please refer to the
publisher's site for terms of use.
Detailed Terms
An Assessment of the Video Analytics Technology Gap for Transportation
Facilities
Jason Thornton, Jeanette Baran-Gale, Aaron Yahr
MIT Lincoln Laboratory, 244 Wood St., Lexington, Ma. 02420
Abstract
volumes of archived data. This is the motivation for
video analytic (or intelligent video) technology, which
encompasses a set of techniques for the automatic
interpretation of video data. Such tools can serve as
aids for security personnel, directing their attention to
the most relevant video.
Many commercially available tools offer a standard
set of video interpretation capabilities and are meant to
work out of the box, with little or no tuning required by
the end user[2]. Some critical infrastructure facilities
have already deployed such tools within limited capacities, especially for perimeter monitoring and archived
content retrieval[3]. However, the performance of these
systems is often problematic, due to the difficulties
inherent in video analysis. Two of the most common
modes of failure are false negatives and false positives.
In the first case, the system does not detect an activityof-interest because it is too complex or subtle for
the analytic algorithms to reliably distinguish it from
normal background activity. In the second case, the
system claims to detect an activity that is not exhibited
in the processed video, issuing a false alarm; when
frequent enough, these alarms become a distraction
to security personnel rather than an aid. In addition,
some critical infrastructure facilities have specialized
analytic needs which are not well-addressed by the
commercial market.
In this work, we conduct an assessment of existing
video analytic technology as applied to critical infrastructure protection, based partially on experiments using a video data testbed. As part of this assessment, we
are surveying the video interpretation needs of different
types of critical infrastructure facilities, as well as their
past or present experiences with video analytic technology. We are then applying representative commercial
and open-source tools to sample video datasets, in
order to evaluate the ability of current technology to
meet high-priority critical infrastructure needs. Finally,
we are identifying critical gaps which might benefit
from near-term research and development within the
next several years.
We conduct an assessment of existing video analytic
technology as applied to critical infrastructure protection, particularly in the transportation sector. Based
on discussions with security personnel at multiple
facilities, we assemble a list of desired video analytics
functionality, which we group into five categories: lowlevel activity detection, high-level behavior detection,
discrimination, tracking, and content retrieval. We then
evaluate the capabilities and deficiencies of current
technology to meet these needs, in part by applying
representative video analytic tools to a testbed of video
data from multiple sources. As part of the evaluation,
we examine performance across pixel resolution of degrees of scene clutter. Finally, we identify directions for
promising technology development in order to address
critical gaps. 1
1. Introduction
In support of security operations, critical infrastructure facilities often deploy comprehensive installations of surveillance video cameras. These surveillance
systems serve two functions: they provide a visual
record of events for the investigation of past incidents,
and they allow security personnel to perform realtime monitoring of their facilities more effectively. The
visual cues provided by such systems may be used
to detect suspicious or otherwise noteworthy behavior,
including some criminal and terrorist acts. Due to the
decreasing costs of digital video cameras, digital video
recorders, and network capabilities, many facilities
have been increasing their investments in this technology (and therefore, their surveillance coverage)[1].
However, it is difficult for humans to monitor multiple video feeds simultaneously, or to inspect large
1. This work was sponsored by the Department of the Air Force
under Air Force Contract # FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors
and are not necessarily endorsed by the United States Government.
978-1-4244-4179-2/09/$25.00 ©2009 IEEE
135
facility. Some desired capabilities require tracking
a person or vehicle backwards in time in order to
bring up all video depictions. Other capabilities
require tracing an object’s path on a common
geo-spatial reference frame shared by all cameras
(such as a facility floor plan).
We note that our initial focus is on algorithms and
information processing, and not on camera hardware
or network configuration issues. There have been several recent open evaluations of video interpretation
algorithms, including the Performance Evaluation of
Tracking and Surveillance (PETS) workshops[4] and
the NIST TREC Video Retrieval Evaluation series
[5]. To the extent that these evaluations address the
capabilities we wish to assess, we take into account
their findings and, in the case of PETS, incorporate
some of their data into our experiments.
•
2. Desired Capabilities
Before conducting our assessment, we collected
information about video analytic needs from security
representatives at airports, subway stations, train/bus
stations, national monuments, secure office buildings,
and military bases. This information includes desired
functionality (irrespective of technical limitations), relative priorities for these needs, and some input about
operational restrictions. Based on these discussions,
we assembled a list of desired capabilities, which we
group into the following five categories:
•
•
For the most part, these need categories require different video interpretation techniques, although there
are some common sub-components between the categories. For instance, the detection of tailgating may
rely upon the ability to track individuals over relatively
short time periods. We note that we exclude two more
specialized categories of video analytic capabilities
from our assessment: video-based biometrics and license plate recognition. Though we do not discuss
them here, both of these are active areas of research
and development[6], [7], [8].
Low-level activity detection: This category includes detection capabilities for specific actions
with relatively well-defined descriptions. These
activities occur over short time spans ranging
from seconds to minutes, and judgment of their
occurrence is generally consistent across human observers. Examples include fence-climbing,
window-breaking, and item abandonment.
3. Assessment Methodology
To aid with our assessment of existing video analytic
technology, we are acquiring testbed data from several
sources, described below:
High-level activity detection: High-level activities are more ambiguously defined than low-level
activities, indicated more by patterns of behavior
than by specific physical actions. In fact, it is not
always clear to human observers what comprises
an instance of these activity types. Examples include pre-operational site casing and surveillance
and individuals exhibiting abnormally high stress
levels.
•
Object discrimination: This category includes
any sort of discrimination between scene components, including humans, vehicles, animals, and
inanimate objects. Since many surveillance tasks
revolve around human activity, one of the most
useful forms of discrimination is between human
motion and non-human motion.
•
Coordinated tracking: This category includes
prolonged tracking of persons or vehicles, possibly across multiple camera views within the same
Video content retrieval: Video retrieval capabilities allow security personnel to search through
archived footage efficiently. For example, security may wish to execute semantic queries for
particular types of content, such as all vehicles appearing in a specified region within the
last several days. Such tools must accommodate
searching over large video datasets without heavy
computational loads (i.e., without reprocessing all
archived video).
136
•
The first source is publicly available datasets
from PETS 2004, 2006, and 2007, which contain
several types of staged activities (e.g., abandoned
items, loitering, fighting, falling) among backgrounds of crowds in public areas. This video was
captured at 30 frames per second and stored using
the MPEG4 compression format.
•
The second source is an internal dataset collected
by clusters of high-resolution cameras at Lincoln Laboratory overlooking a parking lot and
several building exteriors. We have video data
collected on two different days (one overcast day
with snow coverage, and one mild day with no
snow coverage). This dataset also includes several staged activities (such as fence-climbing and
casing), along with normal background activity
depicting person/vehicle motion and interactions.
from the item for a sustained period of time (defined
by some spatial and temporal criteria encoded in the
analytic rule)[10].
For our quantitative evaluation, we used 43 video
sequences depicting abandoned item events, along with
background data depicting no abandoned item events.
This data was recorded in five different environments
and contains some video of the same events captured
at different camera angles or post-processed to achieve
different frame resolutions. We applied a representative
COTS video analytics toolkit to process this data, using
built-in functionality to attempt detection of abandoned
item events. As in all of our empirical trials, we left
any configurable algorithm parameters at their default
values. Any system alarm corresponding to an actual
left item event was considered a successful detection,
while any alarm that did not was considered a false
positive. In some cases, the system issued multiple
alarms in rapid succession, apparently in response to
the same activity; we combined these instances into a
single alarm when computing performance statistics.
Across all processed video, the probability of detection of real abandoned item events was 67%. To
some degree, the performance is a function of data
characteristics such as scene crowdedness and pixel
resolution, which varies in our dataset. To address these
factors, we ascribe to each video clip a category for
clutter (i.e., the average number of movers in the scene)
and pixels-on-item (i.e., the average number of pixels
depicting the abandoned item). Figure 1 shows the
detector performance across two environment types,
crowded indoor and sparse indoor, for a fixed range
of pixels-on-item. The sparse scenes have approximately 0-10 movers in a typical frame, while the
crowded scenes have approximately 10-30. In this case,
the scenes with less clutter allow for a reduction in
false positives and a significant increase in detections.
In contrast, Figure 2 plots the performance statistics
across pixels-on-item for sparse indoor settings. Note
that probability of detection decreases with resolution,
although the 25-100 pixel range still gives sufficient
cues for 50% detection in the sparse environment.
We observed the following common causes of difficulty for this task:
• False positive cause: Shifting heads or limbs
that suddenly come into view are mistaken for
abandoned items. In these cases, the skin-colored
group of pixels stands out from clothing or scene
background as a newly introduced blob.
• False positive cause: Stationary humans, or slowmoving humans who are distant from the camera,
are labeled as abandoned objects.
• False positive cause: After establishing an as-
This video was captured at 4 frames per second
and stored using the MPEG4 compression format.
•
The final source is external data collected at
one or more representative critical infrastructure
facilities. Collection of this data is ongoing, and
the experimental results plots presented in this
paper do not incorporate this data source.
Combined, the datasets we use contain a variety of
indoor and outdoor environments, with variable camera
positions and a range of scene clutter and occlusion
effects.
In order to guide our assessment, we applied examples of commercial off-the-shelf and open-source video
analytic software to our testbed data. Most of the opensource tools were taken from OpenCV[9], a library
of computer vision utilities. We note that we cannot
conduct empirical tests for every capability need on
our list, since we are limited both by the content
of the testbed data and the functionality available
in our tools. We use a few tools as case studies
which, when combined with the context provided by
video interpretation experts and feedback from security
personnel, provide some useful indications about the
state of the technology. During experimentation, we
vary properties of the video data such as frame rate
and pixel resolution, in order to gauge the effect on
video analytic performance.
4. Low-Level Activity Detection
There is a wide range of specific activities relevant
to critical infrastructure security operations. Activities of interest include abandoned items, loitering,
perimeter-crossing, fence-climbing, window-breaking,
baggage theft, tailgating through restricted access
points, falling/collapsing, and physical altercations.
Commercial video analytics tools address the detection
of some of these directly, incorporating them into the
syntax made available to the end-user when defining
alarm-generation rules. The first three capabilities,
in particular, are commonly included in commercial
packages. We discuss empirical performance results for
one of these sample capabilities below.
4.1. Abandoned items
Item abandonment typically refers to situations in
which pedestrians leave behind hand-carried items
(e.g., bags, packages, or suitcases) in public spaces.
The video analytic objective is to form an association
between pedestrians and the items in their possession,
and then raise an alarm when a pedestrian moves away
137
representation of the scene using multiple overlapping camera views (if available) can help recover some object positions otherwise lost in twodimensional image plane projections.
Ab a n d o n e d ite m a lg o rith m p e rfo rm a n c e w ith c lu tte r v a ria tio n
(1 0 0 - 4 0 0 p ix e ls o n ite m )
D etec tion rate
F als e alarm rate
20
90%
18
80%
16
70%
14
60%
12
50%
10
40%
8
30%
6
20%
4
10%
2
0%
F alse alarm s p er h o u r
P ro b ab ility o f d etectio n
100%
4.2. Discussion
Current video analytic technology demonstrates
some success detecting low-level activities, as long as
there is a sufficient number of pixels-on-target and low
clutter in the form of movers. The performance of these
capabilities tends to be a function of both the complexity of interactions which define the activities and the
duration of the activities. Simple-interaction activities
such as perimeter-crossing or falling/collapsing require
a less detailed (and therefore easier) video interpretation than activities such as tailgating, item abandonment, or window-breaking, which require parsing interactions between multiple persons or objects.
In addition, actions that occur over relatively short
time spans facilitate detection more than actions that
occur over extended time spans, which have more
opportunity for algorithm breakdown. For instance,
we would expect five-minute loitering rules to miss
detections more often than 30-second loitering rules.
Overall, it seems plausible that incremental algorithm
improvements could eventually lead to a robust set of
low-level activity characterization abilities in all but
the most challenging settings.
0
S pars e Indoor
C row ded Indoor
S ce n e T yp e
Figure 1. Abandoned item detection performance
by moving clutter category.
Ab a n d o n e d ite m a lg o rith m p e rfo rm a n c e
w ith re s o lu tio n v a ria tio n
(sp a rs e in d o o r e n v iro n m e n ts )
Detection rate
90%
80%
70%
False alarm rate
14
63%
60%
50%
50%
18
16
75%
13
20
8
40%
12
10
8
8
30%
6
20%
4
10%
2
0%
F alse alarm s p er h o u r
P ro b ab ility o f d etectio n
100%
0
100 - 400
50 - 200
25 - 100
R a n g e o f P ix e ls o n Ite m
Figure 2. Abandoned item detection performance
by pixels-on-item range.
•
5. High-Level Activity Detection
In contrast to low-level activities, high-level activities are defined by patterns of behavior that exhibit
more subtle visual cues. It is difficult for video interpretation algorithms to parse these behaviors for
several reasons: the individual cues tend to be difficult
to detect, they do not occur in well-defined sequences,
and they may be exhibited over extended periods of
time (from minutes to hours) and across multiple
camera views. For critical infrastructure protection,
high-level activities of interest include the following:
sociation between person and object, they are
disassociated when the person becomes partially
occluded, leading to an incorrect abandoned item
alarm.
False negative cause: Motion in front of an abandoned item causes frequent occlusions, preventing
the system from recognizing a static object.
Based on these observations, we believe the following would be useful objectives for future algorithm
development:
•
•
Incorporation of stronger, more articulated human
body models: A model that correctly fits pixels to
a detailed human form, including the limbs and
head, would be less likely to confuse stationary
humans or human segments with inanimate objects.
Three-dimensional scene reconstructions: Partial
occlusion is a frequent obstacle for abandoned
item detection. Computing a three-dimensional
138
•
Site casing and surveillance: Individuals who
survey the layout of a site or closely monitor
the details of operation might be doing so in
preparation for terrorist or criminal acts. Potential cues include systematic movement throughout
the facility, in addition to unusual focus on or
photographing of critical site components, such
as support structures or air intake vents.
•
Abnormally high stress or malicious intent:
Some security personnel are trained to look for
individuals exhibiting signs of high stress or
malicious intent, in order to determine when
they should initiate an investigative interview[11].
However, visible indicators of these states of
mind, such as aggressive body language or particular facial expressions, are usually fairly subtle
and require well-fit models of human motion to
achieve automatic interpretation. We note that
there are efforts to combine video cues with other
remote physiological measurements in order to
detect mal-intent (for example, the DHS Future
Attribute Screening Technology program).
•
D is c rim in a tio n a lg o rith m p e rfo rm a n c e w ith c lu tte r v a ria tio n
(5 0 - 5 0 0 p ix e ls p e r p e rs o n )
100%
90%
80%
D etec tion rate
C onfus ion rate (given detec tion)
70%
Rate
60%
50%
40%
30%
20%
10%
0%
Crowded Indoor
S pars e O utdoor
S ce n e T yp e
Possession of concealed weapons or explosives:
While concealed weapons or explosives are typically not directly observable through video, the
individuals concealing them tend to exhibit observable patterns of body language. These include
unusual walking gaits, constant clothing adjustments, and repetitive over-the-clothes patting of
the weapon/explosive to make sure it is in position
and secure.
Figure 3. Person/vehicle discrimination performance by moving clutter category.
associating person or vehicle line-crossing events with
person or vehicle alarms. We extract two performance
metrics from the confusion matrix: detection rate (i.e.,
the percentage of human or vehicle events that triggered any alarm), and confusion rate (i.e., the percentage of detection cases for which the system confused
humans and vehicles).
Figure 3 plots discrimination performance metrics
for two environments with different moving clutter levels. As in low-level activity detection, we see a strong
disparity in detection capability between crowded and
sparse environments. The false negative rate increases
with a rise in number of occlusions occurring within
the scene. The confusion rate, however, seems to
remain relatively constant despite changes in scene
crowdedness. This suggests that crossing tracks of
movers, though fundamental to detection, may be a
lesser driver of discrimination performance.
Similarly, Figure 4 shows that a reduction in number
of pixels-per-person (down to the 125-500 range) does
not significantly alter the discrimination capability in
the crowded indoor environment. Both detection rate
and confusion rate show only slight reductions within
crowded scenes as the number of pixels-per-person
decreases, although we would expect performance to
drop-off significantly at some point if we examined
even smaller pixel-per-person ranges.
We observed the following common causes of difficulty for this task:
To our knowledge, conventional video analytic tools
(both commercial and open-source) do not attempt
to detect these high-level activities. This may be because of the ambiguous task definitions or because of
the difficulty of detection these activities. Successful
algorithm development for these applications would
probably have to incorporate highly articulated models
of human body and face motion, and would also
have to track and catalog an individual’s actions long
enough to establish a suspicious pattern of behavior.
6. Object Discrimination
In the broadest sense this technique encompasses
any effort to distinguish between various object types
within a video scene. Its sophistication and robustness
is a strong driver of performance for certain detection
capabilities, especially in the case of prevalent background clutter within a scene. We focus on the ability
to distinguish between human motion and vehicle motion, and to separate instances of either from negligible
background motion, since this form of discrimination
is a particularly useful component for many scene
interpretation tasks.
To test performance on the discrimination task, we
again employed a COTS video analytic system, instructing it to alarm whenever a moving object (human
or vehicle) crossed over a specified line segment within
each video scene. Since the video analytic system
reports whether an alarm was triggered by a person or
vehicle, we were able to tabulate a confusion matrix
•
139
Confusion cause: The primary metric for discrimination between vehicles and humans was the
aspect ratio of the mover. Thus the successful
differentiation between the two groups was driven
in part by whether or not the look angle of the
camera distorted the aspect ratio of the movers
within the scene.
Camshift Results
Discrimination performance with resolution variation
MCMC Results
7%
R ate
(Crowded indoor environments)
29%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
5%
Detec tion rate
C onfus ion rate (given detec tion)
46%
36%
31%
35%
24%
58%
31%
2%
6%
29%
43%
5%
500 - 2000
250 - 1000
125 - 500
Ra n g e o f P ix e ls p e r p e rso n
Uninitiated Track
Track Loss due to Target Occlusion
Figure 4. Person/vehicle discrimination performance by pixels-on-target range.
Track Loss due to Confuser
Track Loss due to Environmental Change
Track Completed
•
•
•
Figure 5. Tracking results for each technique
Confusion cause: When segmentation is imperfect, groups of moving humans and persons toting
luggage sometimes mirror the aspect ratio or
contour shape of vehicles, leading to algorithm
confusion.
False negative cause: Poor contrast with background or high frequency of occlusions seemed
to cause difficultly extracting movers from the
background.
False positive cause: Mechanical motion (such
as automatic doors sliding shut) was occasionally
declared human motion by the algorithm.
single camera view, tracking and retrieval across multiple camera views, and full-site path reconstruction on
a geo-spatial reference map. Existing video analytic
systems mostly address the first subproblem, since
relatively short-term tracking within a single camera
view drives the interpretation of some specific activities
like loitering and item abandonment.
7.1. Single camera tracking
Based on these observations, we believe the following
would be useful objectives for future algorithm development:
•
•
We conducted some tests for single-camera tracking, applying two algorithm implementations from
the open-source community to sample testbed videos.
The first tracking tool is an OpenCV implementation
of the Continuously Adaptive Meanshift (Camshift)
algorithm [12]. The second tracking tool is an OpenCV
implementation of a Markov Chain Monte Carlo
(MCMC) technique[13]. For testing, we selected 145
targets (both persons and vehicles) from four video
scenes and applied both tracking algorithms described
above. We specified the initial position of each target within a particular frame as input to the tracking algorithms. If the tracker successfully started the
tracking process without immediately transitioning to
background confusers, we counted it as an initiated
track. If it tracked the target through to the end of
the test sequence, we counted it as a successfully
completed track. Track lengths in our test videos varied
from a few seconds to a few minutes. Figure 5 displays
the breakdown of tracking results for the Camshift
and MCMC trackers, including several categories of
observed causes for track loss.
In addition, Figure 6 plots performance metrics
for the more effective tracker (MCMC) on the four
Automatic correction for perspective distortion
may improve performance, since this seems to
affect some of the primary discrimination cues
(aspect ratios, contours, etc).
Incorporation of spatial-temporal texture cues:
The assumption underlying this technique is that
vehicles maintain more structural rigidity while in
motion, while humans exhibit more variation due
to moving limbs and changing pose. Furthermore,
movement texture (as characterized by the spatialtemporal frequency domain) can more effectively
distinguish human/vehicle motion from background motion with unique frequency-domain
signatures, such as crashing waves or trees blowing on the wind.
7. Tracking
Based upon desired capabilities for critical infrastructure, we divide the tracking problem into three
subproblems of escalating difficulty: tracking within a
140
models rather than adaptive appearance models that fit
to other scene components too easily.
M C M C tra c k in g a lg o rith m p e rfo rm a n c e
Initiated Trac k s
100%
C om pleted Trac k s
7.2. Multi-camera tracking and path reconstruction
90%
S u ccess R ate
80%
70%
60%
50%
The problem of multi-camera tracking introduces a
considerable number of additional difficulties beyond
those in the single-camera tracking problem. For one
thing, there may be significant variation of human or
vehicle appearance from one camera view to another.
This is due to potential differences in resolution, viewing angle, and imaging properties between cameras.
In addition, if the cameras do not overlap, the tracker
must disregard or use only weak motion models to
associate tracks between views. Finally, tracking a
person or vehicle across an entire critical infrastructure
site requires maintaining tracks over larger spatial areas
and periods of time, increasing the volume potential
confusers.
Path reconstruction requires not only successful multi-camera tracking but also geo-registration
and path interpolation. Geo-registration requires each
camera to map the coordinate system of its twodimensional image plane into a common coordinate
system in the three-dimensional world, so that the
target path can be traced on a shared reference map.
This mapping typically requires prior knowledge about
the location of the ground plane relative to the camera
and becomes unstable at far-field viewing angles. In
addition, when camera coverage is not complete, parts
of the target path will be missing; therefore, complete
path reconstruction requires that the video analytic
system fill in the gaps with some sort of path estimation algorithm. Although commercial systems are
beginning to address these trickier tracking problems,
the capabilities of this technology remain less than
robust in operational settings.
40%
30%
20%
10%
0%
Indoor crowded
Indoor sparse
Outdoor sparse
(clear conditions)
Outdoor sparse
(winter conditions)
S ce n e T yp e
Figure 6. Tracking performance of MCMC algorithm across different environments.
test video environments. Other than in the snowcovered environment, performance remains somewhat
consistent. However, track completion rates experience
some degradation due to scene crowdedness in the
indoor setting. The results in outdoor winter conditions
are significantly worse because of markedly lower
contrast (due to reduced levels of sunlight). Commonly
observed tracking failure modes included:
•
•
•
•
Unsuccessful track acquisition: Targets with lowintensity color profiles, often caused by dark
clothing, proved problematic for the Camshift
technique. It could not achieve target acquisition
when it lacked the color to build a stable color
histogram.
Target occlusion: The trackers sometimes lost
targets undergoing heavy occlusion by other components in the scene. In these failure cases, the
tracker could not recover, terminating the track
prematurely.
Track switch to a confuser: The trackers would
sometimes switch over and track confusers, or
other parts of the scene with sufficiently similar
appearance to the target. Confusers included other
moving people and vehicles as well as stationary
background components.
Environmental changes: Changes in outdoor lighting due to clouds or weather, as well as indoor
illumination changes, caused some track losses
for the Camshift technique.
8. Video Content Retrieval
The objective for this class of capabilities is the
intelligent retrieval of video samples from archived
video based on user queries. Because it is computationally impractical to re-process large sets of archived
video every time a user query is issued, these systems
must instead process video as it is captured and store
useful meta-data, or content tags, on which to perform
searches. Consequently, content retrieval systems must
address three fundamental problems: how to interpret
video content as it arrives, how to form meta-data for
storage of this content, and how to process (preferably
semantic) user queries about content.
In general, tracking algorithms search for a consensus
between motion models and appearance models. The
tracking algorithms we tested experienced a significant
number of track losses to confusers, even when the
target maintained consistent motion. As a result, we
may expect an improvement in performance if we use
trackers with more of an emphasis on strong motion
141
Solutions to the first problem are limited by the
capabilities of real-time video analytic processing.
Techniques for meta-data formation must deal with a
trade-off between the size of the extracted data and
its expressiveness; systems which store too many lowlevel details require more storage space and slower
search times, while systems that store more compact
descriptors discard potentially valuable information.
Finally, the format for queries is important because
they must be user-friendly and capable of expressing
the desired search criteria.
Some systems also store meta-data related to the
appearance of standard object classes (e.g., persons,
vehicles, boats, or faces) in support of queries requesting all instances of one of these classes within
a spatial or temporal window. While video indexing
based on object classes is not perfect, this capability
has improved significantly within the last decade due
to active research, as evidenced by the TRECVID and
Video Analysis and Content Extraction (VACE)[14]
evaluations. In summary, existing systems can handle
basic query structures using a limited set of pre-defined
object classes and modifiers.
[2] N. Gagvani, “Challenges in video analytics,” in Advances in Pattern Recognition: Embedded Computer
Vision, 2008, ch. 12, pp. 237–256.
9. Summary
[8] P. Phillips, “Overview of the multiple biometric
grand challenge,” in MBGC Workshop, 2009. [Online].
Available: http://face.nist.gov/mbgc/
[3] A. Smith, “Ti, others see potential for smart video
security systems.”
[4] M. Ferryman, Ed., Proceedings of IEEE International
Workshop on Performance Evaluation of Tracking
and Surveillance (PETS), 2007. [Online]. Available:
http://pets2007.net/
[5] A. Smeaton, P. Over, and W. Kraaij, “High-level feature
detection from video in trecvid: A 5-year retrospective
of achievements,” in Multimedia Content Analysis, Theory and Applications, 2009, pp. 151–174.
[6] C. Anagnostopoulos, I. Anagnostopoulos, I. Psoroulas,
and V. Loumos, “License plate recognition from still
images and video sequences: Asurvey,” IEEE Transactions on Intelligent Transportation Systems, vol. 9,
no. 3, pp. 377–391, 2008.
[7] D. Tao, X. Li, X. Wu, and S. Maybank, “General
tensor discriminant analysis and gabor features for gait
recognition,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 29, no. 10, pp. 1700–
1715, 2007.
Existing video analytic technology offers some useful tools for critical infrastructure security, although
it does not yet achieve all of the capabilities desired
by security personnel. Video analytics software can
reliably recognize certain types of low-level activities
when the scene characteristics facilitate successful operation. However, specific activity detection becomes
less than reliable when the scene contains a high volume of moving clutter or when the activity-of-interest
is defined by more complicated interactions between
persons, limbs, vehicles, or objects (e.g., a person
breaking a car window). Current systems can also
achieve some degree of discrimination and tracking capabilities, although these are error-prone. For security
purposes, particularly useful developments in video analytics technology would include the following: basic
interpretation of behavioral cues, robust discrimination
between humans and non-humans (especially in farfield views), and facility-wide multi-camera tracking
and path reconstruction.
[9] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library. OReilly Press,
2008.
[10] D. Thirde, L. Li, and J. Ferryman, “An overview
of the pets 2006 dataset,” in Proceedings of IEEE
International Workshop on Performance Evaluation of
Tracking and Surveillance (PETS 2006), 2006, pp. 47–
50.
[11] S. Donnelly, “A chastened airport watches for suspects,”
Time Magazine, August 2003.
[12] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time
tracking of non-rigid objects using mean shift,” in Proceedings of the IEEE Conferenceon Computer Vision
and Pattern Recognition (CVPR), 2008.
[13] S. Zhou, R. Chellappa, and B. Moghaddam, “Visual
tracking and recognition using appearance-adaptive
models in particle filters,” IEEE Transactions on Image
Processing, vol. 11, pp. 1434–1456, 2004.
[14] V. Manohar, M. Boonstra, V. Korzhova, P. Soundararajan, D. Goldgof, R. Kasturi, S. Prasad, H. Raju,
R. Bowers, and J. Garofolo, “Pets vs. vace evaluation
programs: A comparative study,” in Proceedings of
IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2006.
References
[1] J. Vlahos, “Surveillance society: New high-tech cameras are watching you,” Popular Mechanics, January
2008.
142
Download