An assessment of the video analytics technology gap for transportation facilities The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Thornton, J., J. Baran-Gale, and A. Yahr. “An Assessment of the Video Analytics Technology Gap for Transportation Facilities.” Technologies for Homeland Security, 2009. HST ’09. IEEE Conference On. 2009. 135-142. Copyright © 2009, IEEE As Published http://dx.doi.org/10.1109/THS.2009.5168025 Publisher Institute of Electrical and Electronics Engineers Version Final published version Accessed Thu May 26 18:40:02 EDT 2016 Citable Link http://hdl.handle.net/1721.1/62008 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Detailed Terms An Assessment of the Video Analytics Technology Gap for Transportation Facilities Jason Thornton, Jeanette Baran-Gale, Aaron Yahr MIT Lincoln Laboratory, 244 Wood St., Lexington, Ma. 02420 Abstract volumes of archived data. This is the motivation for video analytic (or intelligent video) technology, which encompasses a set of techniques for the automatic interpretation of video data. Such tools can serve as aids for security personnel, directing their attention to the most relevant video. Many commercially available tools offer a standard set of video interpretation capabilities and are meant to work out of the box, with little or no tuning required by the end user[2]. Some critical infrastructure facilities have already deployed such tools within limited capacities, especially for perimeter monitoring and archived content retrieval[3]. However, the performance of these systems is often problematic, due to the difficulties inherent in video analysis. Two of the most common modes of failure are false negatives and false positives. In the first case, the system does not detect an activityof-interest because it is too complex or subtle for the analytic algorithms to reliably distinguish it from normal background activity. In the second case, the system claims to detect an activity that is not exhibited in the processed video, issuing a false alarm; when frequent enough, these alarms become a distraction to security personnel rather than an aid. In addition, some critical infrastructure facilities have specialized analytic needs which are not well-addressed by the commercial market. In this work, we conduct an assessment of existing video analytic technology as applied to critical infrastructure protection, based partially on experiments using a video data testbed. As part of this assessment, we are surveying the video interpretation needs of different types of critical infrastructure facilities, as well as their past or present experiences with video analytic technology. We are then applying representative commercial and open-source tools to sample video datasets, in order to evaluate the ability of current technology to meet high-priority critical infrastructure needs. Finally, we are identifying critical gaps which might benefit from near-term research and development within the next several years. We conduct an assessment of existing video analytic technology as applied to critical infrastructure protection, particularly in the transportation sector. Based on discussions with security personnel at multiple facilities, we assemble a list of desired video analytics functionality, which we group into five categories: lowlevel activity detection, high-level behavior detection, discrimination, tracking, and content retrieval. We then evaluate the capabilities and deficiencies of current technology to meet these needs, in part by applying representative video analytic tools to a testbed of video data from multiple sources. As part of the evaluation, we examine performance across pixel resolution of degrees of scene clutter. Finally, we identify directions for promising technology development in order to address critical gaps. 1 1. Introduction In support of security operations, critical infrastructure facilities often deploy comprehensive installations of surveillance video cameras. These surveillance systems serve two functions: they provide a visual record of events for the investigation of past incidents, and they allow security personnel to perform realtime monitoring of their facilities more effectively. The visual cues provided by such systems may be used to detect suspicious or otherwise noteworthy behavior, including some criminal and terrorist acts. Due to the decreasing costs of digital video cameras, digital video recorders, and network capabilities, many facilities have been increasing their investments in this technology (and therefore, their surveillance coverage)[1]. However, it is difficult for humans to monitor multiple video feeds simultaneously, or to inspect large 1. This work was sponsored by the Department of the Air Force under Air Force Contract # FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. 978-1-4244-4179-2/09/$25.00 ©2009 IEEE 135 facility. Some desired capabilities require tracking a person or vehicle backwards in time in order to bring up all video depictions. Other capabilities require tracing an object’s path on a common geo-spatial reference frame shared by all cameras (such as a facility floor plan). We note that our initial focus is on algorithms and information processing, and not on camera hardware or network configuration issues. There have been several recent open evaluations of video interpretation algorithms, including the Performance Evaluation of Tracking and Surveillance (PETS) workshops[4] and the NIST TREC Video Retrieval Evaluation series [5]. To the extent that these evaluations address the capabilities we wish to assess, we take into account their findings and, in the case of PETS, incorporate some of their data into our experiments. • 2. Desired Capabilities Before conducting our assessment, we collected information about video analytic needs from security representatives at airports, subway stations, train/bus stations, national monuments, secure office buildings, and military bases. This information includes desired functionality (irrespective of technical limitations), relative priorities for these needs, and some input about operational restrictions. Based on these discussions, we assembled a list of desired capabilities, which we group into the following five categories: • • For the most part, these need categories require different video interpretation techniques, although there are some common sub-components between the categories. For instance, the detection of tailgating may rely upon the ability to track individuals over relatively short time periods. We note that we exclude two more specialized categories of video analytic capabilities from our assessment: video-based biometrics and license plate recognition. Though we do not discuss them here, both of these are active areas of research and development[6], [7], [8]. Low-level activity detection: This category includes detection capabilities for specific actions with relatively well-defined descriptions. These activities occur over short time spans ranging from seconds to minutes, and judgment of their occurrence is generally consistent across human observers. Examples include fence-climbing, window-breaking, and item abandonment. 3. Assessment Methodology To aid with our assessment of existing video analytic technology, we are acquiring testbed data from several sources, described below: High-level activity detection: High-level activities are more ambiguously defined than low-level activities, indicated more by patterns of behavior than by specific physical actions. In fact, it is not always clear to human observers what comprises an instance of these activity types. Examples include pre-operational site casing and surveillance and individuals exhibiting abnormally high stress levels. • Object discrimination: This category includes any sort of discrimination between scene components, including humans, vehicles, animals, and inanimate objects. Since many surveillance tasks revolve around human activity, one of the most useful forms of discrimination is between human motion and non-human motion. • Coordinated tracking: This category includes prolonged tracking of persons or vehicles, possibly across multiple camera views within the same Video content retrieval: Video retrieval capabilities allow security personnel to search through archived footage efficiently. For example, security may wish to execute semantic queries for particular types of content, such as all vehicles appearing in a specified region within the last several days. Such tools must accommodate searching over large video datasets without heavy computational loads (i.e., without reprocessing all archived video). 136 • The first source is publicly available datasets from PETS 2004, 2006, and 2007, which contain several types of staged activities (e.g., abandoned items, loitering, fighting, falling) among backgrounds of crowds in public areas. This video was captured at 30 frames per second and stored using the MPEG4 compression format. • The second source is an internal dataset collected by clusters of high-resolution cameras at Lincoln Laboratory overlooking a parking lot and several building exteriors. We have video data collected on two different days (one overcast day with snow coverage, and one mild day with no snow coverage). This dataset also includes several staged activities (such as fence-climbing and casing), along with normal background activity depicting person/vehicle motion and interactions. from the item for a sustained period of time (defined by some spatial and temporal criteria encoded in the analytic rule)[10]. For our quantitative evaluation, we used 43 video sequences depicting abandoned item events, along with background data depicting no abandoned item events. This data was recorded in five different environments and contains some video of the same events captured at different camera angles or post-processed to achieve different frame resolutions. We applied a representative COTS video analytics toolkit to process this data, using built-in functionality to attempt detection of abandoned item events. As in all of our empirical trials, we left any configurable algorithm parameters at their default values. Any system alarm corresponding to an actual left item event was considered a successful detection, while any alarm that did not was considered a false positive. In some cases, the system issued multiple alarms in rapid succession, apparently in response to the same activity; we combined these instances into a single alarm when computing performance statistics. Across all processed video, the probability of detection of real abandoned item events was 67%. To some degree, the performance is a function of data characteristics such as scene crowdedness and pixel resolution, which varies in our dataset. To address these factors, we ascribe to each video clip a category for clutter (i.e., the average number of movers in the scene) and pixels-on-item (i.e., the average number of pixels depicting the abandoned item). Figure 1 shows the detector performance across two environment types, crowded indoor and sparse indoor, for a fixed range of pixels-on-item. The sparse scenes have approximately 0-10 movers in a typical frame, while the crowded scenes have approximately 10-30. In this case, the scenes with less clutter allow for a reduction in false positives and a significant increase in detections. In contrast, Figure 2 plots the performance statistics across pixels-on-item for sparse indoor settings. Note that probability of detection decreases with resolution, although the 25-100 pixel range still gives sufficient cues for 50% detection in the sparse environment. We observed the following common causes of difficulty for this task: • False positive cause: Shifting heads or limbs that suddenly come into view are mistaken for abandoned items. In these cases, the skin-colored group of pixels stands out from clothing or scene background as a newly introduced blob. • False positive cause: Stationary humans, or slowmoving humans who are distant from the camera, are labeled as abandoned objects. • False positive cause: After establishing an as- This video was captured at 4 frames per second and stored using the MPEG4 compression format. • The final source is external data collected at one or more representative critical infrastructure facilities. Collection of this data is ongoing, and the experimental results plots presented in this paper do not incorporate this data source. Combined, the datasets we use contain a variety of indoor and outdoor environments, with variable camera positions and a range of scene clutter and occlusion effects. In order to guide our assessment, we applied examples of commercial off-the-shelf and open-source video analytic software to our testbed data. Most of the opensource tools were taken from OpenCV[9], a library of computer vision utilities. We note that we cannot conduct empirical tests for every capability need on our list, since we are limited both by the content of the testbed data and the functionality available in our tools. We use a few tools as case studies which, when combined with the context provided by video interpretation experts and feedback from security personnel, provide some useful indications about the state of the technology. During experimentation, we vary properties of the video data such as frame rate and pixel resolution, in order to gauge the effect on video analytic performance. 4. Low-Level Activity Detection There is a wide range of specific activities relevant to critical infrastructure security operations. Activities of interest include abandoned items, loitering, perimeter-crossing, fence-climbing, window-breaking, baggage theft, tailgating through restricted access points, falling/collapsing, and physical altercations. Commercial video analytics tools address the detection of some of these directly, incorporating them into the syntax made available to the end-user when defining alarm-generation rules. The first three capabilities, in particular, are commonly included in commercial packages. We discuss empirical performance results for one of these sample capabilities below. 4.1. Abandoned items Item abandonment typically refers to situations in which pedestrians leave behind hand-carried items (e.g., bags, packages, or suitcases) in public spaces. The video analytic objective is to form an association between pedestrians and the items in their possession, and then raise an alarm when a pedestrian moves away 137 representation of the scene using multiple overlapping camera views (if available) can help recover some object positions otherwise lost in twodimensional image plane projections. Ab a n d o n e d ite m a lg o rith m p e rfo rm a n c e w ith c lu tte r v a ria tio n (1 0 0 - 4 0 0 p ix e ls o n ite m ) D etec tion rate F als e alarm rate 20 90% 18 80% 16 70% 14 60% 12 50% 10 40% 8 30% 6 20% 4 10% 2 0% F alse alarm s p er h o u r P ro b ab ility o f d etectio n 100% 4.2. Discussion Current video analytic technology demonstrates some success detecting low-level activities, as long as there is a sufficient number of pixels-on-target and low clutter in the form of movers. The performance of these capabilities tends to be a function of both the complexity of interactions which define the activities and the duration of the activities. Simple-interaction activities such as perimeter-crossing or falling/collapsing require a less detailed (and therefore easier) video interpretation than activities such as tailgating, item abandonment, or window-breaking, which require parsing interactions between multiple persons or objects. In addition, actions that occur over relatively short time spans facilitate detection more than actions that occur over extended time spans, which have more opportunity for algorithm breakdown. For instance, we would expect five-minute loitering rules to miss detections more often than 30-second loitering rules. Overall, it seems plausible that incremental algorithm improvements could eventually lead to a robust set of low-level activity characterization abilities in all but the most challenging settings. 0 S pars e Indoor C row ded Indoor S ce n e T yp e Figure 1. Abandoned item detection performance by moving clutter category. Ab a n d o n e d ite m a lg o rith m p e rfo rm a n c e w ith re s o lu tio n v a ria tio n (sp a rs e in d o o r e n v iro n m e n ts ) Detection rate 90% 80% 70% False alarm rate 14 63% 60% 50% 50% 18 16 75% 13 20 8 40% 12 10 8 8 30% 6 20% 4 10% 2 0% F alse alarm s p er h o u r P ro b ab ility o f d etectio n 100% 0 100 - 400 50 - 200 25 - 100 R a n g e o f P ix e ls o n Ite m Figure 2. Abandoned item detection performance by pixels-on-item range. • 5. High-Level Activity Detection In contrast to low-level activities, high-level activities are defined by patterns of behavior that exhibit more subtle visual cues. It is difficult for video interpretation algorithms to parse these behaviors for several reasons: the individual cues tend to be difficult to detect, they do not occur in well-defined sequences, and they may be exhibited over extended periods of time (from minutes to hours) and across multiple camera views. For critical infrastructure protection, high-level activities of interest include the following: sociation between person and object, they are disassociated when the person becomes partially occluded, leading to an incorrect abandoned item alarm. False negative cause: Motion in front of an abandoned item causes frequent occlusions, preventing the system from recognizing a static object. Based on these observations, we believe the following would be useful objectives for future algorithm development: • • Incorporation of stronger, more articulated human body models: A model that correctly fits pixels to a detailed human form, including the limbs and head, would be less likely to confuse stationary humans or human segments with inanimate objects. Three-dimensional scene reconstructions: Partial occlusion is a frequent obstacle for abandoned item detection. Computing a three-dimensional 138 • Site casing and surveillance: Individuals who survey the layout of a site or closely monitor the details of operation might be doing so in preparation for terrorist or criminal acts. Potential cues include systematic movement throughout the facility, in addition to unusual focus on or photographing of critical site components, such as support structures or air intake vents. • Abnormally high stress or malicious intent: Some security personnel are trained to look for individuals exhibiting signs of high stress or malicious intent, in order to determine when they should initiate an investigative interview[11]. However, visible indicators of these states of mind, such as aggressive body language or particular facial expressions, are usually fairly subtle and require well-fit models of human motion to achieve automatic interpretation. We note that there are efforts to combine video cues with other remote physiological measurements in order to detect mal-intent (for example, the DHS Future Attribute Screening Technology program). • D is c rim in a tio n a lg o rith m p e rfo rm a n c e w ith c lu tte r v a ria tio n (5 0 - 5 0 0 p ix e ls p e r p e rs o n ) 100% 90% 80% D etec tion rate C onfus ion rate (given detec tion) 70% Rate 60% 50% 40% 30% 20% 10% 0% Crowded Indoor S pars e O utdoor S ce n e T yp e Possession of concealed weapons or explosives: While concealed weapons or explosives are typically not directly observable through video, the individuals concealing them tend to exhibit observable patterns of body language. These include unusual walking gaits, constant clothing adjustments, and repetitive over-the-clothes patting of the weapon/explosive to make sure it is in position and secure. Figure 3. Person/vehicle discrimination performance by moving clutter category. associating person or vehicle line-crossing events with person or vehicle alarms. We extract two performance metrics from the confusion matrix: detection rate (i.e., the percentage of human or vehicle events that triggered any alarm), and confusion rate (i.e., the percentage of detection cases for which the system confused humans and vehicles). Figure 3 plots discrimination performance metrics for two environments with different moving clutter levels. As in low-level activity detection, we see a strong disparity in detection capability between crowded and sparse environments. The false negative rate increases with a rise in number of occlusions occurring within the scene. The confusion rate, however, seems to remain relatively constant despite changes in scene crowdedness. This suggests that crossing tracks of movers, though fundamental to detection, may be a lesser driver of discrimination performance. Similarly, Figure 4 shows that a reduction in number of pixels-per-person (down to the 125-500 range) does not significantly alter the discrimination capability in the crowded indoor environment. Both detection rate and confusion rate show only slight reductions within crowded scenes as the number of pixels-per-person decreases, although we would expect performance to drop-off significantly at some point if we examined even smaller pixel-per-person ranges. We observed the following common causes of difficulty for this task: To our knowledge, conventional video analytic tools (both commercial and open-source) do not attempt to detect these high-level activities. This may be because of the ambiguous task definitions or because of the difficulty of detection these activities. Successful algorithm development for these applications would probably have to incorporate highly articulated models of human body and face motion, and would also have to track and catalog an individual’s actions long enough to establish a suspicious pattern of behavior. 6. Object Discrimination In the broadest sense this technique encompasses any effort to distinguish between various object types within a video scene. Its sophistication and robustness is a strong driver of performance for certain detection capabilities, especially in the case of prevalent background clutter within a scene. We focus on the ability to distinguish between human motion and vehicle motion, and to separate instances of either from negligible background motion, since this form of discrimination is a particularly useful component for many scene interpretation tasks. To test performance on the discrimination task, we again employed a COTS video analytic system, instructing it to alarm whenever a moving object (human or vehicle) crossed over a specified line segment within each video scene. Since the video analytic system reports whether an alarm was triggered by a person or vehicle, we were able to tabulate a confusion matrix • 139 Confusion cause: The primary metric for discrimination between vehicles and humans was the aspect ratio of the mover. Thus the successful differentiation between the two groups was driven in part by whether or not the look angle of the camera distorted the aspect ratio of the movers within the scene. Camshift Results Discrimination performance with resolution variation MCMC Results 7% R ate (Crowded indoor environments) 29% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 5% Detec tion rate C onfus ion rate (given detec tion) 46% 36% 31% 35% 24% 58% 31% 2% 6% 29% 43% 5% 500 - 2000 250 - 1000 125 - 500 Ra n g e o f P ix e ls p e r p e rso n Uninitiated Track Track Loss due to Target Occlusion Figure 4. Person/vehicle discrimination performance by pixels-on-target range. Track Loss due to Confuser Track Loss due to Environmental Change Track Completed • • • Figure 5. Tracking results for each technique Confusion cause: When segmentation is imperfect, groups of moving humans and persons toting luggage sometimes mirror the aspect ratio or contour shape of vehicles, leading to algorithm confusion. False negative cause: Poor contrast with background or high frequency of occlusions seemed to cause difficultly extracting movers from the background. False positive cause: Mechanical motion (such as automatic doors sliding shut) was occasionally declared human motion by the algorithm. single camera view, tracking and retrieval across multiple camera views, and full-site path reconstruction on a geo-spatial reference map. Existing video analytic systems mostly address the first subproblem, since relatively short-term tracking within a single camera view drives the interpretation of some specific activities like loitering and item abandonment. 7.1. Single camera tracking Based on these observations, we believe the following would be useful objectives for future algorithm development: • • We conducted some tests for single-camera tracking, applying two algorithm implementations from the open-source community to sample testbed videos. The first tracking tool is an OpenCV implementation of the Continuously Adaptive Meanshift (Camshift) algorithm [12]. The second tracking tool is an OpenCV implementation of a Markov Chain Monte Carlo (MCMC) technique[13]. For testing, we selected 145 targets (both persons and vehicles) from four video scenes and applied both tracking algorithms described above. We specified the initial position of each target within a particular frame as input to the tracking algorithms. If the tracker successfully started the tracking process without immediately transitioning to background confusers, we counted it as an initiated track. If it tracked the target through to the end of the test sequence, we counted it as a successfully completed track. Track lengths in our test videos varied from a few seconds to a few minutes. Figure 5 displays the breakdown of tracking results for the Camshift and MCMC trackers, including several categories of observed causes for track loss. In addition, Figure 6 plots performance metrics for the more effective tracker (MCMC) on the four Automatic correction for perspective distortion may improve performance, since this seems to affect some of the primary discrimination cues (aspect ratios, contours, etc). Incorporation of spatial-temporal texture cues: The assumption underlying this technique is that vehicles maintain more structural rigidity while in motion, while humans exhibit more variation due to moving limbs and changing pose. Furthermore, movement texture (as characterized by the spatialtemporal frequency domain) can more effectively distinguish human/vehicle motion from background motion with unique frequency-domain signatures, such as crashing waves or trees blowing on the wind. 7. Tracking Based upon desired capabilities for critical infrastructure, we divide the tracking problem into three subproblems of escalating difficulty: tracking within a 140 models rather than adaptive appearance models that fit to other scene components too easily. M C M C tra c k in g a lg o rith m p e rfo rm a n c e Initiated Trac k s 100% C om pleted Trac k s 7.2. Multi-camera tracking and path reconstruction 90% S u ccess R ate 80% 70% 60% 50% The problem of multi-camera tracking introduces a considerable number of additional difficulties beyond those in the single-camera tracking problem. For one thing, there may be significant variation of human or vehicle appearance from one camera view to another. This is due to potential differences in resolution, viewing angle, and imaging properties between cameras. In addition, if the cameras do not overlap, the tracker must disregard or use only weak motion models to associate tracks between views. Finally, tracking a person or vehicle across an entire critical infrastructure site requires maintaining tracks over larger spatial areas and periods of time, increasing the volume potential confusers. Path reconstruction requires not only successful multi-camera tracking but also geo-registration and path interpolation. Geo-registration requires each camera to map the coordinate system of its twodimensional image plane into a common coordinate system in the three-dimensional world, so that the target path can be traced on a shared reference map. This mapping typically requires prior knowledge about the location of the ground plane relative to the camera and becomes unstable at far-field viewing angles. In addition, when camera coverage is not complete, parts of the target path will be missing; therefore, complete path reconstruction requires that the video analytic system fill in the gaps with some sort of path estimation algorithm. Although commercial systems are beginning to address these trickier tracking problems, the capabilities of this technology remain less than robust in operational settings. 40% 30% 20% 10% 0% Indoor crowded Indoor sparse Outdoor sparse (clear conditions) Outdoor sparse (winter conditions) S ce n e T yp e Figure 6. Tracking performance of MCMC algorithm across different environments. test video environments. Other than in the snowcovered environment, performance remains somewhat consistent. However, track completion rates experience some degradation due to scene crowdedness in the indoor setting. The results in outdoor winter conditions are significantly worse because of markedly lower contrast (due to reduced levels of sunlight). Commonly observed tracking failure modes included: • • • • Unsuccessful track acquisition: Targets with lowintensity color profiles, often caused by dark clothing, proved problematic for the Camshift technique. It could not achieve target acquisition when it lacked the color to build a stable color histogram. Target occlusion: The trackers sometimes lost targets undergoing heavy occlusion by other components in the scene. In these failure cases, the tracker could not recover, terminating the track prematurely. Track switch to a confuser: The trackers would sometimes switch over and track confusers, or other parts of the scene with sufficiently similar appearance to the target. Confusers included other moving people and vehicles as well as stationary background components. Environmental changes: Changes in outdoor lighting due to clouds or weather, as well as indoor illumination changes, caused some track losses for the Camshift technique. 8. Video Content Retrieval The objective for this class of capabilities is the intelligent retrieval of video samples from archived video based on user queries. Because it is computationally impractical to re-process large sets of archived video every time a user query is issued, these systems must instead process video as it is captured and store useful meta-data, or content tags, on which to perform searches. Consequently, content retrieval systems must address three fundamental problems: how to interpret video content as it arrives, how to form meta-data for storage of this content, and how to process (preferably semantic) user queries about content. In general, tracking algorithms search for a consensus between motion models and appearance models. The tracking algorithms we tested experienced a significant number of track losses to confusers, even when the target maintained consistent motion. As a result, we may expect an improvement in performance if we use trackers with more of an emphasis on strong motion 141 Solutions to the first problem are limited by the capabilities of real-time video analytic processing. Techniques for meta-data formation must deal with a trade-off between the size of the extracted data and its expressiveness; systems which store too many lowlevel details require more storage space and slower search times, while systems that store more compact descriptors discard potentially valuable information. Finally, the format for queries is important because they must be user-friendly and capable of expressing the desired search criteria. Some systems also store meta-data related to the appearance of standard object classes (e.g., persons, vehicles, boats, or faces) in support of queries requesting all instances of one of these classes within a spatial or temporal window. While video indexing based on object classes is not perfect, this capability has improved significantly within the last decade due to active research, as evidenced by the TRECVID and Video Analysis and Content Extraction (VACE)[14] evaluations. In summary, existing systems can handle basic query structures using a limited set of pre-defined object classes and modifiers. [2] N. Gagvani, “Challenges in video analytics,” in Advances in Pattern Recognition: Embedded Computer Vision, 2008, ch. 12, pp. 237–256. 9. Summary [8] P. Phillips, “Overview of the multiple biometric grand challenge,” in MBGC Workshop, 2009. [Online]. Available: http://face.nist.gov/mbgc/ [3] A. Smith, “Ti, others see potential for smart video security systems.” [4] M. Ferryman, Ed., Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2007. [Online]. Available: http://pets2007.net/ [5] A. Smeaton, P. Over, and W. Kraaij, “High-level feature detection from video in trecvid: A 5-year retrospective of achievements,” in Multimedia Content Analysis, Theory and Applications, 2009, pp. 151–174. [6] C. Anagnostopoulos, I. Anagnostopoulos, I. Psoroulas, and V. Loumos, “License plate recognition from still images and video sequences: Asurvey,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3, pp. 377–391, 2008. [7] D. Tao, X. Li, X. Wu, and S. Maybank, “General tensor discriminant analysis and gabor features for gait recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1700– 1715, 2007. Existing video analytic technology offers some useful tools for critical infrastructure security, although it does not yet achieve all of the capabilities desired by security personnel. Video analytics software can reliably recognize certain types of low-level activities when the scene characteristics facilitate successful operation. However, specific activity detection becomes less than reliable when the scene contains a high volume of moving clutter or when the activity-of-interest is defined by more complicated interactions between persons, limbs, vehicles, or objects (e.g., a person breaking a car window). Current systems can also achieve some degree of discrimination and tracking capabilities, although these are error-prone. For security purposes, particularly useful developments in video analytics technology would include the following: basic interpretation of behavioral cues, robust discrimination between humans and non-humans (especially in farfield views), and facility-wide multi-camera tracking and path reconstruction. [9] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library. OReilly Press, 2008. [10] D. Thirde, L. Li, and J. Ferryman, “An overview of the pets 2006 dataset,” in Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2006), 2006, pp. 47– 50. [11] S. Donnelly, “A chastened airport watches for suspects,” Time Magazine, August 2003. [12] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2008. [13] S. Zhou, R. Chellappa, and B. Moghaddam, “Visual tracking and recognition using appearance-adaptive models in particle filters,” IEEE Transactions on Image Processing, vol. 11, pp. 1434–1456, 2004. [14] V. Manohar, M. Boonstra, V. Korzhova, P. Soundararajan, D. Goldgof, R. Kasturi, S. Prasad, H. Raju, R. Bowers, and J. Garofolo, “Pets vs. vace evaluation programs: A comparative study,” in Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2006. References [1] J. Vlahos, “Surveillance society: New high-tech cameras are watching you,” Popular Mechanics, January 2008. 142