Research Journal of Applied Sciences, Engineering and Technology 4(20): 4171-4177, 2012 ISSN: 2040-7467 © Maxwell Scientific Organization, 2012 Submitted: May 08, 2012 Accepted: May 29, 2012 Published: October 15, 2012 Detecting Abnormal Behaviors in Crowded Scenes 1 Oluwatoyin P. Popoola and 2Hui Ma Systems Engineering Department, Faculty of Engineering, University of Lagos, Nigeria 2 College of Electronic Engineering, Heilongjiang University, Harbin 150080, PR China 1 Abstract: Situational awareness is a basic function of the human visual system, which is attracting a lot of research attention in machine vision and related research communities. There is an increasing demand for smarter video surveillance of public and private space using intelligent vision systems which can distinguish what is semantically meaningful to the human observer as ‘normal’ and ‘abnormal’ behaviors. In this study we propose a novel robust behavior descriptor for encoding the intrinsic local and global behavior signatures in crowded scenes. Crowd scenes transitioning from normal to abnormal behaviors such as “rush”, “scatter” and “herding” were modeled and detected. The descriptor uses features that encode both local and global signatures of crowd interactions. Bayesian topic modeling is used to capture the intrinsic structure of atomic activity in the video frames and used to detect the transition from normal to abnormal behavior. Experimental results and analysis of the proposed framework on two publicly available crowd behavior datasets show the effectiveness of this method compared to other methods for anomaly detection in crowds with a very good detection accuracy rates. Keywords: Abnormal behavior detection, intelligent video surveillance, situation awareness INTRODUCTION Situational awareness is a very robust function of the human visual system. This visual system can quickly and efficiently scan through large quantity of time-varying low-level imagery data and pick out information that ‘stand-out’, semantically meaningful and useful for situational awareness. Such ability of the human visual system has inspired a great deal of research in the use of computational modeling techniques for visual analysis of human behavior in the computer vision and machine learning communities. Intelligent visual surveillance systems should be able to provide answers to queries such as ‘what is going on in the scene?’ or ‘what type of event or activity is taking place?’, or ‘where is this action taking place in the scene?.’ The visual phenomena although very difficult to model analytically, generally are more effectively approached through statistical machine learning. The ever increasing demand from governments and commercial business for tools to effectively monitor public and private spaces towards crime detection and prevention, efficient public transport management, as well as personalized healthcare delivery, has fuelled interest and research funding for machine-based visual analysis of behavior. Some of the recent advances in applications include: Healthcare delivery using social robots perform automated visual analysis of human behavior in elderly peoples’ home care facilities. These personalized assistants learn from visual information, to mimic and interact with the elderly. They provide routine assessment of the persons’ behaviors and provide real-time assistance under conditions of sickness, accident, or fall. Efficient video content search, summarization and indexing is applied to search for certain behaviors in video database which would be tedious and sometimes impossible without meta-data like names, time of event etc. However with automated visual search by object behavior categories such tasks can be done be very rapidly. Human Computer Interaction (HCI) systems are developed for more and more user-friendly human computer interaction systems like computer game console applications produced by the gaming industry and teleconferencing systems where visual sensors capture and perform visual analysis on users’ behaviors for intelligent and meaningful communication and interaction. Intelligent visual surveillance systems: These electronically perceive and understanding the behaviors of people (or objects moved by humans such as vehicles) in image sequences (frames of videos clips). It is by far the most popular motivation for situational awareness. ‘Behavior’ is a generic term often referring to the observable Corresponding Author: Hui Ma, College of Electronic Engineering, Heilongjiang University, Harbin 150080, PR China 4171 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012 actions of agents such as persons, or other moving objects in the scene. Behaviors become salient when they are different from the regular patterns in that context and do not conform either temporally or spatially; that is they ‘stand-out’ as different relative to the context of their surrounding in space and time. Such unusual activities are by nature rare, unexpected out-of-the-ordinary. These characteristics make modeling behavior for purpose of detecting abnormal behavior a non-trivial challenge. The objective of this study is towards early detection of contextually abnormal crowd behaviors using a robust behavior representation. LITERATURE REVIEW used in Vaswani et al. (2003) and Vaswani et al. (2005) to detect abnormal activity in a group of moving and interacting objects by modeling the changing configuration as a moving and deforming "shape" and used continuous Hidden Markov Models (HMMs) to capture the landmark shape dynamics. Tracking however requires effective background subtraction and is limited by factors like occlusion and shadows; thus restricting it’s utility to encode complex behavior in the real world occurrence of anomalies. It is also sensitive to tracking errors even if they occur in only a few frames. This approach also fails when modeling crowded and complicated scenes. PROPOSED METHODOLOGY In this study we propose a new motion descriptor for modeling crowd flows based on local pixel motion from frame to frame. This descriptor is made up of the optical flow, potentials and the Lyaponuv exponent computations on the sequence of moving frames. Lowlevel local and global spatio-temporal features are used to describe crowd behaviors instead of using the tracking based approaches. The feature descriptor is a vector of optical flow, potential fields and Lyaponuv exponent components in the flow sequences. Approaches vary from the ‘holistic’ methods which try to obtain coarser-level (global) information from the main crowd flows for use in detecting anomalies, to bottom-up approaches i.e., from the pixel level. Davies et al. (1995) measure motion features at pixel or pixelneighborhood levels which are then aggregated to obtain motion properties for larger regions in an image. The aggregated results establish the overall dominant crowd direction and magnitude (velocities). Boghossian and Velastin (1999) worked on how to detect Optical flow: This approximates two-dimensional flow emergency including circular flow around the exits field from the image intensities. The vectors are detected in the Hough space which could indicate traffic difference in position of the same pixel content between jams; multi-directional divergence of crowd flow two images and is called flow vector. Details can be indicating potential danger such as fire outbreak; and found in Brox et al. (2004). obstacles in the flow paths that might correspond to injured pedestrians. Andrade et al. (2006) analyzed Potentials functions: Drawing from flow concepts in crowd motion using optical flow and Hidden Markov fluid dynamics, we posit that incompressible and Models (HMMs) to characterize “usual behavior” in the crowd. Ali and Shah (2007) proposed a method of irrotational flow properties of the optical flow field are segmenting crowd flow, which revealed the crowd’s constituents of the behavior signature that characterizes instabilities as observed from the coherent structures the motion types. A simplified mathematical model of which appear by applying Lagrangian particle dynamics fluids assumes that the fluid is incompressible and on the optical flow fields. Instability is detected when irrotational, based on conservation properties of fluids. the number of segments of some analyzed sequence These assumptions lead to potential functions, which changes. Mehran et al. (2009) used the social force are scalar functions that characterize the flow in a model (Helbing Dirk, 1995) to model crowds for , denotes a planar unique way. Optical flow abnormal crowd behavior detection. Particle advection vector field. according to the computed optical flow is done to estimate the social forces. Normal patterns of forces According to Helmholtz decomposition theorem:‘Let over time are modeled by mapping the magnitude of the F (r) be any continuous vector field with continuous interaction force vectors to the image plane. The first partial derivatives. Then F (r) can be uniquely likelihood force flow is estimated to classify each frame expressed in terms of the negative gradient of a scalar as normal or abnormal. potential the curl of a vector potential a (r): In this study, we adopt an approach that uses the probability of occurrence of feature vectors as indicators or (1) of the presence or absence of an anomaly. The features used to encode behavior can be “global” or “local”, either spatial or temporal or both. is the scalar potential and r is the where, Spatial-temporal features have shown particular promise vector potential, which denote the incompressible and in motion understanding due to its rich descriptive irrotational parts of the vector field. An incompressible power (Niebles and Li, 2007) and therefore widely used as feature descriptor. Shape dynamics has also been vector field is divergence free · = 0 and an 4172 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012 irrotational vector field is curl free × = 0. These two functions are known as the stream function and the velocity potential , respectively. Following (Shandong et al., 2010), we use Fourier transforms to decompose incompressible and irrotational parts of the vector field and estimate the potential functions using: , , 0.5 , 0, , ,0 , 0.5 0.5 , 0.5 0, ,0 (2) (3) Finite Time Lypunov Exponents (FTLE): FTLE is a scalar field that measures how much particles separate after a given interval of time. It describes mass transport in a system based on the vector fields. Any two particles starting in the same region of the exponent field will tend to flow together in time as the dynamical system evolves, whereas two particles straddling a ridge in the field will tend to diverge exponentially in time Shadden et al. (2005). For example, suppose a point located at: ∈ at time . When this point is after time advected by the flow, it moves to interval . The amount of stretching about this trajectory can be seen by considering a perturbed point 0 , where, 0 is infinitesimally small and oriented arbitrarily. After the time interval , the perturbation becomes: | 0 | 0 for a positive function means that / remains bounded for all ∈ . By dropping the || 0 || terms, the growth of linearized perturbations are obtained. Using Euclidean norm, the magnitude of the perturbation is given by: | where, | ∗ 〈 ∗ 0 , 0 〉 (4) transpose. The symmetric matrix: means ∗ ∥ where, with 0 is aligned with the eigenvector associated Δ . This can be re-written as: . (5) Δ is a finite-time version of the Cauchy-Green deformation tensor. Although is a function of , andT. Maximum stretching occurs between two ∥ ∥ ∥ Δ ∥ 0 ∥ max max The stream function provides the information regarding the steady and non-divergent parts of the flow, while the velocity potential encodes information regarding the local changes in the non-curling motions. very close particles when 0 is aligned with the eigenvector associated with the maximum eigenvalue of . Thus if , Δ is the maximum eigenvalue of then: | | ∥ (6) 0 ∥ where, Eq. (2.20) represents the largest finite-time Lyapunov exponent with a finite integration time T, corresponding to point ∈ at time . The semantic descriptive label assigned to each of the crowd scenes as it evolves in the global (framelevel) is considered an ensemble of behavior dynamics aggregated across the local spatio-temporal patches that make up the frames. This implies that the singular behavior of one or a minority of the moving targets does not suffice to infer the global behavior until similar abnormal behavior can be observed to be populated across the scene. Thus the feature vector computed across the patches for all the frames consists of the statistics of low-level features of the flow vectors such as magnitude, potentials (incompressible and irrotational) and the Lyapunov exponent values: , , , , (7) where, u, v capture motion magnitude, in the velocity field from optical flow; , , are the incompressible and irrotational flow components from the FTLE potentials and ε is the Lyapunov exponent values indicating the amount of local divergence or convergence. The feature vectors are clustered using the bioinspired clustering method proposed in Wang and Popoola (2010) to form a codebook of visual words. With this representation, the original video frames can be discarded and each video clip represented by the visual words from the codebook. Modeling with bayesian topic models: Variants of the basic topic model shown in Fig. 1 namely the Latent Dirichlet Allocation (LDA) (Blei et al., 2010; Blei et al., 2003) are used in machine learning. They provide a simple way to analyze large volumes of unlabeled text such as a huge database of written documents. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. 4173 Res. J. Apppl. Sci. Eng. Technol., T 4(20)): 4171-4177, 2012 2 Γ ∑ ∏ ∏ … Γ (12) is the num mber of words in document j. j The where, marginnal likelihoodd of the words givenn the hyperparameters | , is intraactable and theerefore the maarginal likelihoood of the hiddden variables given the hypperparameters: , | , is also intracctable. Variatioonal inference (Blei et al., 2003) 2 is used to t free parameeters and : , | , Fig. 1: LDA A topic model (B Blei et al., 2003) Data is i assumed to be observed from fr a generatiive probabilisttic process that has hidden variables v nameely, the themaatic structure. This generaative model for f documentss is based on probabilistic sam mpling rules thhat describe how words in documents d migght be generatted on the basis b of lateent (random) variables. The T generative process operates onn bag-of-worrds assumptionn i.e., it does not make any asssumptions aboout the order of o words as they y appear in doccuments. The LDA L model forr each frame has h the followiing parameterss: K vector governning the Dirichhlet α: A K-dimensional distribbutions of thee topics in a training viddeo dataseet: ,…, t β: A KxW-dimensiional matrix representing the butions of codebook c worrds multinnomial distrib whichh form the voccabulary for alll learned topiics, where: ,…, =[ For anny given docu ument, the prrobability of the t distributionn over topics iss given by: Γ ∑ ∏ , , | , | ∏ Γ | | (13) ∗ ∗ , ∗ ; , ∗ (14) Scores below a defined d threshold is considerred an indicatoor that an anom maly is presentt in that frame. EXPERIM MENTS AND RESULTS R (99) ] | Ann optimal ( , is foundd by calculatinng the lower bound on the log likkelihood giveen by | , log Appplying this to t visual dataa, document words corresppond to featuree vectors whilee ‘topics’ correespond to atom mic activities thhat frequently occur o together in the video clips. The log likelihoodd of observiing a ∗ previouusly unseen cllip is approximated a by its ∗ maximiized loweer bound log ∗ ∗ ∗ , , . is the Dirichllet parameterss that represeent the video sequence mplex. ∗ in a topic sim ∗ are the prior probbabilities of toppics that show w up in are condditional probabbilities ∗ , while , given the words givven topics as model param meters, computted in the M-step of the expecttationmaximiization algorithhm. ∗ Given the modeel, , ∗ ; , ) is evaluatted by ∗ u , ∗ until setting α, β as connstants and updating ∗ ∗ , ; , ) is maaximized . Thee normality scoore for a test cllip is defined as: a (8) ,…, ∏ | … (110) , (111) Thhe crowd behavvior data usedd in our experiiments are extrracted from thhe UMN (Mehrran and Shah, 2011) and PE ETS (Universitty of Reading, 2010) surveiillance video database for monitoring m crrowd activity. Clips with thhe following behavior typpes were useed for trainingg: Tyype 1: Sudden accelerated motion m “rush”. Crowd C mootion that graduually changes into i a panic sittuation but motion is tow wards one direcction. Tyype 2: A crowdd moving at noormal pace sudddenly dissperses in diffeerent directionss i.e., “scatter”.. Tyype 3: Instantaaneous movem ment towards ann exit: “heerding”. 4174 Res. J. Apppl. Sci. Eng. Technol., T 4(20)): 4171-4177, 2012 2 Fig. 2: Videeo segmentation n into clips and feature extractiion from m patches (7×7 pixels) p Fig. 4: Frame F classificattion for ‘scatter’ anomaly Fig. 3: Fram me classification for ‘rush’ anom maly Two of o the scenes are a outdoors while w one is in an enclosed environment. e There T are typicaal outdoor issuues like lighteening variation n, occlusion and a noisy imaage capture. After A a random m sample of clips, we seleect portions thhat contain tran nsitions for usee as validation or test-set. These T were no ot used in thee training stagge. Atomic beehaviors from patches weree modeled usiing low-level visual v pixel feaatures in the fllow fields. Theese form the building blo ocks for modeeling the globbal Fig. 5: Frame F classificattion for ‘herdingg’ anomaly behavioors as describbed in precedding sections. Long video sequences s of thhree scenes weere first divideed into 2-3 secc clips and features f were extracted (Fiig. 2). Featuree quantization was done usinng ant clusterinng and the visuual codebook was w built throuugh which thee clips 4175 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012 means algorithm. The effectiveness of this descriptor in modeling crowd behavior was demonstrated using a Bayesian topic model based on the bag-of-features framework to detect transitions from normal to abnormal crowd behaviors in three different crowd scenes for crowd behaviors transiting contextually normal to abnormal behavior such as running, scattering and herding. Such early detection of the transition between normal phase and abnormal phase is of crucial importance in surveillance applications. Further study will be done in automatic recognition of emerging anomalies in crowds and indicating the local regions in the scenes where such anomalies occur. True positive rate Table 1: Detection accuracy for two pixel patch sizes of feature descriptors 20-topic LDA model ------------------------------------------------------Patch size Rush Scatter Herding 7x7 0.935 0.915 0.945 9x9 0.861 0.970 0.901 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ACKNOWLEDGMENT Baseline Type 1-RUSH Type 2-SCATTER Type 3-HERDING 0 0.2 0.4 0.6 False positive rate 0.8 1.0 The authors thank Prof. Wang Kejun (Harbin Engineering University) for his helpful comments. REFERENCES Ali, S. and M. Shah, 2007. A lagrangian particle dynamics approach for crowd flow segmentation Fig. 6: ROC curve for classifier performance on the three and stability analysis. Proceeding of the IEEE behavior types using 7×7 patch size Conference on Computer Vision and Pattern Recognition. Minneapolis, June 17-22, pp: 1-6. are represented as counts of codebook vector types that Andrade, E.L., R.B. Fisher and S. Blunsden, 2006. appear in the clips. Detection of emergency events in crowded scenes. A 20-topic Bayesian LDA model was fitted to Proceeding of the Institution of Engineering and build the models for the three different crowd scenes Technology Conference on Crime and Security. using only normal frames. The likelihood of a clip Savoy Place, London, UK, June 13-14, pp: 528-533. containing frames with abnormal behavior is Blei, D.M., A.Y. Ng and M.I. Jordan, 2003. Latent determined given the model parameters of the dirichlet allocation. J. Mach. Learn. Res., 3: 993corresponding normal behavior model. Frames whose 1022. negative log-likelihood falls below a defined threshold Blei, D., L. Carin and D. Dunson, 2010. Probabilistic are classified as abnormal. As shown in Fig. 3, 4 and 5. topic models. Signal Process. Mag., IEEE, 27(6): The figures show sample scene of normal behavior and 55-65. an abnormal one. Each figure also shows the ground Boghossian, B.A. and S.A. Velastin, 1999. Motiontruth (as viewed by a human observer) about the frames based machine vision techniques for the being classified by the classifier (green indicates management of large crowds. Proceeding of the normal behavior while red indicates an anomaly is IEEE 6th International Conference on Electronics, present in the frame. Circuits and Systems Cyprus September 5-8, pp: The effect of feature patch size on detection 961-964. accuracy (Table 1) shows that a 7x7 pixel patch seems Brox, T., A. Bruhn, N. Papenberg, J. Name and W. to give consistently high detection rates across the three Name, 2004. High accuracy optical flow estimation behavior types (Fig. 6). Of the three scenes, the best based on a theory for warping. Proceeding of the performance is achieved for the ‘herding’ scene. European Conference on Computer Vision (ECCV). pp: 25-36. CONCLUSION Davies, A.C., J.H. Yin and S.A. Velastin, 1995. Crowd monitoring using image processing. Elec. Commun. This study proposes a novel behavior descriptor for Eng. J., 7(1): 37-47. compact representation of crowd behavior, towards Helbing Dirk, M.P., 1995. Social force model for abnormal behavior detection. It uses an aggregation of pedestrian dynamics. Phys. Rev. E, 51(5): 4282features from optical flow, rotational and irrotational 4286. components of flow fields potentials and Lyaponuv Mehran, R., A. Oyama and M. Shah, 2009. Abnormal exponent to capture both local and global characteristics crowd behavior detection using social force model. for compact representation of crowd behavior. This Proceeding of the IEEE Conference on Computer descriptor is not based on motion tracking. Also, it uses Vision and Pattern Recognition, Miami Beach, bio-inspired ant-clustering algorithm in the codebook Florida, USA, June 19-26, pp: 935-942. development process instead of the commonly used k4176 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012 Mehran, R.A.O. and M. Shah, 2011. UMN Dataset. Retrieve from: http://www.cs.ucf.edu/~ramin/? page_id=24#1.UMN -Dataset. Niebles, J.C. and F.F. Li, 2007. A hierarchical model of shape and appearance for human action classification. Proceeding of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, United States, June 17-22, pp: 1-8. Shadden, S.C., F. Lekien and J.E. Marsden, 2005. Definition and properties of Lagrangian coherent structures from finite-time Lyapunov exponents in two-dimensional aperiodic flows. Phys. D: Nonlin. Phenom., 212(3-4): 271-304. Shandong, W., B.E. Moore and M. Shah, 2010. Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes. Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition San Francisco California, June 13-18, pp: 2054-2060. University of Reading, 2010. PETS 2010 Benckmark Data. Retrieved from: http://www.cvg.rdg.ac.uk/ PETS2010/a.html. Vaswani, N., A. Roy Chowdhury and R. Chellappa, 2003. Statistical shape theory for activity modeling. Proceeding of the IEEE International Conference on Accoustics, Speech and Signal Processing, Hong Kong, April 6-10, pp: 493-496. Vaswani, N., A.K. Roy-Chowdhury and R. Chellappa, 2005. Shape activity: A continuous-state HMM for moving/deforming shapes with application to abnormal activity detection. IEEE Trans. Image Process., 14(10): 1603-1616. Wang, K. and O.P. Popoola, 2010. Ant-based clustering of visual-words for unsupervised human action recognition. Proceeding of the Second World Congress on Nature and Biologically Inspired Computing Kitakyushu, Japan, December 15-17, pp: 654-659. 4177