Research Journal of Applied Sciences, Engineering and Technology 4(20): 4171-4177,... ISSN: 2040-7467

advertisement
Research Journal of Applied Sciences, Engineering and Technology 4(20): 4171-4177, 2012
ISSN: 2040-7467
© Maxwell Scientific Organization, 2012
Submitted: May 08, 2012
Accepted: May 29, 2012
Published: October 15, 2012
Detecting Abnormal Behaviors in Crowded Scenes
1
Oluwatoyin P. Popoola and 2Hui Ma
Systems Engineering Department, Faculty of Engineering, University of Lagos, Nigeria
2
College of Electronic Engineering, Heilongjiang University, Harbin 150080, PR China
1
Abstract: Situational awareness is a basic function of the human visual system, which is attracting a lot of
research attention in machine vision and related research communities. There is an increasing demand for
smarter video surveillance of public and private space using intelligent vision systems which can
distinguish what is semantically meaningful to the human observer as ‘normal’ and ‘abnormal’ behaviors.
In this study we propose a novel robust behavior descriptor for encoding the intrinsic local and global
behavior signatures in crowded scenes. Crowd scenes transitioning from normal to abnormal behaviors
such as “rush”, “scatter” and “herding” were modeled and detected. The descriptor uses features that
encode both local and global signatures of crowd interactions. Bayesian topic modeling is used to capture
the intrinsic structure of atomic activity in the video frames and used to detect the transition from normal to
abnormal behavior. Experimental results and analysis of the proposed framework on two publicly available
crowd behavior datasets show the effectiveness of this method compared to other methods for anomaly
detection in crowds with a very good detection accuracy rates.
Keywords: Abnormal behavior detection, intelligent video surveillance, situation awareness

INTRODUCTION
Situational awareness is a very robust function of
the human visual system. This visual system can
quickly and efficiently scan through large quantity of
time-varying low-level imagery data and pick out
information that ‘stand-out’, semantically meaningful
and useful for situational awareness. Such ability of the
human visual system has inspired a great deal of
research in the use of computational modeling
techniques for visual analysis of human behavior in the
computer vision and machine learning communities.
Intelligent visual surveillance systems should be able to
provide answers to queries such as ‘what is going on in
the scene?’ or ‘what type of event or activity is taking
place?’, or ‘where is this action taking place in the
scene?.’ The visual phenomena although very difficult
to model analytically, generally are more effectively
approached through statistical machine learning.
The ever increasing demand from governments and
commercial business for tools to effectively monitor
public and private spaces towards crime detection and
prevention, efficient public transport management, as
well as personalized healthcare delivery, has fuelled
interest and research funding for machine-based visual
analysis of behavior. Some of the recent advances in
applications include:



Healthcare delivery using social robots perform
automated visual analysis of human behavior in
elderly peoples’ home care facilities. These
personalized assistants learn from visual
information, to mimic and interact with the elderly.
They provide routine assessment of the persons’
behaviors and provide real-time assistance under
conditions of sickness, accident, or fall.
Efficient video content search, summarization and
indexing is applied to search for certain behaviors
in video database which would be tedious and
sometimes impossible without meta-data like
names, time of event etc. However with automated
visual search by object behavior categories such
tasks can be done be very rapidly.
Human Computer Interaction (HCI) systems are
developed for more and more user-friendly human
computer interaction systems like computer game
console applications produced by the gaming
industry and teleconferencing systems where visual
sensors capture and perform visual analysis on
users’ behaviors for intelligent and meaningful
communication and interaction.
Intelligent visual surveillance systems: These
electronically perceive and understanding the
behaviors of people (or objects moved by humans
such as vehicles) in image sequences (frames of
videos clips). It is by far the most popular
motivation for situational awareness. ‘Behavior’ is
a generic term often referring to the observable
Corresponding Author: Hui Ma, College of Electronic Engineering, Heilongjiang University, Harbin 150080, PR China
4171
Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012
actions of agents such as persons, or other moving
objects in the scene. Behaviors become salient
when they are different from the regular patterns in
that context and do not conform either temporally
or spatially; that is they ‘stand-out’ as different
relative to the context of their surrounding in space
and time.
Such unusual activities are by nature rare,
unexpected out-of-the-ordinary. These characteristics
make modeling behavior for purpose of detecting
abnormal behavior a non-trivial challenge. The
objective of this study is towards early detection of
contextually abnormal crowd behaviors using a robust
behavior representation.
LITERATURE REVIEW
used in Vaswani et al. (2003) and Vaswani et al. (2005)
to detect abnormal activity in a group of moving and
interacting objects by modeling the changing
configuration as a moving and deforming "shape" and
used continuous Hidden Markov Models (HMMs) to
capture the landmark shape dynamics. Tracking
however requires effective background subtraction and
is limited by factors like occlusion and shadows; thus
restricting it’s utility to encode complex behavior in the
real world occurrence of anomalies. It is also sensitive
to tracking errors even if they occur in only a few
frames. This approach also fails when modeling
crowded and complicated scenes.
PROPOSED METHODOLOGY
In this study we propose a new motion descriptor
for modeling crowd flows based on local pixel motion
from frame to frame. This descriptor is made up of the
optical flow, potentials and the Lyaponuv exponent
computations on the sequence of moving frames. Lowlevel local and global spatio-temporal features are used
to describe crowd behaviors instead of using the
tracking based approaches. The feature descriptor is a
vector of optical flow, potential fields and Lyaponuv
exponent components in the flow sequences.
Approaches vary from the ‘holistic’ methods which
try to obtain coarser-level (global) information from the
main crowd flows for use in detecting anomalies, to
bottom-up approaches i.e., from the pixel level. Davies
et al. (1995) measure motion features at pixel or pixelneighborhood levels which are then aggregated to
obtain motion properties for larger regions in an image.
The aggregated results establish the overall dominant
crowd direction and magnitude (velocities). Boghossian
and Velastin (1999) worked on how to detect
Optical flow: This approximates two-dimensional flow
emergency including circular flow around the exits
field from the image intensities. The vectors are
detected in the Hough space which could indicate traffic
difference in position of the same pixel content between
jams; multi-directional divergence of crowd flow
two images and is called flow vector. Details can be
indicating potential danger such as fire outbreak; and
found in Brox et al. (2004).
obstacles in the flow paths that might correspond to
injured pedestrians. Andrade et al. (2006) analyzed
Potentials functions: Drawing from flow concepts in
crowd motion using optical flow and Hidden Markov
fluid dynamics, we posit that incompressible and
Models (HMMs) to characterize “usual behavior” in the
crowd. Ali and Shah (2007) proposed a method of
irrotational flow properties of the optical flow field are
segmenting crowd flow, which revealed the crowd’s
constituents of the behavior signature that characterizes
instabilities as observed from the coherent structures
the motion types. A simplified mathematical model of
which appear by applying Lagrangian particle dynamics
fluids assumes that the fluid is incompressible and
on the optical flow fields. Instability is detected when
irrotational, based on conservation properties of fluids.
the number of segments of some analyzed sequence
These assumptions lead to potential functions, which
changes. Mehran et al. (2009) used the social force
are scalar functions that characterize the flow in a
model (Helbing Dirk, 1995) to model crowds for
, denotes a planar
unique way. Optical flow
abnormal crowd behavior detection. Particle advection
vector field.
according to the computed optical flow is done to
estimate the social forces. Normal patterns of forces
According to Helmholtz decomposition theorem:‘Let
over time are modeled by mapping the magnitude of the
F (r) be any continuous vector field with continuous
interaction force vectors to the image plane. The
first partial derivatives. Then F (r) can be uniquely
likelihood force flow is estimated to classify each frame
expressed in terms of the negative gradient of a scalar
as normal or abnormal.
potential
the curl of a vector potential a (r):
In this study, we adopt an approach that uses the
probability of occurrence of feature vectors as indicators
or
(1)
of the presence or absence of an anomaly.
The features used to encode behavior can be
“global” or “local”, either spatial or temporal or both.
is the scalar potential and
r is the
where,
Spatial-temporal features have shown particular promise
vector potential, which denote the incompressible and
in motion understanding due to its rich descriptive
irrotational parts of the vector field. An incompressible
power (Niebles and Li, 2007) and therefore widely used
as feature descriptor. Shape dynamics has also been
vector field is divergence free ·
= 0 and an
4172 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012
irrotational vector field is curl free ×
= 0. These
two functions are known as the stream function and
the velocity potential
, respectively. Following
(Shandong et al., 2010), we use Fourier transforms to
decompose incompressible and irrotational parts of the
vector field and estimate the potential functions using:
,
,
0.5
,
0,
,
,0
,
0.5
0.5
,
0.5
0,
,0
(2)
(3)
Finite Time Lypunov Exponents (FTLE): FTLE is a
scalar field that measures how much particles separate
after a given interval of time. It describes mass
transport in a system based on the vector fields. Any
two particles starting in the same region of the exponent
field will tend to flow together in time as the dynamical
system evolves, whereas two particles straddling a ridge
in the field will tend to diverge exponentially in time
Shadden et al. (2005). For example, suppose a point
located at: ∈ at time . When this point is
after time
advected by the flow, it moves to
interval . The amount of stretching about this
trajectory can be seen by considering a perturbed point
0 , where,
0 is infinitesimally small and
oriented arbitrarily. After the time interval , the
perturbation becomes:
|
0 |
0
for a positive function means that
/
remains
bounded for all ∈ . By dropping the ||
0 ||
terms, the growth of linearized perturbations are
obtained. Using Euclidean norm, the magnitude of the
perturbation is given by:
|
where,
|
∗
⟨
∗
0 ,
0 ⟩
(4)
transpose. The symmetric matrix:
means
∗
∥
where,
with
0 is aligned with the eigenvector associated
Δ . This can be re-written as:
.
(5)
Δ is a finite-time version of the Cauchy-Green
deformation tensor. Although
is a function of
, andT. Maximum stretching occurs between two
∥
∥ ∥ Δ ∥
0 ∥
max
max
The stream function provides the information
regarding the steady and non-divergent parts of the flow,
while the velocity potential encodes information
regarding the local changes in the non-curling motions.
very close particles when
0 is aligned with the
eigenvector associated with the maximum eigenvalue of
. Thus if
,
Δ is the maximum eigenvalue of
then:
| |
∥
(6)
0 ∥
where, Eq. (2.20) represents the largest finite-time
Lyapunov exponent with a finite integration time T,
corresponding to point ∈ at time .
The semantic descriptive label assigned to each of
the crowd scenes as it evolves in the global (framelevel) is considered an ensemble of behavior dynamics
aggregated across the local spatio-temporal patches that
make up the frames. This implies that the singular
behavior of one or a minority of the moving targets
does not suffice to infer the global behavior until
similar abnormal behavior can be observed to be
populated across the scene. Thus the feature vector
computed across the patches for all the frames consists
of the statistics of low-level features of the flow vectors
such as magnitude, potentials (incompressible and
irrotational) and the Lyapunov exponent values:
, , , ,
(7)
where, u, v capture motion magnitude, in the velocity
field from optical flow; , , are the incompressible
and irrotational flow components from the FTLE
potentials and ε is the Lyapunov exponent values
indicating the amount of local divergence or
convergence.
The feature vectors are clustered using the bioinspired clustering method proposed in Wang and
Popoola (2010) to form a codebook of visual words.
With this representation, the original video frames can
be discarded and each video clip represented by the
visual words from the codebook.
Modeling with bayesian topic models: Variants of the
basic topic model shown in Fig. 1 namely the Latent
Dirichlet Allocation (LDA) (Blei et al., 2010; Blei
et al., 2003) are used in machine learning. They provide
a simple way to analyze large volumes of unlabeled text
such as a huge database of written documents. A
"topic" consists of a cluster of words that frequently
occur together. Using contextual clues, topic models
can connect words with similar meanings and
distinguish between uses of words with multiple
meanings.
4173 Res. J. Apppl. Sci. Eng. Technol.,
T
4(20)): 4171-4177, 2012
2
Γ ∑
∏
∏
…
Γ
(12)
is the num
mber of words in document j.
j The
where,
marginnal likelihoodd of the words givenn the
hyperparameters
| , is intraactable and theerefore
the maarginal likelihoood of the hiddden variables given
the hypperparameters:
, | ,
is also intracctable.
Variatioonal inference (Blei et al., 2003)
2
is used to
t free
parameeters and
:
, | ,
Fig. 1: LDA
A topic model (B
Blei et al., 2003)
Data is
i assumed to be observed from
fr
a generatiive
probabilisttic process that has hidden variables
v
nameely,
the themaatic structure. This generaative model for
f
documentss is based on probabilistic sam
mpling rules thhat
describe how words in documents
d
migght be generatted
on the basis
b
of lateent (random) variables. The
T
generative process operates onn bag-of-worrds
assumptionn i.e., it does not make any asssumptions aboout
the order of
o words as they
y appear in doccuments.
The LDA
L
model forr each frame has
h the followiing
parameterss:

K
vector governning the Dirichhlet
α: A K-dimensional
distribbutions of thee topics in a training viddeo
dataseet:
,…,

t
β: A KxW-dimensiional matrix representing the
butions of codebook
c
worrds
multinnomial distrib
whichh form the voccabulary for alll learned topiics,
where:
,…,
=[
For anny given docu
ument, the prrobability of the
t
distributionn over topics iss given by:
Γ ∑
∏
, , | ,
| ∏
Γ
|
|
(13)
∗
∗ , ∗ ;
,
∗
(14)
Scores below a defined
d
threshold is considerred an
indicatoor that an anom
maly is presentt in that frame.
EXPERIM
MENTS AND RESULTS
R
(99)
]
|
Ann optimal ( ,
is foundd by calculatinng the
lower bound on the log likkelihood giveen by
| ,
log Appplying this to
t visual dataa, document words
corresppond to featuree vectors whilee ‘topics’ correespond
to atom
mic activities thhat frequently occur
o
together in the
video clips. The log likelihoodd of observiing a
∗
previouusly unseen cllip
is approximated
a
by its
∗
maximiized
loweer
bound
log ∗
∗
∗
,
, .
is the Dirichllet parameterss that
represeent the video sequence
mplex.
∗ in a topic sim
∗
are the prior probbabilities of toppics that show
w up in
are condditional probabbilities
∗ , while ,
given the words givven topics as model param
meters,
computted in the M-step of the expecttationmaximiization algorithhm.
∗
Given the modeel,
, ∗ ; , ) is evaluatted by
∗
u
, ∗ until
setting α, β as connstants and updating
∗
∗
, ; , ) is maaximized . Thee normality scoore for
a test cllip is defined as:
a
(8)
,…,
∏
|
…
(110)
,
(111)
Thhe crowd behavvior data usedd in our experiiments
are extrracted from thhe UMN (Mehrran and Shah, 2011)
and PE
ETS (Universitty of Reading, 2010) surveiillance
video database for monitoring
m
crrowd activity. Clips
with thhe following behavior typpes were useed for
trainingg:
 Tyype 1: Sudden accelerated motion
m
“rush”. Crowd
C
mootion that graduually changes into
i
a panic sittuation
but motion is tow
wards one direcction.
 Tyype 2: A crowdd moving at noormal pace sudddenly
dissperses in diffeerent directionss i.e., “scatter”..
 Tyype 3: Instantaaneous movem
ment towards ann exit:
“heerding”.
4174 Res. J. Apppl. Sci. Eng. Technol.,
T
4(20)): 4171-4177, 2012
2
Fig. 2: Videeo segmentation
n into clips and feature extractiion
from
m patches (7×7 pixels)
p
Fig. 4: Frame
F
classificattion for ‘scatter’ anomaly
Fig. 3: Fram
me classification for ‘rush’ anom
maly
Two of
o the scenes are
a outdoors while
w
one is in an
enclosed environment.
e
There
T
are typicaal outdoor issuues
like lighteening variation
n, occlusion and
a
noisy imaage
capture. After
A
a random
m sample of clips, we seleect
portions thhat contain tran
nsitions for usee as validation or
test-set. These
T
were no
ot used in thee training stagge.
Atomic beehaviors from patches weree modeled usiing
low-level visual
v
pixel feaatures in the fllow fields. Theese
form the building blo
ocks for modeeling the globbal
Fig. 5: Frame
F
classificattion for ‘herdingg’ anomaly
behavioors as describbed in precedding sections. Long
video sequences
s
of thhree scenes weere first divideed into
2-3 secc clips and features
f
were extracted (Fiig. 2).
Featuree quantization was done usinng ant clusterinng and
the visuual codebook was
w built throuugh which thee clips
4175 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012
means algorithm. The effectiveness of this descriptor in
modeling crowd behavior was demonstrated using a
Bayesian topic model based on the bag-of-features
framework to detect transitions from normal to
abnormal crowd behaviors in three different crowd
scenes for crowd behaviors transiting contextually
normal to abnormal behavior such as running, scattering
and herding. Such early detection of the transition
between normal phase and abnormal phase is of crucial
importance in surveillance applications. Further study
will be done in automatic recognition of emerging
anomalies in crowds and indicating the local regions in
the scenes where such anomalies occur.
True positive rate
Table 1: Detection accuracy for two pixel patch sizes of feature
descriptors
20-topic LDA model
------------------------------------------------------Patch size
Rush
Scatter
Herding
7x7
0.935
0.915
0.945
9x9
0.861
0.970
0.901
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ACKNOWLEDGMENT
Baseline
Type 1-RUSH
Type 2-SCATTER
Type 3-HERDING
0
0.2
0.4
0.6
False positive rate
0.8
1.0
The authors thank Prof. Wang Kejun (Harbin
Engineering University) for his helpful comments.
REFERENCES
Ali, S. and M. Shah, 2007. A lagrangian particle
dynamics approach for crowd flow segmentation
Fig. 6: ROC curve for classifier performance on the three
and stability analysis. Proceeding of the IEEE
behavior types using 7×7 patch size
Conference on Computer Vision and Pattern
Recognition. Minneapolis, June 17-22, pp: 1-6.
are represented as counts of codebook vector types that
Andrade, E.L., R.B. Fisher and S. Blunsden, 2006.
appear in the clips.
Detection of emergency events in crowded scenes.
A 20-topic Bayesian LDA model was fitted to
Proceeding of the Institution of Engineering and
build the models for the three different crowd scenes
Technology Conference on Crime and Security.
using only normal frames. The likelihood of a clip
Savoy Place, London, UK, June 13-14, pp: 528-533.
containing frames with abnormal behavior is
Blei, D.M., A.Y. Ng and M.I. Jordan, 2003. Latent
determined given the model parameters of the
dirichlet allocation. J. Mach. Learn. Res., 3: 993corresponding normal behavior model. Frames whose
1022.
negative log-likelihood falls below a defined threshold
Blei, D., L. Carin and D. Dunson, 2010. Probabilistic
are classified as abnormal. As shown in Fig. 3, 4 and 5.
topic models. Signal Process. Mag., IEEE, 27(6):
The figures show sample scene of normal behavior and
55-65.
an abnormal one. Each figure also shows the ground
Boghossian, B.A. and S.A. Velastin, 1999. Motiontruth (as viewed by a human observer) about the frames
based machine vision techniques for the
being classified by the classifier (green indicates
management of large crowds. Proceeding of the
normal behavior while red indicates an anomaly is
IEEE 6th International Conference on Electronics,
present in the frame.
Circuits and Systems Cyprus September 5-8, pp:
The effect of feature patch size on detection
961-964.
accuracy (Table 1) shows that a 7x7 pixel patch seems
Brox, T., A. Bruhn, N. Papenberg, J. Name and W.
to give consistently high detection rates across the three
Name, 2004. High accuracy optical flow estimation
behavior types (Fig. 6). Of the three scenes, the best
based on a theory for warping. Proceeding of the
performance is achieved for the ‘herding’ scene.
European Conference on Computer Vision (ECCV).
pp: 25-36.
CONCLUSION
Davies, A.C., J.H. Yin and S.A. Velastin, 1995. Crowd
monitoring using image processing. Elec. Commun.
This study proposes a novel behavior descriptor for
Eng. J., 7(1): 37-47.
compact representation of crowd behavior, towards
Helbing
Dirk, M.P., 1995. Social force model for
abnormal behavior detection. It uses an aggregation of
pedestrian dynamics. Phys. Rev. E, 51(5): 4282features from optical flow, rotational and irrotational
4286.
components of flow fields potentials and Lyaponuv
Mehran, R., A. Oyama and M. Shah, 2009. Abnormal
exponent to capture both local and global characteristics
crowd behavior detection using social force model.
for compact representation of crowd behavior. This
Proceeding of the IEEE Conference on Computer
descriptor is not based on motion tracking. Also, it uses
Vision and Pattern Recognition, Miami Beach,
bio-inspired ant-clustering algorithm in the codebook
Florida, USA, June 19-26, pp: 935-942.
development process instead of the commonly used k4176 Res. J. Appl. Sci. Eng. Technol., 4(20): 4171-4177, 2012
Mehran, R.A.O. and M. Shah, 2011. UMN Dataset.
Retrieve from: http://www.cs.ucf.edu/~ramin/?
page_id=24#1.UMN -Dataset.
Niebles, J.C. and F.F. Li, 2007. A hierarchical model of
shape and appearance for human action
classification. Proceeding of the IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition, Minneapolis, MN, United States, June
17-22, pp: 1-8.
Shadden, S.C., F. Lekien and J.E. Marsden, 2005.
Definition and properties of Lagrangian coherent
structures from finite-time Lyapunov exponents in
two-dimensional aperiodic flows. Phys. D: Nonlin.
Phenom., 212(3-4): 271-304.
Shandong, W., B.E. Moore and M. Shah, 2010. Chaotic
invariants of Lagrangian particle trajectories for
anomaly detection in crowded scenes. Proceeding of
the IEEE Conference on Computer Vision and
Pattern Recognition San Francisco California, June
13-18, pp: 2054-2060.
University of Reading, 2010. PETS 2010 Benckmark
Data. Retrieved from: http://www.cvg.rdg.ac.uk/
PETS2010/a.html.
Vaswani, N., A. Roy Chowdhury and R. Chellappa,
2003. Statistical shape theory for activity modeling.
Proceeding of the IEEE International Conference on
Accoustics, Speech and Signal Processing, Hong
Kong, April 6-10, pp: 493-496.
Vaswani, N., A.K. Roy-Chowdhury and R. Chellappa,
2005. Shape activity: A continuous-state HMM for
moving/deforming shapes with application to
abnormal activity detection. IEEE Trans. Image
Process., 14(10): 1603-1616.
Wang, K. and O.P. Popoola, 2010. Ant-based clustering
of visual-words for unsupervised human action
recognition. Proceeding of the Second World
Congress on Nature and Biologically Inspired
Computing Kitakyushu, Japan, December 15-17, pp:
654-659.
4177 
Download