Human Body/Head Orientation Estimation and Crowd Motion Flow

advertisement
Human Body/Head Orientation
Estimation and
Crowd Motion Flow Analysis
From a Single Camera
Ovgu Ozturk
June 2010, Tokyo
The University of Tokyo
Supervisor: Kiyoharu Aizawa
Co-supervisor: Toshihiko Yamasaki
Acknowledgments
It has been a long way and there were really tough times, but I had my friends beside
me whenever I needed them. So the deepest appreciation goes to those special heroes in
various parts of my life in Japan.
I would like to express my deepest sense of gratitude to my supervisor Prof. Kiyoharu
Aizawa for his patient guidance, insightful comments and valuable advices throughout
this study. He has provided the peaceful atmosphere, and his considerateness helped us
to overcome the obstacles and succeed in our goals.
My sincere thanks and appreciation to my co-supervisor Dr. Toshihiko Yamasaki for his
diligence, extensive knowledge and providing the gentle atmosphere by saying “good
morning” everyday. This thesis would not have been possible unless his support in a
number of ways.
Dr. Chaminda de Silva was there from the beginning till the end with his continuous
support. I am heartily thankful to him for sharing his experiences about life and research
with me.
I am indebted to many of my lab colleagues for their support during the projects,
experiments and about dealing with life in Japan. I enjoyed the time spent with them.
A long time of appreciation and gratitude goes to the Japanese Government for
providing me MEXT scholarship to pursue my education in Japan.
With deepest love and gratitude, I am would like to thank to my family, the most
precious in my life.
Table of Contents
List of Figures ................................................................................................ 6
List of Tables.................................................................................................. 8
Abstract ......................................................................................................... 9
List of Publications ...................................................................................... 13
Introduction ................................................................................................. 15
1.1
Motivation ................................................................................................ 16
1.2
Objectives ................................................................................................. 18
1.3
Organization of the Thesis........................................................................ 23
Human Tracking and Body/Head Orientation Estimation .......................... 24
2.1 Introduction.................................................................................................. 24
2.1.1 Related Work ............................................................................................................. 27
2.2 System Overview .......................................................................................... 29
2.3 Human Tracking by using Particle Filters.................................................... 33
2.4 Human Body/Head Orientation Estimation Algorithm ................................. 36
2.6 Experimental Results ................................................................................... 43
2.7 Conclusions and Future Work ...................................................................... 49
Dominant Motion Flow Analysis In Crowds ................................................ 52
3.1 Introduction.................................................................................................. 52
3.2 System Overview .......................................................................................... 58
3.3 SIFT Motion Flow Vector Generation ........................................................... 61
3.4 Hierarchical Clustering of Local Motion Flow Vectors .................................. 67
3.5 Constructing Global Dominant Motion Flows from Local Motion Flows ....... 71
3.6 Experimental Results ................................................................................... 74
3.7 Discussion and Conclusions .......................................................................... 77
Future Footsteps: An intelligent interactive system for public entertainment
..................................................................................................................... 79
4.1
Introduction.............................................................................................. 79
4.1.1 Related Work ............................................................................................................. 83
4.2 System Overview .......................................................................................... 87
4.2.1 System Architecture .................................................................................................. 89
4.2.2 Calibration ................................................................................................................. 92
4.3 Real-time Tracking of Multiple Humans ...................................................... 93
4.3.1 Background Subtraction and Blob Extraction ........................................................ 94
4.3.2 Association of Blobs ................................................................................................... 98
4.4 Analysis of Tracking Results and Visualization of Footsteps ...................... 100
4.4.1 Prediction of Future Footsteps ............................................................................... 101
4.4.1 Visualization of Future Footsteps .......................................................................... 105
4.5 Experimental Results ................................................................................. 107
4.5.1 Results from Various Situations ............................................................................. 108
4.5.2 User Study ............................................................................................................... 109
4.7 Conclusions ................................................................................................ 110
4.8 Future Work ................................................................................................111
Conclusions ................................................................................................ 113
Discussions and Future Work .................................................................... 113
References .................................................................................................. 114
List of Figures
Figure 1. An example view from a market place. ............................................................. 22
Figure 2. An example view from a street .......................................................................... 22
Figure 3. Visual focus attention of people in a market place .......................................... 25
Figure 4. Overview of Body/Head Orientation Estimation System ................................ 30
Figure 5. Edge and color orientation histograms. ............................................................ 33
Figure 6. Grouping of head-shoulder contours ................................................................. 38
Figure 7. SIFT motion flow vectors around head region. ................................................ 40
Figure 8. Addition of SIFT motion flow vectors................................................................ 42
Figure 9. Tracking of a person and estimation of head/body orientation. ...................... 43
Figure 10. Image patches of various head-shoulder regions. .......................................... 45
Figure 11. Experimental results........................................................................................ 47
Figure 12. Experimental results of challenging cases. .................................................... 48
Figure 13. Structured/unstructured crowd scene examples. ........................................... 53
Figure 14. System Overview .............................................................................................. 59
Figure 15. SIFT motion flow vector................................................................................... 61
Figure 16. SIFT motion flow vectors in a given image region. ........................................ 63
Figure 17.SIFT motion flows for 100 frames and 400 frames ......................................... 64
Figure 18. Motion flow map and local regions for the entire scene. ............................... 65
Figure 19. Hierarchical clustering of motion flow vectors. .............................................. 67
Figure 20. Dividing into local regions and creating an orientation histogram for each.68
Figure 21. Hierarchical Clustering Steps ......................................................................... 68
Figure 22. Local regions and motion flow maps. .............................................................. 70
Figure 23. Local dominant motion flows. ......................................................................... 70
Figure 24. Neighborhood schema for local dominant flow vectors.................................. 72
Figure 25. Connecting local flows to obtain global flows. ................................................ 73
Figure 27. Global dominant motion flows. ........................................................................ 74
Figure 26. Input data ......................................................................................................... 74
Figure 28. Combining global flows one step more............................................................ 75
Figure 29. Ground truth. ................................................................................................... 75
Figure 30. Depicted future footstep of a girl..................................................................... 81
Figure 31. General view of the area from the camera. .................................................... 81
Figure 32. Placement of the system in the airport. .......................................................... 86
Figure 33. Inside of the box. .............................................................................................. 86
Figure 34. The placement of the camera and the box. ..................................................... 89
Figure 35. System architecture ......................................................................................... 90
Figure 36. The calibration step is displayed. .................................................................... 93
Figure 37. Blob extraction: Example input scenes and enlarged view of a partial area in
the input scene. .................................................................................................................. 94
Figure 38. Blob extraction for a child................................................................................ 96
Figure 39. Example extracted regions of adults ............................................................... 96
Figure 40. Example extracted regions of children ........................................................... 96
Figure 41. Blob tracking during three consecutive frames. .......................................... 100
Figure 42. Association of blobs stored in the data structure. ........................................ 101
Figure 43. Gradually disappearing images of a foot. ..................................................... 103
Figure 44. Various foot shapes used in the system. ....................................................... 103
Figure 45. Experimental results: visualization of future footsteps for various people.
........................................................................................................................................... 104
Figure 46. Experimental results: various future footsteps............................................ 104
Figure 47. Experimental results: various future footsteps............................................ 106
Figure 48. Visualization of mostly followed paths from top-view ................................. 106
Figure 49. User Reaction: a woman is jumping right and left to play with the displayed
footsteps. ........................................................................................................................... 107
Figure 50. User Reaction: a little girl is exploring and trying to step on the footsteps.
........................................................................................................................................... 107
List of Tables
Abstract
In the last few decades, automation of descriptive and statistical analysis of
human behavior became a very significant research topic. Due to the technological
advances in video technologies, many researchers have focused on detection and
analysis of human motion from video cameras. They tried to develop intelligent systems
that contribute to automatic control and alarm systems, automatic data evaluation and
process. In this respect, analysis of human behavior is important for many different
applications such as marketing, social behavior modeling, security systems,
human-robot interaction, etc… To achieve these systems, the main tasks are stated as
detecting humans in a given scene, counting the number of humans, tracking their
motion and analyzing their trajectory of motion. In addition to this, recently there are
researches which try to understand human face and body gestures, such as smiling,
walking, jogging, waving a hand. Until now, there has been a significant progress in
detection of humans and tracking of their motion in public spaces, understanding their
gestures.
To analyze a given scene in more detailed, the next step is to measure people’s
focus of attention which remains as an unsolved problem. Visual focus of attention of
humans is defined as the direction they are heading to or the direction they are looking
at during their motion. Humans show their attention by walking towards that direction
or by turning their head to that direction. The paths they walk can give us information
about their interests in the environment. Hence, the orientation of their body and head
can give us a hint about their visual focus of attention. In crowd scenes, the most
common motion paths can give us information about the popularity of the places in the
environment.
Currently, there are a huge number of researches that try to solve the body/head
orientation estimation problem by using multiple cameras, multi-sensors or they locate
various markers on the bodies. These approaches are often too impractical or expensive
to build in common public places for general cases. Our aim is to extract the most
possible useful information to achieve human motion analysis in a given public scene
from a single camera. It has very big challenges due to the articulations in human body
and pose and less data. On the other hand, by only using a single camera, we can build
portable, low-cost systems with less complexity.
In our research we focus on two major problems. First, we studied the estimation
of visual focus of attention of people. We have developed a system that tracks people
and estimates their body and head orientation. Second, we have analyzed the crowd
scenes and proposed a method to calculate the dominant motion flows that can handle
very complex situations. Later, as an application of our work, we have presented an
interactive human tracking system during a digital art project that was exhibited in
Haneda Airport in Tokyo for one month.
“Tracking of Humans and Estimation of Body/Head Orientation” part addresses
the problem of determining body and head orientation of a person while tracking the
person in an indoor environment monitored by a top-view single camera. We capture
the top-view of the scene from a very high place and the resolution of the data is low.
By analyzing the head-shoulder contour of a person and by using the motion flows of
distinctive image features, SIFT features, around head region, we try to estimate the
body/head orientation of a person. Experimental results show the successful application
of our method with five degrees error at maximum.
Detecting dominant motion flows in crowd scenes is one of the major problems
to understand the content of a crowded scene. In our work, we focus on analyzing the
crowd motion for structured and unstructured crowds, where the motion of the people is
very complex and unpredictable. We have proposed a hierarchical clustering of
instantaneous motion flow vectors accumulated for a very long time to find the mostly
followed patterns in the scene. Experimental results demonstrate the successful
extraction of dominant motion flows in challenging real-world scenes.
An intelligent interactive public entertainment system, which employs
multiple-human tracking from a single camera, has been developed. The proposed
system robustly tracks people in an indoor environment and displays their predicted
future footsteps in front of them in real-time. To evaluate the performance, the proposed
system was exhibited during a public art exhibition in an airport. The system
successfully tracked multiple people in the environment and displayed the footstep
animations in front of them. Many people participated in the exhibition; they showed
surprise, excitement, curiosity. They tried to control the display of the footsteps by
making various movements.
List of Publications
Book Chapter
1. 未来の足跡「Footprint of Your Future」in 空気の港「Digital Public Art in Haneda
Airport」, O. Ozturk, T. Matsunami, M.Togashi, K.Sawada, T. Ohtani, Y. Suzuki, T.
Yamasaki, K. Aizawa published by 美術出版社
Journal Articles
1. "Real-time tracking of multiple humans and visualization of their future footsteps in
public indoor environments", O. Ozturk, Y. Suzuki, T. Yamasaki, K. Aizawa,
Multimedia Tools and Applications, Special Issue on Intelligent Interactive
Multimedia Systems and Services, Springer. (submitted on March 11)
2. “Human Tracking and Visual Focus of Attention Estimation From a Single Camera
In Indoor Environments”, O. Ozturk, T. Yamasaki, K. Aizawa, EURASIP Journal on
Image and Video Processing (to be submitted)
Reviewed Conference Papers
1. “Detecting Dominant Motion Flows In Unstructured/structured Crowd Scenes”, O.
Ozturk, T. Yamasaki, K. Aizawa、International Conference on Pattern Recognition,
ICPR2010, Istanbul, Turkey.
2. “Can you SEE your “FUTURE FOOTSTEPS?”, O. Ozturk, T. Matsunami, Y.
Suzuki, T. Yamasaki, K. Aizawa, Proceedings of International Conference on Virtual
Reality, VRIC2010, April 7-11,2009,Laval,France.
3. “Tracking of Humans and Estimation of Body/Head Orientation from Top-view
Single Camera for Visual Focus of Attention Analysis”, O. Ozturk, T. Yamasaki, K.
Aizawa, THEMIS2009 Workshop held within ICCV 2009, Sept 27-Oct 4.
4. “Content-aware Control for Efficient Video Transmission of Wireless Multi-camera
Surveillance Systems”, O. Ozturk, T. Yamasaki, K. Aizawa, PhD Forum,
ICDSC2007, September, Vienna.
5. “Human Visual Focus of Attention Analysis by Tracking Head Orientation with
Particle Filters”, O. Ozturk, T. Yamasaki, K. Aizawa, International Conference on
Advanced Video and Signal-based Surveillance, AVSS2010, Boston, USA.(to be
submitted March 26)
Non-reviewed Conference Papers
1. “Multiple Human Tracking and Body Orientation Estimation by using Cascaded
Particle Filter from a Single Camera”, O. Ozturk, T. Yamasaki, K. Aizawa, P-3-23,
PCSJ/IMPS2009, Oct. 7- Oct. 9, 2009, Shizuoka.
2. “Content-aware Video Transmission for Wireless Multi-camera Surveillance”, O.
Ozturk, T. Hayashi, T. Yamasaki, K. Aizawa, P-5-02, PCSJ/IMPS2007, Oct. 31- Nov.
2, 2007, Shizuoka.
3. “Content-aware Spatio-Temporal Rate Control of Video Transmission for Wireless
Multi-camera Surveillance”, O. Ozturk, T. Yamasaki, K. Aizawa, IEICE2008,
D-11-20, March 18-21, Kita-Kyushu.
Chapter 1
Introduction
Analysis of human behavior in public places is an important topic which has
attracted much attention from many researchers, designers, companies and
organizations. It is critical to know how humans move in public spaces, how they react
to surroundings, how they show attention to the objects of interest. For example,
pedestrian behavior recognition at bus stops, at the crossings or stations is useful for
security reasons. People looking at the bulletin boards, commercial screens or customers
walking around market stands can provide us information about recent trends,
marketing strategies, effective advertisement methods. Intelligent human computer
interfaces can be developed by analyzing the user’s body motions. Studying the general
characteristics of behaviors of humans or crowds can provide us a measure to
distinguish the normal and abnormal situations and detect the emergency cases.
In this respect, analyzing human motions can give us useful feedbacks to build
autonomous intelligent environments and systems to improve the quality of life in many
different ways. One of the most useful and efficient way for human motion
understanding is to utilize image and video processing methods. In our era, video
cameras are installed everywhere, they have become a part of everyday life. There are
various security cameras in indoors and outdoors. Many people use web cameras with
their computers. Now, it is very simple and cheap to obtain image/video data of a
person or an environment. Hence, tracking of humans and understanding their motion
via computer vision techniques is of great service for many.
1.1
Motivation
Recently, there has been significant progress in the field of computer vision to
analyze various scenes captured from video cameras and extract useful information for
automatic decision making. Video cameras are installed in the markets, in the stations,
in the shops. There are various cameras that are mounted in the computers, automatic
machines or robots. It is possible to capture an image of the scene or record video data
of the scene, then the data is processed to acquire meaningful information.
There are several researches in this area. Now it is possible to detect humans in
an image, to recognize different objects. Furthermore, there are algorithms to count
humans and even, to track their motion under various circumstances. Given an image of
the scene, existence of the people, their identity can be extracted. Their body pose or
type of their motion analysis can be achieved to some extent. There are systems that
detect simple actions, such as: walking, running, raising hands or kneeling down. Some
other systems tracks moving people and find their motion paths.
As a next step in scene analysis, more detailed analysis of human motions is
remained unsolved. For example, to measure people’s focus of attention is still an open
problem. Regarding this, in our research we focus on two major problems. The first one
is to estimate the visual focus of attention of people while tracking them. The second
one is to detect the dominant motion flows in crowd scenes.
Visual focus of attention of humans is defined as the direction they are heading
to or the direction they are looking at during their motion. Depending on the camera
view, this definition can vary. Considering the general view of a shopping area, we can
say that humans are attracted by their destination or by the objects around. They show
their attention by walking towards that direction or by turning their head to that
direction. The paths they walk can give us information about their interests in the
environment. Hence, the orientation of their body and head can give us a hint about
their visual focus of attention. In crowd scenes, it is impossible to investigate each
person’s movement in detail or to acquire enough information from each person to
evaluate their body/head orientation. For crowd scenes, the most common motion paths
can give us information about the popularity of the places in the environment.
In our research, we build our problem statement as estimating the visual focus of
attention of people from a single camera. Currently, there are a huge number of
researches that try to solve the same problem by using multiple cameras, multi-sensors
or they locate various markers on the bodies. These approaches are often too impractical
or expensive to build in common public places for general cases. On the other hand, by
only using a single camera, we can build portable, low-cost systems with less
complexity. Figure 1 and 2 give example images of the scenes that present our problem
settings.
1.2
Objectives
In our work, we focused on two major problems. First, we studied the estimation
of visual focus of attention of people. We have developed a system that tracks people
and estimates their body and head orientation. Second, we have analyzed the crowd
scenes and proposed a method to calculate the dominant motion flows that can handle
very complex situations. Later, as an application of our work, we have participated in a
digital art project that was exhibited in Haneda Airport in Tokyo for one month. There,
we have developed a real-time multiple human tracking system, which visualizes
people’s predicted future footsteps during their motion. Below, the main characteristics
and major contributions of each research are introduced briefly.
Tracking of Humans and Estimation of Body/Head Orientation: This part
addresses the problem of determining body and head orientation of a person while
tracking the person in an indoor environment monitored by a top-view single camera.
The challenging part of this problem is that there is wide range of human appearances
depending on the position from the camera and pose articulations. In this work, a
two-level cascaded particle filter approach is introduced to track humans. Color clues
are used at the first level and edge-orientation histograms are utilized to support the
tracking at the second level. To determine body and head orientation, combination of
Shape Context and SIFT features is proposed. Body orientation is calculated by
matching upper region of the body with predefined shape templates, finding the
orientation within the ranges of pi/8. Then, optical flow vectors of SIFT features around
head region are calculated to evaluate the direction and type of the motion of the body
and head. We demonstrate the experimental results of our approach showing that body
and head orientation are successfully estimated. Discussion on various motion patterns
and future improvements for more complicated situations is given.
Detecting dominant motion flows in crowds: Detecting dominant motion flows
in crowd scenes is one of the major problems in video surveillance. This is particularly
difficult in unstructured crowd scenes, where the participants move randomly in various
directions. In our work we present a novel method which utilizes SIFT features’ flow
vectors to calculate the dominant motion flows in both unstructured and structured
crowd scenes. SIFT features can represent the characteristic parts of objects, allowing
robust tracking under non-rigid motion. First, flow vectors of SIFT features are
calculated at certain intervals to form a motion flow map of the video. Next, this map is
divided into equally sized square regions and in each region dominant motion flows are
estimated by clustering the flow vectors. Then, local dominant motion flows are
combined to obtain the global dominant motion flows. Experimental results demonstrate
the successful application of the proposed method to challenging real-world scenes.
An intelligent interactive system for public entertainment: In this work, an
interactive entertainment system which employs multiple-human tracking from a single
camera is presented. The proposed system robustly tracks people in an indoor
environment and displays their predicted future footsteps in front of them in real-time.
The system is composed of a video camera, a computer and a projector. There are three
main modules: tracking, analysis and visualization. The tracking module extracts people
as moving blobs by using an adaptive background subtraction algorithm. Then, the
location and orientation of their next footsteps are predicted. The future footsteps are
visualized by a high-paced continuous display of foot images in the predicted location
to simulate the natural stepping of a person. To evaluate the performance, the proposed
system was exhibited during a public art exhibition in an airport. People showed
surprise, excitement, curiosity. They tried to control the display of the footsteps by
making various movements.
Figure 1. An example view from a market place.
Figure 2. An example view from a street
1.3
Organization of the Thesis
The rest of this thesis is organized as follows. Chapter 2 describes the developed
algorithms to track humans and estimate their body/head orientation. Chapter 3 explains
our algorithm to detect dominant motion flows. Chapter 4 introduces an interactive
application project, “Future Footsteps”, which utilizes real-time multiple-human
tracking. In each chapter, the recent related advances in the field are given in detail. The
developed algorithms are explained and experimental results for various situations are
presented. Additionally, in Chapter 4 user studies about the interactive system are
introduced. Chapter 5 concludes the work presented in the thesis, followed by the
discussions and future improvements summarized in Chapter 6.
Chapter 2
Human Tracking and Body/Head
Orientation Estimation
2.1 Introduction
Analysis of human behavior in public places is an important topic which has
attracted much attention from many researchers, designers, companies and
organizations. It is critical to know how humans move in public spaces. Until now, there
has been a huge amount of research to detect humans, count and track them. However, a
few has gone one step further to detect visual focus of attention. In this work, we
address a relatively untouched problem and focus on the tracking of humans and
estimation of the visual focus of attention in indoor environments. More specifically, we
Figure 3. Visual focus attention of people in a market place
would like to know which direction a person is looking at, while wandering in an indoor
environment. We present a tracking system, which keeps the track of a person under
random motions and propose a new method to find the orientation of the body,
moreover orientation of the head while tracking the person.
Our contribution is two-fold, first we have developed a two-level cascaded
particle filter tracking system to track the body motion. Human appearance in the image
is defined as an elliptic region which is very effective and adaptable to track the object
for any pose seen from any angle. Appearance model of the target is constructed by
two-methods: first one is by using a random color histogram, second one is by using an
edge orientation histogram. The color histogram forms the basis for tracking and is
calculated at each iteration. The edge orientation histogram is used at some intervals and
only updated when necessary. Our second contribution is about how to calculate the
direction of the head movements and track the head motion throughout video frames.
We use Shape Context ¥cite{Seemann, ShapeContext} approach to detect basic body
orientation and propose an optical flow approach by using distinctive features, SIFT
features ¥cite{Sift}, around head region. The displacement of distinctive features
around head region gives an idea about the local motion. Then, it is combined with the
change in the center of mass to evaluate the overall motion of the body and head.
This chapter describes our approach and gives initial experimental results. We
discuss about various cases including both successful and insufficient situations and
seek for the ways to further improve the algorithm. Our work is a part of a project in an
airport in Tokyo and our human tracking and orientation estimation system will be used
to measure visual focus of attention of the audience during an art exhibition. For the
time being, occlusions are excluded and humans in the images are assumed to be not
carrying big bags on their shoulders, which would distract the head-shoulder triangle.
The rest of the chapter is organized as follows. Next section introduces over all
system, explaining our tracking and orientation estimation algorithms. Experimental
results and related discussion are presented in section 3 followed by conclusions and
future work in section 4.
2.1.1 Related Work
In the area of computer vision, considerable progress has been made to achieve
automatic
detection
¥cite{Wu,Zhang,Sabzmeydani,Zhao,Wojek},
counting
¥cite{VideoSecurity,Counting} and tracking of humans ¥cite{Yang,YuanLi,Han}.
However, most of the previous efforts have relied on two main groups of constraints:
first one is on motion path and second one is on body shape. Some researchers
¥cite{Seemann,VideoSecurity,Counting} used gates, or passages where human motion
can be predicted and linear most of the time. Besides, the majority of the tracking
algorithms
used
constant
shape-features
of
human
body
¥cite{Wu,Sabzmeydani,Counting,Zhao}, such as $¥Omega$ shape of head-shoulders or
combination of body parts-head,torse,legs-¥cite{Wu,Zhao}. However, while examining
a general indoor public places, such as market places, exhibition areas, where people
walk randomly, there is an extensive range of human pose seen from the camera and
unpredicted change in motion direction. Although tracking has been achieved to some
extent, estimation of the human body and head orientation while tracking still remain
unsolved.
To our best knowledge, this work is the first attempt to extract orientation of
both body and head of wandering humans in a public space by using only video data.
Glas ¥etal ¥cite{LaserTracking} studies a similar problem by combining video data and
laser scanner data. Their work extracts the position of two arms and head from a
top-view appearance and finds the orientation of human body in the scene. Using laser
scanners increases the cost and limits the portability of the system. In our work, we only
use a single camera and we aim to detect not only body orientation but also head
orientation, which would not be possible by using a laser scanner system as in
¥cite{LaserTracking}. O. Ba and J. Odobez ¥cite{Odobez02} explore the head
orientation detection problem in single and multiple cameras. They train various poses
of human head from front-view and combine the detection results from multiple
cameras to estimate the head orientation. Matsumoto ¥etal ¥cite{Matsumoto} focuses
on gaze detection from a single face image to find the focus of attention. More advanced
research was conducted by Smith ¥etal ¥cite{TrackingVFOA} using a single camera
and computer vision techniques to track the head orientation of humans while passing in
front of an outdoor advertisement.
For our case, the situation is more complicated targeting a general case with
motion of the body and head in any direction. We work with data captured from a
top-view single camera in an indoor public environment. Since privacy is a critical issue,
only top-view capturing is permitted, yielding less personal data acquisition, hence
higher privacy protection. As a result of top-view capturing, appearance of human body
can change depending on the position from the camera and orientation of the body.
There is little distinctive data and a wide range of human poses. We define estimating
the visual focus of attention as detecting the orientation of body and following the
head-movements to find where the person is looking at.
2.2 System Overview
We intend to automatically track humans wandering in an indoor environment
and analyze their behavior. The environment selected for this work is a marketplace
inside an airport. The environment is monitored by a single camera mounted sufficiently
high above to provide a top-view of the scene. Figure 3 shows an example scene from
our experiments in a market place at an airport. The data used in the experiments is
captured in full HD mode providing 1440x1080 pixels resolution for each frame. An
average human appearance resolution is 70x90 pixels. In this work, we focus on people
Figure 4. Overview of Body/Head Orientation Estimation System
who are walking around or standing still and looking around. It is assumed that people
don’t carry big bags on their shoulders. Occlusion handling and generalization of the
approach for various behaviors are left as future work.
Our system is composed of two main modules and an initialization step, which
is illustrated in Figure 4. Initialization step is composed of four parts. First of all, blobs
indicating the human regions in the scene are extracted by a background subtraction
algorithm. For each blob, color histogram and edge-orientation histogram models are
generated to describe the appearance of the target human during tracking process. Then,
initial body orientation is calculated by using shape context matching. After this,
tracking process starts and at each frame the target is tracked by using color-histogram
model based particle filter. With 10 frames of intervals, displacement in the center of
mass of the blob, change in the area and optical flow of the SIFT features are observed
to evaluate the change in the orientation of the body and head. If orientation change is
detected, new orientation is estimated and edge-orientation histogram is updated,
tracking process starts again for another 10 frames interval. During each tracking
interval, particle filter is used based on color histogram at each iteration, and
edge-orientation histogram is used once in three iterations to validate the tracking
process.
Blob Detection, Color Histogram Generation, Edge-orientation Histogram
Generation parts of the initialization are described below. Steps of the Blob Detection
are given in Table 1.
Blob Detection:
1. Use H,S,V color components and take the difference of the input and background
image
2. Apply Thresholding
3. Majority voting of H,S, V color values
4. Median Filtering
5. Morphological opening process
6. Find blobs with area larger than a threshold
7. For each blob, assign an ellipse (center, orientation, major axis length, minor axis
length)
Table 1. Blob Detection
Color Histogram Generation:
Figure 3 shows the color histogram and edge-orientation histogram generation. We
utilize a color histogram of 18-bins in total, where 6-bins are used for each
color-component of RGB color space. For each component, 0-256 color range is
divided equally into 6 bins. From a predetermined number ( N = 120 ) of randomly
chosen points inside the elliptic region, each color component of each point contributes
to the corresponding bin with its value.
Figure 5. Edge and color orientation histograms.
Edge-orientation Histogram Generation:
Sobel edge detection is used. Orientation and strength of the edges are calculated similar
to the method in [Yang]. Defining an edge orientation histogram, where [pi/2, pi/2]
range is divided into 8 bins, each edge information contributes to the corresponding bin
with the amount of its strength.
2.3 Human Tracking by using Particle Filters
In our system, we employ particle filter approach, one of the most popular
tracking methods in computer vision. The particle filter, also known as condensation
filter, is a Bayesian sequential importance sampling technique, which recursively
approximates the posterior probability density function(pdf) of the state space using a
finite set of weighted samples. Target objects are defined with their observation models
which are used to measure the observation likelihood of the samples in the candidate
region.
Particle filter tracking approach basically consists of two steps: prediction and
update. Given all available observations
Z1:t 1  Z1 ,..., Zt 1
prediction stage uses the probabilistic system transition model
up to time t-1, the
p  X t | X t 1  to
predict the posterior at time t as:
p  X t | Z1:t 1   p  X t | X t 1  p  X t 1 | Z1:t 1  dX t 1 ….………..(1)
At time t, the observation Zt is available, the state can be updated using Bayes's rule:
p  X t | Z1:t  
where
p  Z t | X t  p  X t | Z1:t 1 
p  Z t | Z1:t 1 
……………………(2)
p  Zt | X t  is described by the observation equation.
The posterior
p  X t | Z1:t 
X 
with importance weights wt . As shown in ¥cite{Monte}, the weights
i
t i 1,..., N
is approximated by a finite set of
become the observation likelihood
N
samples
i
p  Zt | X t  . After some iteration of prediction
and update steps, when the number of samples decreases below a certain threshold,
samples are re-sampled with equal weights.
In our work, we combine two types of observation models, color-histogram and
edge-orientation histogram in two-level cascaded approach. The object to be tracked is
assigned an elliptic region, similar to the work ¥cite{Sebastian}. At each iteration, N =
300 particles are sampled as candidates, for each particle color-histogram is calculated
from randomly sampled 120 points inside the elliptic region around the particle as
shown in Figure 5. Then, after some number of iterations (n = 3 was found to be the
most appropriate), edge-orientation histogram of the head region is generated and used
to support tracking.
State space is defined as X   x, x, y, y, hx, hy,  , where the states are given as
follows:
x,y: center of the ellipse
x, : velocity of the center in two-dimensions
hx, hy: axis lengths of the ellipse
Θ: orientation of the ellipse
State equations are:
xt  xt 1  k * xt 1 ………….….…….(3)
xt  xt 1 ………….…......…..(4)
yt  yt 1  k * yt 1 ....….…..………......(5)
yt  yt 1 .……………………(6)
hxt  hxt 1 …………………….(7)
hyt  hyt 1 …………………….(8)
t  t 1 ...............................(9)
State variables are assumed to be affected by Gaussian noise, where
appropriate covariance values are assigned empirically considering the kinematics of
ellipse. Bhattacharyya distance is used to compare the histogram of the sample points
with the histogram of the object.
2.4 Human Body/Head Orientation Estimation Algorithm
Human Body/Head Orientation Estimation Module is composed of three main parts:
1. Motion Vector Detection
2. SIFT Optical Flow
3. Body Orientation Check
During "Body Orientation Assignment" part of initialization module and "Body
Orientation Check" part, the same orientation estimation algorithm is used. We
implement
a
simple
but
very
effective
method,
shape
context
matching¥cite{Poppe,Wojek,ShapeContext} to determine the orientation of the body by
comparing the head region of a person with previously defined templates. Figure 6
shows the groups that compose the category set for possible appearances of
head-shoulder triangle region in the scene. In our experiments, camera monitors the
scene from a very high place and camera's position corresponds to the bottom of the
captured scene. In Figure 6, two types of placement of a camera are shown, our system
is the one with the camera at the side. If the camera was placed in a central position on
the ceiling, the number of categories would be doubled to include the other half. There
are 13 groups in the set, which is used to categorize all possible cases of head region.
Groups with the names ending with r are constructed by getting the symmetry image
about the y-axis. Canny edge detection algorithm is used to detect the edges on the
boundary of the upper half region of the body, then shape context matching is conducted
to find the corresponding group.
Figure 6. Grouping of head-shoulder contours
Tracking starts with the initially assigned orientation, then for each 10 frames
of interval, displacement vector(DV) of the center of the tracked region is calculated by
using the result from the particle filter. And resultant optical flow vector, which is called
"SOFV" is computed from SIFT features around head region. Head region is chosen,
since it is the most stable part compared to other body parts considering various poses
of the body and it is not affected by shape changes from the clothing of the person. Here,
a very basic logic applies to group the patterns of motion.
1. If both DV and SOFV are close to zero, then person is standing in the same position
with no orientation change.
2. If DV is close to zero but SOFV is larger than a threshold, then person is standing
still, but the orientations of the body and head are changing.(Person is turning right or
left around himself, or she/he is only moving her/his head).
3. If both DV and SOFV are larger than defined thresholds, then person is stepping
towards a direction.
4. Logically, DV is larger than a threshold and SOFV is close to zero combination is not
possible.
Motion type is evaluated and orientations of the head and body are estimated as
explained below by using DV, SOFV vectors. At certain intervals, body orientation is
checked with shape context matching to support the operation of the estimation process.
How to calculate SOFV: During tracking process, 10 is chosen as interval
length to evaluate the change in orientation. SIFT features are examined with sub
intervals of length 3,3,4. In other words, SIFT features are detected and tracked at
3rd,6th and 10th frames of the interval to construct the optical flow vectors as shown in
Figure 7. The numbers 10 and 3 for interval length are set empirically to keep the SIFT
Figure 7. SIFT motion flow vectors around head region.
feature correspondence between frames. If sub interval length is smaller than 3, then the
vector length is too small to get information about the direction change; if it is larger,
then the number of matched SIFT features between two frames is too few.
In Figure 7, 35x35 pixels image of the head region of the frame x is displayed.
10 frames interval starts with frame x and ends with frame x+10. Red marks show the
SIFT features detected in frame x, blue marks show the matched SIFT features in frame
x+3 corresponding to the ones in frame x. Again, yellow marks show the matched
features of frame x+6 with frame x+3. Finally white marks show the SIFT features that
matches with the ones in frame x+6. On the right of the image, optical flow vectors are
displayed. We divide 2D space into 4 regions as shown and call them -R, -L, +L, +R.
For each region, we calculate the average vector of the optical flow vectors and average
position
of
the
vectors
in
that
region
and
call
them
as
 VR, VL, VL, VR  and  cr, cl, cl, cr  .
SOFV Calculation:
1. Calculate average vectors resulting from the optical flows of SIFT features in each
region.
2. Choose two regions with the largest average vector length.
( Region -R, -L in the example)
3. Calculate the average positions of the optical flow vectors for each region.
4. By comparing average positions, add two dominant vectors calculated in step 2 to
find
SOVF .
SOVF  V1  V 2 ………………………………(10)
Figure 8. Addition of SIFT motion flow vectors
There is a global motion opposing the local rotational motion. These motion patterns
can not be calculated at the moment. In the example in Figure 7, red line shows the
average vector in -R, yellow line shows the average vector in -L and black line shows
the average vector in +R. Since there is no vectors in region +L, resultant average vector
is zero. Resultant
SOVF
is shown with bold blue line.
Figure 9. Tracking of a person and estimation of head/body orientation.
2.6 Experimental Results
In our experiments, we use real time data captured from a market place in an
airport. Data was captured in HD mode, with 1440x1080 resolution and average human
appearance occupies around 70x90 pixels.
Figure 8 shows a tracking result of a person during 720 frames, 24 secs. In the
figure, the person tracked is shown inside a yellow rectangle. Red line shows the center
of the tracked region for each iteration and forms a trajectory line. Elliptic regions
resulting from the particle filter tracking are shown with green color around the
trajectory of the motion. However for the sake of clarity, all iterations are not displayed,
the results are displayed once in 10 frames interval. Tracking starts at frame 131, the
person heads upwards at the beginning. He walks a few steps forward and then stops
and gazes around by turning right around himself, which is a clock-wise rotation seen
from the top. Then, he turns left around himself until the point he started, making a
counter-clock wise rotation. This takes place between frames 175 and 470, which is
depicted in the figure where green ellipses accumulate. Then, he continues with his
original direction, walking towards upwards until he comes to another stopping point at
frame 740. During this scenario, person is tracked successfully, the initial body
orientation starts with groupLeft, with pi/8 angle. This continues during the straight
motion, at two stopping points, rotation angle is calculated by using SOFV vectors and
shape context.
Figure 10. Image patches of various head-shoulder regions.
Here, first we present a set of experiments to show the effectiveness of shape
context matching and classification of 5 groups approach. We chose 30 random
appearances of various people in the video and applied the shape context matching to
categorize them. Then, we explain the experimental results of three different cases of
tracking and estimating the orientation of body and head.
Figure 10 shows head region of 30 different appearances of humans in the
captured video. When shape context matching is applied, 24 among 30 were detected
correctly, with 80% matching rate. Mismatched six were from GroupRight,
GroupRightDiagonal,
GroupLeft,
GroupLeftDiagonal.
They
had
a
common
characteristic that head circle was not clear enough to play a strong distinctive role in
the outline of the shape. So they were mismatched with groupStraight. For these cases,
the orientation is mistaken and those cases should be studied further to improve the
matching algorithm.
In Figure 11, two examples are given. Figure 11a illustrates a person turning
left throughout 10 frames interval. Green lines show the edge contours for 1st, 3rd, 6th
and 10th frames of the interval to display the motion. On the right, detected and tracked
SIFT features are displayed with marks in the order red, blue, yellow and white as
explained in the previous section. The final SOFV vector is depicted with blue color
with the resultant turning angle shown in black. Displacement vectors for SIFT features
accumulate in the regions, +L and -L, indicating a motion towards left. Figure 11b
shows a difficult case for the head movement with a small angle. The person stops and
slightly turns right around himself. The resultant SOFV vector correctly indicates a
rotation according to his motion, which is the angle shown with black arrow between
red and blue lines.
Figure 11. Experimental results.
In Figure 12, two cases are presented. The one at the top shows a rotation of a
person towards right, which is a clock-wise rotation seen from the camera.
Figure 12. Experimental results of challenging cases.
The orientation of the head is successfully tracked. The one at the bottom presents a
case where there are more than one motion to be analyzed. SIFT features are affected
both by the motion of the head and by the motion of the body. The center of the body
also moves a little bit upwards. Hence, resultant SIFT feature displacement vectors are
not a result of only orientation change but also a result of global displacement. To solve
this case, experiments are still being held considering the affect of global motion vector
in the calculations of SIFT features.
2.7 Conclusions and Future Work
In this work, we have presented a human tracking and body/head orientation
estimation algorithm from a top-view single camera. Cascaded particle filter was
utilized successfully to track people under random walking directions and standing
patterns. A new approach combining shape context matching and optical flow of SIFT
features is proposed to detect the orientation change in the body and head. The system
presented in this paper is an initial part of a project, tracking and detecting the
body/head orientations of wandering people. In future, it is planned to be further
improved to apply to more complex situations in public places.
In our experiments we have shown the effectiveness of our method for tracking
the humans and their body and head orientation under random walking sequences.
Tracking is successfully achieved with two-step cascaded particle filter. Tracking
algorithm is working under various conditions, even there are some partial occlusions.
However to detect the orientation of the body and head, upper half of the body seen
from the camera is very important and any objects blocking the head triangle region
decreases the success rate. For some cases the algorithm still works very well, since
most of the algorithm depends on general evaluation of displacement vectors of the
features. The only part affected too much from distracting objects is the shape context
matching part, which highly depends on the edge contour of the boundary. There are
still various cases to be studied such as people walking with big bags or some occlusion
problems of people walking very close to each other. They are left as future work.
In our work, SIFT features were chosen to be utilized in the algorithm after
several experiments with Kanade-Lucas-Thomasi(KLT) features or recently popular
SURF¥cite{Surf} features. KLT features were not distinctive enough. Once we detect
KLT features in each image, they appear on the parts of the clothes as well as on the
head region. But, the number of KLT features in the head region is not enough. On the
other hand, the number of detected SURF features is not enough on the body region.
Among all, SIFT features give very promising results.
One important issue to be discussed here is related to performance.
Performance evaluation of the system is one of our near future goals. At the moment our
system is working very well for one person, however it is aimed to be working for
multiple people in the environment at the same time. This brings very important
performance issues, which should be studied very carefully. The interval rates for
body/head orientation calculation, edge detection and shape context matching
algorithms are expected to be improved to get the best performance overall.
Chapter 3
Dominant Motion Flow Analysis In Crowds
3.1 Introduction
Dominant motion patterns in videos provide very significant information which
has a wide range of applications. Since motion patterns are formed by individual
motions or interacted motions of crowds, it helps to analyze the social behavior in a
given environment in the video. Furthermore it is useful during public place design and
activity analysis for security reasons.
Over the years, there have been many researches which try to find the motion
patterns by using individual object tracking and trajectory classification methods.
However, in real world situations, high density crowds form the most cases, and it is not
always possible to track individual objects. Crowd scenes can be divided into two
groups, unstructured and structured scenes, as in Figure 12. Structured crowds are the
(a) Structured Crowd Scenes
(b) Unstructured Crowd Scenes
Figure 13. Structured/unstructured crowd scene examples.
ones where main motion tracks are defined by environmental conditions, such as
escalators, crosswalks, etc. Unstructured crowds are those where objects can move
freely in any direction, following any path. So far, very few researchers have attempted
to solve the complexity of the crowd scenes that are structured. Detecting dominant
motion flows in unstructured crowds still remains as a challenging task.
To solve the problem of calculating the dominant motion flows both in
unstructured and structured crowds, we propose a new approach which has two
distinctive contributions. First, our approach utilizes motion flows of the SIFT features
in a scene. Unlike corner-based features which have been used commonly in other
researches, SIFT features can represent characteristics parts of the objects. Therefore,
their tracking consistency and accuracy are higher during complex motions. Second,
we propose a hierarchical clustering framework to deal with the complexity of
unstructured motion flows. Entire scene is divided into equally sized local regions. In
each local region, flow vectors are classified into groups based on their orientation.
Then, location-based classification is applied to find the spatial accumulation of the
vectors. Finally, local dominant motion flows are connected to obtain global dominant
motion flows.
Related Work:
Tracking individual objects and constructing the trajectories is a common
approach to find the global motion flows as in [1, 6]. However, for crowd videos,
continuous tracking of individual objects is not possible because of occlusion or failures.
Another approach is to employ instantaneous flow vectors of image features in the
entire image [3-5, 11]. They use corner-based features. But, these features are not
reliable under non-rigid motion, affine transformation or noise. Hence, these researches
consider only structured motions and do not work for unstructured crowds. In [4], they
use neighborhood information, but it fails when a region contains flows with multiple
directions eliminating each other. In [7], they propose floor fields, which are applicable
for structured crowds.
In their work, Brostow et al. [8] tracke simple image features and cluster them
probabilistically to represent the motion of individual entities. The algorithm works for
detecting and tracking individuals under various motions. However, the crowds in their
examples are in the category of structured crowds. Hence the complexity is not as much
as the complexity in unstructured crowds like in our examples. And occlusions are in
the acceptable levels.
In [11, 12], Lin et al. utilize a dynamic near-regular texture paradigm for
tracking groups of people. They try to handle occlusion and rapid movement changes.
However still the type of the crowd mentioned in the paper is structured and the flow of
the marching motion can be predictable. Another approach that tracks individuals in
crowded scenes using selective visual attention is proposed by Yang et al. [13]. Only,
the work in [2] considers unstructured crowd scenes and deals with complex crowds,
they also try to track individual targets.
Grimson et al. [12] gives one of the early examples of activity analysis by
tracking moving objects and learning the motion patterns. They use the tracked motion
data: to calibrate the distributed sensors, to construct rough site models, to classify
detected objects, to learn common patterns of activity for different object classes.
Johnson et al. [5] used neural networks to model motion paths from trajectories. While
in [3], they accumulate the trajectories to describe the mostly followed paths, and then
they use this information to find unusual behaviors of pedestrians. Similarly, Wang et al.
[9] use trajectory based scene model and classification. They propose similarity
measures for trajectory comparison to segment a scene into semantic regions. Vaswani
et al. [10] modeled the motion of all the moving objects analyzing the temporal
deformation of the “shape” which was constructed by joining the locations of the
objects in each frame. Long-term trajectory based approaches are only applicable when
continuous tracking of objects is possible. Therefore, they do not work in case of
unstructured crowds, especially when there are severe occlusions.
In trajectory analysis, sinks are defined as the endpoints of the trajectories [2, 4,
shah]. Stauffer [6] defined a transition likelihood matrix and iteratively optimized the
matrix for the estimation of sources/sinks. Wang et al. [9] estimated the sinks using the
local density velocity map in a trajectory clustering. However, when continuous
tracking is interrupted by occlusions or noisy data, trajectory calculation will result in
false sinks. For unstructured crowds, long-term tracking of objects is not possible. As a
result, sinks cannot be defined reliably. Therefore, short term motion flow vector based
approaches are the most promising solutions to analyze complex motions with severe
occlusions.
Wright and Pless [6] determine persistent motion patterns by a global joint
distribution of independent local brightness gradient distributions and model this with a
Gaussian mixture model. This approach assumes all motion in a frame to be coherent,
independent motions like pedestrians moving independently violate these assumptions.
Ali and Shah [1] present an approach inspired by particle dynamics, where they first
determine spatial flow boundaries by adverting particles through the optical flow field
and subsequently performing graph-cut based image segmentation. Their image
sequences do not contain overlapping motions. Andrade et al. [2] use features based on
linear PCA of optical flow vectors as input for a temporal model.
Crowd motion analysis can be also applied to understand the behavior of
biological populations. For example, Betke et al.[2] proposed an algorithm to track a
dense crowd of bats in thermal imagery. Li et al. [3] have recently developed an
algorithm for tracking thousands of cells in phase contrast time-lapse microscopy
images.
The next section gives the overview of the system. In Section 3.3, the
definition and construction of a SIFT motion flow vector are given. In Section 3.4,
generation of dominant local motion flow vectors by hierarchical clustering of SIFT
motion flows is explained. After that, the next section describes how local motion flows
are combined to obtain global motion flows. Experimental results are presented giving
examples of both structured and unstructured crowd data sets. Finally, the chapter is
closed with conclusions and future directions of the work.
3.2 System Overview
In our study, our main goal is to analyze the behavior of crowds by finding the
most popular flow paths of the crowd motion. This may apply to cars on highways or
crossings, or it may apply to pedestrians. Usually people move along predetermined
paths, for instance they follow escalators, roads, sidewalks. However, when there is no
environmental setting, people choose their own way and walk along random paths. As
mentioned earlier, these kind of crowds show unstructured motion characteristics. The
flow of the motion is not predictable and the occlusion level is usually very high. In
case of dense crowds, tracking of individuals is not easily achievable, and tracking
results are erroneous. Hence, continuous tracking of crowd motion is not a viable
Figure 14. System Overview
solution. Instead, short-term motion flows can be helpful to describe the motion flows in
a local region of the given scene for a short-period of time.
Short-term motion flows are represented by the displacement vector of specific
features of the image between successive frames. Low-level image features are tracked
between successive frames of a video data. The displacement vectors of those features
describe the flow of the motion in that local region. They represent instantaneous
motions in the image. Accumulating these vectors for a long period of time and merging
them yield to useful information about the overall motion in the entire image.
However, short-term motion flows represent various kinds of motion in case of
unstructured crowds. For complex motions, there will be many flow vectors in many
directions after accumulating all the information for a long period of time. As a result,
there will be a huge and complex data set of flow vectors to process to analyze the
motion. This forms the most challenging part of the problem. To deal with the
complexity in the entire scene, first we divide the entire scene into smaller image
regions. In each region, we classify the motion flow vectors to obtain a local dominant
motion flow groups. And, then we merge the motion flow vectors to obtain the local
representative dominant motion flow vectors for each group. Figure 14 shows the
system overview of our approach. After detecting local motion flows, they are
connected to obtain the global dominant motion flows in the entire scene. We use SIFT
image features to track and generate instantaneous motion flow vectors. After
experimenting with KLT, SURF, SIFT. SIFT features give the best results. In general,
there are three main steps:
1. SIFT Motion Flow Vector Generation
2. Hierarchical Clustering of Local Motion Flow Vectors
3. Constructing Global Dominant Motion Flows from Local Motion Flows
These steps are explained in detail in the next sections.
Figure 15. SIFT motion flow vector
3.3 SIFT Motion Flow Vector Generation
In this paper, SIFT features are used to calculate the motion flows. SIFT
features are known to be one of the best features that are robust under various
transformations. They can be used to continuously track the foreground objects over
many frames. Thus, instead of calculating the motion flows at each frame, we track the
features at certain intervals as shown in Figure 15.
It provides us two advantages. First,
it reduces the noise coming from background and unstable points. And computed
motion flow vectors can be used directly without any pre or post processing.
In our experiments we have tried various kinds of well-known image features,
such as KLT[ref], SIFT[ref], SURF[ref], Harris corners[ref]. There are important criteria
while deciding which image features to use to represent the motion flow in the image:
1. The features should be distinctive enough to represent a part of each object
(person) in the image. So that, the motion vectors will not be representing random flow
of the motion (coming from noise, background or instantaneous data) in the image, the
vectors will be created in accordance with the motion of people.
2. There should be enough number of extracted features in the local region. The
density of the features (number of extracted and correctly tracked features per area)
should be sufficient enough to represent the amount of motion in the area.
3. The features should be robust enough to be tracked over many frames.
One of the key contributions of this work is that instead of tracking image
features in each video frame, we track image features after some interval. The reason for
this is two-fold. The first one is we want to obtain motion flow vectors which has
meaningful length and direction information. If the length of the vector is too short, than
orientation information will be too weak or noisy to contain correct orientation flow
information in that region. The second one is that, considering the continuity of the
motion for a certain period of time, the flow vectors generated in each frame will have
the similar information with the flow vector generated after skipping a few frames. By
skipping the frames, we will not lose information, while having advantage in terms of
time and computational complexity.
Figure 16. SIFT motion flow vectors in a given image region.
Each video is segmented into intervals with length “d”. SIFT features extracted
in a frame will be matched to the corresponding features in the next frame after the
interval d. After that, the most representative feature displacement vectors are chosen by
thresholding. The displacement vectors of the features over a certain threshold are
defined as flow vectors. Figure 15 shows the definition of a flow vector. Figure 16
depicts the flow vectors.
x,y : center of mass
Θ : orientation
Flow vector is represented with F(x, y, Θ, t, L), where:
(a)
(b)
Figure 17.SIFT motion flows for 100 frames and 400 frames
L : length
t
: frame number
In our approach, deciding the interval length “d” is very important. After
holding experiments with various video data and interval lengths, most of the time three
was the most informative interval length ensuring the tracking continuity and sufficient
number of features.
Figure 17(a) and (b) demonstrate a part of an unstructured crowd scene with
two different motion flow maps depicted for different durations. Motion flows are
calculated for 100 frames and 400 frames with interval length 3. Accumulation of flow
vectors can be seen in certain orientations. However, if the variety of orientations in the
region increases, the flow map becomes very complicated. The complexity increases in
Figure 18. Motion flow map and local regions for the entire scene.
Figure 17(b). When entire scene is considered, data amount and complexity will be
much higher. The overall motion flow map is show in Figure 18.
The data containing all motion flow vectors is very huge and has high variety
in terms of orientations which spread into the entire image. It is difficult to analyze this
data and cluster into meaningful groups. In this case, an effective solution is to divide
the entire image into regions. Instead of dealing with the huge data, dividing into
smaller groups and clustering the vectors locally gives effective and meaningful results.
Figure 18 shows and example of a division of the entire scene into equally-sized square
shaped local regions.
In each local region, motion flow vector are clustered to obtain the dominant
motion flows in that region. Motion flow vectors have position, orientation and strength
as parameters. If a general clustering is applied by using these parameters all at the same
time, the results will not be informative and serve our aim to calculate the dominant
motion flows. Common clustering methods [3] in the literature will not work effectively.
Instead, we give priority to some parameters, such as orientation, and introduce a
hierarchical clustering method to detect the dominant motion flows in the region, which
is explained in the next section.
(a)System flow
(b) Orientation groups
Figure 19. Hierarchical clustering of motion flow vectors.
3.4 Hierarchical Clustering of Local Motion Flow Vectors
Detecting dominant motion flows is defined as finding the orientation and
spatial distribution of the mostly followed paths in a scene during a given period. If the
motion of the objects in a video has an organized behavior, then one type of orientation
can be assigned to each location. However, for crowd videos, especially unstructured
crowds, participants move in various directions at different times. Each spatial location
holds more than one orientation type depending on the time. It is not possible to find the
dominant flows by existing methods [3, 4, 11].
Figure 20. Dividing into local regions and creating an orientation histogram for each.
(a)Orientation-based
(b)Spatial
(c)Dominant motion flows
Figure 21. Hierarchical Clustering Steps
In this work, entire scene is divided into smaller regions, in which flows
vectors are easier to separate into meaningful groups. Then, the flow vectors in each
region are clustered with a two-step hierarchical approach to find the local dominant
motion flows. Figure 20 and 21 show the hierarchical clustering steps. Finally, local
dominant motion flows are connected to compute the global dominant motion flows.
Orientation information is the most significant information while classifying
the flow vectors. In each local region, first, flow vectors are classified into one of the
four main orientation groups. Figure 19 shows the grouping of orientations. To achieve
this, orientation histogram is calculated and major groups are chosen to represent the
region. For example, in Figure 21(a), there are two groups depending on the orientation
as depicted in blue and green. Second step is spatial clustering. Flow vectors in each
orientation group are clustered based on the location. Hence accumulations of the
vectors in the region are detected as in Figure 21(b). For this, “Self-Tuning Spectral
Clustering” method has been applied considering the evaluation results in [3]. Here,
deciding the number of clusters gains an important role in the results. In our algorithm,
the dimensions of the local regions and the number of clusters are very significant to
obtain the correct representation of the flow in the area.
The dimensions of the local region can be decided by considering the number
of motion flows per area, the average strength of the motion flow vectors.
Deciding
the number of clusters for each step is more straightforward. For orientation-based
clustering, there are four different kinds of orientation groups and the choices can be
one, two, three, four. By looking at the orientation histogram, the number of clusters is
easily calculated. For spatial clustering, “one”,” two”, “three” and “four” are given as an
input to the clustering algorithm, and the one giving the best grouping results is decided
to be the number of clusters. To measure how well the number of clusters provides
meaningful results, we investigate the ratio of distance between the center of clusters to
Figure 22. Local regions and motion flow maps.
Figure 23. Local dominant motion flows.
the local region dimensions. If the resultant groups are very well separated, it means
there is a high chance of good clustering..
After clustering, local dominant motion flows are calculated by computing the
average location, average orientation and total number of the flow vectors in each group.
So, local dominant motion flow for each group is described with L(x, y, w, Θ). “w”
symbolizes the number of vectors and depicted with the width of the flow vector. Figure
21(c) shows three dominant motion flows calculated in the region. The center of the
dominant motion flow vector is calculated from the average position of the motion flow
vectors. Figure 22 and 23 show example local regions and local dominant motion flow
vectors calculated in each region.
3.5 Constructing Global Dominant Motion Flows from Local Motion Flows
Once, main flows in local regions are detected, next question is how to
combine them and obtain the global motion flows. The basic logic is to start from one
side of the scene and follow the local flows and connect them to the most probable
neighbor flows till the end of the scene. In other words, first, the entire scene is scanned
horizontally to connect the horizontal flows. After this, it is scanned vertically to
connect the vertical flows.
Figure 24. Neighborhood schema for local dominant flow vectors.
Orientation groups II, III are stated as horizontal flows, whereas groups I and
IV are vertical flows. The algorithm is as follows:
While scanning, for each local motion flow,
1. Determine the neighbor cells, Ns.
2. In each N, search for the motion flows that are in the same orientation group
3. Choose the closest one in the neighborhood and connect with the current flow.
Figure 25. Connecting local flows to obtain global flows.
4. If, there are not motion flows with the same orientation group in the neighbor
cells and next neighbor cells, choose the motion flow that is the closest
Neighbor cells are defined as the two regions that are in the direction of the
current flow. For example, in Figure25, for the horizontal vector, the neighborhood cells
are c, e and next neighbor cells are c’’, e’. In Figure 6, the vectors shown with A are in
orientation group II. A1 is connected to A3 and A2, A3 are connected to A4. Hence, they
form the global flow shown with bold gray line. If there are not any vectors in the
neighbor and next neighbor cells, then it is connected to the closest vector to keep the
continuity. In which case, it means there is a dominant abrupt motion orientation change
in that region. For example, if there wasn’t A4 , A3 would be connected to B1.
Figure 26. Input data
Figure 27. Global dominant motion flows.
3.6 Experimental Results
In our experiments, crowd data sets are taken from the datasets of University of
Central Florida [4] to provide a comparison with the related works. Figure 26 shows
two examples, one from a structured crowd, the other from an unstructured crowd scene.
SIFT motion flow maps are also depicted on the scenes. The interval length for both
Figure 28. Combining global flows one step more.
Figure 29. Ground truth.
cases is three. Each video has 360x480-pixel size. Figure 27 shows the local regions,
detected dominant motion flows in each region and combination of them yielding global
dominant motion flows in the entire image. Local regions have the size of 60x60-pixels.
There are 48 (6x8) local regions in total. At first step, our method gives a detailed result
for global dominant motion flows in the scene. We can go one more step further and
combine the resultant flows to generate one group for each global flow. Hence, we
obtain more basic representation of the flows in the scene, which is shown in Figure 28.
Figure 29 shows the ground truth calculated as a result of a user study in our
experiments. Three people are asked to count roughly the people moving from different
sides in the video. They drew lines according to their perspective of motion flows in the
image for each person. Later those lines were combined by the same users into groups
to get a rougher view. The average results of the three users are taken and set as ground
truth as shown in the figure.
The set on the left in Figure 26 is from an escalator neighborhood, which is a
structured crowd example. The video is analyzed between frames 100 and 460 with an
interval of three. Most of the people move on the escalators and the people on the far
end of the escalators walk freely. The proposed method can successfully detect the
global motion flows in free motion regions as well as the flows through the escalators.
The one on the right is from a street, which is an unstructured crowd example
and complexity is high. Video is analyzed between frames 140 and 460 with interval
length three. For the street scene, our system catches the parallelism in the upper half of
the scene. And the crossing of the motion flows is also detected in the lower part. Also,
three main flows of vertical motion are detected,
it is shown with purple in Figure 27.
With the proposed approach, dominant motion flows can be detected in various levels.
3.7 Discussion and Conclusions
In this work, we have presented a new approach to solve the problem of
calculating dominant motion flows in various crowd scenes. Instead of tracking the
objects in the scene and analyzing their trajectories, we have proposed a different
approach where we accumulate the instantaneous motion flow vectors for a long period
of time and evaluate the overall motion characteristic at once in the scene. We utilize
motion flow vectors of SIFT features in the scene. By using SIFT feature flows and
hierarchical clustering approach, it becomes possible to analyze the motion flows even
for unstructured and structured crowds. The proposed approach can detect global
motion flows, at the same time it can give information about local characteristics of the
motion flows.
In our approach we investigate the speed, position and orientation of the
motion flows. Time and direction information are not included in the analysis, however
if more detailed analysis is required, time and direction can be added as additional
parameters to the clustering algorithm.
We divide the entire scene into local regions, this provides us to achieve more
detailed and accurate analysis of the scene. Depending on the content in the scene, the
speed of the motion can vary. Depending on the local regions in the scene, the speed
and orientation of the motion flows can vary. In other words, in a given scene, there
may be local characteristics of the motion which are different than each other. With our
approach we can figure out these characteristics more accurately and combine them to
obtain their contribution to the global view of the motion flow.
Our approach can be used for all kinds of scenes containing any kind of objects.
There may be cars, pedestrians. Or, we may investigate the motion flow of birds, insects
or bacteria population. The advantage of our method is that it works for structured or
unstructured motions of crowds. This method can be developed further to include more
information depending on the objects in the scene. Social behavior analysis can be
achieved by using our method.
Chapter 4
Future Footsteps: An intelligent interactive
system for public entertainment
4.1 Introduction
In the last two decades the collaboration of researches in art, design and
technology has increased remarkably. This provided the development of new systems
such as sophisticated human-computer interaction tools, virtual reality, augmented
reality,
interactive
entertainment
and
education
technologies
{RefVR,RefH,RefPOST,RefID}. With the advances in video and sensor systems and
computer vision technologies, various intelligent systems have been developed to
acquire information about objects, humans in most cases, and analyze their motion or
intended behavior. Combined with the artistic concepts from art and design studies,
sophisticated, smart, enjoyable interactive tools and environments have been created.
In this work, we focus on building an interactive environment that can serve
many humans at the same time. Our aim is not to build a sophisticated tool, which
people buy and use as a game console or to build a virtual environment, where people
go and enjoy in their leisure time. The aim of this work is to build an entertainment
system which naturally appears in people's daily life and becomes a part of the flow. So
the technology meets people in their natural living environment, providing an
interactive entertainment space. It promotes the collaboration of technology and art,
introduces the technological advances to public, while letting them enjoy the results.
The proposed system is composed of a video camera, a projector and a
computer. It employs a multiple-human tracking algorithm from a single camera. People
walking in an indoor environment are tracked by the camera that is located on a high
place. Then, virtual footsteps of the people are created continuously and displayed in
front of their feet while they keep moving. The visualized footsteps are called
¥textit{future footsteps} showing their destination. It creates an effect of creating one's
own future by his/her current motions. Or, seeing one's future in front of his/her eyes
might affect the present motion.
Figure 30. Depicted future footstep of a girl.
Figure 31. General view of the area from the camera.
Figure 30 shows an example visualization of a future footstep. In Figure 31, an
example input image of the scene captured by the camera is shown. The camera
captures the top-view of the area. In order to track multiple humans in real-time, we use
blob tracking method and associate the blobs along a sequence of frames to generate a
position history data for each blob. For each blob, speed and direction of the motion are
calculated and the next position is predicted. Foot shaped images are displayed in the
predicted location in the direction of movement to visualize the future footsteps.
Additionally, by analyzing the tracking results, we examined the relationship
between blob area and the number of people in the blob. Multiple footsteps were
displayed according to the number of people in the blob. Another analysis was held to
distinguish an adult from a child when a blob contains one person and smaller size foot
images were displayed for a child. This work contains another component, which
accumulates the footsteps data and provides the visualization of all the footsteps at the
same time in an image. It gives us a way to describe a scene in terms of mostly followed
paths.
The following section introduces the related work about visual tracking and
interactive systems. In Section 4.2 the overall system is explained, and technical system
details are described in Section 4.3. Experimental results for various situations are
presented in Section 4.4, at the same time user studies about the system are introduced.
Finally, discussions and further improvements for the system are given in Section 4.5,
followed by the conclusions in Section 4.6.
4.1.1 Related Work
Most interactive systems utilize video cameras and various sensors to track
humans, human body parts and evaluate the motion to provide input or feedback to the
computer or intelligent environment {RefH}-{RefMM}. Tracking the human hands,
head or the objects humans hold in their hands help to extract the gestures or locations
of humans and interact easily with various multimedia devices. Examples of these
applications are virtual games, interactive education systems, 3D visualization systems,
etc... The other group of systems detect facial expressions or eye movements {RefEYE}
to accomplish the user-system communication.
Relatively less number of
researchers¥cite{RefV,RefID,RefMM} build systems for simultaneous use of multiple
humans.
There are various systems utilizing tracking techniques to create interactive
entertainment systems. Most systems track the human body parts in order to control the
tools by using gestures. For example, ¥cite{RefPOST} uses multiple cameras and
markers, to track the hands and legs of a person to control virtual tools and games. In
the work of ¥cite{RefCOL}, a person holds colorful objects which are detected and
tracked by the system to provide the control of a game without keyboard or mouse.
¥cite{RefH} explains the history and techniques of head tracking that is used in virtual
reality and augmented reality. Without going into the details of researches in 3D, which
is a different area of research, a 3D motion model example for a home entertainment
can be given. In research of ¥cite{RefMM}, multiple cameras are used to construct a
3D motion model of a human body by tracking various body parts(head, torso,
shoulders, forearms, legs), the system is proposed to be used as an automated home
entertainment.
We concentrate here on the papers most relevant to our work, which tracks
entire body of the human and analyzes human motions for interactive applications.
There are few researches in this field. In ¥cite{RefV} and ¥cite{RefA}, a multi-human
tracking system is developed for multiple persons to interact with virtual agents in an
augmented reality system. In their application, most of the time only a few humans
participate and the system focuses on understanding the actions of the interacting
humans and agents. Another group of researchers¥cite{RefID} present an entertainment
system, where multiple users are involved in an interactive dance environment. They
use markers and multiple cameras to capture the motion of various body parts.
Besides the scientific researches, there are many commercial applications of
human motion tracking in interactive advertisements, arts or entertainment areas.
Interactive floors are very popular in shopping centers or public places to attract
customers.
Some
example
products
can
be
found
in
http://www.eyeclick.com/products¥_ 500.html and http://www.reactrix.com.
Most of the systems introduced so far ¥cite{RefA,RefPOST,RefID,RefMM},
utilize multiple cameras or combine camera and sensor systems. Or they use markers,
sometimes special objects for tracking. A comprehensive study of motion analysis of
multiple humans with multiple cameras is given in the book ¥cite{RefAVI}. Different
from those, we have developed a low cost, compact system and use only a single
camera to track multiple people simultaneously. Our system can be installed easily in
any public place. In the literature there are various tracking systems, many researchers
detect
and
track
image
features,
such
as
color¥cite{RefVSAI,RefCT},
KLT¥cite{RefKLT}, corners¥cite{RefST} or textures¥cite{RefPP}, to track humans.
And sophisticated tracking algorithms have been developed to effectively deal with
various situations. Among these, Kalman filter-based¥cite{RefO,RefKF} and Particle
Figure 32. Placement of the system in the airport.
Figure 33. Inside of the box.
filter-based¥cite{RefPF} algorithms or mean-shift¥cite{RefMS} algorithms are very
popular ones. Optical flow¥cite{RefOF} is another common method which is based on
calculating the motion flow of image features. The paper ¥cite{RefMOE} presents a
good survey of these. Basically, all these methods require two main steps, the first one
is detection of image features and the second step is tracking these features in
consecutive frames. Hence, they require complex calculations and they are
time-consuming algorithms. The most important requirement for an interactive system
is high-speed within a given range of accuracy. To be able to achieve a real-time robust
tracking of multiple humans, we employ a blob tracking algorithm.
4.2 System Overview
Figure 32 shows the placement of the system. It is composed of a video camera,
a computer, a projector and a mirror. All the electronic equipments are placed inside a
box and the mirror is mounted on the front shutter of the box as shown in Figure 33.
The box is located on a high place to capture the top-view of the target area. People
walking in the area are tracked by the system. By using the tracking results, locations
and orientations of future footsteps are predicted. The future footsteps are projected on
the floor by means of the projector and the mirror in front of the projector lens.
Tracking objects is a widely used computer vision technique, yet it depends on
the object properties and it can be very challenging in different situations, such as in
crowded situations. Our system is designed for tracking and visualization of the future
footsteps of multiple people in real-time. It means that the tracking algorithm should be
very fast and robust to pass the results, so that the footsteps can be displayed in front of
a person in the right timing before he/she proceeds further. In this work, we apply blob
extraction and association technique to track people. People are recognized as moving
blobs in the video.
Once, people are tracked as moving blobs, the area and position history
information is calculated for a short period of time. By using the history data, the
orientation and position of the next footsteps are predicted, it is explained in Section
¥ref{tech} in detail. Foot shaped white images are displayed in the predicted positions
for each existing blob in accordance with the predicted orientation by using the
projector. The following sections explain the system architecture and the calibration of
the camera and projector.
Figure 34. The placement of the camera and the box.
4.2.1 System Architecture
In our system, a CCD camera with 640x480 pixels resolution is used. The
camera output frame rate is selected as 6.25fps from the camera properties, instead of
default 25 fps rate. This rate is enough to catch the change in motion of people and
helps to save time by reducing the number of frames to be processed. In order to enable
the clear projection of the footsteps and increase the visibility on the floor, high contrast
images are required. One of the newest technology projectors, DLP projector, NP 4100J,
is used. To provide a simple, compact and good-looking appearance, everything is
placed in a white wooden box. During long-hours operation of the system, the projector
gets hot very quickly, hence, fans are mounted to ventilate inside of the box.
The proposed system is composed of three main processes as follows:
I. Real-time Tracking of Multiple Humans
Figure 35. System architecture
II. Analysis of the Tracking Results
III. Visualization of the Future Footsteps
Figure 35 shows the system architecture. First of all, calibration is required to
establish the correspondence between camera coordinates and projector coordinates.
The camera and the projector view the same area from different angles. To achieve the
correspondence of the coordinates between two views, calibration parameters are
calculated to convert from one to another. It is explained in the following section. Once
the calibration is done, processing environment is ready for the rest of the process.
The characteristic of this system is that it creates future footsteps for
multiple-people in real-time. To achieve such a system, processing time is very crucial.
The system should be able to track multiple people and predict their next steps. At the
same time, it should be able to visualize footsteps in the predicted position before
people move further and pass that position. And this is done repeatedly until people
leave the scene. To achieve this, a simple but effective tracking algorithm has been
developed. Furthermore, tracking and visualization parts are designed to work in
parallel. By using the video input, humans are tracked and necessary information is
stored in the designed data structure. By using data points in the data structure,
visualization part generates future footsteps. Two parts work separately in parallel,
while they access to the same data structure. This provides the necessary gain in speed
while tracking a person and displaying his/her footsteps in the right timing before he/she
proceeds further. However, it requires very careful synchronization to provide correct
localization and timing of the footsteps. Each predicted footstep is visualized by
displaying gradually disappearing foot shape images quickly to provide a natural
appearance of a stepping foot.
4.2.2 Calibration
The view angle and view space of the camera and projector are different than
each other. In order to provide the correspondence of coordinates between these two
spaces, camera calibration is required. The camera calibration is carried out by using the
calibration functions developed by the Intel's Open source computer vision libraries
(OpenCV)¥cite{RefOCV}. OpenCV's cvFindHomography() function helps us to
calculate the homography matrix between two spaces. Four reference points are chosen
both in the projector screen and camera view. The four outermost corners of a
chessboard image are usually chosen as reference points. The calibration step is
displayed in Figure 36. A chessboard image is projected from the projector such that it
covers the entire target area. Then, from the camera view, the references points are
selected by clicking on each of them. In our system, calibration is done once manually
before the start-up of the system and homography matrix is stored. And it is used to
convert the coordinates from camera space to projector space by using cvGEMM()
function when necessary. In the following equations, src represents a point on the floor
plane in the camera space, whereas dst represents the corresponding point in the
projector space.
void cvFindHomography(const CvMat* srcpoints, const CvMat* dstpoints, CvMat*
homography);
void cvGEMM(const CvArr* src1, const CvArr* src2, double alpha, const CvArr* src3,
double beta, CvArr* dst, int tABC=0);
Figure 36. The calibration step is displayed.
4.3 Real-time Tracking of Multiple Humans
In our system, the camera captures the top-view of the target area and humans
are tracked as moving blobs in the scene. Moving blobs are extracted by applying a
background subtraction and the regions above a certain threshold are marked as
foreground regions. Later, blobs in successive frames are connected with a simple
Figure 37. Blob extraction: Example input scenes and enlarged view of a partial area in the
input scene.
method which evaluates the distance between the blob positions to decide the
associations.
4.3.1 Background Subtraction and Blob Extraction
Our system is designed to be used during daytime and/or night time containing various
illumination changes in the environment. The change of sunlight or the lights from the
surroundings can affect the scene environment. Hence, a dynamically updated
background calculation algorithm is necessary to keep the best possible background
scene definition. For background subtraction, we apply an adaptive algorithm
introduced in ¥cite{RefBGS}. An average background is stored for the scene and it is
continuously updated in time as follows.
To extract the moving regions, the average background image is subtracted
from the current video frame. Then, the regions with an area above a certain threshold
are detected as blobs.
Here, morphological opening and closing operations might be helpful to define
the borderlines more clearly. However, processing time is very limited and the blobs
extracted after background subtraction and thresholding is descriptive enough to define
the moving regions. Each blob is defined with the following elements, (c_x,c_y): center
of mass, A: area, P: preceder, flag: flag. So, jth blob in ith frame is defined as
B[i,j] = { c_x, c_y, A, P, flag}
Figure 37 shows example input images from a scene in an airport, a partial region in a
larger view. Figure 37 also shows the result after background subtraction and
demonstrates the extracted blobs with bounding boxes. Moving people are very well
extracted with the defined algorithm. In our method, shadow removal algorithm is not
necessary, although there are shadows of objects in the scene. This is because of two
Figure 38. Blob extraction for a child
Figure 39. Example extracted regions of adults
Figure 40. Example extracted regions of children
reasons. The first one is, because of multiple lights in the environment, there are
multiple weak shadows spread in various directions around a person. These shadows
can be eliminated by the background subtraction algorithm. The second reason is that,
interestingly the remaining strong shadow regions, which might be larger with the
strong sunlight, can be very helpful. We are trying to predict the positions of feet of a
person. If we extract the blobs without shadows, the center of mass will correspond to
somewhere in the middle of a person's body. However, when we look at the extracted
regions, we will see that the center of mass of each blob will slide towards the bottom
part of a person's body(feet region seen from the view) with the contribution of the
shadow region. Figure 38 demonstrates an example of this. The point in the square
shows the center of mass excluding the shadow, the point in the circle shows the center
of mass including the shadow. The point in the circle is more closer to the feet region,
supporting the aim of our system.
In our system, currently, a person and his luggage are considered as one blob.
Since they are connected after the extraction process, they are assumed to be one region
and one footstep is visualized for the whole. Distinguishing and eliminating luggage in
the system is left as future work. On the other hand, if the scene is very crowded and
people walk very closely to each other, in that case a group of people can be extracted
as connected and recognized as one blob for a long time and one footstep is visualized
for this situation. However under normal conditions, people do not walk that close
during consecutive frames and most of the time they are separately detected and
tracked.
Figure 39 shows sample extracted blobs of adults and Figure 40 shows sample
blobs of a child. There is a big difference in the area. For adults, the average area size is
about 3500 pixels and for children, the average area size is around 1200 pixels. This
information is used during the visualization of footsteps and smaller footsteps are
displayed for children.
4.3.2 Association of Blobs
Considering the time requirement and the possibility of many people existing
in the scene at the same time, we have developed a fast and robust blob tracking
algorithm which works in real-time. Blobs are extracted at each frame with the
algorithm described in the previous section. After blob extraction, the detected blobs
should be assigned with their corresponding blobs in the previous frame. To achieve this,
the center of mass of each blob is compared with the center of masses of the blobs in the
previous frame. And the one with the minimum distance and smaller than a defined
connectivity threshold (C_th) is chosen to be the preceder to that blob. If there are no
preceders found in the previous frame, then the current blob is defined to be the head of
the tracking chain and the flag is assigned to -1. If the blobs are not the heads of their
chain, their flags are assigned to 0, indicating that they have preceders. And the preceder
element, P, of each blob is assigned to the index of the preceder blob in the previous
frame.
For each blob in the current frame:
- Find the blob in the previous frame with the minimum distance to the current blob.
Find k which satisfies the following condition, where dist(B[i,j],B[i-1,k]) is the
Euclidean distance between the center of masses.
Min(dist(B[i, j],B[i − 1, k]))…………………………………(11)
- If dist(B[i, j],B[i − 1, k]) <= Cth, then preceder of B[i, j] is B[i − 1, k].
B[i, j]− > P = k;
B[i, j]− > flag = 0;
- If dist(B[i, j],B[i − 1, k]) >= Cth, then there is not a preceder of B[i, j].
B[i, j]− > P = −1;
B[i, j]− > flag = −1;
Blob association is calculated for 5 frames at maximum. In other words,
tracking data of a person is stored only during the last five frames. It is enough to
predict the speed, orientation and position of the next step. Figure 41 shows the tracked
blobs for 3 consecutive frames. Blobs A, B, C, D, E exist in the first frame. A, B, D
move upwards and leaves the scene after the second frame. C, E move downwards. F
appears in the scene in the second frame and moves upwards. Figure 42 illustrates the
data structure for three frames in Figure 41. Each blob is represented with a name which
starts with "B" and ends with the number of the blob. The flag and preceder
fields are shown below each blob. The flags indicating the existence of preceder and the
number of the preceder blob in the previous frame are assigned accordingly.
Figure 41. Blob tracking during three consecutive frames.
4.4 Analysis of Tracking Results and Visualization of Footsteps
In this system a person's data during the last five frames is stored, it almost corresponds
to a duration of one second(6.25 fps). This data is used in three different ways. First,
position history information is used to mathematically model the motion of the person
and estimate his/her speed. Second, by analyzing the area information of each blob in a
Figure 42. Association of blobs stored in the data structure.
frame, the number of people in each blob and the total number of the people in the scene
are estimated. Sometimes, blobs can contain connected group of people, in case of high
density crowds. The area of the blob helps us to estimate the number of people in the
blob. Third, an adult and a child can be distinguished, and footsteps can be displayed
smaller for children by using this information.
By using these data, the speed and orientation of the motion of the person are
calculated. And by using the position data, calculated speed and orientation, position
and orientation of the next step of a person is predicted by applying linear prediction.
4.4.1 Prediction of Future Footsteps
Numerically, the motion of each blob is modeled by a linear function. x and y attributes
of the position of each blob are defined with two linear functions with four parameters
as in the equations (12). t represents the time. The parameters ax, bx, ay, by are calculated
by solving the equations coming from five data of the last five frames. Equation (13)
shows the matrix operations. After calculating these parameters, speed of each blob is
estimated by the equation (16). Then, for each blob △t is assigned according to the
speed information. Finally, new position of the blob is estimated by using the equations
(14) and $ ax, bx, ay, by, △t. Orientation angle of the motion is computed by the
equation (15).
x(t )  axt  bx
y (t )  a y t  by
……………………………………………(12)
 x(t ) 
 y (t ) 
 x(t  1) 
 y (t  1) 



t  
t 
 x(t  2)    axbx     ,  y (t  2)    a y by     ……….(13)


1  y (t  3) 
1
x
(
t

3)




 x(t  4) 
 y (t  4) 
xnext  ax (t  t )  bx
ynext  a y (t  t )  by
…………………………………….(14)
  arctan(a y / ax ) ……………………………………….(15)
S  ( x(t )  x(t  4)) 2  ( y (t )  y (t  4)) 2 / 4 ……………(16)
Figure 43. Gradually disappearing images of a foot.
Figure 44. Various foot shapes used in the system.
Estimation of Number of People: As stated before, the area of a blob can give
information whether that person is an adult or child. Similarly, if the extracted blob
region is composed of multiple persons, we can calculate the number of people in the
region by looking at the area. And the number of visualized footsteps for that blob can
be defined according to the area. For adults, the average area size is about 3500 pixels.
So if the area of a blob is larger than factors of this amount, then there are multiple
people in the blob and the number of people is estimated by looking at the size. If the
blob is far smaller than this amount, for example in the ranges of 1200 pixels, then it is
Figure 45. Experimental results: visualization of future footsteps for various people.
Figure 46. Experimental results: various future footsteps.
evaluated as a child. In this work, luggage can be confusing. There are many kinds of
luggage, small ones, large ones. Sometimes they can be connected with the person in
the blob, but most of the time they are detected as separate blobs. At the moment, it is
assumed that luggage do not exist in the area and it is left as a future work to distinguish
people from luggage for more general situations.
4.4.1 Visualization of Future Footsteps
Once the positions and orientations are calculated for the predicted future footsteps, an
image containing foot-shaped white regions on a black background is projected on the
floor with the help of a mirror as in Figure 34. And for each footstep, gradually
disappearing images are displayed successively to create an effect of a foot stepping on
the floor. An example group of a footstep image is shown in Figure 43. There are three
kinds of foot shapes, which are shown in Figure 44, and a shape is chosen randomly for
each tracked blob. During the display of the footsteps, at each step, an image containing
white foot regions on a black background is constructed and projected on the floor.
White foot regions correspond to the predicted future footsteps of the existing blobs in
the scene. Depending on the area, the blob is evaluated as either a child or an adult. And
the size of the foot image is set accordingly. Sometimes, when a person makes a turn,
depending on the sharpness, some delay might occur in the direction. This system works
in real-time for multiple people. To achieve fast processing, simple algorithms are
developed. However, still it requires 64-bit machine working at 2.67GHz using 2GB
RAM. Illumination condition is important, less light in the environment helps to display
clearer footsteps.
Figure 47. Experimental results: various future footsteps
Figure 48. Visualization of mostly followed paths from top-view
Figure 49. User Reaction: a woman is jumping right and left to play with the displayed
footsteps.
Figure 50. User Reaction: a little girl is exploring and trying to step on the footsteps.
4.5 Experimental Results
In this section, experimental results for the visualization of the future footsteps
are presented for various situations. The reactions of people the experiencing the system
is also introduced. Furthermore, a user study has been carried out to study the questions,
such as what kind of people pay attention or how many people can notice the displayed
footsteps in crowded situations?
4.5.1 Results from Various Situations
Figure 45 shows the visualization of the future footsteps for different people. The
orientation and positions of the footsteps successfully depict the intended future step of
each person. In Figure 46, two kinds of foot shapes are illustrated. Figure 46 depicts the
future footstep visualization result for a person with a luggage. In Figure 47, the results
of the future footsteps systems is presented from a top-view for four people walking in
the area. The image is captured from a height of 12 m., so foot regions are a little dim,
however one can notice the correct orientation and positioning of the displayed future
footsteps for multiple people.
Another usage of the proposed system is the detection of the dominant motion
paths in the area. For the entertainment purpose, the footsteps are displayed in real-time
according to the motion of each person. When we accumulate the predicted footsteps
and display all of them at the same time, we can analyze the overall motion paths. The
mostly followed paths in the scene can be recognized by looking at the resultant image.
This provides an alternative visualization of the dominant paths taken by the customers
in an indoor environment. It can be used by many people such as architects, social
analysts, market analysts, etc... Figure 48 gives an example of a resultant image after
observing 120 frames.
4.5.2 User Study
When people see their footsteps, they express great excitements. Some people
try to make interesting movements to see what will happen to the footsteps. Some
people speed up to catch and step on the footstep but they can never do it. Some others
try to find where the footsteps are coming from by searching around. The woman in
Figure 49 jumps from right to left and left to right and tries to control the future
footsteps. In Figure 50, a little girl plays with the footsteps, tries to catch them exploring
the area. As a result, we can say that this system serves our purposes: to awaken an
interest in technology, to entertain, to make people think and to show the recent
progress in technology.
During one hour of a study, approximately 900 people passed through the
target area in the airport. Around 50 people recognized the existence of visualized
footsteps. They came usually in groups, such as couples, university students or tour
groups. When someone in a group noticed the footsteps, he/she showed it to the friends,
and more people noticed it. Some groups came with flyers in their hands, they tried to
find the footsteps visualization area by using the map on the flyer. The total number of
the people who came alone and noticed the footsteps while looking around randomly is
five.
4.7 Conclusions
This paper presents an interactive entertainment system for simultaneous use of
multiple humans. The system tracks people walking freely in an indoor environment. It
continuously visualizes their predicted future footsteps in front of them while they keep
moving. A real-time multiple human tracking algorithm has been developed and
combined with a visualization process. A video camera and a projector are located high
above the target area. Humans walking through the area are captured and tracked by the
camera-computer system. Then, by using the tracking results, their next locations are
predicted by analyzing direction and speed of their motion. Foot-shaped images are
displayed in the predicted location in front of them by using a projector. This provides
people to see their destination created by themselves. It gives the feeling that they
control their future.
The system can be installed in any indoor place easily. It does not affect the
natural flow of life in the sense that it does not affect the movements of people until
they notice the displayed interactive foot shapes. When they notice, people show
surprise, excitement, astonishment. They try to discover where and why the foot shapes
are coming from. They play with the system by making various movements. Sometimes
they try to step on the visualized images, sometimes they observe the foot images very
carefully. As a result, this interactive entertainment system becomes a part of the daily
life, brings technology into people's life by presenting technology with artistic concepts.
4.8 Future Work
We have developed a system which employs tracking of multiple people for an
interactive entertainment application. The most challenging task was to achieve the
real-time tracking process and to synchronize it with the visualization part. Hence, we
chose blob tracking which is fast and robust for multiple people case. However, our aim
is to build a system for indoor environments, an airport building in the current case. It is
very likely that people will carry luggage. It is important to distinguish people from
luggage. Some line detection algorithm can be used to detect the objects with straight
lines(luggage has straight lines) and eliminate them.
In our system, most of the time linear prediction works very well. As long as
people make soft direction changes in their movements, the predicted footsteps are still
displayed in the correct place with a correct direction of motion estimation. However,
when people take sharp turns, the footsteps are displayed with delays. To improve this,
another prediction method is planned to be developed considering the faster movements
and variety in human motion.
Displaying all predicted footsteps during a given period of a video at once is
good to describe the movements in the scene. Social analysts or public area designers
can benefit from this kind of visualization to store statics. Further analysis of the overall
motion can be added to the system, such as finding the dominant motion flows in a
scene, etc...
Chapter 5
Conclusions
Chapter 6
Discussions and Future Work
References
[1] N. F. A. Doucet and N. Gordon. Sequential monte carlo methods in practice.
Springer-Verlag, 2001.
[2] S. Ba and J. Odobez. Probabilistic head pose tracking evaluation in single and
multiple camera setups. Classification of Events, Activities and Relationships, pages
276–286, 2007.
[3] R. D. C. Yang and L. Davis. Fast multiple object tracking via a hierarchical particle
filter. Proc. IEEE Intl. Conf. on Computer Vision, 1:212–219, 2005.
[4] H. I. D. Glas, T. Miyashita and N. Hagita. Laser tracking of human body motion
using adaptive shape modeling. Proc. IEEE/RSJ Conf. on Intelligent Robots and
Systems, pages 602–608, 2007.
[5] K. M. E. Seemann, B. Leibe and B. Schiele. An evaluation of local shape-based
features for pedestrian detection. In Proc. BMVC, 2005.
[6] T. T. H. Bay and L. V. Gool. Surf: Speeded-up robust features. Computer Vision
and Image Understanding, 110(3):346–359, 2008.
[7] J. O. K. Smith, S. Ba and D. Perez. Tracking the visual focus of attention for a
varying number of wandering people. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 30(7):1212–1229, 2008.
[8] C. M. L. Snidaro and C. Chiavedale. Video security for ambient intelligence. IEEE
Trans. on Systems, Man, and Cybernetics-part A:Systems and Humans, 35(1):133–144,
2005.
[9] B. W. L. Zhang and R. Nevatia. Detection and tracking of multiple humans with
extensive pose articulation. Proc. IEEE Intl. Conf. on Computer Vision, pages 1–8,
2007.
[10] D. Lowe. Distinctive image features from scale-invariant keypoints. Intl. Journal of
Computer Vision, 60(2):91–110, 2004.
[12] R. Poppe and M. Poel. Comparison of silhouette shape descriptors for
example-based human pose recovery. Proc. IEEE Intl. Conf. on Automatic Face and
Gesture Recognition,pages 541–546, 2006.
[13] J. M. S. Belongie and J. Puzicha. Shape matching and object recognition using
shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence,
24(4):509–522, 2002.
[14] J. M. S. Sidla and J. Puzicha. Shape matching and object recognition using shape
contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(4):509–522,
2002.
[15] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features.
Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1–8, 2007.
[16] R. N. T. Zhao. Bayesian human segmentation in crowded situations. Proc. IEEE
Conf. Computer Vision and Pattern Recognition, 2:459–466, 2003.
[17] C.Wojek and B. Schiele. A performance evaluation of single and multi-feature
people detection. Proc. DAGM Symposium on Pattern Recognition, pages 82–91, 2008.
[18] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors. Proc. IEEE Intl. Conf. on
Computer Vision, 1:90–97, 2005.
[19] T. Y. S. L. Y. Li, H. Ai and M. Kawade. Tracking in low frame rate video: a
cascaded particle filter with discriminative observers of different life spans. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 30(10):1728–1740,2008.
[20] T. O. Y. Matusmoto and A. Zelinsky. Behavior recognition based on head pose and
gaze direction measurement. Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems,
pages 2127–2132, 2000.
[21] Q. Y. Z. Han and J. Jiao. Online feature evaluation for object tracking using
kalman filter. Proc. IEEE Intl. Conf. on Pattern Recognition, 2008.
[22] F. M. Porikli, Trajectory Pattern Detection by HMM Parameter Space Features and
Eigenvector Clustering, ECCV, 2004.
[23] M. Rodriguez, S. Ali and T. Kanade, Tracking In Unstructured Crowded Scenes,
ICCV, 2009.
[24] G. Eibl, N. Brandle, Evaluation of Clustering Methods for Finding Dominant
Optical Flow Fields in Crowded Scenes, ICPR, 2008.
[25] M. Hu, S. Ali and M. Shah, Detecting Global Motion Patterns in Complex Videos,
ICPR, 2008.
[26] G. Brostow, R. Cipolla, Unsupervised Bayesian Detection of Independent Motion
in Crowds, CVPR, 2006.
[27] X. Wang et al., Learning Semantic Scene Models by Trajectory Analysis, ECCV,
2006.
[28] S. Ali, M. Shah, Floor Fields for Tracking in High Density Crowd Scenes, ECCV,
2008.
[29] B. D. Lucas and T. Kanade, An Iterative Image Registration Technique with an
Application to Stereo Vision, IJCAI, 1981.
[30] D. Lowe. Distinctive image features from scale-invariant
of Computer Vision, 60(2):91–110, 2004.
key points. Intl. J.
[31] Y. Tsuduki, H. Fujiyoshi, A Method for Visualizing Pedestrian Traffic Flow using
SIFT, PSIVT, 2009.
[32] N. Ihaddadene, C. Djeraba, Real-time Crowd Motion Analysis, ICPR, 2008.
[33] L. Zelnik-Manor, P. Perona, Self-Tuning Spectral Clustering, In Adv. Neur. Inf.
Proc. Sys.: 1601–1608, 2004.
[34]Fox J, Arena D, Bailenson JN (2009) Virtual reality: A survival guide for the social
scientist. Journal of Media Psychology: Theories, Methods, and Applications
21(3):95-113
[35]Welch GF (2009) HISTORY: The use of the Kalman filter for human motion
tracking ing virtual reality. PRESENCE: Teleoperators and Virtual Environments
18(1):72-91
[36] Choi J, Cho Y, Cho K, Bae S, Yang HS (2008) A View-based multiple objects
tracking and human action recognition for interactive virtual environments. The Int.
Journal of Virtual Reality 7(3):71-76
[37] Chae YN, Kim Y, Choi J, Cho K, Yang HS (2009) An adaptive sensor fusion
based objects tracking and human action recognition for interactive virtual
environments. Proc. of Int. Conf. on Virtual Reality Continuum and its Applications in
Industry, pp 357-362
[38] Chung J, Kim N, Kim GJ, Park C (2001) POSTRACK: A low cost real-time
motion tracking system for VR application. Proc. of the Int. Conf. on Virtual Systems
and Multimedia, pp 383-392
[39] Chung J, Shim K (2006) Color object tracking system for interactive entertainment
applications. IEEE Int. Conf. on Acoustics, Speech, Signal Processing 12:5355-5358
[40] James J, Ingalls T, Qian G et al (2006) Movement-based interactive dance
performance. Proc. of ACM Int. Conf. on Multimedia, pp 470-480
[41] Michoud B, Guillou E, Bouakaz S (2007) Real-time and markerless 3D human
motion capture using multiple views, LNCS, pp 88-103. Springer, Heidelberg
[42] Lee S, Kim GJ, Choi S (2007) Real-time tracking of visually attended objects in
interactive virtual environments. Proc. of ACM symposium on Virtual Reality Software
and Technology, pp 29-38.
[43] Hirose M et al (2010) Digital public art in Haneda airport, pp 1-160. Bijutsu
Shuppan Ltd., Tokyo
[44] Ohya J, Utsumi A, Yamato J (2002) Analyzing video sequences of multiple
humans: tracking, posture estimation and behavior recognition, Int. Series in Video
Computing, pp 1-160.Springer
[45] Moeslund TB, Hilton A, Kruger V (2006) A survey of advances in vision-based
human motion capture and analysis. Computer Vision and Image Understanding
104(2):90-126
[46] Snidaro L, Micheloni C, Chiavedale C (2005) Video security for ambient
intelligence. IEEE Trans. on systems, man and cybernetics-part A:systems and humans
35(1):133-144
[47] Kong S, Sanderson C, Lovell BC (2007) Classifying and tracking multiple persons
for proactive surveillance of mass transport systems. IEEE Conf. on Advanced Video
and Signal Based Surveillance, pp 159-163
[48] Han Z, Ye Q, Jiao J (2008) Online feature evaluation for object tracking using
Kalman filter. Int. Conf. on Pattern Recognition, pp 1-4
[49] Chan AB, Liang ZSJ, Vasconcelos N (2008) Privacy preserving crowd monitoring:
counting people without people models and tracking. IEEE Conf. on Computer Vision
and Pattern Recognition, pp 1-7
[50] Ozturk O, Matsunami T, Suzuki Y, Yamasaki T, Aizawa K (2010) Can you see
your ”Future Footsteps”?. In Proceedings of Int. Conf. on Virtual Reality.
[51] Comaniciu D, Ramesh V, Meer P (2000) Real-time tracking of non-rigid objects
using mean shift. IEEE Conf. on Computer Vision and Pattern Recognition (2):142-149
[52] Doucet A, De Freitas JFG, Gordon NJ, Sequential Monte Carlo methods in practice.
Springer-Verlag, New York (2001)
[53] Kalman RM (1960) A new approach to linear filtering and prediction problems.
Tran. Of the ASME–Journal of Basic Engineering 82(D):35-45
[54] Shi J, Tomasi C (1994) Good features to track. Int. Conf. Computer Vision and
Pattern Recognition, pp 593-600
[55] Barron JL, Beauchemin SS, Fleet DJ (1992) Performance of optical flow
techniques. IEEE Conf. on Computer Vision and Pattern Recognition, pp 236-242
[56] Lucas BD, Kanade T (1981) An iterative image registration technique with an
application to stereo vision. In Proc. of Int. Conf. on Artificial Intelligence, pp 674-679
[57] Yilmaz A, Javed O, Shah M (2006) Object tracking:A survey. ACM Journal of
Computing Surveys 38(4):1-45
[58] Bradski G, Kaehler A (2008) Learning OpenCV, pp 370-403. O’Reilly
[59] Morita S, Yamazawa K, Terazawa M, Yokoya N (2005) Networked remote
surveillance system using omnidirectional image sensors. The Tran. of the Institute of
Electronics, Information and Communication Engineers, (5):864-875
[60] M. Betke et al., Tracking Large Variable Numbers of Objectsin Clutter, IEEE
CVPR, 2007.
[61] G. Brostow and R. Cipolla, Unsupervised Bayesian Detectionof Independent
Motion in Crowds, IEEE CVPR, 2006.
[62] W. Lin et al., Tracking Dynamic Near-regular Textures under Occlusion and Rapid
Movements, European Conference on Computer Vision (ECCV), 2006.
[63] W. Lin et al., A Lattice-based MRF Model for Dynamic Near regular Texture
Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
Vol. 29, No. 5, 2007.
[64] K. Li and T. Kanade, Cell Population Tracking and Lineage Construction Using
Multiple-Model Dynamics Filters and Spatiotemporal Optimization, International
Workshop on Microscopic Image Analysis with Applications in Biology, 2007.
[65] M. Yang, J. Yuan, and Y. Wu, Spatial Selection for Attention Visual Tracking,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[66] W. E. L. Grimson et al., Using Adaptive Tracking to Classify and Monitor
Activities in a Site, CVPR, 1998.
[67] N. Johnson et al., Learning the Distribution of Object Trajectories for Event
Recognition, IVC, 14, 1996.
[68] D. Makris and T. Ellis, Path Detection in Video Sequence,IVC, Vol. 30, 2002.
[69] X. Wang et al., Learning Semantic Scene Models by Trajectory Analysis, ECCV,
2006.
[70] N. Vaswani et al., Activity Recognition Using the Dynamics of the Configuration
of Interacting Objects, CVPR, 2003.
[71] D. Makris and T. Ellis, Automatic Learning of an Activity-based Semantic Scene
Model, AVSBS, 2003.
[72] S. McKenna et al., Learning Spatial Context from Tracking Using Penalized
Likelihood Estimation, ICPR, 2004.
[73] C. Stauffer, Estimating Tracking Sources and Sinks, Event Mining Workshop,
2003.
[74] J. Wright and R. Pless. Analysis of Persistent Motion Patterns Using the 3D
Structure Tensor. In WACV/MOTION, 2005.
[75] E. Andrade, S. Blunsden, and R. Fisher. Modelling Crowd Scenes for Event
Detection. In Proceedings ICPR, 2006.
[76] S. Ali and M. Shah. A Lagrangian Particle Dynamics Approach for Crowd Flow
Simulation and Stability Analysis. In Proceedings CVPR, 2007.
[77] H. Rahmalan, M. S. Nixon, and J. N. Carter. On crowd density estimation for
surveillance. In International Conference on Crime Detection and Prevention, 2006.
[78] R. Ma, L. Li, W. Huang, and Q. Tian. On pixel count based crowd density
estimation for visual surveillance. Cybernetics and Intelligent Systems, 2004 IEEE
Conference, vol. 1:170–173, 2004.
[79] B. Boghossian and S. Velastin. Motion-based machine vision techniques for the
management of large crowds. Electronics, Circuits and Systems, 1999. Proceedings of
ICECS ’99. The 6th IEEE International Conference, vol. 2:961–964, 1999.
[80] E. Andrade, S. Blunsden, and R. Fisher. Hidden markov models for optical flow
analysis in crowds. Pattern Recognition, 2006. ICPR 2006. 18th International
Conference, vol. 1:460–463, 2006.
[81] E. L. Andrade, S. Blunsden, and R. B. Fisher. Modelling crowd scenes for event
detection. In ICPR ’06: Proceedings of the 18th International Conference on Pattern
Recognition, pages 175–178, Washington, DC, USA, 2006. IEEE Computer Society.
[82] J. Shi and C. Tomasi. Good features to track. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 593–600, 1994.
Download