ICME09_B02

advertisement
FOREGROUND SEGMENTATION FOR STATIC VIDEO VIA MULTI-CORE AND MULTIMODAL GRAPH-CUT
Lun-Yu Chang and Winston H. Hsu
National Taiwan University, Taipei, Taiwan
felixc977@gmail.com, winston@csie.ntu.edu.tw
ABSTRACT
We proposed a new foreground detection method using
the static cameras. It merges multi-modality into graph cut
energy function, and performs much better results than
conventional methods. Here we not only consider about the
color appearance of the frame, but also spatial constraint of
the foreground object. Therefore we can get the more
precise shape of foreground. Besides, we divide the graph
cut problem into several sub problems, in order to reduce
the computing time, and try to utilize multi-core to solve the
problems at the same time. Furthermore, real time
performance can be easily achieved by above works. The
experiments on real videos demonstrate our improvement
than the other conventional foreground detection method.
Index Terms— graph cut, foreground detection,
foreground segmentation, surveillance,
1. INTRODUCTION
Foreground segmentation is the problem of segmenting
moving objects from their backgrounds in videos as shown
in Figure 6. It serves as an important cornerstone in many
video-related application systems such as video surveillance,
motion capture, and human-computer interaction, etc. For
example, most surveillance systems (e.g., [7][10]) urge that
the success of object tracking and semantic event detection
heavily relies on the quality of the shapes (or silhouettes) for
the foreground objects.
A surveillance video is mostly captured by a static
camera with fixed position and parameters. Based on the
observation, most foreground detection methods focus on
building background models. Any drastic change from the
background model is then considered to be an object in the
foreground. Given a sequence of frames, we can use
statistical information of changes in pixel values to model
the background with either single Gaussian model [1] or
Gaussian Mixture Model (GMM) [2]. Next, the foreground
objects can be extracted by thresholding the difference
between current frame and the background model pixel by
pixel. Generally, GMM [2] is more robust against non-static
background (e.g. moving shadow of a building or swaying
branches). Although the above method for foreground
detection is straight forward, it has several deficiencies.
Such methods work on a pixel-wise basis without
considering the continuity between each pixel and its
neighbors, leading to false foreground segments and broken
objects in the detection results. These errors are usually
caused by sudden light changes or other noises in the video.
Nowadays, the graph-cut [3] algorithm (cf. Section 2)
is promising for solving the above issues in foreground
detection (or segmentation). For example, in [4] the authors
take into account the relationship between neighboring
pixels, but the energy function used is quite ad-hoc.
Therefore the overall system improvement is limited. They
use the difference between the current frame and the
background model as the data term in graph cut, and a
constant as the smoothness term. Although this is an
improvement over conventional methods, there are still
some unsolved problems. First, considering only
background modality only is too weak for actual
surveillance video, and their smoothness term cannot
describe the relationship between each pixel and its
neighbors. Second, graph cut is time consuming, but
surveillance system should work in real time.
Besides, Sun. et al.[5] proposes a refined method based
on Howe’s work. They use Stauffer’s adaptive background
GMM [2] as the first modality, and also consider a shadow
removal term that tries to reduce the detection error due to
changes in lighting condition. Nevertheless, if the moving
object in the foreground looks similar to the background,
Sun’s method still fails to generate precise detection results.
Besides, even though they have taken approaches to
accelerate graph-cut to about 12 fps for a frame size of
320*240 on a 2.66 GHz CPU, we believe more room to
improve.
In this work, we propose a two-phase strategy to
improve the efficiency and accuracy for foreground
detection. In the first phase we obtain rough foreground
segmentation results and identify candidate areas that may
contain moving objects. In the second phase we apply
graph-cut to identified candidate areas from the first phase.
In contrast to applying graph-cut to the whole video frame,
our approach significantly shortens the computing time.
Besides, there are many cues informative for foreground
detection; for example, appearance, tracking, and shadow
removal. We also proposed a framework to combine these
cues into the graph energy functions to obtain a multi-modal
graph-cut that is more powerful than its predecessors. We
list our main contributions in this work:
Figure 1. The system framework where we divide the foreground detection
into several smaller tasks and apply graph cut (cf. Section 3.2) on them
simultaneously (multi-cores), instead of putting the whole frame into graph
cut processing.
– Improving foreground segmentation with multi-modal
graph-cut.
– Considering the spatial continuity of the object. Here
we construct prior constraint terms to model the
probability of object distribution.
– Improving efficiency by running graph-cut on object
areas instead of the whole video frame.
– Divide and conquer on the segmentation of foreground
objects. We apply graph-cut on the bounding box of
each moving object parallel with multi-core CPU,
leading to faster computation than before.
– Construct the adaptive background GMM parallel by
multi-core CPU.
Different from and complementary to [5], we focus on
handling false negatives (e.g. foreground object is similar to
background), but not the false positive problems (e.g.
shadow, reflection, etc.). We aim to detect the complete
shape of foreground objects instead of broken parts of
objects. Because foreground detection is the pre-process of
many computer vision researches and applications, getting
the clear shape of foreground objects in real time is essential.
2. GRAPH CUT FOR FOREGROUND DETECTION
Foreground detection is essentially a segmentation problem
(i.e., separating foreground and background pixels in this
problem). The advanced researches are through graph cut
methods [3][8], where a specialized graph for the energy
function to be minimized such that the minimum cut on the
graph also minimizes the energy. It can be a binary labeling
problem over a graph, which consists of an undirected graph
G = (V,E) where G is composed by a vertex set V
(representing the pixels), an edge set E (between
neighboring pixels with four or eight connections), and a
finite label set for the nodes (pixels). Here foreground
segmentation only needs two labels xv ={‘fg’, ‘bg’} or {1,0},
which denote foreground and background respectively.
The labeling problem is essentially finding a proper way
to cut the graph (pixels) into two sets – ‘fg’ or ‘bg’ – and
 function for optimizing the
heavily replies on the energy
solution. There is an energy function E(x) described the
energy between each vertex and label. Where E(x) can be
(a)
(b)
(c)
Figure 2. MMCut is a method that utilizes multi-modalities (e.g.
appearance, foreground likelihood (b), prior size constraints (c), etc.)
as the graph cut energy function. Then applied the min-cut algorithm
to segment the foreground which shows yielding much silhouette
results than simply using any single modality.
written in terms of unary (data) and pairwise (smoothness)
energy terms in the simplest interesting case as
E ( x)   H ( x v )   G( xu , x v ) (1)
vV
vV
The energy function will directly influence what label
should apply on each vertex. Data term describes how
possible a vertex detected as ‘fg’ or ‘bg’, smoothness term
describing the relation between each pixel and its neighbor,
for example, if two pixel’s appearances are close, they have
high probability been detected as the same label.
A graph cut algorithm is to …
Once the graph cut model and the energy function are
constructed, we apply min-cut algorithm [3][8], which has
been shown effective and efficiency for the graph cut
problem. We can then get a binary label map as the
silhouette result (cf. Figure 6).
3. PROPOSED METHOD
We propose a framework for efficient foreground
segmentation problem, and especially point to the situation
when object is similar to background (cf. Figure 6). Unlike
prior work, our method leverages multiple modalities for
constructing an effective energy function and divides the
detection into atomic subtasks over multi-core platforms for
efficiency required in surveillance systems. The system
diagram is shown in Figure 1.
3.1. Adaptive Background GMM
Based on Stauffer’s method [2], we first construct the
background model with the pixelwise adaptive GMM, each
pixel in the frame is independent to any other pixels. The
Gaussian model can be learnt and updated on line efficiently
frame by frame [2].
Once we have the pixelwise GMM model for the static
video camera, a pixel v (or a vertex in for graph cut) within
a new frame can be measured to be foreground (‘fg’) by the
defined likelihood score:
 I (v )   k ( v ) 
score(v)  min k 
(2)

  k (v ) 
where I(v) denotes the intensity of vertex v μk and σk are
mean and standard deviation of k’th Gaussian model, which
has the least distance (normalized by its standard deviation)
to the all Gaussian models. This score stands for the
probability about the vertex is belong to background. Here
we illustrate a GMM likelihood score map as Figure 2-b. If
we thresholding the score map, we could get a rough
foreground segmentation result.
3.2. Multimodality Graph Cut – MMCut
In traditional method of foreground detection models, each
pixel is independent to the other pixels in the frame. As
shown in Figure 2-b, once the background is very similar to
appearance of the moving objects, the pixelwise algorithm
will have apparent false negative within object silhouette.
Also illustrated in Figure 2-c, we can remedy the problem
by enforcing the spatial continuity of the moving object.
Besides the continuity property, there are more
proposing modalities (e.g., color appearance, foreground
likelihood, object spatial continuity, tracking estimation, etc.)
to exploit for effective foreground detection. We argue to
leverage multiple cues to boost the foreground detection and
propose multimodality graph cut (MMCut) for composing
the energy functions (in data and smoothness terms) by
multimodalities.
3.3. Energy Functions
In the MMCut energy function, data and smoothness terms
are defined as follows:
H ( xv )  (1   ) LG  L A  LC
(3)
G( xu , xv )  (1   ) G   A
(4)
Where τ, δ, and ζ are parameters to linearly combine energy
functions derived from multiple modalities; LG, LA and LC
derived from adaptive GMM likelihood, appearance
similarity, and (spatial) continuity constraint, respectively
and explained in the following subsections. Eq. (4) is a
smooth term linearly composed by GMM term φG, and
appearance term φA. Note that the sensitivity of the
parameters and the impact from different modalities will be
investigated thoroughly in the experiments (Section 4).
3.3.1. Adaptive GMM Likelihood Based on Chris Stauffer
methods [2], here we utilized GMM likelihood score as one
of the data term in Eqn. 3. As shown in Figure 6 and Figure 2b, the GMM likelihood can provide estimated cues for the
foreground though troubled by certain errors. Modified from
Eqn. 2, we define the GMM likelihood energy function LG
as:
const  1 / score(v) , if xv ' bg'
LG ( xv )  
, if xv ' fg'
 1 / score (v)
(5)
In this way, those pixels with higher foreground likelihood
will tend to have lower energy as labeling as ‘fg’. The
constant parameter means a uniform distribution in
appearance of foreground objects as [5], here we set the
const=1; through the experiments, it is not sensitive to
segmentation results.
3.3.2. Appearance Likelihood Appearance (color, texture,
etc.) information has been shown effective in interactive
segmentation such as Lazy Snapping [8], where the
foreground are roughly sketched by users and then the
background and foreground model can be constructed for
the graph cut algorithm. Similar to that, we are interested in
if such appearance cues can help foreground segmentation.
After thresholding GMM likelihood score map, we can have
a rough bounding box for the object candidate as Figure 2-b.
In order to compute the appearance likelihood LA, we utilize
the K-means methods as [8]. First we could regard the
exterior part of this box as known background and use it to
group a Background Cluster KB. Next, we also group a
foreground Cluster KF adopting the known foreground
inside this box.
For each vertex v we compute the minimum distance
from its color C(v) to foreground clusters as:
diF (v)  min C (v)  K nF , diB (v)  min C (v)  K nB
n
n
Similarly, we favor those pixels similar to initial the
foreground model as foreground candidates and define the
appearance likelihood term as:

d iB (v)
 F
 d i (v)  d iB (v)
L A ( xv )  
d iF (v)

F
B
 d i (v)  d i (v)
, if x v ' bg'
, if x v ' fg'
(6)
3.3.3. Spatial Continuity Likelihood We can have the
rough bounding box for the candidate object (Figure 2-b).
Naturally, for moving objects (e.g., cars, people, etc.) the
object is complete within the object. Hence, we assume that
the center part of the bounding box should have higher
probability to be detected as foreground than the other parts
and formulate the spatial continuity constraint by a Gaussian
mask centered at the object center (Figure 2-c). We can then
define LC for Eqn. 3 as:
 log PC (i, j )
1  log P (0,0) , if xv ' bg'
C
LC ((i, j ) | xv )  
log PC (i, j ) , if xv ' fg'

 log PC (0,0)
(7)
Here we utilized 2D Gaussian density function P c to model
the likelihood distribution, and regard the center of the box
as its Gaussian mean, and (i, j) is the position relative to the
top-left corner (i.e., (0, 0)) of the bounding box it belongs to.
Besides, we use 1/6 height and width of the bounding box as
the standard deviations for vertical and horizontal axes in
the 2D Gaussian mask (Figure 2-b). The ratio is determined
in a cross-validation manner.
3.3.4. Smoothness Term Smoothness term is to constrain
the similarity of neighboring pixels – those pixels with
similar pixels should have similar labels (0 or 1). In [4][5],
only a single modality was considered for the smoothness
term. We are interested in if multiple modalities can help
construct a better smoothness term for foreground
segmentation. We define our smoothness term as:
1
 uv
1
 G ( xu , xv )  xu  xv 
 uv
 A ( x u , x v )  xu  x v 
(8)
(9)
Where Ωuv is the distance of color appearance, and Δuv is he
distance of GMM likelihood score:
3.4. Divide and Conquer for Detection Efficiency
For static video cameras, only moving objects are
semantically meaningful. We devise three strategies to
speed up for refining segmentation results. Firstly, MMCut
is applied on the small regions of the candidate moving
objects (e.g., t1, t2, and t3 in Figure 1) only instead of the
entire video frame as prior works [5][6][8]; it generally
requires quadratic time for calculating the energy functions.
Such candidate objects can be yielded efficiently by the
GMM foreground detection algorithm.
Secondly, we regard each candidate moving objects as
an independent subtask and then apply MMCut for each one
as illustrated in Figure 1. We then leverage emerging multicore framework for refining the candidate objects in
parallels. For example, if having a quad-core CPU, we could
put most four processing four moving at the same time
Furthermore, in GMM background modeling (Section
3.1), each pixel is independent to any others can further
benefit from multi-cores as constructing and testing the
pixelwise GMM model. Note that the second and third
strategies are realized by OpenMP 1 . We will demonstrate
the significant efficiency improvement in the experiments
(Section 4).
4. EXPERIMENTS
We evaluate the foreground segmentation performance
using the foreground detection benchmark from IPPR2006
(three 24-bit color videos with 320x240 resolution) [9].
Here we name the three video sequences as “Indoor1,”
“Indoor2,” and “Outdoor1.” The scene of “Indoor1” is a
lobby under poor light, “Indoor2” is a hallway with strong
sunlight, and “Outdoor1” is a road that has moving vehicles
and people. The performance metric for the benchmark is
“error per frame” and defined as (# stands for ‘number of
pixels’):
Error per frame= # in false positives + # in false negatives
Due to we want to test if prior constraint can benefit
the detection result. We selected two important clips in the
dataset1 (which contains the object hard to be detected
precisely. clip1 contains frame no.53~87; clip2 contains
frame no.263~299), and here named those as “Broken
Object1 and 2”. After MMCut processing, the result is as
Figure 3. Where δ is the weight of prior constraint in
likelihood term, we can see the prior constraint benefit the
1
The OpenMP (Open Multi-Processing) is an API that
supports multi-platform multiprocessing programming. See
more at http://openmp.org/ .
1000
Broken Object1
Error Per Frame
2
800
Broken Object2
600
400
200
0
0
0.5 0.75
1
1.1 1.2 1.3 1.4 1.5 1.6 1.7
δ
Figure 3. δ is the weight of Lc. The error rate is reduced significantly
when we take “prior constraint” into account. Besides, we get lowest
error rate when δ=1.5.
detection result significantly. Besides, we get lowest error
rate when δ=1.5. Here shows some sillouette results of
applying MMCut and other two methods on these two clips
in Figure 6.
Because GMM likelihood is close to color appearance, but
still exist some difference between them. We should test
what ratio between them can get the best result. As Figure 4
we have the best result when we considering only GMM
likelihood, and there are no notable changes in dataset1 and
dataset 2. Since using GMM likelihood is even better than
consider these two modalities. Moreover, the time cost of
computing appearance likelihood is much more than GMM
(because K-Means Clustering). That is we could remove this
modality from data term, and involve more other modalities
(ex. tracking, shadow elimination, etc) into it. Our future
work will be focused on this goal.
Error Per Frame
 uv  C (u )  C (v) ,  uv  score (u )  score (v)
2
900
800
700
600
500
400
300
200
100
0
Dataset1
Dataset2
Dataset3
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
τ
Figure 4. τ is the ratio between GMM likelihood & Appearance in
data term. τ=0 means only considering about GMM likelihood . Above
chart shows the best result appears when τ=0 (only considering about
GMM likelihood).
Besides, we have tested changing the ratio between φG
and φA in the smoothness term, the result is actually
improved even the improvement is slight (about 0.1%), but
certainly there actually are benefits when combining these
two modalities in smoothness term.
We further compare the proposed method with three
other rivals including the GMM approach [2], GMM with
post-processing by morphological operators (open and
close), and the best detection result in the benchmark [9].
From Figure 5, we can see the proposed method outperform
the rival methods in terms of error per frame (the lower, the
(a)
(b)
(c)
(d)
Figure 6. Example results of 3 methods. Each subgraph shows original frame, Stauffer’s method, Stauffer’s method after morphological operations, and
our methods from left to right. From above figures, we can see the false negative parts (where foreground looks like the background) of the foreground
objects have success been reduced. Besides, our algorithm is more efficient than other foreground detection methods via graph-cut [4][5][6].
better). The segmentation cues from multiple modalities (via
energy functions) and rigorously considering pairwise pixels
(via graph cut) can significantly boost the detection
performance. We observe that the most contributions come
from the consideration of spatial continuity, which can be
verified in Figure 6. Meanwhile, it is superior to the best
method in the in dataset1 and dataset3. Note that our
proposed method is much more efficient and can process 30
frames per second (fps) comparing with the 0.04 fps in the
best benchmark method [9] and 15 fps in [5]. The lighting
changes in dataset2 and not considering shadow removal yet
cause our method with poor performance against the best
one in the benchmark. We believe the performance can be
further boosted if considering shadow removal cues [5].
Error per Frame
1200
In this work we try to solve the detection problem about
foreground object is similar to background. In order to get a
more precise segment result, we apply MMCut on the result
of Stauffer’s method trying to refine the segmentation result.
For a more efficiency detection system we put the moving
object box into MMCut instead of the whole frame, and
apply multi-core on updating the adaptive GMM and
MMCut. We try to provide a more flexible, efficiency, and
automatic foreground segmentation method, and the
experiment result actually reduced the false negative
problems and speed up the whole system to real-time (60
f.p.s.) as shown in Section 4. About future work, we would
like to involve more useful modalities and try to refine the
extracting bounding box method (with tracking…).
Adaptive GMM
1000
800
Adaptive GMM +
Post processing
MMCut
600
400
IPPR06 Best
200
0
5. CONCLUSIONS
1
Dataset1
2
3
Dataset2
Dataset3
Figure 5. MMCut performs significantly better than rival methods and
even the best result in the benchmark. Besides, our method is much
more efficient than the best benchmark method (30fps vs. 0.04fps)
CPU
Threads
Frames per sec.
Intel Core2 Duo 2.16GHz
1
31
Intel Core2 Duo 2.16GHz
2
43
Intel Core2 Quad 2.4GHz
1
36
Intel Core2 Quad 2.4GHz
4
60
Table 1. The proposed method can improve 28% efficiency on the
dual-core system, and 67% on the quad-core system.
The benefits in detection efficiency through our “divide
and conquer” strategies in multi-core platforms are
demonstrated in Table 1. Apparently, the proposed method
can improve the detection frame rate up to 68% in the quadcore platform.
6. REFERENCES
[1] S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld. “Detection
and location of people in video images using adaptive fusion of
color and edge information.” Proc. ICPR, 2000.
[2] Chris Stauffer, “Adaptive background mixture models for realtime tracking.” Proc. CVPR, 1999.
[3] Y. Boykov and V. Kolmogorov. “An experimental comparison
of min-cut/max-flow algorithms for energy minimization in
vision,” TPAMI, Vol. 26, No. 9, pp. 1124–1137, 2004.
[4] N.-R. Howe “Better foreground segmentation through graph
cuts,” Technical report, Smith College, 2004.
[5] Y. Sun, B. Yuan, Z. Miao, and C. Wan, “Better foreground
segmentation for static cameras via new energy form and
dynamic graph-cut.” Proc. ICPR, 2006.
[6] Keita Takahashi, Taketoshi Mori: “Foreground Segmentation
with Single Reference Frame Using Iterative Likelihood
Estimation and Graph-Cut.” Proc. ICME, 2008
[7] Ying Wang, Kaiqi Huang, Tieniu Tan: “Human Activity
Recognition Based on R Transform.” Proc. CVPR, 2007
[8] Y. Li, J. Sun, C.-K. Tang, H.-Y. Shum, “Lazy Snapping,” Proc.
SIGGRAPH, 2004.
[9] The Chinese Image Processing and Pattern Recognition
Society(IPPR), http://www.ippr.org.tw/.
[10] I. Haritaoglu, “W4: Real-time surveillance of people and
their activities,” IEEE Trans. Pattern Anal. Machine Intell.,
2000.
Download