ICME09_a01

advertisement
FOREGROUND SEGMENTATION FOR STATIC VIDEO VIA MULTI-CORE AND MULTIMODAL GRAPH-CUT
Lun-Yu Chang and Winston H. Hsu
National Taiwan University, Taipei, Taiwan
felixc977@gmail.com, winston@csie.ntu.edu.tw
ABSTRACT
We proposed a new foreground detection method using
the static cameras. It merges multi-modality into MRF
energy function, and performs much better results than
conventional methods. Here we not only consider about the
color appearance of the frame, but also spatial constraint of
the foreground object. Therefore we can get the more
precise shape of foreground. Besides, we divide the MRF
problem into several sub problems, in order to reduce the
computing time, and try to utilize multi-core to solve the
problems at the same time. Furthermore, real time
performance can be easily achieved by above works. The
experiments on real videos demonstrate our improvement
than the other conventional foreground detection method.
Index Terms— graph cut, foreground detection,
surveillance,
1. INTRODUCTION
A good foreground segmentation method is an essential
technology as a pre-processing for further analysis of
images and videos, such as surveillance system, motion
capture, human-computer interactions… and so on. Highquality segmentation greatly benefits further processing and
influences overall performance of the whole system. After
insistent research for a long time, many people have
discovered that there are some useful properties of static
camera can be used to separate foreground from background.
That is, videos are taken from the cameras with fixed
position and parameters.
Using simple background subtraction is the most
straightforward method for segmenting moving objects in
this kind of videos. Given a sequence of frames, foreground
objects are extracted by thresholding the difference between
current frame and previous frame pixel by pixel. In order to
reflect temporal change of the background, we can utilize
statistical background model, as proposed successively in a
few methods such as single Gaussian [1] and Gaussian
Mixture Model (GMM) [2], differing in the statistical
features used to represent the background. Rather than the
simple thresholding operation, foreground is usually
determined in a probabilistic way. Although these
improvements offer much better performance than simple
background subtraction especially under non-static
background (ex. shadow of a building, swaying branches),
they still work on a pixelwise but not considering about the
continuity between each pixel and its neighbors, including
false foreground blobs and holes in detected objects caused
by camera noise or suddenly light changes. All these works
are cited as conventional background subtraction methods.
In [4] Howe et al. take the relationship between
neighboring pixels into account by graph cut [3]. They
construct a graph incorporating all the differences measured
between current frame and the background model, and
consider about the connectivity of the pixels in the image,
allowing each pixel to affect those in its local neighborhood.
Finally, the min cut algorithm is used to segment the graph
and actually separate the moving objects from the
background. However this method seems be good at
foreground segmentation, but there still are some unsolved
problems. First, how the energy functions should we fill up.
Second, graph cut is time consuming, but the surveillance
system should work on real time. However, we can get some
benefits on static camera video. For example, the scene in
the video is static, we just need to notice the moving object.
Besides, given a sequence of frames we can easily generate
the background model. Furthermore, there are lots of
modalities can help to extract the moving objects (ex.
appearance, tracking, shadow removal…etc). If we try to
unite those elements into the MRF energy functions, we
should have a more powerful foreground detector.
In this work, we follow the formulation that
foreground segmentation corresponds to global optimum of
an energy function, and the energy function is derived from
a Markov Random Fields (MRF) defined on a graph.
Compared with [4][5], our main contributions are:
• Improving foreground segmentation, with multimodalities fusion energy functions.
• Considering the spatial continuity of object, construct
prior constraint term to model the probability of object
distribution.
• Only put the moving object into MRF instead of whole
frame. Reduce the amount that put into MRF.
• Try to divide and conquer MRF problem. We could
apply MMCut on each moving object bounding box
parallel with multi-core, leading to much more accurate
results than before.
With above novelties, our method could segment the
foreground efficiently. Different from [5], we focus to
handle false negative problems (e.g. foreground object is
similar to background), but not the false positive problems
(e.g. shadow, reflection…etc). We try to get a more precise
object shape but not an object with broken parts in it.
Moreover, even MMCut utilizing MRF to solve the problem,
the work still can be done in real time. Because foreground
detection is just the preliminary of lots computer vision
researches, getting the clear shape of foreground objects in
real time is essential. For example, if we get a more precise
foreground object, we can have a better performance in
human activity recognition via R-Transform [7].
2. MARKOV RANDOM FIELDS
In this section we provide a general overview of the Markov
Random Fields (MRF), and give the notation used in the
paper.
An MRF consists of an undirected graph G = (V,E) where V
is a finite set of vertices and E ⊂ V × V is a set of edges, a
finite set L = {l1, ..., ln} of labels, and a probability
distribution P on the space X = LV of label assignments, such
that an element x of X, called a configuration of the MRF, is
a map that assigns a label xv in L to each vertex v. P(x) is a
Gibbs distribution relative to G
P(x)~e−E(x)
where E(x) can be written in terms of unary and pairwise
energy terms in the simplest interesting case as
E(x)= ∑ h(xv ) + ∑ g(xu ,xv )
v∈V
(u,v)∈E
x ∗ = arg min E(x)
k
Then inference on the optimal labels (i.e. the best
segmentation) corresponding to the MRF is seen as an
energy minimization problem.
3. PROPOSED METHOD
In this paper, we proposed a new method try to deal with the
real time foreground detection problem, and especially point
to the situation about object is similar to background. As
mentioned in Section 2, applying graph cut has already
reported in [5], which combine background likelihood term
3.1. MMCut
Traditional method of foreground detection often using the
pixel-wise background model, each pixel is independent to
the other pixels in the frame. But if we ignore the spatial
continuity of the moving object, that is not make sense in
the reality.
MMCut stands for composing the energy function by
Multi-Modality (e.g. Color Appearance, GMM likelihood, prior
constraints…), and applied min cut algorithm to solve it by
our efficiency methods. Here we want to utilize MRF to
model the relationship between each node, and then we just
need to fill up the energy function. There are lots of
researches about foreground detection, and some of them
have a good performance by their own modality. If we use
multi-modalities at the same time, maybe we could get a
better detection result than other methods.
width
height
In the context of foreground segmentation problem, V
corresponds to the set of all pixels in current frame, E
contains all the links between neighboring pixels in the 4neighborhood sense, and the set L comprises of two labels
(‘fg’,‘bg’) representing a pixel belongs to foreground or
background. As each pixel v has a labelling xv ∈ L, every
configuration x of such an MRF defines a segmentation. Let
D denote the set of observed color values in current frame.
Taking a Bayesian perspective, we wish to find the best
configuration x∗ (i.e. the optimal labels for the pixels in
current frame) which maximize the posterior probability
P(x|D), or in other words, to solve MAP-MRF problem.
This can be done by finding the configuration with the
minimum energy:
and shadow elimination term as their energy function. This
method is good at resolving false positive problems (e.g.
shadow, reflection…etc), but not at false negative problems
(e.g. foreground object is similar to background). Besides,
MRF is inefficiency for numerous of frames. How to reduce
computing time and use the properties of static camera are
important problems. Therefore, our method utilized multimodalities as the energy function, and designed an efficient
algorithm to accelerate Min Cut in MRF.
We first introduce the main idea of MMCut in Section
3.1. Then we describe the formula of MRF energy functions
in Section 3.2. Finally, we will introduce how to divide and
conquer MRF problems and accelerate it by multi-core in
Section 3.3. Note that the flow of processing is different
from above process.
(a)
(b)
(c)
Figure 1. MMCut is a method that utilized multi-modality (e.g.
Appearance, GMM likelihood, prior constraints…) as the MRF energy
function. Then applied the min cut algorithm to resolve the problem.
Finally, we can get the better silhouette result than just using any
single modality.
3.2. Energy in MMCut
Here we define the data term and smoothness term first. We
assume an energy form of:
H(xv ) = (1 − τ)LG + τLA + δLC
G(xu , xv ) = (1 − ζ)φG + ζφA
(2)
(3)
Where τ, δ, and ζ are fixed parameters. LG, LA and LC
denote Adaptive GMM likelihood, Appearance, and prior
constraint, respectively. Eq. (3) is a smoothness term which
composed by GMM term φG ,and Appearance term φA .
Below we will define each likelihood terms in Equation (2)
and (3).
Adaptive GMM Likelihood
Based on Chris Stauffer
methods [2], here we use Gaussian mixture models as our
background model. In these cases, pixelwise GMM is an
effective model to estimate the variations of any pixels.
Assume that we have K mixture models,We define the
GMM likelihood term as:
1/score(I(v))
, if xv = ′bg′
LG (Iv , xv ) = {
(4)
const − 1/score(I(v)) , if xv = ′fg′
score(I(v)) denotes the GMM likelihood score, and we
assumed it to be of the form:
I(v) − μk (v)
score(I(v)) = 1/mink [
]
σk (v)
where means μk, variance σk and weight wk parameters of
kth Gaussian mixtures can be learnt and updated on-line
when receiving a new frame. In this way, low energy is
always guaranteed no matter how the background changes
regularly. The constant parameter means a uniform
distribution in appearance of foreground objects as [5].
Appearance Likelihood
After thresholding GMM
likelihood score map, and extracting connected component
from it, as a result we can get a rough bounding box of
moving object as Figure 1-b. In order to compute the
appearance likelihood, we utilize the K-means methods in
[8]. First we could regard the exterior part of this box as
known background and use it to group a Background
Cluster KB. Second, we also group a foreground Cluster KF
adopting the known foreground inside this box.
Then, for each node v we compute the minimum distance from its
intensity I(v) to foreground clusters as:
dFi (v)= min‖I(v)-K Fn ‖ ,dBi (v)= min‖I(v)-K Bn ‖
n
n
We define the appearance likelihood term as:
LA (I(v), xv ) =
dB
i (v)
F
di (v)+dB
i (v)
dFi (v)
{dFi (v)+dΒi (v)
, if xv = ′bg′
, if xv = ′fg′
(5)
Prior Constraint Likelihood
Since we have already
extracted the rough moving object bounding box. We
assume that the center part of the bounding box should have
higher probability to be detected as foreground than the
other parts. For example, it should not be broken in a human
body as Figure 1-b.
We define our prior constraint term as:
about two modals here, see if it can force the best
segmentation to follow salient edges in current frame. We
define our smoothness term as:
1
φA (Du |xu , xv ) = |xu −xv | ∙
(7)
φG (Du |xu , xv ) = |xu −xv | ∙
Ωuv
1
△uv
(8)
Where Ωuv is the distance of color appearance, and △uv is
he distance of GMM likelihood score:
Ωuv = ‖I(u) − I(v)‖2 , △uv = ‖score(u) − score(v)‖2
3.3. Divide and Conquer
The surveillance video cameras are indeed often fixed, so
that the sequences present a static background together with
moving objects that are semantically meaningful. That
means we only need to care the moving object in a frame,
but not the whole frame.
Here we use pixelwise Adaptive GMM as our baseline
background model since it is robust to some slightly
changes such as swaying leaves, flickering monitors and so
on. After threshold, the GMM likelihood score map will
change into binary map. We can get the rough moving
object bounding box from extracting the connected
component of the binary map. Finally we will regard each
box as an independent sub frames, and apply MMCut on
each of them. Therefore, we apply MMCut on the moving
boxes area instead of the whole frame.
3.4. Multi-Core
As mentioned in section 3.3, due to multi core CPU is much
cheaper than before, we separate the origin foreground
detection problems of a frame into several sub-problems.
Next, if there is a quad-core CPU, we could put most four
moving object boxes into the MMCut processing at the
same time as Figure 2. Therefore, we not only save the work
load of graph-cut, but also can solve multi sub problem at
the same time.
Moreover, in the adaptive GMM background model,
each pixel in the frame is independent to any other pixels.
That means we can also generate the back ground model
parallel with multi-core.
After applying above methods, we succeed to reduce the
computing time of MMCut. The experiments result is in
Table 1.
const − log Pc (u, v) , if xv = ′bg′
LC (I(u, v)|xuv ) = {
(6)
− log Pc (u, v)
, if xv = ′fg′
Here we utilized 2D Gaussian density function Pc to
model the likelihood distribution, and regard the center of
the box as its mean point, and (u, v) is the coordinate
relative to the mean point. Besides, we also view height/6 as
first standard deviation σ1, and width/6 as second standard
deviation σ2. How to choose 1/6 is shown in experiment
section.
Smoothness Term
In [4][5], only one modal has been
used as the smoothness term. Here we want to consider
Figure 2. We can divide the job into several smaller tasks and apply
MMCut on them simultaneously (multi-cores), instead of put the
whole frame into graph cut processing. Our method can save a lot of
time than the traditional method.
We evaluate the foreground detecting performance using
IPPR2006 video dataset (3 videos with 320*240 pixels in
24-bit RGB). Here we named three datasets as “Indoor1”,
“Indoor2”, and “Outdoor1”. The scene of “Indoor1” is a
lobby under poor light, “Indoor2” is a hallway with strong
sunlight, “Outdoor1” is a road that has moving vehicles and
people. 3 methods are compared including Stauffer’s
Adaptive GMM method [2], Adaptive GMM applied
morphology operations (open and close), and the best
detection result in IPPR06 competition [9]. We define the
error rate as (# stands for ‘number of pixels’):
Error per Frame= # in false positives + # in false negatives
Due to we want to test if prior constraint can benefit
the detection result. We selected two important clips in the
dataset1 (exist object hard to be detected precisely), and
here named those as Broken Object1 and 2. After MMCut
processing, the result is as Figure 3. Where δ is the weight
of prior constraint in likelihood term, we can see the prior
constraint benefit the detection result significantly. Besides,
we get lowest error rate when δ=1.5.
Error Per Frame
1000
Broken Object1
800
Broken Object2
600
400
200
Error Per Frame
4. EXPERIMENTS
900
800
700
600
500
400
300
200
100
0
Dataset1
Dataset2
Dataset3
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Besides, we have tested changing the ratio between 𝛗𝐆
and 𝛗𝐀 in the smoothness term, the result is actually
improved even the improvement is slight (about 0.1%), but
certainly there actually are benefits when combining these
two modalities in smoothness term.
At last we evaluate the three dataset and compare with
the other three methods as Figure 5, we can see our method
is outperforming than “Adaptive GMM” and “Adaptive
GMM + Post processing”, and we also have a better result
than the best in IPPR06 competition on dataset1 and
dataset3. Because there are lots changes of light in dataset2,
and we do not consider about shadow removal, leading poor
performance against the best one in IPPR2006. Due the
shadow removal algorithm is proposed [5], and have
excellent performance, maybe we will obtain their shadow
term into our energy function in the future work.
1
1.1 1.2 1.3 1.4 1.5 1.6 1.7
δ
Figure 3. δ is the weight of Lc. The error rate is reduced significantly
when we take “prior constraint” into account. Besides, we get lowest
error rate when δ=1.5.
Because GMM likelihood is close to color appearance,
but still exist some difference between them. We should
test what ratio between them can get the best result. As
Figure 4 we have the best result when we considering only
GMM likelihood, and there are no notable changes in
dataset1 and dataset 2.
CPU
Threads
Frame per Sec.
Intel Core2 Duo 2.16GHz
1
31
Intel Core2 Duo 2.16GHz
2
43
Intel Core2 Quad 2.4GHz
1
36
Intel Core2 Quad 2.4GHz
4
60
Table 1. As the table, using our method can save about 28% time on
dual-core system, and 66% time on quad-core system. Our method is
also faster than Y.Sun’s method [5].
Error Per Frame
1200
0.5 0.75
τ
Figure 4. τ is the ratio between GMM likelihood & Appearance, τ=0
means only considering about GMM likelihood . Above chart shows
the best result appears when τ=0 (only considering about GMM
likelihood).
0
0
1
Adaptive GMM
1000
800
600
Adaptive GMM +
Post Processing
400
MMCut
200
0
Dataset1
1
Dataset2
2
Dataset3
IPPR 2006 Best
3
Figure 5. MMCut is outperforming than “Adaptive GMM + Post
Processing”, even better than the best performance the IPPR 2006
competition. Besides, our method is much more efficiency than the
IPPR 2006 best (0.04fps vs. 30fps.)
To demonstrate the usefulness of our better “divide
and conquer” method on MRF problem, we test our method
on the Dataset1, and list the frame rate (FPS) in the
conditions of using different core numbers. The result is as
Table 1.
(a)
(b)
(c)
(d)
Figure 6. Example results of 3 methods. Each subgraph shows original frame, Stauffer’s method, Stauffer’s method after morphological operations, and
our methods from left to right. From above figures, we can see the false negative parts (where foreground looks like the background) of the foreground
objects have success been reduced. Besides, our algorithm is more efficient than other foreground detection methods via graph-cut [4][5][6].
5. CONCLUSIONS
This work try to solve the detection problem about
foreground object is similar to background. Therefore we
extend current efforts on accurate foreground segmentation
which combine multi-modal MRF instead of conventional
background subtractions. Especially, we focus on what the
spatial constraint of an object should be. Besides, due to
MRF is time costly, we separate the frame into some smaller
blocks and put them into MRF parallel (multi-core).
Here we introduce a better energy function of MRF to deal
with the moving objects in static cameras. The energy
function is based on GMM likelihood, color appearance
info, and the spatial constraints.
6. REFERENCES
[1] S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld. “Detection
and location of people in video images using adaptive fusion of
color and edge information.” Proc. ICPR, 2000.
[2] Chris Stauffer, “Adaptive background mixture models for realtime tracking.” Proc. CVPR, 1999.
[3] Y. Boykov and V. Kolmogorov. “An experimental comparison
of min-cut/max-flow algorithms for energy minimization in
vision,” TPAMI, Vol. 26, No. 9, pp. 1124–1137, 2004.
[4] N.-R. Howe “Better foreground segmentation through graph
cuts,” Technical report, Smith College, 2004.
[5] Y. Sun, B. Yuan, Z. Miao, and C.Wan, “Better foreground
segmentation for static cameras via new energy form and
dynamic graph-cut.” Proc. ICPR, 2006.
[6] Keita Takahashi, Taketoshi Mori: “Foreground Segmentation
with Single Reference Frame Using Iterative Likelihood
Estimation and Graph-Cut.” Proc. ICME, 2008
[7] Ying Wang, Kaiqi Huang, Tieniu Tan: “Human Activity
Recognition Based on R Transform.” Proc. CVPR, 2007
[8] Y. Li, J. Sun, C.-K. Tang, H.-Y. Shum, “Lazy Snapping,” Proc.
SIGGRAPH, 2004.
[9] The Chinese Image Processing and Pattern Recognition
Society(IPPR), http://www.ippr.org.tw/.
Download