Selective Eigenbackgrounds Method for Background Subtraction in

advertisement
2011 18th IEEE International Conference on Image Processing
SELECTIVE EIGENBACKGROUNDS METHOD FOR BACKGROUND SUBTRACTION
IN CROWED SCENES*
Zhipeng Hu1, 3, Yaowei Wang2, 3†, Yonghong Tian3†, Tiejun Huang3
1
Key Lab of Intel. Inf. Proc., Institute of Computing Technology, Chinese Academy of Sciences
2
Department of Electronic Engineering, Beijing Institute of Technology
3
National Engineering Laboratory for Video Technology, Peking University
{zphu, ywwang, yhtian, tjhuang}@jdl.ac.cn
ABSTRACT
In this paper, a selective eigenbackgrounds method is
proposed for background subtraction in crowded scenes. In
order to train and update the eigenbackground model with
frames containing few objects (i.e. clean frames), virtual
frames are constructed based on a frame selection map.
Then, the eigenbackground that best depicts background is
selected for each pixel based on an eigenbackground
selection map. Experimental results show the performance
of the proposed method is better than those of some stateof-the-art methods in crowded scenes.
Index Terms — background subtraction, video surveillance,
eigenbackground, crowded scenes
1. INTRODUCTION
In surveillance video, many events of interest occur in
crowded scenes, where people density is pretty high and
most of them move very slowly. For instance, when a
terrorist is planting bombs in a lounge, he may keep
motionless or just roaming in the crowd. It is expected the
terrorist can be detected as a foreground object for
surveillance application. Unfortunately, most foreground
detection algorithms using background subtraction
technology [1-4] determine motionless objects as
background. The reason is the most recent video frames are
used to update background models.
Classic eigenbackgrounds method [5] can be used to
solve the problem. This method employs principle
component analysis to obtain eigenvectors (i.e.
eigenbackgrounds). As the principle components of a scene
are its background in general, the eigenbackgrounds that are
used to reconstruct the background depict the characteristics
of the background. Therefore, this method could identify the
motionless objects as foregrounds.
However, the foregrounds may become the principle
components and be absorbed into the eigenbackgrounds in
crowded scenes as the foregrounds cover a large proportion of
the scene most of the time. This will lead to miss detections
and false alarms. To our knowledge, no literature has done
research on this problem. We think there are two key issues to
solve this problem: (1) Use clean frames to train and update the
background model; (2) Use the eigenbackgrounds without
absorbed foregrounds to reconstruct the background.
In this paper, selective eigenbackgrounds method is
proposed to resolve the two issues. Our contributions are
two-folds. First, in order to train and update the
eigenbackgrounds with frames containing few objects (i.e.
clean frames), virtual frames are constructed based on a
frame selection map. Frame selection map is a data structure
with the same size to the video frame, each element of
which corresponds to a pixel location and indicates the
frame in which the pixel is clean. "Clean" means that the
pixel is an "actual" background pixel. For virtual frames are
constructed with clean pixels, fewer foregrounds would be
absorbed into the eigenbackgrounds. Second, the
eigenbackground that best describe the background is
selected for each pixel based on an eigenbackground
selection map. Just like the frame selection map, each
element of the map corresponds to a pixel location. The
difference is that the element in the eigenbackground
selection map indicates the best eigenbackground that can
be used to reconstruct the background value of the
corresponding pixel. As the selection is performed for each
pixel, best background reconstruction results can be
obtained for all pixels as compared to the classic
eigenbackgrounds method.
The remainder of this paper is organized as follows.
Section 2 gives the detailed description of our method.
*
This work was done in National Engineering Laboratory for Video Technology and supported by grants from the Chinese National
Natural Science Foundation under contract No.61035001 and No. 61072095, National Basic Research Program of China under contract No.
2009CB320906, and Fok Ying Dong Education Foundation under contract No. 122008.
†
contact author
978-1-4577-1302-6/11/$26.00 ©2011 IEEE
3338
2011 18th IEEE International Conference on Image Processing
Experimental results are shown in Section 3. Section 4
concludes this paper.
2. SELECTIVE EIGENBACKGROUNDS METHOD
Eigenbackgrounds method is performed by thresholding the
difference image between the reconstructed background and
the video frame. In the proposed method, reconstructed
background is obtained by the selected eigenbackgrounds,
which are trained and updated with constructed virtual
frames. The detailed description is discussed in this section.
x  x  2.5
(2.4)
Then virtual frames can be constructed with clean pixels
based on a frame selection map, which is a data structure
with the same size to the video frame. Each element of the
map corresponds to a pixel location and indicates the frame
in which the pixel is clean. Specifically, the selection map is
initialized with “-1” and when the ith training frame is
loaded, the value at the location ( x, y) of the selection map is
calculated as
s( x, y)  arg f min{i  f | f  i  I f ( x, y) is clean pixel}
(2.5)
2.1. Virtual Frame Construction
f is the index of a training frame. If there is no frame
satisfying Eq 2.5, s( x, y) is assigned -1. The reason why
minimum is used in Eq 2.5 is that it is expected a virtual frame
be constructed with the temporally nearest clean pixels.
The frame selection map is updated with the incremental
loading of the training frames, and once there is no -1, a
virtual frame is constructed with clean pixels indicated in
the frame selection map. Then the map is reset to -1 and the
same process is repeated. Figure 2 describes the process of
constructing a virtual frame. On the average, hundreds of frames
are needed to construct a virtual frame in crowded scenes.
f it
If video frames without foreground objects are used as
training and updating samples, better background model can
be obtained. In crowded scenes, it is difficult to find such
clean frames as shown in the upper row of Figure 1.
However, it is fairly easy to find clean pixels in the video,
so one solution is to construct clean frames by selecting
clean pixels in the video. As the constructed frames are not
actual frames from the video, they are named “virtual frames”.
Fig. 1 upper row: randomly selected frames from video; middle
row: some selected frames with fewer foregrounds from the
video; bottom row: virtual frame examples
Then the problem becomes how to select clean pixels
from the video. Inspired by [1], GMM is established for
each pixel by using EM algorithm, which is formulated as:
P( x)  i 1
i ( x, xi ,  i )
N
(2.1)
( x  xi )
1
exp{
}
(2.2)
2 i 2
2 i
where N is the number of Gaussians (5 is used in the
paper), i , xi and  i are the weight, mean and variance of
the ith Gaussian respectively. Then the Gaussians are sorted
in the descendent order of  /  , and the first B ones are
used as “clean” Gaussians:
2
 ( x, xi ,  i ) 
b
B  argb max{ i  Th}
(2.3)
i 1
Th is a predefined parameter (0.6 is used). Once a pixel
matches at least one of these clean Gaussians, it is
determined as a clean pixel. The matching rule is defined as
Fig. 2 virtual frame construction process
Some constructed virtual frames are shown in the
bottom row of Figure 1. The second row shows some
manually selected frames with low crowd density. By
comparing the two rows, it can be observed that the virtual
frames contain fewer foregrounds indeed, which verifies the
effectiveness of the proposed selection mechanism.
2.2 Background Model Update
This step includes the GMM model update for each pixel
and the eigenbackgrounds update.
The clean pixels in an input frame are selected to update
the GMM models. Particularly, each pixel is verified
whether it matches one of the corresponding “clean”
Gaussians. If so, it is selected to update the first matched
Gaussian in running average style:
3339
2011 18th IEEE International Conference on Image Processing
x  n  = 1-   x  n  1 + x  n 
 2  n  = 1-    2 (n  1) +  x  n  - x  n 
where
 is the update
(2.6)
2
(2.7)
rate (0.01 is used in the paper) and
x  n  is the selected background pixel value.
Meanwhile, with the incremental loading of the input
frames, the frame selection map is updated and used to
construct virtual frames, just as described in section 2.1.
The constructed virtual frames are used to update the
eigenbackgrounds with Candid Covariance-free Incremental
PCA [6], which can be formulated as an iterative procedure:
(2.8)
1  n   x  n   x  n 
  n  1
n 1 
1 
i  n  
i  n  1 
i  n  iT  n  i
(2.9)
n
n
|| i  n  1 ||
  n  i  n 
i 1  n   i  n     n  i
|| i  n  || || i  n  ||
T
i
(2.10)
 is the updating rate parameter, i (n) is the input vector
for the computation of the ith eigenvector ui (n) , and
i (n)  i (n)ui (n) .
2.3 Background Reconstruction
In classic eigenbackgrounds method, the background of the
current frame is reconstructed on frame level:
(2.11)
B( x)  UU T ( x  x )  x
where all the eigenbackgrounds are used to reconstruct the
background. The reconstructed background may contain
foregrounds in crowded scenes. This can be understood by
exploring the insufficiency of the virtual frames and the
meaning of the eigenbackgrounds.
In section 2.1, virtual frames are constructed to train
and update the eigenbackgrounds. However, there may be
noises and residual foreground fragments in virtual frames
due to the fact that a pixel is determined as clean in terms
of probability. Some trained eigenbackgrounds may contain
the information of foreground as a result. Specifically, the
ith element u j (i) of the jth eigenbackground corresponds
to the ith image pixel. The ith element depicts background
in some eigenbackgrounds, while in the others, it may
depict foreground. If all the eigenbackgrounds are used to
reconstruct the background, the reconstructed background
may contain foregrounds.
Therefore, it is necessary to select the best eigenbackground
for each pixel to reconstruct its background value. Actually, the
absolute value of each element in an eigenbackground can
be taken as the measurement of scatter for the
corresponding pixel. Generally, the scatter of the
background is smaller than that of the foreground because
the background is stable while the foreground contains
different objects. So the smaller the absolute value of the
element is, the more likely it describes background.
Then background reconstruction can be formulated with
an eigenbackground selection map, each element of which
indicates the index of the eigenbackground that should be
used to reconstruct the background value of the
corresponding pixel. The value of the ith element in the
map can be determined as
(2.12)
i  arg j min{| u j (i) |}
Then the background value B(i) of the ith pixel can be
reconstructed as
(2.13)
B(i)    (i)
i
   u u ( x  x )  x
T
i
i
i
(2.14)
where   i is the reconstructed background frame,
ui is the
selected eigenbackground for the ith pixel and x is the
input frame vector.
At last, background subtraction is performed by
thresholding the absolute difference image between the
input frame and the reconstructed background.
3. EXPERIMENTAL RESULT
The dataset in the experiments is from TRECVID-SED [7],
which is used for event detection. It is one of the most
challenging dataset due to its thick crowds, clutter scenes,
severe occlusion and illumination variations. The dataset is
captured from 5 cameras and we use 4 of them in our
experiments as there are few people in camera 4. In the
following experiments, no post-processing is performed.
(a)
(b)
(c)
(d)
(e)
Fig. 3 subjective comparison results. (a)original frame (b)(d)
reconstructed background with C-EigenBg and PS-EigenBg
respectively; (c)(e)subtraction results with C-EigenBg and PSEigenBg respectively. Green circle: miss detections retrieved by
PS-EigenBg; red circles: false alarms removed by PS-Eigenbg
The first experiment is to take a subjective comparison
between the classic eigenbackgrounds method (C-EigenBg
for short), where no selection operation is employed, and
the proposed method in this paper (PS-EigenBg for short).
Figure 3 shows the comparison results. It can be observed
that there are a lot of false alarms and miss detections in the
3340
2011 18th IEEE International Conference on Image Processing
1
1
0.8
0.8
true positive
true positive
0.6
0.4
C-EigenBg
PS-EigenBg
0.2
0
0
0.2
0.4
false positive
0.6
0.4
1
0.8
true positive
true positive
1
0.6
0.4
0
0
C-EigenBg
PS-EigenBg
0.2
0.4
false positive
0.3
0.35
0.4
0.45
0.5
false positive
0.55
0.6
(b) camera 2
0.8
0.2
C-EigenBg
PS-EigenBg
0.2
(a) camera 1
0.6
0.8
0.6
0.4
C-EigenBg
PS-EigenBg
0.2
0
0.1
1
0.8
0.6
0.4
MoG
Bayes
Codebook
0.2
PS-EigenBg
0
0.4
0.5
0.6
false positive
0.7
0.8
Fig. 5 objective comparison to some literal state-of-arts
4. CONCLUSION
A selective eigenbackgrounds method for background
subtraction in crowded scenes is proposed, where virtual
frames are constructed as training and updating samples for
the eigenbackground model, and the eigenbackground that
best depicts the background is selected for each pixel to
reconstruct the background. Both subjective and objective
experiments show the effectiveness of our method.
5. REFERENCES
0.6
0
0.8
disastrous accidents. Therefore, in practical applications
when the recall needs to be high, the performance of our
method exceeds that of Bayes method. For codebook
method, the false positive ratio is much higher than our
method under the same true positive ratio.
true positive
results of C-EigenBg due to the fact that the frames full of
foregrounds are used as training and updating frames. Thus
some foregrounds are absorbed into the eigenbackgrounds
and the reconstructed background would contain many
foregrounds, just as shown in Figure 3(b). However, by
constructing virtual frames and select the best
eigenbackground for each pixel, the miss detections are
retrieved and a lot of false alarms are removed (the regions
pointed out in Figure 3(c)).
One way to give objective analysis is to computing the
overlap area between the detected regions and the regions
inside the labeled object contour. But this would bring two
problems. First, it is labor-intensive. Second, it is more
meaningful to perform evaluation from the object detection
perspective, which is the end use of background subtraction.
Therefore, we generate the ground truth by labeling
bounding boxes on the foreground objects in random
selected frames [2]. An object is detected if more than 30%
of the pixels in its bounding box are classified as foreground.
The ratio of the number of detected objects to the number of
objects in ground truth serves as true positive and the
percentage of background pixels that are misclassified as
foreground is used to calculate the false positive. The ROCs
are shown in Figure 4. It can be observed that the
performance of PS-EigenBg exceeds that of C-EigenBg a
lot, which verifies the effectiveness of our algorithm from
the objective evaluation.
0.2
0.3
0.4
0.5
false positive
0.6
(b) camera 3
(d) camera 5
Fig. 4 ROCs of C-EigenBg and PS-EigenBg
At last, we give a comparison among PS-EigenBg and
some state-of-the-art methods in literatures, including
MoG[1], codebook[3] and Bayes[4] methods. Figure 5 gives
the objective comparison on camera 1. It can be observed
that MoG has the worst performance as it can only detect
moving foreground objects. Bayes method performs better
than our method when the recall is lower than 43%.
However, in video surveillance, low recall is intolerant as
too many missed foreground detections would cause
[1] Stauffer C., Grimson W.E.L.: Adaptive background mixture
models for real-time tracking, CVPR 1999, vol.2, pp.246-252
[2] Klare B., Sarkar S.: Background Subtraction in Varying
Illuminations Using an Ensemble Based on an Enlarged Feature
Set. CVPR Workshops, pp.66-73, June 2009
[3] Kim K.,Chalidabhongse T.H.,Harwood D.,Davis L.:Real-time
foreground background segmentation using codebook model.RealTime Imaging In Special Issue on Video Object Processing,
vol.11,no.3,pp.172-185,June 2005
[4] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian: Foreground object
detection from videos containing complex background. ACM
Multimedia 2003, pp. 2–10.
[5] Oliver N.M., Rosario B., Pentland A.P.: A Bayesian computer
vision system for modeling human interactions. PAMI 2000,
vol.22, issue.8, pp.831-843.
[6] J.Weng, Y.Zhang, W.Hwang: Candid covariance-free
incremental principal component analysis, PAMI 2003, vol.25,
issue.8, pp.1034–1040
[7] TRECVid 2010 Evaluation for Surveillance Event Detection.
http://www.itl.nist.gov/iad/mig//tests/trecvid/2010/
3341
Download