2011 18th IEEE International Conference on Image Processing SELECTIVE EIGENBACKGROUNDS METHOD FOR BACKGROUND SUBTRACTION IN CROWED SCENES* Zhipeng Hu1, 3, Yaowei Wang2, 3†, Yonghong Tian3†, Tiejun Huang3 1 Key Lab of Intel. Inf. Proc., Institute of Computing Technology, Chinese Academy of Sciences 2 Department of Electronic Engineering, Beijing Institute of Technology 3 National Engineering Laboratory for Video Technology, Peking University {zphu, ywwang, yhtian, tjhuang}@jdl.ac.cn ABSTRACT In this paper, a selective eigenbackgrounds method is proposed for background subtraction in crowded scenes. In order to train and update the eigenbackground model with frames containing few objects (i.e. clean frames), virtual frames are constructed based on a frame selection map. Then, the eigenbackground that best depicts background is selected for each pixel based on an eigenbackground selection map. Experimental results show the performance of the proposed method is better than those of some stateof-the-art methods in crowded scenes. Index Terms — background subtraction, video surveillance, eigenbackground, crowded scenes 1. INTRODUCTION In surveillance video, many events of interest occur in crowded scenes, where people density is pretty high and most of them move very slowly. For instance, when a terrorist is planting bombs in a lounge, he may keep motionless or just roaming in the crowd. It is expected the terrorist can be detected as a foreground object for surveillance application. Unfortunately, most foreground detection algorithms using background subtraction technology [1-4] determine motionless objects as background. The reason is the most recent video frames are used to update background models. Classic eigenbackgrounds method [5] can be used to solve the problem. This method employs principle component analysis to obtain eigenvectors (i.e. eigenbackgrounds). As the principle components of a scene are its background in general, the eigenbackgrounds that are used to reconstruct the background depict the characteristics of the background. Therefore, this method could identify the motionless objects as foregrounds. However, the foregrounds may become the principle components and be absorbed into the eigenbackgrounds in crowded scenes as the foregrounds cover a large proportion of the scene most of the time. This will lead to miss detections and false alarms. To our knowledge, no literature has done research on this problem. We think there are two key issues to solve this problem: (1) Use clean frames to train and update the background model; (2) Use the eigenbackgrounds without absorbed foregrounds to reconstruct the background. In this paper, selective eigenbackgrounds method is proposed to resolve the two issues. Our contributions are two-folds. First, in order to train and update the eigenbackgrounds with frames containing few objects (i.e. clean frames), virtual frames are constructed based on a frame selection map. Frame selection map is a data structure with the same size to the video frame, each element of which corresponds to a pixel location and indicates the frame in which the pixel is clean. "Clean" means that the pixel is an "actual" background pixel. For virtual frames are constructed with clean pixels, fewer foregrounds would be absorbed into the eigenbackgrounds. Second, the eigenbackground that best describe the background is selected for each pixel based on an eigenbackground selection map. Just like the frame selection map, each element of the map corresponds to a pixel location. The difference is that the element in the eigenbackground selection map indicates the best eigenbackground that can be used to reconstruct the background value of the corresponding pixel. As the selection is performed for each pixel, best background reconstruction results can be obtained for all pixels as compared to the classic eigenbackgrounds method. The remainder of this paper is organized as follows. Section 2 gives the detailed description of our method. * This work was done in National Engineering Laboratory for Video Technology and supported by grants from the Chinese National Natural Science Foundation under contract No.61035001 and No. 61072095, National Basic Research Program of China under contract No. 2009CB320906, and Fok Ying Dong Education Foundation under contract No. 122008. † contact author 978-1-4577-1302-6/11/$26.00 ©2011 IEEE 3338 2011 18th IEEE International Conference on Image Processing Experimental results are shown in Section 3. Section 4 concludes this paper. 2. SELECTIVE EIGENBACKGROUNDS METHOD Eigenbackgrounds method is performed by thresholding the difference image between the reconstructed background and the video frame. In the proposed method, reconstructed background is obtained by the selected eigenbackgrounds, which are trained and updated with constructed virtual frames. The detailed description is discussed in this section. x x 2.5 (2.4) Then virtual frames can be constructed with clean pixels based on a frame selection map, which is a data structure with the same size to the video frame. Each element of the map corresponds to a pixel location and indicates the frame in which the pixel is clean. Specifically, the selection map is initialized with “-1” and when the ith training frame is loaded, the value at the location ( x, y) of the selection map is calculated as s( x, y) arg f min{i f | f i I f ( x, y) is clean pixel} (2.5) 2.1. Virtual Frame Construction f is the index of a training frame. If there is no frame satisfying Eq 2.5, s( x, y) is assigned -1. The reason why minimum is used in Eq 2.5 is that it is expected a virtual frame be constructed with the temporally nearest clean pixels. The frame selection map is updated with the incremental loading of the training frames, and once there is no -1, a virtual frame is constructed with clean pixels indicated in the frame selection map. Then the map is reset to -1 and the same process is repeated. Figure 2 describes the process of constructing a virtual frame. On the average, hundreds of frames are needed to construct a virtual frame in crowded scenes. f it If video frames without foreground objects are used as training and updating samples, better background model can be obtained. In crowded scenes, it is difficult to find such clean frames as shown in the upper row of Figure 1. However, it is fairly easy to find clean pixels in the video, so one solution is to construct clean frames by selecting clean pixels in the video. As the constructed frames are not actual frames from the video, they are named “virtual frames”. Fig. 1 upper row: randomly selected frames from video; middle row: some selected frames with fewer foregrounds from the video; bottom row: virtual frame examples Then the problem becomes how to select clean pixels from the video. Inspired by [1], GMM is established for each pixel by using EM algorithm, which is formulated as: P( x) i 1 i ( x, xi , i ) N (2.1) ( x xi ) 1 exp{ } (2.2) 2 i 2 2 i where N is the number of Gaussians (5 is used in the paper), i , xi and i are the weight, mean and variance of the ith Gaussian respectively. Then the Gaussians are sorted in the descendent order of / , and the first B ones are used as “clean” Gaussians: 2 ( x, xi , i ) b B argb max{ i Th} (2.3) i 1 Th is a predefined parameter (0.6 is used). Once a pixel matches at least one of these clean Gaussians, it is determined as a clean pixel. The matching rule is defined as Fig. 2 virtual frame construction process Some constructed virtual frames are shown in the bottom row of Figure 1. The second row shows some manually selected frames with low crowd density. By comparing the two rows, it can be observed that the virtual frames contain fewer foregrounds indeed, which verifies the effectiveness of the proposed selection mechanism. 2.2 Background Model Update This step includes the GMM model update for each pixel and the eigenbackgrounds update. The clean pixels in an input frame are selected to update the GMM models. Particularly, each pixel is verified whether it matches one of the corresponding “clean” Gaussians. If so, it is selected to update the first matched Gaussian in running average style: 3339 2011 18th IEEE International Conference on Image Processing x n = 1- x n 1 + x n 2 n = 1- 2 (n 1) + x n - x n where is the update (2.6) 2 (2.7) rate (0.01 is used in the paper) and x n is the selected background pixel value. Meanwhile, with the incremental loading of the input frames, the frame selection map is updated and used to construct virtual frames, just as described in section 2.1. The constructed virtual frames are used to update the eigenbackgrounds with Candid Covariance-free Incremental PCA [6], which can be formulated as an iterative procedure: (2.8) 1 n x n x n n 1 n 1 1 i n i n 1 i n iT n i (2.9) n n || i n 1 || n i n i 1 n i n n i || i n || || i n || T i (2.10) is the updating rate parameter, i (n) is the input vector for the computation of the ith eigenvector ui (n) , and i (n) i (n)ui (n) . 2.3 Background Reconstruction In classic eigenbackgrounds method, the background of the current frame is reconstructed on frame level: (2.11) B( x) UU T ( x x ) x where all the eigenbackgrounds are used to reconstruct the background. The reconstructed background may contain foregrounds in crowded scenes. This can be understood by exploring the insufficiency of the virtual frames and the meaning of the eigenbackgrounds. In section 2.1, virtual frames are constructed to train and update the eigenbackgrounds. However, there may be noises and residual foreground fragments in virtual frames due to the fact that a pixel is determined as clean in terms of probability. Some trained eigenbackgrounds may contain the information of foreground as a result. Specifically, the ith element u j (i) of the jth eigenbackground corresponds to the ith image pixel. The ith element depicts background in some eigenbackgrounds, while in the others, it may depict foreground. If all the eigenbackgrounds are used to reconstruct the background, the reconstructed background may contain foregrounds. Therefore, it is necessary to select the best eigenbackground for each pixel to reconstruct its background value. Actually, the absolute value of each element in an eigenbackground can be taken as the measurement of scatter for the corresponding pixel. Generally, the scatter of the background is smaller than that of the foreground because the background is stable while the foreground contains different objects. So the smaller the absolute value of the element is, the more likely it describes background. Then background reconstruction can be formulated with an eigenbackground selection map, each element of which indicates the index of the eigenbackground that should be used to reconstruct the background value of the corresponding pixel. The value of the ith element in the map can be determined as (2.12) i arg j min{| u j (i) |} Then the background value B(i) of the ith pixel can be reconstructed as (2.13) B(i) (i) i u u ( x x ) x T i i i (2.14) where i is the reconstructed background frame, ui is the selected eigenbackground for the ith pixel and x is the input frame vector. At last, background subtraction is performed by thresholding the absolute difference image between the input frame and the reconstructed background. 3. EXPERIMENTAL RESULT The dataset in the experiments is from TRECVID-SED [7], which is used for event detection. It is one of the most challenging dataset due to its thick crowds, clutter scenes, severe occlusion and illumination variations. The dataset is captured from 5 cameras and we use 4 of them in our experiments as there are few people in camera 4. In the following experiments, no post-processing is performed. (a) (b) (c) (d) (e) Fig. 3 subjective comparison results. (a)original frame (b)(d) reconstructed background with C-EigenBg and PS-EigenBg respectively; (c)(e)subtraction results with C-EigenBg and PSEigenBg respectively. Green circle: miss detections retrieved by PS-EigenBg; red circles: false alarms removed by PS-Eigenbg The first experiment is to take a subjective comparison between the classic eigenbackgrounds method (C-EigenBg for short), where no selection operation is employed, and the proposed method in this paper (PS-EigenBg for short). Figure 3 shows the comparison results. It can be observed that there are a lot of false alarms and miss detections in the 3340 2011 18th IEEE International Conference on Image Processing 1 1 0.8 0.8 true positive true positive 0.6 0.4 C-EigenBg PS-EigenBg 0.2 0 0 0.2 0.4 false positive 0.6 0.4 1 0.8 true positive true positive 1 0.6 0.4 0 0 C-EigenBg PS-EigenBg 0.2 0.4 false positive 0.3 0.35 0.4 0.45 0.5 false positive 0.55 0.6 (b) camera 2 0.8 0.2 C-EigenBg PS-EigenBg 0.2 (a) camera 1 0.6 0.8 0.6 0.4 C-EigenBg PS-EigenBg 0.2 0 0.1 1 0.8 0.6 0.4 MoG Bayes Codebook 0.2 PS-EigenBg 0 0.4 0.5 0.6 false positive 0.7 0.8 Fig. 5 objective comparison to some literal state-of-arts 4. CONCLUSION A selective eigenbackgrounds method for background subtraction in crowded scenes is proposed, where virtual frames are constructed as training and updating samples for the eigenbackground model, and the eigenbackground that best depicts the background is selected for each pixel to reconstruct the background. Both subjective and objective experiments show the effectiveness of our method. 5. REFERENCES 0.6 0 0.8 disastrous accidents. Therefore, in practical applications when the recall needs to be high, the performance of our method exceeds that of Bayes method. For codebook method, the false positive ratio is much higher than our method under the same true positive ratio. true positive results of C-EigenBg due to the fact that the frames full of foregrounds are used as training and updating frames. Thus some foregrounds are absorbed into the eigenbackgrounds and the reconstructed background would contain many foregrounds, just as shown in Figure 3(b). However, by constructing virtual frames and select the best eigenbackground for each pixel, the miss detections are retrieved and a lot of false alarms are removed (the regions pointed out in Figure 3(c)). One way to give objective analysis is to computing the overlap area between the detected regions and the regions inside the labeled object contour. But this would bring two problems. First, it is labor-intensive. Second, it is more meaningful to perform evaluation from the object detection perspective, which is the end use of background subtraction. Therefore, we generate the ground truth by labeling bounding boxes on the foreground objects in random selected frames [2]. An object is detected if more than 30% of the pixels in its bounding box are classified as foreground. The ratio of the number of detected objects to the number of objects in ground truth serves as true positive and the percentage of background pixels that are misclassified as foreground is used to calculate the false positive. The ROCs are shown in Figure 4. It can be observed that the performance of PS-EigenBg exceeds that of C-EigenBg a lot, which verifies the effectiveness of our algorithm from the objective evaluation. 0.2 0.3 0.4 0.5 false positive 0.6 (b) camera 3 (d) camera 5 Fig. 4 ROCs of C-EigenBg and PS-EigenBg At last, we give a comparison among PS-EigenBg and some state-of-the-art methods in literatures, including MoG[1], codebook[3] and Bayes[4] methods. Figure 5 gives the objective comparison on camera 1. It can be observed that MoG has the worst performance as it can only detect moving foreground objects. Bayes method performs better than our method when the recall is lower than 43%. However, in video surveillance, low recall is intolerant as too many missed foreground detections would cause [1] Stauffer C., Grimson W.E.L.: Adaptive background mixture models for real-time tracking, CVPR 1999, vol.2, pp.246-252 [2] Klare B., Sarkar S.: Background Subtraction in Varying Illuminations Using an Ensemble Based on an Enlarged Feature Set. CVPR Workshops, pp.66-73, June 2009 [3] Kim K.,Chalidabhongse T.H.,Harwood D.,Davis L.:Real-time foreground background segmentation using codebook model.RealTime Imaging In Special Issue on Video Object Processing, vol.11,no.3,pp.172-185,June 2005 [4] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian: Foreground object detection from videos containing complex background. ACM Multimedia 2003, pp. 2–10. [5] Oliver N.M., Rosario B., Pentland A.P.: A Bayesian computer vision system for modeling human interactions. PAMI 2000, vol.22, issue.8, pp.831-843. [6] J.Weng, Y.Zhang, W.Hwang: Candid covariance-free incremental principal component analysis, PAMI 2003, vol.25, issue.8, pp.1034–1040 [7] TRECVid 2010 Evaluation for Surveillance Event Detection. http://www.itl.nist.gov/iad/mig//tests/trecvid/2010/ 3341