FOREGROUND SEGMENTATION FOR STATIC VIDEO VIA MULTI-CORE AND MULTIMODAL GRAPH-CUT Lun-Yu Chang and Winston H. Hsu National Taiwan University, Taipei, Taiwan felixc977@gmail.com, winston@csie.ntu.edu.tw ABSTRACT We proposed a new foreground detection method using the static cameras. It merges multi-modality into graph cut energy function, and performs much better results than conventional methods. Here we not only consider about the color appearance of the frame, but also spatial constraint of the foreground object. Therefore we can get the more precise shape of foreground. Besides, we divide the graph cut problem into several sub problems, in order to reduce the computing time, and try to utilize multi-core to solve the problems at the same time. Furthermore, real time performance can be easily achieved by above works. The experiments on real videos demonstrate our improvement than the other conventional foreground detection method. Index Terms— graph cut, foreground detection, foreground segmentation, surveillance, 1. INTRODUCTION Foreground segmentation is the problem of segmenting moving objects from their backgrounds in videos as shown in Figure 6. It serves as an important cornerstone in many video-related application systems such as video surveillance, motion capture, and human-computer interaction, etc. For example, most surveillance systems (e.g., [7][10]) urge that the success of object tracking and semantic event detection heavily relies on the quality of the shapes (or silhouettes) for the foreground objects. A surveillance video is mostly captured by a static camera with fixed position and parameters. Based on the observation, most foreground detection methods focus on building background models. Any drastic change from the background model is then considered to be an object in the foreground. Given a sequence of frames, we can use statistical information of changes in pixel values to model the background with either single Gaussian model [1] or Gaussian Mixture Model (GMM) [2]. Next, the foreground objects can be extracted by thresholding the difference between current frame and the background model pixel by pixel. Generally, GMM [2] is more robust against non-static background (e.g. moving shadow of a building or swaying branches). Although the above method for foreground detection is straight forward, it has several deficiencies. Such methods work on a pixel-wise basis without considering the continuity between each pixel and its neighbors, leading to false foreground segments and broken objects in the detection results. These errors are usually caused by sudden light changes or other noises in the video. Nowadays, the graph-cut [3] algorithm (cf. Section 2) is promising for solving the above issues in foreground detection (or segmentation). For example, in [4] the authors take into account the relationship between neighboring pixels, but the energy function used is quite ad-hoc. Therefore the overall system improvement is limited. They use the difference between the current frame and the background model as the data term in graph cut, and a constant as the smoothness term. Although this is an improvement over conventional methods, there are still some unsolved problems. First, considering only background modality only is too weak for actual surveillance video, and their smoothness term cannot describe the relationship between each pixel and its neighbors. Second, graph cut is time consuming, but surveillance system should work in real time. Besides, Sun. et al.[5] proposes a refined method based on Howe’s work. They use Stauffer’s adaptive background GMM [2] as the first modality, and also consider a shadow removal term that tries to reduce the detection error due to changes in lighting condition. Nevertheless, if the moving object in the foreground looks similar to the background, Sun’s method still fails to generate precise detection results. Besides, even though they have taken approaches to accelerate graph-cut to about 12 fps for a frame size of 320*240 on a 2.66 GHz CPU, we believe more room to improve. In this work, we propose a two-phase strategy to improve the efficiency and accuracy for foreground detection. In the first phase we obtain rough foreground segmentation results and identify candidate areas that may contain moving objects. In the second phase we apply graph-cut to identified candidate areas from the first phase. In contrast to applying graph-cut to the whole video frame, our approach significantly shortens the computing time. Besides, there are many cues informative for foreground detection; for example, appearance, tracking, and shadow removal. We also proposed a framework to combine these cues into the graph energy functions to obtain a multi-modal graph-cut that is more powerful than its predecessors. We list our main contributions in this work: Figure 1. The system framework where we divide the foreground detection into several smaller tasks and apply graph cut (cf. Section 3.2) on them simultaneously (multi-cores), instead of putting the whole frame into graph cut processing. – Improving foreground segmentation with multi-modal graph-cut. – Considering the spatial continuity of the object. Here we construct prior constraint terms to model the probability of object distribution. – Improving efficiency by running graph-cut on object areas instead of the whole video frame. – Divide and conquer on the segmentation of foreground objects. We apply graph-cut on the bounding box of each moving object parallel with multi-core CPU, leading to faster computation than before. – Construct the adaptive background GMM parallel by multi-core CPU. Different from and complementary to [5], we focus on handling false negatives (e.g. foreground object is similar to background), but not the false positive problems (e.g. shadow, reflection, etc.). We aim to detect the complete shape of foreground objects instead of broken parts of objects. Because foreground detection is the pre-process of many computer vision researches and applications, getting the clear shape of foreground objects in real time is essential. 2. GRAPH CUT FOR FOREGROUND DETECTION Foreground detection is essentially a segmentation problem (i.e., separating foreground and background pixels in this problem). The advanced researches are through graph cut methods [3][8], where a specialized graph for the energy function to be minimized such that the minimum cut on the graph also minimizes the energy. It can be a binary labeling problem over a graph, which consists of an undirected graph G = (V,E) where G is composed by a vertex set V (representing the pixels), an edge set E (between neighboring pixels with four or eight connections), and a finite label set for the nodes (pixels). Here foreground segmentation only needs two labels xv ={‘fg’, ‘bg’} or {1,0}, which denote foreground and background respectively. The labeling problem is essentially finding a proper way to cut the graph (pixels) into two sets – ‘fg’ or ‘bg’ – and function for optimizing the heavily replies on the energy solution. There is an energy function E(x) described the energy between each vertex and label. Where E(x) can be (a) (b) (c) Figure 2. MMCut is a method that utilizes multi-modalities (e.g. appearance, foreground likelihood (b), prior size constraints (c), etc.) as the graph cut energy function. Then applied the min-cut algorithm to segment the foreground which shows yielding much silhouette results than simply using any single modality. written in terms of unary (data) and pairwise (smoothness) energy terms in the simplest interesting case as E ( x) H ( x v ) G( xu , x v ) (1) vV vV The energy function will directly influence what label should apply on each vertex. Data term describes how possible a vertex detected as ‘fg’ or ‘bg’, smoothness term describing the relation between each pixel and its neighbor, for example, if two pixel’s appearances are close, they have high probability been detected as the same label. A graph cut algorithm is to … Once the graph cut model and the energy function are constructed, we apply min-cut algorithm [3][8], which has been shown effective and efficiency for the graph cut problem. We can then get a binary label map as the silhouette result (cf. Figure 6). 3. PROPOSED METHOD We propose a framework for efficient foreground segmentation problem, and especially point to the situation when object is similar to background (cf. Figure 6). Unlike prior work, our method leverages multiple modalities for constructing an effective energy function and divides the detection into atomic subtasks over multi-core platforms for efficiency required in surveillance systems. The system diagram is shown in Figure 1. 3.1. Adaptive Background GMM Based on Stauffer’s method [2], we first construct the background model with the pixelwise adaptive GMM, each pixel in the frame is independent to any other pixels. The Gaussian model can be learnt and updated on line efficiently frame by frame [2]. Once we have the pixelwise GMM model for the static video camera, a pixel v (or a vertex in for graph cut) within a new frame can be measured to be foreground (‘fg’) by the defined likelihood score: I (v ) k ( v ) score(v) min k (2) k (v ) where I(v) denotes the intensity of vertex v μk and σk are mean and standard deviation of k’th Gaussian model, which has the least distance (normalized by its standard deviation) to the all Gaussian models. This score stands for the probability about the vertex is belong to background. Here we illustrate a GMM likelihood score map as Figure 2-b. If we thresholding the score map, we could get a rough foreground segmentation result. 3.2. Multimodality Graph Cut – MMCut In traditional method of foreground detection models, each pixel is independent to the other pixels in the frame. As shown in Figure 2-b, once the background is very similar to appearance of the moving objects, the pixelwise algorithm will have apparent false negative within object silhouette. Also illustrated in Figure 2-c, we can remedy the problem by enforcing the spatial continuity of the moving object. Besides the continuity property, there are more proposing modalities (e.g., color appearance, foreground likelihood, object spatial continuity, tracking estimation, etc.) to exploit for effective foreground detection. We argue to leverage multiple cues to boost the foreground detection and propose multimodality graph cut (MMCut) for composing the energy functions (in data and smoothness terms) by multimodalities. 3.3. Energy Functions In the MMCut energy function, data and smoothness terms are defined as follows: H ( xv ) (1 ) LG L A LC (3) G( xu , xv ) (1 ) G A (4) Where τ, δ, and ζ are parameters to linearly combine energy functions derived from multiple modalities; LG, LA and LC derived from adaptive GMM likelihood, appearance similarity, and (spatial) continuity constraint, respectively and explained in the following subsections. Eq. (4) is a smooth term linearly composed by GMM term φG, and appearance term φA. Note that the sensitivity of the parameters and the impact from different modalities will be investigated thoroughly in the experiments (Section 4). 3.3.1. Adaptive GMM Likelihood Based on Chris Stauffer methods [2], here we utilized GMM likelihood score as one of the data term in Eqn. 3. As shown in Figure 6 and Figure 2b, the GMM likelihood can provide estimated cues for the foreground though troubled by certain errors. Modified from Eqn. 2, we define the GMM likelihood energy function LG as: const 1 / score(v) , if xv ' bg' LG ( xv ) , if xv ' fg' 1 / score (v) (5) In this way, those pixels with higher foreground likelihood will tend to have lower energy as labeling as ‘fg’. The constant parameter means a uniform distribution in appearance of foreground objects as [5], here we set the const=1; through the experiments, it is not sensitive to segmentation results. 3.3.2. Appearance Likelihood Appearance (color, texture, etc.) information has been shown effective in interactive segmentation such as Lazy Snapping [8], where the foreground are roughly sketched by users and then the background and foreground model can be constructed for the graph cut algorithm. Similar to that, we are interested in if such appearance cues can help foreground segmentation. After thresholding GMM likelihood score map, we can have a rough bounding box for the object candidate as Figure 2-b. In order to compute the appearance likelihood LA, we utilize the K-means methods as [8]. First we could regard the exterior part of this box as known background and use it to group a Background Cluster KB. Next, we also group a foreground Cluster KF adopting the known foreground inside this box. For each vertex v we compute the minimum distance from its color C(v) to foreground clusters as: diF (v) min C (v) K nF , diB (v) min C (v) K nB n n Similarly, we favor those pixels similar to initial the foreground model as foreground candidates and define the appearance likelihood term as: d iB (v) F d i (v) d iB (v) L A ( xv ) d iF (v) F B d i (v) d i (v) , if x v ' bg' , if x v ' fg' (6) 3.3.3. Spatial Continuity Likelihood We can have the rough bounding box for the candidate object (Figure 2-b). Naturally, for moving objects (e.g., cars, people, etc.) the object is complete within the object. Hence, we assume that the center part of the bounding box should have higher probability to be detected as foreground than the other parts and formulate the spatial continuity constraint by a Gaussian mask centered at the object center (Figure 2-c). We can then define LC for Eqn. 3 as: log PC (i, j ) 1 log P (0,0) , if xv ' bg' C LC ((i, j ) | xv ) log PC (i, j ) , if xv ' fg' log PC (0,0) (7) Here we utilized 2D Gaussian density function P c to model the likelihood distribution, and regard the center of the box as its Gaussian mean, and (i, j) is the position relative to the top-left corner (i.e., (0, 0)) of the bounding box it belongs to. Besides, we use 1/6 height and width of the bounding box as the standard deviations for vertical and horizontal axes in the 2D Gaussian mask (Figure 2-b). The ratio is determined in a cross-validation manner. 3.3.4. Smoothness Term Smoothness term is to constrain the similarity of neighboring pixels – those pixels with similar pixels should have similar labels (0 or 1). In [4][5], only a single modality was considered for the smoothness term. We are interested in if multiple modalities can help construct a better smoothness term for foreground segmentation. We define our smoothness term as: 1 uv 1 G ( xu , xv ) xu xv uv A ( x u , x v ) xu x v (8) (9) Where Ωuv is the distance of color appearance, and Δuv is he distance of GMM likelihood score: 3.4. Divide and Conquer for Detection Efficiency For static video cameras, only moving objects are semantically meaningful. We devise three strategies to speed up for refining segmentation results. Firstly, MMCut is applied on the small regions of the candidate moving objects (e.g., t1, t2, and t3 in Figure 1) only instead of the entire video frame as prior works [5][6][8]; it generally requires quadratic time for calculating the energy functions. Such candidate objects can be yielded efficiently by the GMM foreground detection algorithm. Secondly, we regard each candidate moving objects as an independent subtask and then apply MMCut for each one as illustrated in Figure 1. We then leverage emerging multicore framework for refining the candidate objects in parallels. For example, if having a quad-core CPU, we could put most four processing four moving at the same time Furthermore, in GMM background modeling (Section 3.1), each pixel is independent to any others can further benefit from multi-cores as constructing and testing the pixelwise GMM model. Note that the second and third strategies are realized by OpenMP 1 . We will demonstrate the significant efficiency improvement in the experiments (Section 4). 4. EXPERIMENTS We evaluate the foreground segmentation performance using the foreground detection benchmark from IPPR2006 (three 24-bit color videos with 320x240 resolution) [9]. Here we name the three video sequences as “Indoor1,” “Indoor2,” and “Outdoor1.” The scene of “Indoor1” is a lobby under poor light, “Indoor2” is a hallway with strong sunlight, and “Outdoor1” is a road that has moving vehicles and people. The performance metric for the benchmark is “error per frame” and defined as (# stands for ‘number of pixels’): Error per frame= # in false positives + # in false negatives Due to we want to test if prior constraint can benefit the detection result. We selected two important clips in the dataset1 (which contains the object hard to be detected precisely. clip1 contains frame no.53~87; clip2 contains frame no.263~299), and here named those as “Broken Object1 and 2”. After MMCut processing, the result is as Figure 3. Where δ is the weight of prior constraint in likelihood term, we can see the prior constraint benefit the 1 The OpenMP (Open Multi-Processing) is an API that supports multi-platform multiprocessing programming. See more at http://openmp.org/ . 1000 Broken Object1 Error Per Frame 2 800 Broken Object2 600 400 200 0 0 0.5 0.75 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 δ Figure 3. δ is the weight of Lc. The error rate is reduced significantly when we take “prior constraint” into account. Besides, we get lowest error rate when δ=1.5. detection result significantly. Besides, we get lowest error rate when δ=1.5. Here shows some sillouette results of applying MMCut and other two methods on these two clips in Figure 6. Because GMM likelihood is close to color appearance, but still exist some difference between them. We should test what ratio between them can get the best result. As Figure 4 we have the best result when we considering only GMM likelihood, and there are no notable changes in dataset1 and dataset 2. Since using GMM likelihood is even better than consider these two modalities. Moreover, the time cost of computing appearance likelihood is much more than GMM (because K-Means Clustering). That is we could remove this modality from data term, and involve more other modalities (ex. tracking, shadow elimination, etc) into it. Our future work will be focused on this goal. Error Per Frame uv C (u ) C (v) , uv score (u ) score (v) 2 900 800 700 600 500 400 300 200 100 0 Dataset1 Dataset2 Dataset3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 τ Figure 4. τ is the ratio between GMM likelihood & Appearance in data term. τ=0 means only considering about GMM likelihood . Above chart shows the best result appears when τ=0 (only considering about GMM likelihood). Besides, we have tested changing the ratio between φG and φA in the smoothness term, the result is actually improved even the improvement is slight (about 0.1%), but certainly there actually are benefits when combining these two modalities in smoothness term. We further compare the proposed method with three other rivals including the GMM approach [2], GMM with post-processing by morphological operators (open and close), and the best detection result in the benchmark [9]. From Figure 5, we can see the proposed method outperform the rival methods in terms of error per frame (the lower, the (a) (b) (c) (d) Figure 6. Example results of 3 methods. Each subgraph shows original frame, Stauffer’s method, Stauffer’s method after morphological operations, and our methods from left to right. From above figures, we can see the false negative parts (where foreground looks like the background) of the foreground objects have success been reduced. Besides, our algorithm is more efficient than other foreground detection methods via graph-cut [4][5][6]. better). The segmentation cues from multiple modalities (via energy functions) and rigorously considering pairwise pixels (via graph cut) can significantly boost the detection performance. We observe that the most contributions come from the consideration of spatial continuity, which can be verified in Figure 6. Meanwhile, it is superior to the best method in the in dataset1 and dataset3. Note that our proposed method is much more efficient and can process 30 frames per second (fps) comparing with the 0.04 fps in the best benchmark method [9] and 15 fps in [5]. The lighting changes in dataset2 and not considering shadow removal yet cause our method with poor performance against the best one in the benchmark. We believe the performance can be further boosted if considering shadow removal cues [5]. Error per Frame 1200 In this work we try to solve the detection problem about foreground object is similar to background. In order to get a more precise segment result, we apply MMCut on the result of Stauffer’s method trying to refine the segmentation result. For a more efficiency detection system we put the moving object box into MMCut instead of the whole frame, and apply multi-core on updating the adaptive GMM and MMCut. We try to provide a more flexible, efficiency, and automatic foreground segmentation method, and the experiment result actually reduced the false negative problems and speed up the whole system to real-time (60 f.p.s.) as shown in Section 4. About future work, we would like to involve more useful modalities and try to refine the extracting bounding box method (with tracking…). Adaptive GMM 1000 800 Adaptive GMM + Post processing MMCut 600 400 IPPR06 Best 200 0 5. CONCLUSIONS 1 Dataset1 2 3 Dataset2 Dataset3 Figure 5. MMCut performs significantly better than rival methods and even the best result in the benchmark. Besides, our method is much more efficient than the best benchmark method (30fps vs. 0.04fps) CPU Threads Frames per sec. Intel Core2 Duo 2.16GHz 1 31 Intel Core2 Duo 2.16GHz 2 43 Intel Core2 Quad 2.4GHz 1 36 Intel Core2 Quad 2.4GHz 4 60 Table 1. The proposed method can improve 28% efficiency on the dual-core system, and 67% on the quad-core system. The benefits in detection efficiency through our “divide and conquer” strategies in multi-core platforms are demonstrated in Table 1. Apparently, the proposed method can improve the detection frame rate up to 68% in the quadcore platform. 6. REFERENCES [1] S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld. “Detection and location of people in video images using adaptive fusion of color and edge information.” Proc. ICPR, 2000. [2] Chris Stauffer, “Adaptive background mixture models for realtime tracking.” Proc. CVPR, 1999. [3] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” TPAMI, Vol. 26, No. 9, pp. 1124–1137, 2004. [4] N.-R. Howe “Better foreground segmentation through graph cuts,” Technical report, Smith College, 2004. [5] Y. Sun, B. Yuan, Z. Miao, and C. Wan, “Better foreground segmentation for static cameras via new energy form and dynamic graph-cut.” Proc. ICPR, 2006. [6] Keita Takahashi, Taketoshi Mori: “Foreground Segmentation with Single Reference Frame Using Iterative Likelihood Estimation and Graph-Cut.” Proc. ICME, 2008 [7] Ying Wang, Kaiqi Huang, Tieniu Tan: “Human Activity Recognition Based on R Transform.” Proc. CVPR, 2007 [8] Y. Li, J. Sun, C.-K. Tang, H.-Y. Shum, “Lazy Snapping,” Proc. SIGGRAPH, 2004. [9] The Chinese Image Processing and Pattern Recognition Society(IPPR), http://www.ippr.org.tw/. [10] I. Haritaoglu, “W4: Real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., 2000.