FOREGROUND SEGMENTATION FOR STATIC VIDEO VIA MULTI-CORE AND MULTIMODAL GRAPH-CUT Lun-Yu Chang and Winston H. Hsu National Taiwan University, Taipei, Taiwan felixc977@gmail.com, winston@csie.ntu.edu.tw ABSTRACT We proposed a new foreground detection method using the static cameras. It merges multi-modality into MRF energy function, and performs much better results than conventional methods. Here we not only consider about the color appearance of the frame, but also spatial constraint of the foreground object. Therefore we can get the more precise shape of foreground. Besides, we divide the MRF problem into several sub problems, in order to reduce the computing time, and try to utilize multi-core to solve the problems at the same time. Furthermore, real time performance can be easily achieved by above works. The experiments on real videos demonstrate our improvement than the other conventional foreground detection method. Index Terms— graph cut, foreground detection, surveillance, 1. INTRODUCTION A good foreground segmentation method is an essential technology as a pre-processing for further analysis of images and videos, such as surveillance system, motion capture, human-computer interactions… and so on. Highquality segmentation greatly benefits further processing and influences overall performance of the whole system. After insistent research for a long time, many people have discovered that there are some useful properties of static camera can be used to separate foreground from background. That is, videos are taken from the cameras with fixed position and parameters. Using simple background subtraction is the most straightforward method for segmenting moving objects in this kind of videos. Given a sequence of frames, foreground objects are extracted by thresholding the difference between current frame and previous frame pixel by pixel. In order to reflect temporal change of the background, we can utilize statistical background model, as proposed successively in a few methods such as single Gaussian [1] and Gaussian Mixture Model (GMM) [2], differing in the statistical features used to represent the background. Rather than the simple thresholding operation, foreground is usually determined in a probabilistic way. Although these improvements offer much better performance than simple background subtraction especially under non-static background (ex. shadow of a building, swaying branches), they still work on a pixelwise but not considering about the continuity between each pixel and its neighbors, including false foreground blobs and holes in detected objects caused by camera noise or suddenly light changes. All these works are cited as conventional background subtraction methods. In [4] Howe et al. take the relationship between neighboring pixels into account by graph cut [3]. They construct a graph incorporating all the differences measured between current frame and the background model, and consider about the connectivity of the pixels in the image, allowing each pixel to affect those in its local neighborhood. Finally, the min cut algorithm is used to segment the graph and actually separate the moving objects from the background. However this method seems be good at foreground segmentation, but there still are some unsolved problems. First, how the energy functions should we fill up. Second, graph cut is time consuming, but the surveillance system should work on real time. However, we can get some benefits on static camera video. For example, the scene in the video is static, we just need to notice the moving object. Besides, given a sequence of frames we can easily generate the background model. Furthermore, there are lots of modalities can help to extract the moving objects (ex. appearance, tracking, shadow removal…etc). If we try to unite those elements into the MRF energy functions, we should have a more powerful foreground detector. In this work, we follow the formulation that foreground segmentation corresponds to global optimum of an energy function, and the energy function is derived from a Markov Random Fields (MRF) defined on a graph. Compared with [4][5], our main contributions are: • Improving foreground segmentation, with multimodalities fusion energy functions. • Considering the spatial continuity of object, construct prior constraint term to model the probability of object distribution. • Only put the moving object into MRF instead of whole frame. Reduce the amount that put into MRF. • Try to divide and conquer MRF problem. We could apply MMCut on each moving object bounding box parallel with multi-core, leading to much more accurate results than before. With above novelties, our method could segment the foreground efficiently. Different from [5], we focus to handle false negative problems (e.g. foreground object is similar to background), but not the false positive problems (e.g. shadow, reflection…etc). We try to get a more precise object shape but not an object with broken parts in it. Moreover, even MMCut utilizing MRF to solve the problem, the work still can be done in real time. Because foreground detection is just the preliminary of lots computer vision researches, getting the clear shape of foreground objects in real time is essential. For example, if we get a more precise foreground object, we can have a better performance in human activity recognition via R-Transform [7]. 2. MARKOV RANDOM FIELDS In this section we provide a general overview of the Markov Random Fields (MRF), and give the notation used in the paper. An MRF consists of an undirected graph G = (V,E) where V is a finite set of vertices and E ⊂ V × V is a set of edges, a finite set L = {l1, ..., ln} of labels, and a probability distribution P on the space X = LV of label assignments, such that an element x of X, called a configuration of the MRF, is a map that assigns a label xv in L to each vertex v. P(x) is a Gibbs distribution relative to G P(x)~e−E(x) where E(x) can be written in terms of unary and pairwise energy terms in the simplest interesting case as E(x)= ∑ h(xv ) + ∑ g(xu ,xv ) v∈V (u,v)∈E x ∗ = arg min E(x) k Then inference on the optimal labels (i.e. the best segmentation) corresponding to the MRF is seen as an energy minimization problem. 3. PROPOSED METHOD In this paper, we proposed a new method try to deal with the real time foreground detection problem, and especially point to the situation about object is similar to background. As mentioned in Section 2, applying graph cut has already reported in [5], which combine background likelihood term 3.1. MMCut Traditional method of foreground detection often using the pixel-wise background model, each pixel is independent to the other pixels in the frame. But if we ignore the spatial continuity of the moving object, that is not make sense in the reality. MMCut stands for composing the energy function by Multi-Modality (e.g. Color Appearance, GMM likelihood, prior constraints…), and applied min cut algorithm to solve it by our efficiency methods. Here we want to utilize MRF to model the relationship between each node, and then we just need to fill up the energy function. There are lots of researches about foreground detection, and some of them have a good performance by their own modality. If we use multi-modalities at the same time, maybe we could get a better detection result than other methods. width height In the context of foreground segmentation problem, V corresponds to the set of all pixels in current frame, E contains all the links between neighboring pixels in the 4neighborhood sense, and the set L comprises of two labels (‘fg’,‘bg’) representing a pixel belongs to foreground or background. As each pixel v has a labelling xv ∈ L, every configuration x of such an MRF defines a segmentation. Let D denote the set of observed color values in current frame. Taking a Bayesian perspective, we wish to find the best configuration x∗ (i.e. the optimal labels for the pixels in current frame) which maximize the posterior probability P(x|D), or in other words, to solve MAP-MRF problem. This can be done by finding the configuration with the minimum energy: and shadow elimination term as their energy function. This method is good at resolving false positive problems (e.g. shadow, reflection…etc), but not at false negative problems (e.g. foreground object is similar to background). Besides, MRF is inefficiency for numerous of frames. How to reduce computing time and use the properties of static camera are important problems. Therefore, our method utilized multimodalities as the energy function, and designed an efficient algorithm to accelerate Min Cut in MRF. We first introduce the main idea of MMCut in Section 3.1. Then we describe the formula of MRF energy functions in Section 3.2. Finally, we will introduce how to divide and conquer MRF problems and accelerate it by multi-core in Section 3.3. Note that the flow of processing is different from above process. (a) (b) (c) Figure 1. MMCut is a method that utilized multi-modality (e.g. Appearance, GMM likelihood, prior constraints…) as the MRF energy function. Then applied the min cut algorithm to resolve the problem. Finally, we can get the better silhouette result than just using any single modality. 3.2. Energy in MMCut Here we define the data term and smoothness term first. We assume an energy form of: H(xv ) = (1 − τ)LG + τLA + δLC G(xu , xv ) = (1 − ζ)φG + ζφA (2) (3) Where τ, δ, and ζ are fixed parameters. LG, LA and LC denote Adaptive GMM likelihood, Appearance, and prior constraint, respectively. Eq. (3) is a smoothness term which composed by GMM term φG ,and Appearance term φA . Below we will define each likelihood terms in Equation (2) and (3). Adaptive GMM Likelihood Based on Chris Stauffer methods [2], here we use Gaussian mixture models as our background model. In these cases, pixelwise GMM is an effective model to estimate the variations of any pixels. Assume that we have K mixture models,We define the GMM likelihood term as: 1/score(I(v)) , if xv = ′bg′ LG (Iv , xv ) = { (4) const − 1/score(I(v)) , if xv = ′fg′ score(I(v)) denotes the GMM likelihood score, and we assumed it to be of the form: I(v) − μk (v) score(I(v)) = 1/mink [ ] σk (v) where means μk, variance σk and weight wk parameters of kth Gaussian mixtures can be learnt and updated on-line when receiving a new frame. In this way, low energy is always guaranteed no matter how the background changes regularly. The constant parameter means a uniform distribution in appearance of foreground objects as [5]. Appearance Likelihood After thresholding GMM likelihood score map, and extracting connected component from it, as a result we can get a rough bounding box of moving object as Figure 1-b. In order to compute the appearance likelihood, we utilize the K-means methods in [8]. First we could regard the exterior part of this box as known background and use it to group a Background Cluster KB. Second, we also group a foreground Cluster KF adopting the known foreground inside this box. Then, for each node v we compute the minimum distance from its intensity I(v) to foreground clusters as: dFi (v)= min‖I(v)-K Fn ‖ ,dBi (v)= min‖I(v)-K Bn ‖ n n We define the appearance likelihood term as: LA (I(v), xv ) = dB i (v) F di (v)+dB i (v) dFi (v) {dFi (v)+dΒi (v) , if xv = ′bg′ , if xv = ′fg′ (5) Prior Constraint Likelihood Since we have already extracted the rough moving object bounding box. We assume that the center part of the bounding box should have higher probability to be detected as foreground than the other parts. For example, it should not be broken in a human body as Figure 1-b. We define our prior constraint term as: about two modals here, see if it can force the best segmentation to follow salient edges in current frame. We define our smoothness term as: 1 φA (Du |xu , xv ) = |xu −xv | ∙ (7) φG (Du |xu , xv ) = |xu −xv | ∙ Ωuv 1 △uv (8) Where Ωuv is the distance of color appearance, and △uv is he distance of GMM likelihood score: Ωuv = ‖I(u) − I(v)‖2 , △uv = ‖score(u) − score(v)‖2 3.3. Divide and Conquer The surveillance video cameras are indeed often fixed, so that the sequences present a static background together with moving objects that are semantically meaningful. That means we only need to care the moving object in a frame, but not the whole frame. Here we use pixelwise Adaptive GMM as our baseline background model since it is robust to some slightly changes such as swaying leaves, flickering monitors and so on. After threshold, the GMM likelihood score map will change into binary map. We can get the rough moving object bounding box from extracting the connected component of the binary map. Finally we will regard each box as an independent sub frames, and apply MMCut on each of them. Therefore, we apply MMCut on the moving boxes area instead of the whole frame. 3.4. Multi-Core As mentioned in section 3.3, due to multi core CPU is much cheaper than before, we separate the origin foreground detection problems of a frame into several sub-problems. Next, if there is a quad-core CPU, we could put most four moving object boxes into the MMCut processing at the same time as Figure 2. Therefore, we not only save the work load of graph-cut, but also can solve multi sub problem at the same time. Moreover, in the adaptive GMM background model, each pixel in the frame is independent to any other pixels. That means we can also generate the back ground model parallel with multi-core. After applying above methods, we succeed to reduce the computing time of MMCut. The experiments result is in Table 1. const − log Pc (u, v) , if xv = ′bg′ LC (I(u, v)|xuv ) = { (6) − log Pc (u, v) , if xv = ′fg′ Here we utilized 2D Gaussian density function Pc to model the likelihood distribution, and regard the center of the box as its mean point, and (u, v) is the coordinate relative to the mean point. Besides, we also view height/6 as first standard deviation σ1, and width/6 as second standard deviation σ2. How to choose 1/6 is shown in experiment section. Smoothness Term In [4][5], only one modal has been used as the smoothness term. Here we want to consider Figure 2. We can divide the job into several smaller tasks and apply MMCut on them simultaneously (multi-cores), instead of put the whole frame into graph cut processing. Our method can save a lot of time than the traditional method. We evaluate the foreground detecting performance using IPPR2006 video dataset (3 videos with 320*240 pixels in 24-bit RGB). Here we named three datasets as “Indoor1”, “Indoor2”, and “Outdoor1”. The scene of “Indoor1” is a lobby under poor light, “Indoor2” is a hallway with strong sunlight, “Outdoor1” is a road that has moving vehicles and people. 3 methods are compared including Stauffer’s Adaptive GMM method [2], Adaptive GMM applied morphology operations (open and close), and the best detection result in IPPR06 competition [9]. We define the error rate as (# stands for ‘number of pixels’): Error per Frame= # in false positives + # in false negatives Due to we want to test if prior constraint can benefit the detection result. We selected two important clips in the dataset1 (exist object hard to be detected precisely), and here named those as Broken Object1 and 2. After MMCut processing, the result is as Figure 3. Where δ is the weight of prior constraint in likelihood term, we can see the prior constraint benefit the detection result significantly. Besides, we get lowest error rate when δ=1.5. Error Per Frame 1000 Broken Object1 800 Broken Object2 600 400 200 Error Per Frame 4. EXPERIMENTS 900 800 700 600 500 400 300 200 100 0 Dataset1 Dataset2 Dataset3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Besides, we have tested changing the ratio between 𝛗𝐆 and 𝛗𝐀 in the smoothness term, the result is actually improved even the improvement is slight (about 0.1%), but certainly there actually are benefits when combining these two modalities in smoothness term. At last we evaluate the three dataset and compare with the other three methods as Figure 5, we can see our method is outperforming than “Adaptive GMM” and “Adaptive GMM + Post processing”, and we also have a better result than the best in IPPR06 competition on dataset1 and dataset3. Because there are lots changes of light in dataset2, and we do not consider about shadow removal, leading poor performance against the best one in IPPR2006. Due the shadow removal algorithm is proposed [5], and have excellent performance, maybe we will obtain their shadow term into our energy function in the future work. 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 δ Figure 3. δ is the weight of Lc. The error rate is reduced significantly when we take “prior constraint” into account. Besides, we get lowest error rate when δ=1.5. Because GMM likelihood is close to color appearance, but still exist some difference between them. We should test what ratio between them can get the best result. As Figure 4 we have the best result when we considering only GMM likelihood, and there are no notable changes in dataset1 and dataset 2. CPU Threads Frame per Sec. Intel Core2 Duo 2.16GHz 1 31 Intel Core2 Duo 2.16GHz 2 43 Intel Core2 Quad 2.4GHz 1 36 Intel Core2 Quad 2.4GHz 4 60 Table 1. As the table, using our method can save about 28% time on dual-core system, and 66% time on quad-core system. Our method is also faster than Y.Sun’s method [5]. Error Per Frame 1200 0.5 0.75 τ Figure 4. τ is the ratio between GMM likelihood & Appearance, τ=0 means only considering about GMM likelihood . Above chart shows the best result appears when τ=0 (only considering about GMM likelihood). 0 0 1 Adaptive GMM 1000 800 600 Adaptive GMM + Post Processing 400 MMCut 200 0 Dataset1 1 Dataset2 2 Dataset3 IPPR 2006 Best 3 Figure 5. MMCut is outperforming than “Adaptive GMM + Post Processing”, even better than the best performance the IPPR 2006 competition. Besides, our method is much more efficiency than the IPPR 2006 best (0.04fps vs. 30fps.) To demonstrate the usefulness of our better “divide and conquer” method on MRF problem, we test our method on the Dataset1, and list the frame rate (FPS) in the conditions of using different core numbers. The result is as Table 1. (a) (b) (c) (d) Figure 6. Example results of 3 methods. Each subgraph shows original frame, Stauffer’s method, Stauffer’s method after morphological operations, and our methods from left to right. From above figures, we can see the false negative parts (where foreground looks like the background) of the foreground objects have success been reduced. Besides, our algorithm is more efficient than other foreground detection methods via graph-cut [4][5][6]. 5. CONCLUSIONS This work try to solve the detection problem about foreground object is similar to background. Therefore we extend current efforts on accurate foreground segmentation which combine multi-modal MRF instead of conventional background subtractions. Especially, we focus on what the spatial constraint of an object should be. Besides, due to MRF is time costly, we separate the frame into some smaller blocks and put them into MRF parallel (multi-core). Here we introduce a better energy function of MRF to deal with the moving objects in static cameras. The energy function is based on GMM likelihood, color appearance info, and the spatial constraints. 6. REFERENCES [1] S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld. “Detection and location of people in video images using adaptive fusion of color and edge information.” Proc. ICPR, 2000. [2] Chris Stauffer, “Adaptive background mixture models for realtime tracking.” Proc. CVPR, 1999. [3] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” TPAMI, Vol. 26, No. 9, pp. 1124–1137, 2004. [4] N.-R. Howe “Better foreground segmentation through graph cuts,” Technical report, Smith College, 2004. [5] Y. Sun, B. Yuan, Z. Miao, and C.Wan, “Better foreground segmentation for static cameras via new energy form and dynamic graph-cut.” Proc. ICPR, 2006. [6] Keita Takahashi, Taketoshi Mori: “Foreground Segmentation with Single Reference Frame Using Iterative Likelihood Estimation and Graph-Cut.” Proc. ICME, 2008 [7] Ying Wang, Kaiqi Huang, Tieniu Tan: “Human Activity Recognition Based on R Transform.” Proc. CVPR, 2007 [8] Y. Li, J. Sun, C.-K. Tang, H.-Y. Shum, “Lazy Snapping,” Proc. SIGGRAPH, 2004. [9] The Chinese Image Processing and Pattern Recognition Society(IPPR), http://www.ippr.org.tw/.