DETECTION AND CLASSIFICATION FOR GROUP MOVING HUMANS WALID SULIMAN ELGENAIDI A dissertation submitted in partial fulfillment of the requirements for the award of the degree of Master of Engineering (Electrical-Electronics & Telecommunication) Faculty of Electrical Engineering Universiti Teknologi Malaysia May, 2007 DEDICATION “To My Beloved Father, Mother, Brothers and Sisters” ACKNOWLEDGEMENTS First and foremost, I thank God Almighty for giving me the strength to complete my research. I would also like to express my gratitude and respect to my research supervisor PM DR. SYED ABD. RAHMAN AL-ATTAS for his constant support and guidance during my graduate studies at Universiti Teknologi Malaysia. Thanks to all of my colleagues and friends with whom I had the opportunity to learn, share and enjoy. It has been a pleasure. Finally, special and infinite thanks to the most important people in my life, my parents, for their love, prayers, sacrifice and support. ABSTRACT In the case of moving group of humans the recognition algorithms more often misclassify it as vehicles or large moving object. It is there fore the aim of this project to detect and classify moving object as either Group of humans or something else. The background subtraction technique has been employed in this work as it is able to provide complete feature of the moving object. However, it is extremely sensitive to dynamic changes like change of illumination. The detected foreground pixels usually contain noise, small movements like tree leaves. These isolated pixels are filtered by some of preprocessing operations; such as median filter and sequence of morphological operations dilation and erosion. Then the object will be extracted using border extraction technique. The classification makes use the shape of the object. The performance of the proposed technique has achieved 75% accuracy based on 18 test samples. This result shows that if it possible to distinctly classify a group of humans moving in the video sequence from other large moving objects such as vehicles. ABSTRAK Dalam kebanyakan kes pengecaman manusia yang bergerak secara berkumpulan, algoritma pengecaman selalunya salah mengklasifikasikan kumpulan manusia tersebut dan mengklasifikasikannya sebagai kenderaan atau sebagai satu objek besar yang bergerak. Oleh yang demikian, adalah objektif utama projek ini untuk mengesan dan mengklasifikasikan objek bergerak tersebut sebagai satu kumpulan manusia bergerak atau sebaliknya. Teknik penolakan latar belakang digunakan supaya ciri-ciri objek bergerak yang sempurna dapat diperolehi. Namun demikian, teknik ini adalah amat sensitif terhadap perubahan dinamik seperti perubahan pada pencahayaan. Piksel-piksel yang telah diasingkan, biasanya mengandungi banyak hingar dan pergerakan-pergerakan kecil, seperti pergerakan daun-daun pokok. Piksel-piksel ini akan diproses dengan beberapa penapis seperti penapis median dan dituruti pula dengan beberapa operasi morfologi iaitu operasi pengembangan dan penghakisan. Kemudian, sempadan objek akan diekstrak menggunakan teknik pengekstrakkan sempadan. Proses klasifikasi pula mengunakan maklumat bentuk objek tersebut. Berdasarkan eksperimen terhadap 18 sampel ujian, didapati teknik yang telah dicadangkan mempunyai ketepatan sehingga 75%. Daripada keputusan ini, didapati bahawa algoritma ini berpotensi untuk mengklasifikasikan kumpulan manusia bergerak secara tepat berbanding dengan objek bergerak yang lain seperti kenderaan. TABLE OF CONTANTS CHAPTER TITLE DECLARATION iv DEDICATION v ACKNOWLEDGEMENTS vi ABSTRACT vii ABSTRAK viii TABLE OF CONTENTS 1 2 PAGE ix LIST OF TABLES xii LIST OF FIGURES xiii LIST OF NOMENCLATURES xv INTRODUCTION 1 1.1 Overview 2 1.2 Overview system stages 3 1.3 Objective of this study 3 1.4 Scope of this study 4 1.5 Project’s Outline 4 LITERATURE REVIEW 6 2.1 Introduction 6 2.2 Reference Based Approach 7 3 2.3. Experimental Results 8 2.4 Background Subtraction 9 2.5 Motion Detection and Shape - Based Detection 9 2.6 Summary 9 PROJECT METHODOLOGY 11 3.1 Introduction 11 3.2 System Overview 11 3.3 Detection stage 11 3.3.1 Image Capture 13 3.3.2 Background Model 13 3.3.3 Foreground Model 14 3.4 Object Preprocessing 3.4 .1 3.5 Median Filter 15 15 3.4.2 Morphological Operations 16 3.4.3 Dilation 17 3.4.4 Erosion 18 3.4.5 Border Extraction 20 Classification Stage 21 3.5.1 Extracting Object Features 21 Centre of Mass 22 Distance between the center point and the border of the 3.5.2 4 object 23 Classification Metric 25 EXPERIMENTAL RESULT 28 4.1 Introduction 28 4.2 Moving object Position 28 4.3 Image Capture Results 30 4.4 Background Subtraction Results 32 4.5 Median Filter 32 5 4.6 Dilation 34 4.7 Erosion 34 4.8 Region Filling 36 4.9 Border Extraction 36 4.10 Feature Extraction Result 38 4.11 Classification Results 41 4.12 Recognition Accuracy 42 4.13 Factors that Contribute to Low Accuracy 43 CONCLUSION AND SUMMARY 44 5.1 Summary 44 5.2 Conclusions 44 5.3 Recommendations for Future Work 45 REFERENCES 46 APPENDIX A: MATLAB DECLARATIONS 49 APPENDIX B: IMAGES USED IN THE DATA BASE 57 LIST OF TABLES TABLE NO 4.1 TITLE The results of the classification metric for 12 samples of the group of humans, where (THB= 0.04). 4.2 40 The results of the classification metric for 8 samples of others, where (THB= 0.04). 4.3 PAGE Performance accuracy 41 42 LIST OF FIGURES FIGURE NO TITLE PAGE 2.1 Flow chart of the proposed approach 7 2.2 Multiple human detection from indoor sequences 8 2.3 Multiple human detection from outdoor sequences 8 3.1 A generic framework for the algorithm 12 3.2 The system block diagram. 12 3.3 Convert the frames to grayscale frame 13 3.4 a) The original image (unfiltered image). b) After replacing Center value (previously 97) is replaced by the median of all nine values (4) 3.5 (a) Structure ‘B’ (b) a simple binary image ’A’(c) Result of erosion process. 3.6 18 (a) Structure ‘B’ (b) a simple binary image ’A’(c) Result of erosion process. 3.7 16 19 Show the output of the border after subtracting the eroded image from original one 20 3.8 (a) A simple binary image (b) Result of using border extraction 20 3.9 Sample objects and their silhouettes 20 3.10 Exacted the border from the foreground 3.11 (a) and (c) The objects border with distance and center point (b) and (d) Sample distance signal calculation and normal distance signals 3.12 22 24 (a), (b) The border and distance graphs of the vehicle. (c) The border and distance graph of the group of humans. 25 3.13 (a) Graph of the AVG. (b) The AVG graph between 60-120 27 4.1 Different camera positions. 29 4.2 Result of the image capture and grayscale image 31 4.3 Result of the background subtraction 32 4.4 Results of the median filter 33 4.5 Results of the dilation process 34 4.6 Result of the erosion process. 35 4.7 The result of the region filling process 36 4.8 The result of the border extraction process. 38 4.9 Results of the shape feature 40 4.10 Results for classifying the group of humans 41 4.11 Results for classifying the others 42 4.12 Incorrect results of classification 43 LIST OF SYMBOLS B(j,i) - Background image C(j,i) - Current image g(j,i) - The output after the threshold process TH - The threshold value Cm - Center point YCm - Center point for y coordinates XCm - Center point for x coordinates Dist - The Euclidian distance DS - The normalized distance signal CHAPTER 1 INTRODUCTION 1.1 Introduction Video surveillance systems have long been in used to monitor security sensitive areas. The history of video surveillance consists of three generations of systems (generation surveillance systems ) which are called 1GSS, 2GSS and 3GSS [11]. The first generation surveillance systems (1GSS, 1960-1980) were based on analog sub systems for image acquisition, transmission and processing. They extended human eye in spatial sense by transmitting the outputs of several cameras monitoring a set of sites to the displays in a central control room. They had the major drawbacks like requiring high bandwidth, difficult archiving and retrieval of events due to large number of video tape requirements and difficult online event detection which only depended on human operators with limited attention span. The next generation surveillance systems (2GSS, 1980-2000) were hybrids in the sense that they used both analog and digital sub systems to resolve some drawbacks of its predecessors. They made use of the early advances in digital video processing methods that provide assistance to the human operators by filtering out spurious events. Most of the work during 2GSS is focused on real time event detection. 2 Third generation surveillance systems (3GSS, 2000- ) provide end-to-end digital systems. Image acquisition and processing at the sensor level, communication through mobile and fixed heterogeneous broadband networks and image storage at the central servers benefit from low cost digital infrastructure. Unlike previous generations, in 3GSS some parts of the image processing are distributed towards the sensor level by the use of intelligent cameras that are able to digitize and compress acquired analog image signals and perform image analysis algorithms like motion and face detection with the help of their attached digital computing components. Moving object detection is the basic step for further analysis of video. It handles segmentation of moving objects from stationary background objects. This not only creates a focus of attention for higher level processing but also decreases computation time considerably. Commonly used techniques for object detection are background subtraction, statistical models, temporal differencing and optical flow. Due to the dynamic environmental conditions such as illumination changes, shadows and waving tree branches in the wind object segmentation is a difficult and significant problem that needs to be handled well for a robust visual surveillance system. Object classification step categorizes detected objects into predefined classes such as human, vehicle, animal, clutter, etc. It is necessary to distinguish objects from each other in order to track and analyze their actions reliably. Currently, there are two major approaches towards moving object classification, which are shapebased and motion-based methods [15]. 1.1 Overview This project is to design group of humans recognition system that can be integrated into an ordinary visual surveillance system with moving object detection classification .The present system which operates on gray scale video imagery from 3 a video camera, the system is handled by the use of an adaptive background subtraction scheme[3] which works reliably in an out-door environments. After segmenting moving pixels from the static background, connected regions are classified into predetermined object categories: group of humans or vehicle or some thing else. 1.2 Overview system stages The proposed system is capable of detecting moving objects .The system extracts features of these moving objects and then classifies them into two categories “Group of Humans or something else”. The methods used can be summarized as follows: 1. 1.3 Detection step: • Background model. • Foreground detection. 2. Object Preprocessing. 3. Feature Extraction. 4. Classification. Objective of this study The main objective of this project is to design a system that can detect and differentiate the group of humans moving from the moving objects. The object will be processed before classification using some image processing techniques to accommodate environmental change during the acquiring process. This work can be an important part for intelligent security surveillance purposes. 4 1.4 Scope of this study To accomplish this objective, the scope of this study would be divided into several stages as follow: 1. The scene does not include night vision. 2. Method developed is meant only for outdoor environment. 3. This method makes use of the objects silhouette contour, length and area to classify the detected objects. 4. The camera is facing the front of the object. 5. The system classifies a group of 3 humans and above 6. The systems programmed using MATLAB. 7. The processing will be done is off line. 1.5 Projects Outline The project is organized into five chapters. The outline is as following; Chapter 1- Introduction This chapter discuses the objective and scope of the project and gives general introduction on the history of video surveillance and classification of the moving objects that will be detected. Chapter 2- Review of Literature Review This chapter review previous approach for detection of multiple moving objects from binocular video sequences is reported. First an efficient motion estimation method is applied to sequences acquired from each camera. 5 Chapter 3- Project Methodology This chapter presents the overall system methodology and discusses in details each step that has to be taken into consideration for classification purposes. Chapter 4- Experimental Results This chapter shows the results for each process done on the image for this system, and final results of the system. Chapter 5- Conclusion This chapter consists of conclusions and recommendation for future improvement. 6 CHAPTER 2 LITERATURE REVIEW 2.1 Introduction A lot of researchers have begun working on the detection and classifications of the objects. This paper will be introduced in the next section. Yang Ran et al[1] .developed detection of multi moving people from binocular sequences. A novel approach for detection of multiple moving objects from binocular video sequences is reported. First an efficient motion estimation method is applied to sequences acquired from each camera. The motion estimation is then used to obtain cross camera correspondence between the stereo pair. Next, background subtraction is achieved by fusing of temporal difference and depth estimation. Finally moving foregrounds are further segmented into moving object according to a distance measure defined in a 2.5D feature space, which is done in a hierarchical strategy. The proposed approach has been tested on several indoor and outdoor sequences. Preliminary experiments have shown that the new approach can robustly detect multiple partially occluded moving persons in a noisy background. Representative human detection results are presented. 7 2.2 Reference Based Approach In this paper, Yang Ran proposed a novel approach for detecting moving human from binocular videos. It used a fast and accurate sub-pixel accuracy motion estimation technique to extract object motion information, which significantly reduces ambiguity and computation cost in establishing dense stereo correspondence. In this approach, both motion consistency between the two cameras and stereo disparity map are used for background subtraction and moving object segmentation/grouping. The motion correspondence significantly improves the background subtraction process; while stereo correspondence trims down the searching computation. Fig. 2.1 shows a flow chart of this approach. Figure2.1 Flow chart of the proposed approach 8 2.3 Experimental Results Yang Ran applied an algorithm to a number of stereo sequences acquired by a stationary stereo camera. Two representative results are presented here. The videos are captured at 320x240 resolutions, 25 frames per second. Figure 2.2 shows an example of detection results in an indoor scene. The background for indoor scene is constant during capturing. Shown in the left column are two input frames (#16 and #25) taken from the left camera. Shown in the central column are motion (foreground) detection results. Shown in the right column are the person segmentation/grouping results, where different individuals are assigned with different gray levels. The first row is the case where no occlusion occurs and the people are at different distances. The second row is the case where occlusion happens and the persons are at different distances. Figure 2.2 Multiple human detection from indoor sequences Figure 2.3 shows an example of detection two people in an outdoor scene. The test demonstrates that even under cluttered background (due to background vegetation motion) and shadows. Figure 2.3 Multiple human detection from outdoor sequences 9 2.4 Background Subtraction A common approach to identifying moving objects from a video sequence is a fundamental and critical task is background subtraction, which identifies moving objects from the portion of a video frame that differs significantly from a background model. There are many challenges in developing a good background subtraction algorithm. First, it must be robust against changes in illumination. Second, it should avoid detecting non-stationary background objects such as moving leaves, rain, snow, and shadows cast by moving objects. Finally, its internal background model should react quickly to changes in background such as starting and stopping of vehicles. 2.5 Motion Detection and Shape - Based Detection Object classification step categorizes detected objects into predefined classes such as human, vehicle, animal, clutter, etc. It is necessary to distinguish objects from each other in order to track and analyze their actions reliably. Currently, there are two major approaches towards moving object classification, which are shapebased and motion-based methods. Shape-based methods make use of the objects’ 2D spatial information whereas motion-based methods use temporal tracked features of objects for the classification solution. 2.6 Summary A novel approach for detection of multiple occluded moving persons from binocular video sequences is presented. By integrating the motion estimation result into every step in the whole detecting process, monocular and binocular correspondences are fused to generate robust detections, which is the work contribution. First an efficient motion estimation method is applied to sequences from each camera. The motion estimation is then used to obtain cross camera 10 correspondence between the stereo pair. Next, background subtraction is achieved by fusion of temporal difference and depth estimation. Finally foregrounds are further segmented into moving objects according to a distance measure defined in a 2.5D feature space. The proposed approach has been tested on several indoor and outdoor sequences. 11 CHAPTER 3 PROJECT METHODOLOGY 3.1 Introduction The system extracts the features from the moving objects and classifies them into “group of humans, vehicle or some thing else “.This chapter presents in details the methodology of the proposed system 3.2 System Overview The flowchart of the system architecture approach is shown in Fig 3.1. This chart gives an overview of the main stages of the methodology; this system is divided into two main stages: detection stage and classification stage. 3.3 Detection stage The system operates on gray scale video imagery from the video frames. The system is handled by the use background subtraction scheme which reliably works in outdoor environments. 12 Object Detection Figure 3.1 Object Classification Decision A generic framework for the algorithm ageCapture Capture Im IIm mage age Capture Background Model Current Image Background Foreground Model Foreground Image Object Preprocessing Object with Feature Feature Extraction Classification Figure 3.2 The system block diagram. 13 3.3.1 Image Capture The system captures the images off line from the video (25 frames per second). The system will start to initialize the background using the first frame, the captured frames are converted to grayscale images. A grayscale (or graylevel) image is simply one in which the only colors are shades of gray. The reason for differentiating such images from any other sort of color image is that less information needs to be provided for each pixel. In fact a `gray' color is one in which the red, green and blue components all have equal intensity in RGB space, and so it is only necessary to specify a single intensity value for each pixel, as opposed to the three intensities needed to specify each pixel in a image. The grayscale intensity is stored is an 8-bit integer giving 256 possible different shades of gray from black to white. Grayscale images are very common, in part because much of today's display and image capture hardware. In addition, grayscale images are entirely sufficient for many tasks and so there is no need to use more complicated and harder-to-process color images. Figure 3.3 3.3.2 Convert the frames to grayscale frame Background Model Each application that benefit, from smart video processing has different needs, thus requires different treatment. However, they have something in common: 14 moving objects. Thus, detecting regions that correspond to moving objects such as Group of Humans or something else in video is the first basic step of almost every vision system since it provides a focus of attention and simplifies the processing on subsequent analysis steps. Due to dynamic changes in natural scenes such as sudden illumination and weather changes, repetitive motions that cause clutter (tree leaves moving in blowing wind), motion detection is a difficult problem to process reliably. Frequently used techniques for moving object detection are background subtraction, whose description is given below. Background subtraction is particularly a commonly used technique for motion segmentation in static scenes. It attempts to detect moving regions by subtracting the current image pixel-by-pixel from a reference background image that is created by averaging images over time in an initialization period. Background subtraction method, a reference background is initialized at the start of the system with the first frame of video. 3.3.3 Foreground Model At each new frame, foreground pixels are detected by subtracting the intensity values from the background and filtering the absolute value of the differences with value of threshold per pixel. The pixels where the difference is above a threshold are classified as foreground. Let B(j,i) represents the gray-level Background image, B(j,i) which is in the range [0, 255]. Let C(j,i) be the Current image[8]. As the generic background subtraction scheme suggests, a pixel at position (j,i) in the current video image belongs to foreground if it satisfies Foreground (j,i) = |B (j,i) – Current image (j,i) | ≥ TH (3.1) Where TH is the threshold value. The above equation is used to generate the foreground pixel map which represents the foreground regions as a binary array 15 where a 1 corresponds to a foreground pixel and a 0 stands for a background pixel. The reference background B(j,i) is initialized with the first video image and the threshold image is obtained from empirical experiments. 3.4 Object Preprocessing The outputs of foreground region detection algorithms in which explained in previous three sections generally contain noise and therefore are not appropriate for further processing without noise filtering. In this system, the first method of using simple intensity value has been applied. The threshold value was fixed to the value (TH=32) and followed by the rule below g(j,i) is the output after the threshold process. ⎧0 ⎩1 g (j,i) = ⎨ 3.4.1 If .Foreground ( j , i ) < TH otherwise (3.2) Median Filter The median filter is normally used to reduce noise in an image and it is a simple and very effective noise removal filtering process. Its performance is particularly good for removing shot noise. Shot noise consists of strong spikelike isolated values. The median filter is also a sliding-window spatial filter, but it replaces the center value in the window with the median of all the pixel values in the window. Example of median filtering of a single 3x3 window of values is shown below [16]. 16 unfiltered values 6 2 0 3 97 4 19 3 10 (a) In order: 0, 2, 3, 3, 4, 6, 10, 15, 97 median filtered * * * * 4 * * * * (b) Figure 3.4 a) The original image (unfiltered image). b) After replacing Center value (previously 97) is replaced by the median of all nine values (4). 3.4.2 Morphological Operations The field of mathematical morphology contributes a wide range of operators to image processing, all based around a few simple mathematical concepts from set theory. The operators are particularly useful for the analysis of binary images and common usages include edge detection, noise removal, image enhancement and image segmentation. The two most basic operations in mathematical morphology are erosion and dilation. Both of these operators take two pieces of data as input: an image to be eroded or dilated, and a structuring element (also known as a kernel). The two pieces of input data are each treated as representing sets of coordinates in a way that is slightly different for binary and grayscale images. 17 Morphological operations, erosion and dilation, are applied to remove noisy foreground pixels that do not correspond to actual foreground regions and to remove the noisy background pixels near and inside object regions that are actually foreground pixels. Basic operation of a morphology-based approach is the translation of a structuring element over the image and the erosion and/or dilation of the image content based on the shape of the structuring element. A morphological operation analyses and manipulates the structure of an image by marking the locations where the structuring element fits. In mathematical morphology, neighborhoods are, therefore, defined by the structuring element, i.e., the shape of the structuring element. There are many types of morphological operation that can be used but in this project ,only three of them will be used as preprocessing and these are erosion , dilation, and connected component labeling. 3.4.3 Dilation Dilation is one of the two basic operators in the area of mathematical morphology, the other being erosion. It is typically applied to binary images, but there are versions that work on grayscale images. The basic effect of the operator on a binary image is to gradually enlarge the boundaries of regions of foreground pixels (i.e. white pixels, typically). Thus areas of foreground pixels grow in size while holes within those regions become smaller. So the areas of foreground pixels grow in size while holes within those regions become smaller as shown in Figure.3.5. The dilation operator takes two pieces of data as input. The first is the image which is to be dilated. The second is a set of coordinate points known as a structuring element (also known as a kernel) as shown figure. It is this structuring element that determines the precise effect of dilation on the input image. To compute the dilation of a binary input image by structuring element, each of the background pixels in the 18 input image is considered in turn. For each background pixel(or input pixel), the structuring element is super imposed on the top of input image so that the origin of the structuring element coincides with the input pixel position. If at least one pixel in the structuring element coincides with a foreground pixel in the image underneath, then the input pixel is set to the foreground value. If all the corresponding pixels in the image are background however, the input pixel is left at the background value [5]. D [A , B ] = A ⊕ B = (A + B ) U β ∈B Figure 3.5 (3.3) (a) structure ‘B’ (b) a simple binary image ’A’(c)Result of erosion process. Dilation process has many good criteria such as it can repair the broken edges, help in getting smoother border etc, but its drawback is when applying on a small object. The following steps below have been applied in order to obtain better results. (a) Calculating the entire area of the object. (b) If area of the object is >500 then dilation process will be applied to the object other wise no dilation process is performed. 3.4.4 Erosion Erosion is the other basic operators in the area of mathematical morphology. The basic operation is to erode the boundaries of region of foreground pixels (i.e. 19 white pixels, typically). Thus areas of foreground pixels shrink in size, and holes with those areas become larger [18] as shown in Figure 3.6. The erosion operator takes two pieces of data as inputs. The first is the image which is to be eroded and the second is (usually small) set of coordinate points known as a structuring element (also known as kernel). It is this structuring element that determines the precise effect of the erosion on the input image. To compute the erosion of a binary input image by this structuring element, each of the foreground pixels in the input image is considered in turn. For each foreground pixel (which is called the input pixel) the structuring element is superimposed on top of the input image so that the original image of the structuring element coincides with the input pixel coordinates. If for every pixel in the structuring element, the corresponding pixel in the image underneath is a foreground pixel, then the input pixel is left as it is. If any of the corresponding pixels in the image are background however, the input pixel is also set to background value [18]. E[A, B ] = AΘ(− B ) = (A − B) I β ∈B Figure 3.6 (3.4) (a) structure ‘B’ (b) a simple binary image ’A’(c)Result of erosion process. 20 3.4.5 Border Extraction This method is to extract the outline of the border using eroding the image once and then subtract the input image from the eroded one using the formula bellow [19]: B ( A) = A − ( AΘ B ) ( AΘB ) = A= B(A)= Figure 3.7 shows the output of the border after subtracting the eroded image from original one. (a) Figure 3.8 (b) (a) A simple binary image (b) Result of using border extraction Figure 3.9 Sample objects and their silhouettes 21 3.5 Classification Stage Categorizing the type of a detected video object is a crucial step in achieving this goal. With the help of object type information, more specific and accurate methods can be developed to recognize the objects. Hence, in this project developed a novel video object classification method based on object shape. Typical video scenes may contain a variety of objects such as group of humans, vehicles, animals, natural phenomenon (e.g. rain, snow), plants and clutter. However, main target of interest in surveillance applications are generally group of humans. 3.5.1 Extracting Object Features After detecting foreground regions and applying post-processing operations to remove noise and shadow regions. After finding individual blobs that correspond to objects, spatial features like bounding box, size, center of mass and silhouettes of these regions are calculated. 22 Labeled Foreground regions (Blobs) Dilation Erosion Filtered Foreground regions (Blobs) Center Centreof ofRegion Region Object only with Border Figure 3.10 Exacted the border from the foreground Centre of Mass After extracting the border of the foreground region as shown above in figure 3.8, the center of the object is calculated by simply finding the average of all x coordinates and y coordinates. 23 In order to calculate the center of mass point, Cm = (XCm, YCm), of an object [4], we use the following equation: XCm = ∑ n i Xi n ∑ Yi (3.5) n YCm = i n (3.6) Where n is the number of pixels in object. Distance between the center point and the border of the object After calculating the center of the mass of the object and extracted the border of object, the distance between the border and center is be calculated. The algorithm is used to calculate the distance is show in Figure 3.11. Let S = {P1, P2… Pn } be the silhouette of an object O consisting of 180 points ordered from (0 degree) with the coordinates of center point of the detected region in opposite clockwise direction to (180 degree). The distance signal DS = {d1, d2… dn} is generated by calculating the distance between Cm and each Pi starting from 1 through 181 as follows: di = Dist (C m , Pi ).......∀i ∈ [1......180] (3.7) Where the Dist function is the Euclidian distance. Different objects have different shapes in video and therefore have silhouettes of varying sizes. Even the same object has altering contour size from frame to frame. 24 In the next step, the scaled distance signal d(i) is calculated and normalized to have integral unit area. The normalized distance signal DS is calculated using the following equation: DS [i ] = d (i ) 180 (3.8) Ds[i] Points (a) (b) Ds[i] Points (c) Figure 3.11 (d) (a) and (c) The objects border with distance and center point (b) and (d) Sample distance signal calculation and normal distance signals The main concept to extract the shape feature is to look at the location of heads. This feature is detected because of the peaks appear in the case of group of humans. Location of the heads from (0-180 degree). Most heads will be located in between 60 and 120 degree. The feature is illustrated in the Figure 3.12. 25 (a) (b) (c) Figure 3.12 (a), (b) The border and distance graphs of the vehicle. (c) The border and distance graph of the group of humans. 3.5.2 Classification Metric There are numerous methods been used to classify the object based on shapes [14, 3, 13, 2, 10]. Our object classification metric is based on the similarity of object 26 border of shapes. After obtaining the distance, as in Figure 3.12 (b) and (c). The next step is the comparison between the input object (i.e. border distance) and the stored border distance, which can be calculated offline as follows: Result= ⎧ group of humans ⎨ ⎩others ∑ Dst AB − AVG ≥ TDB otherwise (3.9) Where The DstAB is the distance extracted from the object. The AVG is the offline measured value The TDB = 0.04 obtained from empirical experiments. The AVG is illustrated in Figure 3.13. By applying the above rule, the classification has achieved (a) 27 (b) Figure 3.13. (a) Graph of the AVG. (b) The AVG graph between 60-120. 28 CHAPTER 4 EXPERIMENTAL RESULTS 4.1 Introduction This chapter presents the experimental results of this project. The results for each process (include the preprocessing stage) are also presented in this chapter. In addition this chapter also discusses the performance of the technique and factors that affect the accuracy of the system. 4.2 Moving object Position The movement of the object in the different distances, object location and the camera positioning are the most important issues during the extraction of the features. In this system, the camera is facing the front of the object. Figure 4.1 below shows the positions of the camera used. 29 (a) (b) (c) Figure 4.1 Different camera positions. 30 4.3 Image Capture Results The system captured the frames from the video and converted to the grayscale image. Figure 4.2 shows the results of the capture image. (a) (b) 31 (c) (d) Figure 4.2 Result of the image capture and grayscale image 32 4.4 Background Subtraction Results The results of the background subtraction algorithm after comparing with threshold value as it explained in Chapter 3. Shown in Figure 4.3. (a) (b) (c) (d) Figure 4.3 4.5 Result of the background subtraction Median Filter Median filter is applied to remove the noise it appears in the image after the background subtraction operation. Figure 4.4 below shows the results of the median filter. 33 (a) (b) (c) (d) Figure 4.4 Results of the median filter 34 4.6 Dilation Dilation process can link the broken border of the object to be with same shape. Figure 4.5 below shows the results of dilation operation (a) (c) Figure 4.5 4.7 (b) (d) Results of the dilation process Erosion Erosion process can not loose all the small details of the object to help for improve the border of the object. Figure 4.6 below shows the results of erosion process. 35 (a) (b) (c) (d) Figure 4.6 Result of the erosion process. 36 4.8 Region Filling This process is to fill the object with white pixels in order to improve the border extraction techniques. Figure 4.7 shows the result of this process. (a) (b) (c) Figure 4.7 4.9 (d) The result of the region filling process. Border Extraction The border extraction algorithm to extract the outline of the border. Figure 4.8 shows the result of the algorithm. 37 (a) (b) (c) 38 (d) Figure 4.8 4.10 The result of the border extraction process. Feature Extraction Result In this section, the extraction of the feature of the shape is shown in the graphs below; the border of the object is represented in the distance graph. Figure 4.9 shows these results. (a) 39 (b) (c) 40 (d) Figure 4.9 Results of the shape feature Table 4.1: The results of the classification metric for 12 samples of the group of humans, where (THB= 0.04). No Result = ∑ Dst AB − AVG ≥ THB Comment Sample_1 0.0598 (True the result>THB) Sample_2 0.156 - Sample_3 0.0600 - Sample_4 0.207 - Sample_5 0.2299 - Sample_6 0.0821 - Sample_7 0.0325 (Failed the result<THB) Sample_8 0.0771 (True the result>THB) Sample_9 0.0056 (Failed the result<THB) Sample_10 0.0574 (True the result>THB) 41 Table 4.2: The results of the classification metric for 8 samples of others, where (THB= 0.04). No Results= ∑ Dst AB − AVG ≥ THB Comment Sample_1 0.0108 (True the result<THB) Sample_2 0.0381 - Sample_3 0.1462 Sample_4 0.0665 (Failed the result<THB) - Sample_5 0.0158 (True the result<THB) Sample_6 0.0074 - Sample_7 0.0056 - Sample_8 0.0108 - 4.11 Classification Results The result of the classification process is the crucial key point in this system in order to classify the objects into two classes (group of humans and others) as shows in Figure 4.10 and Figure 4.11 below. Figure 4.10 Results for classifying the group of humans 42 Figure4.11 4.12 Results for classifying the others Recognition Accuracy A high accuracy system with low error rate is required. The main target of classification processes is to classify the group of humans from other objects. The video samples are selected to test the accuracy in the way that covers all the possible conditions like different positions for the objects, different camera positioning etc. The recognition accuracy for this system is shown in Table 4.3. Table 4.3: Performance accuracy Object Classification accuracy Samples Success Fail Group of humans 10 8 2 80% Others 8 6 2 75% Average Success Rate 77% 43 4.13 Factors that Contribute to Low Accuracy There are several factors that affect the accuracy of the classification. The first one is caused by poor quality video. The second one is due to in the changes the shape of the object caused by preprocessing stages (dilation, erosion, etc). Others like the distance between the camera and the objects, the natural scenes such as sudden illumination and weather changes. Figure 4.12 show example of the failure classification. Figure 4.12 Incorrect results of classification 44 CHAPTER 5 CONCLUSION AND SUMMARY 5.1 Summary The program for detection and classification of group moving humans has been developed in this project using the object silhouettes shape. In detecting the moving objects background subtraction has been used because of its high performance of handling the moving objects. The results show that the presented method is promising. The shape feature extraction method has been used in this project to classify the moving objects. Finally the classification of group moving humans has been successfully achieved with some misclassification error which was contributed by poor quality of the video. 5.2 Conclusions In general, the objective of the detection and classification of group moving humans has been achieved. The program developed is currently fit for offline application. The images have been captured from same camera position. To ensure the obtained are reliable, the system must capture good quality images. 45 The input image for this system has been passed through many preprocessing stages before it is viable for the classification process. Extracting the features vector is the most important part of this project so that it would be able to achieve the main aim of this project. By extracting the most descriptive features from the moving objects, a system with high accuracy for classification can be produced. 5.3 Recommendations for Future Woke All the objectives were accomplished within the scope and the limitation of the project. There are few recommendations which might be helpful in the future work as given in below. • The use of the different color range of the image and big mass coverage of the color in the case of class vehicles. • The use of 3D image can help in detecting and classifying of the objects. • Consider object motion in different situations. • Increase the feature vector so that high accuracy can be achieved for classification. • Convert this system from offline application to online applications so that the actual performance of the algorithm can be verified. 46 REFERENCES [1] Ran, Y and Zheng, Q. Multi moving people detection from binocular sequences. Center for automation research institute of advanced computer studies, University of Maryland, USA. [2] Arkin, E.M. Chew, L. P. Huttenlocher, D. P. Kedem, K., and Mitchell, J. S. B. (1991). An e_ciently computable metric for comparing polygonal shapes. IEEE Transactions on Pattern Recognition and Machine Intelligence, 13:209–216, [3] Collins, R. T. Gross, R. and Shi, J. (2002). Silhouette-based human identification from body shape and gait. In Proc. of Fifth IEEE Conf. on Automatic Face and Gesture Recognition, pages 366–371. [4] Collins, R. T. (2000). A system for video surveillance and monitoring: VSAM final report. Technical report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University. [5] Brodsky, T (2002). Visual Surveillance in Retail Stores and in the Home, chapter 4, pages 51–61. Video-Based Surveillance Systems. Kluwer Academic Publishers, Boston. [6] Fujiyoshi, H. and Lipton, A. J. (1998). Real time human motion analysis by image skeletonization. In Proc. of Workshop Applications of Computer Vision, pages 15–21. [7] Healey, G. Slater, Lin, D. T. Drda, B., and Goedeke, D. (1993). A system for real-time fire detection. Computer Vision and Pattern Recognition, pages 605– 606. [8] Heijden, F. (1996). Image Based Measurement Systems: Object Recognition and Parameter Estimation. Wiley. [9] Heikkila, J. and Silven, O. (1999). A real-time system for monitoring of cyclists and pedestrians. In Proc. of Second IEEE Workshop on Visual Surveillance, pages 74–81, Fort Collins, Colorado. 47 [10] Ramoser, H. Schlgl, T. Winter, M. and Bischof, H. (2003). Shape-based detection of humans for video surveillance. In Proc. of IEEE Int. Conf. on Image Processing, Barcelona, Spain. [11] Loncaric, S. (1998). A survey of shape analysis techniques. Pattern Recognition, 31(8):983–1001. [12] Oberti, F. Ferrari, G. and Regazzoni, C. S. (2002). A Comparison between Continuous and Burst, Recognition Driven Transmission Policies in Distributed3GSS, chapter 22, pages 267–278. Video-Based Surveillance Systems. Kluwer Academic Publishers, Boston. [13] Saykol, E. Gudukbay, U. and Ulusoy, O. (2002). A histogram-based approach for object-based query-by-shape-and-color in multimedia databases. Technical Report BUCE-0201, Bilkent University. [14] Saykol, E. Gulesir, G. Gudukbay, U. and Ulusoy, O. (2002). KiMPA: A kinematicsbased method for polygon approximation. In International Conference on Advances in Information Systems (ADVIS’02), pages 186– 194, Izmir, Turkey. [15] Veltkamp, R.C. and. Hagedoorn, M. (2001). State-of-the-art in shape matching, pages. Principles of Visual Information Retrieval. Springer. 87– 119 [16] Wang, L. Hu, W. and Tan, T. (2003). Recent developments in human motion analysis. Pattern Recognition, 36(3):585–601. [17] [18] Hypermedia image processing Reference. (1995). Department of Artificial Intelligence, University of Edinburgh, UK, Version 1. [19] 48 APPENDIX A MATLAB DECLARATIONS % Capture the Image from the Video X = aviread ('video sample .avi'); I1 = frame2im (number of frame); I2= frame2im(number of frame); % Change to Gray Levels I1 = rgb2gray(I1); I2 = rgb2gray(I2); I1_d = double(I1); I2_d = double(I2); % Background Subtraction Vwidth = 384; Vheight= 288; for j=1:vheight 49 for i=1:vwidth backsubtract(j,i)=(double(I1_d(j,i))-double(I2_d(j,i))); backabs(j,i)=abs(backsubtract(j,i)); end end % Thresholding for j = 1:vheight for i= 1:vwidth if backabs(j,i)<32 backabs(j,i)=0; else backabs(j,i) = 255; end end end %Median filter MedImage = medfilt2(backabs); MedImage1 = medfilt2(MedImage); % Morophological Operations BW = bwareaopen(MedImage1,50); areaObj =bwarea(BW); if areaObj > 500 50 se= strel('square',3); dilate1=imdilate(BW,se); %dilation process dilate2=imdilate(dilate1,se); erode1=imerode(dilate2,se); % erosion process Imgefilled= imfill(erode1,'holes'); else Imgefilled= imfill(BW,'holes'); % Region filling process end se = strel('square',3); erode =imerode(Imgefilled,se); f = erode; %Calculate the Center Point and the Distances count=0; for j=1:vheight for i=1:vwidth if f(j,i)==1 count=count+1; end end end %calculate the summation for vertical sumc =0; for j=1:vheight for i=1:vwidth if f(j,i)==1 51 sumc=sumc+j; end end end %calculate the summation for horizontal sumr=0; for j=1:vheight for i=1:vwidth if f(j,i)==1 sumr=sumr+i; end end end %Calculate the Center of the Mass Xc =round(sumr(1,1)/count(1,1)); Yc =round(sumc/count); % Border Extraction z=0; for i=yc:-1:1 if f(i,xc)==0 y_top=i+1; break; end end 52 for i=xc:vwidth if f(yc,i)==0 x_right=i-1; break; end end f_erode = imerode(f,se); f_diff = uint8(f - f_erode); f_out = uint8(f_erode - f); imwrite(f_diff, 'border.bmp', 'bmp'); % Calculate the Distances between Center Point and Border from (0 _180 degree) dist =0 width = vwidth; height = vheight; white = 1; wdist = 0; loop = 0; step = 1; N = 180/step; for theta = 0:step:180 theta_r = pi() * theta/ 180; loop = loop + 1; if(theta ~= 90) %disp('do'); wdist(loop,1) = 0; 53 wdist(loop,2) = 0; for x=1:(width-xc) if(theta > 90) y = -x*tan(theta_r); else y = x*tan(theta_r); end if( theta > 90) cx = round(xc - x); else cx = round(x + xc); end cy = round(yc - y); if(cy <= 0) cy = 1; end if(cx <= 0) cx = 1; end if (f_diff(cy,cx) == white) % calculate distance dist = sqrt( (cy - yc).^2 + (cx - xc).^2 ); wdist(loop,1) = theta; wdist(loop,2) = dist; break; % if white end if(loop > 1) 54 if(wdist(loop,1) == 0) wdist(loop,1) = theta; end if(wdist(loop,2) == 0) wdist(loop,2) = wdist(loop-1,2); end end end % for width else %theta != 90 wdist(loop,1) = theta; for y=1:1:(yc-1) if(f_diff(yc-y, xc) == white) wdist(loop,2) = y; break; end end end end u = wdist(:,2); total = sum(u); u_norm = u/total; % Range( 60 – 120 )degrees u_norm_new = u_norm; range = 60; for i=1:range u_norm_new(i) = 0.0; u_norm_new(182-i) = 0.0; 55 end %Classification step ref = AVO; sample =u_norm_new; r = 60; for i=1:r sample(i) = 0.0; sample(182-i) = 0.0; ref(i) = 0.0; ref(182-i) = 0.0; end resultis = sum(abs(sample - ref)) if resultis > 0.04 % THB disp ('Classifiied as Group of human'); else disp ('Others!'); end 56 APPENDIX B IMAGES USED IN THE DATA BASE 5 Images used for calculate the graph of AVG (a) 57 (b) (c) 58 (d) (e)