A technical review of image processing and computer vision techniques to implement real-time video analytics Maryam Majareh mm12e10@ecs.soton.ac.uk University of Southampton ABSTRACT 1. INTRODUCTION The paper presents a technical review of various computer vision techniques used in real-time video processing. The domain focuses on the assessment of human behaviour in crowded scenes such as train stations, airports, parking lots, etc. The technology aims to transform a basic video camera feed into a live learning and detection tool in order to process video frames. The main objective is to detect activities such as abandoned objects, illegally parked cars, trespassing and even remote biometrics. There are a number of challenges faced by the research community to process video frame sequences including background subtraction, object (blob) segmentation, sequence feature extraction and AI modelling that are actively being investigated at present. The paper reviews a wide range of investigation domains involved in the real-time processing and modelling of images from image acquisition and processing to device calibration, segmentation and artificial intelligence (AI)-based modelling. Computer vision is regarded as the domain that deals with the processing of imagebased data utilised by a computer. The domain comprises of a number of core phases including image acquisition, processing and classification [1]. Real-time video-feed processing is a practical example of image processing where image sequences from a video source such as a CCTV camera are extracted and manipulated in order to extract useful information. This information can be in the form of car number plates from highspeed motor vehicles to human faces from pedestrians entering a building hallway. Background subtraction involves training of a specialised model to detect foreground objects from a static background taken from a static camera. Blob-tracking involves the utilization of image processing techniques to isolate and bound foreground objects. Sequence feature extraction involves processing of temporally distributed video frames to gain an understanding of various foreground objects present within. As the frames are time-based and context sensitive, the core information is extracted at two unique stages – firstly by pre-processing the frames via suitable image processing techniques to efficiently extract regions-ofinterests (ROIs) and secondly by utilising robust artificial intelligence (AI) routines such as Hidden Markov Models, Bayesian Learning, or Neural Networks to model and train detection classifiers. Based on the level of investigation going-on in this area, this paper presents a review of the current state-of-the-art in the field of image processing, video analytics and computer vision. In doing so, the review presents existing research and the ongoing challenges faced by the research community. The paper also presents future directions into each of the three core areas of background subtraction, object segmentation and frame-based video-tracking. Categories and Subject Descriptors I.4.8 [Scene Analysis]: Image Processing and Computer Vision – Computing methodologies, artificial intelligence, computer vision, image and video acquisition, motion capture. General Terms Algorithms, Measurement, Performance, Experimentation, Security, Human Factors, Standardization, Verification. Keywords Computer Vision, Image Processing, Video Analytics, Machine Learning, Contrary to humans’ outstanding calibration ability in modelling real-world video scenes, processing and classification of computer vision-based data in existing vision hardware is an extremely cumbersome task. An image in computing is regarded as a twodimensional function 𝑓(𝑥, 𝑦) where the amplitude of any of the pixels within the image is called the intensity or the grey scale level of the specific picture. For a colour RGB image, this greyscale level divides into three unique channels (Red, Green and Blue) with each represented by an 8-bit binary value ranging from 0 to 256. Therefore a pixel containing an RGB representation of 255, 0, and 0 would show a red colour due to the remaining two channels containing zero value representation. The processing of these finite values by a digital computer is called digital image processing. Based on the immediate application, image processing generally comprise of two main categories: Enhancing or optimising image quality for human viewing Preparing images extraction for computer vision-based feature The scope of this paper addresses the later part of image processing where the ongoing research in the geometrical composition, relevant measurements and image interpretations are analysed to critically discuss and analyse the current state-of-theart of the domain. This paper is primarily divided into three core sections: Section 2 discusses the domain of image processing and its current state-ofknowledge in the analyses of raw camera-based images. Section 3 discusses the challenges and limitations of extracting real-time video-based images from live camera feeds for real-world applications. Section 4 ultimately proposes and discusses a new frontier of temporal computer vision techniques via a novel infrared depth sensing domain to be used for various applications. The paper ultimately concludes with a discussion into any future extensions and applications of these techniques. 2. IMAGE PROCESSING AND ANALYSIS Image segmentation is regarded as the first pre-processing stages of image preparation to make it legible-enough for a computing system to extract important features from. The core stages of an image analysis system can be divided into the following three substages: 2.1 Image pre-processing This phase is primarily used to improve the quality of an image by removing noise due to various factors such as uneven light intensity, dirt, poor device quality, etc. Digital images particularly are prone to different noise types which result into pixel intensities that do not reflect the true intensities of the original scene. There are several ways in which noise can be introduced into a scene as follows: Images scanned from photographic films generally contain noise due to the presence of grains on film material. Images acquired in low-quality scanning equipment generally result in such images. Scanners and damaged film may also lead to poor image qualities with a high sound-to-noise (SNR) ratio. Images acquired from old library records are one example that suffer the most from this kind of noise Figure 1: A simulated comparison of various noise types via Matlab Image Processing based noise induction (a) Original image, (b) Gaussian noise, (c) Poisson noise, (d) Salt & Pepper noise 2.2 Image noise removal A wide number of image noise removal techniques are reported in the literature as follows [3]: Linear filtering If the image data is transmitted via an electronic transmission, noise may be introduced due to the in-built compression mechanisms. Images taken from JPEG compression devices such as digital cameras introduce noise due to loss-based compression of image data The technique is used to eliminate only certain types of noises via Gaussian or averaging filters. The technique is used to remove noise by either removing or enhancing certain spatial frequencies within an image [2]. Finally, if an image is acquired directly in digital format from, the data gathering mechanism may introduce noise Median filters are generally used to remove impulsive noise due to its ability to preserve edge information or step-wise discontinuities in the signal. An image enhancement is generally achieved by the following core methodologies [2]: Removal of additive noise and interference Elimination of multiplicative interference Regulation of image contrast and Decrement of blurring A number of methods are used for noise removal including smoothing via low-pass filtering, sharpening via high-pass filtering, histogram equalisation and the usage of generic deblurring algorithms. The effect of various noise types are shown in Figure 1 where (a) shows an original lab image taken from a Samsung Galaxy S3 phone, (b) image induced with a zero-mean Gaussian white noise contain a variance of 0.01, (c) image induced with Poisson distribution with a mean of 10 and (d) with image containing Salt & Pepper noise with 0.05 pixel density. The images were created via the “imnoise” function provided by Matlab 2012a. Median filtering Adaptive filtering Adaptive linear filters work on the concept of extracting the desired information (the actual image) via an estimation operation. According to [3], an adaptive linear filter is generally used not only to remove noise but channel filtering as well. 2.3 Image segmentation With a pre-processed image, the next stage in image processing is the segmentation of the region of interest (ROI). ROI in an image can contain any type of elements ranging from humans [4] to a wide array of non-living objects such as luggage moving over a conveyer belt or even vehicles for the purpose of license plate recognition [5]. Nonetheless, image segmentation from real-world objects suffer from a completely new array of challenges compared to noise removal. Images taken in open environments are never the same. A picture taken at a certain time of the day is generally different from one taken under different conditions such as cloud cover, time-of-day, moving trees or other objects. These challenges generally divide the area of image segmentation into two distinct domains – the static image segmentation with no background information available and the dynamic image segmentation based on a sequence of video images. Based on the type of image segmentation case being addressed, the following section presents a number of techniques that are generally performed to extract foreground pixels from the background data: 2.3.1 Edge detection kernels The purpose of edge detection is to extract the outlines of different regions in an image [2]. This technique can be used fairly for both the static and dynamic segmentation cases. The objective is to divide an image into a set of ROIs based on brightness or colour similarities. One of the simplest methods of segmentation is the application of histogram equalisation or thresholding technique over an image. This is generally achieved by plotting or grouping pixels on the basis of their specific intensity values. Conceptually, an image histogram is a probability density function (PDF) of a grey-scale image. However, the domain gets further challenging when a degree of dynamism is induced within the image due to it being part of a sequence of frames gathered from a generic or CCTV camera. Images thus taken continuously change their pixel-level intensities thereby making it impossible for hard-threshold-based image histogram techniques as those stated above to fail. As discussed before, these changes generally occur due to different day times, variable cloud cover, occlusions and dynamic foreground pixels. Dynamic foreground pixels generally occur due to the presence of moving objects that are part of a video image sequence. These can be trees, waves or even sand particles due to an ensuing dust storm. Figure 3: A binary image created based on the intensity scale profile shown in Figure 2 (a) thresholded adaptively at a median calculated via the histogram shown in Figure 2(b) The most state-of-the-art challenge faced by the research community in the segmentation of such images therefore comes from these foreground pixels that act as part of a foreground but are in-effect to be eliminated as background pixels. The next section discusses on various “background subtraction” methodologies that have recently been employed in the literature to solve the issue of foreground modelling in the presence of dynamic background pixels. 3. IMAGE PROCESSING IN DYNAMIC VIDEO-BASED FRAMES Figure 2: An intensity histogram (b) of the lab-view grey-scale image shown in (a) It can be understood from the image shown in Figure 2 that the right-hand-side portion of the image in (a) contains a fairly high number of pixels (> 1200) that lie in the higher intensity domain whereas the left-hand-side image portion mainly contains darker pixels due to the presence of the monitor, lower-intensity wall portion and the bag. This very concept of “histogramming” has routinely been used in applications where certain objects within a complex background are to be extracted based upon the underlying intensity criteria. The concept is frequently used in applications such as character segmentation in the domain of optical character recognition [6]. The adaptively thresholded image created based on the histogram profile shown in Figure 2 is shown in Figure 3. Predominantly termed as “background subtraction”, the technique is increasingly being used in real-time video-frame-based image segmentation to detect, subtract and segment critical ROIs such as moving vehicles and individuals. Due to the presence of moving background objects such as trees or other dynamic objects, the classification of various ROIs in images requires careful modelling to minimise false alarms. The situation is further complicated when if the image contains sudden intensity variations such as shadows, occlusions and objects moving at variable speeds [7]. A variety of techniques with their own limitations and benefits are used in recent literature to robustly locate foreground pixels as discussed below. 3.1.1 Background modelling via Gaussian mixture models A robust background methodology aims at the construction of model that is capable of eliminating dynamic background objects while efficiently keeping track of genuine foreground objects over a temporal sequence of video frames. Gaussian mixture models (GMM) are one of the oldest methods utilised to learn from time- based pixel variations. [8] utilised a probabilistic GMM architecture to train each pixel based on its intensity variations over time. The methodology was further extended by [9] to include statistical Bayesian modelling-based artificial intelligence (AI). However, the two technologies predominantly suffered from two major setbacks. Firstly, the models could not incorporate object shadows as background pixels and secondly, if the model were trained for slow intensity variations, it would fail for abrupt intensity changes and vice-verse. Figure 4 shows a sample video sequence taken from the ChangeDetection repository where the standard GMM algorithm implemented in an OpenCV installation fails for the bus-station video [10, 11]. [12] did try to incorporate a time-adaptive system where the pixels were able to integrate variable intensity rate. In order to further improve the technique [13] adopted a hierarchical approach to integrate colour and gradient information to further improve and differential on overlapping foreground and background pixels with matching intensity and colour profiles. Yet the issue of shadow-incorporate still remained at-large with most of these GMM-based models. in terms of moving camera object recognition in the absence of a robust and supervised AI model. With the latest induction of infra-red sensing devices such as Microsoft Kinect, the domain of background subtraction has taken a new aspect where the pixels are not merely realised in a 2D intensity domain but in 3D point-cloud space where the distance can be measured and modelled with respect to an infra-red camera present on the device itself. The technology has already revolutionised the XBOX gaming domain and with the launch of Windows-based Kinect version in February, 2012 along with its SDK, it is now possible for conventional programmers to access the depth-map and sensing APIs to a wide-range of real-world applications including gesture recognition, motion sensing, film & animation and high-resolution 3D surface regeneration. 3.1.2 Code-book-based background subtraction The issues with shadow and abrupt intensity variations were predominantly addressed by another genre of algorithms based on a pixel-level time-based codebook methodology. The technique keeps a record of intensity-variation behaviour of pixels over a time-based codebook. Perhaps the most groundbreaking implementation in this domain is by [7] who introduced a technique termed as the maximum negative runtime length (MNRL). The algorithm classifies a pixel’s behaviour by learning its change rate over a set period of frames and thereby keeps a codebook of a number of its parameters as follows: The minimum and maximum brightness The frequency with which a codeword has occurred in the database The maximum negative runtime length The first and last access time of the codeword The technique has presented promising outcomes in the domain of background subtraction, particularly in highly changing scene modelling such as traffic videos, pedestrian motion tracking and even in gesture and gait recognition[4, 9, 14-18]. 4. ANALYSIS OF RECENT TECHNOLOGICAL ADVANCEMENTS INTO IMAGE PROCESSING Yet, the biggest shortcoming of majority of histogram, GMM and codebook-based algorithms lay in their capability to only process a 2D image realisation of an image. With the rapidly changing technologies, the advent of 3D scanners did introduce a sense of novelty and promise in the image and video processing domain however, the overwhelmingly tedious process of calibration and the need of willing subjects severely limited their usage in realtime image processing. Moreover, scenes captured via moving cameras require a further overhead of using separate models for each camera position in order to efficiently differentiate foreground pixels. The current state-of-the-art substantially lacks Figure 4: An implementation of MNRL-based codebook algorithm given in [7] via the OpenCV library presenting the inherent weaknesses of GMM-based background segmentation evaluated against a benchmarking video taken from [10, 11] Work in the domain of point cloud processing for graphical reconstruction for the objective of 3D surface matching has increasingly been used to compare and identify objects such as human faces, vehicles and aerial scans as 3D surface plots [1, 2]. The field is increasingly finding its applications in forensics [3] and is very likely to be extended to real world applications of multi-dimensional aerial scanning [4], beyond visual recognition biometrics [5], fire detection in smoke [6], industrial conditional monitoring [7] and most importantly, in medical and surgical applications of tumour detection, advanced magnetic resonance imaging (MRI) as well as gait analysis-based physical abnormality detection [8, 9]. Despite the promising nature of depth-sensing, infra-red and thermographic devices in computer vision, the technology is still not used substantially in everyday real-world usage. However, as discussed earlier, with the advent of low-cost depth-sensing devices such as Microsoft Kinect, the domain can now be explored for everyday touch-free applications. Figure 5 presents samples of (a) skeletal joint mapping machine, (b) a Delaunay triangulation used to capture 3D face wireframe (c) gray-scale depth profile from the Kinect sensor for distance measurement and (d) a thermograph to capture temperature information from distant objects. based image processing where the majority of recent investigations are now concentrated. A detailed review of video acquisition and processing techniques in the backdrop of recent depth-processing and 3D pixel-cloud abilities of released hardware present a wide range and promising set of applications. Most importantly, a 3D infrared depth-map of is expected to present a set of features that, if combined with latest AI techniques are likely to increase the overall detection and classification accuracies of existing systems. Most importantly, as the camera itself does not require multiple view points, it is envisaged that future integration of this camera into mobile devices and smart phones is likely to revolutionise the way picture are taken from handheld devices. Moreover, a further integration of infrared-based thermographs is foreseen to completely change remote diagnoses and treatment of patients. The technology is very likely to enable a GP or even an artificial diagnosis software in a smart phone to detect and identify body temperature changes, breathing problems, heart and pulse rates merely by non-invasive, touch-free body scans. Moreover, in the industrial domain, real-time sparse point clouds can be compared to regular point clouds of a machine’s motion to pre-emptively diagnose operational anomalies such as excess vibrations or abnormal noise patterns. To wrap-up, depth-scan and 3D sensing capabilities built in “single-directional” devices like Kinect are widely expected to wide range of real-world domains. 6. REFERENCES [1] Szeliski, R. and SpringerLink, Computer vision : algorithms and applications. Texts in computer science. 2011, London ; New York: Springer. xx, 812 p. [2] Petrou, M., C. Petrou, and I. Wiley, Image processing : the fundamentals. 2nd ed. 2010, Chichester: Wiley. xxiii, 794 p. [3] Vaseghi, S.V., Advanced digital signal processing and noise reduction. 4th ed. 2008, Chichester: J. Wiley & Sons. xxx, 514 p. Figure 5: Diagrammatic representation of a Kinect depth map profile with distant images represented by higher gray intensity mapping and closer objects such as the hand showed with intensity values closers to 255 Feature vectors from the streams shown in Figure 5 can be used in a wide range of real-world applications including sign language recognition[10], gait identification[11], touch-free biometrics, and 3D face recognition [12] and zero-visibility motion sensing (via infra-red sensing)[13]. Moreover, as the device’s uniqueness is in the single-directional capability, it makes it possible to embed the technology in future hand-handheld devices such as smart phones and tablets. Such an integration is likely to introduce opportunities into 3D photography, animation and film industry, robotics, augmented reality, education and virtual reality. Ultimately, the only limitation with the current state-of-the-art lies with the computational capability of conventional handheld hardware which is still in integration phase for high-quality rendering that is involved in multidimensional processing. [4] Moeslund, T.B. and SpringerLink, Visual analysis of humans [electronic resource] : looking at people. 2011, London ; New York: Springer-Verlag London Limited. 1 online resource (xxi, 632 p.). [5] Chang, S.L., et al., Automatic license plate recognition. Ieee Transactions on Intelligent Transportation Systems, 2004. 5(1). [6] Rice, S.V., G. Nagy, and T.A. Nartker, Optical character recognition : an illustrated guide to the frontier. The Kluwer international series in engineering and computer science. 1999, Boston, Mass. ; London: Kluwer Academic Publishers. vi, 194 p. [7] Kim, K., et al., Real-time foreground-background segmentation using codebook model. Real-Time Imaging, 2005. 11(3). [8] Stauffer, C. and W.E.L. Grimson, Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. 22(8): p. 747-757. 5. CONCLUSION [9] Lee, D.S., et al., A Bayesian framework for Gaussian mixture background modeling. 2003 International Conference on Image Processing, Vol 3, Proceedings, 2003: p. 973-976. The paper presents a detailed analysis into the core concepts of image processing and segmentation in real-world applications. Having discussed these, the review moves to dynamic, video- [10] ChangeDetection. ChangeDetection Video Database. 2012 [cited 2012 19th August, 2012]; Available from: http://www.changedetection.net/. [11] Goyette, N., et al. changedetection.net: A new change detection benchmark dataset. in Proc. IEEE Workshop on Change Detection (CDW’12). 2012. Providence, RI. [12] Harville, M., A framework for high-level feedback to adaptive, Per-Pixel, Mixture-Of-Gaussian background models. Computer Vision - Eccv 2002 Pt Iii, 2002. 2352: p. 543-560. [13] Javed, O., K. Shafique, and M. Shah, A hierarchical approach to robust background subtraction using color and gradient information. Ieee Workshop on Motion and Video Computing (Motion 2002), Proceedings, 2002: p. 22-27. [14] Buch, N., S.A. Velastin, and J. Orwell, A Review of Computer Vision Techniques for the Analysis of Urban Traffic. Ieee Transactions on Intelligent Transportation Systems, 2011. 12(3). 1. [15] Cristani, M., M. Bicego, and V. Murino, Integrated regionand pixel-based approach to background modelling. Ieee Workshop on Motion and Video Computing (Motion 2002), Proceedings, 2002: p. 3-8. [16] Ilyas, A., et al., Real Time Foreground-Background Segmentation Using a Modified Codebook Model. Avss: 2009 6th Ieee International Conference on Advanced Video and Signal Based Surveillance, 2009: p. 454-459. [17] Diamantopoulos, G. and M. Spann, Event detection for intelligent car park video surveillance. Real-Time Imaging, 2005. 11(3): p. 233-243. [18] Xiang, T., S.G. Gong, and S.O.C. Ieee Computer. Video behaviour profiling and abnormality detection without manual labelling. in 10th IEEE International Conference on Computer Vision (ICCV 2005). 2005. Beijing, PEOPLES R CHINA. Pauly, M., R. Keiser, and M. Gross, Multi-scale feature extraction on point-sampled surfaces. Computer Graphics Forum, 2003. 22(3). 2. Schnabel, R., R. Wahl, and R. Klein, Efficient RANSAC for point-cloud shape detection. Computer Graphics Forum, 2007. 26(2). 3. Vanezis, P., et al., Facial reconstruction using 3-D computer graphics. Forensic Science International, 2000. 108(2). 4. Guo, L., et al., Relevance of airborne lidar and multispectral image data for urban scene classification using Random Forests. Isprs Journal of Photogrammetry and Remote Sensing, 2011. 66(1). 5. Moreno-Moreno, M., J. Fierrez, and J. Ortega-Garcia, Biometrics beyond the Visible Spectrum: Imaging Technologies and Applications. Biometric Id Management and Multimodal Communication, Proceedings, 2009. 5707. 6. Kolaric, D., K. Skala, and A. Dubravic, Integrated system for forest fire early detection and management. Periodicum Biologorum, 2008. 110(2). 7. Omar, M., K. Kuwana, and K. Saito, The use of infrared thermograph technique to investigate welding related industrial fires. Fire Technology, 2007. 43(4). 8. Lee, M.-Y. and C.-S. Yang, Entropy-based feature extraction and decision tree induction for breast cancer diagnosis with standardized thermograph images. Computer Methods and Programs in Biomedicine, 2010. 100(3). 9. Selvarasu, N., et al., Abnormality Detection from Medical Thermographs in Human Using Euclidean Distance based color Image Segmentation. 2010 International Conference on Signal Acquisition and Processing: Icsap 2010, Proceedings, 2010. 10. Keskin, C., et al., Real Time Hand Pose Estimation using Depth Sensors. 2011 Ieee International Conference on Computer Vision Workshops (Iccv Workshops), 2011. 11. Stone, E. and M. Skubic, Evaluation of an inexpensive depth camera for in-home gait assessment. Journal of Ambient Intelligence and Smart Environments, 2011. 3(4). 12. Mahoor, M.H. and M. Abdel-Mottaleb, A multimodal approach for face modeling and recognition. Ieee Transactions on Information Forensics and Security, 2008. 3(3). 13. Elangovan, V. and A. Shirkhodaie. Recognition of Human Activity Characteristics Based on State Transitions Modeling Technique. in Conference on Signal Processing, Sensor Fusion, and Target Recognition XXI. 2012. Baltimore, MD.