A technical review of image processing and computer vision

advertisement
A technical review of image processing and computer
vision techniques to implement real-time video analytics
Maryam Majareh
mm12e10@ecs.soton.ac.uk
University of Southampton
ABSTRACT
1. INTRODUCTION
The paper presents a technical review of various computer vision
techniques used in real-time video processing. The domain
focuses on the assessment of human behaviour in crowded scenes
such as train stations, airports, parking lots, etc. The technology
aims to transform a basic video camera feed into a live learning
and detection tool in order to process video frames. The main
objective is to detect activities such as abandoned objects,
illegally parked cars, trespassing and even remote biometrics.
There are a number of challenges faced by the research
community to process video frame sequences including
background subtraction, object (blob) segmentation, sequence
feature extraction and AI modelling that are actively being
investigated at present.
The paper reviews a wide range of investigation domains involved
in the real-time processing and modelling of images from image
acquisition and processing to device calibration, segmentation and
artificial intelligence (AI)-based modelling. Computer vision is
regarded as the domain that deals with the processing of imagebased data utilised by a computer. The domain comprises of a
number of core phases including image acquisition, processing
and classification [1]. Real-time video-feed processing is a
practical example of image processing where image sequences
from a video source such as a CCTV camera are extracted and
manipulated in order to extract useful information. This
information can be in the form of car number plates from highspeed motor vehicles to human faces from pedestrians entering a
building hallway.
Background subtraction involves training of a specialised model
to detect foreground objects from a static background taken from
a static camera. Blob-tracking involves the utilization of image
processing techniques to isolate and bound foreground objects.
Sequence feature extraction involves processing of temporally
distributed video frames to gain an understanding of various
foreground objects present within. As the frames are time-based
and context sensitive, the core information is extracted at two
unique stages – firstly by pre-processing the frames via suitable
image processing techniques to efficiently extract regions-ofinterests (ROIs) and secondly by utilising robust artificial
intelligence (AI) routines such as Hidden Markov Models,
Bayesian Learning, or Neural Networks to model and train
detection classifiers.
Based on the level of investigation going-on in this area, this
paper presents a review of the current state-of-the-art in the field
of image processing, video analytics and computer vision. In
doing so, the review presents existing research and the ongoing
challenges faced by the research community. The paper also
presents future directions into each of the three core areas of
background subtraction, object segmentation and frame-based
video-tracking.
Categories and Subject Descriptors
I.4.8 [Scene Analysis]: Image Processing and Computer Vision –
Computing methodologies, artificial intelligence, computer vision,
image and video acquisition, motion capture.
General Terms
Algorithms, Measurement, Performance, Experimentation,
Security, Human Factors, Standardization, Verification.
Keywords
Computer Vision, Image Processing, Video Analytics, Machine
Learning,
Contrary to humans’ outstanding calibration ability in modelling
real-world video scenes, processing and classification of computer
vision-based data in existing vision hardware is an extremely
cumbersome task. An image in computing is regarded as a twodimensional function 𝑓(𝑥, 𝑦) where the amplitude of any of the
pixels within the image is called the intensity or the grey scale
level of the specific picture. For a colour RGB image, this greyscale level divides into three unique channels (Red, Green and
Blue) with each represented by an 8-bit binary value ranging from
0 to 256. Therefore a pixel containing an RGB representation of
255, 0, and 0 would show a red colour due to the remaining two
channels containing zero value representation. The processing of
these finite values by a digital computer is called digital image
processing. Based on the immediate application, image processing
generally comprise of two main categories:

Enhancing or optimising image quality for human viewing

Preparing images
extraction
for
computer
vision-based
feature
The scope of this paper addresses the later part of image
processing where the ongoing research in the geometrical
composition, relevant measurements and image interpretations are
analysed to critically discuss and analyse the current state-of-theart of the domain.
This paper is primarily divided into three core sections: Section 2
discusses the domain of image processing and its current state-ofknowledge in the analyses of raw camera-based images. Section 3
discusses the challenges and limitations of extracting real-time
video-based images from live camera feeds for real-world
applications. Section 4 ultimately proposes and discusses a new
frontier of temporal computer vision techniques via a novel
infrared depth sensing domain to be used for various applications.
The paper ultimately concludes with a discussion into any future
extensions and applications of these techniques.
2. IMAGE PROCESSING AND ANALYSIS
Image segmentation is regarded as the first pre-processing stages
of image preparation to make it legible-enough for a computing
system to extract important features from. The core stages of an
image analysis system can be divided into the following three substages:
2.1 Image pre-processing
This phase is primarily used to improve the quality of an image by
removing noise due to various factors such as uneven light
intensity, dirt, poor device quality, etc. Digital images particularly
are prone to different noise types which result into pixel
intensities that do not reflect the true intensities of the original
scene. There are several ways in which noise can be introduced
into a scene as follows:

Images scanned from photographic films generally contain
noise due to the presence of grains on film material. Images
acquired in low-quality scanning equipment generally result
in such images.

Scanners and damaged film may also lead to poor image
qualities with a high sound-to-noise (SNR) ratio. Images
acquired from old library records are one example that suffer
the most from this kind of noise
Figure 1: A simulated comparison of various noise types via
Matlab Image Processing based noise induction (a) Original
image, (b) Gaussian noise, (c) Poisson noise, (d) Salt & Pepper
noise
2.2 Image noise removal
A wide number of image noise removal techniques are reported in
the literature as follows [3]:

Linear filtering

If the image data is transmitted via an electronic
transmission, noise may be introduced due to the in-built
compression mechanisms. Images taken from JPEG
compression devices such as digital cameras introduce noise
due to loss-based compression of image data
The technique is used to eliminate only certain types of noises via
Gaussian or averaging filters. The technique is used to remove
noise by either removing or enhancing certain spatial frequencies
within an image [2].

Finally, if an image is acquired directly in digital format
from, the data gathering mechanism may introduce noise
Median filters are generally used to remove impulsive noise due
to its ability to preserve edge information or step-wise
discontinuities in the signal.
An image enhancement is generally achieved by the following
core methodologies [2]:

Removal of additive noise and interference

Elimination of multiplicative interference

Regulation of image contrast and

Decrement of blurring
A number of methods are used for noise removal including
smoothing via low-pass filtering, sharpening via high-pass
filtering, histogram equalisation and the usage of generic deblurring algorithms.
The effect of various noise types are shown in Figure 1 where (a)
shows an original lab image taken from a Samsung Galaxy S3
phone, (b) image induced with a zero-mean Gaussian white noise
contain a variance of 0.01, (c) image induced with Poisson
distribution with a mean of 10 and (d) with image containing Salt
& Pepper noise with 0.05 pixel density. The images were created
via the “imnoise” function provided by Matlab 2012a.


Median filtering
Adaptive filtering
Adaptive linear filters work on the concept of extracting the
desired information (the actual image) via an estimation
operation. According to [3], an adaptive linear filter is generally
used not only to remove noise but channel filtering as well.
2.3 Image segmentation
With a pre-processed image, the next stage in image processing is
the segmentation of the region of interest (ROI). ROI in an image
can contain any type of elements ranging from humans [4] to a
wide array of non-living objects such as luggage moving over a
conveyer belt or even vehicles for the purpose of license plate
recognition [5].
Nonetheless, image segmentation from real-world objects suffer
from a completely new array of challenges compared to noise
removal. Images taken in open environments are never the same.
A picture taken at a certain time of the day is generally different
from one taken under different conditions such as cloud cover,
time-of-day, moving trees or other objects. These challenges
generally divide the area of image segmentation into two distinct
domains – the static image segmentation with no background
information available and the dynamic image segmentation based
on a sequence of video images.
Based on the type of image segmentation case being addressed,
the following section presents a number of techniques that are
generally performed to extract foreground pixels from the
background data:
2.3.1 Edge detection kernels
The purpose of edge detection is to extract the outlines of
different regions in an image [2]. This technique can be used
fairly for both the static and dynamic segmentation cases. The
objective is to divide an image into a set of ROIs based on
brightness or colour similarities.
One of the simplest methods of segmentation is the application of
histogram equalisation or thresholding technique over an image.
This is generally achieved by plotting or grouping pixels on the
basis of their specific intensity values. Conceptually, an image
histogram is a probability density function (PDF) of a grey-scale
image.
However, the domain gets further challenging when a degree of
dynamism is induced within the image due to it being part of a
sequence of frames gathered from a generic or CCTV camera.
Images thus taken continuously change their pixel-level intensities
thereby making it impossible for hard-threshold-based image
histogram techniques as those stated above to fail. As discussed
before, these changes generally occur due to different day times,
variable cloud cover, occlusions and dynamic foreground pixels.
Dynamic foreground pixels generally occur due to the presence of
moving objects that are part of a video image sequence. These can
be trees, waves or even sand particles due to an ensuing dust
storm.
Figure 3: A binary image created based on the intensity scale
profile shown in Figure 2 (a) thresholded adaptively at a
median calculated via the histogram shown in Figure 2(b)
The most state-of-the-art challenge faced by the research
community in the segmentation of such images therefore comes
from these foreground pixels that act as part of a foreground but
are in-effect to be eliminated as background pixels. The next
section discusses on various “background subtraction”
methodologies that have recently been employed in the literature
to solve the issue of foreground modelling in the presence of
dynamic background pixels.
3. IMAGE PROCESSING IN DYNAMIC
VIDEO-BASED FRAMES
Figure 2: An intensity histogram (b) of the lab-view grey-scale
image shown in (a)
It can be understood from the image shown in Figure 2 that the
right-hand-side portion of the image in (a) contains a fairly high
number of pixels (> 1200) that lie in the higher intensity domain
whereas the left-hand-side image portion mainly contains darker
pixels due to the presence of the monitor, lower-intensity wall
portion and the bag. This very concept of “histogramming” has
routinely been used in applications where certain objects within a
complex background are to be extracted based upon the
underlying intensity criteria. The concept is frequently used in
applications such as character segmentation in the domain of
optical character recognition [6]. The adaptively thresholded
image created based on the histogram profile shown in Figure 2 is
shown in Figure 3.
Predominantly termed as “background subtraction”, the technique
is increasingly being used in real-time video-frame-based image
segmentation to detect, subtract and segment critical ROIs such as
moving vehicles and individuals. Due to the presence of moving
background objects such as trees or other dynamic objects, the
classification of various ROIs in images requires careful
modelling to minimise false alarms. The situation is further
complicated when if the image contains sudden intensity
variations such as shadows, occlusions and objects moving at
variable speeds [7]. A variety of techniques with their own
limitations and benefits are used in recent literature to robustly
locate foreground pixels as discussed below.
3.1.1 Background modelling via Gaussian mixture
models
A robust background methodology aims at the construction of
model that is capable of eliminating dynamic background objects
while efficiently keeping track of genuine foreground objects over
a temporal sequence of video frames. Gaussian mixture models
(GMM) are one of the oldest methods utilised to learn from time-
based pixel variations. [8] utilised a probabilistic GMM
architecture to train each pixel based on its intensity variations
over time. The methodology was further extended by [9] to
include statistical Bayesian modelling-based artificial intelligence
(AI). However, the two technologies predominantly suffered from
two major setbacks. Firstly, the models could not incorporate
object shadows as background pixels and secondly, if the model
were trained for slow intensity variations, it would fail for abrupt
intensity changes and vice-verse. Figure 4 shows a sample video
sequence taken from the ChangeDetection repository where the
standard GMM algorithm implemented in an OpenCV installation
fails for the bus-station video [10, 11]. [12] did try to incorporate
a time-adaptive system where the pixels were able to integrate
variable intensity rate. In order to further improve the technique
[13] adopted a hierarchical approach to integrate colour and
gradient information to further improve and differential on
overlapping foreground and background pixels with matching
intensity and colour profiles. Yet the issue of shadow-incorporate
still remained at-large with most of these GMM-based models.
in terms of moving camera object recognition in the absence of a
robust and supervised AI model.
With the latest induction of infra-red sensing devices such as
Microsoft Kinect, the domain of background subtraction has taken
a new aspect where the pixels are not merely realised in a 2D
intensity domain but in 3D point-cloud space where the distance
can be measured and modelled with respect to an infra-red camera
present on the device itself. The technology has already
revolutionised the XBOX gaming domain and with the launch of
Windows-based Kinect version in February, 2012 along with its
SDK, it is now possible for conventional programmers to access
the depth-map and sensing APIs to a wide-range of real-world
applications including gesture recognition, motion sensing, film &
animation and high-resolution 3D surface regeneration.
3.1.2 Code-book-based background subtraction
The issues with shadow and abrupt intensity variations were
predominantly addressed by another genre of algorithms based on
a pixel-level time-based codebook methodology. The technique
keeps a record of intensity-variation behaviour of pixels over a
time-based codebook. Perhaps the most groundbreaking
implementation in this domain is by [7] who introduced a
technique termed as the maximum negative runtime length
(MNRL). The algorithm classifies a pixel’s behaviour by learning
its change rate over a set period of frames and thereby keeps a
codebook of a number of its parameters as follows:

The minimum and maximum brightness

The frequency with which a codeword has occurred in the
database

The maximum negative runtime length

The first and last access time of the codeword
The technique has presented promising outcomes in the domain of
background subtraction, particularly in highly changing scene
modelling such as traffic videos, pedestrian motion tracking and
even in gesture and gait recognition[4, 9, 14-18].
4. ANALYSIS OF RECENT
TECHNOLOGICAL ADVANCEMENTS
INTO IMAGE PROCESSING
Yet, the biggest shortcoming of majority of histogram, GMM and
codebook-based algorithms lay in their capability to only process
a 2D image realisation of an image. With the rapidly changing
technologies, the advent of 3D scanners did introduce a sense of
novelty and promise in the image and video processing domain
however, the overwhelmingly tedious process of calibration and
the need of willing subjects severely limited their usage in realtime image processing. Moreover, scenes captured via moving
cameras require a further overhead of using separate models for
each camera position in order to efficiently differentiate
foreground pixels. The current state-of-the-art substantially lacks
Figure 4: An implementation of MNRL-based codebook
algorithm given in [7] via the OpenCV library presenting the
inherent weaknesses of GMM-based background
segmentation evaluated against a benchmarking video taken
from [10, 11]
Work in the domain of point cloud processing for graphical
reconstruction for the objective of 3D surface matching has
increasingly been used to compare and identify objects such as
human faces, vehicles and aerial scans as 3D surface plots [1, 2].
The field is increasingly finding its applications in forensics [3]
and is very likely to be extended to real world applications of
multi-dimensional aerial scanning [4], beyond visual recognition
biometrics [5], fire detection in smoke [6], industrial conditional
monitoring [7] and most importantly, in medical and surgical
applications of tumour detection, advanced magnetic resonance
imaging (MRI) as well as gait analysis-based physical
abnormality detection [8, 9].
Despite the promising nature of depth-sensing, infra-red and
thermographic devices in computer vision, the technology is still
not used substantially in everyday real-world usage. However, as
discussed earlier, with the advent of low-cost depth-sensing
devices such as Microsoft Kinect, the domain can now be
explored for everyday touch-free applications. Figure 5 presents
samples of (a) skeletal joint mapping machine, (b) a Delaunay
triangulation used to capture 3D face wireframe (c) gray-scale
depth profile from the Kinect sensor for distance measurement
and (d) a thermograph to capture temperature information from
distant objects.
based image processing where the majority of recent
investigations are now concentrated. A detailed review of video
acquisition and processing techniques in the backdrop of recent
depth-processing and 3D pixel-cloud abilities of released
hardware present a wide range and promising set of applications.
Most importantly, a 3D infrared depth-map of is expected to
present a set of features that, if combined with latest AI
techniques are likely to increase the overall detection and
classification accuracies of existing systems.
Most importantly, as the camera itself does not require multiple
view points, it is envisaged that future integration of this camera
into mobile devices and smart phones is likely to revolutionise the
way picture are taken from handheld devices. Moreover, a further
integration of infrared-based thermographs is foreseen to
completely change remote diagnoses and treatment of patients.
The technology is very likely to enable a GP or even an artificial
diagnosis software in a smart phone to detect and identify body
temperature changes, breathing problems, heart and pulse rates
merely by non-invasive, touch-free body scans. Moreover, in the
industrial domain, real-time sparse point clouds can be compared
to regular point clouds of a machine’s motion to pre-emptively
diagnose operational anomalies such as excess vibrations or
abnormal noise patterns. To wrap-up, depth-scan and 3D sensing
capabilities built in “single-directional” devices like Kinect are
widely expected to wide range of real-world domains.
6. REFERENCES
[1] Szeliski, R. and SpringerLink, Computer vision : algorithms
and applications. Texts in computer science. 2011, London ;
New York: Springer. xx, 812 p.
[2] Petrou, M., C. Petrou, and I. Wiley, Image processing : the
fundamentals. 2nd ed. 2010, Chichester: Wiley. xxiii, 794 p.
[3] Vaseghi, S.V., Advanced digital signal processing and noise
reduction. 4th ed. 2008, Chichester: J. Wiley & Sons. xxx,
514 p.
Figure 5: Diagrammatic representation of a Kinect depth map
profile with distant images represented by higher gray
intensity mapping and closer objects such as the hand showed
with intensity values closers to 255
Feature vectors from the streams shown in Figure 5 can be used in
a wide range of real-world applications including sign language
recognition[10], gait identification[11], touch-free biometrics, and
3D face recognition [12] and zero-visibility motion sensing (via
infra-red sensing)[13].
Moreover, as the device’s uniqueness is in the single-directional
capability, it makes it possible to embed the technology in future
hand-handheld devices such as smart phones and tablets. Such an
integration is likely to introduce opportunities into 3D
photography, animation and film industry, robotics, augmented
reality, education and virtual reality. Ultimately, the only
limitation with the current state-of-the-art lies with the
computational capability of conventional handheld hardware
which is still in integration phase for high-quality rendering that is
involved in multidimensional processing.
[4] Moeslund, T.B. and SpringerLink, Visual analysis of humans
[electronic resource] : looking at people. 2011, London ;
New York: Springer-Verlag London Limited. 1 online
resource (xxi, 632 p.).
[5] Chang, S.L., et al., Automatic license plate recognition. Ieee
Transactions on Intelligent Transportation Systems, 2004.
5(1).
[6] Rice, S.V., G. Nagy, and T.A. Nartker, Optical character
recognition : an illustrated guide to the frontier. The Kluwer
international series in engineering and computer science.
1999, Boston, Mass. ; London: Kluwer Academic Publishers.
vi, 194 p.
[7] Kim, K., et al., Real-time foreground-background
segmentation using codebook model. Real-Time Imaging,
2005. 11(3).
[8] Stauffer, C. and W.E.L. Grimson, Learning patterns of
activity using real-time tracking. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2000. 22(8): p.
747-757.
5. CONCLUSION
[9] Lee, D.S., et al., A Bayesian framework for Gaussian mixture
background modeling. 2003 International Conference on
Image Processing, Vol 3, Proceedings, 2003: p. 973-976.
The paper presents a detailed analysis into the core concepts of
image processing and segmentation in real-world applications.
Having discussed these, the review moves to dynamic, video-
[10] ChangeDetection. ChangeDetection Video Database. 2012
[cited 2012 19th August, 2012]; Available from:
http://www.changedetection.net/.
[11] Goyette, N., et al. changedetection.net: A new change
detection benchmark dataset. in Proc. IEEE Workshop on
Change Detection (CDW’12). 2012. Providence, RI.
[12] Harville, M., A framework for high-level feedback to
adaptive, Per-Pixel, Mixture-Of-Gaussian background
models. Computer Vision - Eccv 2002 Pt Iii, 2002. 2352: p.
543-560.
[13] Javed, O., K. Shafique, and M. Shah, A hierarchical
approach to robust background subtraction using color and
gradient information. Ieee Workshop on Motion and Video
Computing (Motion 2002), Proceedings, 2002: p. 22-27.
[14] Buch, N., S.A. Velastin, and J. Orwell, A Review of
Computer Vision Techniques for the Analysis of Urban
Traffic. Ieee Transactions on Intelligent Transportation
Systems, 2011. 12(3).
1.
[15] Cristani, M., M. Bicego, and V. Murino, Integrated regionand pixel-based approach to background modelling. Ieee
Workshop on Motion and Video Computing (Motion 2002),
Proceedings, 2002: p. 3-8.
[16] Ilyas, A., et al., Real Time Foreground-Background
Segmentation Using a Modified Codebook Model. Avss:
2009 6th Ieee International Conference on Advanced Video
and Signal Based Surveillance, 2009: p. 454-459.
[17] Diamantopoulos, G. and M. Spann, Event detection for
intelligent car park video surveillance. Real-Time Imaging,
2005. 11(3): p. 233-243.
[18] Xiang, T., S.G. Gong, and S.O.C. Ieee Computer. Video
behaviour profiling and abnormality detection without
manual labelling. in 10th IEEE International Conference on
Computer Vision (ICCV 2005). 2005. Beijing, PEOPLES R
CHINA.
Pauly, M., R. Keiser, and M. Gross, Multi-scale feature
extraction on point-sampled surfaces. Computer
Graphics Forum, 2003. 22(3).
2. Schnabel, R., R. Wahl, and R. Klein, Efficient RANSAC
for point-cloud shape detection. Computer Graphics
Forum, 2007. 26(2).
3. Vanezis, P., et al., Facial reconstruction using 3-D
computer graphics. Forensic Science International,
2000. 108(2).
4. Guo, L., et al., Relevance of airborne lidar and
multispectral image data for urban scene
classification using Random Forests. Isprs Journal of
Photogrammetry and Remote Sensing, 2011. 66(1).
5. Moreno-Moreno, M., J. Fierrez, and J. Ortega-Garcia,
Biometrics beyond the Visible Spectrum: Imaging
Technologies and Applications. Biometric Id
Management and Multimodal Communication,
Proceedings, 2009. 5707.
6. Kolaric, D., K. Skala, and A. Dubravic, Integrated
system for forest fire early detection and
management. Periodicum Biologorum, 2008. 110(2).
7.
Omar, M., K. Kuwana, and K. Saito, The use of infrared
thermograph technique to investigate welding related
industrial fires. Fire Technology, 2007. 43(4).
8. Lee, M.-Y. and C.-S. Yang, Entropy-based feature
extraction and decision tree induction for breast
cancer diagnosis with standardized thermograph
images. Computer Methods and Programs in
Biomedicine, 2010. 100(3).
9. Selvarasu, N., et al., Abnormality Detection from
Medical Thermographs in Human Using Euclidean
Distance based color Image Segmentation. 2010
International Conference on Signal Acquisition and
Processing: Icsap 2010, Proceedings, 2010.
10. Keskin, C., et al., Real Time Hand Pose Estimation
using Depth Sensors. 2011 Ieee International
Conference on Computer Vision Workshops (Iccv
Workshops), 2011.
11. Stone, E. and M. Skubic, Evaluation of an inexpensive
depth camera for in-home gait assessment. Journal of
Ambient Intelligence and Smart Environments, 2011.
3(4).
12. Mahoor, M.H. and M. Abdel-Mottaleb, A multimodal
approach for face modeling and recognition. Ieee
Transactions on Information Forensics and Security,
2008. 3(3).
13. Elangovan, V. and A. Shirkhodaie. Recognition of
Human Activity Characteristics Based on State
Transitions Modeling Technique. in Conference on
Signal Processing, Sensor Fusion, and Target
Recognition XXI. 2012. Baltimore, MD.
Download