Audio-Visual Detection And Classification Of
Vehicle Using Multiclass SVM: A Review
Dhanashree S. Tayade
Sanjay S. Gharde
Department of Computer Engineering
SSBT’s COET Bambhori, Jalgaon
Maharashtra, India
Department of Computer Engineering
SSBT’s COET Bambhori, Jalgaon
Maharashtra, India
Abstract—In recent years many of researchers have developed
vision based techniques for detection and recognization of
moving vehicles. But some methods become unsuccessful and
computationally expensive. Hence many problems such as vehicle
occlusions, misdetection of vehicle and increased computational
load, affect the accuracy and degrade the performance of vehicle
analysis. To overcome such problems audio-visual based
approach is developed for vehicle classification in uncontrolled
environments. By using various types of audio and video feature,
and multiclass SVM technique for classification, accuracy can be
improved and finer classification can be achieved.
Index Terms— vehicle classification, audio feature, video
Vehicle detection and classification in a video has become a
potential area of research due to its many applications to
video-based intelligent transportation systems. For example,
over a time period counting vehicles on a busy traffic circle
helps authority to efficiently control the duration of traffic
signal on a road to reduce the level of traffic jamming during
rush hours. Usually, vehicles are detected from a video by
detecting objects that have significant motion. Motionestimation-based vehicle detection techniques include the
interframe difference method, optical flow estimation method,
Gaussian scale mixture model method and background
subtraction method. Using only visual feature may not be
complementary acoustic signatures, such as loudness and
sharpness, for typical categorizing types of vehicles.
Multimodalities lead to the extraction of higher-quality and
more reliable information than that obtained from singlemodality. The advantage is double. Initial, as the modalities
are usually complementary, the outcome of multimodal
processing is more useful than for each of the modalities
individually. This is correct in all application domainsmultimodal identification or multimodal image processing.
The next advantage is that, as modalities are sometime
unreliable, it is likely, when one modality becomes corrupted,
to extract the lost information from the others, leading to a
more reliable system. Use of multimodalities in vehicle
detection decreases wrong detection and classification can be
more efficient. Here various types of audio feature and visual
feature are analyzed. The audio features like short time energy
(STE), spectral energy, entropy, flux and centroid feature, and
Mel-frequency cepstral coefficients (MFCCs), which are
grouped into three types: temporal features (STEs), spectral
features (SPECs) and perceptual features (PERCs). In visual
color-based feature is used. The same features may play
different roles. Here we design classification tasks using the
set of features on the dataset and provide a thorough study on
the feature extraction and combinations of features for vehicle
classification using SVMs.
The rest of this paper is organized as follows. In the next
section, the related work of vehicle detection and
classification technique and audio-video features are
described. Next the problem definition is presented in section
III. Then, in proposed work classification technique support
vector machine is explained in Section IV. Section V
describes detail discussion of various methods/techniques used
and review of those methods is presented in table. Finally, a
conclusion is presented in Section VI.
Over a past decade, density of traffic on roads and highways
has been increasing constantly. Intelligent traffic management
systems are needed to avoid traffic congestions or accidents
and to ensure safety of road users. Traffic surveillance systems
based on video cameras cover a broad range of different tasks,
such as vehicle count, lane occupancy, speed measurements
and classification; additionally they also detect serious events
as fire and smoke, traffic jams or lost cargo. Combined
integration of data from video, audio, infrared, ultrasonic or
inductive loop sensors helps to improve recognition rates and
robustness as well as to reduce ambiguity and uncertainty of
systems based on video images only. The task of vehicle
classification marks an important factor of current traffic
management systems. Difference between dissimilar vehicle
types such as cars, motorbikes, busses or trucks provides
useful data about road utilization and affords detailed traffic
information. Here literature review of audio and video
approach and classification technique is given below.
A. Audio Approaches
In audio -based approach, if the audio features need to be
stored, features require a small amount of space. The benefit
of audio approaches is that they usually require less
computational resources than visual approach. In addition the
audio clips can be very short, many of the researcher used
clips in the range of 1-2 seconds in length. Hence Audio
approaches are much important in the video classification.
Audio features can lead to three layers of audio understanding
low-level acoustics, such as the average frequency for a frame,
mid-level sound objects, such as the audio signature of the
sound a ball makes while bouncing, and high-level scene
classes, such as background music playing in certain types of
video scenes. Here are some literature reviews Roach and
Mason [1] use the audio, in particular Mel-frequency cepstral
coefficients (MFCC), from video for classification. This
approach is chosen because of its success in automatic speech
recognition. The authors examine how many of the
coefficients to keep and find that the best results occur with
10-12 coefficients. Dinh et al. [2] apply a 4 wavelet to seven
sub-bands of audio clips. Like the discrete cosine transform
DCT, wavelet transforms have good energy and are useful for
reducing dimensionality. The features for representing the
audio clips utilize the wavelet coefficients. They are sub-band
energy, sub- band variance, and zero crossing rate as well as
two features defined by the authors: centroid and bandwidth.
Moncrieff et al. [3] use audio-based cinematic principles to
distinguish between horror and non-horror movies. Changes in
sound energy intensity are used to detect what the authors call
sound energy events.
B. Video Approaches
A video is a group of images called as frames. All of the
frames within a single camera action are called a shot. A scene
is one or more shots that form a semantic unit. The approaches
that utilize visual features, most extract features on a per
frame or per shot basis. While some authors use the terms
shots and scenes interchangeably, typically when they use the
term scene they are really referring to a shot. Shots are also
associated with some cinematic principles. For example,
movies that focus on action tend to have shots of shorter
duration than those that focus on character development.
Identifying scenes is even more difficult and there are few
video classification approaches that do so. Using visual-based
features there is a problem of huge amount of potential data.
This problem can be improved by using key frames to
represent shots or with dimensionality reduction techniques,
such as the application of wavelet transforms. Various visual
features are- color based, MPEG (Motion Pictures Expert
Group), Shot based, object-based, motion based, etc. Here are
some literature reviews in visual approach- Drew and Au [4]
proposed solution to normalize the color channel bands of
each frame and then move them into a chromaticity color
space, in shot- based the Lienhart states that some video
editing systems provide more than 100 different types of edits
and no current method can correctly identify all types. Yengar
and Lippman detect shot changes using the Kullback-Leibler
distance between histograms of consecutive frames that have
been transformed to the rgb color space.
C. Classification technique
Bayesian classifier is a probabilistic classifier based on Baye's
theorem. Bayesian subspace method was first proposed by
Moghaddam and Pentland for face recognition and later
Mario. E. Munich [5] used the same technique for vehicle
classification. In this technique m different event classes are
observed for classification purpose. Each class is assumed to
be independent instances of parametric distributed random
process. To implement a Bayesian classifier following two
parameters are required that can be estimated by using
Maximum likelihood (ML) or maximum a posteriory
probability (MAP) rule. Bayesian classifier requires a large
number of training set otherwise it will not be able to classify
vehicles correctly. This requirement makes it hard to compute
the discriminate functions.
Hidden Markov Model (HMM) implementation for vehicle
classification is based on the estimation of the sequence of
states, given a sequence of observations. L. D. Ahmad
Aljaafreh [6] proposed modeling a distributed multiple
hypothesis classification problem by HMM. Number of
vehicles in each class is considered as a state and each state
depends on the previous state. For a sequence of observations,
Viterbi algorithm is used to find the maximum likelihood
sequence of states. In this the hypotheses that need to be tested
have non-zero transition probabilities.
Next, Artificial neural network has been utilized in reference
as a recognizer of type of vehicles. In this method a two layer
back propagation neural network is trained by the output of a
sell organized maps neural network. In some literature, rough
neural network is used in classification of vehicles by using 25
input layers, 25 hidden layers and 4 output layers of neurons.
The SVM has been initiated as one of the most efficient
learning algorithms in computer vision. While many
challenging classification problems are inherently multiclass,
the original SVM is only able to solve binary classification
problems. Due to significant appearance variation across
different vehicles, a direct solution of vehicle classification
using single SVM unit should be avoided. The better way is to
use a combination of several binary SVM classifiers to
classify vehicles.
Yung-Sheng Chen et al. [7] implemented various methods for
vehicle detection and classification such as-Background
updating Method, Detection of lane-dividing lines and
Shadow detection technique with new linearity feature.
Next Anshul Goyal [8] proposed a Neural Network
based Approach for the Vehicle Classification . This system
extracts different structural features of given vehicle. These
features are extracted by capturing the video of vehicles from
different angles. Then normalize these images and based on
these normalized features the classifier helps in classifying the
given image into one of the given vehicle type.
Niluthpol Chowdhury Mithun et al. [9] implemented two
methods -Multiple Time Spatial Images and Multiple virtual
detection line with feature extraction such as- Shape-based,
Shape-invariant, Texture-based.
Support Vector
Type of Vehicle
Fig.1: Flow of Proposed System
TaoWang and Zhigang Zhu [10] proposed Support Vector
Machine for vehicle classification. Here audio-Video features
are extracted.
In last few years, it has been seen that significant research
intended to enhance safety by monitoring on road environment.
Due to which advances in detection and classification of
vehicle system is required. Many researchers have put efforts
in detecting and classifying vehicle by providing vision based
sensing modalities. But due to some limitations such as
classification accuracy, time required, shadow problems and
weather conditions while detecting vehicle, the goals are not
achieved. The goals can be achieved by using multimodalities
and by using good quality classification technique in a
proposed system.
A. Objectives of Proposed System
 To improve the performance of existing system using
multimodal feature.
 To increase the accuracy and computational speed than
existing system.
 To achieve finer classification by using multiclass SVM.
An intelligent transportation system (ITS) is the application
that includes electronic, computer, and communication
technologies into vehicles and roadways for monitoring traffic
conditions, reducing congestion, enhancing mobility. The
proposed system is designed for detecting and classifying
vehicles in various weather conditions. The performance of
the proposed system can be improved by using
multimodalities i.e. audio and video features and by using
multiclass SVM for classification. The figure 1 shows flow of
the proposed system. In this system, firstly various image
shots of moving vehicle are capture from video clip. Then the
reconstruction of image from original image is done by using
moving vehicle reconstruction image algorithm[11]. Next
visual feature such as color feature are extracted from this
reconstructed image and given to SVM for classification. The
audio features such as short time energy, spectral energy,
MFCC are extracted from video clip and provided to SVM
classifier. Here, to classify type of vehicle the support vector
machine is used. The SVM technique is very successful in
several areas of application. It was basically designed as
binary classifier. It is the best method for classification. The
SVM is a formulation of learning task. The multiclass SVM
can categorize vehicles into a sufficiently large number of
classes. Hence by using multiclass SVM in a proposed work,
improved classification accuracy can be gained.
Author Name
Classification Technique /Method
Feature Extraction
Total vehicle samples
Yung-Sheng Chen
- Background updating
- Detection of lane-dividing lines
- Shadow detection technique
Car, mini-van, truck, van
Total vehicle count 20443
Anshul Goyal et. al
-Neural network
-MLP classifier
-Shape feature
Double decker bus,
Chevrolet van
400 samples taken
Niluthpol Chowdhury
Mithun (2012)
-Multiple Time Spatial Images
-Multiple virtual detection line
-Shape invariant
6W-Pickup van
Video clip taken more than 1
hr. duration
TaoWang and Zhigang Zhu (2012) [10]
-Support Vector Machine
-Audio features
-Video features
Sedan, Van, Pickup
Truck, and Buses
Total 455 samples
280- training set
205- testing set
In vehicle detection and classification system, lot of work has
been completed using various methods. The table I shows
analysis of various vehicle detection and classification
techniques. From Table I, it is observed that Yung-Sheng
Chen et al. [7] implemented various methods like-Background
updating Method, Detection of lane-dividing lines and
Shadow detection technique, in this system V-scan and H-scan
is performed for detecting lane dividing line. Here classifiers
are Car, mini-van, truck and van truck. The proposed system
neural network by Anshul Goyal et al. [8] classifies only three
types of vehicles van, bus or car. Features used in the system
are shape features and for extracting these features it uses
system known as Hierarchical image process. (Shape
features). Here MLP classifier is used for classification of
vehicles. Then the Niluthpol Chowdhury Mithun et al. [9]
implemented two methods -Multiple Time Spatial Images and
Multiple virtual detection line. It performs two step
Tao Wang et al. [10] proposed multimodal vehicle detection
and classification system using support vector machine for
vehicle classification. The Table II shows technical difference
between SVM and other classifiers. In [7] occlusions may
occur if lane dividing line is not detected properly. And also if
shadows are not detected properly it may affect the system.
Thus shadow elimination technique is important, because
classification accuracy depends on it. In proposed system
color feature is used hence no problem of shadow and light
varying conditions. The classification accuracy can be
improved by combining audio with visual features. Next in [8]
It is observed that MLP classifier requires more training time
compared to SVM. Classification may affect by weather
conditions. Whereas it is seen that SVM is much faster for
larger datasets compared to MLP. In [9] multiple time spatial
image method is used and K-NN classifier is used for
Author Name
Classification Technique /Method
- Detection of lane-dividing lines
Yung-Sheng Chen
- Shadow detection technique
Anshul Goyal et. al
-MLP classifier
Niluthpol Chowdhury
Mithun (2012)
-Multiple time spatial images.
-K-NN classifier
How SVM overcomes Limitations
-If lane dividing line is not detected properly the
occlusion may occur.
- In this varying light conditions may affect the
vehicle classification.
- Classification accuracy depends on shadow
elimination technique.
- Here color based feature is used, hence
there is no problem of shadow elimination
and light varying conditions.
- The classification accuracy will definitely
improve by combining audio with visual
-It is observed that MLP requires more training
time compared to SVM.
-Classification may affect due to various
weather conditions and also shadow problems.
-SVM is much faster in larger datasets
compared to MLP.
- Here two step classification is performed.
- K-NN classifier may lead to classification error
especially when there is small subset of
- It uses all features equally in computing.
- No need of two step classification.
- SVM outperforms than KNN in vehicle
classification and they are also well on
datasets that have many attributes.
K-NN classifier may lead to classification error especially
when there is small subset of features and also uses all
features equally in computing. In proposed system no need of
two step classification. SVM performs better than K-NN in
vehicle classification. And SVM’s are also good on datasets
that have many attributes.
In the proposed work the goals such as classification accuracy
computational speed and time required, are achieved by
combining various types of audio and visual features together,
this will definitely improve the performance of vehicle
detection and classification system. And using such
multimodal features with support vector machine, goodquality of classification can be obtained. It is observed that
how SVM overcomes various limitations of existing system
such as shadow problem and various weather conditions.
Multiclass SVM will also increase the classification accuracy
than existing system. In future studies more combination of
features can be used to increase accuracy.
M. Roach and J. Mason, “Classification of video genre using
audio,” Eurospeech, vol. 4, pp. 2693–2696, 2001.
[2] P. Q. Dinh, C. Dorai, and S. Venkatesh, “Video genre
categorization using audio wavelet coefficients,” in Fifth Asian
Conference on Computer Vision, 2002.
[3] S. Moncrieff, S. Venkatesh, and C. Dorai, “Horror film genre
typing and scene labeling via audio analysis,” 2003.
[4] M. S. Drew and J. Au, “Video keyframe production by efficient
clustering of compressed chromaticity signatures,” in Poster
session of the eighth ACM international conference on
Multimedia (MULTIMEDIA’00), 2000, pp. 365–367.
[5] E. Munich, “ Bayesian subspace methods for acoustic signature
recognition of vehicles," Proceedings of the European Signal
Processing Conference, 2004.
[6] L. D. Ahmad Aljaafreh, “ Hidden markov model based
classi_cation approach for multiple dynamic vehicles in wireless
sensor networks," ICNSC, IEEE, 2010.
[7] Jun-Wei Hsieh, Shih-Hao Yu, Yung-Sheng Chen, and Wen-Fong
Hu, “Automatic traffic surveillance system for vehicle tracking
and classification," IEEE Transactions On Intelligent
Transportation Systems, june 2006.
[8] Anshul Goyal and Brijesh Verma ,“ A Neural Network based
Approach for the Vehicle Classification”, Proceedings of the
2007 IEEE Symposium on Computational Intelligence in Image
and Signal Processing (CIISP 2007) .
[9] Niluthpol Chowdhury Mithun and S. M. M. Rahman, “Detection
and classification of vehicles from video using multiple timespatial images," IEEE Transactions On Intelligent
Transportation Systems, February 2012.
[10] T.Wang and Z. Zhu, “Multimodal and multi-task audio-visual
vehicle detection and classification," 2012 IEEE Ninth
International Conference on Advanced Video and SignalBased Surveillance, 2012.
[11] T. Wang and Z. Zhu, Z., “Real time vehicle detection and
reconstruction for improving classification”, IEEE Computer
Society's Workshop on Applications of Computer Vision
(WACV), January 9-11, 2012, Colorado.
[12] J. Kim, B. Kim, and S. Savarese “comparing image
classification methods: k-nearest neighbor and support-vectormachines”, Applied Mathematics in Electrical and Computer