Audio-Visual Detection And Classification Of Vehicle Using Multiclass SVM: A Review Dhanashree S. Tayade Sanjay S. Gharde Department of Computer Engineering SSBT’s COET Bambhori, Jalgaon Maharashtra, India dhanon19@gmail.com Department of Computer Engineering SSBT’s COET Bambhori, Jalgaon Maharashtra, India sanjay_gharde358@yahoo.com Abstract—In recent years many of researchers have developed vision based techniques for detection and recognization of moving vehicles. But some methods become unsuccessful and computationally expensive. Hence many problems such as vehicle occlusions, misdetection of vehicle and increased computational load, affect the accuracy and degrade the performance of vehicle analysis. To overcome such problems audio-visual based approach is developed for vehicle classification in uncontrolled environments. By using various types of audio and video feature, and multiclass SVM technique for classification, accuracy can be improved and finer classification can be achieved. Index Terms— vehicle classification, audio feature, video feature. I. INTRODUCTION Vehicle detection and classification in a video has become a potential area of research due to its many applications to video-based intelligent transportation systems. For example, over a time period counting vehicles on a busy traffic circle helps authority to efficiently control the duration of traffic signal on a road to reduce the level of traffic jamming during rush hours. Usually, vehicles are detected from a video by detecting objects that have significant motion. Motionestimation-based vehicle detection techniques include the interframe difference method, optical flow estimation method, Gaussian scale mixture model method and background subtraction method. Using only visual feature may not be adequate; hence audio information can provide complementary acoustic signatures, such as loudness and sharpness, for typical categorizing types of vehicles. Multimodalities lead to the extraction of higher-quality and more reliable information than that obtained from singlemodality. The advantage is double. Initial, as the modalities are usually complementary, the outcome of multimodal processing is more useful than for each of the modalities individually. This is correct in all application domainsmultimodal identification or multimodal image processing. The next advantage is that, as modalities are sometime unreliable, it is likely, when one modality becomes corrupted, to extract the lost information from the others, leading to a more reliable system. Use of multimodalities in vehicle detection decreases wrong detection and classification can be more efficient. Here various types of audio feature and visual feature are analyzed. The audio features like short time energy (STE), spectral energy, entropy, flux and centroid feature, and Mel-frequency cepstral coefficients (MFCCs), which are grouped into three types: temporal features (STEs), spectral features (SPECs) and perceptual features (PERCs). In visual color-based feature is used. The same features may play different roles. Here we design classification tasks using the set of features on the dataset and provide a thorough study on the feature extraction and combinations of features for vehicle classification using SVMs. The rest of this paper is organized as follows. In the next section, the related work of vehicle detection and classification technique and audio-video features are described. Next the problem definition is presented in section III. Then, in proposed work classification technique support vector machine is explained in Section IV. Section V describes detail discussion of various methods/techniques used and review of those methods is presented in table. Finally, a conclusion is presented in Section VI. II. RELATED WORK Over a past decade, density of traffic on roads and highways has been increasing constantly. Intelligent traffic management systems are needed to avoid traffic congestions or accidents and to ensure safety of road users. Traffic surveillance systems based on video cameras cover a broad range of different tasks, such as vehicle count, lane occupancy, speed measurements and classification; additionally they also detect serious events as fire and smoke, traffic jams or lost cargo. Combined integration of data from video, audio, infrared, ultrasonic or inductive loop sensors helps to improve recognition rates and robustness as well as to reduce ambiguity and uncertainty of systems based on video images only. The task of vehicle classification marks an important factor of current traffic management systems. Difference between dissimilar vehicle types such as cars, motorbikes, busses or trucks provides useful data about road utilization and affords detailed traffic information. Here literature review of audio and video approach and classification technique is given below. A. Audio Approaches In audio -based approach, if the audio features need to be stored, features require a small amount of space. The benefit of audio approaches is that they usually require less computational resources than visual approach. In addition the audio clips can be very short, many of the researcher used clips in the range of 1-2 seconds in length. Hence Audio approaches are much important in the video classification. Audio features can lead to three layers of audio understanding low-level acoustics, such as the average frequency for a frame, mid-level sound objects, such as the audio signature of the sound a ball makes while bouncing, and high-level scene classes, such as background music playing in certain types of video scenes. Here are some literature reviews Roach and Mason [1] use the audio, in particular Mel-frequency cepstral coefficients (MFCC), from video for classification. This approach is chosen because of its success in automatic speech recognition. The authors examine how many of the coefficients to keep and find that the best results occur with 10-12 coefficients. Dinh et al. [2] apply a 4 wavelet to seven sub-bands of audio clips. Like the discrete cosine transform DCT, wavelet transforms have good energy and are useful for reducing dimensionality. The features for representing the audio clips utilize the wavelet coefficients. They are sub-band energy, sub- band variance, and zero crossing rate as well as two features defined by the authors: centroid and bandwidth. Moncrieff et al. [3] use audio-based cinematic principles to distinguish between horror and non-horror movies. Changes in sound energy intensity are used to detect what the authors call sound energy events. B. Video Approaches A video is a group of images called as frames. All of the frames within a single camera action are called a shot. A scene is one or more shots that form a semantic unit. The approaches that utilize visual features, most extract features on a per frame or per shot basis. While some authors use the terms shots and scenes interchangeably, typically when they use the term scene they are really referring to a shot. Shots are also associated with some cinematic principles. For example, movies that focus on action tend to have shots of shorter duration than those that focus on character development. Identifying scenes is even more difficult and there are few video classification approaches that do so. Using visual-based features there is a problem of huge amount of potential data. This problem can be improved by using key frames to represent shots or with dimensionality reduction techniques, such as the application of wavelet transforms. Various visual features are- color based, MPEG (Motion Pictures Expert Group), Shot based, object-based, motion based, etc. Here are some literature reviews in visual approach- Drew and Au [4] proposed solution to normalize the color channel bands of each frame and then move them into a chromaticity color space, in shot- based the Lienhart states that some video editing systems provide more than 100 different types of edits and no current method can correctly identify all types. Yengar and Lippman detect shot changes using the Kullback-Leibler distance between histograms of consecutive frames that have been transformed to the rgb color space. C. Classification technique Bayesian classifier is a probabilistic classifier based on Baye's theorem. Bayesian subspace method was first proposed by Moghaddam and Pentland for face recognition and later Mario. E. Munich [5] used the same technique for vehicle classification. In this technique m different event classes are observed for classification purpose. Each class is assumed to be independent instances of parametric distributed random process. To implement a Bayesian classifier following two parameters are required that can be estimated by using Maximum likelihood (ML) or maximum a posteriory probability (MAP) rule. Bayesian classifier requires a large number of training set otherwise it will not be able to classify vehicles correctly. This requirement makes it hard to compute the discriminate functions. Hidden Markov Model (HMM) implementation for vehicle classification is based on the estimation of the sequence of states, given a sequence of observations. L. D. Ahmad Aljaafreh [6] proposed modeling a distributed multiple hypothesis classification problem by HMM. Number of vehicles in each class is considered as a state and each state depends on the previous state. For a sequence of observations, Viterbi algorithm is used to find the maximum likelihood sequence of states. In this the hypotheses that need to be tested have non-zero transition probabilities. Next, Artificial neural network has been utilized in reference as a recognizer of type of vehicles. In this method a two layer back propagation neural network is trained by the output of a sell organized maps neural network. In some literature, rough neural network is used in classification of vehicles by using 25 input layers, 25 hidden layers and 4 output layers of neurons. The SVM has been initiated as one of the most efficient learning algorithms in computer vision. While many challenging classification problems are inherently multiclass, the original SVM is only able to solve binary classification problems. Due to significant appearance variation across different vehicles, a direct solution of vehicle classification using single SVM unit should be avoided. The better way is to use a combination of several binary SVM classifiers to classify vehicles. Yung-Sheng Chen et al. [7] implemented various methods for vehicle detection and classification such as-Background updating Method, Detection of lane-dividing lines and Shadow detection technique with new linearity feature. Next Anshul Goyal et.al [8] proposed a Neural Network based Approach for the Vehicle Classification . This system extracts different structural features of given vehicle. These features are extracted by capturing the video of vehicles from different angles. Then normalize these images and based on these normalized features the classifier helps in classifying the given image into one of the given vehicle type. Niluthpol Chowdhury Mithun et al. [9] implemented two methods -Multiple Time Spatial Images and Multiple virtual detection line with feature extraction such as- Shape-based, Shape-invariant, Texture-based. Audio Features Video Clip Support Vector Machine Classifier Image Fames Reconstructed Image Type of Vehicle Visual Feature Fig.1: Flow of Proposed System IV .PROPOSED WORK TaoWang and Zhigang Zhu [10] proposed Support Vector Machine for vehicle classification. Here audio-Video features are extracted. III .PROBLEM DEFINITION In last few years, it has been seen that significant research intended to enhance safety by monitoring on road environment. Due to which advances in detection and classification of vehicle system is required. Many researchers have put efforts in detecting and classifying vehicle by providing vision based sensing modalities. But due to some limitations such as classification accuracy, time required, shadow problems and weather conditions while detecting vehicle, the goals are not achieved. The goals can be achieved by using multimodalities and by using good quality classification technique in a proposed system. A. Objectives of Proposed System To improve the performance of existing system using multimodal feature. To increase the accuracy and computational speed than existing system. To achieve finer classification by using multiclass SVM. An intelligent transportation system (ITS) is the application that includes electronic, computer, and communication technologies into vehicles and roadways for monitoring traffic conditions, reducing congestion, enhancing mobility. The proposed system is designed for detecting and classifying vehicles in various weather conditions. The performance of the proposed system can be improved by using multimodalities i.e. audio and video features and by using multiclass SVM for classification. The figure 1 shows flow of the proposed system. In this system, firstly various image shots of moving vehicle are capture from video clip. Then the reconstruction of image from original image is done by using moving vehicle reconstruction image algorithm[11]. Next visual feature such as color feature are extracted from this reconstructed image and given to SVM for classification. The audio features such as short time energy, spectral energy, MFCC are extracted from video clip and provided to SVM classifier. Here, to classify type of vehicle the support vector machine is used. The SVM technique is very successful in several areas of application. It was basically designed as binary classifier. It is the best method for classification. The SVM is a formulation of learning task. The multiclass SVM can categorize vehicles into a sufficiently large number of classes. Hence by using multiclass SVM in a proposed work, improved classification accuracy can be gained. TABLE I. ANALYSIS OF VARIOUS VEHICLE DETECTION AND CLASSIFICATION SYSTEM Parameters Author Name Classification Technique /Method Feature Extraction Classifiers Total vehicle samples Yung-Sheng Chen et.al (2006)[7] - Background updating method - Detection of lane-dividing lines - Shadow detection technique -Size -Linearity Car, mini-van, truck, van truck Total vehicle count 20443 Anshul Goyal et. al (2007)[8] -Neural network -MLP classifier -Shape feature Double decker bus, Chevrolet van 400 samples taken Niluthpol Chowdhury Mithun et.al (2012) [9] -Multiple Time Spatial Images -Multiple virtual detection line -Shape-based -Shape invariant -Texture-based 3W-Autorikshaw, 4W-Car 6W-Pickup van Video clip taken more than 1 hr. duration TaoWang and Zhigang Zhu (2012) [10] -Support Vector Machine -Audio features -Video features Sedan, Van, Pickup Truck, and Buses Total 455 samples 280- training set 205- testing set V. DISCUSSION In vehicle detection and classification system, lot of work has been completed using various methods. The table I shows analysis of various vehicle detection and classification techniques. From Table I, it is observed that Yung-Sheng Chen et al. [7] implemented various methods like-Background updating Method, Detection of lane-dividing lines and Shadow detection technique, in this system V-scan and H-scan is performed for detecting lane dividing line. Here classifiers are Car, mini-van, truck and van truck. The proposed system neural network by Anshul Goyal et al. [8] classifies only three types of vehicles van, bus or car. Features used in the system are shape features and for extracting these features it uses system known as Hierarchical image process. (Shape features). Here MLP classifier is used for classification of vehicles. Then the Niluthpol Chowdhury Mithun et al. [9] implemented two methods -Multiple Time Spatial Images and Multiple virtual detection line. It performs two step classifications. Tao Wang et al. [10] proposed multimodal vehicle detection and classification system using support vector machine for vehicle classification. The Table II shows technical difference between SVM and other classifiers. In [7] occlusions may occur if lane dividing line is not detected properly. And also if shadows are not detected properly it may affect the system. Thus shadow elimination technique is important, because classification accuracy depends on it. In proposed system color feature is used hence no problem of shadow and light varying conditions. The classification accuracy can be improved by combining audio with visual features. Next in [8] It is observed that MLP classifier requires more training time compared to SVM. Classification may affect by weather conditions. Whereas it is seen that SVM is much faster for larger datasets compared to MLP. In [9] multiple time spatial image method is used and K-NN classifier is used for classification. TABLE II. TECHNICAL DIFFERENCE BETWEEN EXITING AND PROPOSED SYSTEM Parameters Author Name Classification Technique /Method - Detection of lane-dividing lines Yung-Sheng Chen et.al (2006)[7] - Shadow detection technique Anshul Goyal et. al (2007)[8] -MLP classifier Niluthpol Chowdhury Mithun et.al (2012) [9] -Multiple time spatial images. -K-NN classifier Limitations How SVM overcomes Limitations -If lane dividing line is not detected properly the occlusion may occur. - In this varying light conditions may affect the vehicle classification. - Classification accuracy depends on shadow elimination technique. - Here color based feature is used, hence there is no problem of shadow elimination and light varying conditions. - The classification accuracy will definitely improve by combining audio with visual features. -It is observed that MLP requires more training time compared to SVM. -Classification may affect due to various weather conditions and also shadow problems. -SVM is much faster in larger datasets compared to MLP. - Here two step classification is performed. - K-NN classifier may lead to classification error especially when there is small subset of features. - It uses all features equally in computing. - No need of two step classification. - SVM outperforms than KNN in vehicle classification and they are also well on datasets that have many attributes. K-NN classifier may lead to classification error especially when there is small subset of features and also uses all features equally in computing. In proposed system no need of two step classification. SVM performs better than K-NN in vehicle classification. And SVM’s are also good on datasets that have many attributes. VI. CONCLUSION In the proposed work the goals such as classification accuracy computational speed and time required, are achieved by combining various types of audio and visual features together, this will definitely improve the performance of vehicle detection and classification system. And using such multimodal features with support vector machine, goodquality of classification can be obtained. It is observed that how SVM overcomes various limitations of existing system such as shadow problem and various weather conditions. Multiclass SVM will also increase the classification accuracy than existing system. In future studies more combination of features can be used to increase accuracy. REFERENCES M. Roach and J. Mason, “Classification of video genre using audio,” Eurospeech, vol. 4, pp. 2693–2696, 2001. [2] P. Q. Dinh, C. Dorai, and S. Venkatesh, “Video genre categorization using audio wavelet coefficients,” in Fifth Asian Conference on Computer Vision, 2002. [3] S. Moncrieff, S. Venkatesh, and C. Dorai, “Horror film genre typing and scene labeling via audio analysis,” 2003. [4] M. S. Drew and J. Au, “Video keyframe production by efficient clustering of compressed chromaticity signatures,” in Poster session of the eighth ACM international conference on Multimedia (MULTIMEDIA’00), 2000, pp. 365–367. [5] E. Munich, “ Bayesian subspace methods for acoustic signature recognition of vehicles," Proceedings of the European Signal Processing Conference, 2004. [6] L. D. Ahmad Aljaafreh, “ Hidden markov model based classi_cation approach for multiple dynamic vehicles in wireless sensor networks," ICNSC, IEEE, 2010. [7] Jun-Wei Hsieh, Shih-Hao Yu, Yung-Sheng Chen, and Wen-Fong Hu, “Automatic traffic surveillance system for vehicle tracking and classification," IEEE Transactions On Intelligent Transportation Systems, june 2006. [8] Anshul Goyal and Brijesh Verma ,“ A Neural Network based Approach for the Vehicle Classification”, Proceedings of the 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing (CIISP 2007) . [9] Niluthpol Chowdhury Mithun and S. M. M. Rahman, “Detection and classification of vehicles from video using multiple timespatial images," IEEE Transactions On Intelligent Transportation Systems, February 2012. [10] T.Wang and Z. Zhu, “Multimodal and multi-task audio-visual vehicle detection and classification," 2012 IEEE Ninth International Conference on Advanced Video and SignalBased Surveillance, 2012. [1] [11] T. Wang and Z. Zhu, Z., “Real time vehicle detection and reconstruction for improving classification”, IEEE Computer Society's Workshop on Applications of Computer Vision (WACV), January 9-11, 2012, Colorado. [12] J. Kim, B. Kim, and S. Savarese “comparing image classification methods: k-nearest neighbor and support-vectormachines”, Applied Mathematics in Electrical and Computer Engineering.