1 Audio-speech segmentation and topic detection for a speech-based information retrieval system Julian David Echeverry Correa Speech Technology Group Universidad Politécnica de Madrid jdec@die.upm.es Abstract This paper reports recent work in Automatic Speech Recognition - ASR systems in the fields of information retrieval and audio segmentation in multimedia files. Two approaches are presented. One of them relies on keywords selection and topic detection and the other one on audio segmentation. A natural technique for determining long term characteristics in speech, such as language, topic or dialect is to first convert the input speech into a sequence of tokens (e.g. words, phones, etc.). From these tokens, we can then look for distinctive phrases or keywords, that characterize the speech. In many applications, such as supervised topic detection, a set of distinctive keywords may not be known a priori. In this case, a method of keyword selection is desirable. In this paper, TF-IDF is used to determine which words in a corpus of documents might be more favorable to discrimine text among different topics. Experiments have been done using MAVIR database. Regarding audio segmentation, although the proposed method can easily be adapted to other speech/non-speech doscrimination applications, the present work only focuses on speech/music+speech segmentation. Different experiments, including different features extraction method and different HMM topologies illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Index Terms Speech segmentation, keyword selection, topic detection. I. I NTRODUCTION He work presented in this paper has been carried out in the context of the author’s thesis development. As part of the work related to speech recognition, there are some paralel tasks that may improve the word accuracy recognition goal and also provide contextual information about the topic or the context in wich the speech is involved. The rapid growth on multimedial contents on web and digital resources has brought up interest for developing techniques to automatically classify these contents. In particular, audio files and audio tracks in videos contain relevant information about the nature and content of the multimedia file. In this context, audio classification plays an important role as preprocessing step in a variety of more complex systems related to information retrieval in music or multimedia content extraction, annotation and indexing. Standard speech recognizers attempting to perform recognition on all input frames will naturally produce high error rates with such a mixed input signal. Therefore, one major challenge in Automatic Speech Recognition - ASR Systems, dealing with multimedia content, lies in how to separate these speech signals from other audio signals (e.g., background noise or music) in order to avoid recognizing mixed input signals. More generally, audio segmentation could allow the use of ASR acoustic models trained on particular acoustic conditions, such as wide bandwidth (high quality microphone input) versus telephone narrow bandwidth, male speaker versus female speaker, etc., thus improving overall performance of the resulting system. Finally, this segmentation could also be designed to provide additional interesting information, such as the division into speaker turns and the speaker identities (allowing, e.g., for an automatic indexing and retrieval of all occurrences of a same speaker), as well as syntactical information (such as end of sentences, punctuation marks, etc.). The other task we explore in this paper is the keyword selection for supervised topic detection in automatic transcriptions given by an ASR. Having representative keywords for each labeled topic can be useful in more accurately clustering recognized speech. Actually, topic detection task can be used for indexing audio content from broadcast news, conferences, lectures, etc. This paper is arranged as follows: in section III we introduce the keyword selection and topic detection task and its computation for the present setup. In section IV we present the audio segmentation task. Finally, in section VIII some conclusions are dicussed. T II. G ENERAL OVERVIEW An overview of the proposed system architecture is shown in figure 1. In the current system, speech is automatically off-line segmented using the time marks provided in database annotations. In a parallel task, the system receives the full audio content Thanks to everyone in the GTH, (http://lorien.die.upm.es/). 2 and segments speech frames from music, music+speech or noise in background. In the system under development the audio is expected to be segmented as a previous step of the speech recognition process. Then, only speech frames will be recognized. This stage is very important for the rest of the process since we are not interested in wasting time trying to recognise audio segments that do not contain “useful” speech. Off-line and previous processing is required for selecting the keywords needed in topic detection task. Keywords selection Automatic Speech Recognition Segmented audio Full audio Topic Detection Audio Segmentation Current architecture System under development Full audio Fig. 1. Keywords selection Audio Segmentation Automatic Speech Recognition Topic Detection General overview of the proposed system architecture. III. T OPIC DETECTION One of the goals in Topic Detection is to develop automatic methods of identifying topically related stories within a stream of audio (e.g. news media, conferences, lectures, etc.) or recognized text. According to [1] a topic is a “seminal event or activity, along with directly related events and activities”. A story is considered “on topic” when it discusses events and activities that are directly connected to that topic’s seminal event. For topic identification, keywords are utilized. This is done not only based on the words’ relevance in the current document, but also expanded to the rest of the documents in the corpus. To measure words’s relevance Term Frequency - Inverse Document Frequency TF-IDF is used [2]. A. Keyword selection We face the problem of keyword selection as a feature selection problem. In this work the TF-IDF weight is used for selecting words as keywords from the training database. Though TF-IDF is a relatively old weighing scheme, it is simple and effective and it can be a starting point for future algorithms [3], [4]. Figure 2 shows the scheme used for keyword selection. First, all the documents labeled as topic n are gathered together. Then TF-IDF is applied to each of the previous sets. Different selection criteria were used to select the final set of keywords. Training Corpus Text / topic 1 TF -IDF (Term Frequency Inverse Document Frequency) Text / topic 2 ... Text / topic 3 Text / topic n Keyword selection Fig. 2. Keywords selection procedure. Selection Criteria Selected Keywords 3 B. Mathematical framework for TF-IDF Essentially, TF-IDF works by determining the relative frequency of words in a specific document compared to the inverse proportion of that word over the entire document corpus. This calculation determines how relevant a given word is in a particular document. Words that are common in a single or a small group of documents tend to have higher TF-IDF numbers than common words such as articles and prepositions. The implementation of the TF-IDF works as follows. Given a corpus or a document collection D, a word wi and a individual document dj ∈ D, we calculate term frequency - tfi,j as ni,j tfi,j = P k nk,j (1) where ni,j is the number of times the word wi appears in dj and the denominator is the sum of the ocurrences of all terms in document dj , that is, the size of the document |dj |. The inverse document frequency is a measure of the general importance of the term. It is calculated as follows idfi = log |D| 1 + |{j : wi ∈ dj }| (2) where |D| is the total number of documents in the corpus and |{j : wi ∈ dj }| is the number of documents where the term wi appears. Finally, the TF-IDF can be obtained as (tf − idf ) = tfi,j × idfi (3) A high value in tf −idf means a high term frequency and a low document frequency of the term in the whole corpus. Words with highest values are the most relevant for a given document. In the case for extremely common words, such as articles or prepositions, they hold no relevant meaning in a topic. Therefore this discriminatory power can be used for selecting the keywords that represents each topic. C. Normalization Once the keywords have been selected, the next step is to calculate the histogram of keywords per topic. This is done by calculating the relative frequency of each keyword over the whole text labeled in each topic. Since each topic is represented by its own unique vector it cannot be expected that the same range values will be optimal across all topics unless these values are normalized. D. Topic Detection As shown in figure 3 the topic detection problem can be solved as a classification problem [5]. Selected Keywords i keywords Automatic Speech Recognition Keyword Extraction Keyword Topic Histogram Histogram vector ki Classifier Topic Detected Histogram Matrix Kij jtopics Fig. 3. Topic Detection Scheme. After the speech is recognized the keyword extraction module calculate the times each keyword appears in the recognized text resulting in the vector k = [k1 , k2 , ..., kn ] where n is the number of keywords. Then, the classifier calculates the distance between the histogram vector k and every column vector in histogram matrix K, as follows: X (Kij − ki )2 dj = (4) σi2 i=n where dj is the distance from the histogram vector k to the j − th column vector of matrix K and j is the number of topics. The detected topic is the one with the minimun distance value. 4 IV. AUDIO SEGMENTATION The problem of distinguishing speech signals form other audio signals (e.g. music) has become increasingly important as ASR systems are applied to more realworld multimedia domains. Therefore, a pre-processing stage that segments the signal into periods of speech and non-speech is invaluable in improving recognition accuracy. A general overview of the system is depicted in figure 4. Music Others Speech Audio segmentation Noise Speech ASR Music Fig. 4. Audio-segmentation global scheme One of the issues in the design of a signal classifier is the selection of an appropriate feature set that captures the temporal and spectral structure of the signals. Many such features for speech/music discrimination have been suggested in the literature. Previous works on audio segmentation have been focused on feature extraction analysis or system architecture. Normally, authors have combined MFCCs with spectral features such as modulation energy and percentage of low energy frames as in [6], [11], [12] or have used histogram equalization-based features as in [7]. A. Baseline Previous works in UPM-GTH 1 have been focused in segmenting audio documents into few acoustic classes (ACs), such as Clean Speech, Music, Speech with noise in background and Speech with music in background [8]. In this paper, only two classes have been considered due to the audio content included in MAVIR database; there are only two classes of audio events: • Music [mu]. Music is understood in a general sense. • Speech with background music [sm]. Overlapping of speech and music classes For feature extraction, we have considered long term statistics of MFCC (Mel Frequency Cepstral Coefficients), sprectral entropy [9] and CHROMA coefficients. These CHROMA features are a powerful representation for music audio in wich the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave [10]. The baseline is a one-step system based on HMM as the one presented in figure 5. In particular, 3-state HMM have been considered for each acoustic class, considering initially 16 Gaussians per state [14]. The number of states have been adjusted following the topology proposed in [8]. Feature extraction HMM based frame classification Music MFCC Chroma Speech Audio segmentation Fig. 5. Baseline scheme B. Evaluation metrics The evaluation metric used for audio segmentation is the same as the one used for speaker diarization experiments and described and used by NIST in the RT Evaluations. This Diarization Error Rate - DER is defined as the sum of the perspeaker false alarm (falsely identifying speech), missed speech (failing to identify speech), and speaker error (incorrectly 1 Grupo de Tecnología del Habla - Universidad Politécnica de Madrid 5 identifying the speech over the music) times, divided by the total amount of speech time in a test audio file. That is, DER = TF A + TM ISS + TSP KR TSP EECH (5) In our case, the total amount of speech time is the same as total amount of scored time of the audio file. To measure it, the script MD-eval-v12.pl developed by NIST was used 2 . C. Feature analysis During experiments we have evaluated an important amount of features used in speech and speaker recognition. The best features for this task have been: • MFCC15_E_D_A (mean+var). 15-MFCCs and local energy computed in 25ms windows (with an overlapping of 15ms), their delta and double delta. The statistics are mean and variance computed along a 1 second with 0.5s overlapping. • MFCC15_E_D_A (mean+std). Similar to the previous one but considering as statistics: mean and standard deviation. • MFCC15_E_D_A (mean+std+skew): Similar to the previous one but considering as statistics: mean, standard deviation and skewness. • MFCC15_E_D_A (mean+std+skew+kurt): adding kurtosis as a new statistic. • MFCC15_E_D_A (mean+std+kurt): same that previous one removing skewness. • MFCC15CHR_E_D_A (mean+std): 15-MFCCs, local energy computed in 25ms windows (with an overlapping of 15ms), their delta and double delta and 12 CHROMA coefficients computed every 50 ms. Statistics are mean and standard deviation along a 1 second with 0.5s overlapping. • MFCC15CHR+SpectralFeatures_E_D_A (mean+std): same to the previous one adding the statistics (mean and standard deviation) of several spectral features computed at 50ms frames (flux, centroid, entropy and band energies). • MFCC15CHR+Entropy_E_D_A (mean+std): same to the previous one adding only the mean and standard deviation of the spectral entropy. This initial analysis was performed considering 16 Gaussians per state. After this analysis we decided to increase the number of gaussians per state. For this reason, the experiment was repeated for 32, 64 and 128 Gaussians. V. DATABASE DESCRIPTION Experiments have been done using MAVIR database. Tests on audio segmentation have been done on tourism video corpus provided by MAVIR project 3 . This corpus consists of a collection of tourism videoclips from the TURESPAÑA Corporate Website (Spanish Tourism Institute). It includes 39 videos in spanish and 23 videos in english. It was annotated by the Linguistic Department of the Universidad Autónoma de Madrid under MAVIR project. The spanish partition of the database includes around 2 hours of video. The audio signals are provided in pcm format, mono, 16 bits and sampling frequency 8kHz. VI. E XPERIMENTAL RESULTS FOR AUDIO SEGMENTATION For the features described in section IV-C, table I, table II and table III presents the results for different features. TABLE I R ESULTS FOR DIFFERENT FEATURES USING 32 G AUSSIANS PER STATE Feature MFCC15_E_D_A (mean+var) MFCC15_E_D_A (mean+std) MFCC15_E_D_A (mean+std+skew) MFCC15_E_D_A (mean+std+skew+kurt) DER(%) 20.09 19.78 19.72 18.01 Features MFCC15_E_D_A (mean+std+kurt) MFCC15CHR_E_D_A (mean+std) MFCC15CHR+SpectralFeatures_E_D_A (mean+std) MFCC15CHR+Entropy_E_D_A (mean+std) DER(%) 18.11 15.21 14.95 14.12 TABLE II R ESULTS FOR DIFFERENT FEATURES USING 64 G AUSSIANS PER STATE Feature MFCC15_E_D_A (mean+var) MFCC15_E_D_A (mean+std) MFCC15_E_D_A (mean+std+skew) MFCC15_E_D_A (mean+std+skew+kurt) 2 Available DER(%) 18.34 17.09 17.22 18.01 Features MFCC15_E_D_A (mean+std+kurt) MFCC15CHR_E_D_A (mean+std) MFCC15CHR+SpectralFeatures_E_D_A (mean+std) MFCC15CHR+Entropy_E_D_A (mean+std) DER(%) 15.37 13.98 11.54 9.56 at http://www.itl.nist.gov/ is a research network co-funded by the Regional Government of Madrid and the European Social Funder under MA2VICMR (2010-2013). More info http://www.mavir.net 3 MAVIR 6 TABLE III R ESULTS FOR DIFFERENT FEATURES USING 128 G AUSSIANS PER STATE Feature MFCC15_E_D_A (mean+var) MFCC15_E_D_A (mean+std) MFCC15_E_D_A (mean+std+skew) MFCC15_E_D_A (mean+std+skew+kurt) DER(%) 16.32 15.23 14.34 15.32 Features MFCC15_E_D_A (mean+std+kurt) MFCC15CHR_E_D_A (mean+std) MFCC15CHR+SpectralFeatures_E_D_A (mean+std) MFCC15CHR+Entropy_E_D_A (mean+std) DER(%) 12.51 10.15 9.63 7.96 VII. E XPERIMENTAL RESULTS FOR TOPIC DETECTION Different experiments have been done using MAVIR database. For experiment 1 all annotations texts have been used for keywords selection. For experiment 2, annotation text have been divided into train and test (90-10%). The word recognition error of the ASR is about 45%. Different number of keywords per topic have been evaluated. Table IV presents the results for the topic detection accuracy in both experiments. TABLE IV R ESULTS FOR DIFFERENT EXPERIMENTS IN MAVIR DATABASE No. kws / topic 10 11 12 13 14 15 16 17 18 19 20 21 22 Experiment 1 (%) 76.92 80.77 84.62 88.46 88.46 96.15 96.15 96.15 92.31 100 100 100 100 Experiment 2 (%) 57.69 57.69 61.54 61.54 65.38 61.54 65.38 69.23 76.92 80.77 82.46 80.77 80.77 VIII. C ONCLUSIONS For the topic detection task, we can conclude that: • Normalization over each topic is needed due to the different range values the TF-IDF can present for the same keyword in different topics. • TF-IDF is a simple and efficient algorithm for selecting representative words in order to identificate and detect diferent topics among a whole collection of documents. • But, despite its strength, TF-IDF has some limitations. This algorithm does not take into account the relationship between words, (e.g. in synonyms or in plural words). In this experiment TF-IDF could not recognize words as “parque” and “parques” as the same word, categorizing each of them instead of evaluating as the same word. This can decrease the weight given to that word in the keyword set. For large document collections, this could present an escalating problem. And for the audio segmentation task we also conclude that: • Including CHROMA coefficients allows reducing significantly the error for all ACs from 12.51% to 7.96%. • In all cases, when increasing the number of Gaussians the results are better. Best results are obtained when using MFCC+CHROMA+Entropy features. • In summary, for the best configuration of the one-step system, we have obtained a 7.96% error (using the NIST tool). R EFERENCES [1] Fiscus, J. and Doddington, G. “Topic Detection and Tracking Evaluation Overview”. Chapter in Topic Detection and Tracking Event-based Information Organization. National Institute of Standars and Technology, USA, 2002. [2] Schultz, J.M. and Liberman, M. “Topic Detection and Tracking using idf-Weighted Cosine Coefficient”. Proc. of The DARPA Broadcast News Workshop 1999. pp. 189-192. USA, 1999. [3] Wintrode, J. and Kulp, S. “Confidence-based techniques for rapid and robust topic identification of conversational telephone speech”. Proc. of Interspeech 2009. England, 2009. [4] Ramos, J. “Using TF-IDF to determine word relevance in document queries”. Department of Computer Science. Rutgers University, USA, 2003. [5] Mahajan, M. and Beeferman, D. and Huang, X.D. “Improved topic-dependent language modeling using information retrieval techniques”. Proc. IEEE ICASSP 1999, vol. 1, pp. 541-544, 1999. [6] Izumitani, T. and Mukai, R. and Kashino, K. “A background music detection method based on robust feature extraction”. Proc. IEEE ICASSP 2008, pp. 13-16, 2008. [7] Gallardo, A. and Montero, J.M. “Histogram Equalization-Based Features for Speech, Music, and Song Discrimination”. IEEE Signal Process. Lett., vol. 17, no. 7, July, 2010. [8] Gallardo, A. and San-Segundo, R. “UPM-UC3M system for music and speech segmentation”. Proc. of the Jornadas de Tecnología del Habla FALA 2010. Spain, Novemeber, 2010. 7 [9] Misra, H. and Ikbal, S. and Bourlard, H. and Hermansky, H. “Spectral entropy based feature for robust ASR”. Proc. IEEE ICASSP 2004, pp. 193-198, 2004. [10] Eyben, F. and Wöllmer, M. and Schuller, B. “openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor”. Proc. ACM Multimedia (MM), ACM, Firenze, Italy, 2010. [11] Huijbregts, M. and de-Jong, F. “Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content”. Speech Communication, vol. 53, no. 2, pp. 143-153, 2011. [12] Tsai, W.H. and Lin, H.P. “Background Music Removal Based on Cepstrum Transformation for Popular Singer Identification”. IEEE Trans. Audio, Speech and Language Processing, vol. 19, no. 5, pp. 1196-1205, 2011. [13] Moreno, J. et al. “Some experiments in evaluating ASR systems applied to multimedia retrieval”. Proc. of 7th Intl. Conf. on Adaptive Multimedia Retrieval, pp. 12-23, Spain, 2009. [14] Ajmera, J. and McCowan, I. and Bourlard, H. “Speech/music segmentation using entropy and dynamism features in a HMM classification framework”. Speech Communication, vol. 40, pp. 351-363, 2003. [15] Lane, I. and Kawahara, T. andMatsui, T. “Language model switching based on topic detection for dialog speech recognition”. Proc. IEEE ICASSP 2003, pp. 616-619, 2003.