CONFIDENTIAL UP TO AND INCLUDING 12/31/2013 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY Design of a Plugin-based Approach for Word-per-Word Alignment Using Automatic Speech Recognition Joke Van den Mergele Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Tom De Nies, Ir. Miel Vander Sande, Dr. Wesley De Neve, dhr. Kris Carron Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 CONFIDENTIAL UP TO AND INCLUDING 12/31/2013 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY Design of a Plugin-based Approach for Word-per-Word Alignment Using Automatic Speech Recognition Joke Van den Mergele Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Tom De Nies, Ir. Miel Vander Sande, Dr. Wesley De Neve, dhr. Kris Carron Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Word of Thanks Firstly, I would like to thank my supervisor, prof. dr. ir. Rik Van de Walle, for allowing the possibility of researching this dissertation. I would also like to thank my associate supervisors dr. Wesley De Neve and ir. Tom De Nies, for their feedback and inspiration during this year, as well as Kris Carron, without whom this dissertation would have never existed. I would like to thank my parents for allowing me to take up these studies, and my partner, Niels, for his support and comfort when the end did not seem within sight. Dankwoord Ik zou als eerste graag mijn promoter, prof. dr. ir. Rik Van de Walle, bedanken om het onderzoek naar deze thesis goed te keuren. Verder zou ik ook graag mijn begeleiders dr. Wesley De Neve en ir. Tom De Nies bedanken voor hun feedback en motivatie gedurende het hele jaar, alsook Kris Carron, aangezien deze thesis nooit mogelijk zou zijn geweest zonder hem. Mijn ouders wil ik ook graag bedanken omdat ze me de mogelijkheid hebben gegeven deze studies te volgen. Als laatste wil ik mijn partner Niels bedanken voor zijn steun en comfort op momenten wanneer het einde nog lang niet in zicht leek. Joke Van den Mergele, June 2014 i Usage Permission “The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.” Joke Van den Mergele, June 2014 iii Design of a Plugin-based Approach for Word-per-Word Alignment Using Automatic Speech Recognition by Joke Van den Mergele Dissertation submitted for obtaining the degree of Master in Computer Science Engineering Academic year 2013-2014 Ghent University Faculty of Engineering Department of Electronics and Information Systems Head of the Department: prof. dr. ir. J. Van Campenhout Supervisor: prof. dr. ir. R. Van de Walle Associate supervisors: dr. W. De Neve, ir. T. De Nies, ir. M. Vander Sande Summary In this dissertation we present the design of a plugin-based system to perform automatic speech-text alignment. Our goal is to investigate whether performing an alignment task automatically, instead of manually, is possible with a current state-of-the-art opensource automatic speech recognition systems. We test our application on Dutch audiobooks and text, using the CMU Sphinx automatic speech recognition system as plugin. We were provided with test data by the Playlane company, which is one such company that manually aligns their audio and text. Dutch is a slightly undersourced language when it comes to data necessary for speech recognition. However, from our test results, we conclude that it is indeed possible to automatically generate an alignment of audio and text that is accurate enough for use, using the CMU Sphinx ASR system when performing the alignment. Samenvatting In deze thesis presenteren we een ontwerp van een plugin-gebaseerd systeem om automatische spraak-tekst uitlijning uit te voeren. Ons doel is om te onderzoeken of het mogelijk is een uitlijningstaak automatisch uit te voeren, in plaats van manueel, met de huidige opensource automatische spraakherkenningssystemen. We testen ons ontwerp op Nederlandse audio boeken en tekst, gebruikmakend van het CMU Sphinx automatisch spraakherkenningssysteem als plugin. We werden voorzien met test data door het Playlane bedrijf, dat ´e´en van de bedrijven is die hun audio en tekst manueel uitlijnen. Nederlands is een weinig voorkomende taal als het gaat over nodige data voor spraakherkenning. We kunnen echter concluderen van onze testresultaten dat het inderdaad mogelijk is om automatisch een uitlijning van audio en tekst te genereren die nauwkeurig genoeg is voor gebruik, wanneer we het CMU Sphinx ASR systeem gebruiken om de uitlijning te genereren. Keywords: automatic speech recognition (ASR), CMU Sphinx, Dutch audio, speechtext alignment Trefwoorden: automatische spraakherkenning (ASH), CMU Sphinx, Nederlandse audio, spraak-tekst uitlijning v Design of a Plugin-based Approach for Word-per-Word Alignment Using Automatic Speech Recognition Joke Van den Mergele Supervisor(s): prof. dr. ir. Rik Van de Walle, dr. Wesley De Neve, ir. Tom De Nies, ir. Miel Vander Sande Abstract— In this paper we present the design of a plugin-based system to perform automatic speech-text alignment. Our goal is to investigate whether performing an alignment task automatically, instead of manually, is possible with a current state-of-the-art open-source automatic speech recognition systems. We test our application on Dutch audiobooks and text, using the CMU Sphinx [1] automatic speech recognition system as plugin. Dutch is a slightly undersourced language when it comes to data necessary for speech recognition. However, from our test results, we conclude that it is indeed possible to automatically generate an alignment of audio and text that is accurate enough for use, using the CMU Sphinx ASR system to perform the alignment. Keywords— automatic speech recognition (ASR), CMU Sphinx, Dutch audio, speech-text alignment I. I NTRODUCTION UTOMATIC speech recognition (ASR) has been subject to research for over 50 years, ever since Bell Labs performed their very first small-vocabulary speech recognition tests during the 1950’s, trying to automatically recognise digits over the telephone [2]. As computing power grew during the 1960s, filter banks were combined with dynamic programming to produce the first practical speech recognizers. These were mostly for isolated words, to simplify the task. In the 1970s, much progress arose in commercial small-vocabulary applications over the telephone, due to the use of custom special-purpose hardware. Linear Predictive Coding (LPC) became a dominant automatic speech recognition component, as an automatic and efficient method to represent speech. Core ASR methodology has evolved from expert-system approaches in the 1970s, using spectral resonance (formant) tracking, to the modern statistical method of Markov models based on a Mel-Frequency Cepstral Coefficient (MFCC) approach [2]. Since the 1980s the standard has been Hidden Markov Models (HMMs), which have the power to transform large numbers of training units into simpler probabilistic models [3, 4]. During the 1990s, commercial applications evolved from isolated-word dictation systems to general-purpose continuousspeech systems. ASR has been largely implemented in software, e.g. for the use of medical reporting, legal dictation, and automation of telephone services. With the recent adoption of speech recognition in Apple, Google, and Microsoft products, the ever-improving ability of devices to handle relatively unrestricted multimodal dialogues, is showing clearly. Despite the remaining challenges, the fruits of several decades of research and development into the speech recognition field can now be seen. As Huang, Baker, and Reddy [5] said: “We believe the speech community is en route A to pass the Turing Test1 in the next 40 years with the ultimate goal to match and exceed a human’s speech recognition capability for everyday scenarios.” But even now, there is no such thing as a perfect speech recognition system. Each system has its own limitations and conditions that allow it to perform optimally. However, what has the biggest influence on how a speech recognition system performs, is how well-trained the acoustic model is, and what specific task it is trained for [4]. For example, an acoustic model can be trained on one person specifically (or be updated to better recognize that person’s speech), it can be trained to perform well on broadcast speech, on telephone conversations, on a certain accent, on certain words if only commands must be recognized, etc. Thus, if one has access to an automatic speech recognition system, and knows which task one needs to use it for, it would be very useful to train the acoustic system according to that task. With the recent boom in audiobooks [7], a whole new set of training data is available, as audiobooks are recorded under optimal conditions and the text that is read, is obtainable for each book. But there is also another way these audiobooks might be used by a speech recognition system. Why not align these audiobooks with their book content, using a speech recognition system, and so automate the process of creating digital books that contain both audio and textual content. In the context of this project, we worked together with the Playlane company (now Cartamundi Digital2 ), who were kind enough to provide us with test data for our application. The Playlane company digitizes children’s books, adding games and educational content. One part of this educational content is the alignment between the book’s text and the read-aloud version of the book. Currently, this alignment is done entirely manually. Our goal is to provide companies like this, or people in need of aligning audio and text, with a way to achieve this alignment automatically, using ASR systems. The remainder of this paper is structured as follows: first, we provide a little insight in interesting related work we came across, in Section II. We then discuss the general overview of automatic speech recognition systems in Section III. We briefly present how they generally are build and the building blocks 1 The phrase The Turing Test is most properly used to refer to a proposal made by Turing (1950) as a way of dealing with the question whether machines can think [6], and is now performed as a test to determine the ‘humanity’ of the machine. 2 We will refer to the company as “Playlane” instead of “Cartamundi Digital”, since Cartamundi Digital envelops many companies, and we only work with the products created by Playlane. they consist of, as well as pointing out the blocks that are most important when working with our own application. In Section IV, we provide a summary of the ASR plugin system we applied in our application. A high-level view of how our application was designed and why we made those decisions, is discussed in Section V. We also provide guidelines for the reader to get the most accurate results when using our application. In Section VI we discuss the accuracy of the speech-text transcriptions from our application, and what we changed to improve the accuracy. The conclusions we drew from our results, and the future work are described in Section VII. II. R ELATED W ORK In [8], the authors report on the automatic alignment of audiobooks in Afrikaans. They use an already existing Afrikaans pronunciation dictionary and create an acoustic model from an Afrikaans speech corpus. They use the book “Ruiter in die Nag” by Mikro to partly train their acoustic model, and to perform their tests on. Their goal is to align large Afrikaans audio files at word level, using an automatic speech recognition system. They developed three different automatic speech recognition systems to be able to compare these and discover which performs best, all three of them are build using the HTK toolkit [9]. To define the accuracy of their automatic alignment results, they compare the difference in the final aligned starting position of each word with an estimate of the starting position they obtained by using phoneme recognition. They discovered that the main causes of alignment errors are: • speaker errors, such as hesitations, missing words, repeated words, stuttering, etc.; • rapid speech containing contractions; • difficulty in identifying the starting position of very short (one- or two-phoneme) words; and, • a few text normalization errors (e.g. ‘eenduisend negehonderd’ for ‘neeentienhonderd’). Their final conclusions are that the baseline acoustic model does provide a fairly good alignment for practical purposes, but that the model that was trained on the target audiobook provided the best alignment results. The reason their research is interesting to us is because, just as Afrikaans, Dutch is a slightly undersourced language (though not as undersourced as Afrikaans), despite the large efforts made by the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN) [10]. For example, the acoustic models of Voxforge [11] we use, both for English and Dutch speech recognition, contain around 40 hours of speech over a hundred speakers for English, while only 10 hours of speech for Dutch. The main causes of alignment errors they discovered are, of course, also interesting for us to know, since we can present these to the users of our system and create awareness. However, as the books are read under professional conditions, it is unlikely that there will be many speaker errors. The third interesting fact they researched was that training the acoustic model on part of the target audiobook provides the best alignment results of their tested models. We will also try to achieve this, however, we will train the acoustic model on other audiobooks read by the same person, preferably a book with the same reading difficulty classification as the target audiobook, as these have similar pauses and word lengths. The authors of [12] try out the alignment capabilities of their recognition system under near-ideal conditions, i.e. on audiobooks. They also created three different acoustic models, one trained on manually transcriptions, one trained on the audiobooks at syllable level, and one trained on the audiobooks on word level. They draw the same conclusions as the authors of [8], namely that training the acoustic models on the target audiobooks provide better results, as well as that aligning audiobooks (which are recorded under optimal conditions) is ‘easier’ than aligning real-life speech with background noises or distortions. They also performed tests using acoustic models that were completely speaker-independent, slightly adapted and trained on a specific speaker, and completely trained on a specific speaker. It may come as no surprise that they discovered that the acoustic model that was trained on a certain person provided almost perfect alignment of a text spoken by that person. However, the one part of this article that is extremely interesting to us is that they quantify the sensitivity of a speech recognizer to the articulation characteristics and peculiarities of the speaker. The recognition accuracy results for each speaker have quite a huge deviance in both directions, compared to the average accuracy value of about 74%. They believe the reason for this high deviation in the scores can mostly be blamed on the sensitivity of the recognizer to the actual speaker’s voice. It would thus be a good idea to train an acoustic model for each voice actor a company such as Playlane works with, or at least adapt the acoustic model we use to their voice actors by training them on their speech, if it appears the results we achieve with our speech recognition application are suboptimal. III. AUTOMATIC S PEECH R ECOGNITION The main goal of speech recognition is to find the most likely word sequence, given the observed acoustic signal. Solving the speech decoding problem then consists of finding the maximum of the probability of the word sequence w given signal x, or, equivalently, maximizing the “fundamental equation of speech recognition” P r(w)f (x|w). Most state-of-the-art automatic speech recognition systems use statistical models. This means that speech is assumed to be generated by a language model and an acoustic model. The language model generates estimates of P r(w) for all word strings w and depends on high-level constraints and linguistic knowledge about allowed word strings for the specific task. The acoustic model encodes the message w in the acoustic signal x, which is represented by a probability density function f (x|w). It describes the statistics of sequences of parametrized acoustic observations in the feature space, given the corresponding uttered words. The authors of [13] divide such a speech recognition system in several components. The main knowledge sources are the speech and text corpus, which represent the training data, and the pronunciation dictionary. The training of the acoustic and language model relies on the normalisation and preprocessing, such as N -gram estimation and feature extraction, of the training data. This helps to reduce lexical variability and transforms the texts to better represent the spoken language. However, this step is language specific. It includes rules on how to process numbers, hyphenation, abbreviations and acronyms, apostrophes, etc. After training, the resulting acoustic and language model are used for the actual speech decoding. The input speech signal is first processed by the acoustic front end, which usually performs feature extraction, and then passed on to the decoder. With the language model, acoustic model and pronunciation dictionary at its use, the decoder is able to perform the actual speech recognition and returns the speech transcription to the user. According to [14] the different parts as discussed above, can be grouped into the so-called five basic stages of ASR: 1. Signal Processing/Feature Extraction This stage represents the acoustic front end. The same techniques are also used on the speech corpus, for the feature extraction. For our application, we use Mel-frequency cepstral coefficients (MFCC) [14] to perform feature extraction. 2. Acoustic Modelling This stage encompasses the different steps needed to build the acoustic model. Hidden Markov models (HMMs) [15] are how our acoustic models are trained. 3. Pronunciation Modelling This stage creates the pronunciation dictionary, which is used by the decoder. 4. Language Modelling In this stage, the language model is created. The last, and most important, step for its creation is the N -gram estimation. 5. Spoken Language Understanding/Dialogue Systems This stage refers to the entire system that is build and how it interacts with the user. edge base, which are all controllable by an external application, which provides the input speech and transforms the output to the desired format, if needed. The Sphinx-4 architecture is designed with a high degree of modularity. All blocks are independently replaceable software modules, except for the blocks within the knowledge base, and are written in Java. For more information about the several Sphinx-4 modules, we refer to [16], and the Sphinx-4 source code and documentation [1, 17]. V. O UR A PPROACH A. High-Level View Our application is written entirely in the JavaTM language, and consists of three separate components, see Figure 1. Main Component Plugin Knowledge Component ASR Plugin Component IV. T HE ASR P LUGIN USED FOR OUR A PPLICATION CMU Sphinx is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University (CMU). They include a series of speech recognizers (Sphinx-2 through 4) and an acoustic model trainer (SphinxTrain). In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx-2 and, a year later, Sphinx-3. The speech decoders come with acoustic models and sample applications. The available resources include software for acoustic model training, language model compilation and a public-domain pronunciation dictionary for English, “cmudict”. The Sphinx-4 speech recognition system [1] is the latest addition to Carnegie Mellon University’s repository of the Sphinx speech recognition systems. It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard’s Cambridge Research Lab. It is different from the earlier CMU Sphinx systems in terms of modularity, flexibility and algorithmic aspects. It uses newer search strategies, and is universal in its acceptance of various kinds of grammars, language models, types of acoustic models and feature streams. Sphinx-4 is developed entirely in the JavaTM programming language and is thus very portable. It also enables and uses multi-threading and permits highly flexible user interfacing. We make use of the latest Sphinx addition, Sphinx-4, in our system; but our Sphinx configuration uses a Sphinx-3 loader to load the acoustic model, in the decoder module. The high-level architecture of CMU Sphinx-4 is fairly straightforward. The three main blocks are the front end, the decoder, and the knowl- Fig. 1. High-level view of our application The first component is the Main component. This is were the main functionality of our application is located, and it contains the parsing of the command line and chooses the plugin, as well as loads it into the application (as the plugin is located in a different component, see below). It also contains the testing framework. • The second component is the Plugin Knowledge component, which contains all the functionality one needs to implement the actual plugin. It provides the user with two possible output formats, namely a standard subtitle file (.srt file format) and an EPUB file. This component receives the audio and text input from the main component, and passes it along to the ASR plugin component. • The third component is where the ASR plugin is actually located. We refer to this component as the ‘plugin component’, since it contains the ASR system. • We decided to split our application in these three components to keep the addition of a new plugin to the application as easy as possible. If a person wants to change the ASR system that is used by our application, they only need to provide a link from the second component to the plugin component. By splitting up the first and second component, they do not need to work out all the extra functionality and modules that are used by the first component, which have no impact on the plugin ASR whatsoever, and can start work even if only being provided with the second component. normal pace audio file length slow pace audio file length 0:43:12 2520 2500 #words in input text We describe a number of characteristics the audio and text data must contain, for our application and, mainly, the Sphinx ASR plugin, to work as accurate as possible. • Firstly, the input audio file needs to conform to a number of characteristics; it needs to be monophonic, have a sampling rate of 16kHz and each sample must be encoded in 16bits, little endian. We use a small tool called SoX to achieve this [18]. This total #words in input text file 3000 2000 961 1000 500 0:28:48 1853 1500 449 565 1131 1254 0:36:00 0:21:36 1286 0:14:24 audio file length B. Guidelines to Increase Alignment Accuracy 594 0:07:12 243 0 0:00:00 $> sox "inputfile" -c 1 -r 16000 -b 16 --endian little "outputfile.wav" mean start time difference mean stop time difference 95348 95557 120000 100000 46291 46530 80000 180 143 162 119 0 260 212 20000 69 40 40000 252 179 29101 29156 60000 10242 10256 We were provided with 10 Dutch books by the PlayLane company, which we were able to use to verify our application with. The subtitle files that were provided by the PlayLane company were manually made by the employees of PlayLane. They listen to the audio track and manually set the timings for each word. To verify the accuracy of our application we needed books that already had a word-per-word transcription, so we could compare that transcription with the one we generated using the ASR plugin. The books we used, are listed below: • Avontuur in de woestijn • De jongen die wolf riep • De muzikanten van Bremen • De luie stoel • Een hut in het bos • Het voetbaltoneel • Luna gaat op paardenkamp • Pier op het feest • Ridder muis • Spik en Spek: Een lek in de boot All these books are read by S. and V., who are both female, and have Dutch as their native language. Only two books, namely, “De luie stoel” and “Het voetbaltoneel” are read by S., the others are read by V. In Figure 2, the number of words for each book are shown, as well as the length of the audio files, both slow and normal pace versions, of each book. The difference between both transcriptions is measured in milliseconds, word-per-word. For each word the difference between both transcriptions’ start times and stop times are calculated separately, and we take the mean over all the words that appear in both files. We decided to calculate the average of start and stop times for each word separately when we discovered, after careful manual inspection of the very first results, that Sphinx-4 has the tendency to allow more pause at the front of a word than at the end. In other words, it has the tendency to start highlighting a word in the pause before it is spoken, but stops the highlighting of the word more neatly after it is said. Figure 3 shows the average difference of the start and stop times for each word, between the files provided by PlayLane and the automatically generated transcription provided by our application. There are six out of eight books read by V. that 116 90 VI. R ESULTS Fig. 2. Chart containing the size of the input text file, and length of the input audio files for both normal and slow pace milliseconds tool is also useful to cut long audio files into smaller chunks (audio file length of around 30 minutes is preferable to create a good alignment). • The input text file that contains the text that needs to be aligned with the audio file, should best be a in simple text format, such as .txt. It, however, needs to be encoded in UTF-8 format. This is usually already the case, but it can easily be verified and applied in source code editors, such as notepad++ [19]. This is needed to correctly interpret the special characters that might be present in the text, such as quotes, accented letters, etc. Fig. 3. The mean start and stop time differences between the automatically generated alignment and the PlayLane timings have timings that are synchronized with, on average, less than one second of difference between the output from our system and the one provided by PlayLane. The two books read by S. have the highest average start and stop time difference, which is why we have decided to train the acoustic model more on her voice, see Figure 5. We also thought it might be interesting to know how well our ASR system performs when there are words missing from the input text. We therefore decided to remove the word “muis” from the “Ridder Muis” input text. The results can be found in Figure 4, and clearly show that the most accurate synchronisation results are to be achieved when the input text file represents 100000 36970 10000 milliseconds 7044 7030 mean start time difference 3780 mean stop time difference max. time difference 1000 162 119 100 with `muis´ without `muis´ input text Fig. 4. The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, with missing word “muis in the input text the actually spoken text as well as possible. As mentioned before, we wanted to train our acoustic model to S.’s voice, as it seemed to get the worst results of the alignment task. We trained the acoustic model on a book called “Wolf heeft jeuk”, which is also read by S. but is not part of the test data. We trained it on the last chapter of the book, which contained 13 sentences with a total of 66 words, covering 24 seconds of audio. And we also trained the original acoustic model on a part of the book “De luie stoel”. We used 23 sentences with a total of 95 words, covering 29 seconds of audio from this book. The alignment results we achieved when using the trained acoustic models can be seen in Figure 5. mean start time difference (old AM) mean start time difference (luie stoel AM) mean start time difference (wolf AM) 95348 120000 43246 29319 46291 17951 362 40195 180 162 1498 260 16124 11046 69 90 96 22374 16041 252 0 116 140 126 20000 643 322 40000 29101 60000 18581 48977 80000 10242 148 2860 milliseconds 100000 Fig. 5. Mean start time difference of each normal pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie stoel” Figure 5 shows some definite improvements for both books read by S., which is as we expected, though the mean time difference is still almost 20 seconds for “De luie stoel” and over 40 seconds for “Het voetbaltoneel” for the alignment results we achieved using the “Wolf heeft jeuk” acoustic model. The average time difference for the book “De luie stoel” is only around 300 milliseconds when we perform the alignment task using the acoustic model trained on “De luie stoel”; and also the alignment for “Het voetbaltoneel” has improved in comparison to when we use the model trained on “Wolf heeft jeuk”, with only around 30 seconds of average time difference instead of 40. This means the alignment results with the acoustic model trained on “Wolf heeft jeuk” are still not acceptable, but show promising results for the acoustic model if it were to be further trained on that book. The alignments results achieved by using the acoustic model trained on book “De luie stoel” are near perfect for that book, and also provide an improvement for the book “Het voetbaltoneel”. This allows us to believe that further training an acoustic model on S.’s voice will achieve much improved alignment results for books read by S.. Of the books that are read by V., some have around the same accuracy with the newly trained acoustic models as with the old one, others have a better accuracy. But, as four out of eight books have a worse accuracy, we can conclude that in general the acoustic model trained on S.’s voice has a bad influence on the accuracy of books read by V. VII. C ONCLUSION AND F UTURE W ORK The goal of this dissertation was to investigate whether performing an alignment task automatically, instead of manually, lays within the realms of the possible. Therefore, we created a software application that provides its user with the option to simply switch out different ASR systems, via the use of plugins. We provide extra flexibility for our application by offering two different output formats (a general subtitle file, and an EPUB file), and by making the creation of a new output format as simple as possible. From the results in the previous chapter, using the ASR plugin CMU Sphinx, we conclude that it is indeed possible to automatically generate an alignment of audio and text that is accurate enough for use (e.g., our test results have on average less than one second of difference between the automatic alignment results and a pre-existing baseline). However, there is still work to be done, especially for undersourced languages, such as Dutch. We achieved positive results when training the acoustic model on (less than 60 seconds of) audio data that corresponded with the person or type of book we wanted to increase alignment accuracy for. Our first remark for future work is then to further train the acoustic model for Dutch, especially when one has a clearly defined type of alignment tasks to perform. Considering it can take days to manually align an audiobook, this small effort to train an acoustic model definitely appears to be highly beneficial, keeping in mind the gain in time one might achieve when automatically generating an accurate automatic alignment. The trained model could also achieve accurate results on multiple books, meaning that it is not necessary to train an acoustic model for every new alignment task. We also note that the accuracy of the input text and the pronunciation dictionary coverage highly influences the accuracy of the alignment output. From our tests, we can conclude that it is best to not have words missing from the input text or the pronunciation dictionary. There is a clear need for a more robust system, with less unexplainable outlying results. We propose a way to increase robustness for our application by comparing the alignment results created by two, or more, different ASR plugins. The overlapping results, within a certain error range, can be considered ‘correct’. This approach is based on the approach followed in [20]. It is our belief that the system we designed provides a flexible approach to speech-text alignment and, as it can be adapted to the user’s preferred ASR system, might be to the benefit of users that previously performed the alignment task manually. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] Carnegie Mellon University, “CMU Sphinx Wiki,” http:// cmusphinx.sourceforge.net/wiki/. D. OShaughnessy, “Invited Paper: Automatic Speech Recognition: History, Methods and Challenges,” Pattern Recognition, vol. 41, no. 10, pp. 2965 – 2979, 2008. L. R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, 1986. X. Huang, Y. Ariki, and M. Jack, Hidden Markov Models for Speech Recognition, Columbia University Press, New York, NY, USA, 1990. X. Huang, J. Baker, and R. Reddy, “A Historical Perspective of Speech Recognition.,” Communication ACM, vol. 57, no. 1, pp. 94–103, 2014. G. Oppy and D. Dowe, “The turing test,” http://plato.stanford. edu/entries/turing-test/. D. Eldridge, “Have you heard? audiobooks are booming.,” BookBusiness your source for publishing intelligence, vol. 17, no. 2, pp. 20–25, April 2014. C.J. Van Heerden, F. De Wet, and M.H. Davel, “Automatic alignment of audiobooks in afrikaans,” in PRASA 2012, CSIR International Convention Centre, Pretoria. November 2012, PRASA. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, Cambridge University Engineering Department, 2006. CGN, “Corpus gesproken nederlands,” http://lands.let.ru.nl/ cgn/ehome.htm. Shmyryov NV, “Free speech database voxforge.org,” http: //translate.google.ca/translate?js=y&prev=_t&hl= en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww. dialog-21.ru%2Fdialog2008%2Fmaterials%2Fhtml% 2F90.htm&sl=ru&tl=en. L. T´oth, B. Tarj´an, G. S´arosi, and P. Mihajlik, “Speech recognition experiments with audiobooks,” Acta Cybernetica, vol. 19, no. 4, pp. 695–713, Jan. 2010. L. Lamel and J.-L. Gauvain, “Speech Processing for Audio Indexing,” in Advances in Natural Language Processing, B. Nordstrm and A. Ranta, Eds., vol. 5221 of Lecture Notes in Computer Science, pp. 4–15. Springer Berlin Heidelberg, 2008. J. Bilmes, “Lecture 2: Automatic Speech Recognition,” http://melodi.ee.washington.edu/˜bilmes/ee516/ lecs/lec2_scribe.pdf, 2005. E. Trentin and M. Gori, “A Survey of Hybrid ANN/HMM Models for Automatic Speech Recognition,” Neural Computing, vol. 37, no. 14, pp. 91 – 126, 2001. P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf, “Design of the CMU Sphinx-4 Decoder,” in 8th European Conference on Speech Communication and Technology (Eurospeech), 2003. Carnegie Mellon University, “CMU Sphinx Forum,” http:// cmusphinx.sourceforge.net/wiki/communicate/. SoX, “SoX Sound eXchange,” http://sox.sourceforge.net/. D. Ho, “Notepad++ Editor,” http://notepad-plus-plus.org/. B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and R. Van de Walle, “Improving Multimedia Analysis Through Semantic Integration of Services,” in 4th FTRA International Conference on Advanced IT, Engineering and Management, Abstracts. 2014, p. 2, Future Technology Research Association (FTRA). Ontwerp van een Plugin-gebaseerde Aanpak voor Woord-per-Woord Uitlijning Gebruikmakend van Automatische Spraakherkenning Joke Van den Mergele Promotor en begeleiders: prof. dr. ir. Rik Van de Walle, dr. Wesley De Neve, ir. Tom De Nies, ir. Miel Vander Sande Abstract— In dit artikel presenteren we het ontwerp van een plugingebaseerd systeem om spraak-tekst uitlijning automatisch uit te voeren. Ons doel is te onderzoeken of het mogelijk is om de uitlijning automatisch uit te voeren, in plaats van manueel, met de huidige open-source automatische spraakherkenningssystemen. We testen onze applicatie op Nederlandse audioboeken en tekst, gebruikmakend van het CMU Sphinx [1] automatisch spraakherkenningssysteem als plugin. Nederlands is een minder voorkomende taal op vlak van data nodig voor spraakherkenning. Uit onze testresultaten kunnen we echter concluderen dat het inderdaad mogelijk is om automatisch een uitlijning van audio en tekst te genereren, die accuraat genoeg is voor gebruik, met behulp van het CMU Sphinx automatisch spraakherkenningssysteem. Trefwoorden— automatische spraakherkenning (ASR, automatic speech recognition), CMU Sphinx, Nederlandse audio, spraak-tekst uitlijning I. I NTRODUCTIE UTOMATISCHE spraakherkenning (ASR, automatic speech recognition) is reeds onderworpen aan onderzoek gedurende meer dan 50 jaar, al sinds Bell Labs hun allereerste kleine-woordenschat spraakherkenning tests uitvoerden in de jaren ’50, met het doel om automatisch cijfers te herkennen via de telefoon [2]. Wanneer computers groeiden aan kracht tijdens de 1960’s werden filterbanken gecombineerd met dynamisch programmeren om de eerste praktische spraakherkenners te produceren. Deze werkten vooral voor ge¨ısoleerde woorden, om de taak te vergemakkelijken. In de jaren ’70 ontstond er grote vooruitgang in commerci¨ele kleine-woordenschat applicaties via de telefoon, door het gebruik van speciaal aangepaste hardware voor een bepaald doel. Lineair voorspellende codering (LPC, linear predictive coding) werd een dominant onderdeel van automatische spraakherkenning, als een automatische en effectieve manier om spraak voor te stellen. Kern ASR methodologie is ge¨evolueerd van de expertsysteem aanpak uit de jaren ’70, die gebruik maakte van het nasporen van spectrale resonantie (formanten), naar de moderne statistische methode met Markov modellen, gebaseerd op een Mel-frequentie cepstrale co¨effici¨ent (MFCC) aanpak [2]. Sinds de 1980’s is de standaard het gebruik van verborgen Markov modellen (HMMs, hidden Markov models), deze hebben de kracht om grote aantallen van trainingseenheden om te vormen in simpele probabilistische modellen [3, 4]. Tijdens de jaren ’90 evolueerden commerci¨ele applicaties van ge¨ısoleerde-woorden dictee-systemen naar algemeen bruikbare gecontinueerde-spraak-systemen. ASR is grotendeels ge¨ımplementeerd in software, bijvoorbeeld voor het gebruik bij medische rapportering, gerechtelijk dictee en automatisering A van telefonische diensten. Met de recente opneming van spraakherkenning in Apple-, Google-, en Microsoftproducten vertoont het steeds verbeterend vermogen van toestellen om met relatief onbegrensde multimodale dialogen om te gaan, zich duidelijk. Ondanks de resterende uitdagingen kunnen de vruchten van meerdere decennia onderzoek en ontwikkeling in het veld van spraakherkenning nu worden geplukt. Zoals Huang, Baker, en Reddy [5] vermelden: “We believe the speech community is en route to pass the Turing Test1 in the next 40 years with the ultimate goal to match and exceed a human’s speech recognition capability for everyday scenarios.” Maar zelfs nu is er niet zoiets als een perfect spraakherkenningssysteem. Elk systeem heeft zijn eigen beperkingen en voorwaarden waaronder het optimaal opereert. Hetgeen echter de grootste invloed blijkt op hoe goed een spraakherkenningssysteem werkt, is hoe goed getraind het akoestisch model is en voor welke bepaalde taak het is getraind [4]. Bijvoorbeeld, een akoestisch model kan getraind zijn op e´ e´ n welbepaalde persoon (of zijn aangepast om de spraak van deze persoon beter te herkennen), het kan getraind zijn om goed te werken op omroepspraak, op telefoongesprekken, op een bepaald accent, op bepaalde woorden indien het systeem enkel commando’s moet herkennen, etc. Dus, als men gebruik maakt van een automatische spraakherkenningssysteem en weet op welk soort taak het moet worden uitgevoerd, dan kan het heel nuttig zijn om het akoestisch model te trainen voor die taak. Met de recente groei aan audioboeken [7] bestaat er een hele nieuwe collectie van trainingdata, aangezien audioboeken worden opgenomen onder optimale condities, en de uitgesproken tekst opvraagbaar is voor elk boek. Maar er is ook een andere manier om deze audioboeken te gebruiken met een spraakherkenningssysteem. Waarom niet deze boeken uitlijnen met hun tekstuele inhoud, gebruikmakend van een spraakherkenningssysteem, en zo het proces voor de aanmaak van digitale boeken die zowel audio en tekstuele inhoud bevatten, te automatiseren? In de context van dit project werkten we samen met het Playlane bedrijf (nu Cartamundi Digital2 ), die zo vriendelijk waren 1 De zegswijze “The Turing Test” wordt gebruikt om te verwijzen naar een voorstel van Turing (1950) als een manier om om te gaan met de vraag of machines kunnen denken [6] en wordt nu uitgevoerd als een test om de ‘menselijkheid’ van een machine te bepalen. 2 We refereren verder naar het bedrijf als “Playlane” in plaats van “Cartamundi Digital”, aangezien Cartamundi Digital vele bedrijven omvat en we enkel werken met de producten gecre¨eerd door Playlane. om ons van voldoende testdata te voorzien voor onze applicatie. Het Playlane bedrijf digitaliseert kinderboeken en voegt spelletjes en educatieve inhoud toe. E´en onderdeel van deze educatieve inhoud is de uitlijning van de boektekst en een voorgelezen versie van het boek. Momenteel wordt deze uitlijning volledig manueel gedaan. Ons doel is om bedrijven zoals Playlane, of personen die nood hebben aan de uitlijning van audio en tekst, te voorzien van manieren om deze uitlijning automatisch te cre¨eren, met behulp van ASR systemen. Het vervolg van dit artikel is als volgt gestructureerd: eerst bezorgen we, in Sectie II, inzicht in interessant gerelateerd werk dat we tegenkwamen tijdens het onderzoek voor onze applicatie. Daarna voorzien we een algemeen overzicht van automatische spraakherkenningssystemen in Sectie III. We presenteren kort hoe ze in het algemeen worden opgebouwd en uit welke componenten ze bestaan. We duiden ook de belangrijkste componenten aan wanneer men werkt met onze applicatie. In Sectie IV vatten we de ASR plugin die we voor het testen van onze applicatie hebben gebruikt samen. Vervolgens bediscussi¨eren we, in Sectie V, een hoog-niveau overzicht van het ontwerp van onze applicatie en waarom we voor deze ontwerp beslissingen hebben gekozen. We geven de lezer ook een aantal richtlijnen om de meest accurate uitlijning te bekomen wanneer ze de applicatie gebruiken. In Sectie VI stellen we de nauwkeurigheid van de verkregen spraak-tekst uitlijning van onze applicatie voor en wat we kunnen aanpassen om deze nauwkeurigheid te verbeteren. Tenslotte bespreken we, in Sectie VII, de conclusies die we trokken van deze resultaten en een paar idee¨en voor de toekomst. II. G ERELATEERD W ERK In [8] rapporteren de auteurs de automatische uitlijning van audioboeken in Afrikaans. Ze gebruiken een reeds bestaand Afrikaans uitspraakwoordenboek en cre¨eren een akoestisch model van een Afrikaans spraakcorpus. Ze gebruiken het boek “Ruiter in die Nag” van Mikro om hun akoestisch model deels te trainen en om hun testen op uit te voeren. Hun doel is om grote Afrikaanse audiobestanden op woord-niveau uit te lijnen, gebruikmakend van een automatisch spraakherkenningssysteem. Ze ontwikkelden drie verschillende automatische spraakherkenningssystemen om deze te vergelijken en te ontdekken welk het best presteert. Alle drie systemen werden gebouwd gebruikmakend van de HTK toolkit [9]. Om de nauwkeurigheid van hun automatische uitlijningsresultaten te bepalen, vergeleken ze de verschillen tussen de uiteindelijk uitgelijnde startposities van elk woord met een schatting van de startposities die ze verkregen door foneemherkenning. Ze ontdekten wat de voornaamste oorzaken van uitlijningsfouten waren: • fouten gemaakt door de spreker, zoals aarzelingen, ontbrekende woorden, herhaalde woorden, stotteren, etc.; • samentrekkingen veroorzaakt door snelle spraak; • moeilijkheid in het identificeren van de startposities van heel korte woorden (´ee´ n of twee fonemen lang); en • een paar tekst-normalisatie fouten (bijvoorbeeld, ‘eenduisend negehonderd’ in plaats van ‘neeentienhonderd’). Hun uiteindelijke conclusies zijn dat het basis akoestisch model een vrij accurate uitlijning genereert voor praktische doelen, maar dat het model dat werd getraind op het “Ruiter in die Nag” audioboek de beste uitlijningsresultaten gaf. De reden waarom hun onderzoek interessant is voor ons is omdat Nederlands, net als Afrikaans, een minder voorkomende taal is, ondanks de grote inspanningen gemaakt door het Corpus Gesproken Nederlands (CGN) [10]. Bijvoorbeeld, de akoestische modellen van Voxforge [11] die we gebruiken, zowel voor Engelse als Nederlandse spraakherkenning, bevatten ongeveer 40 uur spraak voor meer dan honderd sprekers voor het Engels, maar slechts 10 uur spraak voor het Nederlands. De voornaamste oorzaken van uitlijningsfouten die ze ontdekten zijn natuurlijk ook interessant voor ons onderzoek, aangezien we deze kunnen voorleggen aan de gebruikers van onze applicatie en zo hun aandacht erop kunnen richten. De meeste audioboeken worden echter onder professionele condities opgenomen en het is dus onwaarschijnlijk dat er zich veel fouten door de sprekers in de audio bevinden. De derde interessante ontdekking van hun onderzoek was dat het trainen van het akoestisch model op een deel van het bedoelde audioboek, de meest accurate uitlijning genereert van al hun uitgeteste modellen. We zullen proberen gelijkaardige resultaten te bereiken, hoewel we ons akoestisch model ook op andere audioboeken, gelezen door dezelfde persoon, zullen trainen en dit het liefst met een boek van hetzelfde leesniveau als het doel audioboek, aangezien deze gelijkaardige pauzes en woordlengtes bevatten. De auteurs van [12] testten de uitlijningsmogelijkheden van hun spraakherkenningssysteem onder bijna ideale condities, dit wil zeggen op audioboeken. Ze ontwikkelden ook drie verschillende akoestische modellen, e´ e´ n getraind op manuele transcripties, e´ e´ n getraind op het doel audioboek op lettergreep-niveau, en e´ e´ n getraind op het doel audioboek op woord-niveau. Ze trokken dezelfde conclusies als de auteurs van [8], namelijk dat het akoestisch model getraind op het doel audioboek betere uitlijningsresultaten genereert. Ze ontdekten ook dat het uitlijnen van audioboeken die onder optimale condities zijn opgenomen, ‘makkelijker’ is dan het uitlijnen van re¨ele spraak met achtergrondgeluiden en ruis. Ze voerden ook tests uit met een akoestisch model dat ofwel volledig spreker-onafhankelijk, ofwel aangepast aan en getraind op een bepaalde spreker, ofwel volledig getraind op een bepaalde spreker was. Het is niet verwonderlijk dat ze ontdekten dat het akoestisch model dat op een bepaalde persoon was getraind een bijna perfecte uitlijning van een tekst door die persoon gesproken, genereerde. Het voor ons heel erg interessante deel van dit artikel is dat ze de gevoeligheid van een spraakherkenner voor de articulatiekenmerken en eigenaardigheden van de spreker, quantificeren. Vergeleken met de gemiddelde nauwkeurigheidswaarde van ongeveer 74% hebben de nauwkeurigheidsresultaten van de herkenning van elke spreker een redelijk grote afwijking in beide richtingen. Ze wijtten deze hoge afwijking in de accuraatheid aan de gevoeligheid van het spraakherkenningssysteem voor de sprekerstem. Het zou dus een goed idee zijn om het akoestisch model te trainen voor elke stemacteur waarmee bedrijven zoals Playlane werken, of minstens het akoestisch model aan te passen aan de stemmen van de stemacteurs als blijkt dat de herkenningsresultaten suboptimaal zijn. III. AUTOMATISCHE S PRAAKHERKENNING Het voornaamste doel van spraakherkenning is om de meest geschikte woordenreeks te vinden, gegeven het geobserveerde akoestische signaal. Het spraakdecoderingsprobleem bestaat dan uit het vinden van het maximum van de waarschijnlijkheid van de woordenreeks w, gegeven signaal x, of, equivalent, het maximaliseren van de “fundamentele vergelijking van de spraakherkenning” P r(w)f (x|w). De meeste huidige automatische spraakherkenningssystemen gebruiken statistische modellen. Dit betekent dat van spraak wordt aangenomen dat het kan worden gegenereerd door een taalmodel en akoestisch model. Het taalmodel genereert schattingen van P r(w) voor alle woorden w en hangt af van de hoogniveau beperkingen en taalkundige kennis over toegelaten woorden voor de welbepaalde taak. Het akoestisch model codeert de boodschap w in het akoestische signaal x, dat wordt voorgesteld door de waarschijnlijkheidsdichtheidsfunctie (probability density function) f (x|w). Het beschrijft de statistieken van de reeksen van geparametriseerde akoestische observaties in de feature ruimte, gegeven de corresponderende geuite woorden. De auteurs van [13] verdelen zo een spraakherkenningssysteem in verschillende componenten. De voornaamste kennisbronnen zijn het spraak- en tekstcorpus, deze representeren de trainingdata, en het uitspraakwoordenboek. De training van het akoestisch en taalmodel vertrouwt op de normalisatie en voorbewerking, zoals N -gram schatting en feature extractie, van de trainingdata. Dit helpt om de lexicale veranderlijkheid te verminderen en transformeert de tekst om de gesproken taal beter te representeren. Deze stap is echter taal-specifiek. Het omvat regels over hoe met nummers om te gaan, woordafbrekingen, afkortingen en acroniemen, weglatingstekens, enz. Na de training worden het resulterende akoestisch en taalmodel gebruikt voor de eigenlijke spraakdecodering. Het input spraaksignaal is eerst verwerkt door het akoestisch front-end, dat normaal gezien de feature extractie uitvoert, en daarna doorgegeven aan de decoder. Met het taalmodel, akoestisch model en uitspraakwoordenboek ter beschikking kan de decoder de eigenlijke spraakherkenning uitvoeren en geeft de spraaktranscriptie weer aan de gebruiker. Volgens [14] kunnen de verschillende componenten die hierboven staan beschreven, worden ingedeeld in de zogenaamde vijf basisstappen van ASR: 1. Signaalverwerking/ Feature Extractie: Deze stap stelt het akoestisch front-end voor. Dezelfde technieken worden ook toegepast op het spraakcorpus, voor de feature extractie. In onze applicatie gebruiken we Mel-frequentie cepstrale co¨effici¨enten (MFCC) [14] om feature extractie uit te voeren. 2. Akoestisch Modellering: Deze stap omvat de verschillende acties nodig om een akoestisch model te bouwen. Verborgen Markov modellen (HMMs) [15] vormen de manier waarop onze akoestische modellen zijn getraind. 3. Uitspraak Modellering: Tijdens deze stap wordt het uitspraakmodel gecre¨eerd dat wordt gebruikt door de decoder. 4. taalmodellering: In deze stap wordt het taalmodel gecre¨eerd. De laatste, maar meest belangrijke stap in zijn creatie, is de N -gram schatting. 5. Gesproken Taal-Begrip/Dialoogsystemen: Deze stap refereert naar het volledige systeem dat is gebouwd en hoe het reageert op de gebruiker. IV. D E ASR P LUGIN G EBRUIKT IN ONZE A PPLICATIE CMU Sphinx is de algemene term om een groep spraakherkenningssystemen ontwikkeld aan de Carnegie Mellon Universiteit (CMU), te beschrijven. Ze behelzen een reeks spraakherkenners (Sphinx-2 tot en met 4) en een akoestisch modeltrainer (SphinxTrain). In 2000 maakte de Sphinx groep aan de Carnegie Mellon Universiteit enkele spraakherkennercomponenten open-source, onder andere Sphinx-2 en, een jaar later, Sphinx-3. De spraak decoders komen met akoestische modellen en voorbeeldapplicaties. De beschikbare middelen bevatten software voor het trainen van akoestische modellen, taalmodelcompilatie en een publiek-domein uitspraakwoordenboek voor Engels, “cmudict” genaamd. Het Sphinx-4 spraakherkenningssysteem [1] is de laatste versie toegevoegd aan de verschillende spraakherkenningssystemen van de Carnegie Mellon University. Het is gezamelijk ontworpen door de Carnegie Mellon Universiteit, Sun Microsystems laboratories, Mitsubishi Electric Research Labs en Hewlett-Packard’s Cambridge Research Lab. Het verschilt van de vroegere CMU Sphinx systemen in termen van modulariteit, flexibiliteit en algoritmische aspecten. Het gebruikt nieuwere zoekstrategie¨en en is universeel in zijn aanvaarding van verschillende soorten grammatica’s, taalmodellen, types akoestische modellen en feature stromen. Sphinx-4 is volledig ontwikkeld in de JavaTM programmeertaal en is dus zeer draagbaar. Het laat ook multi-threading toe en heeft een zeer flexibele gebruikersinterface. Wij gebruiken de laatste toevoeging aan de Sphinx groep, namelijk Sphinx-4, in onze applicatie, maar onze Sphinx configuratie gebruikt een Sphinx-3 lader om het akoestisch model in de decoder module in te laden. Een hoog-niveau overzicht van de architectuur van CMU Sphinx-4 is redelijk voor de hand liggend. De drie voornaamste componenten zijn de front-end, de decoder en de kennisbasis. Deze zijn alle drie controleerbaar door een externe applicatie, die de inputspraak meegeeft en de output aanpast naar het gewenste formaat. De Sphinx-4 architectuur is ontworpen met een hoge graad van modulariteit. Alle componenten zijn onafhankelijke vervangbare softwaremodules, met uitzondering van de componenten in de kennisbasis, en zijn geschreven in Java. Voor meer informatie over de verschillende Sphinx-4 componenten verwijzen we naar [16] en de Sphinx-4 source code en documentatie [1, 17]. V. O NZE A ANPAK A. Hoog-Niveau Overzicht Onze applicatie is volledig ontwikkeld in de JavaTM programmeertaal en bestaat uit drie aparte componenten, zie Figuur 1. De eerste component is de Main component. Dit is waar de meeste functionaliteit van de applicatie is gesitueerd. Het bevat de ontleding van de opdracht prompt en selecteert de plugin, en laadt die ook in de applicatie (aangezien de plugin zich in een andere component bevindt, zie hieronder). Het bevat ook het hele testraamwerk. • Main Component Plugin Knowledge Component in een simpel tekst formaat, zoals bijvoorbeeld .txt. Het moet echter gecodeerd zijn in UTF-8 formaat. Dit is vaak al zo, maar het kan gemakkelijk worden geverifieerd en toegepast in source code bewerkers, zoals notepad++ [19]. Dit is nodig om de eventuele speciale karakters, zoals aanhalingstekens, geaccentueerde letters, enz., in te lezen en voor te stellen. VI. R ESULTATEN De tweede component is de Plugin Knowledge(Kennis) component, die alle functionaliteit nodig om de eigenlijke plugin te implementeren, bevat. Het geeft de gebruiker de optie tussen twee mogelijke outputformaten, namelijk een standaard ondertitelingsbestand (.srt bestandsformaat) en een EPUB bestand. Deze component ontvangt de audio- en tekstinput van de main component en geeft dit door naar de ASR plugin component. • De derde component is waar de ASR plugin feitelijk is gesitueerd. We refereren naar deze component als de ‘plugin component’, aangezien deze het ASR systeem bevat. We hebben ervoor gekozen om onze applicatie in deze drie componenten te splitsen om de toevoeging van een nieuwe plugin aan de applicatie zo gemakkelijk mogelijk te houden. Indien iemand het gebruikte ASR systeem wil veranderen, hoeven ze enkel een link te voorzien van de tweede component naar de plugin component. Door het splitsen van de eerste en tweede component, hoeven ze niet uit te zoeken waarvoor al de extra functionaliteit en modules in de eerste component dienen, aangezien deze geen impact hebben op de ASR plugin. Zo kunnen ze ook de nieuwe ASR plugin voor onze applicatie implementeren met enkel kennis van de tweede component. • B. Richtlijnen om de Nauwkeurigheid van de Uitlijning te Vergroten We beschrijven hieronder een aantal kenmerken waaraan de audio- en tekstdata moeten voldoen opdat onze applicatie en, vooral, de Sphinx ASR plugin zo nauwkeurig mogelijk kunnen werken. • Ten eerste hoort het input audiobestand te voldoen aan een aantal kenmerken: het moet monofoon zijn, een bemonstering graad van 16kHz hebben en elk monster moet gecodeerd zijn in 16 bits, volgens little endian formaat. We gebruiken een kleine tool genaamd SoX [18] om dit te bereiken. totaal #woorden input tekstbestand 3000 audioduur normale snelheid audioduur trage snelheid 0:43:12 2520 2500 2000 1853 1500 961 1000 500 0 449 243 565 1131 1254 1286 0:36:00 0:28:48 0:21:36 audioduur Fig. 1. Hoog-niveau overzicht van onze applicatie We werden door het Playlane bedrijf voorzien van 10 Nederlandse boeken die we konden gebruiken om onze applicatie te testen. De ondertitelingsbestanden die werden voorzien door het Playlane bedrijf, werden manueel uitgelijnd door de werknemers van Playlane. Zij luisterden naar het audiobestand en selecteerden manueel de tijden van elk woord. Om de nauwkeurigheid van onze applicatie te verifi¨eren hadden we nood aan boeken die reeds een woord-per-woord uitlijning bevatten, zodat we deze konden vergelijken met de uitlijning die gegenereerd werd door onze applicatie en de ASR plugin. De boeken die we gebruikten, worden opgesomd in de volgende lijst: • Avontuur in de woestijn • De jongen die wolf riep • De muzikanten van Bremen • De luie stoel • Een hut in het bos • Het voetbaltoneel • Luna gaat op paardenkamp • Pier op het feest • Ridder muis • Spik en Spek: Een lek in de boot Al deze boeken worden voorgelezen door ofwel S. of V., beiden vrouwen met Nederlands als moedertaal. Twee boeken worden door S. gelezen, namelijk “De luie stoel” en “Het voetbaltoneel”, de andere acht worden door V. voorgelezen. In Figuur 2 wordt het aantal woorden in elk boek voorgesteld, alsook de duur van het audiobestand, voor zowel de normale voorlezing als voor de trage voorlezing van het boek. Het ver- #woorden in input tekst ASR Plugin Component 0:14:24 594 0:07:12 0:00:00 $> sox "inputfile" -c 1 -r 16000 -b 16 --endian little "outputfile.wav" Deze tool kan ook nuttig gebruikt worden om lange audiobestanden in kleinere delen te splitsen (een audiolengte van ongeveer 30 minuten is gewenst om een goede uitlijning te genereren). • Het input tekstbestand dat de tekst bevat die moet worden uitgelijnd met het audio bestand, kan het best worden voorgesteld Fig. 2. Grafiek die de grootte van de input tekst en de duur van de audio bestanden bevat schil tussen beide uitlijningen wordt gemeten in milliseconden, woord-per-woord. Voor elk woord wordt het verschil tussen de gemiddeld start tijdsverschil gemiddeld stop tijdsverschil 10000 7044 7030 gemiddeld start tijdsverschil 3780 gemiddeld stop tijdsverschil max. tijdsverschil 1000 162 119 100 met `muis´ zonder `muis´ input tekst Fig. 4. De gemiddelde start- en stop-, en maximale tijdsverschillen tussen de automatisch gegenereerde uitlijning en de Playlane uitlijning voor het “Ridder Muis” boek, waarbij het woord “muis” ontbreekt in de input tekst het boek beslaan. De uitlijningsresultaten die we bereikten wanneer we de getrainde akoestische modellen gebruikten voor de uitlijningstaak, kunnen worden gevonden in Figuur 5. gemiddeld start tijdsverschil (orig. AM) gemiddeld start tijdsverschil (luie stoel AM) 180 143 162 119 260 212 116 90 20000 69 40 40000 252 179 29101 29156 60000 gemiddeld stop tijdsverschil (wolf AM) 95348 120000 46291 17951 362 40195 180 162 1498 260 16124 11046 22374 16041 69 90 96 0 252 den hebben die uitgelijnd zijn met gemiddeld minder dan e´ e´ n seconde verschil tussen de output van onze applicatie en de uitlijning voorzien door Playlane. De twee boeken die voorgelezen zijn door S. hebben de grootste gemiddelde start- en stoptijdverschillen, daarom hebben we besloten om het akoestische model meer te trainen op haar stem, zie Figuur 5. Uit de verkregen testresultaten, vermoedden we ook dat het interessant was om te weten hoe goed ons ASR systeem werkt wanneer er woorden ontbreken in de input tekst. De resultaten hiervan worden weergegeven in Figuur 4 en tonen duidelijk aan dat de meest nauwkeurige uitlijningsresultaten worden bekomen wanneer het inputtekstbestand de uitgesproken tekst zo goed mogelijk weergeeft. Zoals reeds hiervoor vermeld, leek het ons interessant om het akoestisch model te trainen op S.’s stem, aangezien zij de slechtste resultaten genereerde voor de uitlijningstaak. We trainden het akoestisch model op een boek genaamd “Wolf heeft jeuk”, dat ook is voorgelezen door S. maar dat geen deel uitmaakte van de test data. We trainden het akoestisch model op het laatste hoofdstuk van het boek. Dit bevat 13 zinnen met een totaal van 66 woorden, die 24 seconden aan audio beslaan. We trainden het originele akoestische model ook op een deel van het boek “De luie stoel”. Daarvan gebruikten we 23 zinnen, met een totaal van 95 woorden, die 29 seconden aan audio van 116 140 126 20000 643 322 40000 29101 60000 18581 48977 80000 10242 148 2860 Fig. 3. De gemiddelde start- en stoptijdverschillen tussen de automatisch gegenereerde uitlijning en de Playlane uitlijning milliseconden 100000 43246 29319 46291 46530 80000 10242 10256 milliseconden 100000 0 36970 95348 95557 120000 100000 milliseconden starttijden en stoptijden van beide uitlijningen apart berekend en dan wordt het gemiddelde genomen van alle woorden die in beide bestanden voorkomen. We besloten het gemiddelde van de start- en stoptijden van elk woord apart te berekenen toen we merkten, na zorgvuldige manuele inspectie van de eerste testresultaten, dat Sphinx-4 de neiging heeft om meer pauze aan het begin van een woord toe te voegen, dan aan het einde van een woord. Met andere woorden, het heeft de neiging om een woord te beginnen markeren in de pauze v´oo´ r het woord wordt gesproken, maar stopt met de markering van het woord vrij accuraat nadat de uitspraak van het woord eindigt. Figuur 3 toont het gemiddelde verschil van de start- en stoptijden voor elk woord, tussen de bestanden voorzien door Playlane en de automatisch gegenereerde uitlijning van onze applicatie. Er zijn zes van de acht boeken voorgelezen door V. welke tij- Fig. 5. Het gemiddelde starttijdsverschil van elk boek, gebruikmakend van het origineel akoestisch model, ofwel het akoestisch model getraind op “Wolf heeft jeuk”, ofwel het akoestisch model getraind op “De luie stoel” Zoals kan gezien worden in Figuur 5, is er een duidelijke verbetering voor beide boeken gelezen door S. zoals we hadden verwacht, hoewel het gemiddelde tijdsverschil nog steeds ongeveer 20 seconden is voor het boek “De luie stoel” en meer dan 40 seconden voor “Het voetbaltoneel” bij de uitlijningsresultaten die we verkregen met het akoestisch model getraind op “Wolf heeft jeuk”. Wanneer we de uitlijning uitvoeren met het akoestisch model getraind op boek “De luie stoel” is het gemiddelde tijdsverschil voor het boek “De luie stoel” slechts 300 milliseconden. Eveneens is de uitlijning voor het boek “Het voetbaltoneel” verbeterd in vergelijking met het resultaat dat we verkregen met het akoestisch model getraind op “Wolf heeft jeuk”, met slechts ongeveer 30 seconden gemiddeld tijdsverschil in plaats van 40 seconden. Dit betekent dat de uitlijningsresultaten verkregen met het akoestisch model getraind op “Wolf heeft jeuk” nog steeds niet bruikbaar zijn, maar ze voorspellen goede resultaten indien het akoestisch model verder getraind wordt op dit boek. De uitlijningsresultaten gegenereerd door gebruik te maken van het akoestisch model getraind op “De luie stoel” zijn bijna perfect voor dat boek, en zorgen ook voor een verbetering voor het boek “Het voetbaltoneel”. Dit laat ons toe te geloven dat verdere training van het akoestisch model op boeken die door S. zijn voorgelezen veel verbeterde uitlijningsresultaten zal genereren voor boeken voorgelezen door S.. Van de boeken die zijn voorgelezen door V. hebben sommige ongeveer dezelfde nauwkeurigheid met de getrainde akoestische modellen als met het origineel model, andere hebben een betere nauwkeurigheid. Maar, aangezien vier van de acht boeken een slechtere nauwkeurigheid hebben, kunnen we concluderen dat in het algemeen de akoestische modellen getraind op S.’s stem een slechte invloed hebben op de boeken voorgelezen door V. VII. C ONCLUSIE EN T OEKOMSTIG W ERK Het doel van dit artikel was het onderzoek naar de mogelijkheid om een uitlijningstaak automatisch uit te voeren, in plaats van manueel. We ontwikkelden hiervoor een software-applicatie die de gebruikers de mogelijkheid biedt om op een simpele manier te kiezen welk ASR systeem ze willen gebruiken, door het gebruik van plugins. We bieden extra flexibiliteit in onze applicatie door de gebruiker de keuze te laten tussen twee outputformaten (een algemeen ondertitelformaat en een EPUB bestandsformaat), en door het aanmaken van een nieuw outputformaat zo gemakkelijk mogelijk te houden. Uit de resultaten in Sectie VI, gebruikmakend van de ASR plugin CMU Sphinx, concluderen we dat het inderdaad mogelijk is om automatisch een uitlijning van audio en tekst te genereren die nauwkeurig genoeg is om bruikbaar te zijn (bijvoorbeeld, onze testresultaten hebben gemiddeld minder dan e´ e´ n seconde verschil tussen de automatisch gegenereerde uitlijning en de manuele basislijn). Er is echter nog steeds wat werk uit te voeren, vooral voor minder gangbare talen, zoals Nederlands. We verkregen positieve resultaten wanneer we het akoestisch model trainen op (minder dan 60 seconden) audio data die correspondeert met de persoon of het type boek waarvan we de uitlijningsnauwkeurigheid willen verbeteren. Daarom is onze eerste nota in verband met toekomstig werk het verder trainen van het akoestisch model voor Nederlands, vooral wanneer een duidelijk gedefini¨eerd type uitlijningstaak moet worden uitgevoerd. Aangezien het een aantal dagen kan duren om een audioboek manueel uit te lijnen is de kleine moeite om het akoestisch model te trainen gerechtvaardigd, zeker wanneer men rekening houdt met de tijdswinst die men kan bereiken door een boek automatisch en nauwkeurig uit te lijnen. Het getrainde model kan ook op meerdere boeken gebruikt worden en een nauwkeurig resultaat weergeven, wat dus betekent dat er niet voor elke nieuwe uitlijningstaak een akoestisch model moet worden getraind. We merken verder op dat de nauwkeurigheid van de input tekst en de dekking van het uitspraakwoordenboek een grote invloed heeft op de nauwkeurigheid van het uitlijningsresultaat. Uit onze testen kunnen we concluderen dat het best is om geen ontbrekende woorden te hebben in de input tekst of het uit- spraakwoordenboek. Er is een duidelijke nood aan een meer robuust systeem, met minder onverklaarbare uitschieters. We stellen een aanpak voor om de robuustheid van onze applicatie te vergroten door de uitlijningsresultaten van twee of meerdere verschillende ASR plugins, te vergelijken. De overlappende resultaten kunnen, binnen een bepaalde foutenmarge, als ‘correct’ worden beschouwd. Deze aanpak is gebaseerd op de aanpak gevolgd in [20]. Het is onze overtuiging dat het systeem dat we hebben ontworpen een flexibele aanpak voor spraak-tekst uitlijning vormt en, aangezien het kan worden aangepast aan het voorkeur ASR systeem van de gebruiker, is het voordelig voor gebruikers die voordien uitlijningstaken manueel uitvoerden. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] Carnegie Mellon University, “CMU Sphinx Wiki,” http:// cmusphinx.sourceforge.net/wiki/. D. OShaughnessy, “Invited Paper: Automatic Speech Recognition: History, Methods and Challenges,” Pattern Recognition, vol. 41, no. 10, pp. 2965 – 2979, 2008. L. R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, 1986. X. Huang, Y. Ariki, and M. Jack, Hidden Markov Models for Speech Recognition, Columbia University Press, New York, NY, USA, 1990. X. Huang, J. Baker, and R. Reddy, “A Historical Perspective of Speech Recognition.,” Communication ACM, vol. 57, no. 1, pp. 94–103, 2014. G. Oppy and D. Dowe, “The turing test,” http://plato.stanford. edu/entries/turing-test/. D. Eldridge, “Have you heard? audiobooks are booming.,” BookBusiness your source for publishing intelligence, vol. 17, no. 2, pp. 20–25, April 2014. C.J. Van Heerden, F. De Wet, and M.H. Davel, “Automatic alignment of audiobooks in afrikaans,” in PRASA 2012, CSIR International Convention Centre, Pretoria. November 2012, PRASA. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, Cambridge University Engineering Department, 2006. CGN, “Corpus gesproken nederlands,” http://lands.let.ru.nl/ cgn/ehome.htm. Shmyryov NV, “Free speech database voxforge.org,” http: //translate.google.ca/translate?js=y&prev=_t&hl= en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww. dialog-21.ru%2Fdialog2008%2Fmaterials%2Fhtml% 2F90.htm&sl=ru&tl=en. L. T´oth, B. Tarj´an, G. S´arosi, and P. Mihajlik, “Speech recognition experiments with audiobooks,” Acta Cybernetica, vol. 19, no. 4, pp. 695–713, Jan. 2010. L. Lamel and J.-L. Gauvain, “Speech Processing for Audio Indexing,” in Advances in Natural Language Processing, B. Nordstrm and A. Ranta, Eds., vol. 5221 of Lecture Notes in Computer Science, pp. 4–15. Springer Berlin Heidelberg, 2008. J. Bilmes, “Lecture 2: Automatic Speech Recognition,” http://melodi.ee.washington.edu/˜bilmes/ee516/ lecs/lec2_scribe.pdf, 2005. E. Trentin and M. Gori, “A Survey of Hybrid ANN/HMM Models for Automatic Speech Recognition,” Neural Computing, vol. 37, no. 14, pp. 91 – 126, 2001. P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf, “Design of the CMU Sphinx-4 Decoder,” in 8th European Conference on Speech Communication and Technology (Eurospeech), 2003. Carnegie Mellon University, “CMU Sphinx Forum,” http:// cmusphinx.sourceforge.net/wiki/communicate/. SoX, “SoX Sound eXchange,” http://sox.sourceforge.net/. D. Ho, “Notepad++ Editor,” http://notepad-plus-plus.org/. B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and R. Van de Walle, “Improving Multimedia Analysis Through Semantic Integration of Services,” in 4th FTRA International Conference on Advanced IT, Engineering and Management, Abstracts. 2014, p. 2, Future Technology Research Association (FTRA). Contents 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Playlane Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Automatic Speech Recognition 7 2.1 History of ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 General Design of ASR Systems . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Stage 1: Signal Processing/Feature Extraction . . . . . . . . . . . . 12 2.2.2 Stage 2: Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Stage 3: Pronunciation Modelling . . . . . . . . . . . . . . . . . . . 20 2.2.4 Stage 4: Language Modelling . . . . . . . . . . . . . . . . . . . . . 22 2.2.5 Stage 5: Spoken Language Understanding/Dialogue Systems . . . . 24 3 Decomposition of CMU Sphinx-4 25 3.1 History of CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Architecture of CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 3.2.1 Front End Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Decoder Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3 Knowledge Base Module . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.4 Work Flow of a Sphinx-4 Run . . . . . . . . . . . . . . . . . . . . . 30 Our CMU Sphinx Configuration . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Global Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Recognizer and Decoder Components . . . . . . . . . . . . . . . . . 34 3.3.3 Grammar Component . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.4 Acoustic Model Component . . . . . . . . . . . . . . . . . . . . . . 40 3.3.5 Front End Component . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.6 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 xix 4 Our Application 49 4.1 High-Level View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Components 2 & 3: Projects pluginsystem.plugin & pluginsystem.plugins.sphinxlongaudioaligner . . . . . . . . . . . . . . . . 51 4.3 Component 1: Project pluginsystem . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Best Practices for Automatic Alignment . . . . . . . . . . . . . . . . . . . 55 4.5 4.4.1 How to add a new Plugin to our System . . . . . . . . . . . . . . . 55 4.4.2 How to run the Application . . . . . . . . . . . . . . . . . . . . . . 56 Output Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.1 .srt File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.2 EPUB Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Results and Evaluation 61 5.1 Test Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 Evaluation Metrics and Formulas . . . . . . . . . . . . . . . . . . . 63 5.2.2 Memory Usage and Processing Time . . . . . . . . . . . . . . . . . 64 5.2.3 First Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Meddling with the Pronunciation Dictionaries . . . . . . . . . . . . . . . . 69 5.4 Accuracy of the Input Text . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 A Detailed Word-per-Word Time Analysis . . . . . . . . . . . . . . . . . . 73 5.6 Training the Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.7 5.6.1 How to Train an Acoustic Model . . . . . . . . . . . . . . . . . . . 74 5.6.2 Results with Different Acoustic Models . . . . . . . . . . . . . . . . 77 The Sphinx-4.5 ASR Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.7.1 5.8 Sphinx-4.5 with Different Acoustic Models . . . . . . . . . . . . . . 81 Alignment Results for English Text and Audio . . . . . . . . . . . . . . . . 83 6 Conclusions and Future Work 85 A Configuration File Used for Recognizing Dutch 87 B Slow Pace Audio Alignment results 93 C Configuration File Used by Sphinx-4.5 for Dutch Audio 95 D Specifications for the .wav Files Used for Training the Acoustic Model 101 D.1 SoX Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 D.2 Audacity Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 xx Bibliography 108 xxi xxii Abbreviations ANN AM ASCII ASR BP BSD CDHMM CD-GMM-HMM CI CMN CMU DCT DFT DNN DP DTW FT FFT GMM HMM HTK IDFT ISIP IVR LM LP LPC MFCC ML MLLR MLP MP NIST NLP OOV PD PDF PLP Artificial Neural Network Acoustic Model American Standard Code for Information Interchange Automatic Speech Recognition Backpropagation Berkely Software Distribution Continuous Density HMM Context-Dependent Gaussion Mixture Model HMM Context Independent Cepstral Mean Normalisation Carnegie Mellon University Discrete Cosine Transformation Discrete Fourier Transform Deep Neural Network Dynamic Programming Dynamic Time Warping Fourier Transform Fast Fourier Transform Gaussian Mixture Model Hidden Markov Model Hidden Markov Toolkit Inverse Discrete Fourier Transform Institute for Signal and Information Processing Interactive Voice Response Language Model Linear Prediction Linear Predictive Coding Mel Frequency Cepstral Coefficient Maximum-Likelihood Maximum-Lilkelihood Linear Regression Maximum Likelihood Predictors Multilayer Perceptron National Institute of Standards and Technology Nonlinear Programming Out-of-Vocabulary Pronunciation Dictionary Probability Density Function Perceptual Linear Prediction xxiii PNCC RASTA-PLP RNN SD SI SLU STT SVM TDNN TTS URL WER XML Power Normalized Cepstrum Coefficients Relative Spectral-PLP Recurrent Neural Network Speaker-dependent Speaker-independent Spoken Language Understanding Speech-to-Text Support Vector Machine Time-Delay Neural Network Text-to-Speech Uniform Resource Locator Word Error Rate eXtensible Markup Language xxiv Chapter 1 Introduction When Bell Labs performed their very first small-vocabulary speech recognition tests during the 1950’s, they had every reason to believe automatic speech recognition research would attract a high amount of interest. Ever since, research into automatic speech recognition has been ongoing. Due to the increase in computational power of computer systems, and the discovery of new mathematical techniques in the 1970’s and 1980’s, there were some major improvements achieved for the then common approaches to automatic speech recognition systems. New ideas, new techniques, or new takes on old techniques are still being discovered that have a positive impact on speech recognition systems. But even now, sixty years later, there is no such thing as a perfect speech recognition system. Each system has its own limitations and conditions that allow it to perform optimally. However, what has the biggest influence on how a speech recognition system performs, is how well-trained the acoustic model is, and what specific task it is trained for [31]. For example, an acoustic model can be trained on one person specifically (or be updated to better recognize that person’s speech), it can be trained to perform well on broadcast speech, on telephone conversations, on a certain accent, on certain words if only commands must be recognized, etc. Thus, if a person has access to an automatic speech recognition system, and knows which task it needs to be used for, it would be very useful to train the acoustic system according to that task. With the recent boom in audiobooks [23], a whole new set of training data is available, as audiobooks are recorded under optimal conditions and the text that is read is obtainable for each book. But there is also another way these audiobooks might be used by a speech recognition system. Why not align these audiobooks with their book content using a speech recognition system, and thereby automating the process of creating digital books that contain both audio and textual content? This is our main goal in this dissertation. 1 1.1 Problem Statement To the best of our knowledge, when in need of speech-text alignment, people and companies perform this task manually, which is highly time- and work-intensive. One of the companies that uses this approach, is the company Cartamundi Digital, previously Playlane1 , see Section 1.2. In this dissertation, an application is designed that can automatically synchronise an audio file with a text file, by using an pre-existing ASR system. The goal is to –apart from creating said synchronisation– keep the application as generic as possible, so as to be able: • to switch out different automatic speech recognition systems; • to quickly create a new output format and add it to the application; We believe this approach might greatly decrease the amount of time spend on manually aligning text and audio, and thus, would provide great benefit for companies such as Playlane. 1.2 The Playlane Company The Playlane company creates digital picture books and games for children. They gained international fame with their “Fundels”, which are digital picture books for iPad and pc, that are extended with games and educational activities. They are currently used by hundreds of schools. One of the educational activities that are part of “Fundels”, is the ability to listen to an audio recording of the book while watching the pictures that are relevant to the spoken text, or even following the book’s text, which is highlighted as it is being said. Whether the “Fundel” only shows pictures accompanying the spoken text, or highlights at a sentence or word level depends on the reading difficulty classification of the book. Since these books should also be usable to help children learn to read, there is also the possibility of having the book read aloud at a slow pace (how slow also depends on the book’s difficulty classification). The Playlane company have already created “Fundels” for several books, ranging over multiple difficulty classifications, and have recently started incorporating books for higher elementary school readers. To create this part of the “Fundels”, the Playlane company hires one of the voice actors they work with to read the entire text, both at a normal pace and at a slow pace, 1 We will from now on refer to the company by its old name “Playlane”, as Cartamundi Digital is a big company that comprises of many smaller companies, each adding their own product to the whole. 2 under studio conditions. Then, after the audio recordings are done, one of the employees manually aligns the audio file with the text of the book. They listen to the audio carefully and, word by word, note down which word is said, the time when the word starts in the audio file, and the time when the word stops. This takes them days, rather than hours, depending on the length of the book. Being able to speed up this process would provide a great business opportunity for the company. 1.3 Related Work In this section we will point out some interesting articles we came upon when researching this dissertation, focussing mainly on parts that relate to our work. In [66], the authors report on the automatic alignment of audiobooks in Afrikaans. They use an already existing Afrikaans pronunciation dictionary and create an acoustic model from an Afrikaans speech corpus. They use the book “Ruiter in die Nag” by Mikro to partly train their acoustic model, and to perform their tests on. Their goal is to align large Afrikaans audio files at word level, using an automatic speech recognition system. They developed three different automatic speech recognition systems to be able to compare these and discover which performs best, all three of them are build using the HTK toolkit [70]. The difference between the three systems lies in the acoustic model: the first acoustic model is the baseline model, which is trained on around 90 hours of of broadband speech, the second uses a Maximum a Posteriori adaptation on the baseline model, and the third is trained on the audiobook. To define the accuracy of their automatic alignment results, the authors compare the difference in the final aligned starting position of each word with an estimate of the starting position they obtained by using phoneme recognition. They discovered that the main causes of alignment errors are: • speaker errors, such as hesitations, missing words, repeated words, stuttering, etc.; • rapid speech containing contractions; • difficulty in identifying the starting position of very short (one- or two-phoneme) words; and, • a few text normalization errors (e.g. ‘eenduisend negehonderd’ for ‘neeentienhonderd’). Their final conclusions are that the baseline acoustic model does provide a fairly good alignment for practical purposes, but that the model that was trained on the target audiobook provided the best alignment results. The reason their research is interesting to us is because, just as Afrikaans, Dutch is a slightly undersourced language (though not as undersourced as Afrikaans), despite 3 the large efforts made by the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN) [16]. For example, the acoustic models of Voxforge [48] we use, both for English and Dutch speech recognition, contain around 40 hours of speech over a hundred speakers for English, while only 10 hours of speech for Dutch. The main causes of alignment errors they discovered are, of course, also interesting for us to know, since we can present these to the users of our system and create awareness. However, as the books are read under professional conditions, there are unlikely many speaker errors, and as they mainly work with children’s books, the spoken text is also unlikely to contain many contractions. The third interesting fact they researched was that training the acoustic model on part of the target audiobook provides the best alignment results of their tested models. We will also try to achieve this, however, we will train the acoustic model on other audiobooks read by the same person, preferably a book with the same reading difficulty classification as the target audiobook, as these have similar pauses and word lengths, see Section 5.6.2. The authors of [63] try out the alignment capabilities of their recognition system under near-ideal conditions, i.e. on audiobooks. They also created three different acoustic models, one trained on manually transcriptions, one trained on the audiobooks at syllable level, and one trained on the audiobooks on word level. They draw the same conclusions as the authors of [66], namely, that training the acoustic models on the target audiobooks provide better results, as well as that aligning audiobooks (which are recorded under optimal conditions) is ‘easier’ than aligning real-life speech with background noises or distortions. They also performed tests using acoustic models that were completely speaker-independent, slightly adapted and trained on a specific speaker, and completely trained on a specific speaker. It may come as no surprise that they discovered that the acoustic model that was trained on a certain person provided almost perfect alignment of a text spoken by that person. However, the one part of this article that is extremely interesting to us is that they quantify the sensitivity of a speech recognizer to the articulation characteristics and peculiarities of the speaker. Figure 1.1 shows a histogram of the phone recognition results obtained using the MTBA Hungarian Telephone Speech Corpus, which contains recordings made from 500 people. As can be seen, the results have quite a huge deviance in both directions, compared to the average value of about 74%. They believe the reason for this high deviation in the scores can mostly be blamed on the sensitivity of the recognizer to the actual speaker’s voice. It would thus be a good idea to train an acoustic model for each voice actor, or at least adapt the acoustic model we use to their voice actors by training them on their speech, if it appears the results we achieve with our speech recognition application are suboptimal. In articles [56] and [14], the authors describe their efforts to automatically generate digital talking books, how they provide interesting research possibilities, and the frame4 Figure 1.1: The distribution of phone recognition accuracy as a function of the speaker on the MTBA corpus; figure taken from [66] work they created to build such talking books. These books are used by visually impaired people, and therefore need to conform by certain standards, the widest used being the DAISY standard [17]. They created their own speech recognition system and verified that, with proper recording procedures, the alignment task can be fully automated in a very fast single-step procedure, even for a two-hour long recording. Their main goal is to provide an application that can easily convert existing audio tapes and OCR-based digitalisation of text books into full-featured, multi-synchronised, multimodal digital books. Although we will not be implementing the DAISY standard to conform to the accessibility restrictions with our application, we will implement the EPUB3 standard as one of the output formats, and believe that adhering to the DAISY standard’s accessibility restrictions would provide a good opportunity for future work. The authors of [35] present their own speech recognition system, called SailAlign, which is an open-source software toolkit for robust long speech-text alignment. The authors explain that the conventional automatic speech recognition systems that use Viterbi to force alignment often can be inadequate due to mismatched audio and text and/or noisy audio. They wish to circumvent these restrictions with SailAlign. They demonstrate the potential use of the SailAlign system for the exploitation of audiobooks to study read speech. The basic idea behind their system is the assumption that the long speech-text alignment problem can be posed as a long text-text alignment problem, given a well-performing speech recognition engine. They provide the reader with a pseudocode of the algorithm they used to construct their speech recognition system, which is very interesting although we do not build our own speech recognition system. They 5 conclude that their experiments, with SailAlign, on the TIMIT database demonstrates the increased robustness of the algorithm, compared to the standard Viterbi-based forced alignment algorithm, even with imperfect transcriptions and noisy audio. SailAlign also shows potential for the exploitation of rich spoken language resources such as collections of audiobooks. The main difference with our approach is, first and foremost that we do not build our own speech recognition system, and that we test our application using audio books instead of the TIMIT database. Our main contributions are the plugin-based system to perform automatic alignment, as well as performing certain tests to verify how CMU Sphinx handles a number of specific issues, such as missing words from the pronunciation dictionary, or an inaccurate input text. 1.4 Outline First, in Chapter 2, we give an overview of how an automatic speech recognition system generally works, starting off with a brief history, and discussing the different techniques that might be used to construct an automatic speech recognition system. We then discuss the inner workings of the automatic speech recognition system we decided to use for our application, namely, CMU Sphinx-4, in Chapter 3. Aside from providing the reader with a general knowledge about CMU Sphinx itself, we also present the configuration we used, discussing each part separately. Chapter 4 explains how our application is put together, and what needs to be done to get it working on someone’s computer system. Then in Chapter 5 we show the results we got with our application and, in Chapter 6, discuss these results, draw some conclusions and explain which further work will still be needed. 6 Chapter 2 Automatic Speech Recognition This chapter provides the reader with a basic knowledge of automatic speech recognition (ASR). The main goal of ASR is to, given an input audio file, return a textual transcription of what is said in the audio file, or to align a given text input with the given audio input and, thus, provide a time stamp for each spoken syllable/word/sentence/... We start off, in Section 2.1, with a brief history of the accomplishments in the field of ASR so far. We mention the advantages of new techniques and why there was a need for improvement. Then we go deeper into the aforementioned techniques and discuss the different steps that are required to build an ASR system in Section 2.2, offering several options that must be taking into account when designing each step and giving a detailed explanation of those techniques. This section is particularly important to understand the configuration we used for the implementation of our speech recognition system, which is explained in Chapter 3. 2.1 History of ASR As early as the 1950s, there was an interest in speech recognition. Bell Labs performed small-vocabulary recognition of digits spoken over the telephone, using analogue circuitry. As computing power grew during the 1960s, filter banks were combined with dynamic programming to produce the first practical speech recognizers. These were mostly for isolated words, to simplify the task. In the 1970s, much progress arose in commercial small-vocabulary applications over the telephone, due to the use of custom special-purpose hardware. Linear Predictive Coding (LPC) became a dominant automatic speech recognition (ASR) component, as an automatic and efficient method to represent speech (see Section 2.2.1). ASR focuses on simulating the human auditory and vocal processes as closely as possible. But the difficulty of handling the immense amount of variability in speech 7 production (and transmission channels) led to the failure of simple if-then decision-tree approaches to ASR for larger vocabularies [51]. Core ASR methodology has evolved from expert-system approaches in the 1970s, using spectral resonance (formant) tracking, to the modern statistical method of Markov models based on a Mel-Frequency Cepstral Coefficient (MFCC) approach (see Section 2.2.1). This has remained the dominant ASR methodology since the late 1980s. LPC is, however, still the standard today in mobile phone speech transmissions. This decade also saw the expansion of the internet, and with that, the creation of large widely available databases in several languages, allowing for comparative testing and evaluation. As [51] specifies, it became common practice to non-linearly stretch (or warp) templates to be compared, to try to synchronize similar acoustic segments in test and reference patterns. This Dynamic Time Warping (DTW) procedure is still used today in some applications [20]. Sets of specific templates of target units, such as phonemes, would be compared to each testing unit, and eventually the one with the closest match as the estimated label for the input unit would be selected. This led to a high computation, as well as difficulty in determining which and how many templates to be used in the test search. Since then the standard has been Hidden Markov Models (HMMs) (see Section 2.2.2), in which statistical models replace the templates, since they have the power to transform large numbers of training units into simpler probabilistic models. Instead of seeking the template closest to a test frame, test data is evaluated against sets of Probability Density Functions (PDFs), selecting the PDF with the highest probability. During the 1990s, commercial applications evolved from isolated-word dictation systems to general-purpose continuous-speech systems. Experiments with wavelets, where the variable time-frequency tiling matches human perception more closely, were the next step in the research towards ASR systems. But the non-linearity of wavelets has been a major obstacle to their use [50]. Artificial Neural Networks (ANNs) (see Section 2.2.2) and Support Vector Machines (SVMs) were introduced in ASR, but are not as versatile as HMMs. SVMs maximize the distance (called the “margin”) between the observed data samples and the function used to classify the data. They generalize better than ANNs, and tend to be better than most non-linear classifiers for noisy speech. But, unlike HMMs, SVMs are essentially binary classifiers, and do not provide a direct probability estimation, according to [51]. Thus, they need to be modified to handle general ASR, where input is usually not just “yes” versus “no”. HMMs also do better on problems such as temporal duration normalisation and segmentation of speech, as basic SVMs expect a fixed-length input vector [57]. ANNs have not replaced HMMs for ASR, owing to their relative inflexibility to handle timing variability. Among promising new approaches was the idea to focus attention on 8 specific patterns of both time and frequency, and not simplistically force the ASR analysis into a frame-by-frame approach [55]. Progress occurred in the use of finite state networks, statistical learning algorithms, discriminative training, and kernel-based methods [34]. Since the mid-90s, ASR has been largely implemented in software, e.g. for the use of medical reporting, legal dictation, and automation of telephone services. With the recent adoption of speech recognition in Apple, Google, and Microsoft products, the ever-improving ability of devices to handle relatively unrestricted multimodal – i.e., consisting of a mixture of text, audio and video, to create extra meaning – dialogues, is showing clearly. Despite the remaining challenges, the fruits of several decades of research and development into the speech recognition field can now be seen. As Huang, Baker, and Reddy [32] said: “We believe the speech community is en route to pass the Turing Test1 in the next 40 years with the ultimate goal to match and exceed a human’s speech recognition capability for everyday scenarios.” Figure 2.1: An overview of historical progress on machine speech recognition performance; figure taken from [46] 1 The phrase The Turing Test is most properly used to refer to a proposal made by Turing (1950) as a way of dealing with the question whether machines can think [49], and is now performed as a test to determine the ‘humanity’ of the machine. 9 Figure 2.1 shows progressive word error rate (WER) reduction achieved by increasingly better speaker-independent (SI) systems from 1988 to 2009 [2, 1]. Increasingly difficult speech data was used for the evaluation, often after the error rate for the preceding easier data had been dropped to a satisfactorily low level. This figure illustrates that on average, a relative error-reduction rate of about 10% annually has been maintained through most of these years. The authors of [46] point out that there are two other noticeable and significant trends that can be identified from the figure. First, dramatic performance differences exist for noisy (due to acoustic environment distortion) and clean speech data in an otherwise identical task (as is illustrated by the evaluation for 1995). Such differences have also been observed by nearly all speech recognizers used in industrial laboratories, and intensive research is continuing on to reduce the differences. Second, speech of a conversational and casual style incurs much higher errors than any other types of speech. Acoustic environment distortion and the casual nature in conversational speech form the basis for two principal technical challenges in the current speech recognition technology. 2.2 General Design of ASR Systems The main goal of speech recognition is to find the most likely word sequence, given the observed acoustic signal. Solving the speech decoding problem, then, consists of finding the maximum of the probability of the word sequence w given signal x, or, equivalently, maximizing the “fundamental equation of speech recognition” P r(w)f (x|w). Most state-of-the-art automatic speech recognition systems use statistical models. This means that speech is assumed to be generated by a language model and an acoustic model. The language model generates estimates of P r(w) for all word strings w and depends on high-level constraints and linguistic knowledge about the allowed word strings for the specific task. The acoustic model encodes the message w in the acoustic signal x, which is represented by a probability density function f (x|w). It describes the statistics of sequences of parametrized acoustic observations in the feature space, given the corresponding uttered words. Figure 2.2 shows the main components of such a speech recognition system. The main knowledge sources are the speech and text corpus, which represent the training data, and the pronunciation dictionary. The training of the acoustic and language model relies on the normalisation and preprocessing, such as N -gram estimation and feature extraction, of the training data. This helps to reduce lexical variability and transforms the texts to better represent the spoken language. However, this step is language specific. It includes rules on how to process numbers, hyphenation, abbreviations and acronyms, apostrophes, etc. 10 Speech Text Corpus Corpus Text Text Corpus Corpus Training Normalization Manual Transcription Feature Extraction N-gram Estimation Training Lexicon HMM Training Language Model Pronunciation Dictionary Acoustic Models Decoding Pr(w) f(x|w) Speech Sample y Acoustic Front end x Decoder max(Pr(w)f(x|w)) Speech Transcription Figure 2.2: System diagram of a speech recognizer based on statistical models, including training and decoding processes; figure adapted from [40] After training, the resulting acoustic and language models are used for the actual speech decoding. The input speech signal is first processed by the acoustic front end, which usually performs feature extraction, and then passed on to the decoder. With the language model, acoustic model and pronunciation dictionary at its use, the decoder is able to perform the actual speech recognition and returns the speech transcription to the user. According to [6], the different parts in Figure 2.2 can be grouped into the so-called five basic stages of ASR: 1. Signal Processing/Feature Extraction (see Section 2.2.1) This stage corresponds to the “Acoustic Front end” in Figure 2.2. The same techniques are also used on the “Speech Corpus”, in the “Feature Extraction” step. 2. Acoustic Modelling (see Section 2.2.2) This stage encompasses the different steps needed to build the acoustic model, such as the “HMM Training” in the figure above. 3. Pronunciation Modelling (see Section 2.2.3) This stage creates the pronunciation dictionary, which is used by the decoder. 4. Language Modelling (see Section 2.2.4) In this stage, the language model is created. The last, and most important, step for its creation is the “N-gram estimation”, as can be seen in Figure 2.2. 5. Spoken Language Understanding/Dialogue Systems (see Section 2.2.5) 11 This stage refers to the entire system that is build and how it interacts with the user. Table 2.1 shows the dates when some of the techniques discussed below were accepted for use in ASR. Advance Linear Predictive Coding Dynamic Time Warping Hidden Markov Model Mel-frequency Cepstrum Language Models Neural Networks Kernel-based classifiers Dynamic Bayesian Networks Date 1969 1970 1975 1980 1980s 1980s 1998 1999 Impact Automatic, simple speech compression Reduces search while allowing temporal flexibility Treat both temporal and spectral variation statistically Improved auditory-based speech compression Including language redundancy improves ASR quality Excellent static nonlinear classifier Better discriminative training More general statistical networks Table 2.1: Major advances in ASR methodology; table taken from [51] In some cases, the dates are approximate, as they reflect a gradual acceptance of new technology, rather than a specific breakthrough event. 2.2.1 Stage 1: Signal Processing/Feature Extraction The first necessary step for speech recognition is the extraction of useful acoustic features from the speech waveform. This is done by using signal processing techniques. Although it is theoretically possible to recognize speech directly from a digitized waveform, almost all ASR systems perform some spectral transformation on the speech signal. This is because numerous experiments on the human auditory system and characteristics show that the inner ear acts as a spectral analyser. It has also been concluded, via analysis of human speech production, that humans tend to control the spectral content of their speech much more than the phase (time) domain details of their speech waveforms, as explained in [51]. The most widely used acoustic feature sets for ASR are Mel-Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction (PLP). There are however plenty of other options, such as Linear Predictive Coding (LPC), Relative Spectral-PLP (RASTAPLP), Power Normalized Cepstrum Coefficients (PNCCs), etc. Fourier Transform (FT) The simplest spectral mapping is the Fourier transform, in the digital domain realised as the Discrete Fourier Transform (DFT), or, in practice, the Fast Fourier Transform (FFT). In the case of a periodic function over time (for example, a speech signal), the Fourier transform can be simplified to the calculation of a discrete set of complex amplitudes, called Fourier series coefficients. They represent the frequency spectrum of the original 12 time-domain signal. When a time-domain function is sampled to facilitate storage or computer-processing, it is still possible to recreate a version of the original Fourier transform according to the Poisson summation formula, also known as discrete-time Fourier transform. While very useful for transforming speech into a representation more easily exploited by an ASR system, the DFT accomplishes little data reduction, as is explained in [51]. ASR, thus, needs to compress the speech information further. The simplest way to do this is via band-pass filtering, or “sub-bands”. ASR has typically employed fixed bandwidths in sub-band analysis. The objective is to approximate the spectral envelope of the DFT, while greatly reducing the number of parameters. Linear Predictive Coding (LPC) The Linear Predictive Coding (LPC) model compresses by about two orders of magnitude, effectively smoothing the DFT [50]. As described in [15], the LPC coefficients make up a model of the vocal tract shape that produced the original speech signal. A spectrum generated from these coefficients shows the properties of the vocal tract shape without the interference of the source spectrum. One can take the spectrum of the filter in various ways, for example by passing an impulse through the filter and taking its DFT. LPC is based on the simplified speech production model in Figure 2.3. It starts with the assumption that the speech signal is produced by an excitation at the end of a tube, as the authors of [22] specify. The glottis (the space between the vocal cords) produces the excitation, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, called formants. It analyses the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining excitation. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue. The speech signal is synthesized by reversing the process: use the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech. In the 1970s, LPC became a dominant ASR representation, as an automatic and efficient method to represent speech. It is still the standard today in cellphone speech transmissions, but was replaced by the MFCC approach (see below) in the 1980s. 13 Figure 2.3: LPC speech production scheme Mel-Frequency Cepstral Coefficients (MFCC) Speech waveform Mel-scaled filter bank LPF & downsampling Logarithm DCT MFCC Figure 2.4: MFCC feature extraction procedure; figure adapted from [6] Front-end methods that use MFCC for feature extraction are based on a triangleshaped frequency integration, the Mel filter bank [51]. Figure 2.4 shows how the MFCC feature extraction is performed. First, the input signal, representing the speech waveform, is passed through a Mel-scaled filter bank. Then, this passes through a low-pass filtering bank and is downsampled. Lastly, a discrete cosine transformation (DCT) is performed on the log-outputs from the filters, as if it were a signal. This approach needs no difficult decisions to determine the features, ASR results appear to be better than with other methods, and one may interpret the MFCCs as 14 roughly uncorrelated. Notwithstanding their widespread use, the authors of [51] claim MFCCs are suboptimal. They give equal weight to high and low amplitudes in the log spectrum, despite the well-known fact that high energy dominates perception. Thus, when speech is corrupted by noise, the MFCCs deteriorate. When different speakers present varied spectral patterns for the same phoneme, the lack of interpretability of the MFCCs forces one to use simple merging of distributions to handle different speakers [51]. A related approach, called Perceptual Linear Prediction (PLP), employs a nonlinearly compressed power spectrum. Perceptual Linear Prediction (PLP) PLP [28] is a popular feature extraction method, because it is considered to be a more noise robust solution. It applies critical band analysis for auditory modelling. As shown in Figure 2.5, it first transforms the spectrum to the Bark scale, using a trapezoid-like filter shape (instead of the triangle-shaped Mel filter). Then, it performs the equal loudness pre-emphasis, to estimate the frequency-dependent volume sensitivity of hearing. After the Inverse Discrete Fourier Transform (IDFT), the cepstral coefficients are computed from the linear prediction (LP) coefficients. Speech Critical Band Analysis Inverse Discrete Fourier Transform Equal Loudness Preemphasis Solution for Autoregressive coefficients Intensity-Loudness Conversion Model Figure 2.5: PLP feature extraction procedure; figure adapted from [6] Power Normalized Cepstral Coefficients (PNCC) PNCC [37] is a recently introduced front-end technique, similar to MFCC, but where the Mel-scale transformation is replaced by Gammatone filters, simulating the behaviour of 15 the cochlea (the auditory portion of the inner ear, famous for its snail shell shape). PNCC also includes a step called medium time power bias removal to increase robustness. This bias vector is calculated using the arithmetic to geometric mean ratio, to estimate the quality reduction of speech caused by noise. 2.2.2 Stage 2: Acoustic Modelling The feature extraction performed in the previous stage is important to choose and optimize the acoustic features of the signal to, ultimately, reduce the complexity of the acoustic model, yet still maintaining the relevant linguistic information for the speech recognition. Acoustic modelling must take into account different sources of variability that are present in the speech signal; namely those arising from the linguistic context, and those from non-linguistic context such as the speaker, the acoustic environment and recording channel. They contain a statistical representation of the phonemes (distinct sound units) that make up a word in the dictionary. Basically, after the feature extraction, the acoustic model decides how to model the distribution of the feature vectors. Hidden Markov Models (HMMs) [53, 31] are the most popular (parametric) model at the acoustic level, but there are multiple ways to model that distribution: Gaussian Mixture Models (GMMs) If one has chosen the feature space well, class Probability Density Functions (PDFs) should be both smooth and with only one mode, for example, Gaussian. However, as explained in [51], actual PDFs in most ASR are not smooth, which has lead to the widespread use of “mixtures” of PDFs to model speech units. Such Gaussian Mixture Models (GMMs) are typically described by a set of simple Gaussian curves (each characterized by a mean vector –N values for an N -dimensional PDF– and an N -by-N covariance matrix), as well as a set of weights for the contributions of each Gaussian to the overall PDF. A mixture of Gaussians can be represented by the following formula: P r(xi ) = P j cj N (xi |µj , P j ). ASR often makes an assumption that parameters are uncorrelated, which allows use of simpler matrices, despite clear evidence that the dimensions are nearly always correlated. This usually leads to poorer, but faster, recognition. As diagonal covariance matrices greatly oversimplify reality, there are recent compromises that constrain the inverse covariance matrices in various ways: so-called semitied and subspace-constrained GMMs [4]. 16 Hidden Markov Models (HMMs) An HMM is a pair of stochastic processes: a hidden Markov chain and an observable process, which is a probabilistic function of the states of the chain. This means that observable events in the real world, modelled with probability distributions, are the observable part of the model, associated with individual states of a discrete-time, first-order Markovian process. The semantics of the model are usually encapsulated in the hidden part. If every phoneme is regarded as one of the hidden states, and every feature vector is regarded as a possible observation state, the entire speech process can be represented as an HMM. An HMM is defined by [64]: 1. A set S of N states, S = {x1 , ..., xN }, which are the distinct values that the discrete, hidden stochastic process can take. 2. An initial state probability distribution, i.e. π = {P r(xi |t = 0), xi ∈ S}, where t is a discrete time index. 3. A probability distribution that characterizes the allowed transitions between states, that is aqt ,qt−1 = {P r(xt = qt |xt−1 = qt−1 ), xt ∈ S, xt−1 ∈ S} where the transition probabilities aqt ,qt−1 are assumed to be independent of time t. 4. An observation or feature space F, which is a discrete or continuous universe of all possible observable events (usually a subset of Rd , where d is the dimensionality of the observations). 5. A set of probability distributions (referred to as emission or output probabilities) that describes the statistical properties of the observations for each state of the model: bk = {bi (k) = P r(k|xi ), xi ∈ S, k ∈ F }. It can be represented by this formula: P r(x0:T ) = P q0:T πq0 TQ −1 aqt ,qt+1 t=0 T Q t=0 p(xt |qt , λ). HMMs represent a learning paradigm. The most popular of such algorithms are the forward-backward (or Baum-Welch) and the Viterbi algorithms [52]. Both of these algorithms are based on the general Maximum-Likelihood (ML) criterion, and, when continuous emission probabilities are considered, aim at maximizing the probability of the samples, given the model at hand. The Viterbi algorithm, specifically, concentrates solely on the most alike path throughout all the possible sequences of states in the model. As specified in [64], once training has been accomplished, the HMM can be used for decoding or recognition. Whenever N different HMMs (corresponding to models of N different events or classes defined in the feature space) are used, decoding (classification) 17 means assigning each new sequence of observations to the most alike model. When a single HMM is used, decoding (recognition) means finding out the most alike path of states within the model and assigning each individual observation to a given state within the model. A major distinction has to be made between discrete HMMs and continuous density HMMs (CDHMMs). According to [64], the first type uses discrete probability distributions to model the emission probabilities. They require a quantisation of a continuous input space. CDHMMs, however, use continuous PDFs (usually referred to as likelihoods) to describe statistics of the acoustic features within the HMM states, and are usually best suited for very difficult ASR tasks, since they exhibit better modelling accuracy. (Gaussians, or mixtures of Gaussian components, are the most popular and effective choices of PDFs for CDHMMs). Although HMMs are effective approaches to the problem of acoustic modelling in ASR, allowing for good recognition performance under many circumstances, the authors of [64] explain, they also suffer from some limitations. Standard CDHMMs present poor discriminative power among different models, since they are based on the maximum-likelihood (ML) criterion, which is in itself non-discriminative. Classical HMMs rely strongly on assumptions of the statistical properties of the problem. Also, generalizing the basic HMM to allow for Markov models of order higher than one does raise ASR accuracy, but the computational complexity of such models limit their implementability in hardware. All of the above limitations have driven researchers towards a hybrid solution, using neural networks and HMMs. Artificial Neural Networks (ANN), with their discriminative training, capability to perform non-parametric estimation over whole sequences of patterns, and limited number of parameters, definitely appeared promising. Neural Networks From the late 1980s onwards, researchers began to use Artificial Neural Networks (ANN) for ASR. Neural nets were expected to carry out the recognition task, when discriminatively trained on acoustic features. To take the temporal dependencies typical for speech signals into account, two major classes of neural networks were proposed, namely time-delay neural networks (TDNNs) and recurrent neural networks(RNNs), as specified in [64]. Time-delay neural networks [68], also known as tapped delay lines, represent an effective attempt to train a static multilayer perceptron (MP) [7] for time-sequence processing, by converting the temporal sequence into a spatial sequence over corresponding units. The idea was applied in a variety of ASR applications, mostly for phoneme recognition [67]. Recurrent neural networks (RNNs) provide a powerful extension of feed-forward con18 nectionist models by allowing to introduce connections between arbitrary pairs of units, independently from their position within the topology of the network. In spite of their ability to classify short-time acoustic-phonetic units, ANNs failed as a general framework for ASR the authors of [64] explain, especially with long sequences of acoustic observations, like those required in order to represent words from a dictionary, or whole sentences. The authors of [64] claim this is mainly due to the lack in ability to model long-term dependencies in ANNs. In the early 1990s, this led to the idea of combining HMMs and ANNs in a single model, the hybrid ANN/HMM. This hybrid model relies on an underlying HMM structure, capable of modelling long-term dependencies, and integrates ANNs to provide non-parametric universal approximation, probability estimation, discriminative training algorithms, less parameters to estimate than usually required for HMMs, efficient computation of outputs at recognition time, and efficient hardware implementability. Different hybrid architectures and training/decoding algorithms have been researched, dependent on the nature of the ASR task, the type of HMM used, or the specific role of the ANN in the hybrid system [24, 42, 45, 47, 27, 61, 5]. The hybrid approach often allowed for significant improvement in performance with respect to standard approaches to difficult ASR tasks. Unlike standard HMMs, which have a consolidated and homogeneous theoretical framework, the hybrid ANN/HMM systems are a fairly recent research field, with no unified formulation. Therefore, proposed ANN/HMM hybrid architectures can easily be divided in five categories, according to [64]: 1. Early Attempts; 2. ANNs to estimate the HMM state-posterior probabilities; 3. Global optimisation; 4. Networks as vector quantizers for discrete HMM; 5. Other approaches. In spite of the promise for large-vocabulary speech recognition that ANN/HMM showed, it has not been adopted by commercial speech recognition solutions. After the invention of discriminative training, which refines the model and improves accuracy, the conventional, context-dependent Gaussian mixture model HMMs (CD-GMM-HMMs) outperformed the ANN/HMM models when it came to large-vocabulary speech recognition. Recently, however, there has been a renewed interest in neural networks, namely Deep Neural Networks (DNNs). They offer learned feature representation, and overcome the inefficiency in data representation by the GMM, and thus, can replace the GMM directly. Deep learning can also be used to learn powerful discriminative features for a traditional HMM speech recognition system. The advantage of this hybrid system is that decades of speech recognition technologies, developed by speech recognition researchers, can be used directly. A combination of DNN and HMM produced significant error reduction [29, 38, 19 69, 18, 71] in comparison to some of the ANN/HMM efforts discussed above. In this new hybrid system, the speech classes for DNN are typically represented by tied HMM states, which is a technique directly inherited from earlier speech systems [33]. 2.2.3 Stage 3: Pronunciation Modelling The pronunciation lexicon is the link between the representation at the acoustic level (see Section 2.2.2), and at the word level (see Section 2.2.4). Generally, there are multiple possible pronunciations for a word in continuous speech. At the lexical and pronunciation level, two main sources of variability are the dialect and individual preferences of the speaker, see, for example, Figure 2.6 which shows the possible pronunciations for the word ‘and’. æ n d Ə Figure 2.6: Possible pronunciations of the word ‘and’; figure adapted from [6] Word Sequence Pronunciation Lexicon Pronunciation Model Phone Sequence Phonetic Decision Tree State/Model Sequence Figure 2.7: Steps in pronunciation modelling; figure adapted from [54] The purpose of text collection is to learn how a language is written, so that a language model (see Section 2.2.4) may be constructed. The purpose of audio collection is to learn how a language is spoken, so that acoustic models (see Section 2.2.2) may be 20 constructed. The point of pronunciation modelling is to connect these two realms, as shown in Figure 2.7. According to [54], the pronunciation model consists of these components: 1. A definition of the elemental sounds in a language. Many possible pronunciation units may be used to model all existing ways to pronounce a word. • Phonemes and their realisations (phones): this unit is most commonly used; • Syllables: this is an intermediate unit, between phones and word level; • Individual articulatory gestures. 2. A dictionary that describes how words in a language are pronounced. The pronun- Audio Data & Pronunciation Transcripts Dictionaries Pronunciation Dictionaries Acoustic Models Training Texts Pronunciation & Transcripts Dictionaries Language Models Figure 2.8: Links between pronunciation dictionary, audio and text; figure adapted from [54] ciation dictionary (PD) is the link between the acoustic model and the language model, see Figure 2.8. It is basically a lexicon and provides the way to connect acoustic models (HMMs) to words. For example, in a simple architecture, each HMM state represents one phoneme in the language, a word is represented as a sequence of phonemes, and dictionaries come with different phone sets (for the English language this is usually between 38 and 45 different phonemes). CMU DICT defines 39 separate phones (38 plus one for silence (SIL)) for American English [8]. If the pronunciations in the dictionary deviate sharply from the spoken language, one creates a model with great variance, and similar data with essentially the same phonemes, is distributed to different models. 3. Post-lexical rules for altering pronunciation of words spoken in context. Lexical coverage has a large impact on the recognition performance, and the accuracy of the acoustic models is linked to the consistency of the pronunciations in the lexicon. Which is why the pronunciation vocabulary is usually selected to maximize lexical coverage for a given size lexicon. Each out-of-vocabulary (OOV) word causes more than a single error (usually between 1.5 and 2 errors) on average, thus word list selection is an important design step. 21 2.2.4 Stage 4: Language Modelling Prior to the 1980s, ASR only used acoustic information to evaluate text hypotheses. It was then noted that incorporating knowledge about the text being spoken would significantly raise ASR accuracy, and language models (LMs) were included in ASR design. As mentioned before, the language model generates estimates of P r(w) for all word strings w and depends on high-level constraints and linguistic knowledge about the allowed word strings for a specific task. Given a history of prior words in an utterance, the number P of words that one must consider as possibly coming next is much smaller than the vocabulary size V . P is called the perplexity of a language model. LMs are stochastic descriptions of text-likelihoods of local sequences containing N consecutive words in training texts (typically N = 1, 2, 3, 4). Integrating an LM with the normal acoustic HMMs is now common practice in ASR. As described in [51], N -gram models typically estimate the likelihood of each word, given the context of the preceding N − 1 words. These probabilities are obtained through analysis of a lot of text, and capture both syntactic and semantic redundancies in text. As the vocabulary size increases for practical ASR, the size of an LM (V N ) grows exponentially with the vocabulary size. Large lexicons lead to seriously under-trained LMs, inadequate appropriate texts for training, increased memory needs, and lack of computation power to search all textual possibilities. As a result, most ASR do not employ N -grams with N higher than four. While 3- and 4-gram LMs are most widely used, class-based N -grams, and adapted LMs are recent research areas trying to improve LM accuracy. Some techniques have been adopted to solve the data sparsity problem, such as smoothing and back-off. Back-off methods fall back on lower-order statistics when higher-order N -grams do not occur in the training texts [36]. Grammar constraints are often imposed on LMs, and LMs may be refined in terms of parts-of-speech classes. They can also be designed for specific tasks. Combining the acoustic model, the pronunciation model and the language model, HMMs have been widely used in ASR: • Different utterances will have a different length, e.g. stop consonants (‘k’, ‘g’, ‘p’,..) are always short, whereas vowels will generally be longer. • Ways of comparing variable length features: – Earlier solution: Dynamic Time Warping (DTW) – Modern solution: Hidden Markov Models The Hidden Markov Model is illustrated in Figure 2.9. The parameters aij s are the transition probabilities from state i to state j. The observation probability bj (ot ) repre22 sents the output probability of observation ot given the state j. Since the HMM states j are not observed, they are called “hidden” states. Markov Model M Observation Sequence Figure 2.9: Hidden Markov Model; figure adapted from [70] Modelling with HMMs is a multi-level classification problem. According to [6], from the highest level to the lower levels, it can be described as: W (word) → A(acoustic unit) → Q(HMM state) → X(acoustic observation) XX p(x, q, a|w) p(x|w) = a = q XX qp(x|q)p(q|a)p(a|w) a where a is the phone, q is the HMM state, and w is the word. Speech recognition requires a large search space to search for the best sequences (a, q, w), as in: (a, q, w)∗ = argmax p(x, q, a, w) a,q,w Therefore one wants to maximize the joint probability using Viterbi decoding or stack decoding techniques. An important sub-field of ASR concerns determining whether speakers have said something beyond an acceptable vocabulary, i.e. out-of-vocabulary (OOV) detection [43]. ASR generally searches a dictionary to estimate, for each section of an audio signal, which word forms the best match. One does not wish to output (incorrect) words from an official dictionary when a speaker has coughed or said something beyond the accepted range [65]. It is practically important to detect such OOV conditions. 23 2.2.5 Stage 5: Spoken Language Understanding/Dialogue Systems The operational definition of “language understanding” is for the speech human-computer interface to react, or provide output, in a way that the user of the speech system is satisfied with it and has achieved the desired goal. In the last decade, according to [26], a variety of practical goal-oriented spoken language understanding (SLU) systems have been built for limited domains. One characterisation of these systems is the way they allow humans to interact with them. On one end, there are machine-initiative systems, commonly known as interactive voice response (IVR) systems [62]. In IVR systems, the interaction is controlled by the machine. Machine-initiative systems ask a user specific questions and expect the user to input one of the predetermined keywords or phrases. For example, a Global Positioning System may prompt the user to say “target address”, “calculate route”, “change route”, etc. In such a system, SLU is reduced to detecting one of the allowed keywords or phrases in the user’s utterances. On the other extreme, there are user-initiative systems in which a user controls the flow and a machine simply executes the user’s commands. A compromise in any realistic application is to develop a mixed-initiative system [25], where both users and the system can assume control of the flow of the dialogue. Mixed-initiative systems provide users the flexibility to ask questions and provide information in any sequence they choose. Although such systems are known to be more complex, they have proven to provide more natural human/machine interaction. If the output is speech, e.g. for Text-to-Speech (TTS) systems, the spoken language dialogue system needs to respond naturally. It needs to have discourse modelling and generate appropriate textual responses, or produce natural, pleasant sounding synthetic speech. The most important techniques to remember from this chapter are Mel-frequency cepstral coefficients, which are used for the feature extraction and spectral analysis in our ASR configuration (see Section 3.3.5), and hidden Markov models, which are used for the acoustic model (see Section 3.3.4). 24 Chapter 3 Decomposition of CMU Sphinx-4 There are many open source automatic speech recognition systems available online, such as HTK (developed by Cambridge University) [70], Julius (Kyoto University) [39] and the ISIP Production System (Mississippi State University) [44]. We have chosen to work with the CMU Sphinx open source ASR system, since it is often referenced in scientific papers, and has a large wiki [11] and active forum [10]. It also has a less restrictive license compared to the HTK system. The first two sections in this chapter provide a general overview of CMU Sphinx, which is a group of speech recognition systems developed at Carnegie Mellon University (CMU). In Section 3.1, a short history of the different Sphinx versions is given. We discuss the advantages of each version, and in which environment they should ideally be used. As there is no general, extensive explanation available for all of Sphinx-4’s components, we therefore, based ourself on [41] for one component, and on the source code and documentation [11, 10] for the others. Our discoveries can be found in Sections 3.2 and 3.3. Section 3.2 describes a high-level architecture of Sphinx-4, the version of Sphinx our system implements. A more detailed description of the configuration used by our Sphinx implementation can be found in Section 3.3. This is where the techniques and methods for speech recognition explained in Chapter 2 are applied. 3.1 History of CMU Sphinx CMU Sphinx is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University (CMU). They include a series of speech recognizers (Sphinx-2 through 4) and an acoustic model trainer (SphinxTrain). In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx-2, and, a year later, Sphinx-3. The speech decoders come with acoustic models and sample applications. The available resources 25 include software for acoustic model training, language model compilation and a publicdomain pronunciation dictionary for English, “cmudict”. • CMU Sphinx is a continuous-speech, speaker-independent recognition system that uses hidden Markov acoustic models (HMMs) and an N -gram statistical language model. It was developed by Kai-Fu Lee in 1986. Sphinx featured feasibility of continuous-speech, speaker-independent large-vocabulary recognition, the possibility of which was in dispute at the time. CMU Sphinx is of historical interest only; it has been superseded in performance by subsequent versions. • CMU Sphinx-2 is a fast performance-oriented recognizer, originally developed by Xuedong Huang at Carnegie Mellon and released as open source with a BSD-style license. Sphinx-2 focuses on real-time recognition suitable for spoken language applications. It incorporates functionality such as end-pointing, partial hypothesis generation, dynamic language model switching, and so on. It is used in dialogue systems and language learning systems. The Sphinx-2 code has also been incorporated into a number of commercial products, but is no longer under active development (other than for routine maintenance). Sphinx-2 uses a semi-continuous representation for acoustic modelling (a single set of Gaussians is used for all models, with individual models represented as a weight vector over these Gaussians). • Sphinx-3 adopted the prevalent continuous HMM representation and has been used primarily for high-accuracy, non-real-time recognition. Recent developments (in algorithms and in hardware) have made Sphinx-3 “near” real-time, although it is not yet suitable for use in critical interactive applications. It is currently under active development and in conjunction with SphinxTrain, it provides access to a number of modern modelling techniques that improve recognition accuracy. • The Sphinx-4 speech recognition system [11] is the latest addition to Carnegie Mellon University’s repository of the Sphinx speech recognition systems. It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard’s Cambridge Research Lab. It is different from the earlier CMU Sphinx systems in terms of modularity, flexibility and algorithmic aspects. It uses newer search strategies, and is universal in its acceptance of various kinds of grammars, language models, types of acoustic models and feature streams. Algorithmic innovations included in the system design enable it to incorporate multiple information sources in a more elegant manner as compared to the other systems in the Sphinx family. Sphinx-4 is developed entirely in the JavaTM programming language and is thus very portable. It also enables and uses multi-threading and permits highly flexible user interfacing. • PocketSphinx is a version of Sphinx that can be used in embedded systems (e.g., 26 based on an ARM processor, such as most portable devices)[11]. It is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation. 3.2 Architecture of CMU Sphinx Speech Application Input Control Search Control Decoder Front end Search Endpointer Endpointer Feature Endpointer Computation Search Results Knowledge base Dictionary Language Model State Endpointer Endpointer Probability Computation Graph Construction Structural Information Statistical Parameters Endpointer Acoustic Endpointer Model Figure 3.1: High-level architecture of CMU Sphinx-4; figure adapted from [41] The high-level architecture of CMU Sphinx-4, as seen in Figure 3.1, is fairly straightforward. The three main blocks are the front end, the decoder, and the knowledge base, which are all controllable by an external application, which provides the input speech and transforms the output to the desired format, if needed. The Sphinx-4 architecture is designed with a high degree of modularity. All blocks are independently replaceable software modules, except for the blocks within the knowledge base, and are written in Java. (Stacked blocks indicate multiple types which can be used simultaneously.) Even within each module shown in Figure 3.1, the code is very modular with functions that are easy to replace. 3.2.1 Front End Module The front end is responsible for gathering, annotating, and processing the input data (speech signal). The annotations provided by the front end include, amongst others, the beginning and ending of a data segment. It also extracts features from the input 27 data, to be read by the decoder. To do this, it can be based on Mel Frequency Cepstral Coefficients (MFCCs) for audio signal presentation (see Section 2.2.1), as are other modern general-purpose speech recognition systems, by altering the Sphinx configuration file (see Section 3.3). The front end provides a set of high level classes and interfaces that are used to perform digital signal processing for speech recognition [9]. It is modelled as a series of data processors, each of which performs a specific signal processing function on the incoming data, as shown in Figure 3.2. For example, a processor performs Fast-Fourier Transform (FFT, see Section 2.2.1) on input data, another processor performs high-pass filtering. Thus, the incoming data is transformed as it passes through each data processor. Data Data Data Data Data Processor Processor Data Data Processor Data Figure 3.2: High-level design of CMU Sphinx front end; figure adapted from the Sphinx documentation [9] Data enters and exits the front end, and goes between the implemented data processors. The input data for the front end is typically audio data, but any input type is allowed, such as spectra, cepstra, etc. Similarly, the output data is typically features, but any output type is possible. Sphinx-4 also allows for the specification of multiple front end pipelines and of multiple instances of the same data processor in the same pipeline. In order to obtain a front end, it must be specified in the Sphinx configuration file for the application, see Section 3.3. 3.2.2 Decoder Module The decoder performs the main component of speech recognition, namely the actual recognition. It reads features received from the front end, couples them with data from the knowledge base, provided by the application, such as the language model, information from the pronunciation dictionary, and the structural information from the acoustic model (or sets of parallel acoustic models), and constructs the language HMM in the graph construction module. Then it performs a graph search to determine which phoneme 28 sequence would be the most likely to represent the series of features. In Sphinx-4, the graph construction module is also called the “linguist” [41]. The term “search space” is used to describe the possible most likely sequences of phonemes, and is dynamically updated by the decoder during the search. Many different versions of Sphinx decoders exist. The decision about which version to use depends on how familiar one is with C/Python (for the use of PocketSphinx) or Java (for the use of sphinx-4), and how easy it is to integrate these into the system under development. Currently there are the following choices: • PocketSphinx is CMUs fastest speech recognition system so far. It is a library written in pure C-code, which is optimal for the development of C applications as well as for the creation of language bindings. It is the most accurate engine at real time speed, and therefore, is a good choice for live applications. It is also a good choice for desktop applications, command and control, and dictation applications where fast response and low resource consumption are the main constraints. • Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTM programming language. It works best for implementations of complex servers or cloud-based systems with deep interaction with nonlinear programming (NLP) modules, web services and cloud computing. The system permits use of any level of context in the definition of the basic sound units. One by-product of the system’s modular design is that it becomes easy to implement it in hardware. Some new design aspects [41], compared to Sphinx-3’s decoder, include graph construction for multilevel parallel decoding with independent simultaneous feature streams without the use of compound HMMs, the incorporation of a generalized search algorithm that subsumes Viterbi and full-forward decoding as special cases, design of generalized language HMM graphs from grammars and language models of multiple standard formats, that toggles trivially from flat search structure to tree search structure, etc. • Sphinx-3 is CMUs large vocabulary speech recognition system. It is an older C-based decoder that continues to be maintained, and is still the most accurate decoder for large vocabulary tasks. It is now also used as a baseline to check recognizer accuracies. • Sphinx-2 is a fast speech recognition system, the predecessor of PocketSphinx. It is not being actively developed at this time, but is still widely used in interactive applications. It uses HMMs with semi-continuous output probability density functions (PDFs). Even though it is not as accurate as Sphinx-3 or Sphinx-4, it runs at real time speeds, and is thus a good choice for live applications. These decoders are obsolete and not supported nowadays. It is not recommended to use them. 29 3.2.3 Knowledge Base Module The knowledge base of CMU sphinx consists of three parts: the acoustic model, the language model, and the (pronunciation) dictionary. Acoustic Model The acoustic model contains a statistical representation of the distinct phonemes that make up a word in the dictionary. Each phoneme is modelled by a sequence of states and the observation probability distribution of sounds you might observe in that state. For a more general, extensive description, see Section 2.2.2. Sphinx-4 can handle any number of states, but they must be specified during training. An acoustic model provides a phoneme-level speech structure. They can be trained for any language, task or condition, using the SphinxTrain tool developed by CMU [13]. Pronunciation Dictionary The decoder needs to know the pronunciation of words to perform the graph search, which is why a pronunciation dictionary is needed. Basically, this is a list of words and their possible pronunciations. For more information, see Section 2.2.3. Language Model The language model contains all words and their probability to appear in a certain sequence. It provides a word-level language structure. Language models generally fall into two categories: either graph-driven models, which are similar to one-dimensional (1D) Markov models, and base the word probability solely on the previous word; or N-gram models, which are similar to (n − 1)-dimensional Markov model, and base word probability on the n − 1 previous words. For a more extensive explanation of language models, see Section 2.2.4. Sphinx-4 defaults to using trigram models. 3.2.4 Work Flow of a Sphinx-4 Run Figure 3.3 shows the interfaces used by Sphinx-4 to perform the speech recognition task. We now give a short description of the work flow through these interfaces, and the detailed choices for these interfaces, and more information, can be found in Section 3.3. The starting point is the audio Input, either live from a microphone or pre-recorded in an audio file. The configuration file is used to set all the variables, see Section 3.3.1 30 Application Input Control Result Instrumentation Recognizer FrontEnd Decoder SearchManager Linguist AcousticModel ActiveList Dictionary LanguageModel Scorer Pruner SearchGraph Feature Configuration Manager Figure 3.3: Basic flow chart of how the components of Sphinx-4 fit together; figure adapted from [21] for our configuration file. Most of the components the system uses are configurable Java interfaces. The Configuration Manager loads all these options and variables, as the first step for the Application. The FrontEnd is then constructed and generates Feature vectors from the Input, preferably using the same process as was used during training, see Section 3.2.1. The Linguist generates the SearchGraph, for which it uses the AcousticModel, LanguageModel and pronunciation Dictionary that are specified in the configuration file. The Decoder then constructs the SearchManager, which, in turn, initializes the Scorer, Pruner and ActiveList, see Section 3.2.2 for more information about the decoder. The SearchManager can then use the Features and the SearchGraph to find the best fit path, which represents the best transcription, and then, finally, the Result is passed back to the Application as a series of recognized words. It is interesting to know that, once the initial configuration is complete, the recognition process can repeat without having to reinitialize everything. 3.3 Our CMU Sphinx Configuration The Sphinx-4 configuration manager system has two primary purposes [9]: • To determine which components are to be used in the system. The Sphinx-4 system 31 is designed to be extremely flexible. At runtime, just about any component can be replaced with another. For example, in Sphinx-4 the “FrontEnd” component provides acoustic features that are scored against the acoustic model. Typically, Sphinx-4 is configured with a front end that produces Mel frequency cepstral coefficients (MFCCs, see Section 2.2.1), however it is possible to reconfigure Sphinx-4 to use a different front end that, for instance, produces Perceptual Linear Prediction coefficients (PLP, see Section 2.2.1). • To determine the detailed configuration of each of these components. The Sphinx-4 system has a large number of parameters that control how the system functions. For instance, a beam width is sometimes used to control the number of active search paths maintained during the speech decoding. A larger value for this beam width can sometimes yield higher recognition accuracy at the expense of longer decode times. The Sphinx configuration manager can be used to flexibly design the system like this, with the use of a configuration file. This configuration file defines the names and types of all of the components of the system, the connectivity of these components – that is, which components talk to each other –, and the detailed configuration for each of these components. The format of this file is XML. The most important configuration decisions are listed below, a more complete explanation of our configuration choices can be found in the hereafter following sections. • property name="logLevel" value="WARNING" We use the WARNING level, which only provides information when something went wrong but the system is able to continue its task, and severe errors when the system cannot continue its task. This setting does not overwhelm us with information. • component name="recognizer" We are able to specify a number of monitors in the decoder module. These monitors allow us to keep track of a certain characteristic of Sphinx during the alignment task, e.g. the memoryTracker, and speedTracker monitors which are used by our application and track, respectively, memory usage and processing speed. • property name="relativeBeamWidth" value="1E-300 The relative beam width specifies a threshold for which active search paths to keep, based on their acoustic likelihood computation. How more negative the exponent is, how more search paths will be kept and how more accurate the alignment will be, at a cost of more computational power and an increase in processing time. If processing speed and computational power is not a major concern, increasing the exponent is recommended. • component name="dictionary" When working with our application, the first thing that needs to be verified is 32 whether the dictionary and filler path are referring to the correct location, i.e. the location of the pronunciation dictionary and noise dictionary. • property name="acousticModel" For this component and properties, the location must also be verified, and if necessary changed to refer to the location of the acoustic model. • component name="frontEnd" The front end we use performs feature extraction using Mel-frequency cepstral coefficients. For more information about this technique, see Section 2.2.1. In the sections below, we explain the configuration used by our system to recognize English with the help of excerpts from the configuration file. The configuration file used to recognize Dutch can be found in Appendix A. 3.3.1 Global Properties <property <property <property <property <property <property <property <property <property <property <property name="logLevel" value="WARNING"/> name="absoluteBeamWidth" value="-1"/> name="relativeBeamWidth" value="1E-300"/> name="wordInsertionProbability" value="1.0"/> name="languageWeight" value="10"/> name="addOOVBranch" value="true"/> name="showCreations" value="false"/> name="outOfGrammarProbability" value="1E-26" /> name="phoneInsertionProbability" value="1E-140" /> name="frontend" value="epFrontEnd"/> name="recognizer" value="recognizer"/> Figure 3.4: Global properties Figure 3.4 shows how the global properties of the Sphinx system are defined. All Sphinx components that need to output informational messages will use the Sphinx-4 logger. Each message has an importance level which can be assigned in the “logLevel” property. • The SEVERE level means an error occurred that makes continuing the operation difficult or impossible, this is the highest importance level. • WARNING means that something went wrong but the system is still able to continue, e.g. when a word is missing from the pronunciation dictionary. • INFO provides general information. • The CONFIG level means that information about a component’s configuration will be outputted. 33 • FINE, FINER, and FINEST (which is the lowest logging level) provide fine grained tracing messages. Our system uses the WARNING level, which does not overwhelm us with information, but still allows us to know what is happening during the execution of the application, and, most importantly, flags a warning when a word is missing from the dictionary, so it can be added. The other global properties defined here are used to specify values all through the Sphinx configuration, and are explained below, when the configuration uses of them. 3.3.2 Recognizer and Decoder Components <component name="recognizer" type="edu.cmu.sphinx.recognizer .Recognizer"> <property name="decoder" value="decoder"/> <propertylist name="monitors"> <item>memoryTracker </item> <item>speedTracker </item> </propertylist> </component> <component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="searchManager"/> </component> <component name="searchManager" type="edu.cmu.sphinx.decoder.search.AlignerSearchManager"> <property name="logMath" value="logMath"/> <property name="linguist" value="aflatLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListFactory" value="activeList"/> </component> Figure 3.5: Recognizer and Decoder components Figure 3.5 shows the general properties for the decoder. The recognizer [9] contains the main components of Sphinx-4 (front end, linguist, and decoder). Most interaction from the application to the internal Sphinx-4 system happens through the recognizer. This is also were some monitors can be specified, to keep track of speed, accuracy, memory use, etc. 34 The decoder [9] contains the search manager, which performs the graph search using a certain algorithm, e.g. breadth-first search, best-first search, depth-first search, etc. It also contains the feature scorer and the pruner. The specific details of its components are described below. Active List Component <component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> Figure 3.6: ActiveList component The active list component configuration is shown in Figure 3.6. The active list is a list of tokens that represent all the states in the search graph, that are active in the current feature frame. Our configuration consists of a “PartitionActiveListFactory” [9] which produces a “PartitionActiveList” object. This will partition the list of tokens according to the absolute beam width. The absolute beam width limits the number of elements in the active list. It controls the number of active search paths maintained during the pruning stage of the speech decoding. At each frame, if there are more paths than the specified absolute beam width value, then only the best ones are kept and the rest are discarded. A larger value for this beam width can sometimes yield higher recognition accuracy at the expense of longer decoding times. We set a value of -1, which means an unbounded list, as is the norm, because the relative beam width does a good job at pruning the list. The relative beam width is used to create a threshold for what tokens to keep, based on their acoustic likelihood computation. Anything scoring less than the relative beam width value multiplied by the best score, is pruned. The relative beam width uses a negative exponent to represent a very small fraction, the more negative the exponent the less search paths you discard, and the more accurate the recognition, but again, with is a trade off of increased search time. Pruner and Scorer Components 35 <component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/> <component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="${frontend}"/> </component> Figure 3.7: Pruner and Scorer configurations The pruner is responsible for the pruning of the active list, according to certain strategies. The “SimplePruner” [9] that is used in our configuration, performs default pruning behavior, and invokes the purge on the active list. The scorer scores the current feature frame against all active states in the active list, which is why it has access to the front end. (See Section 3.3.2 for more information about the values used for scoring.) The “ThreadedAcousticScorer” [9] is an acoustic scorer that breaks the scoring up into a configurable number of separate threads. All scores are maintained in “LogMath” log base. Linguist Component <component name="aflatLinguist" type="edu.cmu.sphinx.linguist.aflat.AFlatLinguist"> <property name="logMath" value="logMath" /> <property name="grammar" value="AlignerGrammar" /> <property name="acousticModel" value="wsj" /> <property name="addOutOfGrammarBranch" value="${addOOVBranch}" /> <property name="outOfGrammarProbability" value="${outOfGrammarProbability}" /> <property name="unitManager" value="unitManager" /> <property name="wordInsertionProbability" value="${wordInsertionProbability}" /> <property name="phoneInsertionProbability" value="${phoneInsertionProbability}" /> <property name="languageWeight" value="${languageWeight}" /> <property name="phoneLoopAcousticModel" value="WSJ" /> <property name="dumpGStates" value="true" /> </component> Figure 3.8: Linguist component 36 Figure 3.8 shows our system’s configuration for the linguist. The linguist embodies the linguistic knowledge of the system, which consists of the acoustic model, the dictionary, and the language model. It produces the search graph structure on which the search manager performs the graph search, using different algorithms. The “AFlatLinguist” [9] is a simple form of linguist. It makes the following simplifying assumptions: • One or no words per grammar node. • No fan-in allowed. • No composites. • The graph only includes unit, HMM, and pronunciation states (and the initial/final grammar state), no word, alternative or grammar states are included. • Only valid transitions (matching contexts) are allowed. • No tree organization of units. • Branching grammar states are allowed. It is a dynamic version of the flat linguist, that is more efficient in terms of start-up time and overall footprint. All probabilities are maintained in the log math domain. There are a number of property values that can be specified in order to make the linguist work properly. The acoustic model property is used to define which acoustic model to use when building the search graph. The grammar property defines which grammar must be used when building the search graph, see Section 3.3.3. The “addOutOfGrammarBranch” allows one to specify whether to add a branch for detecting out-of-grammar utterances, and the out-of-grammar probability defines the chance of entering the out-of-grammar branch. The unit manager property is used to define which unit manager to use when building the search graph, see Section 3.3.4. The phone insertion probability property specifies the probability of inserting a Context Independent (CI) phone in the out-of-grammar CI phone loop, and the phone loop acoustic model defines which acoustic model to use to build the phone loop that detects out of grammar utterances. This acoustic model does not need to be the same as the model used for the search graph, see Section 3.3.4. The language weight, also called the language model scaling factor, decides how much relative importance will be given to the actual acoustic probabilities of the words in the search path. A low language weight gives more leeway for words with high acoustic probabilities to be chosen, at the risk of choosing non-existent words. One can decode several times with different language weights, without re-training the acoustic models, to decide what is best for the system. A value between 6 and 13 is standard, and by default the language weight is not applied. 37 The word insertion penalty is an important heuristic parameter in any dynamic programming algorithm. It is the number that decides how much penalty to apply to a new word during the search. If new words are not penalized, the decoder would tend to choose the smallest words possible, since every new word inserted leads to an additional increase in the score of any path, as a result of the inclusion of the inserted word’s language probability from the language model. Word insertion probability controls the word breaks recognition. If the value is near 1, it is more likely to break the perceived text into words, e.g. the sentence “A. D. 6” is preferred with high word insertion probability, while “eighty six” is preferred if the word insertion probability is low. This value is related to the word insertion penalty. We use a value of 1 since the proposed test texts mainly consists of small, elementary-school level words. 3.3.3 Grammar Component <component name="AlignerGrammar" type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar"> <property name="dictionary" value="dictionary" /> <property name="logMath" value="logMath" /> <property name="addSilenceWords" value="true" /> <property name="allowLoopsAndBackwardJumps" value="allowLoopsAndBackwardJumps" /> <property name="selfLoopProbability" value="selfLoopProbability"/> <property name="backwardTransitionProbability" value="backwardTransitionProbability" /> </component> Figure 3.9: Grammar component The “AlignerGrammar” [9] component was created to provide a customizable grammar able to incorporate speech disfluencies such as deletions, substitutions, and repetitions in the audio input. A grammar is represented internally as a graph, in which we allow the inclusion of silence words by putting the “addSilenceWords” value on true. The “allowLoopsAndBackwardJumps”, “selfLoopProbability”, and “backwardTransitionProbability” property values are all defined inside the “AlignerGrammar” class. This will likely be changed to accept the values specified in the configuration file, in a later version of this Sphinx branch. We have kept the predefined values in the class for use in our system. 38 All grammar probabilities are maintained in “LogMath” log domain. The dictionary defined to use for this grammar is referenced by the dictionary property, and is defined in the section below. Dictionary Component <component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.AllWordDictionary"> <property name="dictionaryPath" value="resource:/en/dict/cmudict.0.7a"/> <property name="fillerPath" value="resource:/en/noisedict"/> <property name="dictionaryLanguage" value="EN"/> <property name="addSilEndingPronunciation" value="true"/> <property name="wordReplacement" value="&lt;sil&gt;"/> <property name="unitManager" value="unitManager"/> </component> Figure 3.10: Dictionary configuration This is the most important part of the configuration file, in terms of what to alter when you are running the program on your own installation. The “dictionaryPath” and “fillerPath” must be changed to direct to the dictionary file and filler dictionary file, respectively. The “dictionaryLanguage” must be changed as well, to the language the system must support. The “AllWordDictionary” [9] creates a dictionary by quickly reading in an ASCIIbased Sphinx-3 format pronunciation dictionary. It loads each line of the dictionary into a hash table, assuming that most words are not going to be used. Only when a word is actually used, are its pronunciations copied into an array of pronunciations. The format of the ASCII dictionary that is expected, is the word, followed by spaces or tabs, and followed by the pronunciation(s). For example, an English digits dictionary will look like the first column of table 3.1. The second column shows the pronunciation dictionary entries for Dutch digits. In this example, the words “one”, “zero”, and “een” have two pronunciations each. One can clearly see that the way a pronunciation dictionary is build, depends on which language it represents. Capitalization is important, and each language has its own way to represent a certain phoneme or sound unit. 39 ONE HH W AH N ONE(2) W AH N TWO T UW THREE TH R IY FOUR F AO R FIVE F AY V SIX S IH K S SEVEN S EH V AH N EIGHT EY T NINE N AY N ZERO Z IH R OW ZERO(2) Z IY R OW OH OW een @ n een(2) e n twee t w e drie d r i vier v i r vijf v ei f zes z ee s zeven z e v @ n acht aa x t negen n e gg @ n nul n yy l Table 3.1: Example of an English and a Dutch pronunciation dictionary 3.3.4 Acoustic Model Component <component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate .TiedStateAcousticModel"> <property name="loader" value="wsjLoader"/> <property name="unitManager" value="unitManager"/> </component> <component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/> <property name="unitManager" value="unitManager"/> <property name="location" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"/> </component> <component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/> Figure 3.11: Acoustic model configuration Figure 3.11 shows the configurations for the acoustic model. The “TiedStateAcousticModel” [9] loads a tied-state acoustic model, generated by the Sphinx-3 trainer. 40 The acoustic model is stored as a directory specified by a URL. The files in that directory are, therefore, tables, or pools, of means, variances, mixture weights, and transition probabilities. This directory is specified in the location property from the “Sphinx3Loader” [9]. Here it is referring to a container file, but a regular directory is also possible. The dictionary and language model files are not required to be in the package. Their locations can be specified separately. An HMM models a process using a sequence of states. Associated with each state, there is a probability density function. A popular choice for this function is a Gaussian mixture. As you may recall from Section 2.2.2, a single Gaussian is defined by a mean and a variance, or, in the case of a multidimensional Gaussian, by a mean vector and a covariance matrix, or, under some simplifying assumptions, a variance vector. The means and variances files in the directory contain exactly that: a table in which each line contains a mean vector or a variance vector respectively. The dimension of these vectors is the same as the incoming data, namely the encoded speech signal. The Gaussian mixture is a summation of Gaussians, with different weights for different Gaussians. Each line in the mixture weights file contains the weights for a combination of Gaussians. The transitions between HMM states have an associated probability. These probabilities make up the transition matrices stored in the transition matrices file. The model definition (mdef ) file in a way ties everything together. If the recognition system models phonemes, there is an HMM for each phoneme. The model definition file has one line for each phoneme. The phoneme can be context dependent or independent. Each line, therefore, identifies a unique HMM. This line has the phoneme identification, the non-required left or right context, the index of a transition matrix, and, for each state, the index of a mean vector, a variance vector, and a set of mixture weights. If the model has a layout that is different than the default generated by SphinxTrain, you may specify additional properties like “dataLocation” to set the path to the binary files, and “mdef” to set the path to the model definition file. The additions in Figure 3.12 are used to define the phone loop acoustic model in the linguist (see Section 3.3.2). They are very similar to the specifications for the acoustic model, but do not necessarily need to point to the same location. 3.3.5 Front End Component Figure 3.13 shows the different components in the front end pipeline (see Section 3.2.1). The components are defined in Figure 3.17. We first list the used data processors in the front end pipeline. More information about each component can be found below. 41 <component name="WSJ" type="edu.cmu.sphinx.linguist.acoustic.tiedstate .TiedStateAcousticModel"> <property name="loader" value="WSJLOADER" /> <property name="unitManager" value="UNITMANAGER" /> </component> <component name="WSJLOADER" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath" /> <property name="unitManager" value="UNITMANAGER" /> <property name="location" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz" /> </component> <component name="UNITMANAGER" type="edu.cmu.sphinx.linguist.acoustic.UnitManager" /> Figure 3.12: Additions • audioFileDataSource • dataBlocker • preemphasizer • windower • fft • melFilterBank • dct • liveCMN • featureExtraction audioFileDatasource The “AudioFileDataSource” [9] is responsible for generating a stream of audio data from a giving audio file. All required information is read directly from the file. This component uses ‘JavaSound’ as a backend, and is able to handle all audio files supported by it, such as .wav, .au, and .aiff. Besides these, with the use of plugins, it can support .ogg, .mp3, .speex files and more. dataBlocker The “DataBlocker” [9] wraps the separate data fragments in blocks of equal, predefined length. 42 <component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>audioFileDataSource </item> <item>dataBlocker </item> <item>preemphasizer </item> <item>windower </item> <item>fft </item> <item>melFilterBank </item> <item>dct </item> <item>liveCMN </item> <item>featureExtraction </item> </propertylist> </component> Figure 3.13: Front end configuration preemphasizer The “Preemphasizer” [9] component takes a “DataObject” it received from the “DataBlocker” and passes along the same object with applied preemphasis on the high frequencies. High frequency components usually contain much less energy than lower frequency components, but they are still important for speech recognition. The “Preemphasizer” is thus a high-pass filter, because it allows the high frequency components to “pass through”, while weakening or filtering out the low frequency components. windower The windower component slices up a “DataObject” into a number of overlapping windows, also called frames. In order to minimize the signal discontinuities at the boundaries of each frame, the “RaisedCosineWindower” [9] multiplies each frame with a raised cosine windowing function. The system uses overlapping windows to capture information that may occur at the window boundaries. These events would not be well represented if the windows were juxtaposed. The number of resulting windows depends on the window size and the window shift. Figure 3.14 shows the relationship between the data stream, the window size, the window shift, and the returned windows. fft The next component computes the Discrete Fourier Transform (DFT) of an input sequence, using Fast Fourier Transform (FFT). Fourier Transform is the process of analysing a signal into its frequency components, see Section 2.2.1 for more information about the Fourier transform. 43 Original Data Stream Window 0 Window 1 Window 2 Window Shift Window Size Figure 3.14: The relation between original data size, window size and window shift; figure adapted from the Sphinx documentation [9] melFilterBank The “MelFrequencyFilterBank” [9] component filters an input power spectrum through a bank of mel-filters. The output is an array of filtered values, typically called the mel-spectrum, each corresponding to the result of filtering the input spectrum through an individual filter. The mel-filter bank, if visually represented, looks like Figure 3.15. # of triangles = # of mel-filters = length of mel-spectrum maxFreq Frequency minFreq Figure 3.15: A mel-filter bank; figure adapted from the Sphinx documentation [9] The distance at the base from the centre to the left edge is different from the centre to the right edge. Since the centre frequencies follow the mel-frequency scale, which is a nonlinear scale that models the nonlinear human hearing behaviour, the mel filter bank corresponds to a warping of the frequency axis. Filtering with the mel scale emphasizes the lower frequencies. dct The “DiscreteCosineTransform” [9] first applies a logarithm, and then a Discrete Cosine Transform (DCT) to the input data, which is the mel spectrum received from the 44 previous component in the pipeline. When the input is a mel-spectrum, the vector returned is the MFCC (Mel-Frequency Cepstral Coefficient) vector, where the 0-th element is the energy value. For more information, see Section 2.2.1. This component of the pipeline corresponds to the last stage of converting a signal to cepstra, and is defined as the inverse Fourier transform of the logarithm of the Fourier transform of a signal. liveCMN The “LiveCMN” [9] component applies cepstral mean normalization (CMN) to the incoming cepstral data. Its goal is to reduce the distortion caused by the transmission channel. The output is mean normalized cepstral data. The component does not read the entire stream of Data objects before it calculates the mean. It estimates the mean from already seen data and subtracts this from the Data objects on the fly. Therefore, there is no delay introduced by “LiveCMN”, and is thus the best choice for live applications. featureExtraction The last component in the front end pipeline is the “DeltasFeatureExtractor” [9]. It computes the delta and double delta of the input cepstrum (or plp or ...). The delta is the first order derivative and the double delta (a.k.a. delta delta) is the second order derivative of the original cepstrum. They help model the speech signal dynamics. The output data is a “FloatData” object with an array formed by the concatenation of cepstra, delta cepstra, and double delta cepstra, see Figure 3.16. The output is the feature vector that will be used by the decoder. cepstrum delta double delta Figure 3.16: Layout of the returned features; figure adapted from the Sphinx documentation [9] 3.3.6 Monitors Figure 3.18 shows a number of examples of possible monitors and their configuration. They can be used by adding them to the decoder, as seen in Figure 3.5. The accuracy tracker is turned off in our configuration, but would track and report the recognition accuracy based on the highest scoring search path of the result, since it uses the “BestPathAccuracyTracker” [9] class. The “MemoryTracker” [9] class monitors the memory usage of the recognition task, while the “SpeedTracker” [9] reports on the speed of the recognition. Apart from giving 45 <component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/> <component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/> <component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/> <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"/> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/> <component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"/> <component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/> <component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/> <component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/> <property name="useAddTable" value="true"/> </component> Figure 3.17: Front end pipeline elements the actual time lapse, it also provides the amount of time used in relation to the length of the audio input file. This chapter gave a detailed description of the configuration file that is used to specify the ASR techniques that the Sphinx system for our application, uses. They represent the inner workings of the alignment application we designed, and which is described in Chapter 4. 46 <component name="accuracyTracker" type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker"> <property name="recognizer" value="${recognizer}" /> <property name="showAlignedResults" value="false" /> <property name="showRawResults" value="false" /> </component> <component name="memoryTracker" type="edu.cmu.sphinx.instrumentation.MemoryTracker"> <property name="recognizer" value="${recognizer}" /> <property name="showSummary" value="true" /> <property name="showDetails" value="false" /> </component> <component name="speedTracker" type="edu.cmu.sphinx.instrumentation.SpeedTracker"> <property name="recognizer" value="${recognizer}" /> <property name="frontend" value="${frontend}" /> <property name="showSummary" value="true" /> <property name="showDetails" value="false" /> </component> Figure 3.18: Example of monitors 47 48 Chapter 4 Our Application This chapter describes the plugin system we created in detail. We first shortly discuss a high-level view of our application and explain the reasoning behind our high-level choices. Then we further explain each component more elaborately, and provide the reader with a number of best practices, i.e. how to get our system up and running. We end the chapter with some explanation of the possible output formats of our application, namely .srt file or EPUB. 4.1 High-Level View Our application is written entirely in the JavaTM language, and consists of three separate components, see Figure 4.1. • The first component is the Main component. This is where the main functionality of our application is located, and it contains the parsing of the command line and chooses the plugin, as well as loads it into the application (as the plugin is located in a different component, see below). It also contains the testing framework. • The second component is the Plugin Knowledge component, which contains all the functionality one needs to implement the actual plugin. It provides the user with two possible output formats, namely a standard subtitle file (.srt file format) and an EPUB file. This component receives the audio and text input from the main component, and passes it along to the ASR plugin component. • The third component is where the ASR plugin is actually located. We refer to this component as the (ASR) Plugin component, since it contains the ASR system. In our application we use the Sphinx-4 plugin, see Chapter 3 for more information. We decided to split our application in these three components for two reasons. The first reason is that due to this splitting in several components, the only part that needs any knowledge of the ASR plugin that is used, is the third component. Both other components 49 Main Component Plugin Knowledge Component ASR Plugin Component Figure 4.1: High-level view of our application have no knowledge at all about the inner workings of the ASR system, and can thus easily be altered by someone who just wants some small tweaks to the main output, or the tests output, and has no idea how the ASR component works. The second reason is to keep the addition of a new plugin to the application as easy as possible. If someone wants to change the ASR system that is used by our application, they only need to provide a link from the second component to the plugin component (see Section 4.2). By splitting up the first and second components, they do not need to work out all the extra classes that are used by the first component, which have no impact on the plugin ASR whatsoever, and can start work even if only being provided with the second component. The arrows in Figure 4.1 represent which component has access to which other component. The first and third component only need access to the second one. This component is added to the libraries of the first and third component, to provide said access. We will now discuss these components in more detail, in the following sections. As our application is developed in Java, the components correspond to separate projects, which is why, in the following sections, we will call them ‘projects’. The first component is called the ‘pluginsystem’ project, the second component is called the ‘pluginsystem.plugin’ project, and the third component is called the ‘pluginsystem.plugins.sphinxlongaudioaligner’ project in our java application. 50 4.2 Components 2 & 3: Projects pluginsystem.plugin & pluginsystem.plugins.sphinxlongaudioaligner Figure 4.2 contains all the classes in the second project, and the one class from the plugin project that links these projects together, namely the SphinxLongSpeechRecognizer class. The interactions between the classes are also shown. <<interface>> SpeechRecognizer +align(audioFile:String,textFile:String): void +setConfiguration(configuration:HashMap<String,String>): void +addSpeechRecognizedListener(listener:SpeechRecognizedListener): void +removeSpeechRecognizedListener(listener:SpeechRecognizedListener): void FileHelpers AbstractSpeechRecognizer +align(audioFile:String,textFile:String): void <<interface>> SpeechRecognizedListener +wordRecognized(word:String,start:long, duration:long): void +endOfSpeechReached(): void SphinxLongSpeechRecognizer +align(audioFile:String,textFile:String): void SrtOutputListener EPubOutputListener Figure 4.2: UML scheme of the pluginsystem.plugin project, including the class that links the plugin to our application, namely SphinxLongSpeechRecognizer The second project contains all the necessary classes and interfaces one needs, to implement their own plugin into our system. Its most important class is the AbstractSpeechRecognizer, which implements the SpeechRecognizer interface, and must be extended by a class in the third project (the plugin project). The only abstract method in AbstractSpeechRecognizer is the align(String audioFile, String textFile) method. This method receives a reference to the text input file and the audio input file that need to be synchronized. How this is done, is obviously highly dependent on which ASR system is used, which is why this method must be implemented by a class in the plugin project. 51 When the SphinxLongSpeechRecognizer class of the ASR plugin recognizes, or aligns, a word, it calls on the insertMissingWords(ArrayList<String> inputText, String word, long startTime, long duration, long previousEndTime) method. Even though we provide the Sphinx ASR system with the entire spoken text, it does not necessarily recognize all words of that texts. Therefore, we created this insertMissingWords method to check with the input text file, and insert possible missing words with their corresponding start times and duration. The start time is taken the same as the end time of the previous word. The duration is calculated by taking the duration between the end time of the previous word and the end time of the next recognized word, and dividing it by the total amount of letters in the missing words and the next recognized word.1 Each missing word then gets a duration based on the length of the word. When for each word a start time and duration are assigned, we pass this on to the wordRecognized(String word, long start, long duration) method of the specified listener. The listener will then output each word with its designated times according to the desired output format, e.g. an .srt file format. New listeners for new output formats can be easily created. They only have to implement the SpeechRecognizedListener interface, which consists of just two methods. The first method, as described above, specifies what action needs to be taken when a word is aligned; the second method specifies what needs to happen when the end of the audio file is reached, e.g. close the output file. The FileHelpers class consists of some useful methods when reading a text file. For example, the removePunctuation() method specifies how the ASR plugin should react to different punctuation symbols. We created this method because Sphinx-4, as most ASR systems, can’t handle punctuation and removes them from the input text. However, some symbols are better changed into a space character, or are not recognised by Sphinx, and then this method comes in handy. Important note: To define which ASR plugin must be used, the user needs to alter (or add) the “pluginsystem.plugins.SpeechRecognizer” file, which can be found in the “META-INF.services” folder of the plugin source code. The file has the reference to the interface that is implemented, as name. The first line in that file must refer to the class that implements this interface in the plugin project (the third project), e.g. in our application, to use the Sphinx-4 plugin, the first line reads ‘pluginsystem.plugins.sphinx.SphinxLongSpeechRecognizer’. See Section 4.3 for more information. 1 As vowels tend to have a ‘longer’ pronunciation than consonants[51], it would be more accurate to take into account the amount of vowels in each word when calculating its duration. However, even among the same vowel, pronunciation lengths might greatly vary, e.g. “I’m” versus “hiccup” for vowel i. Then there is also the problem with double consonants, are they pronounced as one sound, as in the word “boat”, or separately, as in the word “coefficients”? It is easy enough to see that how to handle vowels is highly dependent on the word and its language, and thus we decided to not discriminate between vowel and consonant. 52 4.3 Component 1: Project pluginsystem Figure 4.3 contains the classes in the first project, and their interactions. The MainProgram class is the main class in our application. It processes the command line arguments and loads the plugin, using the SpeechRecognizerService and ClassLoader classes. SpeechRecognizerService MainProgram FolderURLClassLoader SystemClassLoader TextConverter TestAccuracy Figure 4.3: UML scheme of the pluginsystem project Loading the ASR plugin consists of two steps. The first, is to add all the required resources to the class path, e.g. the jar file of the plugin, the used ASR system, required references, etc. Our application will automatically add all .jar files located in the “plugins” directory to the class path. This is done using the addUrl method of the default ClassLoader in Java, the UrlClassLoader. We use reflection to call this method, as it has the modifier protected. An alternative approach could have been to create and use our own ClassLoader class to extend UrlClassLoader and make the addUrl method public2 . The functionality to locate and add files to the class path can be found in the SystemClassLoader class. The second, and final, step is to find, and create an instance of the class that implements the SpeechRecognizer interface. To achieve this, the concept of services and service providers is used. A service can be defined as a well-known set of interfaces, while a service provider is a specific implementation of a service. The Java SE Developer Kit comes with a simple service provider loading facility located in the java.util.ServiceLoader class. The ServiceLoader class requires that all service providers are identified by placing one or more provider configuration files in the resource directory “META-INF/services”. As mentioned before, at the end of Section 4.2, the name of the file should correspond 2 Since we use the default ClassLoader class, all referenced resources will be added to the class path as well. 53 with the fully-qualified binary name of the service type. The content exists of one or more lines, where each line is the fully-qualified binary name of a service provider. As stated before, this means that in our case, every plugin should have a provider-configuration file called “pluginsystem.plugins.SpeechRecognizer” with, as content, the fully-qualified binary name of the class that implements this interface. In order to improve the maintainability of our code, we added an abstraction level around the ServiceLoader object in class SpeechRecognizerService. This class has a static function called getSpeechRecognizer which returns the SpeechRecognizer object that should be used by our system. Internally, this class uses the ServiceLoader class as described above. The TextConverter and TestAccuracy classes are used to perform the accuracy tests on the received output. They can be performed using the commands in Figure 4.4. The <<jarfile>> in the command, is the path to the .jar-file that contains the runnable program code. $> java -jar <<jarfile>> TEST CONVERT "narration_helper.txt" "PlayLane_manual_SRTfile" $> java -jar <<jarfile>> TEST "PlayLane_manual_SRTfile converted.srt" "output_SRTfile" Figure 4.4: The commands used for testing accuracy The PlayLane company uses two files to specify timings for each word. The first file (the “narration helper.txt” file) contains a label for each word (see the first column in Table 4.1), the second file is an .srt file, which specifies a word label for each timing (see the second column in Table 4.1). Due to this internally used file format, to specify which word needs to be highlighted at which time, we first need to convert these PlayLane files to a regular .srt file, so we can easily compare this manual transcription with our automatically generated transcription. This is done by the TEST CONVERT command, as shown in Figure 4.4. After the PlayLane .srt file contains the corresponding words, we can compare our automatically generated transcription with their manual transcription. This is done by the TEST command in Figure 4.4. The accuracy results are discussed in Section 5.2 of Chapter 5 How all the classes of our application interact, between the different projects, can be seen in Figure 4.5. 54 8 : 11 : en 151 00:01:45,100 --> 00:01:45,620 8:13 8 : 12 : een 8 : 13 : huis 152 00:01:45,620 --> 00:01:46,060 8:14 8 : 14 : vol 8 : 15 : vuur 153 00:01:46,060 --> 00:01:46,560 8:15 8 : 16 : . 8 : 17 : Bas 154 00:01:48,120 --> 00:01:48,480 8:17 8 : 18 : kijkt 8 : 19 : zijn 155 00:01:48,850 --> 00:01:49,100 8:18 Table 4.1: Example excerpt of the “narration helper.txt” file (on the left) and its corresponding .srt file (on the right) for the book “De luie stoel” 4.4 4.4.1 Best Practices for Automatic Alignment How to add a new Plugin to our System 1. If a user decides to use a different ASR system than the already provided Sphinx-4, with our application, the first thing they need to do is add the second project to the libraries of their plugin. Or, in case their plugin does not provide access to its source code, create a new java project and add the ASR plugin and the second project to that new java project’s libraries. 2. Then they need to create a new java class (or alter an already existing class) to extend the AbstractSpeechRecognizer class, and implement the align method, as explained in Section 4.2. This method will call on the ASR functions from the plugin. 3. Add the “pluginsystem.plugins.SpeechRecognizer” file (without extension!), to the “META-INF.services” folder of the plugin source code. Add the name of the class created in step 2 above, including the packages it is part of, to the first line of this file (see Section 4.2 for more information on this file). 4. Build the plugin project (this will create a .jar file for your project) and copy the generated content from the “dist” folder3 to the “plugins” folder of the first project. 3 The “dist” folder should now contain the .jar file for the plugin project and a folder called “lib” 55 SpeechRecognizerService MainProgram TestAccuracy FolderURLClassLoader SystemClassLoader TextConverter <<interface>> SpeechRecognizer FileHelpers <<abstract>> AbstractSpeechRecognizer <<interface>> SpeechRecognizedListener EPubOutputListener SphinxLongSpeechRecognizer SrtOutputListener Figure 4.5: UML scheme of the entire application Doing this will ensure that, when running the application, it has access to all the libraries its projects might need. Also, remove the .jar files from plugins that should not be used when running the application. 4.4.2 How to run the Application 5. Firstly, the input audio file needs to conform to a number of characteristics; it needs to be monophonic, have a sampling rate of 16kHz and each sample must be encoded in 16bits, little endian. We use a small tool called SoX to achieve this [58]. $> sox "inputfile" -c 1 -r 16000 -b 16 --endian little "outputfile.wav" This tool is also useful to cut long audio files into smaller chunks (audio file length of around 30 minutes is preferable to create a good alignment). 6. The input text file that contains the text that needs to be aligned with the input audio file, should best be a in simple text format, such as .txt. It, however, needs to be encoded in UTF-8 format. This is usually already the case, but it can easily be which contains all the libraries needed for this ASR plugin. 56 verified and applied in source code editors, such as notepad++ [30]. This is needed to correctly interpret the special characters that might be present in the text, such as quotes, accented letters, etc. 7. If this is done, the alignment can be started with the following command: $> java -Xmx1000m -jar <<jarfile>> -a "audio_input.wav" -t "text_input.txt" --config configFile="config_files/configAlignerEN.xml";outputFormat="SRT" where <<jarfile>> is the path to the .jar-file that contains the runnable program code. We use the -Xmx1000m setting to assure our system has access to about a Gigabyte of memory, more detailed information about Sphinx’s memory use can be found in Section5.2.2 of Chapter 5. The “audio input.wav” and “text input.txt” files contain the audio and text that need to be aligned with each other. The “config files/configAlignerEN.xml” refers to the configuration file that Sphinx needs to align English audio and text. This can be changed to the Dutch configuration file (“configAlignerNL.xml”). Both files are already included in our system, but as they are Sphinx specific we opted to leave their reference in the command encase the user decides to use a different ASR plugin. If, however, it is decided to stick with the Sphinx plugin, the main class can easily be altered to make the command to run the alignment task cleaner and briefer. The output format in the command is set to produce an .srt file, but an EPUB file can be created by setting the “outputFormat” value to “EPUB”. When running the application, it will first output the parameters received from the command line, when they are added to the configuration (see Figure 4.6). Adding configuration: key=inputAudioFile, value=IO_files/Ridder muis/ridder muis 0.wav Adding configuration: key=inputTextFile, value=IO_files/Ridder muis/ridder muis input.txt Adding configuration: key=configFile, value=config_files/configAlignerNL.xml Adding configuration: key=outputFormat, value=SRT Figure 4.6: Example of configuration output for book “Ridder Muis” Then it will output all remarks and warnings made by the ASR plugin (remember the warning loglevel explained in Section 3.3.1 and the monitors in Section 3.3.6). Lastly, it will output the names and paths of the specified text and audio input files, as well as the name and path of the output file. This provides the user with the possibility to easily check which files were used if perhaps the result is not as expected, or if one 57 runs a whole batch of transcription commands in a row. An example of such an output can be found in Figure 4.7. Used Files: Input Text File: IO_files/Ridder muis/ridder muis input.txt Input Audio File: IO_files/Ridder muis/ridder muis 0.wav Output File: IO_files/Ridder muis/ridder muis 0.srt Figure 4.7: Example of path and file names for book “Ridder Muis” Note that the name of the output file, independent on which output format was chosen, will be the same name of the audio file with the proper output format extension. 4.5 Output Formats After presenting our application in the sections above, the output format possibilities are shortly described in this section. Our application contains two possible output formats, namely .srt file format and EPUB file format. As mentioned above, other formats can easily be added to the application. 4.5.1 .srt File Format The .srt file format is a standard subtitle file format, and has a fairly straightforward form, as can be seen in Figure 4.8. The ‘subtitles’ are numbered separate, and contain the times corresponding to the start of the subtitle and stop of the subtitle, separated by an --> arrow. 4.5.2 EPUB Format The EPUB file format can be opened and viewed by a number of devices and tools, and provides a simple, widely-used standard format for reading books. It is a container format and has the possibility of containing audio and the corresponding speech-text alignment. An EPUB output file consists of two folders: the META-INF and OEBPS folder. The META-INF folder contains an XML file, called “container.xml”, which directs the device processor to where to find the meta data information. The OEBPS (Open eBook Publication Structure) folder contains all the books contents, e.g. text, audio, alignment specifications. There is a “content.opf” file which specifies all the meta-data of the book, such as author, language, etc. There are also a 58 37 00:00:21,705 --> 00:00:22,160 chapter 38 00:00:24,690 --> 00:00:25,400 1 39 00:00:26,420 --> 00:00:27,230 loomings 40 00:00:29,220 --> 00:00:29,480 call 41 00:00:29,480 --> 00:00:29,630 me 42 00:00:29,630 --> 00:00:30,420 ishmael Figure 4.8: Example of an .srt file content; taken from the .srt output for “Moby Dick” .xhtml and a .smil file. The first contains the book’s textual contents, the latter contains the alignment specifications between the audio file and textual contents in the .xhtml file. Figure 4.9 shows part of the EPUB output file for book “Moby Dick”. It is displayed by the Readium application for the Chrome web browser. 59 Figure 4.9: Example part of an EPUB file, generated by our application 60 Chapter 5 Results and Evaluation In this chapter we present the test results we achieved with the application we described in the previous chapter. We first provide some more information about the test files we were able to use, in Section 5.1. We then, in Section 5.2, discuss the results we achieved when running our application on those test files, comparing them to the base line alignment provided to us by the Playlane company, as described extensively in Section 5.1. In Sections 5.3, resp. 5.4 and 5.6, we investigate the effects the pronunciation dictionary, resp. the input accuracy and the acoustic model have on the alignment results. As is explained in Section 5.7, during the writing of this dissertation, CMU released a new Sphinx version. We briefly discuss the results we achieved when using this newer, though unstable, version. We conclude this chapter with Section 5.8, in which we present some impressions we received when manually checking the alignment results for English audio and text. The conclusions we draw from these test results, and opted ideas for future work, are presented in the next chapter. 5.1 Test Files We were provided with a number of books by the Playlane company, which we were able to use to verify our application with. Each book contained a complete textual version of the spoken book, which we used as our text input file, an audio file containing the book read at normal pace, and one read at a slow pace, a subtitle file containing word-perword timings using labels for each audio file, and a “narration helper” file (as discussed in Section 4.3). They are all Dutch books. The subtitle files that were provided by the Playlane company were manually made by the employees of Playlane. They listen to the audio track and manually set the timings for each word. 61 To verify the accuracy of our application we needed books that already had a wordper-word transcription, so we could compare that transcription with the one we generated using the ASR plugin. We are aware that, due to human errors, the alignments provided by the Playlane company might not be perfect. However, they do provide a decent baseline to compare our achieved accuracy to, and are therefore regarded as the ground truth for our alignment. These books are listed below: • • • • • • • • • • Avontuur in de woestijn De jongen die wolf riep De muzikanten van Bremen De luie stoel Een hut in het bos Het voetbaltoneel Luna gaat op paardenkamp Pier op het feest Ridder muis Spik en Spek: Een lek in de boot All these books are read by S. and V., who are both female, and have Dutch as their native language. Only two books, namely “De luie stoel” and “Het voetbaltoneel” are read by S., the others are read by V. normal pace audio file length 3000 0:43:12 2500 0:36:00 2000 0:28:48 1500 0:21:36 1000 0:14:24 500 0:07:12 0 0:00:00 audio file length #words in input text total #words in input text file Figure 5.1: Chart containing the size of the input text file, and length of the input audio files for both normal and slow pace In Figure 5.1, the number of words for each book are shown, as well as the length of the audio files, both slow and normal pace versions, of each book. Figure 5.2 shows the 62 percentage extra audio length 400 350 300 % 250 200 150 100 50 0 Figure 5.2: Percentage of extra audio length for the books read at slow pace, compared to the normal pace audio file length percentage of extra audio length for the slow pace audio files, compared to the normal pace files. For example, the audio file length of the slow pace version of book “De jongen die wolf riep” has over tripled in size compared to the length of the normal pace version of the book. The value for the book “De jongen die wolf riep” is over 300% in Figure 5.2, it is coincidently, also the book with the biggest audio file length difference between slow and normal pace. 5.2 5.2.1 Results Evaluation Metrics and Formulas Note that when we, from here on, refer to the ‘mean start and/or stop time difference’ we mean to say the average time difference for each word’s start and/or stop time between the manually transcribed .srt file provided by the Playlane company, and the automatically generated output created by our application. Figure 5.3 shows an example of a word with its start and stop times. This means that the word “Muis” will start being highlighted when the audio file, that is playing, reaches 440 milliseconds, and it will stop being 63 highlighted when the audio file reaches one second and 280 milliseconds. Thus its start time is 440 milliseconds and its stop time is one second and 280 milliseconds. 1 00:00:00,440 --> 00:00:01,280 Muis Figure 5.3: Example of a word and its start and stop times, in .srt file format 1 00:00:00,357 --> 00:00:01,284 Muis Figure 5.4: Example of a word and its start and stop times, in .srt file format Say, e.g., that the example in Figure 5.3 a part is of the alignment baseline provided by Playlane, and that the example in Figure 5.4 a part is of the alignment automatically generated by our application. To measure the average deviation our automatically generated alignment displays compared to the Playlane alignment, we for example, take the start time for the word “Muis” in Figure 5.3 and distract the start time for the same word in Figure 5.4. We then take the absolute value of the result of this distraction, which in our example would be 83, and perform the summation for these values for each word in both the automatic alignment result and the Playlane alignment. Finally, we divide this sum by the number of words used for the summation, which provides us with the average start time difference. The following formula is a mathematical representation of these calculations. W P abs(P laylaneStartT ime(wi ) − autoStartT ime(wi )) i=1 . meanStartT imeDif f erence = W The same process is repeated to calculate the mean stop time difference. 5.2.2 Memory Usage and Processing Time We will first show the memory usage and processing time needed by Sphinx to perform the alignment task. As can be seen in Table 5.1, the memory use to align a text can run up to over half a Gigabyte, which is why we present our application with approximately a Gigabyte of free memory. What is also notable is the amount of processing time Sphinx needs, as it never exceeds a tenth of the total audio length (x RT1 ), see Figure 5.5. These memory usage and processing times were achieved on a computer running a 32-bit Operating System (Windows 8.1 Pro) on a x64-based processor, with 4 Gigabytes 1 “RT” stands for real time, which refers to the total audio length. If, e.g., something took 0.05 x RT, it took five percent of the original audio length to perform the task. 64 book titles pace Avontuur in de woestijn De jongen die wolf riep De muzikanten van Bremen Een hut in het box Luna gaat op paardenkamp Pier naar het feest normal slow normal slow normal slow normal slow normal slow normal slow normal slow normal slow normal slow normal slow Ridder Muis Spik en Spek De luie stoel Het voetbaltoneel memory usage (Mb) 629.92 485.32 407.31 652.41 504.74 443.77 437.98 649.06 469.10 544.66 542.80 421.15 689.61 621.18 569.48 587.93 539.33 519.35 548.95 533.85 total audio length (ms) 69 008 121 900 30 212 100 858 29 586 75 898 51 495 126 312 80559 113 051 16 920 25 383 161 045 225 933 29 479 42 890 78 909 146 694 92 311 164 832 processing time (ms) 3 436 6 560 2 067 5 726 1 899 5 044 2 689 6 914 3 867 4 972 869 1 478 7 795 13 438 1 619 2 353 4 494 7 344 5 255 8 318 speed (x RT) 0.05 0.05 0.07 0.06 0.06 0.07 0.05 0.05 0.05 0.04 0.05 0.06 0.05 0.06 0.05 0.05 0.06 0.05 0.06 0.05 Table 5.1: Memory usage, processing times and speed of Sphinx for several alignment task, on both normal and slow pace audio of RAM, of which 3.24 Gigabytes were usable. It has an Intel(R) Core(TM)2 Duo T8300 CPU processor with a clock rate of 2.40 GHz. 5.2.3 First Results We ran the application on both audio versions of each of the aforementioned books, to create the automatically generated transcriptions, and then ran the TEST CONVERT and TEST commands to verify the transcription’s similarity to the manually transcribed file, see Sections 4.3 and 4.4 for more information on the used commands. The difference between the transcriptions is measured in milliseconds, word-per-word. For each word the difference between both transcriptions’ start times and stop times are calculated separately, and we take the mean over all the words that appear in both files. We decided to calculate the average of start and stop times for each word separately when we discovered, after careful manual inspection of the very first results, that Sphinx-4 has the tendency to allow more pause at the front of a word than at the end. In other words, 65 1000000 0.1 xRT total audio length processing time milliseconds 100000 10000 1000 100 Figure 5.5: Processing times for the automatic alignment performed on normal pace book it has the tendency to start highlighting a word in the pause before it is spoken, but stops the highlighting of the word more neatly after it is said, see Section 5.5. Figure 5.6 shows the average difference of the start and stop times for each word, for the books read at normal pace, between the files provided by Playlane and the automatically generated transcription provided by our application. The acceptable average time differences for normal pace audio are shown in Figure 5.7, together with the absolute maximum time difference. We consider all time differences less than one second to be acceptable. The maximum time differences lay between one to ten seconds for all six books, and can be explained by the long pauses at the end and beginning of new paragraphs in the books. The books “De jongen die wolf riep”, “De muzikanten van Bremen”, “De luie stoel” and “Het voetbaltoneel” deviate too far from the manual transcription times to be usable for an application that generates automatic synchronization between audio files and text files. There are six out of eight books read by V. that have timings that are synchronized with on average less than one second of difference between the output from our system and the one provided by Playlane. We take a closer look at each word’s time deviation in Section 5.5 for one of these books, namely “Ridder Muis”. There is no apparent reason why the other two books have a worse alignment accuracy, there are no general differences between those two books and the other six. We even took a look at the original audio file format, which we altered to .wav for Sphinx, in case this had any influence on the alignment. Table 5.2 shows the original audio file formats for each book. There appears 66 mean start time difference mean stop time difference 120000 milliseconds 100000 80000 60000 40000 20000 0 Figure 5.6: The mean start and stop time differences between the automatically generated alignment and the Playlane timings, for the books read at normal pace mean start time difference mean stop time difference max. time difference 8000 7000 milliseconds 6000 5000 4000 3000 2000 1000 0 Figure 5.7: The acceptable mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at normal pace 67 De muzikanten van Bremen De luie stoel Een hut in het bos Het voetbaltoneel Luna gaat op paardenkamp Pier naar het feest Ridder Muis Spik en Spek: Een lek in de boot audio file mp3 format missing words 0 at end total words 1254 in text De jongen die wolf riep Avontuur in de woestijn to be no link between whether the input file was an .mp3 file or an .wav file originally, and the accuracy of the alignment results. The appears no way to determine whether, when running the alignment task for a specific book, Sphinx-4 will return acceptable alignment results. However, we output the number of missing words at the end, i.e. the words at the end of the input text that Sphinx did not include in the alignment, and there appears to be a correlation between this number and the alignment accuracy, see Table 5.2. This might be useful to decide whether the alignment result will be accurate enough to be considered for use. wav wav wav wav mp3 mp3 mp3 wav wav 78 216 392 0 312 0 0 0 0 565 594 1131 961 1286 1853 243 2520 449 Table 5.2: Table showing possibilities that might help determine whether the alignment result will be accurate The two books read by S. have the highest average start and stop time difference, which is why we have decided to train the acoustic model more on her voice. More explanation on this training, and the accomplished results, can be found in Section 5.6. The results we got for the slow readings of the books are disastrous, see Figure 5.8. Apart from the book “Luna gaat op paardenkamp”, all have an average start and stop time deviation of at least 20 seconds. There seems to be no obvious cause why one book has a larger average time deviation than the other, as can be seen in Figures B.1 and B.2 in Appendix B. It appears these bad results can only be explained by Sphinx’s apparent difficulty with handling pauses in audio, especially long pauses. We considered that perhaps the reason that the book “Luna gaat op paardenkamp” performs so well on 68 mean start time difference mean stop time difference 160000 140000 milliseconds 120000 100000 80000 60000 40000 20000 0 Figure 5.8: The mean start and stop time differences between the automatically generated alignment and the Playlane timings, for the books read at slow pace its slowly read version is because the slow version is only 140% longer than the normal pace version, as seen in Figure 5.2. However, book “Ridder Muis” has the same percentage of audio file length difference, as “Luna gaat op paardenkamp”, which would mean that “Ridder muis” ought to give acceptable results as well, which is obviously not the case. 5.3 Meddling with the Pronunciation Dictionaries The first idea we had that might increase accuracy of the synchronisation timings, was to add words from the books that are missing in the pronunciation dictionary. Sphinx helps us by warning us when this occurs, e.g. when running the application for book “De jongen die wolf riep”, we got a warning as part of the command line output, see Figure 5.9. 18:44:22.254 WARNING dictionary (AllWordDictionary) Missing word: slaapschaap Figure 5.9: Example of a warning for a missing word 69 Table 5.3 shows which books had missing words, which words those were, and how often they appeared in the input text of the book. book titles Avontuur in de woestijn De jongen die wolf riep De muzikanten van Bremen Luna gaat op paardenkamp De luie stoel Het voetbaltoneel #words missing missing word #times the word appears 5 1 total #words in book 1254 565 1 1 zorahs slaapschaap 1 kuuuuuu 1 594 1 controle 1 1853 1 2 babien fwiet, tuinhok 39 2 1131 1286 Table 5.3: Table containing information on words missing from the pronunciation dictionary mean start time difference for normal pace 25000 mean start time difference for slow pace milliseconds 20000 15000 10000 5000 0 Figure 5.10: The mean start time differences between the automatically generated alignment using the dictionaries with words missing, and using the dictionaries with missing words added, for books read at slow and normal pace Figure 5.10 shows the amount of milliseconds of difference there is between the mean start time deviation when using the dictionary with missing words (this is the dictionary we used for the previous tests) and when using the dictionary, where we added the missing words from Table 5.3 and their pronunciations. We show this difference for the mean start 70 milliseconds 10000 mean start time difference 1000 mean stop time difference max. time difference 100 with 'muis' without 'muis' dictionary Figure 5.11: The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, using a dictionary that is missing the word “muis” times of both the audio read at normal pace as at slow pace. The mean stop time difference is nearly the same as the mean start time difference, and thus adds no extra value to the graph. It is clear from Figure 5.10 that when a word is missing from the pronunciation dictionary, adding this word does provide a better synchronisation, especially when that word is used often throughout the book’s text. As can be seen, the mean start time difference for the book “De luie stoel”, which had the highest appearance of the missing word in its text, has improved with almost five seconds for the normal pace, and over 20 seconds for the slow pace version of the read book. As a reference, we also performed this test on the book “Ridder Muis”, by deleting the word “muis” from the pronunciation dictionary and comparing the timing results. The word “muis” appears 123 times in the book contents, which contain 2501 words in total. The average start and stop time differences can be seen in Figure 5.11, as well as the maximum time difference between the automatically generated synchronisations and the PlayLane timings. As you can see from both previous results (Figures 5.10 and 5.11), it is important to 71 add missing words to the pronunciation dictionary, especially if the missing word has a high presence in the book’s text. 5.4 Accuracy of the Input Text After running the tests to verify the increased accuracy achieved by adding missing words to the pronunciation dictionary, we thought it might be interesting to know how well our ASR system performs when there are words missing from the input text. We therefore decided to remove the word “muis” from the “Ridder Muis” input text. The results can be found in Figure 5.12, and clearly show that the most accurate synchronisation results are to be had when the input text file represents the actually spoken text as well as possible. 100000 10000 milliseconds mean start time difference mean stop time difference max. time difference 1000 100 with `muis´ without `muis´ input text Figure 5.12: The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, with missing word “muis in the input text Comparing Figures 5.11 and 5.12, shows us that the correctness of the input text has a higher influence on the accuracy of the synchronisation result than the content of the pronunciation dictionary, as using a lacking dictionary gives a mean average start time 72 of around 500 milliseconds, while using a lacking input text gives a mean average start time of around 7000 milliseconds. It is, thus, more important to make sure to pass on an accurate input text, than to make sure the pronunciation dictionary contains all the words in that input text. 5.5 A Detailed Word-per-Word Time Analysis We now take a closer look a the synchronisation results for a specific book, namely “Ridder Muis”. Figure 5.13 shows the start and stop time deviation for each word separately. The negative values mean that the start or stop time in our application’s result starts earlier than the time for the same word in the Playlane file. Positive values indicate that the time from our application was later than the time from the same word in the Playlane synchronisation. The horizontal axis in the figure has one word as its basic unit, thus, every value on the horizontal axis represents the timing deviation for one word. start time deviation stop time deviation 3000 2000 milliseconds 1000 0 -1000 -2000 -3000 -4000 Ridder Muis Figure 5.13: time deviations for each word between the automatically generated alignment and the Playlane timings, for the normal pace “Ridder Muis” book As can be seen from Figure 5.13, there is much higher proportion of negative values than positive values. This indicates Sphinx’ preference to start a word in the pause before the word is actually spoken. This preference can also be noted from the fact that the dark line (indicating deviation on start times) is overall more notable on the graph, indicating generally higher values than the stop time deviations. There are a number of outliers, in the negative as well as the positive range, which can usually be described to errors in the input file. The first outlier, in the positive deviation range, happens at the text “Maar dan...” in the book contents. We cannot readily explain why Sphinx struggles to align these words, except that both words are pronounced very slowly and containing a high level of anticipation. The first negative outlier happens on 73 the text “Draak slaakt een diepe zucht.”, where Sphinx quickly passes over the first four words and pauses on “zucht” until that word is said, for no apparent reason. The biggest outlier happens on the text “Het geluid komt uit haar eigen buik.”, and we believe it is caused by the long (3490 milliseconds) pause between the previous sentence and this one. But as can be seen from the figure, the alignment always rectifies itself after each outlier, and goes back to, on average, one second of time difference. 5.6 Training the Acoustic Model 5.6.1 How to Train an Acoustic Model To adapt an acoustic model by training it on certain data, all one needs is some audio data and the corresponding transcriptions. The user will want to divide the audio data in sentence segments, e.g. using the tool SoX [58] we mentioned before, or the Audacity software [3] which easily allows the user to separate an audio file in several smaller parts. Due to its visualisation of the audio, it is easy to see when sentences start or stop. For more information about the necessary characteristics of the .wav files and the mentioned tools we used, see Appendix D. These smaller audio files should have the following names: <<filename>>0xx.wav; where <<filename>> refers to, for example, the original audio or book title, to easily group and recognize which files contain which data, and the 0xx part should be different for each file. Each audio file must have a unique file name. Next, we created a <<filename>>.fileids file, which contains the name of each audio file we wanted to use for the training (without its extension), see Figure 5.14. <<filename>>_001 <<filename>>_002 <<filename>>_003 <<filename>>_004 <<filename>>_0xx Figure 5.14: Example contents of a .fileids file The final file we needed, to perform acoustic model training, is the <<filename>>.transcription file, which contains the transcription of each audio file mentioned in the .fileids file. Figure 5.15 shows an example content of such a file. The transcription of each audio fragment must be inserted between <s> and </s>, followed by the file name of the corresponding audio fragment between parentheses. 74 <s> <s> <s> <s> <s> tot het weer ruzie is </s> (<<filename>>_001) die zijn weer vrienden </s> (<<filename>>_002) die is dan weer weg </s> (<<filename>>_003) en met de kras </s> (<<filename>>_004) die is dan weer dun </s> (<<filename>>_0xx) Figure 5.15: Example contents of a .transcription file After the creation of these files, training the acoustic model can be done by following the steps below. 1. We created a new folder to insert all the files, e.g. “adapted model”. 2. We created a new folder, called “bin”, inside folder “adapted model” and inserted the binary files of Sphinxbase [12] and SphinxTrain [13] in this “bin” folder. 3. We created another folder inside the “adapted model” folder, called “original”. We inserted the acoustic model we wished to alter, i.e. the ‘original’ acoustic model. (The pronunciation dictionary need not be added here.) 4. We then created another folder inside the “adapted model” folder, called “adapted”. We inserted the original acoustic model in this folder as well. This folder will ultimately contain the adapted acoustic model. 5. Then we inserted the following files to the “adapted model” folder: • <<filename>>.fileids; • <<filename>>.transcription; • the several .wav files; and, • the pronunciation dictionary (e.g., the “celex.dic” file, which is the pronunciation dictionary we use for our application). Note that for the pronunciation dictionary to be usable for training the acoustic model, it only needs to contain all the words in the transcription file. 6. The internal folder structure should now correspond to Figure 5.16. We could then start with the actual adaptation of the acoustic model. For this, we needed to open a command prompt in the “adapted model” folder. 1. We first needed to create the feature files (.mfc files) from the .wav files, using the following command: $> bin/sphinx_fe -argfile original/feat.params -samprate 16000 -c <<filename>>.fileids -di . -do . -ei wav -eo mfc -mswav yes Now, for each .wav file, there should be a corresponding .mfc file in the “adapted model” folder. 75 2. The next step is to create some statistics for the adaptation of the acoustic model. We uses the tool bw and ran the following command: (we wrote the options separately on each line for readability purposes only): $> bin/bw -hmmdir original -moddeffn original/mdef -ts2cbfn .cont. -feat 1s_c_d_dd -cmn current -agc none -dictfn <<celex.dic>> -ctlfn <<filename>>.fileids -lsnfn <<filename>>.transcription -accumdir . We ran this command with the “celex.dic” pronunciation dictionary, but any dictionary can be used. The dictionary that will needed depends on the language of the acoustic model, and the data it will be trained on. 3. Next, we needed to perform a maximum-likelihood linear regression (MLLR) transformation. This is a small adaptation to the acoustic model, and is needed when the amount of available data is limited. $> bin/mllr_solve -meanfn original/means -varfn original/variances -outmllrfn mllr_matrix -accumdir . 4. Then we needed to update the acoustic model, using maximum a posteriori (MAP) adaptation: $> bin/map_adapt -meanfn original/means -varfn original/variances -mixwfn original/mixture_weights -tmatfn original/transition_matrices -accumdir . -mapmeanfn adapted/means -mapvarfn adapted/variances -mapmixwfn adapted/mixture_weights -maptmatfn adapted/transition_matrices 5. The “adapted” folder will now contain the adapted acoustic model. The can be verified by checking the date of modification for the “means”, “variances”, “mixture weights” and “transition matrices” files. For more information on how to adapt an acoustic model, see [60]. 76 |--adapted_model | |--adapted | |--feat.params | |--mdef | |--means | |--mixture_weights | |--noisedict | |--transition_matrices | |--variances | |--bin | |-- .exe files | |-- .dll files | |--original | |--feat.params | |--mdef | |--means | |--mixture_weights | |--noisedict | |--transition_matrices | |--variances | |-- pronunciation file, e.g. celex.dic |--<<filename>>0xx.wav files |--<<filename>>.fileids |--<<filename>>.transcription Figure 5.16: Example structure of the “adapted model” folder 5.6.2 Results with Different Acoustic Models As mentioned before, we wanted to train our acoustic model to S.’s voice, as it seemed to get the worst results of the alignment task. We trained the acoustic model on a book called “Wolf heeft jeuk”, which is also read by S. but is not part of the test data. We trained it on the last chapter of the book, which contained 13 sentences with a total of 66 words, covering 24 seconds of audio. And we also trained the original acoustic model on a part of the book “De luie stoel”. We used 23 sentences with a total of 95 words, covering 29 seconds of audio from this book. The alignment results we achieved when using the trained acoustic models can be seen in Figure 5.17 for the books read at normal pace, and in Figure 5.18 for the slow pace. Figure 5.17 shows some definite improvements for both books read by S., which is 77 mean start time difference (old AM) mean start time difference (luie stoel AM) mean start time difference (wolf AM) 120000 milliseconds 100000 80000 60000 40000 20000 0 Figure 5.17: Mean start time difference of each normal pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie stoel” as we expected, though the mean time difference is still almost 20 seconds for “De luie stoel” and over 40 seconds for “Het voetbaltoneel” for the alignment results we achieved using the “Wolf heeft jeuk” acoustic model. The average time difference for the book “De luie stoel” is only around 300 milliseconds when we perform the alignment task using the acoustic model trained on “De luie stoel”; and also the alignment for “Het voetbaltoneel” has improved in comparison to when we use the model trained on “Wolf heeft jeuk”, with only around 30 seconds of average time difference instead of 40. This means the alignment results with the acoustic model trained on “Wolf heeft jeuk” are still not acceptable, but show promising results for the acoustic model if it were to be further trained on that book. The alignments results achieved by using the acoustic model trained on book “De luie stoel” are near perfect for that book, and also provide an improvement for the book “Het voetbaltoneel”. This allows us to believe that further training an acoustic model on S.’s voice will achieve much improved alignment results for books read by S.. Of the books that are read by V., some have around the same accuracy with the newly trained acoustic models as with the old one, others have a better accuracy. But, as four 78 mean start time difference (old AM) mean start time difference (luie stoel AM) mean start time difference (wolf AM) 500000 450000 400000 milliseconds 350000 300000 250000 200000 150000 100000 50000 0 Figure 5.18: Mean start time difference of each slow pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie stoel” out of eight books have a worse accuracy, we can conclude that in general the acoustic model trained on S.’s voice has a bad influence on the accuracy of books read by V. As can be seen from Figure 5.18, the newly trained acoustic model has even worse alignment accuracy for the books read at slow pace. For example, the book “Ridder Muis” has a mean time difference of 450 seconds (7.5 minutes) with the trained model. Considering the results we achieved on the books read by S. at normal pace, by training the acoustic model on her voice, it might be a good idea to train an acoustic model on one of these slow pace versions of the audio books, and see if Sphinx can then better recognize pauses between words. 5.7 The Sphinx-4.5 ASR Plugin During the writing of this dissertation, CMU has released a new update to their Sphinx project. This Sphinx 4.5 was released in February and is a pre-alpha release, but we decided to have a look at its alignment abilities for future reference. The results we achieved can be found in Figures 5.19 and 5.20 for the books read at normal pace, and in 79 Figure 5.21 for the books read at slow pace. mean start time difference (Sphinx-4.5) mean start time difference (Sphinx-4) 120000 milliseconds 100000 80000 60000 40000 20000 0 Figure 5.19: The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at normal pace mean start time difference (Sphinx-4.5) mean start time difference (Sphinx-4) 300 milliseconds 250 200 150 100 50 0 Figure 5.20: The acceptable mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at normal pace 80 mean start time difference (Sphinx-4.5) mean start time difference (Sphinx-4) 160000 140000 milliseconds 120000 100000 80000 60000 40000 20000 0 Figure 5.21: The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at slow pace The configuration settings can be found in Appendix C. However, as Sphinx internally uses these values as well, it does not need to be added to the command line when running the application with this plugin. The path to the acoustic model and pronunciation dictionary do need to be specified in the command. As can be seen on Figure 5.19, the pre-alpha release achieves worse accuracy then the Sphinx-4 plugin, on four books, namely “De jongen die wolf riep”, “De muzikanten van Bremen” (both read by V.), “De luie stoel” and “Het voetbaltoneel” (read by S.). The other six books, which already had a good transcription accuracy with the Sphinx-4 plugin, now have an even better accuracy, as seen in Figure 5.20. What we can conclude from Figure 5.21, is that the Sphinx-4.5 ASR plugin provides an overall better alignment accuracy than the Sphinx-4 plugin for the books that are read at a slow pace. For five books the mean time difference is even less than one second. 5.7.1 Sphinx-4.5 with Different Acoustic Models We now also compare the results for the alignment task when we use the acoustic models we trained in the previous Section. The results can be found in Figures 5.22, 5.23 and 5.24. 81 mean start time difference (old AM) mean start time difference (wolf AM) mean start time difference (luie stoel AM) 120000 milliseconds 100000 80000 60000 40000 20000 0 Figure 5.22: The mean start time differences for the Sphinx-4.5 plugin, for the books read at normal pace, using the three different acoustic models mean start time difference (old AM) mean start time difference (luie stoel AM) mean start time difference (wolf AM) 2000 1800 1600 milliseconds 1400 1200 1000 800 600 400 200 0 Figure 5.23: The mean start time differences for the Sphinx-4.5 plugin, for the six books read at normal pace that usually achieve acceptable accuracy, using the three different acoustic models 82 As can be seen from Figures 5.22 and 5.23, Sphinx-4.5 also achieves better alignment accuracy with the trained acoustic models, for books read by S.. But Sphinx-4.5 performs nearly as good using the acoustic models trained on S.’s voice, as when using the original acoustic model for the books shown in Figure 5.23 (disregarding the outlier for “Luna gaat op paardenkamp” for the acoustic model trained on “Wolf heeft jeuk”). mean start time difference (old AM) mean start time difference (luie stoel AM) mean start time difference (wolf AM) 300000 milliseconds 250000 200000 150000 100000 50000 0 Figure 5.24: The mean start time differences for the Sphinx-4.5 plugin, for the books read at slow pace, using the three different acoustic models The effects the trained acoustic models had on the books read at a slow pace can be seen in Figure 5.24, and are very irregular. For some books a trained acoustic model perform better, for other books it performs worse than the original acoustic model. In general, we can say that the acoustic model trained on “wolf heeft jeuk” performs worst of all three models. There appear no similarities between the alignment accuracy results for the Sphinx-4.5 plugin and the Sphinx-4 plugin (see Figure 5.18). 5.8 Alignment Results for English Text and Audio We have also tried our application and ASR plugin on English text and audio, as English is a more researched language, due to its higher presence in audio and text. 83 It is difficult to find a word-per-word alignment baseline to compare our results against, since these are often created manually, which is highly work- and time-intensive, and are therefore not made freely available. However, we were able to perform the alignment task on a number of English books, such as “The curious case of Benjamin Button” by F. Scott Fitzgerald, and “Moby Dick” by Herman Melville. Both these books are part of the public domain, due to the copyright expiry law. When taking a look at the generated EPUB file, we concluded that, for both British and American voices, the alignment results are near perfect. These generated EPUB files can be found online2 . 2 http://1drv.ms/1k0f258 84 Chapter 6 Conclusions and Future Work The goal of this dissertation was to investigate whether performing an alignment task automatically, instead of manually, lays within the realms of the possible. Therefore, we created a software application that provides its user with the option to simply switch out different ASR systems, via the use of plugins. We provide extra flexibility for our application by offering two different output formats (a general subtitle file, and an EPUB file), and by making the creation of a new output format as simple as possible. To support speech-text alignment in EPUB format, we extended the existing EPUB library with a media-overlay option. From the results in the previous chapter, using the ASR plugin CMU Sphinx, we conclude that it is indeed possible to automatically generate an alignment of audio and text that is accurate enough for use (e.g., our test results have on average less than one second of difference between the automatic alignment results and a pre-existing baseline). However, there is still work to be done, especially for undersourced languages, such as Dutch. We achieved positive results when training the acoustic model on (less than 60 seconds of) audio data that corresponded with the person or type of book we wanted to increase alignment accuracy for. Our first remark for future work is then to further train the acoustic model for Dutch, especially when one has a clearly defined type of alignment tasks to perform. For example, the Playlane company has a set of around 20 voice actors they work with. Based on the results we achieved when training the acoustic model, we believe that training an acoustic model for each of these actors would highly increase accuracy for an alignment task on an audio fragment of each of these people, when using the corresponding acoustic model. Considering it can take days to manually align an audiobook, this small effort to train an acoustic model definitely appears to be highly beneficial, keeping in mind the gain in time one might achieve when automatically generating an accurate automatic alignment. The trained model could also achieve accurate results on multiple books, meaning that it is not necessary to train an acoustic 85 model for every new alignment task. As can clearly be concluded from the alignment results we achieved for the books read at a slow pace, Sphinx shows a number of definite issues with aligning pauses and silences. Sphinx might, for example, claim a word starts in the pause before it is actually spoken, or not recognize small pauses between words. It can misrecognise pauses as either longer or shorter than they actually are. We therefore propose the idea to have a further examination as to why Sphinx is perceiving these problems. It is highly likely that the data that was originally used to build and train the acoustic model consisted mainly of adult’s speech, which tends to keep a fast pace. This means that it might be possible that these difficulties can be alleviated as well, by training the acoustic model on audio containing a high amount of silences and pauses. It is, however, also possible the problem occurs at the front end parsing of the audio data; taking a closer look at how Sphinx operates might help discover why silences form a rather big issue. We mentioned before, in Section 5.2.3, that the amount of missing words at the end of the input file that were not included in the alignment by Sphinx, provide a fair indication of the alignment’s accuracy. It might be interesting to investigate whether this is caused by a single word Sphinx has difficulty to align, which then causes the other words to be misaligned in turn, or whether words are consequently further misaligned until there is no more audio to align while there is still some input text left unaligned. We also note that the accuracy of the input text and the pronunciation dictionary coverage highly influences the accuracy of the alignment output. From our tests, we can conclude that it is best to not have words missing from the input text or the pronunciation dictionary. There is a clear need for a more robust system, with less unexplainable outlying results. We propose a way to increase robustness for our application by comparing the alignment results created by two, or more, different ASR plugins. The overlapping results, within a certain error range, can be considered ‘correct’. This approach is based on the approach followed in [19]. It is our belief that the system we designed provides a flexible approach to speech-text alignment and, as it can be adapted to the user’s preferred ASR system, might be to the benefit of users that previously performed the alignment task manually. 86 Appendix A Configuration File Used for Recognizing Dutch <?xml version="1.0" encoding="UTF-8"?> <config> <!-- ************************************************** --> <!-- Global Properties --> <!-- ************************************************** --> <property name="logLevel" value="WARNING"/> <property name="absoluteBeamWidth" value="-1"/> <property name="relativeBeamWidth" value="1E-300"/> <property name="wordInsertionProbability" value="1.0"/> <property name="languageWeight" value="10"/> <property name="addOOVBranch" value="true"/> <property name="frontend" value="epFrontEnd"/> <property name="recognizer" value="recognizer"/> <property name="showCreations" value="false"/> <property name="outOfGrammarProbability" value="1E-26" /> <property name="phoneInsertionProbability" value="1E-140" /> <component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer"> <property name="decoder" value="decoder"/> <propertylist name="monitors"> <item>accuracyTracker </item> <item>speedTracker </item> <item>memoryTracker </item> </propertylist> </component> <component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="searchManager"/> 87 </component> <component name="searchManager" type="edu.cmu.sphinx.decoder.search.AlignerSearchManager"> <property name="logMath" value="logMath"/> <property name="linguist" value="aflatLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListFactory" value="activeList"/> </component> <component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/> <component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="${frontend}"/> </component> <component name="aflatLinguist" type="edu.cmu.sphinx.linguist.aflat.AFlatLinguist"> <property name="logMath" value="logMath" /> <property name="grammar" value="AlignerGrammar" /> <property name="acousticModel" value="wsj" /> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}" /> <property name="unitManager" value="unitManager" /> <property name="addOutOfGrammarBranch" value="${addOOVBranch}" /> <property name="phoneLoopAcousticModel" value="WSJ" /> <property name="outOfGrammarProbability" value="${outOfGrammarProbability}" /> <property name="phoneInsertionProbability" value="${phoneInsertionProbability}"/> <property name="dumpGStates" value="true" /> </component> <component name="AlignerGrammar" type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar"> <property name="dictionary" value="dictionary" /> <property name="logMath" value="logMath" /> <property name="addSilenceWords" value="true" /> <property name="allowLoopsAndBackwardJumps" value="allowLoopsAndBackwardJumps"/> <property name="selfLoopProbability" value="selfLoopProbability" /> <property name="backwardTransitionProbability" 88 value="backwardTransitionProbability"/> </component> <!-- ******************* --> <!-- DICTIONARY SETTINGS --> <!-- ******************* --> <component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.AllWordDictionary"> <property name="dictionaryPath" value="resource:/nl/dict/celex.dic"/> <property name="fillerPath" value="resource:/nl/noisedict"/> <property name="dictionaryLanguage" value="NL"/> <property name="addSilEndingPronunciation" value="true"/> <property name="wordReplacement" value="&lt;sil&gt;"/> <property name="unitManager" value="unitManager"/> </component> <component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"> <property name="loader" value="wsjLoader"/> <property name="unitManager" value="unitManager"/> </component> <component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/> <property name="unitManager" value="unitManager"/> <property name="location" value="resource:/nl"/> </component> <component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/> <!-- additions start--> <component name="WSJ" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"> <property name="loader" value="WSJLOADER" /> <property name="unitManager" value="UNITMANAGER" /> </component> <component name="WSJLOADER" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath" /> <property name="unitManager" value="UNITMANAGER" /> <property name="location" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"/> </component> <component name="UNITMANAGER" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/> 89 <!-- additions end --> <component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>audioFileDataSource </item> <item>dataBlocker </item> <item>preemphasizer </item> <item>windower </item> <item>fft </item> <item>melFilterBank </item> <item>dct </item> <item>liveCMN </item> <item>featureExtraction </item> </propertylist> </component> <component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/> <component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/> <component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"/> <component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/> <component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker"/> <component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/> <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"/> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/> <component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"/> <component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/> <component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/> 90 <component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/> <property name="useAddTable" value="true"/> </component> <!-- ******************************************************* --> <!-- monitors --> <!-- ******************************************************* --> <component name="accuracyTracker" type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker"> <property name="recognizer" value="${recognizer}" /> <property name="showAlignedResults" value="false" /> <property name="showRawResults" value="false" /> </component> <component name="memoryTracker" type="edu.cmu.sphinx.instrumentation.MemoryTracker"> <property name="recognizer" value="${recognizer}" /> <property name="showSummary" value="true" /> <property name="showDetails" value="false" /> </component> <component name="speedTracker" type="edu.cmu.sphinx.instrumentation.SpeedTracker"> <property name="recognizer" value="${recognizer}" /> <property name="frontend" value="${frontend}" /> <property name="showSummary" value="true" /> <property name="showDetails" value="false" /> </component> </config> 91 92 Appendix B Slow Pace Audio Alignment results As explained in Section 5.2.3, there appears to be no obvious cause why one book has a larger average time difference than the other. To visualise this, we have sorted the alignment results on several arguments. Figure B.1 shows the alignment accuracies for the books, sorted from smallest input text size to largest text size. mean start time difference mean stop time difference 180000 160000 milliseconds 140000 120000 100000 80000 60000 40000 20000 0 Figure B.1: The mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at slow pace 93 Figure B.2 shows the alignment accuracies for each book, sorted from smallest audio length to largest audio length. mean start time difference mean stop time difference 180000 160000 milliseconds 140000 120000 100000 80000 60000 40000 20000 0 Figure B.2: The mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at slow pace As the accuracy results show no pattern on both figures, we conclude that, due to the long pauses in the audio files, Sphinx-4 cannot provide accurate alignment results. 94 Appendix C Configuration File Used by Sphinx-4.5 for Dutch Audio <?xml version="1.0" encoding="UTF-8"?> <config> <property name="logLevel" value="WARNING"/> <property <property <property <property name="absoluteBeamWidth" value="50000"/> name="relativeBeamWidth" value="1e-80"/> name="absoluteWordBeamWidth" value="1000"/> name="relativeWordBeamWidth" value="1e-60"/> <property name="wordInsertionProbability" value="0.1"/> <property name="silenceInsertionProbability" value="0.1"/> <property name="fillerInsertionProbability" value="1e-2"/> <property name="languageWeight" value="12.0"/> <component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer"> <property name="decoder" value="decoder"/> </component> <component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="wordPruningSearchManager"/> </component> <component name="simpleSearchManager" type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager"> <property name="linguist" value="flatLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListFactory" value="activeList"/> 95 </component> <component name="wordPruningSearchManager" type="edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager"> <property name="linguist" value="lexTreeLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListManager" value="activeListManager"/> <property name="growSkipInterval" value="0"/> <property name="buildWordLattice" value="true"/> <property name="keepAllTokens" value="true"/> <property name="acousticLookaheadFrames" value="1.7"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="activeListManager" type="edu.cmu.sphinx.decoder.search.SimpleActiveListManager"> <propertylist name="activeListFactories"> <item>standardActiveListFactory</item> <item>wordActiveListFactory</item> <item>wordActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> </propertylist> </component> <component name="standardActiveListFactory" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="wordActiveListFactory" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="absoluteBeamWidth" value="${absoluteWordBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeWordBeamWidth}"/> </component> <component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/> 96 <component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="liveFrontEnd"/> </component> <component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"> <property name="grammar" value="jsgfGrammar"/> <property name="acousticModel" value="acousticModel"/> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="silenceInsertionProbability" value="${silenceInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}"/> <property name="unitManager" value="unitManager"/> </component> <component name="lexTreeLinguist" type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist"> <property name="acousticModel" value="acousticModel"/> <property name="languageModel" value="simpleNGramModel"/> <property name="dictionary" value="dictionary"/> <property name="addFillerWords" value="true"/> <property name="generateUnitStates" value="false"/> <property name="wantUnigramSmear" value="true"/> <property name="unigramSmearWeight" value="1"/> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="silenceInsertionProbability" value="${silenceInsertionProbability}"/> <property name="fillerInsertionProbability" value="${fillerInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}"/> <property name="unitManager" value="unitManager"/> </component> <component name="simpleNGramModel" type="edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel"> <property name="location" value=""/> <property name="dictionary" value="dictionary"/> <property name="maxDepth" value="3"/> <property name="unigramWeight" value=".7"/> </component> <component name="largeTrigramModel" type="edu.cmu.sphinx.linguist.language.ngram.large.LargeTrigramModel"> 97 <property <property <property <property </component> name="location" value=""/> name="unigramWeight" value=".5"/> name="maxDepth" value="3"/> name="dictionary" value="dictionary"/> <component name="alignerGrammar" type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar"> <property name="dictionary" value="dictionary"/> <property name="addSilenceWords" value="true"/> </component> <component name="jsgfGrammar" type="edu.cmu.sphinx.jsgf.JSGFGrammar"> <property name="dictionary" value="dictionary"/> <property name="grammarLocation" value=""/> <property name="grammarName" value=""/> <property name="addSilenceWords" value="true"/> </component> <component name="grXmlGrammar" type="edu.cmu.sphinx.jsgf.GrXMLGrammar"> <property name="dictionary" value="dictionary"/> <property name="grammarLocation" value=""/> <property name="grammarName" value=""/> <property name="addSilenceWords" value="true"/> </component> <component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.FastDictionary"> <property name="dictionaryPath" value="file:models/nl/dict/celex.dic"/> <property name="fillerPath" value="file:models/nl/noisedict"/> <property name="addSilEndingPronunciation" value="false"/> <property name="allowMissingWords" value="false"/> <property name="unitManager" value="unitManager"/> </component> <component name="acousticModel" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"> <property name="loader" value="acousticModelLoader"/> <property name="unitManager" value="unitManager"/> </component> <component name="acousticModelLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="unitManager" value="unitManager"/> <property name="location" value="file:models/nl"/> </component> 98 <component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/> <component name="liveFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>dataSource </item> <item>dataBlocker </item> <item>speechClassifier </item> <item>speechMarker </item> <item>nonSpeechDataFilter </item> <item>preemphasizer </item> <item>windower </item> <item>fft </item> <item>autoCepstrum </item> <item>liveCMN </item> <item>featureExtraction </item> <item>featureTransform </item> </propertylist> </component> <component name="batchFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>dataSource </item> <item>dataBlocker </item> <item>preemphasizer </item> <item>windower </item> <item>fft </item> <item>autoCepstrum </item> <item>liveCMN </item> <item>featureExtraction </item> <item>featureTransform </item> </propertylist> </component> <component name="dataSource" type="edu.cmu.sphinx.frontend.util.StreamDataSource"/> <component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/> <component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"> <property name="threshold" value="13" /> </component> <component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/> 99 <component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" > <property name="speechTrailer" value="50"/> </component> <component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/> <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"> </component> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"> </component> <component name="autoCepstrum" type="edu.cmu.sphinx.frontend.AutoCepstrum"> <property name="loader" value="acousticModelLoader"/> </component> <component name="batchCMN" type="edu.cmu.sphinx.frontend.feature.BatchCMN"/> <component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/> <component name="featureTransform" type="edu.cmu.sphinx.frontend.feature.FeatureTransform"> <property name="loader" value="acousticModelLoader"/> </component> <component name="confidenceScorer" type="edu.cmu.sphinx.result.MAPConfidenceScorer"> <property name="languageWeight" value="${languageWeight}"/> </component> </config> 100 Appendix D Specifications for the .wav Files Used for Training the Acoustic Model The most important component for training acoustic models is, of course, the audio data. When we use SphinxTrain, as explained in Section 5.6.1, the audio data needs to conform to the .wav file format. For optimal training, it is best to only contain one sentence in each audio file. It is important to note that the audio files that we plan to use for training, must have the same characteristics as those used to build the acoustic model, just like the audio files we perform an alignment task on, must have the same characteristics as these as well. The acoustic models provided by Sphinx have a typical sampling rate of 16kHz, use 16 bits per sample, and have one channel (they are monophonic). D.1 SoX Tool There are several possibilities to generate these audio files. For example, we can create the audio files ourself by recording our own voice saying the texts and saving them as .wav files. This can be done, by using the aforementioned SoX tool. SoX contains a command line tool called rec, which provides the ability to record input from a microphone, and directly save it to a .wav file with the provided characteristics: $> rec -r 16000 -e signed-integer -b 16 -c 1 <<filename>>_0xx.wav In the example above, speech is recorded into a .wav file, using the default input device, e.g. the microphone, with a sampling rate of 16kHz (-r 16000), using 16 bit per sample (-e signed-integer -b16), existing of one channel (-c 1). The tool also contains plenty of extra features, e.g. using the silence argument can be used to indicate that data should only be written to a file when audio is detected with a certain volume, and that recording may be stopped after a specified number of seconds of silence. It is also possible to automatically start recording to a new file after a specified number of seconds of silence. This is especially useful when we want to record large amounts of texts, and do not wish to rerun the command for each sentence. See [59] for an extended overview of the possibilities of the rec tool. An extra advantage of the tool is that it is a command line tool, which means it can be easily used in combination with scripting. For example, in the following command (found on the Sphinx wiki [11]), the first 20 lines of the “arctic20.txt” files will be shown to the user, one sentence at the time, and the rec 101 command will be started to record the corresponding speech. The user can display the next sentence by stopping the rec command, e.g. with CTRL+C. for i in ‘seq 1 20‘; do fn=‘printf arctic_%04d $i‘; read sent; echo $sent; rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; done < arctic20.txt D.2 Audacity Software Of course, we can also use existing audio files to train an acoustic model. These files will need to be divided by sentence, and be saved and converted to the correct file format. There is also a need for a perfect transcription. To divide and convert audio we can again use the SoX tool, using the silence argument if preferred. There is also the possibility to use a graphic tool, for example, Audacity [3]. Audacity is a free, open-source, cross-platform software package for recording and manipulating sound. Considering there are typically pauses between sentences, it is often very easy to distinguish sentences in a graphical display. In Audacity, it is possible to export parts of an audio file by simply selecting the part of the audio signal, see Figures D.1 and D.2. Figure D.1: How to select a sentence using Audacity The Audacity software contains several functions, e.g. recording audio, noise cancelling, band filters, etc. Note that it can sometimes be interesting to allow (little) noise in the file, so the corresponding 102 Figure D.2: How to export a sentence using Audacity acoustic model will have less trouble with aligning audio that was recorded in the same noisy environment. 103 104 Bibliography [1] Defense Advanced Research Projects Agency(DARPA)’s Effective, Affordable Reusable Speech-totext(EARS) Kickoff Meeting, Vienna, VA, May 9-10 2002. [2] Defense Advanced Research Projects Agency(DARPA)’s Effective, Affordable Reusable Speech-totext(EARS) Conference, Boston, MA, May 21-22 2003. [3] Audacity. Audacity software tool. http://audacity.sourceforge.net/. [4] S. Axelrod, V. Goel, R. Gopinath, P. Olsen, and K. Visweswariah. Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(1):172–189, January 2007. [5] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe. Global Optimization of a Neural NetworkHidden Markov Model Hybrid. In International Joint Conference on Neural Networks, 1991. IJCNN91-Seattle., volume 2, pages 789–794, July 1991. [6] J. Bilmes. Lecture 2: Automatic Speech Recognition. ~bilmes/ee516/lecs/lec2_scribe.pdf, 2005. http://melodi.ee.washington.edu/ [7] H. Bourlard and C.J. Wellekens. Links Between Markov Models and Multilayer Perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(12):1167–1178, December 1990. [8] Carnegie Mellon University. CMU DICT. http://www.speech.cs.cmu.edu/cgi-bin/cmudict. [9] Carnegie Mellon University. CMU Sphinx Documentation. http://cmusphinx.sourceforge.net/ doc/sphinx4/. [10] Carnegie Mellon University. CMU Sphinx Forum. http://cmusphinx.sourceforge.net/wiki/ communicate/. [11] Carnegie Mellon University. CMU Sphinx Wiki. http://cmusphinx.sourceforge.net/wiki/. [12] Carnegie Mellon University. CMU SphinxBase. http://sourceforge.net/projects/cmusphinx/ files/sphinxbase/0.8/. [13] Carnegie Mellon University. CMU SphinxTrain. http://sourceforge.net/projects/cmusphinx/ files/sphinxtrain/1.0.8/. [14] L. Carrio, C. Duarte, R. Lopes, M. Rodrigues, and N. Guimares. Building rich user interfaces for digital talking books. In Robert. J.K., Q. Limbourg, and J. Vanderdonckt, editors, Computer-Aided Design of User Interfaces IV, pages 335–348. Springer Netherlands, 2005. [15] S. Cassidy. Chapter 9. Feature Extraction for ASR. http://web.science.mq.edu.au/~cassidy/ comp449/html/ch09s02.html. [16] CGN. Corpus gesproken nederlands. http://lands.let.ru.nl/cgn/ehome.htm. [17] DAISY Consortium. Digital Accessible Information SYstem. http://www.daisy.org. 105 [18] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-Dependent Pre-Trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, January 2012. [19] B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and R. Van de Walle. Improving Multimedia Analysis Through Semantic Integration of Services. In 4th FTRA International Conference on Advanced IT, Engineering and Management, Abstracts, page 2. Future Technology Research Association (FTRA), 2014. [20] M. De Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools, and D. Van Compernolle. Template-Based Continuous Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1377–1390, May 2007. [21] H. Dewey-Hagborg. speech.ppt. Speech Recognition. http://www.deweyhagborg.com/learningBitByBit/ [22] EEL6586. A simple coder to work in real-time between two pcs. http://plaza.ufl.edu/hsuzh/ #report. [23] D. Eldridge. Have you heard? audiobooks are booming. BookBusiness your source for publishing intelligence, 17(2):20–25, April 2014. [24] M. Franzini, K.-F. Lee, and A. Waibel. Connectionist Viterbi Training: A New Hybrid Method for Continuous Speech Recognition. In International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1, pages 425–428, April 1990. [25] J. R. Glass. Challenges For Spoken Dialogue Systems. In Proceedings of 1999 IEEE ASRU Workshop, 1999. [26] N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M. Gilbert. The AT&T Spoken Language Understanding System. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):213–222, January 2006. [27] P. Haffner, M. Franzini, and A. Waibel. Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition. In International Conference on Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., volume 1, pages 105–108, April 1991. [28] H. Hermansky. Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America, 87(4):1738–1752, May 1990. [29] G. Hinton, L. Deng, D. Yu, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, November 2012. [30] D. Ho. Notepad++ Editor. http://notepad-plus-plus.org/. [31] X. Huang, Y. Ariki, and M. Jack. Hidden Markov Models for Speech Recognition. Columbia University Press, New York, NY, USA, 1990. [32] X. Huang, J. Baker, and R. Reddy. A Historical Perspective of Speech Recognition. Communication ACM, 57(1):94–103, 2014. [33] M.-Y. Hwang and X. Huang. Shared-Distribution Hidden Markov Models for Speech Recognition. IEEE Transactions on Speech and Audio Processing, 1(4):414–420, October 1993. [34] H. Jiang, X. Li, and C. Liu. Large Margin Hidden Markov Models for Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1584–1595, September 2006. [35] A. Katsamanis, M.P. Black, P. G. Georgiou, L. Goldstein, and S. Narayanan. Sailalign: Robust long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research, January 2011. 106 [36] S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing, pages 400–401, 1987. [37] C. Kim and R. M. Stern. Feature Extraction for Robust Speech Recognition using a Power-Law Nonlinearity and Power-Bias Subtraction, 2009. [38] D. H. Klatt. Readings in Speech Recognition. chapter Review of the ARPA Speech Understanding Project, pages 554–575. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. [39] Kyoto University. Julius LVCSR. http://julius.sourceforge.jp/en_index.php. [40] L. Lamel and J.-L. Gauvain. Speech Processing for Audio Indexing. In B. Nordstrm and A. Ranta, editors, Advances in Natural Language Processing, volume 5221 of Lecture Notes in Computer Science, pages 4–15. Springer Berlin Heidelberg, 2008. [41] P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf. Design of the CMU Sphinx-4 Decoder. In 8th European Conference on Speech Communication and Technology (Eurospeech), 2003. [42] E. Levin. Word Recognition Using Hidden Control Neural Architecture. In International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1, pages 433–436, April 1990. [43] H. Lin, J. Bilmes, D. Vergyri, and K. Kirchhoff. OOV Detection by Joint Word/Phone Lattice Alignment. In IEEE Workshop on Automatic Speech Recognition Understanding, 2007. ASRU., pages 478–483, December 2007. [44] Mississippi State University. ISIP ASR System. http://www.isip.piconepress.com/projects/ speech/. [45] N. Morgan and H. Bourlard. Continuous Speech Recognition Using Multilayer Perceptrons with Hidden Markov Models. In International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1, pages 413–416, April 1990. [46] National Institute of Standards and Technology (NIST). The History of Automatic Speech Recognition Evaluations at NIST. http://www.itl.nist.gov/iad/mig/publications/ASRhistory/. [47] L.T. Niles and H.F. Silverman. Combining Hidden Markov Model and Neural Network Classifiers. In International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1, pages 417–420, April 1990. [48] Shmyryov NV. Free speech database voxforge.org. http://translate.google.ca/translate? js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.dialog-21.ru% 2Fdialog2008%2Fmaterials%2Fhtml%2F90.htm&sl=ru&tl=en. [49] G. Oppy and D. Dowe. The turing test. http://plato.stanford.edu/entries/turing-test/. [50] D. O’Shaughnessy. Speech Communications: Human and Machine. Institute of Electrical and Electronics Engineers, 2000. [51] D. OShaughnessy. Invited Paper: Automatic Speech Recognition: History, Methods and Challenges. Pattern Recognition, 41(10):2965 – 2979, 2008. [52] L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–286, February 1989. [53] L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, 1986. 107 [54] T. Schlippe. Pronunciation Modeling. http://csl.anthropomatik.kit.edu/downloads/ vorlesungsinhalte/MMMK-PP12-PronunciationModeling-SS2012.pdf, 2012. [55] K. Schutte and J. Glass. Speech Recognition with Localized Time-Frequency Pattern Detectors. In IEEE Workshop on Automatic Speech Recognition Understanding, 2007. ASRU., pages 341–346, December 2007. [56] A. Serralheiro, I. Trancoso, T. Chambel, L. Carrio, and N. Guimares. Towards a repository of digital talking books. In Eurospeech, 2003. [57] R. Solera-Ure˜ na, D. Mart´ın-Iglesias, A. Gallardo-Antol´ın, C. Pel´aez-Moreno, and F. D´ıaz-de Mar´ıa. Robust ASR Using Support Vector Machines. Speech Communication, 49(4):253–267, April 2007. [58] SoX. SoX Sound eXchange. http://sox.sourceforge.net/. [59] SoX. SoX Sound eXchange Options. http://sox.sourceforge.net/sox.html. [60] CMU Sphinx. Adapting the Default Acoustic Model. http://cmusphinx.sourceforge.net/wiki/ tutorialadapt. [61] J. Tebelskis, A. Waibel, B. Petek, and O. Schmidbauer. Continuous Speech Recognition Using Linked Predictive Neural Networks. In International Conference on Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., volume 1, pages 61–64, April 1991. [62] The International Engineering Consortium. Speech-Enabled Interactive Voice Response Systems. http://www.uky.edu/~jclark/mas355/SPEECH.PDF. [63] L. T´ oth, B. Tarj´ an, G. S´ arosi, and P. Mihajlik. Speech recognition experiments with audiobooks. Acta Cybernetica, 19(4):695–713, January 2010. [64] E. Trentin and M. Gori. A Survey of Hybrid ANN/HMM Models for Automatic Speech Recognition. Neural Computing, 37(14):91 – 126, 2001. [65] K. P. Truong and D. A. van Leeuwen. Automatic Discrimination Between Laughter and Speech. Speech Communication, 49(2):144–158, 2007. [66] C.J. Van Heerden, F. De Wet, and M.H. Davel. Automatic alignment of audiobooks in afrikaans. In PRASA 2012, CSIR International Convention Centre, Pretoria. PRASA, November 2012. [67] A. Waibel. Modular Construction of Time-Delay Neural Networks for Speech Recognition. Neural Computing, 1(1):39–46, March 1989. [68] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang. Phoneme Recognition Using TimeDelay Neural Networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3):328– 339, March 1989. [69] Z.-Y. Yan, Q. Huo, and J. Xu. A Scalable Approach to Using DNN-Derived Features in GMM-HMM Based Acoustic Modeling for LVCSR. In International Speech Communication Association, August 2013. [70] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Cambridge University Engineering Department, 2006. [71] D. Yu, F. Seide, and G. Li. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In ICML, June 2012. 108 List of Figures 1.1 The distribution of phone recognition accuracy as a function of the speaker on the MTBA corpus; figure taken from [66] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 An overview of historical progress on machine speech recognition performance; figure taken from [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 System diagram of a speech recognizer based on statistical models, including training and decoding processes; figure adapted from [40] . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 LPC speech production scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 MFCC feature extraction procedure; figure adapted from [6] . . . . . . . . . . . . . . . . . 14 2.5 PLP feature extraction procedure; figure adapted from [6] . . . . . . . . . . . . . . . . . . 15 2.6 Possible pronunciations of the word ‘and’; figure adapted from [6] . . . . . . . . . . . . . . 20 2.7 Steps in pronunciation modelling; figure adapted from [54] . . . . . . . . . . . . . . . . . . 20 2.8 Links between pronunciation dictionary, audio and text; figure adapted from [54] . . . . . 21 2.9 Hidden Markov Model; figure adapted from [70] . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 High-level architecture of CMU Sphinx-4; figure adapted from [41] . . . . . . . . . . . . . 27 3.2 High-level design of CMU Sphinx front end; figure adapted from the Sphinx documentation [9] 28 3.3 Basic flow chart of how the components of Sphinx-4 fit together; figure adapted from [21] 31 3.4 Global properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Recognizer and Decoder components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 ActiveList component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7 Pruner and Scorer configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Linguist component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9 Grammar component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.10 Dictionary configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.11 Acoustic model configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.12 Additions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.13 Front end configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.14 The relation between original data size, window size and window shift; figure adapted from the Sphinx documentation [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.15 A mel-filter bank; figure adapted from the Sphinx documentation [9] . . . . . . . . . . . . 44 3.16 Layout of the returned features; figure adapted from the Sphinx documentation [9] . . . . 45 3.17 Front end pipeline elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.18 Example of monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.1 2.2 109 4.1 High-level view of our application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 UML scheme of the pluginsystem.plugin project, including the class that links the plugin to our application, namely SphinxLongSpeechRecognizer . . . . . . . . . . . . . . . . . . 51 4.3 UML scheme of the pluginsystem project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 The commands used for testing accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 UML scheme of the entire application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Example of configuration output for book “Ridder Muis” . . . . . . . . . . . . . . . . . . 57 4.7 Example of path and file names for book “Ridder Muis” . . . . . . . . . . . . . . . . . . . 58 4.8 Example of an .srt file content; taken from the .srt output for “Moby Dick” . . . . . . 59 4.9 Example part of an EPUB file, generated by our application . . . . . . . . . . . . . . . . . 60 5.1 Chart containing the size of the input text file, and length of the input audio files for both normal and slow pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Percentage of extra audio length for the books read at slow pace, compared to the normal pace audio file length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Example of a word and its start and stop times, in .srt file format . . . . . . . . . . . . . 64 5.4 Example of a word and its start and stop times, in .srt file format . . . . . . . . . . . . . 64 5.5 Processing times for the automatic alignment performed on normal pace book . . . . . . . 66 5.6 The mean start and stop time differences between the automatically generated alignment and the Playlane timings, for the books read at normal pace . . . . . . . . . . . . . . . . . 67 The acceptable mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at normal pace . . . . . . . . . . 67 The mean start and stop time differences between the automatically generated alignment and the Playlane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . . 69 Example of a warning for a missing word . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.10 The mean start time differences between the automatically generated alignment using the dictionaries with words missing, and using the dictionaries with missing words added, for books read at slow and normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.11 The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, using a dictionary that is missing the word “muis” . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.12 The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, with missing word “muis in the input text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.13 time deviations for each word between the automatically generated alignment and the Playlane timings, for the normal pace “Ridder Muis” book . . . . . . . . . . . . . . . . . . 73 5.14 Example contents of a .fileids file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.15 Example contents of a .transcription file . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.16 Example structure of the “adapted model” folder . . . . . . . . . . . . . . . . . . . . . . . 77 5.17 Mean start time difference of each normal pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie stoel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.18 Mean start time difference of each slow pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie stoel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 5.7 5.8 5.9 110 5.19 The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.20 The acceptable mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.21 The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the books read at slow pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.22 The mean start time differences for the Sphinx-4.5 plugin, for the books read at normal pace, using the three different acoustic models . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.23 The mean start time differences for the Sphinx-4.5 plugin, for the six books read at normal pace that usually achieve acceptable accuracy, using the three different acoustic models . 82 5.24 The mean start time differences for the Sphinx-4.5 plugin, for the books read at slow pace, using the three different acoustic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 B.1 The mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . . 93 B.2 The mean start and stop time differences between the automatically generated alignment and the PlayLane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . . 94 D.1 How to select a sentence using Audacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 D.2 How to export a sentence using Audacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 111