Design of a Plugin-based Approach for Word-per

advertisement
CONFIDENTIAL UP TO AND INCLUDING 12/31/2013 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY
Design of a Plugin-based Approach for Word-per-Word
Alignment Using Automatic Speech Recognition
Joke Van den Mergele
Supervisor: Prof. dr. ir. Rik Van de Walle
Counsellors: Ir. Tom De Nies, Ir. Miel Vander Sande, Dr. Wesley De Neve, dhr. Kris
Carron
Master's dissertation submitted in order to obtain the academic degree of
Master of Science in de ingenieurswetenschappen: computerwetenschappen
Department of Electronics and Information Systems
Chairman: Prof. dr. ir. Jan Van Campenhout
Faculty of Engineering and Architecture
Academic year 2013-2014
CONFIDENTIAL UP TO AND INCLUDING 12/31/2013 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY
Design of a Plugin-based Approach for Word-per-Word
Alignment Using Automatic Speech Recognition
Joke Van den Mergele
Supervisor: Prof. dr. ir. Rik Van de Walle
Counsellors: Ir. Tom De Nies, Ir. Miel Vander Sande, Dr. Wesley De Neve, dhr. Kris
Carron
Master's dissertation submitted in order to obtain the academic degree of
Master of Science in de ingenieurswetenschappen: computerwetenschappen
Department of Electronics and Information Systems
Chairman: Prof. dr. ir. Jan Van Campenhout
Faculty of Engineering and Architecture
Academic year 2013-2014
Word of Thanks
Firstly, I would like to thank my supervisor, prof. dr. ir. Rik Van de Walle, for allowing
the possibility of researching this dissertation. I would also like to thank my associate
supervisors dr. Wesley De Neve and ir. Tom De Nies, for their feedback and inspiration
during this year, as well as Kris Carron, without whom this dissertation would have never
existed.
I would like to thank my parents for allowing me to take up these studies, and my
partner, Niels, for his support and comfort when the end did not seem within sight.
Dankwoord
Ik zou als eerste graag mijn promoter, prof. dr. ir. Rik Van de Walle, bedanken om het
onderzoek naar deze thesis goed te keuren. Verder zou ik ook graag mijn begeleiders dr.
Wesley De Neve en ir. Tom De Nies bedanken voor hun feedback en motivatie gedurende
het hele jaar, alsook Kris Carron, aangezien deze thesis nooit mogelijk zou zijn geweest
zonder hem.
Mijn ouders wil ik ook graag bedanken omdat ze me de mogelijkheid hebben gegeven
deze studies te volgen. Als laatste wil ik mijn partner Niels bedanken voor zijn steun en
comfort op momenten wanneer het einde nog lang niet in zicht leek.
Joke Van den Mergele, June 2014
i
Usage Permission
“The author gives permission to make this master dissertation available for consultation
and to copy parts of this master dissertation for personal use. In the case of any other
use, the limitations of the copyright have to be respected, in particular with regard to the
obligation to state expressly the source when quoting results from this master dissertation.”
Joke Van den Mergele, June 2014
iii
Design of a Plugin-based Approach for Word-per-Word
Alignment Using Automatic Speech Recognition
by
Joke Van den Mergele
Dissertation submitted for obtaining the degree of
Master in Computer Science Engineering
Academic year 2013-2014
Ghent University
Faculty of Engineering
Department of Electronics and Information Systems
Head of the Department: prof. dr. ir. J. Van Campenhout
Supervisor: prof. dr. ir. R. Van de Walle
Associate supervisors: dr. W. De Neve, ir. T. De Nies, ir. M. Vander Sande
Summary
In this dissertation we present the design of a plugin-based system to perform automatic speech-text alignment. Our goal is to investigate whether performing an alignment
task automatically, instead of manually, is possible with a current state-of-the-art opensource automatic speech recognition systems. We test our application on Dutch audiobooks and text, using the CMU Sphinx automatic speech recognition system as plugin.
We were provided with test data by the Playlane company, which is one such company
that manually aligns their audio and text. Dutch is a slightly undersourced language
when it comes to data necessary for speech recognition. However, from our test results,
we conclude that it is indeed possible to automatically generate an alignment of audio and
text that is accurate enough for use, using the CMU Sphinx ASR system when performing
the alignment.
Samenvatting
In deze thesis presenteren we een ontwerp van een plugin-gebaseerd systeem om automatische spraak-tekst uitlijning uit te voeren. Ons doel is om te onderzoeken of het mogelijk is
een uitlijningstaak automatisch uit te voeren, in plaats van manueel, met de huidige opensource automatische spraakherkenningssystemen. We testen ons ontwerp op Nederlandse
audio boeken en tekst, gebruikmakend van het CMU Sphinx automatisch spraakherkenningssysteem als plugin. We werden voorzien met test data door het Playlane bedrijf, dat
´e´en van de bedrijven is die hun audio en tekst manueel uitlijnen. Nederlands is een weinig
voorkomende taal als het gaat over nodige data voor spraakherkenning. We kunnen echter
concluderen van onze testresultaten dat het inderdaad mogelijk is om automatisch een
uitlijning van audio en tekst te genereren die nauwkeurig genoeg is voor gebruik, wanneer
we het CMU Sphinx ASR systeem gebruiken om de uitlijning te genereren.
Keywords: automatic speech recognition (ASR), CMU Sphinx, Dutch audio, speechtext alignment
Trefwoorden: automatische spraakherkenning (ASH), CMU Sphinx, Nederlandse audio,
spraak-tekst uitlijning
v
Design of a Plugin-based Approach for
Word-per-Word Alignment Using Automatic
Speech Recognition
Joke Van den Mergele
Supervisor(s): prof. dr. ir. Rik Van de Walle, dr. Wesley De Neve, ir. Tom De Nies, ir. Miel Vander Sande
Abstract— In this paper we present the design of a plugin-based system to perform automatic speech-text alignment. Our goal is to investigate
whether performing an alignment task automatically, instead of manually,
is possible with a current state-of-the-art open-source automatic speech
recognition systems. We test our application on Dutch audiobooks and text,
using the CMU Sphinx [1] automatic speech recognition system as plugin.
Dutch is a slightly undersourced language when it comes to data necessary
for speech recognition. However, from our test results, we conclude that
it is indeed possible to automatically generate an alignment of audio and
text that is accurate enough for use, using the CMU Sphinx ASR system to
perform the alignment.
Keywords— automatic speech recognition (ASR), CMU Sphinx, Dutch
audio, speech-text alignment
I. I NTRODUCTION
UTOMATIC speech recognition (ASR) has been subject to
research for over 50 years, ever since Bell Labs performed
their very first small-vocabulary speech recognition tests during
the 1950’s, trying to automatically recognise digits over the telephone [2].
As computing power grew during the 1960s, filter banks were
combined with dynamic programming to produce the first practical speech recognizers. These were mostly for isolated words,
to simplify the task. In the 1970s, much progress arose in commercial small-vocabulary applications over the telephone, due to
the use of custom special-purpose hardware. Linear Predictive
Coding (LPC) became a dominant automatic speech recognition
component, as an automatic and efficient method to represent
speech.
Core ASR methodology has evolved from expert-system approaches in the 1970s, using spectral resonance (formant) tracking, to the modern statistical method of Markov models based
on a Mel-Frequency Cepstral Coefficient (MFCC) approach [2].
Since the 1980s the standard has been Hidden Markov Models
(HMMs), which have the power to transform large numbers of
training units into simpler probabilistic models [3, 4].
During the 1990s, commercial applications evolved from
isolated-word dictation systems to general-purpose continuousspeech systems. ASR has been largely implemented in software,
e.g. for the use of medical reporting, legal dictation, and automation of telephone services.
With the recent adoption of speech recognition in Apple,
Google, and Microsoft products, the ever-improving ability of
devices to handle relatively unrestricted multimodal dialogues,
is showing clearly. Despite the remaining challenges, the fruits
of several decades of research and development into the speech
recognition field can now be seen. As Huang, Baker, and
Reddy [5] said: “We believe the speech community is en route
A
to pass the Turing Test1 in the next 40 years with the ultimate
goal to match and exceed a human’s speech recognition capability for everyday scenarios.”
But even now, there is no such thing as a perfect speech recognition system. Each system has its own limitations and conditions that allow it to perform optimally. However, what has the
biggest influence on how a speech recognition system performs,
is how well-trained the acoustic model is, and what specific
task it is trained for [4]. For example, an acoustic model can be
trained on one person specifically (or be updated to better recognize that person’s speech), it can be trained to perform well
on broadcast speech, on telephone conversations, on a certain
accent, on certain words if only commands must be recognized,
etc. Thus, if one has access to an automatic speech recognition
system, and knows which task one needs to use it for, it would
be very useful to train the acoustic system according to that task.
With the recent boom in audiobooks [7], a whole new set
of training data is available, as audiobooks are recorded under
optimal conditions and the text that is read, is obtainable for each
book. But there is also another way these audiobooks might
be used by a speech recognition system. Why not align these
audiobooks with their book content, using a speech recognition
system, and so automate the process of creating digital books
that contain both audio and textual content.
In the context of this project, we worked together with the
Playlane company (now Cartamundi Digital2 ), who were kind
enough to provide us with test data for our application. The
Playlane company digitizes children’s books, adding games and
educational content. One part of this educational content is the
alignment between the book’s text and the read-aloud version of
the book. Currently, this alignment is done entirely manually.
Our goal is to provide companies like this, or people in need of
aligning audio and text, with a way to achieve this alignment
automatically, using ASR systems.
The remainder of this paper is structured as follows: first,
we provide a little insight in interesting related work we came
across, in Section II. We then discuss the general overview of
automatic speech recognition systems in Section III. We briefly
present how they generally are build and the building blocks
1 The phrase The Turing Test is most properly used to refer to a proposal made
by Turing (1950) as a way of dealing with the question whether machines can
think [6], and is now performed as a test to determine the ‘humanity’ of the
machine.
2 We will refer to the company as “Playlane” instead of “Cartamundi Digital”,
since Cartamundi Digital envelops many companies, and we only work with the
products created by Playlane.
they consist of, as well as pointing out the blocks that are most
important when working with our own application. In Section IV, we provide a summary of the ASR plugin system we
applied in our application. A high-level view of how our application was designed and why we made those decisions, is discussed in Section V. We also provide guidelines for the reader
to get the most accurate results when using our application. In
Section VI we discuss the accuracy of the speech-text transcriptions from our application, and what we changed to improve the
accuracy. The conclusions we drew from our results, and the
future work are described in Section VII.
II. R ELATED W ORK
In [8], the authors report on the automatic alignment of audiobooks in Afrikaans. They use an already existing Afrikaans
pronunciation dictionary and create an acoustic model from an
Afrikaans speech corpus. They use the book “Ruiter in die Nag”
by Mikro to partly train their acoustic model, and to perform
their tests on. Their goal is to align large Afrikaans audio files at
word level, using an automatic speech recognition system. They
developed three different automatic speech recognition systems
to be able to compare these and discover which performs best,
all three of them are build using the HTK toolkit [9]. To define
the accuracy of their automatic alignment results, they compare
the difference in the final aligned starting position of each word
with an estimate of the starting position they obtained by using
phoneme recognition. They discovered that the main causes of
alignment errors are:
• speaker errors, such as hesitations, missing words, repeated
words, stuttering, etc.;
• rapid speech containing contractions;
• difficulty in identifying the starting position of very short
(one- or two-phoneme) words; and,
• a few text normalization errors (e.g. ‘eenduisend negehonderd’ for ‘neeentienhonderd’).
Their final conclusions are that the baseline acoustic model does
provide a fairly good alignment for practical purposes, but that
the model that was trained on the target audiobook provided the
best alignment results.
The reason their research is interesting to us is because, just
as Afrikaans, Dutch is a slightly undersourced language (though
not as undersourced as Afrikaans), despite the large efforts made
by the Spoken Dutch Corpus (Corpus Gesproken Nederlands,
CGN) [10]. For example, the acoustic models of Voxforge [11]
we use, both for English and Dutch speech recognition, contain
around 40 hours of speech over a hundred speakers for English,
while only 10 hours of speech for Dutch. The main causes of
alignment errors they discovered are, of course, also interesting
for us to know, since we can present these to the users of our system and create awareness. However, as the books are read under
professional conditions, it is unlikely that there will be many
speaker errors. The third interesting fact they researched was
that training the acoustic model on part of the target audiobook
provides the best alignment results of their tested models. We
will also try to achieve this, however, we will train the acoustic
model on other audiobooks read by the same person, preferably
a book with the same reading difficulty classification as the target audiobook, as these have similar pauses and word lengths.
The authors of [12] try out the alignment capabilities of their
recognition system under near-ideal conditions, i.e. on audiobooks. They also created three different acoustic models, one
trained on manually transcriptions, one trained on the audiobooks at syllable level, and one trained on the audiobooks on
word level. They draw the same conclusions as the authors
of [8], namely that training the acoustic models on the target
audiobooks provide better results, as well as that aligning audiobooks (which are recorded under optimal conditions) is ‘easier’
than aligning real-life speech with background noises or distortions. They also performed tests using acoustic models that were
completely speaker-independent, slightly adapted and trained on
a specific speaker, and completely trained on a specific speaker.
It may come as no surprise that they discovered that the acoustic model that was trained on a certain person provided almost
perfect alignment of a text spoken by that person.
However, the one part of this article that is extremely interesting to us is that they quantify the sensitivity of a speech recognizer to the articulation characteristics and peculiarities of the
speaker. The recognition accuracy results for each speaker have
quite a huge deviance in both directions, compared to the average accuracy value of about 74%. They believe the reason
for this high deviation in the scores can mostly be blamed on
the sensitivity of the recognizer to the actual speaker’s voice. It
would thus be a good idea to train an acoustic model for each
voice actor a company such as Playlane works with, or at least
adapt the acoustic model we use to their voice actors by training
them on their speech, if it appears the results we achieve with
our speech recognition application are suboptimal.
III. AUTOMATIC S PEECH R ECOGNITION
The main goal of speech recognition is to find the most likely
word sequence, given the observed acoustic signal. Solving the
speech decoding problem then consists of finding the maximum
of the probability of the word sequence w given signal x, or,
equivalently, maximizing the “fundamental equation of speech
recognition” P r(w)f (x|w).
Most state-of-the-art automatic speech recognition systems
use statistical models. This means that speech is assumed to be
generated by a language model and an acoustic model. The language model generates estimates of P r(w) for all word strings
w and depends on high-level constraints and linguistic knowledge about allowed word strings for the specific task. The acoustic model encodes the message w in the acoustic signal x, which
is represented by a probability density function f (x|w). It describes the statistics of sequences of parametrized acoustic observations in the feature space, given the corresponding uttered
words.
The authors of [13] divide such a speech recognition system in several components. The main knowledge sources are
the speech and text corpus, which represent the training data,
and the pronunciation dictionary. The training of the acoustic
and language model relies on the normalisation and preprocessing, such as N -gram estimation and feature extraction, of the
training data. This helps to reduce lexical variability and transforms the texts to better represent the spoken language. However, this step is language specific. It includes rules on how
to process numbers, hyphenation, abbreviations and acronyms,
apostrophes, etc.
After training, the resulting acoustic and language model are
used for the actual speech decoding. The input speech signal is
first processed by the acoustic front end, which usually performs
feature extraction, and then passed on to the decoder. With the
language model, acoustic model and pronunciation dictionary at
its use, the decoder is able to perform the actual speech recognition and returns the speech transcription to the user.
According to [14] the different parts as discussed above, can
be grouped into the so-called five basic stages of ASR:
1. Signal Processing/Feature Extraction This stage represents
the acoustic front end. The same techniques are also used on the
speech corpus, for the feature extraction. For our application, we
use Mel-frequency cepstral coefficients (MFCC) [14] to perform
feature extraction.
2. Acoustic Modelling This stage encompasses the different
steps needed to build the acoustic model. Hidden Markov models (HMMs) [15] are how our acoustic models are trained.
3. Pronunciation Modelling This stage creates the pronunciation dictionary, which is used by the decoder.
4. Language Modelling In this stage, the language model is
created. The last, and most important, step for its creation is the
N -gram estimation.
5. Spoken Language Understanding/Dialogue Systems This
stage refers to the entire system that is build and how it interacts
with the user.
edge base, which are all controllable by an external application,
which provides the input speech and transforms the output to the
desired format, if needed. The Sphinx-4 architecture is designed
with a high degree of modularity. All blocks are independently
replaceable software modules, except for the blocks within the
knowledge base, and are written in Java.
For more information about the several Sphinx-4 modules,
we refer to [16], and the Sphinx-4 source code and documentation [1, 17].
V. O UR A PPROACH
A. High-Level View
Our application is written entirely in the JavaTM language, and
consists of three separate components, see Figure 1.
Main
Component
Plugin Knowledge
Component
ASR Plugin
Component
IV. T HE ASR P LUGIN USED FOR OUR A PPLICATION
CMU Sphinx is the general term to describe a group of speech
recognition systems developed at Carnegie Mellon University
(CMU). They include a series of speech recognizers (Sphinx-2
through 4) and an acoustic model trainer (SphinxTrain).
In 2000, the Sphinx group at Carnegie Mellon committed to
open source several speech recognizer components, including
Sphinx-2 and, a year later, Sphinx-3. The speech decoders come
with acoustic models and sample applications. The available resources include software for acoustic model training, language
model compilation and a public-domain pronunciation dictionary for English, “cmudict”.
The Sphinx-4 speech recognition system [1] is the latest addition to Carnegie Mellon University’s repository of the Sphinx
speech recognition systems. It has been jointly designed by
Carnegie Mellon University, Sun Microsystems laboratories,
Mitsubishi Electric Research Labs, and Hewlett-Packard’s Cambridge Research Lab.
It is different from the earlier CMU Sphinx systems in terms
of modularity, flexibility and algorithmic aspects. It uses newer
search strategies, and is universal in its acceptance of various
kinds of grammars, language models, types of acoustic models and feature streams. Sphinx-4 is developed entirely in the
JavaTM programming language and is thus very portable. It
also enables and uses multi-threading and permits highly flexible user interfacing.
We make use of the latest Sphinx addition, Sphinx-4, in our
system; but our Sphinx configuration uses a Sphinx-3 loader to
load the acoustic model, in the decoder module. The high-level
architecture of CMU Sphinx-4 is fairly straightforward. The
three main blocks are the front end, the decoder, and the knowl-
Fig. 1. High-level view of our application
The first component is the Main component. This is were
the main functionality of our application is located, and it contains the parsing of the command line and chooses the plugin,
as well as loads it into the application (as the plugin is located
in a different component, see below). It also contains the testing
framework.
• The second component is the Plugin Knowledge component,
which contains all the functionality one needs to implement the
actual plugin. It provides the user with two possible output formats, namely a standard subtitle file (.srt file format) and an
EPUB file. This component receives the audio and text input
from the main component, and passes it along to the ASR plugin component.
• The third component is where the ASR plugin is actually located. We refer to this component as the ‘plugin component’,
since it contains the ASR system.
•
We decided to split our application in these three components
to keep the addition of a new plugin to the application as easy
as possible. If a person wants to change the ASR system that is
used by our application, they only need to provide a link from
the second component to the plugin component. By splitting up
the first and second component, they do not need to work out
all the extra functionality and modules that are used by the first
component, which have no impact on the plugin ASR whatsoever, and can start work even if only being provided with the
second component.
normal pace audio file length
slow pace audio file length
0:43:12
2520
2500
#words in input text
We describe a number of characteristics the audio and text
data must contain, for our application and, mainly, the Sphinx
ASR plugin, to work as accurate as possible.
• Firstly, the input audio file needs to conform to a number of
characteristics; it needs to be monophonic, have a sampling rate
of 16kHz and each sample must be encoded in 16bits, little endian. We use a small tool called SoX to achieve this [18]. This
total #words in input text file
3000
2000
961
1000
500
0:28:48
1853
1500
449
565
1131
1254
0:36:00
0:21:36
1286
0:14:24
audio file length
B. Guidelines to Increase Alignment Accuracy
594
0:07:12
243
0
0:00:00
$> sox "inputfile" -c 1 -r 16000 -b 16 --endian little
"outputfile.wav"
mean start time difference
mean stop time difference
95348
95557
120000
100000
46291
46530
80000
180
143
162
119
0
260
212
20000
69
40
40000
252
179
29101
29156
60000
10242
10256
We were provided with 10 Dutch books by the PlayLane company, which we were able to use to verify our application with.
The subtitle files that were provided by the PlayLane company
were manually made by the employees of PlayLane. They listen
to the audio track and manually set the timings for each word.
To verify the accuracy of our application we needed books
that already had a word-per-word transcription, so we could
compare that transcription with the one we generated using the
ASR plugin. The books we used, are listed below:
• Avontuur in de woestijn
• De jongen die wolf riep
• De muzikanten van Bremen
• De luie stoel
• Een hut in het bos
• Het voetbaltoneel
• Luna gaat op paardenkamp
• Pier op het feest
• Ridder muis
• Spik en Spek: Een lek in de boot
All these books are read by S. and V., who are both female, and
have Dutch as their native language. Only two books, namely,
“De luie stoel” and “Het voetbaltoneel” are read by S., the others
are read by V.
In Figure 2, the number of words for each book are shown,
as well as the length of the audio files, both slow and normal
pace versions, of each book. The difference between both transcriptions is measured in milliseconds, word-per-word. For each
word the difference between both transcriptions’ start times and
stop times are calculated separately, and we take the mean over
all the words that appear in both files. We decided to calculate the average of start and stop times for each word separately
when we discovered, after careful manual inspection of the very
first results, that Sphinx-4 has the tendency to allow more pause
at the front of a word than at the end. In other words, it has the
tendency to start highlighting a word in the pause before it is
spoken, but stops the highlighting of the word more neatly after
it is said.
Figure 3 shows the average difference of the start and stop
times for each word, between the files provided by PlayLane
and the automatically generated transcription provided by our
application. There are six out of eight books read by V. that
116
90
VI. R ESULTS
Fig. 2. Chart containing the size of the input text file, and length of the input
audio files for both normal and slow pace
milliseconds
tool is also useful to cut long audio files into smaller chunks
(audio file length of around 30 minutes is preferable to create a
good alignment).
• The input text file that contains the text that needs to be
aligned with the audio file, should best be a in simple text
format, such as .txt. It, however, needs to be encoded in
UTF-8 format. This is usually already the case, but it can
easily be verified and applied in source code editors, such as
notepad++ [19]. This is needed to correctly interpret the special characters that might be present in the text, such as quotes,
accented letters, etc.
Fig. 3. The mean start and stop time differences between the automatically
generated alignment and the PlayLane timings
have timings that are synchronized with, on average, less than
one second of difference between the output from our system
and the one provided by PlayLane.
The two books read by S. have the highest average start and
stop time difference, which is why we have decided to train the
acoustic model more on her voice, see Figure 5.
We also thought it might be interesting to know how well our
ASR system performs when there are words missing from the
input text. We therefore decided to remove the word “muis”
from the “Ridder Muis” input text. The results can be found in
Figure 4, and clearly show that the most accurate synchronisation results are to be achieved when the input text file represents
100000
36970
10000
milliseconds
7044
7030
mean start time
difference
3780
mean stop time
difference
max. time
difference
1000
162
119
100
with `muis´
without `muis´
input text
Fig. 4. The mean start and stop, and maximum time differences between the
automatically generated alignment and the PlayLane timings for the normal
pace “Ridder Muis” book, with missing word “muis in the input text
the actually spoken text as well as possible.
As mentioned before, we wanted to train our acoustic model
to S.’s voice, as it seemed to get the worst results of the alignment task. We trained the acoustic model on a book called “Wolf
heeft jeuk”, which is also read by S. but is not part of the test
data. We trained it on the last chapter of the book, which contained 13 sentences with a total of 66 words, covering 24 seconds of audio. And we also trained the original acoustic model
on a part of the book “De luie stoel”. We used 23 sentences
with a total of 95 words, covering 29 seconds of audio from this
book. The alignment results we achieved when using the trained
acoustic models can be seen in Figure 5.
mean start time difference (old AM)
mean start time difference (luie stoel AM)
mean start time difference (wolf AM)
95348
120000
43246
29319
46291
17951
362
40195
180
162
1498
260
16124
11046
69
90
96
22374
16041
252
0
116
140
126
20000
643
322
40000
29101
60000
18581
48977
80000
10242
148
2860
milliseconds
100000
Fig. 5. Mean start time difference of each normal pace book, using the original acoustic model, the acoustic model trained on “Wolf heeft jeuk”, or the
acoustic model trained on “De luie stoel”
Figure 5 shows some definite improvements for both books
read by S., which is as we expected, though the mean time difference is still almost 20 seconds for “De luie stoel” and over
40 seconds for “Het voetbaltoneel” for the alignment results we
achieved using the “Wolf heeft jeuk” acoustic model. The average time difference for the book “De luie stoel” is only around
300 milliseconds when we perform the alignment task using the
acoustic model trained on “De luie stoel”; and also the alignment for “Het voetbaltoneel” has improved in comparison to
when we use the model trained on “Wolf heeft jeuk”, with only
around 30 seconds of average time difference instead of 40.
This means the alignment results with the acoustic model
trained on “Wolf heeft jeuk” are still not acceptable, but show
promising results for the acoustic model if it were to be further
trained on that book. The alignments results achieved by using
the acoustic model trained on book “De luie stoel” are near perfect for that book, and also provide an improvement for the book
“Het voetbaltoneel”. This allows us to believe that further training an acoustic model on S.’s voice will achieve much improved
alignment results for books read by S..
Of the books that are read by V., some have around the same
accuracy with the newly trained acoustic models as with the old
one, others have a better accuracy. But, as four out of eight
books have a worse accuracy, we can conclude that in general
the acoustic model trained on S.’s voice has a bad influence on
the accuracy of books read by V.
VII. C ONCLUSION AND F UTURE W ORK
The goal of this dissertation was to investigate whether performing an alignment task automatically, instead of manually,
lays within the realms of the possible. Therefore, we created
a software application that provides its user with the option to
simply switch out different ASR systems, via the use of plugins.
We provide extra flexibility for our application by offering two
different output formats (a general subtitle file, and an EPUB
file), and by making the creation of a new output format as simple as possible.
From the results in the previous chapter, using the ASR plugin
CMU Sphinx, we conclude that it is indeed possible to automatically generate an alignment of audio and text that is accurate
enough for use (e.g., our test results have on average less than
one second of difference between the automatic alignment results and a pre-existing baseline).
However, there is still work to be done, especially for undersourced languages, such as Dutch. We achieved positive results
when training the acoustic model on (less than 60 seconds of)
audio data that corresponded with the person or type of book
we wanted to increase alignment accuracy for. Our first remark
for future work is then to further train the acoustic model for
Dutch, especially when one has a clearly defined type of alignment tasks to perform. Considering it can take days to manually
align an audiobook, this small effort to train an acoustic model
definitely appears to be highly beneficial, keeping in mind the
gain in time one might achieve when automatically generating
an accurate automatic alignment. The trained model could also
achieve accurate results on multiple books, meaning that it is
not necessary to train an acoustic model for every new alignment task.
We also note that the accuracy of the input text and the pronunciation dictionary coverage highly influences the accuracy
of the alignment output. From our tests, we can conclude that it
is best to not have words missing from the input text or the pronunciation dictionary. There is a clear need for a more robust
system, with less unexplainable outlying results. We propose a
way to increase robustness for our application by comparing the
alignment results created by two, or more, different ASR plugins. The overlapping results, within a certain error range, can
be considered ‘correct’. This approach is based on the approach
followed in [20].
It is our belief that the system we designed provides a flexible
approach to speech-text alignment and, as it can be adapted to
the user’s preferred ASR system, might be to the benefit of users
that previously performed the alignment task manually.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Carnegie Mellon University,
“CMU Sphinx Wiki,” http://
cmusphinx.sourceforge.net/wiki/.
D. OShaughnessy, “Invited Paper: Automatic Speech Recognition: History, Methods and Challenges,” Pattern Recognition, vol. 41, no. 10, pp.
2965 – 2979, 2008.
L. R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, 1986.
X. Huang, Y. Ariki, and M. Jack, Hidden Markov Models for Speech
Recognition, Columbia University Press, New York, NY, USA, 1990.
X. Huang, J. Baker, and R. Reddy, “A Historical Perspective of Speech
Recognition.,” Communication ACM, vol. 57, no. 1, pp. 94–103, 2014.
G. Oppy and D. Dowe, “The turing test,” http://plato.stanford.
edu/entries/turing-test/.
D. Eldridge, “Have you heard? audiobooks are booming.,” BookBusiness
your source for publishing intelligence, vol. 17, no. 2, pp. 20–25, April
2014.
C.J. Van Heerden, F. De Wet, and M.H. Davel, “Automatic alignment of
audiobooks in afrikaans,” in PRASA 2012, CSIR International Convention
Centre, Pretoria. November 2012, PRASA.
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore,
J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK
Book, Cambridge University Engineering Department, 2006.
CGN, “Corpus gesproken nederlands,” http://lands.let.ru.nl/
cgn/ehome.htm.
Shmyryov NV,
“Free speech database voxforge.org,” http:
//translate.google.ca/translate?js=y&prev=_t&hl=
en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.
dialog-21.ru%2Fdialog2008%2Fmaterials%2Fhtml%
2F90.htm&sl=ru&tl=en.
L. T´oth, B. Tarj´an, G. S´arosi, and P. Mihajlik, “Speech recognition experiments with audiobooks,” Acta Cybernetica, vol. 19, no. 4, pp. 695–713,
Jan. 2010.
L. Lamel and J.-L. Gauvain, “Speech Processing for Audio Indexing,”
in Advances in Natural Language Processing, B. Nordstrm and A. Ranta,
Eds., vol. 5221 of Lecture Notes in Computer Science, pp. 4–15. Springer
Berlin Heidelberg, 2008.
J. Bilmes,
“Lecture 2:
Automatic Speech Recognition,”
http://melodi.ee.washington.edu/˜bilmes/ee516/
lecs/lec2_scribe.pdf, 2005.
E. Trentin and M. Gori, “A Survey of Hybrid ANN/HMM Models for
Automatic Speech Recognition,” Neural Computing, vol. 37, no. 14, pp.
91 – 126, 2001.
P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf,
“Design of the CMU Sphinx-4 Decoder,” in 8th European Conference on
Speech Communication and Technology (Eurospeech), 2003.
Carnegie Mellon University,
“CMU Sphinx Forum,” http://
cmusphinx.sourceforge.net/wiki/communicate/.
SoX, “SoX Sound eXchange,” http://sox.sourceforge.net/.
D. Ho, “Notepad++ Editor,” http://notepad-plus-plus.org/.
B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and
R. Van de Walle, “Improving Multimedia Analysis Through Semantic Integration of Services,” in 4th FTRA International Conference on Advanced
IT, Engineering and Management, Abstracts. 2014, p. 2, Future Technology Research Association (FTRA).
Ontwerp van een Plugin-gebaseerde Aanpak voor
Woord-per-Woord Uitlijning Gebruikmakend van
Automatische Spraakherkenning
Joke Van den Mergele
Promotor en begeleiders: prof. dr. ir. Rik Van de Walle, dr. Wesley De Neve, ir. Tom De Nies, ir. Miel
Vander Sande
Abstract— In dit artikel presenteren we het ontwerp van een plugingebaseerd systeem om spraak-tekst uitlijning automatisch uit te voeren.
Ons doel is te onderzoeken of het mogelijk is om de uitlijning automatisch
uit te voeren, in plaats van manueel, met de huidige open-source automatische spraakherkenningssystemen. We testen onze applicatie op Nederlandse audioboeken en tekst, gebruikmakend van het CMU Sphinx [1] automatisch spraakherkenningssysteem als plugin. Nederlands is een minder
voorkomende taal op vlak van data nodig voor spraakherkenning. Uit onze
testresultaten kunnen we echter concluderen dat het inderdaad mogelijk is
om automatisch een uitlijning van audio en tekst te genereren, die accuraat genoeg is voor gebruik, met behulp van het CMU Sphinx automatisch
spraakherkenningssysteem.
Trefwoorden— automatische spraakherkenning (ASR, automatic speech
recognition), CMU Sphinx, Nederlandse audio, spraak-tekst uitlijning
I. I NTRODUCTIE
UTOMATISCHE spraakherkenning (ASR, automatic
speech recognition) is reeds onderworpen aan onderzoek
gedurende meer dan 50 jaar, al sinds Bell Labs hun allereerste
kleine-woordenschat spraakherkenning tests uitvoerden in de jaren ’50, met het doel om automatisch cijfers te herkennen via de
telefoon [2].
Wanneer computers groeiden aan kracht tijdens de 1960’s
werden filterbanken gecombineerd met dynamisch programmeren om de eerste praktische spraakherkenners te produceren.
Deze werkten vooral voor ge¨ısoleerde woorden, om de taak te
vergemakkelijken. In de jaren ’70 ontstond er grote vooruitgang
in commerci¨ele kleine-woordenschat applicaties via de telefoon,
door het gebruik van speciaal aangepaste hardware voor een bepaald doel. Lineair voorspellende codering (LPC, linear predictive coding) werd een dominant onderdeel van automatische
spraakherkenning, als een automatische en effectieve manier om
spraak voor te stellen.
Kern ASR methodologie is ge¨evolueerd van de expertsysteem aanpak uit de jaren ’70, die gebruik maakte van het
nasporen van spectrale resonantie (formanten), naar de moderne
statistische methode met Markov modellen, gebaseerd op een
Mel-frequentie cepstrale co¨effici¨ent (MFCC) aanpak [2]. Sinds
de 1980’s is de standaard het gebruik van verborgen Markov modellen (HMMs, hidden Markov models), deze hebben de kracht
om grote aantallen van trainingseenheden om te vormen in simpele probabilistische modellen [3, 4].
Tijdens de jaren ’90 evolueerden commerci¨ele applicaties
van ge¨ısoleerde-woorden dictee-systemen naar algemeen bruikbare gecontinueerde-spraak-systemen. ASR is grotendeels
ge¨ımplementeerd in software, bijvoorbeeld voor het gebruik
bij medische rapportering, gerechtelijk dictee en automatisering
A
van telefonische diensten.
Met de recente opneming van spraakherkenning in Apple-,
Google-, en Microsoftproducten vertoont het steeds verbeterend
vermogen van toestellen om met relatief onbegrensde multimodale dialogen om te gaan, zich duidelijk. Ondanks de resterende uitdagingen kunnen de vruchten van meerdere decennia
onderzoek en ontwikkeling in het veld van spraakherkenning nu
worden geplukt.
Zoals Huang, Baker, en Reddy [5] vermelden: “We believe
the speech community is en route to pass the Turing Test1 in
the next 40 years with the ultimate goal to match and exceed a
human’s speech recognition capability for everyday scenarios.”
Maar zelfs nu is er niet zoiets als een perfect spraakherkenningssysteem. Elk systeem heeft zijn eigen beperkingen en
voorwaarden waaronder het optimaal opereert. Hetgeen echter
de grootste invloed blijkt op hoe goed een spraakherkenningssysteem werkt, is hoe goed getraind het akoestisch model is
en voor welke bepaalde taak het is getraind [4]. Bijvoorbeeld,
een akoestisch model kan getraind zijn op e´ e´ n welbepaalde persoon (of zijn aangepast om de spraak van deze persoon beter
te herkennen), het kan getraind zijn om goed te werken op omroepspraak, op telefoongesprekken, op een bepaald accent, op
bepaalde woorden indien het systeem enkel commando’s moet
herkennen, etc. Dus, als men gebruik maakt van een automatische spraakherkenningssysteem en weet op welk soort taak het
moet worden uitgevoerd, dan kan het heel nuttig zijn om het
akoestisch model te trainen voor die taak.
Met de recente groei aan audioboeken [7] bestaat er een hele
nieuwe collectie van trainingdata, aangezien audioboeken worden opgenomen onder optimale condities, en de uitgesproken
tekst opvraagbaar is voor elk boek. Maar er is ook een andere
manier om deze audioboeken te gebruiken met een spraakherkenningssysteem. Waarom niet deze boeken uitlijnen met hun
tekstuele inhoud, gebruikmakend van een spraakherkenningssysteem, en zo het proces voor de aanmaak van digitale boeken
die zowel audio en tekstuele inhoud bevatten, te automatiseren?
In de context van dit project werkten we samen met het Playlane bedrijf (nu Cartamundi Digital2 ), die zo vriendelijk waren
1 De zegswijze “The Turing Test” wordt gebruikt om te verwijzen naar een
voorstel van Turing (1950) als een manier om om te gaan met de vraag of machines kunnen denken [6] en wordt nu uitgevoerd als een test om de ‘menselijkheid’
van een machine te bepalen.
2 We refereren verder naar het bedrijf als “Playlane” in plaats van “Cartamundi
Digital”, aangezien Cartamundi Digital vele bedrijven omvat en we enkel werken met de producten gecre¨eerd door Playlane.
om ons van voldoende testdata te voorzien voor onze applicatie. Het Playlane bedrijf digitaliseert kinderboeken en voegt
spelletjes en educatieve inhoud toe. E´en onderdeel van deze
educatieve inhoud is de uitlijning van de boektekst en een voorgelezen versie van het boek. Momenteel wordt deze uitlijning
volledig manueel gedaan. Ons doel is om bedrijven zoals Playlane, of personen die nood hebben aan de uitlijning van audio en
tekst, te voorzien van manieren om deze uitlijning automatisch
te cre¨eren, met behulp van ASR systemen.
Het vervolg van dit artikel is als volgt gestructureerd: eerst
bezorgen we, in Sectie II, inzicht in interessant gerelateerd werk
dat we tegenkwamen tijdens het onderzoek voor onze applicatie.
Daarna voorzien we een algemeen overzicht van automatische
spraakherkenningssystemen in Sectie III. We presenteren kort
hoe ze in het algemeen worden opgebouwd en uit welke componenten ze bestaan. We duiden ook de belangrijkste componenten aan wanneer men werkt met onze applicatie. In Sectie IV
vatten we de ASR plugin die we voor het testen van onze applicatie hebben gebruikt samen. Vervolgens bediscussi¨eren we, in
Sectie V, een hoog-niveau overzicht van het ontwerp van onze
applicatie en waarom we voor deze ontwerp beslissingen hebben gekozen. We geven de lezer ook een aantal richtlijnen om
de meest accurate uitlijning te bekomen wanneer ze de applicatie gebruiken. In Sectie VI stellen we de nauwkeurigheid van
de verkregen spraak-tekst uitlijning van onze applicatie voor en
wat we kunnen aanpassen om deze nauwkeurigheid te verbeteren. Tenslotte bespreken we, in Sectie VII, de conclusies die we
trokken van deze resultaten en een paar idee¨en voor de toekomst.
II. G ERELATEERD W ERK
In [8] rapporteren de auteurs de automatische uitlijning van
audioboeken in Afrikaans. Ze gebruiken een reeds bestaand
Afrikaans uitspraakwoordenboek en cre¨eren een akoestisch model van een Afrikaans spraakcorpus. Ze gebruiken het boek
“Ruiter in die Nag” van Mikro om hun akoestisch model deels te
trainen en om hun testen op uit te voeren. Hun doel is om grote
Afrikaanse audiobestanden op woord-niveau uit te lijnen, gebruikmakend van een automatisch spraakherkenningssysteem.
Ze ontwikkelden drie verschillende automatische spraakherkenningssystemen om deze te vergelijken en te ontdekken welk het
best presteert. Alle drie systemen werden gebouwd gebruikmakend van de HTK toolkit [9]. Om de nauwkeurigheid van
hun automatische uitlijningsresultaten te bepalen, vergeleken ze
de verschillen tussen de uiteindelijk uitgelijnde startposities van
elk woord met een schatting van de startposities die ze verkregen door foneemherkenning. Ze ontdekten wat de voornaamste
oorzaken van uitlijningsfouten waren:
• fouten gemaakt door de spreker, zoals aarzelingen, ontbrekende woorden, herhaalde woorden, stotteren, etc.;
• samentrekkingen veroorzaakt door snelle spraak;
• moeilijkheid in het identificeren van de startposities van heel
korte woorden (´ee´ n of twee fonemen lang); en
• een paar tekst-normalisatie fouten (bijvoorbeeld, ‘eenduisend
negehonderd’ in plaats van ‘neeentienhonderd’).
Hun uiteindelijke conclusies zijn dat het basis akoestisch model een vrij accurate uitlijning genereert voor praktische doelen,
maar dat het model dat werd getraind op het “Ruiter in die Nag”
audioboek de beste uitlijningsresultaten gaf.
De reden waarom hun onderzoek interessant is voor ons is
omdat Nederlands, net als Afrikaans, een minder voorkomende
taal is, ondanks de grote inspanningen gemaakt door het Corpus
Gesproken Nederlands (CGN) [10]. Bijvoorbeeld, de akoestische modellen van Voxforge [11] die we gebruiken, zowel voor
Engelse als Nederlandse spraakherkenning, bevatten ongeveer
40 uur spraak voor meer dan honderd sprekers voor het Engels, maar slechts 10 uur spraak voor het Nederlands. De voornaamste oorzaken van uitlijningsfouten die ze ontdekten zijn natuurlijk ook interessant voor ons onderzoek, aangezien we deze
kunnen voorleggen aan de gebruikers van onze applicatie en zo
hun aandacht erop kunnen richten. De meeste audioboeken worden echter onder professionele condities opgenomen en het is
dus onwaarschijnlijk dat er zich veel fouten door de sprekers in
de audio bevinden. De derde interessante ontdekking van hun
onderzoek was dat het trainen van het akoestisch model op een
deel van het bedoelde audioboek, de meest accurate uitlijning
genereert van al hun uitgeteste modellen. We zullen proberen
gelijkaardige resultaten te bereiken, hoewel we ons akoestisch
model ook op andere audioboeken, gelezen door dezelfde persoon, zullen trainen en dit het liefst met een boek van hetzelfde
leesniveau als het doel audioboek, aangezien deze gelijkaardige
pauzes en woordlengtes bevatten.
De auteurs van [12] testten de uitlijningsmogelijkheden van
hun spraakherkenningssysteem onder bijna ideale condities, dit
wil zeggen op audioboeken. Ze ontwikkelden ook drie verschillende akoestische modellen, e´ e´ n getraind op manuele transcripties, e´ e´ n getraind op het doel audioboek op lettergreep-niveau,
en e´ e´ n getraind op het doel audioboek op woord-niveau. Ze
trokken dezelfde conclusies als de auteurs van [8], namelijk
dat het akoestisch model getraind op het doel audioboek betere uitlijningsresultaten genereert. Ze ontdekten ook dat het
uitlijnen van audioboeken die onder optimale condities zijn opgenomen, ‘makkelijker’ is dan het uitlijnen van re¨ele spraak met
achtergrondgeluiden en ruis. Ze voerden ook tests uit met een
akoestisch model dat ofwel volledig spreker-onafhankelijk, ofwel aangepast aan en getraind op een bepaalde spreker, ofwel
volledig getraind op een bepaalde spreker was. Het is niet verwonderlijk dat ze ontdekten dat het akoestisch model dat op een
bepaalde persoon was getraind een bijna perfecte uitlijning van
een tekst door die persoon gesproken, genereerde.
Het voor ons heel erg interessante deel van dit artikel is dat
ze de gevoeligheid van een spraakherkenner voor de articulatiekenmerken en eigenaardigheden van de spreker, quantificeren.
Vergeleken met de gemiddelde nauwkeurigheidswaarde van ongeveer 74% hebben de nauwkeurigheidsresultaten van de herkenning van elke spreker een redelijk grote afwijking in beide
richtingen. Ze wijtten deze hoge afwijking in de accuraatheid
aan de gevoeligheid van het spraakherkenningssysteem voor de
sprekerstem. Het zou dus een goed idee zijn om het akoestisch
model te trainen voor elke stemacteur waarmee bedrijven zoals Playlane werken, of minstens het akoestisch model aan te
passen aan de stemmen van de stemacteurs als blijkt dat de herkenningsresultaten suboptimaal zijn.
III. AUTOMATISCHE S PRAAKHERKENNING
Het voornaamste doel van spraakherkenning is om de meest
geschikte woordenreeks te vinden, gegeven het geobserveerde
akoestische signaal. Het spraakdecoderingsprobleem bestaat
dan uit het vinden van het maximum van de waarschijnlijkheid van de woordenreeks w, gegeven signaal x, of, equivalent, het maximaliseren van de “fundamentele vergelijking van
de spraakherkenning” P r(w)f (x|w).
De meeste huidige automatische spraakherkenningssystemen
gebruiken statistische modellen. Dit betekent dat van spraak
wordt aangenomen dat het kan worden gegenereerd door een
taalmodel en akoestisch model. Het taalmodel genereert schattingen van P r(w) voor alle woorden w en hangt af van de hoogniveau beperkingen en taalkundige kennis over toegelaten woorden voor de welbepaalde taak. Het akoestisch model codeert de
boodschap w in het akoestische signaal x, dat wordt voorgesteld
door de waarschijnlijkheidsdichtheidsfunctie (probability density function) f (x|w). Het beschrijft de statistieken van de reeksen van geparametriseerde akoestische observaties in de feature
ruimte, gegeven de corresponderende geuite woorden.
De auteurs van [13] verdelen zo een spraakherkenningssysteem in verschillende componenten. De voornaamste kennisbronnen zijn het spraak- en tekstcorpus, deze representeren de
trainingdata, en het uitspraakwoordenboek. De training van het
akoestisch en taalmodel vertrouwt op de normalisatie en voorbewerking, zoals N -gram schatting en feature extractie, van de
trainingdata. Dit helpt om de lexicale veranderlijkheid te verminderen en transformeert de tekst om de gesproken taal beter
te representeren. Deze stap is echter taal-specifiek. Het omvat
regels over hoe met nummers om te gaan, woordafbrekingen,
afkortingen en acroniemen, weglatingstekens, enz.
Na de training worden het resulterende akoestisch en taalmodel gebruikt voor de eigenlijke spraakdecodering. Het input
spraaksignaal is eerst verwerkt door het akoestisch front-end,
dat normaal gezien de feature extractie uitvoert, en daarna doorgegeven aan de decoder. Met het taalmodel, akoestisch model en
uitspraakwoordenboek ter beschikking kan de decoder de eigenlijke spraakherkenning uitvoeren en geeft de spraaktranscriptie
weer aan de gebruiker.
Volgens [14] kunnen de verschillende componenten die hierboven staan beschreven, worden ingedeeld in de zogenaamde
vijf basisstappen van ASR:
1. Signaalverwerking/ Feature Extractie: Deze stap stelt het
akoestisch front-end voor. Dezelfde technieken worden ook toegepast op het spraakcorpus, voor de feature extractie. In onze
applicatie gebruiken we Mel-frequentie cepstrale co¨effici¨enten
(MFCC) [14] om feature extractie uit te voeren.
2. Akoestisch Modellering: Deze stap omvat de verschillende
acties nodig om een akoestisch model te bouwen. Verborgen
Markov modellen (HMMs) [15] vormen de manier waarop onze
akoestische modellen zijn getraind.
3. Uitspraak Modellering: Tijdens deze stap wordt het uitspraakmodel gecre¨eerd dat wordt gebruikt door de decoder.
4. taalmodellering: In deze stap wordt het taalmodel gecre¨eerd. De laatste, maar meest belangrijke stap in zijn creatie,
is de N -gram schatting.
5. Gesproken Taal-Begrip/Dialoogsystemen: Deze stap refereert naar het volledige systeem dat is gebouwd en hoe het reageert op de gebruiker.
IV. D E ASR P LUGIN G EBRUIKT IN ONZE A PPLICATIE
CMU Sphinx is de algemene term om een groep spraakherkenningssystemen ontwikkeld aan de Carnegie Mellon Universiteit (CMU), te beschrijven. Ze behelzen een reeks spraakherkenners (Sphinx-2 tot en met 4) en een akoestisch modeltrainer
(SphinxTrain).
In 2000 maakte de Sphinx groep aan de Carnegie Mellon Universiteit enkele spraakherkennercomponenten open-source, onder andere Sphinx-2 en, een jaar later, Sphinx-3. De spraak
decoders komen met akoestische modellen en voorbeeldapplicaties. De beschikbare middelen bevatten software voor het
trainen van akoestische modellen, taalmodelcompilatie en een
publiek-domein uitspraakwoordenboek voor Engels, “cmudict”
genaamd.
Het Sphinx-4 spraakherkenningssysteem [1] is de laatste versie toegevoegd aan de verschillende spraakherkenningssystemen van de Carnegie Mellon University. Het is gezamelijk ontworpen door de Carnegie Mellon Universiteit, Sun Microsystems laboratories, Mitsubishi Electric Research Labs en
Hewlett-Packard’s Cambridge Research Lab.
Het verschilt van de vroegere CMU Sphinx systemen in termen van modulariteit, flexibiliteit en algoritmische aspecten.
Het gebruikt nieuwere zoekstrategie¨en en is universeel in zijn
aanvaarding van verschillende soorten grammatica’s, taalmodellen, types akoestische modellen en feature stromen. Sphinx-4
is volledig ontwikkeld in de JavaTM programmeertaal en is dus
zeer draagbaar. Het laat ook multi-threading toe en heeft een
zeer flexibele gebruikersinterface.
Wij gebruiken de laatste toevoeging aan de Sphinx groep, namelijk Sphinx-4, in onze applicatie, maar onze Sphinx configuratie gebruikt een Sphinx-3 lader om het akoestisch model
in de decoder module in te laden. Een hoog-niveau overzicht
van de architectuur van CMU Sphinx-4 is redelijk voor de hand
liggend. De drie voornaamste componenten zijn de front-end,
de decoder en de kennisbasis. Deze zijn alle drie controleerbaar door een externe applicatie, die de inputspraak meegeeft
en de output aanpast naar het gewenste formaat. De Sphinx-4
architectuur is ontworpen met een hoge graad van modulariteit.
Alle componenten zijn onafhankelijke vervangbare softwaremodules, met uitzondering van de componenten in de kennisbasis,
en zijn geschreven in Java.
Voor meer informatie over de verschillende Sphinx-4 componenten verwijzen we naar [16] en de Sphinx-4 source code en
documentatie [1, 17].
V. O NZE A ANPAK
A. Hoog-Niveau Overzicht
Onze applicatie is volledig ontwikkeld in de JavaTM programmeertaal en bestaat uit drie aparte componenten, zie Figuur 1.
De eerste component is de Main component. Dit is waar de
meeste functionaliteit van de applicatie is gesitueerd. Het bevat
de ontleding van de opdracht prompt en selecteert de plugin, en
laadt die ook in de applicatie (aangezien de plugin zich in een
andere component bevindt, zie hieronder). Het bevat ook het
hele testraamwerk.
•
Main
Component
Plugin Knowledge
Component
in een simpel tekst formaat, zoals bijvoorbeeld .txt. Het moet
echter gecodeerd zijn in UTF-8 formaat. Dit is vaak al zo, maar
het kan gemakkelijk worden geverifieerd en toegepast in source
code bewerkers, zoals notepad++ [19]. Dit is nodig om de
eventuele speciale karakters, zoals aanhalingstekens, geaccentueerde letters, enz., in te lezen en voor te stellen.
VI. R ESULTATEN
De tweede component is de Plugin Knowledge(Kennis) component, die alle functionaliteit nodig om de eigenlijke plugin te
implementeren, bevat. Het geeft de gebruiker de optie tussen
twee mogelijke outputformaten, namelijk een standaard ondertitelingsbestand (.srt bestandsformaat) en een EPUB bestand.
Deze component ontvangt de audio- en tekstinput van de main
component en geeft dit door naar de ASR plugin component.
• De derde component is waar de ASR plugin feitelijk is gesitueerd. We refereren naar deze component als de ‘plugin component’, aangezien deze het ASR systeem bevat.
We hebben ervoor gekozen om onze applicatie in deze drie componenten te splitsen om de toevoeging van een nieuwe plugin
aan de applicatie zo gemakkelijk mogelijk te houden. Indien
iemand het gebruikte ASR systeem wil veranderen, hoeven ze
enkel een link te voorzien van de tweede component naar de
plugin component. Door het splitsen van de eerste en tweede
component, hoeven ze niet uit te zoeken waarvoor al de extra
functionaliteit en modules in de eerste component dienen, aangezien deze geen impact hebben op de ASR plugin. Zo kunnen
ze ook de nieuwe ASR plugin voor onze applicatie implementeren met enkel kennis van de tweede component.
•
B. Richtlijnen om de Nauwkeurigheid van de Uitlijning te Vergroten
We beschrijven hieronder een aantal kenmerken waaraan de
audio- en tekstdata moeten voldoen opdat onze applicatie en,
vooral, de Sphinx ASR plugin zo nauwkeurig mogelijk kunnen
werken.
• Ten eerste hoort het input audiobestand te voldoen aan een
aantal kenmerken: het moet monofoon zijn, een bemonstering
graad van 16kHz hebben en elk monster moet gecodeerd zijn in
16 bits, volgens little endian formaat. We gebruiken een kleine
tool genaamd SoX [18] om dit te bereiken.
totaal #woorden input tekstbestand
3000
audioduur normale snelheid
audioduur trage snelheid
0:43:12
2520
2500
2000
1853
1500
961
1000
500
0
449
243
565
1131
1254
1286
0:36:00
0:28:48
0:21:36
audioduur
Fig. 1. Hoog-niveau overzicht van onze applicatie
We werden door het Playlane bedrijf voorzien van 10 Nederlandse boeken die we konden gebruiken om onze applicatie te
testen. De ondertitelingsbestanden die werden voorzien door
het Playlane bedrijf, werden manueel uitgelijnd door de werknemers van Playlane. Zij luisterden naar het audiobestand en
selecteerden manueel de tijden van elk woord.
Om de nauwkeurigheid van onze applicatie te verifi¨eren hadden we nood aan boeken die reeds een woord-per-woord uitlijning bevatten, zodat we deze konden vergelijken met de uitlijning die gegenereerd werd door onze applicatie en de ASR
plugin. De boeken die we gebruikten, worden opgesomd in de
volgende lijst:
• Avontuur in de woestijn
• De jongen die wolf riep
• De muzikanten van Bremen
• De luie stoel
• Een hut in het bos
• Het voetbaltoneel
• Luna gaat op paardenkamp
• Pier op het feest
• Ridder muis
• Spik en Spek: Een lek in de boot
Al deze boeken worden voorgelezen door ofwel S. of V., beiden
vrouwen met Nederlands als moedertaal. Twee boeken worden
door S. gelezen, namelijk “De luie stoel” en “Het voetbaltoneel”, de andere acht worden door V. voorgelezen.
In Figuur 2 wordt het aantal woorden in elk boek voorgesteld,
alsook de duur van het audiobestand, voor zowel de normale
voorlezing als voor de trage voorlezing van het boek. Het ver-
#woorden in input tekst
ASR Plugin
Component
0:14:24
594
0:07:12
0:00:00
$> sox "inputfile" -c 1 -r 16000 -b 16 --endian little
"outputfile.wav"
Deze tool kan ook nuttig gebruikt worden om lange audiobestanden in kleinere delen te splitsen (een audiolengte van ongeveer 30 minuten is gewenst om een goede uitlijning te genereren).
• Het input tekstbestand dat de tekst bevat die moet worden uitgelijnd met het audio bestand, kan het best worden voorgesteld
Fig. 2. Grafiek die de grootte van de input tekst en de duur van de audio bestanden bevat
schil tussen beide uitlijningen wordt gemeten in milliseconden,
woord-per-woord. Voor elk woord wordt het verschil tussen de
gemiddeld start tijdsverschil
gemiddeld stop tijdsverschil
10000
7044
7030
gemiddeld start
tijdsverschil
3780
gemiddeld stop
tijdsverschil
max. tijdsverschil
1000
162
119
100
met `muis´
zonder `muis´
input tekst
Fig. 4. De gemiddelde start- en stop-, en maximale tijdsverschillen tussen de automatisch gegenereerde uitlijning en de Playlane uitlijning voor het “Ridder
Muis” boek, waarbij het woord “muis” ontbreekt in de input tekst
het boek beslaan. De uitlijningsresultaten die we bereikten wanneer we de getrainde akoestische modellen gebruikten voor de
uitlijningstaak, kunnen worden gevonden in Figuur 5.
gemiddeld start tijdsverschil (orig. AM)
gemiddeld start tijdsverschil (luie stoel AM)
180
143
162
119
260
212
116
90
20000
69
40
40000
252
179
29101
29156
60000
gemiddeld stop tijdsverschil (wolf AM)
95348
120000
46291
17951
362
40195
180
162
1498
260
16124
11046
22374
16041
69
90
96
0
252
den hebben die uitgelijnd zijn met gemiddeld minder dan e´ e´ n
seconde verschil tussen de output van onze applicatie en de uitlijning voorzien door Playlane.
De twee boeken die voorgelezen zijn door S. hebben de grootste gemiddelde start- en stoptijdverschillen, daarom hebben we
besloten om het akoestische model meer te trainen op haar stem,
zie Figuur 5.
Uit de verkregen testresultaten, vermoedden we ook dat het
interessant was om te weten hoe goed ons ASR systeem werkt
wanneer er woorden ontbreken in de input tekst. De resultaten hiervan worden weergegeven in Figuur 4 en tonen duidelijk
aan dat de meest nauwkeurige uitlijningsresultaten worden bekomen wanneer het inputtekstbestand de uitgesproken tekst zo
goed mogelijk weergeeft.
Zoals reeds hiervoor vermeld, leek het ons interessant om het
akoestisch model te trainen op S.’s stem, aangezien zij de slechtste resultaten genereerde voor de uitlijningstaak. We trainden
het akoestisch model op een boek genaamd “Wolf heeft jeuk”,
dat ook is voorgelezen door S. maar dat geen deel uitmaakte van
de test data. We trainden het akoestisch model op het laatste
hoofdstuk van het boek. Dit bevat 13 zinnen met een totaal van
66 woorden, die 24 seconden aan audio beslaan.
We trainden het originele akoestische model ook op een deel
van het boek “De luie stoel”. Daarvan gebruikten we 23 zinnen,
met een totaal van 95 woorden, die 29 seconden aan audio van
116
140
126
20000
643
322
40000
29101
60000
18581
48977
80000
10242
148
2860
Fig. 3. De gemiddelde start- en stoptijdverschillen tussen de automatisch gegenereerde uitlijning en de Playlane uitlijning
milliseconden
100000
43246
29319
46291
46530
80000
10242
10256
milliseconden
100000
0
36970
95348
95557
120000
100000
milliseconden
starttijden en stoptijden van beide uitlijningen apart berekend
en dan wordt het gemiddelde genomen van alle woorden die in
beide bestanden voorkomen. We besloten het gemiddelde van
de start- en stoptijden van elk woord apart te berekenen toen we
merkten, na zorgvuldige manuele inspectie van de eerste testresultaten, dat Sphinx-4 de neiging heeft om meer pauze aan het
begin van een woord toe te voegen, dan aan het einde van een
woord. Met andere woorden, het heeft de neiging om een woord
te beginnen markeren in de pauze v´oo´ r het woord wordt gesproken, maar stopt met de markering van het woord vrij accuraat
nadat de uitspraak van het woord eindigt.
Figuur 3 toont het gemiddelde verschil van de start- en stoptijden voor elk woord, tussen de bestanden voorzien door Playlane
en de automatisch gegenereerde uitlijning van onze applicatie.
Er zijn zes van de acht boeken voorgelezen door V. welke tij-
Fig. 5. Het gemiddelde starttijdsverschil van elk boek, gebruikmakend van het
origineel akoestisch model, ofwel het akoestisch model getraind op “Wolf
heeft jeuk”, ofwel het akoestisch model getraind op “De luie stoel”
Zoals kan gezien worden in Figuur 5, is er een duidelijke verbetering voor beide boeken gelezen door S. zoals we hadden verwacht, hoewel het gemiddelde tijdsverschil nog steeds ongeveer
20 seconden is voor het boek “De luie stoel” en meer dan 40 seconden voor “Het voetbaltoneel” bij de uitlijningsresultaten die
we verkregen met het akoestisch model getraind op “Wolf heeft
jeuk”. Wanneer we de uitlijning uitvoeren met het akoestisch
model getraind op boek “De luie stoel” is het gemiddelde tijdsverschil voor het boek “De luie stoel” slechts 300 milliseconden.
Eveneens is de uitlijning voor het boek “Het voetbaltoneel” verbeterd in vergelijking met het resultaat dat we verkregen met
het akoestisch model getraind op “Wolf heeft jeuk”, met slechts
ongeveer 30 seconden gemiddeld tijdsverschil in plaats van 40
seconden.
Dit betekent dat de uitlijningsresultaten verkregen met het
akoestisch model getraind op “Wolf heeft jeuk” nog steeds niet
bruikbaar zijn, maar ze voorspellen goede resultaten indien het
akoestisch model verder getraind wordt op dit boek. De uitlijningsresultaten gegenereerd door gebruik te maken van het
akoestisch model getraind op “De luie stoel” zijn bijna perfect
voor dat boek, en zorgen ook voor een verbetering voor het boek
“Het voetbaltoneel”. Dit laat ons toe te geloven dat verdere training van het akoestisch model op boeken die door S. zijn voorgelezen veel verbeterde uitlijningsresultaten zal genereren voor
boeken voorgelezen door S..
Van de boeken die zijn voorgelezen door V. hebben sommige
ongeveer dezelfde nauwkeurigheid met de getrainde akoestische
modellen als met het origineel model, andere hebben een betere
nauwkeurigheid. Maar, aangezien vier van de acht boeken een
slechtere nauwkeurigheid hebben, kunnen we concluderen dat
in het algemeen de akoestische modellen getraind op S.’s stem
een slechte invloed hebben op de boeken voorgelezen door V.
VII. C ONCLUSIE EN T OEKOMSTIG W ERK
Het doel van dit artikel was het onderzoek naar de mogelijkheid om een uitlijningstaak automatisch uit te voeren, in plaats
van manueel. We ontwikkelden hiervoor een software-applicatie
die de gebruikers de mogelijkheid biedt om op een simpele manier te kiezen welk ASR systeem ze willen gebruiken, door het
gebruik van plugins. We bieden extra flexibiliteit in onze applicatie door de gebruiker de keuze te laten tussen twee outputformaten (een algemeen ondertitelformaat en een EPUB bestandsformaat), en door het aanmaken van een nieuw outputformaat
zo gemakkelijk mogelijk te houden.
Uit de resultaten in Sectie VI, gebruikmakend van de ASR
plugin CMU Sphinx, concluderen we dat het inderdaad mogelijk is om automatisch een uitlijning van audio en tekst te genereren die nauwkeurig genoeg is om bruikbaar te zijn (bijvoorbeeld, onze testresultaten hebben gemiddeld minder dan e´ e´ n seconde verschil tussen de automatisch gegenereerde uitlijning en
de manuele basislijn).
Er is echter nog steeds wat werk uit te voeren, vooral voor
minder gangbare talen, zoals Nederlands. We verkregen positieve resultaten wanneer we het akoestisch model trainen op
(minder dan 60 seconden) audio data die correspondeert met de
persoon of het type boek waarvan we de uitlijningsnauwkeurigheid willen verbeteren. Daarom is onze eerste nota in verband
met toekomstig werk het verder trainen van het akoestisch model voor Nederlands, vooral wanneer een duidelijk gedefini¨eerd
type uitlijningstaak moet worden uitgevoerd. Aangezien het een
aantal dagen kan duren om een audioboek manueel uit te lijnen
is de kleine moeite om het akoestisch model te trainen gerechtvaardigd, zeker wanneer men rekening houdt met de tijdswinst
die men kan bereiken door een boek automatisch en nauwkeurig uit te lijnen. Het getrainde model kan ook op meerdere boeken gebruikt worden en een nauwkeurig resultaat weergeven,
wat dus betekent dat er niet voor elke nieuwe uitlijningstaak een
akoestisch model moet worden getraind.
We merken verder op dat de nauwkeurigheid van de input
tekst en de dekking van het uitspraakwoordenboek een grote invloed heeft op de nauwkeurigheid van het uitlijningsresultaat.
Uit onze testen kunnen we concluderen dat het best is om geen
ontbrekende woorden te hebben in de input tekst of het uit-
spraakwoordenboek. Er is een duidelijke nood aan een meer
robuust systeem, met minder onverklaarbare uitschieters. We
stellen een aanpak voor om de robuustheid van onze applicatie
te vergroten door de uitlijningsresultaten van twee of meerdere
verschillende ASR plugins, te vergelijken. De overlappende resultaten kunnen, binnen een bepaalde foutenmarge, als ‘correct’
worden beschouwd. Deze aanpak is gebaseerd op de aanpak
gevolgd in [20].
Het is onze overtuiging dat het systeem dat we hebben ontworpen een flexibele aanpak voor spraak-tekst uitlijning vormt
en, aangezien het kan worden aangepast aan het voorkeur ASR
systeem van de gebruiker, is het voordelig voor gebruikers die
voordien uitlijningstaken manueel uitvoerden.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Carnegie Mellon University,
“CMU Sphinx Wiki,” http://
cmusphinx.sourceforge.net/wiki/.
D. OShaughnessy, “Invited Paper: Automatic Speech Recognition: History, Methods and Challenges,” Pattern Recognition, vol. 41, no. 10, pp.
2965 – 2979, 2008.
L. R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, 1986.
X. Huang, Y. Ariki, and M. Jack, Hidden Markov Models for Speech
Recognition, Columbia University Press, New York, NY, USA, 1990.
X. Huang, J. Baker, and R. Reddy, “A Historical Perspective of Speech
Recognition.,” Communication ACM, vol. 57, no. 1, pp. 94–103, 2014.
G. Oppy and D. Dowe, “The turing test,” http://plato.stanford.
edu/entries/turing-test/.
D. Eldridge, “Have you heard? audiobooks are booming.,” BookBusiness
your source for publishing intelligence, vol. 17, no. 2, pp. 20–25, April
2014.
C.J. Van Heerden, F. De Wet, and M.H. Davel, “Automatic alignment of
audiobooks in afrikaans,” in PRASA 2012, CSIR International Convention
Centre, Pretoria. November 2012, PRASA.
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore,
J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK
Book, Cambridge University Engineering Department, 2006.
CGN, “Corpus gesproken nederlands,” http://lands.let.ru.nl/
cgn/ehome.htm.
Shmyryov NV,
“Free speech database voxforge.org,” http:
//translate.google.ca/translate?js=y&prev=_t&hl=
en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.
dialog-21.ru%2Fdialog2008%2Fmaterials%2Fhtml%
2F90.htm&sl=ru&tl=en.
L. T´oth, B. Tarj´an, G. S´arosi, and P. Mihajlik, “Speech recognition experiments with audiobooks,” Acta Cybernetica, vol. 19, no. 4, pp. 695–713,
Jan. 2010.
L. Lamel and J.-L. Gauvain, “Speech Processing for Audio Indexing,”
in Advances in Natural Language Processing, B. Nordstrm and A. Ranta,
Eds., vol. 5221 of Lecture Notes in Computer Science, pp. 4–15. Springer
Berlin Heidelberg, 2008.
J. Bilmes,
“Lecture 2:
Automatic Speech Recognition,”
http://melodi.ee.washington.edu/˜bilmes/ee516/
lecs/lec2_scribe.pdf, 2005.
E. Trentin and M. Gori, “A Survey of Hybrid ANN/HMM Models for
Automatic Speech Recognition,” Neural Computing, vol. 37, no. 14, pp.
91 – 126, 2001.
P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf,
“Design of the CMU Sphinx-4 Decoder,” in 8th European Conference on
Speech Communication and Technology (Eurospeech), 2003.
Carnegie Mellon University,
“CMU Sphinx Forum,” http://
cmusphinx.sourceforge.net/wiki/communicate/.
SoX, “SoX Sound eXchange,” http://sox.sourceforge.net/.
D. Ho, “Notepad++ Editor,” http://notepad-plus-plus.org/.
B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and
R. Van de Walle, “Improving Multimedia Analysis Through Semantic Integration of Services,” in 4th FTRA International Conference on Advanced
IT, Engineering and Management, Abstracts. 2014, p. 2, Future Technology Research Association (FTRA).
Contents
1 Introduction
1
1.1
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
The Playlane Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2 Automatic Speech Recognition
7
2.1
History of ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
General Design of ASR Systems . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1
Stage 1: Signal Processing/Feature Extraction . . . . . . . . . . . . 12
2.2.2
Stage 2: Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3
Stage 3: Pronunciation Modelling . . . . . . . . . . . . . . . . . . . 20
2.2.4
Stage 4: Language Modelling . . . . . . . . . . . . . . . . . . . . . 22
2.2.5
Stage 5: Spoken Language Understanding/Dialogue Systems . . . . 24
3 Decomposition of CMU Sphinx-4
25
3.1
History of CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2
Architecture of CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3
3.2.1
Front End Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2
Decoder Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3
Knowledge Base Module . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.4
Work Flow of a Sphinx-4 Run . . . . . . . . . . . . . . . . . . . . . 30
Our CMU Sphinx Configuration . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1
Global Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2
Recognizer and Decoder Components . . . . . . . . . . . . . . . . . 34
3.3.3
Grammar Component . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4
Acoustic Model Component . . . . . . . . . . . . . . . . . . . . . . 40
3.3.5
Front End Component . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.6
Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xix
4 Our Application
49
4.1
High-Level View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2
Components 2 & 3: Projects pluginsystem.plugin
& pluginsystem.plugins.sphinxlongaudioaligner . . . . . . . . . . . . . . . . 51
4.3
Component 1: Project pluginsystem . . . . . . . . . . . . . . . . . . . . . . 53
4.4
Best Practices for Automatic Alignment . . . . . . . . . . . . . . . . . . . 55
4.5
4.4.1
How to add a new Plugin to our System . . . . . . . . . . . . . . . 55
4.4.2
How to run the Application . . . . . . . . . . . . . . . . . . . . . . 56
Output Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1
.srt File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.2
EPUB Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Results and Evaluation
61
5.1
Test Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1
Evaluation Metrics and Formulas . . . . . . . . . . . . . . . . . . . 63
5.2.2
Memory Usage and Processing Time . . . . . . . . . . . . . . . . . 64
5.2.3
First Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3
Meddling with the Pronunciation Dictionaries . . . . . . . . . . . . . . . . 69
5.4
Accuracy of the Input Text . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5
A Detailed Word-per-Word Time Analysis . . . . . . . . . . . . . . . . . . 73
5.6
Training the Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7
5.6.1
How to Train an Acoustic Model . . . . . . . . . . . . . . . . . . . 74
5.6.2
Results with Different Acoustic Models . . . . . . . . . . . . . . . . 77
The Sphinx-4.5 ASR Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7.1
5.8
Sphinx-4.5 with Different Acoustic Models . . . . . . . . . . . . . . 81
Alignment Results for English Text and Audio . . . . . . . . . . . . . . . . 83
6 Conclusions and Future Work
85
A Configuration File Used for Recognizing Dutch
87
B Slow Pace Audio Alignment results
93
C Configuration File Used by Sphinx-4.5 for Dutch Audio
95
D Specifications for the .wav Files Used for Training the Acoustic Model 101
D.1 SoX Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
D.2 Audacity Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
xx
Bibliography
108
xxi
xxii
Abbreviations
ANN
AM
ASCII
ASR
BP
BSD
CDHMM
CD-GMM-HMM
CI
CMN
CMU
DCT
DFT
DNN
DP
DTW
FT
FFT
GMM
HMM
HTK
IDFT
ISIP
IVR
LM
LP
LPC
MFCC
ML
MLLR
MLP
MP
NIST
NLP
OOV
PD
PDF
PLP
Artificial Neural Network
Acoustic Model
American Standard Code for Information Interchange
Automatic Speech Recognition
Backpropagation
Berkely Software Distribution
Continuous Density HMM
Context-Dependent Gaussion Mixture Model HMM
Context Independent
Cepstral Mean Normalisation
Carnegie Mellon University
Discrete Cosine Transformation
Discrete Fourier Transform
Deep Neural Network
Dynamic Programming
Dynamic Time Warping
Fourier Transform
Fast Fourier Transform
Gaussian Mixture Model
Hidden Markov Model
Hidden Markov Toolkit
Inverse Discrete Fourier Transform
Institute for Signal and Information Processing
Interactive Voice Response
Language Model
Linear Prediction
Linear Predictive Coding
Mel Frequency Cepstral Coefficient
Maximum-Likelihood
Maximum-Lilkelihood Linear Regression
Maximum Likelihood Predictors
Multilayer Perceptron
National Institute of Standards and Technology
Nonlinear Programming
Out-of-Vocabulary
Pronunciation Dictionary
Probability Density Function
Perceptual Linear Prediction
xxiii
PNCC
RASTA-PLP
RNN
SD
SI
SLU
STT
SVM
TDNN
TTS
URL
WER
XML
Power Normalized Cepstrum Coefficients
Relative Spectral-PLP
Recurrent Neural Network
Speaker-dependent
Speaker-independent
Spoken Language Understanding
Speech-to-Text
Support Vector Machine
Time-Delay Neural Network
Text-to-Speech
Uniform Resource Locator
Word Error Rate
eXtensible Markup Language
xxiv
Chapter 1
Introduction
When Bell Labs performed their very first small-vocabulary speech recognition tests during the 1950’s, they had every reason to believe automatic speech recognition research
would attract a high amount of interest. Ever since, research into automatic speech recognition has been ongoing. Due to the increase in computational power of computer systems,
and the discovery of new mathematical techniques in the 1970’s and 1980’s, there were
some major improvements achieved for the then common approaches to automatic speech
recognition systems. New ideas, new techniques, or new takes on old techniques are still
being discovered that have a positive impact on speech recognition systems.
But even now, sixty years later, there is no such thing as a perfect speech recognition
system. Each system has its own limitations and conditions that allow it to perform
optimally. However, what has the biggest influence on how a speech recognition system
performs, is how well-trained the acoustic model is, and what specific task it is trained
for [31]. For example, an acoustic model can be trained on one person specifically (or
be updated to better recognize that person’s speech), it can be trained to perform well
on broadcast speech, on telephone conversations, on a certain accent, on certain words if
only commands must be recognized, etc. Thus, if a person has access to an automatic
speech recognition system, and knows which task it needs to be used for, it would be very
useful to train the acoustic system according to that task.
With the recent boom in audiobooks [23], a whole new set of training data is available, as audiobooks are recorded under optimal conditions and the text that is read is
obtainable for each book. But there is also another way these audiobooks might be used
by a speech recognition system. Why not align these audiobooks with their book content using a speech recognition system, and thereby automating the process of creating
digital books that contain both audio and textual content? This is our main goal in this
dissertation.
1
1.1
Problem Statement
To the best of our knowledge, when in need of speech-text alignment, people and companies perform this task manually, which is highly time- and work-intensive. One of
the companies that uses this approach, is the company Cartamundi Digital, previously
Playlane1 , see Section 1.2.
In this dissertation, an application is designed that can automatically synchronise an
audio file with a text file, by using an pre-existing ASR system. The goal is to –apart
from creating said synchronisation– keep the application as generic as possible, so as to
be able:
• to switch out different automatic speech recognition systems;
• to quickly create a new output format and add it to the application;
We believe this approach might greatly decrease the amount of time spend on manually
aligning text and audio, and thus, would provide great benefit for companies such as
Playlane.
1.2
The Playlane Company
The Playlane company creates digital picture books and games for children. They gained
international fame with their “Fundels”, which are digital picture books for iPad and pc,
that are extended with games and educational activities. They are currently used by
hundreds of schools.
One of the educational activities that are part of “Fundels”, is the ability to listen
to an audio recording of the book while watching the pictures that are relevant to the
spoken text, or even following the book’s text, which is highlighted as it is being said.
Whether the “Fundel” only shows pictures accompanying the spoken text, or highlights
at a sentence or word level depends on the reading difficulty classification of the book.
Since these books should also be usable to help children learn to read, there is also the
possibility of having the book read aloud at a slow pace (how slow also depends on the
book’s difficulty classification).
The Playlane company have already created “Fundels” for several books, ranging over
multiple difficulty classifications, and have recently started incorporating books for higher
elementary school readers.
To create this part of the “Fundels”, the Playlane company hires one of the voice
actors they work with to read the entire text, both at a normal pace and at a slow pace,
1
We will from now on refer to the company by its old name “Playlane”, as Cartamundi Digital is a
big company that comprises of many smaller companies, each adding their own product to the whole.
2
under studio conditions. Then, after the audio recordings are done, one of the employees
manually aligns the audio file with the text of the book. They listen to the audio carefully
and, word by word, note down which word is said, the time when the word starts in the
audio file, and the time when the word stops. This takes them days, rather than hours,
depending on the length of the book. Being able to speed up this process would provide
a great business opportunity for the company.
1.3
Related Work
In this section we will point out some interesting articles we came upon when researching
this dissertation, focussing mainly on parts that relate to our work.
In [66], the authors report on the automatic alignment of audiobooks in Afrikaans.
They use an already existing Afrikaans pronunciation dictionary and create an acoustic
model from an Afrikaans speech corpus. They use the book “Ruiter in die Nag” by
Mikro to partly train their acoustic model, and to perform their tests on. Their goal is
to align large Afrikaans audio files at word level, using an automatic speech recognition
system. They developed three different automatic speech recognition systems to be able
to compare these and discover which performs best, all three of them are build using
the HTK toolkit [70]. The difference between the three systems lies in the acoustic
model: the first acoustic model is the baseline model, which is trained on around 90
hours of of broadband speech, the second uses a Maximum a Posteriori adaptation on
the baseline model, and the third is trained on the audiobook. To define the accuracy of
their automatic alignment results, the authors compare the difference in the final aligned
starting position of each word with an estimate of the starting position they obtained by
using phoneme recognition. They discovered that the main causes of alignment errors
are:
• speaker errors, such as hesitations, missing words, repeated words, stuttering, etc.;
• rapid speech containing contractions;
• difficulty in identifying the starting position of very short (one- or two-phoneme)
words; and,
• a few text normalization errors (e.g. ‘eenduisend negehonderd’ for ‘neeentienhonderd’).
Their final conclusions are that the baseline acoustic model does provide a fairly good
alignment for practical purposes, but that the model that was trained on the target
audiobook provided the best alignment results.
The reason their research is interesting to us is because, just as Afrikaans, Dutch
is a slightly undersourced language (though not as undersourced as Afrikaans), despite
3
the large efforts made by the Spoken Dutch Corpus (Corpus Gesproken Nederlands,
CGN) [16]. For example, the acoustic models of Voxforge [48] we use, both for English
and Dutch speech recognition, contain around 40 hours of speech over a hundred speakers
for English, while only 10 hours of speech for Dutch. The main causes of alignment errors
they discovered are, of course, also interesting for us to know, since we can present these
to the users of our system and create awareness. However, as the books are read under
professional conditions, there are unlikely many speaker errors, and as they mainly work
with children’s books, the spoken text is also unlikely to contain many contractions. The
third interesting fact they researched was that training the acoustic model on part of the
target audiobook provides the best alignment results of their tested models. We will also
try to achieve this, however, we will train the acoustic model on other audiobooks read
by the same person, preferably a book with the same reading difficulty classification as
the target audiobook, as these have similar pauses and word lengths, see Section 5.6.2.
The authors of [63] try out the alignment capabilities of their recognition system under near-ideal conditions, i.e. on audiobooks. They also created three different acoustic
models, one trained on manually transcriptions, one trained on the audiobooks at syllable
level, and one trained on the audiobooks on word level. They draw the same conclusions
as the authors of [66], namely, that training the acoustic models on the target audiobooks provide better results, as well as that aligning audiobooks (which are recorded
under optimal conditions) is ‘easier’ than aligning real-life speech with background noises
or distortions. They also performed tests using acoustic models that were completely
speaker-independent, slightly adapted and trained on a specific speaker, and completely
trained on a specific speaker. It may come as no surprise that they discovered that the
acoustic model that was trained on a certain person provided almost perfect alignment of
a text spoken by that person.
However, the one part of this article that is extremely interesting to us is that they
quantify the sensitivity of a speech recognizer to the articulation characteristics and peculiarities of the speaker. Figure 1.1 shows a histogram of the phone recognition results
obtained using the MTBA Hungarian Telephone Speech Corpus, which contains recordings made from 500 people. As can be seen, the results have quite a huge deviance in
both directions, compared to the average value of about 74%. They believe the reason for
this high deviation in the scores can mostly be blamed on the sensitivity of the recognizer
to the actual speaker’s voice. It would thus be a good idea to train an acoustic model
for each voice actor, or at least adapt the acoustic model we use to their voice actors
by training them on their speech, if it appears the results we achieve with our speech
recognition application are suboptimal.
In articles [56] and [14], the authors describe their efforts to automatically generate
digital talking books, how they provide interesting research possibilities, and the frame4
Figure 1.1: The distribution of phone recognition accuracy as a function of the speaker on the
MTBA corpus; figure taken from [66]
work they created to build such talking books. These books are used by visually impaired
people, and therefore need to conform by certain standards, the widest used being the
DAISY standard [17]. They created their own speech recognition system and verified
that, with proper recording procedures, the alignment task can be fully automated in a
very fast single-step procedure, even for a two-hour long recording. Their main goal is to
provide an application that can easily convert existing audio tapes and OCR-based digitalisation of text books into full-featured, multi-synchronised, multimodal digital books.
Although we will not be implementing the DAISY standard to conform to the accessibility restrictions with our application, we will implement the EPUB3 standard as one
of the output formats, and believe that adhering to the DAISY standard’s accessibility
restrictions would provide a good opportunity for future work.
The authors of [35] present their own speech recognition system, called SailAlign,
which is an open-source software toolkit for robust long speech-text alignment. The
authors explain that the conventional automatic speech recognition systems that use
Viterbi to force alignment often can be inadequate due to mismatched audio and text
and/or noisy audio. They wish to circumvent these restrictions with SailAlign. They
demonstrate the potential use of the SailAlign system for the exploitation of audiobooks
to study read speech. The basic idea behind their system is the assumption that the long
speech-text alignment problem can be posed as a long text-text alignment problem, given
a well-performing speech recognition engine. They provide the reader with a pseudocode of the algorithm they used to construct their speech recognition system, which is
very interesting although we do not build our own speech recognition system. They
5
conclude that their experiments, with SailAlign, on the TIMIT database demonstrates
the increased robustness of the algorithm, compared to the standard Viterbi-based forced
alignment algorithm, even with imperfect transcriptions and noisy audio. SailAlign also
shows potential for the exploitation of rich spoken language resources such as collections
of audiobooks. The main difference with our approach is, first and foremost that we do
not build our own speech recognition system, and that we test our application using audio
books instead of the TIMIT database.
Our main contributions are the plugin-based system to perform automatic alignment,
as well as performing certain tests to verify how CMU Sphinx handles a number of specific
issues, such as missing words from the pronunciation dictionary, or an inaccurate input
text.
1.4
Outline
First, in Chapter 2, we give an overview of how an automatic speech recognition system
generally works, starting off with a brief history, and discussing the different techniques
that might be used to construct an automatic speech recognition system. We then discuss
the inner workings of the automatic speech recognition system we decided to use for our
application, namely, CMU Sphinx-4, in Chapter 3. Aside from providing the reader with
a general knowledge about CMU Sphinx itself, we also present the configuration we used,
discussing each part separately.
Chapter 4 explains how our application is put together, and what needs to be done to
get it working on someone’s computer system. Then in Chapter 5 we show the results we
got with our application and, in Chapter 6, discuss these results, draw some conclusions
and explain which further work will still be needed.
6
Chapter 2
Automatic Speech Recognition
This chapter provides the reader with a basic knowledge of automatic speech recognition
(ASR). The main goal of ASR is to, given an input audio file, return a textual transcription
of what is said in the audio file, or to align a given text input with the given audio input
and, thus, provide a time stamp for each spoken syllable/word/sentence/...
We start off, in Section 2.1, with a brief history of the accomplishments in the field
of ASR so far. We mention the advantages of new techniques and why there was a need
for improvement. Then we go deeper into the aforementioned techniques and discuss the
different steps that are required to build an ASR system in Section 2.2, offering several
options that must be taking into account when designing each step and giving a detailed
explanation of those techniques. This section is particularly important to understand the
configuration we used for the implementation of our speech recognition system, which is
explained in Chapter 3.
2.1
History of ASR
As early as the 1950s, there was an interest in speech recognition. Bell Labs performed
small-vocabulary recognition of digits spoken over the telephone, using analogue circuitry.
As computing power grew during the 1960s, filter banks were combined with dynamic
programming to produce the first practical speech recognizers. These were mostly for
isolated words, to simplify the task. In the 1970s, much progress arose in commercial
small-vocabulary applications over the telephone, due to the use of custom special-purpose
hardware. Linear Predictive Coding (LPC) became a dominant automatic speech recognition (ASR) component, as an automatic and efficient method to represent speech (see
Section 2.2.1).
ASR focuses on simulating the human auditory and vocal processes as closely as
possible. But the difficulty of handling the immense amount of variability in speech
7
production (and transmission channels) led to the failure of simple if-then decision-tree
approaches to ASR for larger vocabularies [51].
Core ASR methodology has evolved from expert-system approaches in the 1970s, using
spectral resonance (formant) tracking, to the modern statistical method of Markov models
based on a Mel-Frequency Cepstral Coefficient (MFCC) approach (see Section 2.2.1). This
has remained the dominant ASR methodology since the late 1980s. LPC is, however, still
the standard today in mobile phone speech transmissions.
This decade also saw the expansion of the internet, and with that, the creation of
large widely available databases in several languages, allowing for comparative testing
and evaluation.
As [51] specifies, it became common practice to non-linearly stretch (or warp) templates to be compared, to try to synchronize similar acoustic segments in test and reference
patterns. This Dynamic Time Warping (DTW) procedure is still used today in some applications [20]. Sets of specific templates of target units, such as phonemes, would be
compared to each testing unit, and eventually the one with the closest match as the estimated label for the input unit would be selected. This led to a high computation, as well
as difficulty in determining which and how many templates to be used in the test search.
Since then the standard has been Hidden Markov Models (HMMs) (see Section 2.2.2), in
which statistical models replace the templates, since they have the power to transform
large numbers of training units into simpler probabilistic models. Instead of seeking the
template closest to a test frame, test data is evaluated against sets of Probability Density
Functions (PDFs), selecting the PDF with the highest probability.
During the 1990s, commercial applications evolved from isolated-word dictation systems to general-purpose continuous-speech systems.
Experiments with wavelets, where the variable time-frequency tiling matches human
perception more closely, were the next step in the research towards ASR systems. But
the non-linearity of wavelets has been a major obstacle to their use [50]. Artificial Neural
Networks (ANNs) (see Section 2.2.2) and Support Vector Machines (SVMs) were introduced in ASR, but are not as versatile as HMMs. SVMs maximize the distance (called the
“margin”) between the observed data samples and the function used to classify the data.
They generalize better than ANNs, and tend to be better than most non-linear classifiers
for noisy speech. But, unlike HMMs, SVMs are essentially binary classifiers, and do not
provide a direct probability estimation, according to [51]. Thus, they need to be modified
to handle general ASR, where input is usually not just “yes” versus “no”. HMMs also do
better on problems such as temporal duration normalisation and segmentation of speech,
as basic SVMs expect a fixed-length input vector [57].
ANNs have not replaced HMMs for ASR, owing to their relative inflexibility to handle
timing variability. Among promising new approaches was the idea to focus attention on
8
specific patterns of both time and frequency, and not simplistically force the ASR analysis
into a frame-by-frame approach [55]. Progress occurred in the use of finite state networks,
statistical learning algorithms, discriminative training, and kernel-based methods [34].
Since the mid-90s, ASR has been largely implemented in software, e.g. for the use of
medical reporting, legal dictation, and automation of telephone services.
With the recent adoption of speech recognition in Apple, Google, and Microsoft products, the ever-improving ability of devices to handle relatively unrestricted multimodal –
i.e., consisting of a mixture of text, audio and video, to create extra meaning – dialogues,
is showing clearly. Despite the remaining challenges, the fruits of several decades of research and development into the speech recognition field can now be seen. As Huang,
Baker, and Reddy [32] said: “We believe the speech community is en route to pass the
Turing Test1 in the next 40 years with the ultimate goal to match and exceed a human’s
speech recognition capability for everyday scenarios.”
Figure 2.1: An overview of historical progress on machine speech recognition performance; figure
taken from [46]
1
The phrase The Turing Test is most properly used to refer to a proposal made by Turing (1950) as
a way of dealing with the question whether machines can think [49], and is now performed as a test to
determine the ‘humanity’ of the machine.
9
Figure 2.1 shows progressive word error rate (WER) reduction achieved by increasingly
better speaker-independent (SI) systems from 1988 to 2009 [2, 1]. Increasingly difficult
speech data was used for the evaluation, often after the error rate for the preceding easier
data had been dropped to a satisfactorily low level. This figure illustrates that on average,
a relative error-reduction rate of about 10% annually has been maintained through most
of these years.
The authors of [46] point out that there are two other noticeable and significant trends
that can be identified from the figure. First, dramatic performance differences exist for
noisy (due to acoustic environment distortion) and clean speech data in an otherwise
identical task (as is illustrated by the evaluation for 1995). Such differences have also been
observed by nearly all speech recognizers used in industrial laboratories, and intensive
research is continuing on to reduce the differences. Second, speech of a conversational
and casual style incurs much higher errors than any other types of speech. Acoustic
environment distortion and the casual nature in conversational speech form the basis for
two principal technical challenges in the current speech recognition technology.
2.2
General Design of ASR Systems
The main goal of speech recognition is to find the most likely word sequence, given the
observed acoustic signal. Solving the speech decoding problem, then, consists of finding
the maximum of the probability of the word sequence w given signal x, or, equivalently,
maximizing the “fundamental equation of speech recognition” P r(w)f (x|w).
Most state-of-the-art automatic speech recognition systems use statistical models.
This means that speech is assumed to be generated by a language model and an acoustic
model. The language model generates estimates of P r(w) for all word strings w and depends on high-level constraints and linguistic knowledge about the allowed word strings
for the specific task. The acoustic model encodes the message w in the acoustic signal
x, which is represented by a probability density function f (x|w). It describes the statistics of sequences of parametrized acoustic observations in the feature space, given the
corresponding uttered words.
Figure 2.2 shows the main components of such a speech recognition system. The main
knowledge sources are the speech and text corpus, which represent the training data, and
the pronunciation dictionary. The training of the acoustic and language model relies on
the normalisation and preprocessing, such as N -gram estimation and feature extraction,
of the training data. This helps to reduce lexical variability and transforms the texts to
better represent the spoken language. However, this step is language specific. It includes
rules on how to process numbers, hyphenation, abbreviations and acronyms, apostrophes,
etc.
10
Speech
Text
Corpus
Corpus
Text
Text
Corpus
Corpus
Training
Normalization
Manual
Transcription
Feature
Extraction
N-gram
Estimation
Training
Lexicon
HMM
Training
Language
Model
Pronunciation
Dictionary
Acoustic
Models
Decoding
Pr(w)
f(x|w)
Speech
Sample
y
Acoustic
Front end
x
Decoder
max(Pr(w)f(x|w))
Speech
Transcription
Figure 2.2: System diagram of a speech recognizer based on statistical models, including training
and decoding processes; figure adapted from [40]
After training, the resulting acoustic and language models are used for the actual
speech decoding. The input speech signal is first processed by the acoustic front end,
which usually performs feature extraction, and then passed on to the decoder. With the
language model, acoustic model and pronunciation dictionary at its use, the decoder is
able to perform the actual speech recognition and returns the speech transcription to the
user.
According to [6], the different parts in Figure 2.2 can be grouped into the so-called
five basic stages of ASR:
1. Signal Processing/Feature Extraction (see Section 2.2.1) This stage corresponds to the “Acoustic Front end” in Figure 2.2. The same techniques are also
used on the “Speech Corpus”, in the “Feature Extraction” step.
2. Acoustic Modelling (see Section 2.2.2) This stage encompasses the different
steps needed to build the acoustic model, such as the “HMM Training” in the figure
above.
3. Pronunciation Modelling (see Section 2.2.3) This stage creates the pronunciation dictionary, which is used by the decoder.
4. Language Modelling (see Section 2.2.4) In this stage, the language model
is created. The last, and most important, step for its creation is the “N-gram
estimation”, as can be seen in Figure 2.2.
5. Spoken Language Understanding/Dialogue Systems (see Section 2.2.5)
11
This stage refers to the entire system that is build and how it interacts with the
user.
Table 2.1 shows the dates when some of the techniques discussed below were accepted
for use in ASR.
Advance
Linear Predictive Coding
Dynamic Time Warping
Hidden Markov Model
Mel-frequency Cepstrum
Language Models
Neural Networks
Kernel-based classifiers
Dynamic Bayesian Networks
Date
1969
1970
1975
1980
1980s
1980s
1998
1999
Impact
Automatic, simple speech compression
Reduces search while allowing temporal flexibility
Treat both temporal and spectral variation statistically
Improved auditory-based speech compression
Including language redundancy improves ASR quality
Excellent static nonlinear classifier
Better discriminative training
More general statistical networks
Table 2.1: Major advances in ASR methodology; table taken from [51]
In some cases, the dates are approximate, as they reflect a gradual acceptance of new technology,
rather than a specific breakthrough event.
2.2.1
Stage 1: Signal Processing/Feature Extraction
The first necessary step for speech recognition is the extraction of useful acoustic features
from the speech waveform. This is done by using signal processing techniques. Although
it is theoretically possible to recognize speech directly from a digitized waveform, almost
all ASR systems perform some spectral transformation on the speech signal. This is
because numerous experiments on the human auditory system and characteristics show
that the inner ear acts as a spectral analyser. It has also been concluded, via analysis
of human speech production, that humans tend to control the spectral content of their
speech much more than the phase (time) domain details of their speech waveforms, as
explained in [51].
The most widely used acoustic feature sets for ASR are Mel-Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction (PLP). There are however plenty of
other options, such as Linear Predictive Coding (LPC), Relative Spectral-PLP (RASTAPLP), Power Normalized Cepstrum Coefficients (PNCCs), etc.
Fourier Transform (FT)
The simplest spectral mapping is the Fourier transform, in the digital domain realised as
the Discrete Fourier Transform (DFT), or, in practice, the Fast Fourier Transform (FFT).
In the case of a periodic function over time (for example, a speech signal), the Fourier
transform can be simplified to the calculation of a discrete set of complex amplitudes,
called Fourier series coefficients. They represent the frequency spectrum of the original
12
time-domain signal. When a time-domain function is sampled to facilitate storage or
computer-processing, it is still possible to recreate a version of the original Fourier transform according to the Poisson summation formula, also known as discrete-time Fourier
transform.
While very useful for transforming speech into a representation more easily exploited
by an ASR system, the DFT accomplishes little data reduction, as is explained in [51].
ASR, thus, needs to compress the speech information further. The simplest way to do this
is via band-pass filtering, or “sub-bands”. ASR has typically employed fixed bandwidths
in sub-band analysis. The objective is to approximate the spectral envelope of the DFT,
while greatly reducing the number of parameters.
Linear Predictive Coding (LPC)
The Linear Predictive Coding (LPC) model compresses by about two orders of magnitude,
effectively smoothing the DFT [50].
As described in [15], the LPC coefficients make up a model of the vocal tract shape that
produced the original speech signal. A spectrum generated from these coefficients shows
the properties of the vocal tract shape without the interference of the source spectrum.
One can take the spectrum of the filter in various ways, for example by passing an impulse
through the filter and taking its DFT.
LPC is based on the simplified speech production model in Figure 2.3. It starts with
the assumption that the speech signal is produced by an excitation at the end of a tube,
as the authors of [22] specify. The glottis (the space between the vocal cords) produces
the excitation, which is characterized by its intensity (loudness) and frequency (pitch).
The vocal tract (the throat and mouth) forms the tube, which is characterized by its
resonances, called formants.
It analyses the speech signal by estimating the formants, removing their effects from
the speech signal, and estimating the intensity and frequency of the remaining excitation.
The process of removing the formants is called inverse filtering, and the remaining signal
is called the residue.
The speech signal is synthesized by reversing the process: use the residue to create a
source signal, use the formants to create a filter (which represents the tube), and run the
source through the filter, resulting in speech.
In the 1970s, LPC became a dominant ASR representation, as an automatic and
efficient method to represent speech. It is still the standard today in cellphone speech
transmissions, but was replaced by the MFCC approach (see below) in the 1980s.
13
Figure 2.3: LPC speech production scheme
Mel-Frequency Cepstral Coefficients (MFCC)
Speech
waveform
Mel-scaled
filter bank
LPF & downsampling
Logarithm
DCT
MFCC
Figure 2.4: MFCC feature extraction procedure; figure adapted from [6]
Front-end methods that use MFCC for feature extraction are based on a triangleshaped frequency integration, the Mel filter bank [51]. Figure 2.4 shows how the MFCC
feature extraction is performed. First, the input signal, representing the speech waveform,
is passed through a Mel-scaled filter bank. Then, this passes through a low-pass filtering
bank and is downsampled. Lastly, a discrete cosine transformation (DCT) is performed
on the log-outputs from the filters, as if it were a signal.
This approach needs no difficult decisions to determine the features, ASR results
appear to be better than with other methods, and one may interpret the MFCCs as
14
roughly uncorrelated. Notwithstanding their widespread use, the authors of [51] claim
MFCCs are suboptimal. They give equal weight to high and low amplitudes in the log
spectrum, despite the well-known fact that high energy dominates perception. Thus,
when speech is corrupted by noise, the MFCCs deteriorate.
When different speakers present varied spectral patterns for the same phoneme, the
lack of interpretability of the MFCCs forces one to use simple merging of distributions to
handle different speakers [51]. A related approach, called Perceptual Linear Prediction
(PLP), employs a nonlinearly compressed power spectrum.
Perceptual Linear Prediction (PLP)
PLP [28] is a popular feature extraction method, because it is considered to be a more
noise robust solution. It applies critical band analysis for auditory modelling. As shown in
Figure 2.5, it first transforms the spectrum to the Bark scale, using a trapezoid-like filter
shape (instead of the triangle-shaped Mel filter). Then, it performs the equal loudness
pre-emphasis, to estimate the frequency-dependent volume sensitivity of hearing. After
the Inverse Discrete Fourier Transform (IDFT), the cepstral coefficients are computed
from the linear prediction (LP) coefficients.
Speech
Critical Band Analysis
Inverse Discrete
Fourier Transform
Equal Loudness Preemphasis
Solution for
Autoregressive
coefficients
Intensity-Loudness
Conversion
Model
Figure 2.5: PLP feature extraction procedure; figure adapted from [6]
Power Normalized Cepstral Coefficients (PNCC)
PNCC [37] is a recently introduced front-end technique, similar to MFCC, but where the
Mel-scale transformation is replaced by Gammatone filters, simulating the behaviour of
15
the cochlea (the auditory portion of the inner ear, famous for its snail shell shape). PNCC
also includes a step called medium time power bias removal to increase robustness. This
bias vector is calculated using the arithmetic to geometric mean ratio, to estimate the
quality reduction of speech caused by noise.
2.2.2
Stage 2: Acoustic Modelling
The feature extraction performed in the previous stage is important to choose and optimize
the acoustic features of the signal to, ultimately, reduce the complexity of the acoustic
model, yet still maintaining the relevant linguistic information for the speech recognition.
Acoustic modelling must take into account different sources of variability that are
present in the speech signal; namely those arising from the linguistic context, and those
from non-linguistic context such as the speaker, the acoustic environment and recording
channel. They contain a statistical representation of the phonemes (distinct sound units)
that make up a word in the dictionary.
Basically, after the feature extraction, the acoustic model decides how to model the
distribution of the feature vectors. Hidden Markov Models (HMMs) [53, 31] are the most
popular (parametric) model at the acoustic level, but there are multiple ways to model
that distribution:
Gaussian Mixture Models (GMMs)
If one has chosen the feature space well, class Probability Density Functions (PDFs) should
be both smooth and with only one mode, for example, Gaussian. However, as explained
in [51], actual PDFs in most ASR are not smooth, which has lead to the widespread use
of “mixtures” of PDFs to model speech units. Such Gaussian Mixture Models (GMMs)
are typically described by a set of simple Gaussian curves (each characterized by a mean
vector –N values for an N -dimensional PDF– and an N -by-N covariance matrix), as well
as a set of weights for the contributions of each Gaussian to the overall PDF. A mixture
of Gaussians can be represented by the following formula:
P r(xi ) =
P
j
cj N (xi |µj ,
P
j ).
ASR often makes an assumption that parameters are uncorrelated, which allows use
of simpler matrices, despite clear evidence that the dimensions are nearly always correlated. This usually leads to poorer, but faster, recognition. As diagonal covariance
matrices greatly oversimplify reality, there are recent compromises that constrain the inverse covariance matrices in various ways: so-called semitied and subspace-constrained
GMMs [4].
16
Hidden Markov Models (HMMs)
An HMM is a pair of stochastic processes: a hidden Markov chain and an observable
process, which is a probabilistic function of the states of the chain. This means that
observable events in the real world, modelled with probability distributions, are the observable part of the model, associated with individual states of a discrete-time, first-order
Markovian process. The semantics of the model are usually encapsulated in the hidden
part.
If every phoneme is regarded as one of the hidden states, and every feature vector is
regarded as a possible observation state, the entire speech process can be represented as
an HMM.
An HMM is defined by [64]:
1. A set S of N states, S = {x1 , ..., xN }, which are the distinct values that the discrete,
hidden stochastic process can take.
2. An initial state probability distribution, i.e. π = {P r(xi |t = 0), xi ∈ S}, where t is
a discrete time index.
3. A probability distribution that characterizes the allowed transitions between states,
that is aqt ,qt−1 = {P r(xt = qt |xt−1 = qt−1 ), xt ∈ S, xt−1 ∈ S} where the transition
probabilities aqt ,qt−1 are assumed to be independent of time t.
4. An observation or feature space F, which is a discrete or continuous universe of all
possible observable events (usually a subset of Rd , where d is the dimensionality of
the observations).
5. A set of probability distributions (referred to as emission or output probabilities)
that describes the statistical properties of the observations for each state of the
model: bk = {bi (k) = P r(k|xi ), xi ∈ S, k ∈ F }.
It can be represented by this formula:
P r(x0:T ) =
P
q0:T
πq0
TQ
−1
aqt ,qt+1
t=0
T
Q
t=0
p(xt |qt , λ).
HMMs represent a learning paradigm. The most popular of such algorithms are the
forward-backward (or Baum-Welch) and the Viterbi algorithms [52]. Both of these algorithms are based on the general Maximum-Likelihood (ML) criterion, and, when continuous emission probabilities are considered, aim at maximizing the probability of the
samples, given the model at hand. The Viterbi algorithm, specifically, concentrates solely
on the most alike path throughout all the possible sequences of states in the model.
As specified in [64], once training has been accomplished, the HMM can be used for
decoding or recognition. Whenever N different HMMs (corresponding to models of N
different events or classes defined in the feature space) are used, decoding (classification)
17
means assigning each new sequence of observations to the most alike model. When a
single HMM is used, decoding (recognition) means finding out the most alike path of
states within the model and assigning each individual observation to a given state within
the model.
A major distinction has to be made between discrete HMMs and continuous density
HMMs (CDHMMs). According to [64], the first type uses discrete probability distributions to model the emission probabilities. They require a quantisation of a continuous
input space. CDHMMs, however, use continuous PDFs (usually referred to as likelihoods)
to describe statistics of the acoustic features within the HMM states, and are usually best
suited for very difficult ASR tasks, since they exhibit better modelling accuracy. (Gaussians, or mixtures of Gaussian components, are the most popular and effective choices of
PDFs for CDHMMs).
Although HMMs are effective approaches to the problem of acoustic modelling in ASR,
allowing for good recognition performance under many circumstances, the authors of [64]
explain, they also suffer from some limitations. Standard CDHMMs present poor discriminative power among different models, since they are based on the maximum-likelihood
(ML) criterion, which is in itself non-discriminative. Classical HMMs rely strongly on assumptions of the statistical properties of the problem. Also, generalizing the basic HMM
to allow for Markov models of order higher than one does raise ASR accuracy, but the
computational complexity of such models limit their implementability in hardware. All
of the above limitations have driven researchers towards a hybrid solution, using neural
networks and HMMs. Artificial Neural Networks (ANN), with their discriminative training, capability to perform non-parametric estimation over whole sequences of patterns,
and limited number of parameters, definitely appeared promising.
Neural Networks
From the late 1980s onwards, researchers began to use Artificial Neural Networks (ANN)
for ASR. Neural nets were expected to carry out the recognition task, when discriminatively trained on acoustic features.
To take the temporal dependencies typical for speech signals into account, two major classes of neural networks were proposed, namely time-delay neural networks (TDNNs)
and recurrent neural networks(RNNs), as specified in [64]. Time-delay neural networks [68],
also known as tapped delay lines, represent an effective attempt to train a static multilayer
perceptron (MP) [7] for time-sequence processing, by converting the temporal sequence
into a spatial sequence over corresponding units. The idea was applied in a variety of
ASR applications, mostly for phoneme recognition [67].
Recurrent neural networks (RNNs) provide a powerful extension of feed-forward con18
nectionist models by allowing to introduce connections between arbitrary pairs of units,
independently from their position within the topology of the network.
In spite of their ability to classify short-time acoustic-phonetic units, ANNs failed as
a general framework for ASR the authors of [64] explain, especially with long sequences
of acoustic observations, like those required in order to represent words from a dictionary,
or whole sentences. The authors of [64] claim this is mainly due to the lack in ability to
model long-term dependencies in ANNs. In the early 1990s, this led to the idea of combining HMMs and ANNs in a single model, the hybrid ANN/HMM. This hybrid model
relies on an underlying HMM structure, capable of modelling long-term dependencies, and
integrates ANNs to provide non-parametric universal approximation, probability estimation, discriminative training algorithms, less parameters to estimate than usually required
for HMMs, efficient computation of outputs at recognition time, and efficient hardware
implementability. Different hybrid architectures and training/decoding algorithms have
been researched, dependent on the nature of the ASR task, the type of HMM used, or the
specific role of the ANN in the hybrid system [24, 42, 45, 47, 27, 61, 5]. The hybrid approach often allowed for significant improvement in performance with respect to standard
approaches to difficult ASR tasks.
Unlike standard HMMs, which have a consolidated and homogeneous theoretical
framework, the hybrid ANN/HMM systems are a fairly recent research field, with no
unified formulation. Therefore, proposed ANN/HMM hybrid architectures can easily be
divided in five categories, according to [64]:
1. Early Attempts;
2. ANNs to estimate the HMM state-posterior probabilities;
3. Global optimisation;
4. Networks as vector quantizers for discrete HMM;
5. Other approaches.
In spite of the promise for large-vocabulary speech recognition that ANN/HMM
showed, it has not been adopted by commercial speech recognition solutions. After the
invention of discriminative training, which refines the model and improves accuracy, the
conventional, context-dependent Gaussian mixture model HMMs (CD-GMM-HMMs) outperformed the ANN/HMM models when it came to large-vocabulary speech recognition.
Recently, however, there has been a renewed interest in neural networks, namely Deep
Neural Networks (DNNs). They offer learned feature representation, and overcome the
inefficiency in data representation by the GMM, and thus, can replace the GMM directly.
Deep learning can also be used to learn powerful discriminative features for a traditional
HMM speech recognition system. The advantage of this hybrid system is that decades of
speech recognition technologies, developed by speech recognition researchers, can be used
directly. A combination of DNN and HMM produced significant error reduction [29, 38,
19
69, 18, 71] in comparison to some of the ANN/HMM efforts discussed above. In this new
hybrid system, the speech classes for DNN are typically represented by tied HMM states,
which is a technique directly inherited from earlier speech systems [33].
2.2.3
Stage 3: Pronunciation Modelling
The pronunciation lexicon is the link between the representation at the acoustic level (see
Section 2.2.2), and at the word level (see Section 2.2.4).
Generally, there are multiple possible pronunciations for a word in continuous speech.
At the lexical and pronunciation level, two main sources of variability are the dialect
and individual preferences of the speaker, see, for example, Figure 2.6 which shows the
possible pronunciations for the word ‘and’.
æ
n
d
Ə
Figure 2.6: Possible pronunciations of the word ‘and’; figure adapted from [6]
Word Sequence
Pronunciation
Lexicon
Pronunciation
Model
Phone Sequence
Phonetic
Decision Tree
State/Model
Sequence
Figure 2.7: Steps in pronunciation modelling; figure adapted from [54]
The purpose of text collection is to learn how a language is written, so that a language model (see Section 2.2.4) may be constructed. The purpose of audio collection is
to learn how a language is spoken, so that acoustic models (see Section 2.2.2) may be
20
constructed. The point of pronunciation modelling is to connect these two realms, as
shown in Figure 2.7.
According to [54], the pronunciation model consists of these components:
1. A definition of the elemental sounds in a language. Many possible pronunciation
units may be used to model all existing ways to pronounce a word.
• Phonemes and their realisations (phones): this unit is most commonly used;
• Syllables: this is an intermediate unit, between phones and word level;
• Individual articulatory gestures.
2. A dictionary that describes how words in a language are pronounced. The pronun-
Audio Data &
Pronunciation
Transcripts
Dictionaries
Pronunciation
Dictionaries
Acoustic
Models
Training Texts
Pronunciation
& Transcripts
Dictionaries
Language
Models
Figure 2.8: Links between pronunciation dictionary, audio and text; figure adapted from [54]
ciation dictionary (PD) is the link between the acoustic model and the language
model, see Figure 2.8. It is basically a lexicon and provides the way to connect
acoustic models (HMMs) to words. For example, in a simple architecture, each
HMM state represents one phoneme in the language, a word is represented as a
sequence of phonemes, and dictionaries come with different phone sets (for the English language this is usually between 38 and 45 different phonemes). CMU DICT
defines 39 separate phones (38 plus one for silence (SIL)) for American English [8].
If the pronunciations in the dictionary deviate sharply from the spoken language,
one creates a model with great variance, and similar data with essentially the same
phonemes, is distributed to different models.
3. Post-lexical rules for altering pronunciation of words spoken in context.
Lexical coverage has a large impact on the recognition performance, and the accuracy
of the acoustic models is linked to the consistency of the pronunciations in the lexicon.
Which is why the pronunciation vocabulary is usually selected to maximize lexical coverage for a given size lexicon. Each out-of-vocabulary (OOV) word causes more than a
single error (usually between 1.5 and 2 errors) on average, thus word list selection is an
important design step.
21
2.2.4
Stage 4: Language Modelling
Prior to the 1980s, ASR only used acoustic information to evaluate text hypotheses. It was
then noted that incorporating knowledge about the text being spoken would significantly
raise ASR accuracy, and language models (LMs) were included in ASR design.
As mentioned before, the language model generates estimates of P r(w) for all word
strings w and depends on high-level constraints and linguistic knowledge about the allowed
word strings for a specific task.
Given a history of prior words in an utterance, the number P of words that one must
consider as possibly coming next is much smaller than the vocabulary size V . P is called
the perplexity of a language model. LMs are stochastic descriptions of text-likelihoods of
local sequences containing N consecutive words in training texts (typically N = 1, 2, 3, 4).
Integrating an LM with the normal acoustic HMMs is now common practice in ASR.
As described in [51], N -gram models typically estimate the likelihood of each word,
given the context of the preceding N − 1 words. These probabilities are obtained through
analysis of a lot of text, and capture both syntactic and semantic redundancies in text.
As the vocabulary size increases for practical ASR, the size of an LM (V N ) grows exponentially with the vocabulary size. Large lexicons lead to seriously under-trained LMs,
inadequate appropriate texts for training, increased memory needs, and lack of computation power to search all textual possibilities. As a result, most ASR do not employ
N -grams with N higher than four. While 3- and 4-gram LMs are most widely used,
class-based N -grams, and adapted LMs are recent research areas trying to improve LM
accuracy.
Some techniques have been adopted to solve the data sparsity problem, such as smoothing and back-off. Back-off methods fall back on lower-order statistics when higher-order
N -grams do not occur in the training texts [36]. Grammar constraints are often imposed
on LMs, and LMs may be refined in terms of parts-of-speech classes. They can also be
designed for specific tasks.
Combining the acoustic model, the pronunciation model and the language model,
HMMs have been widely used in ASR:
• Different utterances will have a different length, e.g. stop consonants (‘k’, ‘g’, ‘p’,..)
are always short, whereas vowels will generally be longer.
• Ways of comparing variable length features:
– Earlier solution: Dynamic Time Warping (DTW)
– Modern solution: Hidden Markov Models
The Hidden Markov Model is illustrated in Figure 2.9. The parameters aij s are the
transition probabilities from state i to state j. The observation probability bj (ot ) repre22
sents the output probability of observation ot given the state j. Since the HMM states j
are not observed, they are called “hidden” states.
Markov
Model M
Observation
Sequence
Figure 2.9: Hidden Markov Model; figure adapted from [70]
Modelling with HMMs is a multi-level classification problem. According to [6], from
the highest level to the lower levels, it can be described as:
W (word) → A(acoustic unit) → Q(HMM state) → X(acoustic observation)
XX
p(x, q, a|w)
p(x|w) =
a
=
q
XX
qp(x|q)p(q|a)p(a|w)
a
where a is the phone, q is the HMM state, and w is the word. Speech recognition requires
a large search space to search for the best sequences (a, q, w), as in:
(a, q, w)∗ = argmax p(x, q, a, w)
a,q,w
Therefore one wants to maximize the joint probability using Viterbi decoding or stack
decoding techniques.
An important sub-field of ASR concerns determining whether speakers have said something beyond an acceptable vocabulary, i.e. out-of-vocabulary (OOV) detection [43]. ASR
generally searches a dictionary to estimate, for each section of an audio signal, which word
forms the best match. One does not wish to output (incorrect) words from an official dictionary when a speaker has coughed or said something beyond the accepted range [65].
It is practically important to detect such OOV conditions.
23
2.2.5
Stage 5: Spoken Language Understanding/Dialogue Systems
The operational definition of “language understanding” is for the speech human-computer
interface to react, or provide output, in a way that the user of the speech system is satisfied
with it and has achieved the desired goal.
In the last decade, according to [26], a variety of practical goal-oriented spoken language understanding (SLU) systems have been built for limited domains. One characterisation of these systems is the way they allow humans to interact with them. On
one end, there are machine-initiative systems, commonly known as interactive voice response (IVR) systems [62]. In IVR systems, the interaction is controlled by the machine.
Machine-initiative systems ask a user specific questions and expect the user to input one
of the predetermined keywords or phrases. For example, a Global Positioning System
may prompt the user to say “target address”, “calculate route”, “change route”, etc. In
such a system, SLU is reduced to detecting one of the allowed keywords or phrases in
the user’s utterances. On the other extreme, there are user-initiative systems in which a
user controls the flow and a machine simply executes the user’s commands. A compromise in any realistic application is to develop a mixed-initiative system [25], where both
users and the system can assume control of the flow of the dialogue. Mixed-initiative
systems provide users the flexibility to ask questions and provide information in any sequence they choose. Although such systems are known to be more complex, they have
proven to provide more natural human/machine interaction. If the output is speech,
e.g. for Text-to-Speech (TTS) systems, the spoken language dialogue system needs to
respond naturally. It needs to have discourse modelling and generate appropriate textual
responses, or produce natural, pleasant sounding synthetic speech.
The most important techniques to remember from this chapter are Mel-frequency
cepstral coefficients, which are used for the feature extraction and spectral analysis in our
ASR configuration (see Section 3.3.5), and hidden Markov models, which are used for the
acoustic model (see Section 3.3.4).
24
Chapter 3
Decomposition of CMU Sphinx-4
There are many open source automatic speech recognition systems available online, such
as HTK (developed by Cambridge University) [70], Julius (Kyoto University) [39] and
the ISIP Production System (Mississippi State University) [44]. We have chosen to work
with the CMU Sphinx open source ASR system, since it is often referenced in scientific
papers, and has a large wiki [11] and active forum [10]. It also has a less restrictive license
compared to the HTK system.
The first two sections in this chapter provide a general overview of CMU Sphinx,
which is a group of speech recognition systems developed at Carnegie Mellon University
(CMU). In Section 3.1, a short history of the different Sphinx versions is given. We
discuss the advantages of each version, and in which environment they should ideally
be used. As there is no general, extensive explanation available for all of Sphinx-4’s
components, we therefore, based ourself on [41] for one component, and on the source code
and documentation [11, 10] for the others. Our discoveries can be found in Sections 3.2
and 3.3. Section 3.2 describes a high-level architecture of Sphinx-4, the version of Sphinx
our system implements. A more detailed description of the configuration used by our
Sphinx implementation can be found in Section 3.3. This is where the techniques and
methods for speech recognition explained in Chapter 2 are applied.
3.1
History of CMU Sphinx
CMU Sphinx is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University (CMU). They include a series of speech recognizers
(Sphinx-2 through 4) and an acoustic model trainer (SphinxTrain).
In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech
recognizer components, including Sphinx-2, and, a year later, Sphinx-3. The speech
decoders come with acoustic models and sample applications. The available resources
25
include software for acoustic model training, language model compilation and a publicdomain pronunciation dictionary for English, “cmudict”.
• CMU Sphinx is a continuous-speech, speaker-independent recognition system that
uses hidden Markov acoustic models (HMMs) and an N -gram statistical language
model. It was developed by Kai-Fu Lee in 1986. Sphinx featured feasibility of
continuous-speech, speaker-independent large-vocabulary recognition, the possibility of which was in dispute at the time. CMU Sphinx is of historical interest only;
it has been superseded in performance by subsequent versions.
• CMU Sphinx-2 is a fast performance-oriented recognizer, originally developed by
Xuedong Huang at Carnegie Mellon and released as open source with a BSD-style
license. Sphinx-2 focuses on real-time recognition suitable for spoken language applications. It incorporates functionality such as end-pointing, partial hypothesis
generation, dynamic language model switching, and so on. It is used in dialogue systems and language learning systems. The Sphinx-2 code has also been incorporated
into a number of commercial products, but is no longer under active development
(other than for routine maintenance).
Sphinx-2 uses a semi-continuous representation for acoustic modelling (a single set
of Gaussians is used for all models, with individual models represented as a weight
vector over these Gaussians).
• Sphinx-3 adopted the prevalent continuous HMM representation and has been used
primarily for high-accuracy, non-real-time recognition. Recent developments (in
algorithms and in hardware) have made Sphinx-3 “near” real-time, although it is
not yet suitable for use in critical interactive applications. It is currently under
active development and in conjunction with SphinxTrain, it provides access to a
number of modern modelling techniques that improve recognition accuracy.
• The Sphinx-4 speech recognition system [11] is the latest addition to Carnegie Mellon University’s repository of the Sphinx speech recognition systems. It has been
jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard’s Cambridge Research Lab.
It is different from the earlier CMU Sphinx systems in terms of modularity, flexibility
and algorithmic aspects. It uses newer search strategies, and is universal in its
acceptance of various kinds of grammars, language models, types of acoustic models
and feature streams. Algorithmic innovations included in the system design enable it
to incorporate multiple information sources in a more elegant manner as compared
to the other systems in the Sphinx family. Sphinx-4 is developed entirely in the
JavaTM programming language and is thus very portable. It also enables and uses
multi-threading and permits highly flexible user interfacing.
• PocketSphinx is a version of Sphinx that can be used in embedded systems (e.g.,
26
based on an ARM processor, such as most portable devices)[11]. It is under active
development and incorporates features such as fixed-point arithmetic and efficient
algorithms for GMM computation.
3.2
Architecture of CMU Sphinx
Speech
Application
Input
Control
Search
Control
Decoder
Front end
Search
Endpointer
Endpointer
Feature
Endpointer
Computation
Search
Results
Knowledge base
Dictionary
Language
Model
State
Endpointer
Endpointer
Probability
Computation
Graph
Construction
Structural
Information
Statistical
Parameters
Endpointer
Acoustic
Endpointer
Model
Figure 3.1: High-level architecture of CMU Sphinx-4; figure adapted from [41]
The high-level architecture of CMU Sphinx-4, as seen in Figure 3.1, is fairly straightforward. The three main blocks are the front end, the decoder, and the knowledge base,
which are all controllable by an external application, which provides the input speech
and transforms the output to the desired format, if needed. The Sphinx-4 architecture
is designed with a high degree of modularity. All blocks are independently replaceable
software modules, except for the blocks within the knowledge base, and are written in
Java. (Stacked blocks indicate multiple types which can be used simultaneously.) Even
within each module shown in Figure 3.1, the code is very modular with functions that are
easy to replace.
3.2.1
Front End Module
The front end is responsible for gathering, annotating, and processing the input data
(speech signal). The annotations provided by the front end include, amongst others,
the beginning and ending of a data segment. It also extracts features from the input
27
data, to be read by the decoder. To do this, it can be based on Mel Frequency Cepstral
Coefficients (MFCCs) for audio signal presentation (see Section 2.2.1), as are other modern
general-purpose speech recognition systems, by altering the Sphinx configuration file (see
Section 3.3).
The front end provides a set of high level classes and interfaces that are used to
perform digital signal processing for speech recognition [9]. It is modelled as a series
of data processors, each of which performs a specific signal processing function on the
incoming data, as shown in Figure 3.2. For example, a processor performs Fast-Fourier
Transform (FFT, see Section 2.2.1) on input data, another processor performs high-pass
filtering. Thus, the incoming data is transformed as it passes through each data processor.
Data
Data
Data
Data
Data
Processor
Processor
Data
Data
Processor
Data
Figure 3.2: High-level design of CMU Sphinx front end; figure adapted from the Sphinx documentation [9]
Data enters and exits the front end, and goes between the implemented data processors. The input data for the front end is typically audio data, but any input type is
allowed, such as spectra, cepstra, etc. Similarly, the output data is typically features, but
any output type is possible.
Sphinx-4 also allows for the specification of multiple front end pipelines and of multiple
instances of the same data processor in the same pipeline.
In order to obtain a front end, it must be specified in the Sphinx configuration file for
the application, see Section 3.3.
3.2.2
Decoder Module
The decoder performs the main component of speech recognition, namely the actual
recognition. It reads features received from the front end, couples them with data from
the knowledge base, provided by the application, such as the language model, information
from the pronunciation dictionary, and the structural information from the acoustic model
(or sets of parallel acoustic models), and constructs the language HMM in the graph
construction module. Then it performs a graph search to determine which phoneme
28
sequence would be the most likely to represent the series of features. In Sphinx-4, the
graph construction module is also called the “linguist” [41]. The term “search space”
is used to describe the possible most likely sequences of phonemes, and is dynamically
updated by the decoder during the search.
Many different versions of Sphinx decoders exist. The decision about which version
to use depends on how familiar one is with C/Python (for the use of PocketSphinx) or
Java (for the use of sphinx-4), and how easy it is to integrate these into the system under
development. Currently there are the following choices:
• PocketSphinx is CMUs fastest speech recognition system so far. It is a library
written in pure C-code, which is optimal for the development of C applications as
well as for the creation of language bindings. It is the most accurate engine at real
time speed, and therefore, is a good choice for live applications.
It is also a good choice for desktop applications, command and control, and dictation applications where fast response and low resource consumption are the main
constraints.
• Sphinx-4 is a state-of-the-art speech recognition system written entirely in the
JavaTM programming language. It works best for implementations of complex
servers or cloud-based systems with deep interaction with nonlinear programming
(NLP) modules, web services and cloud computing. The system permits use of any
level of context in the definition of the basic sound units. One by-product of the
system’s modular design is that it becomes easy to implement it in hardware.
Some new design aspects [41], compared to Sphinx-3’s decoder, include graph construction for multilevel parallel decoding with independent simultaneous feature
streams without the use of compound HMMs, the incorporation of a generalized
search algorithm that subsumes Viterbi and full-forward decoding as special cases,
design of generalized language HMM graphs from grammars and language models
of multiple standard formats, that toggles trivially from flat search structure to tree
search structure, etc.
• Sphinx-3 is CMUs large vocabulary speech recognition system. It is an older C-based
decoder that continues to be maintained, and is still the most accurate decoder
for large vocabulary tasks. It is now also used as a baseline to check recognizer
accuracies.
• Sphinx-2 is a fast speech recognition system, the predecessor of PocketSphinx. It
is not being actively developed at this time, but is still widely used in interactive
applications. It uses HMMs with semi-continuous output probability density functions (PDFs). Even though it is not as accurate as Sphinx-3 or Sphinx-4, it runs
at real time speeds, and is thus a good choice for live applications. These decoders
are obsolete and not supported nowadays. It is not recommended to use them.
29
3.2.3
Knowledge Base Module
The knowledge base of CMU sphinx consists of three parts: the acoustic model, the
language model, and the (pronunciation) dictionary.
Acoustic Model
The acoustic model contains a statistical representation of the distinct phonemes that
make up a word in the dictionary. Each phoneme is modelled by a sequence of states and
the observation probability distribution of sounds you might observe in that state. For a
more general, extensive description, see Section 2.2.2.
Sphinx-4 can handle any number of states, but they must be specified during training.
An acoustic model provides a phoneme-level speech structure. They can be trained for
any language, task or condition, using the SphinxTrain tool developed by CMU [13].
Pronunciation Dictionary
The decoder needs to know the pronunciation of words to perform the graph search, which
is why a pronunciation dictionary is needed. Basically, this is a list of words and their
possible pronunciations. For more information, see Section 2.2.3.
Language Model
The language model contains all words and their probability to appear in a certain sequence. It provides a word-level language structure.
Language models generally fall into two categories: either graph-driven models, which
are similar to one-dimensional (1D) Markov models, and base the word probability solely
on the previous word; or N-gram models, which are similar to (n − 1)-dimensional Markov
model, and base word probability on the n − 1 previous words. For a more extensive
explanation of language models, see Section 2.2.4.
Sphinx-4 defaults to using trigram models.
3.2.4
Work Flow of a Sphinx-4 Run
Figure 3.3 shows the interfaces used by Sphinx-4 to perform the speech recognition task.
We now give a short description of the work flow through these interfaces, and the detailed
choices for these interfaces, and more information, can be found in Section 3.3.
The starting point is the audio Input, either live from a microphone or pre-recorded
in an audio file. The configuration file is used to set all the variables, see Section 3.3.1
30
Application
Input
Control
Result
Instrumentation
Recognizer
FrontEnd
Decoder
SearchManager
Linguist
AcousticModel
ActiveList
Dictionary
LanguageModel
Scorer
Pruner
SearchGraph
Feature
Configuration Manager
Figure 3.3: Basic flow chart of how the components of Sphinx-4 fit together; figure adapted
from [21]
for our configuration file. Most of the components the system uses are configurable Java
interfaces. The Configuration Manager loads all these options and variables, as the first
step for the Application. The FrontEnd is then constructed and generates Feature vectors
from the Input, preferably using the same process as was used during training, see Section 3.2.1. The Linguist generates the SearchGraph, for which it uses the AcousticModel,
LanguageModel and pronunciation Dictionary that are specified in the configuration file.
The Decoder then constructs the SearchManager, which, in turn, initializes the Scorer,
Pruner and ActiveList, see Section 3.2.2 for more information about the decoder. The
SearchManager can then use the Features and the SearchGraph to find the best fit path,
which represents the best transcription, and then, finally, the Result is passed back to the
Application as a series of recognized words.
It is interesting to know that, once the initial configuration is complete, the recognition
process can repeat without having to reinitialize everything.
3.3
Our CMU Sphinx Configuration
The Sphinx-4 configuration manager system has two primary purposes [9]:
• To determine which components are to be used in the system. The Sphinx-4 system
31
is designed to be extremely flexible. At runtime, just about any component can
be replaced with another. For example, in Sphinx-4 the “FrontEnd” component
provides acoustic features that are scored against the acoustic model. Typically,
Sphinx-4 is configured with a front end that produces Mel frequency cepstral coefficients (MFCCs, see Section 2.2.1), however it is possible to reconfigure Sphinx-4 to
use a different front end that, for instance, produces Perceptual Linear Prediction
coefficients (PLP, see Section 2.2.1).
• To determine the detailed configuration of each of these components. The Sphinx-4
system has a large number of parameters that control how the system functions. For
instance, a beam width is sometimes used to control the number of active search
paths maintained during the speech decoding. A larger value for this beam width
can sometimes yield higher recognition accuracy at the expense of longer decode
times.
The Sphinx configuration manager can be used to flexibly design the system like this,
with the use of a configuration file. This configuration file defines the names and types
of all of the components of the system, the connectivity of these components – that is,
which components talk to each other –, and the detailed configuration for each of these
components. The format of this file is XML.
The most important configuration decisions are listed below, a more complete explanation of our configuration choices can be found in the hereafter following sections.
• property name="logLevel" value="WARNING"
We use the WARNING level, which only provides information when something went
wrong but the system is able to continue its task, and severe errors when the system
cannot continue its task. This setting does not overwhelm us with information.
• component name="recognizer"
We are able to specify a number of monitors in the decoder module. These monitors
allow us to keep track of a certain characteristic of Sphinx during the alignment
task, e.g. the memoryTracker, and speedTracker monitors which are used by our
application and track, respectively, memory usage and processing speed.
• property name="relativeBeamWidth" value="1E-300
The relative beam width specifies a threshold for which active search paths to keep,
based on their acoustic likelihood computation. How more negative the exponent
is, how more search paths will be kept and how more accurate the alignment will
be, at a cost of more computational power and an increase in processing time. If
processing speed and computational power is not a major concern, increasing the
exponent is recommended.
• component name="dictionary"
When working with our application, the first thing that needs to be verified is
32
whether the dictionary and filler path are referring to the correct location, i.e. the
location of the pronunciation dictionary and noise dictionary.
• property name="acousticModel"
For this component and properties, the location must also be verified, and if necessary changed to refer to the location of the acoustic model.
• component name="frontEnd"
The front end we use performs feature extraction using Mel-frequency cepstral coefficients. For more information about this technique, see Section 2.2.1.
In the sections below, we explain the configuration used by our system to recognize
English with the help of excerpts from the configuration file. The configuration file used
to recognize Dutch can be found in Appendix A.
3.3.1
Global Properties
<property
<property
<property
<property
<property
<property
<property
<property
<property
<property
<property
name="logLevel" value="WARNING"/>
name="absoluteBeamWidth" value="-1"/>
name="relativeBeamWidth" value="1E-300"/>
name="wordInsertionProbability" value="1.0"/>
name="languageWeight" value="10"/>
name="addOOVBranch" value="true"/>
name="showCreations" value="false"/>
name="outOfGrammarProbability" value="1E-26" />
name="phoneInsertionProbability" value="1E-140" />
name="frontend" value="epFrontEnd"/>
name="recognizer" value="recognizer"/>
Figure 3.4: Global properties
Figure 3.4 shows how the global properties of the Sphinx system are defined.
All Sphinx components that need to output informational messages will use the
Sphinx-4 logger. Each message has an importance level which can be assigned in the
“logLevel” property.
• The SEVERE level means an error occurred that makes continuing the operation
difficult or impossible, this is the highest importance level.
• WARNING means that something went wrong but the system is still able to continue, e.g. when a word is missing from the pronunciation dictionary.
• INFO provides general information.
• The CONFIG level means that information about a component’s configuration will
be outputted.
33
• FINE, FINER, and FINEST (which is the lowest logging level) provide fine grained
tracing messages.
Our system uses the WARNING level, which does not overwhelm us with information,
but still allows us to know what is happening during the execution of the application,
and, most importantly, flags a warning when a word is missing from the dictionary, so it
can be added.
The other global properties defined here are used to specify values all through the
Sphinx configuration, and are explained below, when the configuration uses of them.
3.3.2
Recognizer and Decoder Components
<component name="recognizer" type="edu.cmu.sphinx.recognizer
.Recognizer">
<property name="decoder" value="decoder"/>
<propertylist name="monitors">
<item>memoryTracker </item>
<item>speedTracker </item>
</propertylist>
</component>
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder">
<property name="searchManager" value="searchManager"/>
</component>
<component name="searchManager"
type="edu.cmu.sphinx.decoder.search.AlignerSearchManager">
<property name="logMath" value="logMath"/>
<property name="linguist" value="aflatLinguist"/>
<property name="pruner" value="trivialPruner"/>
<property name="scorer" value="threadedScorer"/>
<property name="activeListFactory" value="activeList"/>
</component>
Figure 3.5: Recognizer and Decoder components
Figure 3.5 shows the general properties for the decoder. The recognizer [9] contains the
main components of Sphinx-4 (front end, linguist, and decoder). Most interaction from
the application to the internal Sphinx-4 system happens through the recognizer. This is
also were some monitors can be specified, to keep track of speed, accuracy, memory use,
etc.
34
The decoder [9] contains the search manager, which performs the graph search using
a certain algorithm, e.g. breadth-first search, best-first search, depth-first search, etc.
It also contains the feature scorer and the pruner. The specific details of its components
are described below.
Active List Component
<component name="activeList"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="logMath" value="logMath"/>
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
Figure 3.6: ActiveList component
The active list component configuration is shown in Figure 3.6. The active list is a list
of tokens that represent all the states in the search graph, that are active in the current
feature frame. Our configuration consists of a “PartitionActiveListFactory” [9] which
produces a “PartitionActiveList” object. This will partition the list of tokens according
to the absolute beam width.
The absolute beam width limits the number of elements in the active list. It controls
the number of active search paths maintained during the pruning stage of the speech
decoding. At each frame, if there are more paths than the specified absolute beam width
value, then only the best ones are kept and the rest are discarded. A larger value for
this beam width can sometimes yield higher recognition accuracy at the expense of longer
decoding times. We set a value of -1, which means an unbounded list, as is the norm,
because the relative beam width does a good job at pruning the list.
The relative beam width is used to create a threshold for what tokens to keep, based
on their acoustic likelihood computation. Anything scoring less than the relative beam
width value multiplied by the best score, is pruned. The relative beam width uses a
negative exponent to represent a very small fraction, the more negative the exponent the
less search paths you discard, and the more accurate the recognition, but again, with is a
trade off of increased search time.
Pruner and Scorer Components
35
<component name="trivialPruner"
type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>
<component name="threadedScorer"
type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer">
<property name="frontend" value="${frontend}"/>
</component>
Figure 3.7: Pruner and Scorer configurations
The pruner is responsible for the pruning of the active list, according to certain strategies. The “SimplePruner” [9] that is used in our configuration, performs default pruning
behavior, and invokes the purge on the active list.
The scorer scores the current feature frame against all active states in the active list,
which is why it has access to the front end. (See Section 3.3.2 for more information about
the values used for scoring.) The “ThreadedAcousticScorer” [9] is an acoustic scorer that
breaks the scoring up into a configurable number of separate threads. All scores are
maintained in “LogMath” log base.
Linguist Component
<component name="aflatLinguist"
type="edu.cmu.sphinx.linguist.aflat.AFlatLinguist">
<property name="logMath" value="logMath" />
<property name="grammar" value="AlignerGrammar" />
<property name="acousticModel" value="wsj" />
<property name="addOutOfGrammarBranch" value="${addOOVBranch}" />
<property name="outOfGrammarProbability"
value="${outOfGrammarProbability}" />
<property name="unitManager" value="unitManager" />
<property name="wordInsertionProbability"
value="${wordInsertionProbability}" />
<property name="phoneInsertionProbability"
value="${phoneInsertionProbability}" />
<property name="languageWeight" value="${languageWeight}" />
<property name="phoneLoopAcousticModel" value="WSJ" />
<property name="dumpGStates" value="true" />
</component>
Figure 3.8: Linguist component
36
Figure 3.8 shows our system’s configuration for the linguist. The linguist embodies the
linguistic knowledge of the system, which consists of the acoustic model, the dictionary,
and the language model. It produces the search graph structure on which the search
manager performs the graph search, using different algorithms.
The “AFlatLinguist” [9] is a simple form of linguist. It makes the following simplifying
assumptions:
• One or no words per grammar node.
• No fan-in allowed.
• No composites.
• The graph only includes unit, HMM, and pronunciation states (and the initial/final
grammar state), no word, alternative or grammar states are included.
• Only valid transitions (matching contexts) are allowed.
• No tree organization of units.
• Branching grammar states are allowed.
It is a dynamic version of the flat linguist, that is more efficient in terms of start-up time
and overall footprint. All probabilities are maintained in the log math domain.
There are a number of property values that can be specified in order to make the
linguist work properly.
The acoustic model property is used to define which acoustic model to use when
building the search graph. The grammar property defines which grammar must be used
when building the search graph, see Section 3.3.3. The “addOutOfGrammarBranch”
allows one to specify whether to add a branch for detecting out-of-grammar utterances,
and the out-of-grammar probability defines the chance of entering the out-of-grammar
branch. The unit manager property is used to define which unit manager to use when
building the search graph, see Section 3.3.4.
The phone insertion probability property specifies the probability of inserting a Context Independent (CI) phone in the out-of-grammar CI phone loop, and the phone loop
acoustic model defines which acoustic model to use to build the phone loop that detects
out of grammar utterances. This acoustic model does not need to be the same as the
model used for the search graph, see Section 3.3.4.
The language weight, also called the language model scaling factor, decides how much
relative importance will be given to the actual acoustic probabilities of the words in the
search path. A low language weight gives more leeway for words with high acoustic
probabilities to be chosen, at the risk of choosing non-existent words. One can decode
several times with different language weights, without re-training the acoustic models, to
decide what is best for the system. A value between 6 and 13 is standard, and by default
the language weight is not applied.
37
The word insertion penalty is an important heuristic parameter in any dynamic programming algorithm. It is the number that decides how much penalty to apply to a new
word during the search. If new words are not penalized, the decoder would tend to choose
the smallest words possible, since every new word inserted leads to an additional increase
in the score of any path, as a result of the inclusion of the inserted word’s language
probability from the language model.
Word insertion probability controls the word breaks recognition. If the value is near
1, it is more likely to break the perceived text into words, e.g. the sentence “A. D. 6” is
preferred with high word insertion probability, while “eighty six” is preferred if the word
insertion probability is low. This value is related to the word insertion penalty. We use a
value of 1 since the proposed test texts mainly consists of small, elementary-school level
words.
3.3.3
Grammar Component
<component name="AlignerGrammar"
type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar">
<property name="dictionary" value="dictionary" />
<property name="logMath" value="logMath" />
<property name="addSilenceWords" value="true" />
<property name="allowLoopsAndBackwardJumps"
value="allowLoopsAndBackwardJumps" />
<property name="selfLoopProbability" value="selfLoopProbability"/>
<property name="backwardTransitionProbability"
value="backwardTransitionProbability" />
</component>
Figure 3.9: Grammar component
The “AlignerGrammar” [9] component was created to provide a customizable grammar
able to incorporate speech disfluencies such as deletions, substitutions, and repetitions in
the audio input.
A grammar is represented internally as a graph, in which we allow the inclusion of
silence words by putting the “addSilenceWords” value on true.
The “allowLoopsAndBackwardJumps”, “selfLoopProbability”, and “backwardTransitionProbability” property values are all defined inside the “AlignerGrammar” class. This
will likely be changed to accept the values specified in the configuration file, in a later
version of this Sphinx branch. We have kept the predefined values in the class for use in
our system.
38
All grammar probabilities are maintained in “LogMath” log domain.
The dictionary defined to use for this grammar is referenced by the dictionary property,
and is defined in the section below.
Dictionary Component
<component name="dictionary"
type="edu.cmu.sphinx.linguist.dictionary.AllWordDictionary">
<property name="dictionaryPath"
value="resource:/en/dict/cmudict.0.7a"/>
<property name="fillerPath" value="resource:/en/noisedict"/>
<property name="dictionaryLanguage" value="EN"/>
<property name="addSilEndingPronunciation" value="true"/>
<property name="wordReplacement" value="<sil>"/>
<property name="unitManager" value="unitManager"/>
</component>
Figure 3.10: Dictionary configuration
This is the most important part of the configuration file, in terms of what to alter
when you are running the program on your own installation. The “dictionaryPath” and
“fillerPath” must be changed to direct to the dictionary file and filler dictionary file,
respectively. The “dictionaryLanguage” must be changed as well, to the language the
system must support.
The “AllWordDictionary” [9] creates a dictionary by quickly reading in an ASCIIbased Sphinx-3 format pronunciation dictionary. It loads each line of the dictionary into
a hash table, assuming that most words are not going to be used. Only when a word is
actually used, are its pronunciations copied into an array of pronunciations.
The format of the ASCII dictionary that is expected, is the word, followed by spaces
or tabs, and followed by the pronunciation(s). For example, an English digits dictionary
will look like the first column of table 3.1. The second column shows the pronunciation
dictionary entries for Dutch digits. In this example, the words “one”, “zero”, and “een”
have two pronunciations each. One can clearly see that the way a pronunciation dictionary
is build, depends on which language it represents. Capitalization is important, and each
language has its own way to represent a certain phoneme or sound unit.
39
ONE HH W AH N
ONE(2) W AH N
TWO T UW
THREE TH R IY
FOUR F AO R
FIVE F AY V
SIX S IH K S
SEVEN S EH V AH N
EIGHT EY T
NINE N AY N
ZERO Z IH R OW
ZERO(2) Z IY R OW
OH OW
een @ n
een(2) e n
twee t w e
drie d r i
vier v i r
vijf v ei f
zes z ee s
zeven z e v @ n
acht aa x t
negen n e gg @ n
nul n yy l
Table 3.1: Example of an English and a Dutch pronunciation dictionary
3.3.4
Acoustic Model Component
<component name="wsj"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate
.TiedStateAcousticModel">
<property name="loader" value="wsjLoader"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="wsjLoader"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath"/>
<property name="unitManager" value="unitManager"/>
<property name="location"
value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"/>
</component>
<component name="unitManager"
type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/>
Figure 3.11: Acoustic model configuration
Figure 3.11 shows the configurations for the acoustic model. The “TiedStateAcousticModel” [9] loads a tied-state acoustic model, generated by the Sphinx-3 trainer.
40
The acoustic model is stored as a directory specified by a URL. The files in that
directory are, therefore, tables, or pools, of means, variances, mixture weights, and
transition probabilities. This directory is specified in the location property from the
“Sphinx3Loader” [9]. Here it is referring to a container file, but a regular directory is also
possible. The dictionary and language model files are not required to be in the package.
Their locations can be specified separately.
An HMM models a process using a sequence of states. Associated with each state,
there is a probability density function. A popular choice for this function is a Gaussian
mixture. As you may recall from Section 2.2.2, a single Gaussian is defined by a mean
and a variance, or, in the case of a multidimensional Gaussian, by a mean vector and
a covariance matrix, or, under some simplifying assumptions, a variance vector. The
means and variances files in the directory contain exactly that: a table in which each line
contains a mean vector or a variance vector respectively. The dimension of these vectors is
the same as the incoming data, namely the encoded speech signal. The Gaussian mixture
is a summation of Gaussians, with different weights for different Gaussians. Each line in
the mixture weights file contains the weights for a combination of Gaussians.
The transitions between HMM states have an associated probability. These probabilities make up the transition matrices stored in the transition matrices file.
The model definition (mdef ) file in a way ties everything together. If the recognition
system models phonemes, there is an HMM for each phoneme. The model definition file
has one line for each phoneme. The phoneme can be context dependent or independent.
Each line, therefore, identifies a unique HMM. This line has the phoneme identification,
the non-required left or right context, the index of a transition matrix, and, for each state,
the index of a mean vector, a variance vector, and a set of mixture weights.
If the model has a layout that is different than the default generated by SphinxTrain,
you may specify additional properties like “dataLocation” to set the path to the binary
files, and “mdef” to set the path to the model definition file.
The additions in Figure 3.12 are used to define the phone loop acoustic model in the
linguist (see Section 3.3.2). They are very similar to the specifications for the acoustic
model, but do not necessarily need to point to the same location.
3.3.5
Front End Component
Figure 3.13 shows the different components in the front end pipeline (see Section 3.2.1).
The components are defined in Figure 3.17.
We first list the used data processors in the front end pipeline. More information
about each component can be found below.
41
<component name="WSJ"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate
.TiedStateAcousticModel">
<property name="loader" value="WSJLOADER" />
<property name="unitManager" value="UNITMANAGER" />
</component>
<component name="WSJLOADER"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath" />
<property name="unitManager" value="UNITMANAGER" />
<property name="location"
value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz" />
</component>
<component name="UNITMANAGER"
type="edu.cmu.sphinx.linguist.acoustic.UnitManager" />
Figure 3.12: Additions
• audioFileDataSource
• dataBlocker
• preemphasizer
• windower
• fft
• melFilterBank
• dct
• liveCMN
• featureExtraction
audioFileDatasource The “AudioFileDataSource” [9] is responsible for generating a
stream of audio data from a giving audio file. All required information is read directly
from the file.
This component uses ‘JavaSound’ as a backend, and is able to handle all audio files
supported by it, such as .wav, .au, and .aiff. Besides these, with the use of plugins, it can
support .ogg, .mp3, .speex files and more.
dataBlocker The “DataBlocker” [9] wraps the separate data fragments in blocks of
equal, predefined length.
42
<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>audioFileDataSource </item>
<item>dataBlocker </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>melFilterBank </item>
<item>dct </item>
<item>liveCMN </item>
<item>featureExtraction </item>
</propertylist>
</component>
Figure 3.13: Front end configuration
preemphasizer The “Preemphasizer” [9] component takes a “DataObject” it received
from the “DataBlocker” and passes along the same object with applied preemphasis on
the high frequencies.
High frequency components usually contain much less energy than lower frequency
components, but they are still important for speech recognition. The “Preemphasizer” is
thus a high-pass filter, because it allows the high frequency components to “pass through”,
while weakening or filtering out the low frequency components.
windower The windower component slices up a “DataObject” into a number of overlapping windows, also called frames. In order to minimize the signal discontinuities at the
boundaries of each frame, the “RaisedCosineWindower” [9] multiplies each frame with
a raised cosine windowing function. The system uses overlapping windows to capture
information that may occur at the window boundaries. These events would not be well
represented if the windows were juxtaposed.
The number of resulting windows depends on the window size and the window shift.
Figure 3.14 shows the relationship between the data stream, the window size, the window
shift, and the returned windows.
fft The next component computes the Discrete Fourier Transform (DFT) of an input sequence, using Fast Fourier Transform (FFT). Fourier Transform is the process of analysing
a signal into its frequency components, see Section 2.2.1 for more information about the
Fourier transform.
43
Original Data Stream
Window 0
Window 1
Window 2
Window Shift
Window Size
Figure 3.14: The relation between original data size, window size and window shift; figure
adapted from the Sphinx documentation [9]
melFilterBank The “MelFrequencyFilterBank” [9] component filters an input power
spectrum through a bank of mel-filters. The output is an array of filtered values, typically
called the mel-spectrum, each corresponding to the result of filtering the input spectrum
through an individual filter.
The mel-filter bank, if visually represented, looks like Figure 3.15.
# of triangles = # of mel-filters = length of mel-spectrum
maxFreq
Frequency
minFreq
Figure 3.15: A mel-filter bank; figure adapted from the Sphinx documentation [9]
The distance at the base from the centre to the left edge is different from the centre
to the right edge. Since the centre frequencies follow the mel-frequency scale, which is a
nonlinear scale that models the nonlinear human hearing behaviour, the mel filter bank
corresponds to a warping of the frequency axis. Filtering with the mel scale emphasizes
the lower frequencies.
dct The “DiscreteCosineTransform” [9] first applies a logarithm, and then a Discrete
Cosine Transform (DCT) to the input data, which is the mel spectrum received from the
44
previous component in the pipeline. When the input is a mel-spectrum, the vector returned is the MFCC (Mel-Frequency Cepstral Coefficient) vector, where the 0-th element
is the energy value. For more information, see Section 2.2.1.
This component of the pipeline corresponds to the last stage of converting a signal to
cepstra, and is defined as the inverse Fourier transform of the logarithm of the Fourier
transform of a signal.
liveCMN The “LiveCMN” [9] component applies cepstral mean normalization (CMN)
to the incoming cepstral data. Its goal is to reduce the distortion caused by the transmission channel. The output is mean normalized cepstral data.
The component does not read the entire stream of Data objects before it calculates
the mean. It estimates the mean from already seen data and subtracts this from the Data
objects on the fly. Therefore, there is no delay introduced by “LiveCMN”, and is thus
the best choice for live applications.
featureExtraction The last component in the front end pipeline is the “DeltasFeatureExtractor” [9]. It computes the delta and double delta of the input cepstrum (or plp
or ...). The delta is the first order derivative and the double delta (a.k.a. delta delta) is
the second order derivative of the original cepstrum. They help model the speech signal
dynamics. The output data is a “FloatData” object with an array formed by the concatenation of cepstra, delta cepstra, and double delta cepstra, see Figure 3.16. The output is
the feature vector that will be used by the decoder.
cepstrum
delta
double delta
Figure 3.16: Layout of the returned features; figure adapted from the Sphinx documentation [9]
3.3.6
Monitors
Figure 3.18 shows a number of examples of possible monitors and their configuration.
They can be used by adding them to the decoder, as seen in Figure 3.5.
The accuracy tracker is turned off in our configuration, but would track and report
the recognition accuracy based on the highest scoring search path of the result, since it
uses the “BestPathAccuracyTracker” [9] class.
The “MemoryTracker” [9] class monitors the memory usage of the recognition task,
while the “SpeedTracker” [9] reports on the speed of the recognition. Apart from giving
45
<component name="audioFileDataSource"
type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/>
<component name="dataBlocker"
type="edu.cmu.sphinx.frontend.DataBlocker"/>
<component name="preemphasizer"
type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/>
<component name="windower"
type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"/>
<component name="fft"
type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/>
<component name="melFilterBank"
type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"/>
<component name="dct"
type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/>
<component name="liveCMN"
type="edu.cmu.sphinx.frontend.feature.LiveCMN"/>
<component name="featureExtraction"
type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/>
<component name="logMath" type="edu.cmu.sphinx.util.LogMath">
<property name="logBase" value="1.0001"/>
<property name="useAddTable" value="true"/>
</component>
Figure 3.17: Front end pipeline elements
the actual time lapse, it also provides the amount of time used in relation to the length
of the audio input file.
This chapter gave a detailed description of the configuration file that is used to specify
the ASR techniques that the Sphinx system for our application, uses. They represent
the inner workings of the alignment application we designed, and which is described in
Chapter 4.
46
<component name="accuracyTracker"
type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker">
<property name="recognizer" value="${recognizer}" />
<property name="showAlignedResults" value="false" />
<property name="showRawResults" value="false" />
</component>
<component name="memoryTracker"
type="edu.cmu.sphinx.instrumentation.MemoryTracker">
<property name="recognizer" value="${recognizer}" />
<property name="showSummary" value="true" />
<property name="showDetails" value="false" />
</component>
<component name="speedTracker"
type="edu.cmu.sphinx.instrumentation.SpeedTracker">
<property name="recognizer" value="${recognizer}" />
<property name="frontend" value="${frontend}" />
<property name="showSummary" value="true" />
<property name="showDetails" value="false" />
</component>
Figure 3.18: Example of monitors
47
48
Chapter 4
Our Application
This chapter describes the plugin system we created in detail. We first shortly discuss a
high-level view of our application and explain the reasoning behind our high-level choices.
Then we further explain each component more elaborately, and provide the reader with a
number of best practices, i.e. how to get our system up and running. We end the chapter
with some explanation of the possible output formats of our application, namely .srt file
or EPUB.
4.1
High-Level View
Our application is written entirely in the JavaTM language, and consists of three separate
components, see Figure 4.1.
• The first component is the Main component. This is where the main functionality
of our application is located, and it contains the parsing of the command line and
chooses the plugin, as well as loads it into the application (as the plugin is located
in a different component, see below). It also contains the testing framework.
• The second component is the Plugin Knowledge component, which contains all the
functionality one needs to implement the actual plugin. It provides the user with
two possible output formats, namely a standard subtitle file (.srt file format) and
an EPUB file. This component receives the audio and text input from the main
component, and passes it along to the ASR plugin component.
• The third component is where the ASR plugin is actually located. We refer to this
component as the (ASR) Plugin component, since it contains the ASR system. In
our application we use the Sphinx-4 plugin, see Chapter 3 for more information.
We decided to split our application in these three components for two reasons. The
first reason is that due to this splitting in several components, the only part that needs any
knowledge of the ASR plugin that is used, is the third component. Both other components
49
Main
Component
Plugin Knowledge
Component
ASR Plugin
Component
Figure 4.1: High-level view of our application
have no knowledge at all about the inner workings of the ASR system, and can thus easily
be altered by someone who just wants some small tweaks to the main output, or the tests
output, and has no idea how the ASR component works. The second reason is to keep
the addition of a new plugin to the application as easy as possible. If someone wants to
change the ASR system that is used by our application, they only need to provide a link
from the second component to the plugin component (see Section 4.2). By splitting up
the first and second components, they do not need to work out all the extra classes that
are used by the first component, which have no impact on the plugin ASR whatsoever,
and can start work even if only being provided with the second component.
The arrows in Figure 4.1 represent which component has access to which other component. The first and third component only need access to the second one. This component
is added to the libraries of the first and third component, to provide said access.
We will now discuss these components in more detail, in the following sections. As our
application is developed in Java, the components correspond to separate projects, which
is why, in the following sections, we will call them ‘projects’. The first component is
called the ‘pluginsystem’ project, the second component is called the ‘pluginsystem.plugin’
project, and the third component is called the ‘pluginsystem.plugins.sphinxlongaudioaligner’ project in our java application.
50
4.2
Components 2 & 3: Projects pluginsystem.plugin
& pluginsystem.plugins.sphinxlongaudioaligner
Figure 4.2 contains all the classes in the second project, and the one class from the plugin
project that links these projects together, namely the SphinxLongSpeechRecognizer
class. The interactions between the classes are also shown.
<<interface>>
SpeechRecognizer
+align(audioFile:String,textFile:String): void
+setConfiguration(configuration:HashMap<String,String>): void
+addSpeechRecognizedListener(listener:SpeechRecognizedListener): void
+removeSpeechRecognizedListener(listener:SpeechRecognizedListener): void
FileHelpers
AbstractSpeechRecognizer
+align(audioFile:String,textFile:String): void
<<interface>>
SpeechRecognizedListener
+wordRecognized(word:String,start:long,
duration:long): void
+endOfSpeechReached(): void
SphinxLongSpeechRecognizer
+align(audioFile:String,textFile:String): void
SrtOutputListener
EPubOutputListener
Figure 4.2: UML scheme of the pluginsystem.plugin project, including the class that links the
plugin to our application, namely SphinxLongSpeechRecognizer
The second project contains all the necessary classes and interfaces one needs, to
implement their own plugin into our system. Its most important class is the AbstractSpeechRecognizer, which implements the SpeechRecognizer interface, and must be
extended by a class in the third project (the plugin project).
The only abstract method in AbstractSpeechRecognizer is the align(String audioFile, String textFile) method. This method receives a reference to the text input file and
the audio input file that need to be synchronized. How this is done, is obviously highly
dependent on which ASR system is used, which is why this method must be implemented
by a class in the plugin project.
51
When the SphinxLongSpeechRecognizer class of the ASR plugin recognizes, or aligns,
a word, it calls on the insertMissingWords(ArrayList<String> inputText, String word,
long startTime, long duration, long previousEndTime) method. Even though we provide
the Sphinx ASR system with the entire spoken text, it does not necessarily recognize all
words of that texts. Therefore, we created this insertMissingWords method to check with
the input text file, and insert possible missing words with their corresponding start times
and duration. The start time is taken the same as the end time of the previous word.
The duration is calculated by taking the duration between the end time of the previous
word and the end time of the next recognized word, and dividing it by the total amount
of letters in the missing words and the next recognized word.1 Each missing word then
gets a duration based on the length of the word.
When for each word a start time and duration are assigned, we pass this on to the
wordRecognized(String word, long start, long duration) method of the specified listener.
The listener will then output each word with its designated times according to the desired
output format, e.g. an .srt file format.
New listeners for new output formats can be easily created. They only have to implement the SpeechRecognizedListener interface, which consists of just two methods. The
first method, as described above, specifies what action needs to be taken when a word is
aligned; the second method specifies what needs to happen when the end of the audio file
is reached, e.g. close the output file.
The FileHelpers class consists of some useful methods when reading a text file. For
example, the removePunctuation() method specifies how the ASR plugin should react to
different punctuation symbols. We created this method because Sphinx-4, as most ASR
systems, can’t handle punctuation and removes them from the input text. However, some
symbols are better changed into a space character, or are not recognised by Sphinx, and
then this method comes in handy.
Important note: To define which ASR plugin must be used, the user needs to alter
(or add) the “pluginsystem.plugins.SpeechRecognizer” file, which can be found in the
“META-INF.services” folder of the plugin source code. The file has the reference to the
interface that is implemented, as name. The first line in that file must refer to the class
that implements this interface in the plugin project (the third project), e.g. in our application, to use the Sphinx-4 plugin, the first line reads ‘pluginsystem.plugins.sphinx.SphinxLongSpeechRecognizer’. See Section 4.3 for more information.
1
As vowels tend to have a ‘longer’ pronunciation than consonants[51], it would be more accurate to
take into account the amount of vowels in each word when calculating its duration. However, even among
the same vowel, pronunciation lengths might greatly vary, e.g. “I’m” versus “hiccup” for vowel i. Then
there is also the problem with double consonants, are they pronounced as one sound, as in the word
“boat”, or separately, as in the word “coefficients”? It is easy enough to see that how to handle vowels is
highly dependent on the word and its language, and thus we decided to not discriminate between vowel
and consonant.
52
4.3
Component 1: Project pluginsystem
Figure 4.3 contains the classes in the first project, and their interactions. The MainProgram class is the main class in our application. It processes the command line arguments and loads the plugin, using the SpeechRecognizerService and ClassLoader
classes.
SpeechRecognizerService
MainProgram
FolderURLClassLoader
SystemClassLoader
TextConverter
TestAccuracy
Figure 4.3: UML scheme of the pluginsystem project
Loading the ASR plugin consists of two steps. The first, is to add all the required
resources to the class path, e.g. the jar file of the plugin, the used ASR system, required
references, etc. Our application will automatically add all .jar files located in the “plugins” directory to the class path. This is done using the addUrl method of the default
ClassLoader in Java, the UrlClassLoader. We use reflection to call this method, as
it has the modifier protected. An alternative approach could have been to create and
use our own ClassLoader class to extend UrlClassLoader and make the addUrl method
public2 . The functionality to locate and add files to the class path can be found in the
SystemClassLoader class.
The second, and final, step is to find, and create an instance of the class that implements the SpeechRecognizer interface. To achieve this, the concept of services and
service providers is used. A service can be defined as a well-known set of interfaces, while
a service provider is a specific implementation of a service. The Java SE Developer Kit
comes with a simple service provider loading facility located in the java.util.ServiceLoader class.
The ServiceLoader class requires that all service providers are identified by placing
one or more provider configuration files in the resource directory “META-INF/services”.
As mentioned before, at the end of Section 4.2, the name of the file should correspond
2
Since we use the default ClassLoader class, all referenced resources will be added to the class path
as well.
53
with the fully-qualified binary name of the service type. The content exists of one or more
lines, where each line is the fully-qualified binary name of a service provider. As stated
before, this means that in our case, every plugin should have a provider-configuration
file called “pluginsystem.plugins.SpeechRecognizer” with, as content, the fully-qualified
binary name of the class that implements this interface.
In order to improve the maintainability of our code, we added an abstraction level
around the ServiceLoader object in class SpeechRecognizerService. This class has a
static function called getSpeechRecognizer which returns the SpeechRecognizer object
that should be used by our system. Internally, this class uses the ServiceLoader class as
described above.
The TextConverter and TestAccuracy classes are used to perform the accuracy tests
on the received output. They can be performed using the commands in Figure 4.4. The
<<jarfile>> in the command, is the path to the .jar-file that contains the runnable
program code.
$> java -jar <<jarfile>> TEST CONVERT "narration_helper.txt"
"PlayLane_manual_SRTfile"
$> java -jar <<jarfile>> TEST "PlayLane_manual_SRTfile converted.srt"
"output_SRTfile"
Figure 4.4: The commands used for testing accuracy
The PlayLane company uses two files to specify timings for each word. The first file
(the “narration helper.txt” file) contains a label for each word (see the first column in
Table 4.1), the second file is an .srt file, which specifies a word label for each timing (see
the second column in Table 4.1).
Due to this internally used file format, to specify which word needs to be highlighted at
which time, we first need to convert these PlayLane files to a regular .srt file, so we can
easily compare this manual transcription with our automatically generated transcription.
This is done by the TEST CONVERT command, as shown in Figure 4.4.
After the PlayLane .srt file contains the corresponding words, we can compare our
automatically generated transcription with their manual transcription. This is done by
the TEST command in Figure 4.4. The accuracy results are discussed in Section 5.2 of
Chapter 5
How all the classes of our application interact, between the different projects, can be
seen in Figure 4.5.
54
8 : 11 : en
151
00:01:45,100 --> 00:01:45,620
8:13
8 : 12 : een
8 : 13 : huis
152
00:01:45,620 --> 00:01:46,060
8:14
8 : 14 : vol
8 : 15 : vuur
153
00:01:46,060 --> 00:01:46,560
8:15
8 : 16 : .
8 : 17 : Bas
154
00:01:48,120 --> 00:01:48,480
8:17
8 : 18 : kijkt
8 : 19 : zijn
155
00:01:48,850 --> 00:01:49,100
8:18
Table 4.1: Example excerpt of the “narration helper.txt” file (on the left) and its corresponding
.srt file (on the right) for the book “De luie stoel”
4.4
4.4.1
Best Practices for Automatic Alignment
How to add a new Plugin to our System
1. If a user decides to use a different ASR system than the already provided Sphinx-4,
with our application, the first thing they need to do is add the second project to the
libraries of their plugin. Or, in case their plugin does not provide access to its source
code, create a new java project and add the ASR plugin and the second project to
that new java project’s libraries.
2. Then they need to create a new java class (or alter an already existing class) to
extend the AbstractSpeechRecognizer class, and implement the align method,
as explained in Section 4.2. This method will call on the ASR functions from the
plugin.
3. Add the “pluginsystem.plugins.SpeechRecognizer” file (without extension!), to the
“META-INF.services” folder of the plugin source code. Add the name of the class
created in step 2 above, including the packages it is part of, to the first line of this
file (see Section 4.2 for more information on this file).
4. Build the plugin project (this will create a .jar file for your project) and copy the
generated content from the “dist” folder3 to the “plugins” folder of the first project.
3
The “dist” folder should now contain the .jar file for the plugin project and a folder called “lib”
55
SpeechRecognizerService
MainProgram
TestAccuracy
FolderURLClassLoader
SystemClassLoader
TextConverter
<<interface>>
SpeechRecognizer
FileHelpers
<<abstract>>
AbstractSpeechRecognizer
<<interface>>
SpeechRecognizedListener
EPubOutputListener
SphinxLongSpeechRecognizer
SrtOutputListener
Figure 4.5: UML scheme of the entire application
Doing this will ensure that, when running the application, it has access to all the
libraries its projects might need. Also, remove the .jar files from plugins that
should not be used when running the application.
4.4.2
How to run the Application
5. Firstly, the input audio file needs to conform to a number of characteristics; it needs
to be monophonic, have a sampling rate of 16kHz and each sample must be encoded
in 16bits, little endian. We use a small tool called SoX to achieve this [58].
$> sox "inputfile" -c 1 -r 16000 -b 16 --endian little "outputfile.wav"
This tool is also useful to cut long audio files into smaller chunks (audio file length
of around 30 minutes is preferable to create a good alignment).
6. The input text file that contains the text that needs to be aligned with the input
audio file, should best be a in simple text format, such as .txt. It, however, needs
to be encoded in UTF-8 format. This is usually already the case, but it can easily be
which contains all the libraries needed for this ASR plugin.
56
verified and applied in source code editors, such as notepad++ [30]. This is needed
to correctly interpret the special characters that might be present in the text, such
as quotes, accented letters, etc.
7. If this is done, the alignment can be started with the following command:
$> java -Xmx1000m -jar <<jarfile>> -a "audio_input.wav" -t "text_input.txt"
--config configFile="config_files/configAlignerEN.xml";outputFormat="SRT"
where <<jarfile>> is the path to the .jar-file that contains the runnable program
code. We use the -Xmx1000m setting to assure our system has access to about a
Gigabyte of memory, more detailed information about Sphinx’s memory use can be
found in Section5.2.2 of Chapter 5. The “audio input.wav” and “text input.txt”
files contain the audio and text that need to be aligned with each other.
The “config files/configAlignerEN.xml” refers to the configuration file that Sphinx
needs to align English audio and text. This can be changed to the Dutch configuration file (“configAlignerNL.xml”). Both files are already included in our system,
but as they are Sphinx specific we opted to leave their reference in the command
encase the user decides to use a different ASR plugin. If, however, it is decided
to stick with the Sphinx plugin, the main class can easily be altered to make the
command to run the alignment task cleaner and briefer.
The output format in the command is set to produce an .srt file, but an EPUB
file can be created by setting the “outputFormat” value to “EPUB”.
When running the application, it will first output the parameters received from the
command line, when they are added to the configuration (see Figure 4.6).
Adding configuration: key=inputAudioFile,
value=IO_files/Ridder muis/ridder muis 0.wav
Adding configuration: key=inputTextFile,
value=IO_files/Ridder muis/ridder muis input.txt
Adding configuration: key=configFile, value=config_files/configAlignerNL.xml
Adding configuration: key=outputFormat, value=SRT
Figure 4.6: Example of configuration output for book “Ridder Muis”
Then it will output all remarks and warnings made by the ASR plugin (remember the
warning loglevel explained in Section 3.3.1 and the monitors in Section 3.3.6).
Lastly, it will output the names and paths of the specified text and audio input files,
as well as the name and path of the output file. This provides the user with the possibility
to easily check which files were used if perhaps the result is not as expected, or if one
57
runs a whole batch of transcription commands in a row. An example of such an output
can be found in Figure 4.7.
Used Files:
Input Text File: IO_files/Ridder muis/ridder muis input.txt
Input Audio File: IO_files/Ridder muis/ridder muis 0.wav
Output File: IO_files/Ridder muis/ridder muis 0.srt
Figure 4.7: Example of path and file names for book “Ridder Muis”
Note that the name of the output file, independent on which output format was chosen,
will be the same name of the audio file with the proper output format extension.
4.5
Output Formats
After presenting our application in the sections above, the output format possibilities are
shortly described in this section. Our application contains two possible output formats,
namely .srt file format and EPUB file format. As mentioned above, other formats can
easily be added to the application.
4.5.1
.srt File Format
The .srt file format is a standard subtitle file format, and has a fairly straightforward
form, as can be seen in Figure 4.8. The ‘subtitles’ are numbered separate, and contain
the times corresponding to the start of the subtitle and stop of the subtitle, separated by
an --> arrow.
4.5.2
EPUB Format
The EPUB file format can be opened and viewed by a number of devices and tools, and
provides a simple, widely-used standard format for reading books. It is a container format
and has the possibility of containing audio and the corresponding speech-text alignment.
An EPUB output file consists of two folders: the META-INF and OEBPS folder. The
META-INF folder contains an XML file, called “container.xml”, which directs the device
processor to where to find the meta data information.
The OEBPS (Open eBook Publication Structure) folder contains all the books contents, e.g. text, audio, alignment specifications. There is a “content.opf” file which
specifies all the meta-data of the book, such as author, language, etc. There are also a
58
37
00:00:21,705 --> 00:00:22,160
chapter
38
00:00:24,690 --> 00:00:25,400
1
39
00:00:26,420 --> 00:00:27,230
loomings
40
00:00:29,220 --> 00:00:29,480
call
41
00:00:29,480 --> 00:00:29,630
me
42
00:00:29,630 --> 00:00:30,420
ishmael
Figure 4.8: Example of an .srt file content; taken from the .srt output for “Moby Dick”
.xhtml and a .smil file. The first contains the book’s textual contents, the latter contains
the alignment specifications between the audio file and textual contents in the .xhtml file.
Figure 4.9 shows part of the EPUB output file for book “Moby Dick”. It is displayed
by the Readium application for the Chrome web browser.
59
Figure 4.9: Example part of an EPUB file, generated by our application
60
Chapter 5
Results and Evaluation
In this chapter we present the test results we achieved with the application we described in
the previous chapter. We first provide some more information about the test files we were
able to use, in Section 5.1. We then, in Section 5.2, discuss the results we achieved when
running our application on those test files, comparing them to the base line alignment
provided to us by the Playlane company, as described extensively in Section 5.1. In
Sections 5.3, resp. 5.4 and 5.6, we investigate the effects the pronunciation dictionary,
resp. the input accuracy and the acoustic model have on the alignment results. As is
explained in Section 5.7, during the writing of this dissertation, CMU released a new
Sphinx version. We briefly discuss the results we achieved when using this newer, though
unstable, version. We conclude this chapter with Section 5.8, in which we present some
impressions we received when manually checking the alignment results for English audio
and text. The conclusions we draw from these test results, and opted ideas for future
work, are presented in the next chapter.
5.1
Test Files
We were provided with a number of books by the Playlane company, which we were able
to use to verify our application with. Each book contained a complete textual version of
the spoken book, which we used as our text input file, an audio file containing the book
read at normal pace, and one read at a slow pace, a subtitle file containing word-perword timings using labels for each audio file, and a “narration helper” file (as discussed
in Section 4.3). They are all Dutch books.
The subtitle files that were provided by the Playlane company were manually made by
the employees of Playlane. They listen to the audio track and manually set the timings
for each word.
61
To verify the accuracy of our application we needed books that already had a wordper-word transcription, so we could compare that transcription with the one we generated
using the ASR plugin. We are aware that, due to human errors, the alignments provided
by the Playlane company might not be perfect. However, they do provide a decent baseline
to compare our achieved accuracy to, and are therefore regarded as the ground truth for
our alignment. These books are listed below:
•
•
•
•
•
•
•
•
•
•
Avontuur in de woestijn
De jongen die wolf riep
De muzikanten van Bremen
De luie stoel
Een hut in het bos
Het voetbaltoneel
Luna gaat op paardenkamp
Pier op het feest
Ridder muis
Spik en Spek: Een lek in de boot
All these books are read by S. and V., who are both female, and have Dutch as their
native language. Only two books, namely “De luie stoel” and “Het voetbaltoneel” are
read by S., the others are read by V.
normal pace audio file length
3000
0:43:12
2500
0:36:00
2000
0:28:48
1500
0:21:36
1000
0:14:24
500
0:07:12
0
0:00:00
audio file length
#words in input text
total #words in input text file
Figure 5.1: Chart containing the size of the input text file, and length of the input audio files
for both normal and slow pace
In Figure 5.1, the number of words for each book are shown, as well as the length of
the audio files, both slow and normal pace versions, of each book. Figure 5.2 shows the
62
percentage extra audio length
400
350
300
%
250
200
150
100
50
0
Figure 5.2: Percentage of extra audio length for the books read at slow pace, compared to the
normal pace audio file length
percentage of extra audio length for the slow pace audio files, compared to the normal
pace files. For example, the audio file length of the slow pace version of book “De jongen
die wolf riep” has over tripled in size compared to the length of the normal pace version
of the book. The value for the book “De jongen die wolf riep” is over 300% in Figure 5.2,
it is coincidently, also the book with the biggest audio file length difference between slow
and normal pace.
5.2
5.2.1
Results
Evaluation Metrics and Formulas
Note that when we, from here on, refer to the ‘mean start and/or stop time difference’ we
mean to say the average time difference for each word’s start and/or stop time between the
manually transcribed .srt file provided by the Playlane company, and the automatically
generated output created by our application. Figure 5.3 shows an example of a word with
its start and stop times. This means that the word “Muis” will start being highlighted
when the audio file, that is playing, reaches 440 milliseconds, and it will stop being
63
highlighted when the audio file reaches one second and 280 milliseconds. Thus its start
time is 440 milliseconds and its stop time is one second and 280 milliseconds.
1
00:00:00,440 --> 00:00:01,280
Muis
Figure 5.3: Example of a word and its start and stop times, in .srt file format
1
00:00:00,357 --> 00:00:01,284
Muis
Figure 5.4: Example of a word and its start and stop times, in .srt file format
Say, e.g., that the example in Figure 5.3 a part is of the alignment baseline provided by
Playlane, and that the example in Figure 5.4 a part is of the alignment automatically generated by our application. To measure the average deviation our automatically generated
alignment displays compared to the Playlane alignment, we for example, take the start
time for the word “Muis” in Figure 5.3 and distract the start time for the same word in
Figure 5.4. We then take the absolute value of the result of this distraction, which in our
example would be 83, and perform the summation for these values for each word in both
the automatic alignment result and the Playlane alignment. Finally, we divide this sum by
the number of words used for the summation, which provides us with the average start time
difference. The following formula is a mathematical representation of these calculations.
W
P
abs(P laylaneStartT ime(wi ) − autoStartT ime(wi ))
i=1
.
meanStartT imeDif f erence =
W
The same process is repeated to calculate the mean stop time difference.
5.2.2
Memory Usage and Processing Time
We will first show the memory usage and processing time needed by Sphinx to perform
the alignment task. As can be seen in Table 5.1, the memory use to align a text can run
up to over half a Gigabyte, which is why we present our application with approximately
a Gigabyte of free memory. What is also notable is the amount of processing time Sphinx
needs, as it never exceeds a tenth of the total audio length (x RT1 ), see Figure 5.5.
These memory usage and processing times were achieved on a computer running a
32-bit Operating System (Windows 8.1 Pro) on a x64-based processor, with 4 Gigabytes
1
“RT” stands for real time, which refers to the total audio length. If, e.g., something took 0.05 x RT,
it took five percent of the original audio length to perform the task.
64
book titles
pace
Avontuur in de
woestijn
De jongen die
wolf riep
De muzikanten
van Bremen
Een hut in het
box
Luna gaat op
paardenkamp
Pier naar het
feest
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
normal
slow
Ridder Muis
Spik en Spek
De luie stoel
Het
voetbaltoneel
memory
usage
(Mb)
629.92
485.32
407.31
652.41
504.74
443.77
437.98
649.06
469.10
544.66
542.80
421.15
689.61
621.18
569.48
587.93
539.33
519.35
548.95
533.85
total
audio
length
(ms)
69 008
121 900
30 212
100 858
29 586
75 898
51 495
126 312
80559
113 051
16 920
25 383
161 045
225 933
29 479
42 890
78 909
146 694
92 311
164 832
processing
time
(ms)
3 436
6 560
2 067
5 726
1 899
5 044
2 689
6 914
3 867
4 972
869
1 478
7 795
13 438
1 619
2 353
4 494
7 344
5 255
8 318
speed
(x RT)
0.05
0.05
0.07
0.06
0.06
0.07
0.05
0.05
0.05
0.04
0.05
0.06
0.05
0.06
0.05
0.05
0.06
0.05
0.06
0.05
Table 5.1: Memory usage, processing times and speed of Sphinx for several alignment task, on
both normal and slow pace audio
of RAM, of which 3.24 Gigabytes were usable. It has an Intel(R) Core(TM)2 Duo T8300
CPU processor with a clock rate of 2.40 GHz.
5.2.3
First Results
We ran the application on both audio versions of each of the aforementioned books, to
create the automatically generated transcriptions, and then ran the TEST CONVERT and
TEST commands to verify the transcription’s similarity to the manually transcribed file,
see Sections 4.3 and 4.4 for more information on the used commands.
The difference between the transcriptions is measured in milliseconds, word-per-word.
For each word the difference between both transcriptions’ start times and stop times are
calculated separately, and we take the mean over all the words that appear in both files.
We decided to calculate the average of start and stop times for each word separately when
we discovered, after careful manual inspection of the very first results, that Sphinx-4 has
the tendency to allow more pause at the front of a word than at the end. In other words,
65
1000000
0.1 xRT
total audio length
processing time
milliseconds
100000
10000
1000
100
Figure 5.5: Processing times for the automatic alignment performed on normal pace book
it has the tendency to start highlighting a word in the pause before it is spoken, but stops
the highlighting of the word more neatly after it is said, see Section 5.5.
Figure 5.6 shows the average difference of the start and stop times for each word, for the
books read at normal pace, between the files provided by Playlane and the automatically
generated transcription provided by our application.
The acceptable average time differences for normal pace audio are shown in Figure 5.7,
together with the absolute maximum time difference. We consider all time differences less
than one second to be acceptable. The maximum time differences lay between one to
ten seconds for all six books, and can be explained by the long pauses at the end and
beginning of new paragraphs in the books. The books “De jongen die wolf riep”, “De
muzikanten van Bremen”, “De luie stoel” and “Het voetbaltoneel” deviate too far from
the manual transcription times to be usable for an application that generates automatic
synchronization between audio files and text files.
There are six out of eight books read by V. that have timings that are synchronized
with on average less than one second of difference between the output from our system
and the one provided by Playlane. We take a closer look at each word’s time deviation in
Section 5.5 for one of these books, namely “Ridder Muis”. There is no apparent reason
why the other two books have a worse alignment accuracy, there are no general differences
between those two books and the other six. We even took a look at the original audio
file format, which we altered to .wav for Sphinx, in case this had any influence on the
alignment. Table 5.2 shows the original audio file formats for each book. There appears
66
mean start time difference
mean stop time difference
120000
milliseconds
100000
80000
60000
40000
20000
0
Figure 5.6: The mean start and stop time differences between the automatically generated
alignment and the Playlane timings, for the books read at normal pace
mean start time difference
mean stop time difference
max. time difference
8000
7000
milliseconds
6000
5000
4000
3000
2000
1000
0
Figure 5.7: The acceptable mean start and stop time differences between the automatically
generated alignment and the PlayLane timings, for the books read at normal pace
67
De muzikanten van Bremen
De luie stoel
Een hut in het bos
Het voetbaltoneel
Luna gaat op paardenkamp
Pier naar het feest
Ridder Muis
Spik en Spek: Een lek in de boot
audio
file
mp3
format
missing
words
0
at end
total
words
1254
in text
De jongen die wolf riep
Avontuur in de woestijn
to be no link between whether the input file was an .mp3 file or an .wav file originally, and
the accuracy of the alignment results. The appears no way to determine whether, when
running the alignment task for a specific book, Sphinx-4 will return acceptable alignment
results. However, we output the number of missing words at the end, i.e. the words at
the end of the input text that Sphinx did not include in the alignment, and there appears
to be a correlation between this number and the alignment accuracy, see Table 5.2. This
might be useful to decide whether the alignment result will be accurate enough to be
considered for use.
wav
wav
wav
wav
mp3
mp3
mp3
wav
wav
78
216
392
0
312
0
0
0
0
565
594
1131
961
1286
1853
243
2520
449
Table 5.2: Table showing possibilities that might help determine whether the alignment result
will be accurate
The two books read by S. have the highest average start and stop time difference,
which is why we have decided to train the acoustic model more on her voice. More
explanation on this training, and the accomplished results, can be found in Section 5.6.
The results we got for the slow readings of the books are disastrous, see Figure 5.8.
Apart from the book “Luna gaat op paardenkamp”, all have an average start and stop
time deviation of at least 20 seconds. There seems to be no obvious cause why one
book has a larger average time deviation than the other, as can be seen in Figures B.1
and B.2 in Appendix B. It appears these bad results can only be explained by Sphinx’s
apparent difficulty with handling pauses in audio, especially long pauses. We considered
that perhaps the reason that the book “Luna gaat op paardenkamp” performs so well on
68
mean start time difference
mean stop time difference
160000
140000
milliseconds
120000
100000
80000
60000
40000
20000
0
Figure 5.8: The mean start and stop time differences between the automatically generated
alignment and the Playlane timings, for the books read at slow pace
its slowly read version is because the slow version is only 140% longer than the normal
pace version, as seen in Figure 5.2. However, book “Ridder Muis” has the same percentage
of audio file length difference, as “Luna gaat op paardenkamp”, which would mean that
“Ridder muis” ought to give acceptable results as well, which is obviously not the case.
5.3
Meddling with the Pronunciation Dictionaries
The first idea we had that might increase accuracy of the synchronisation timings, was to
add words from the books that are missing in the pronunciation dictionary. Sphinx helps
us by warning us when this occurs, e.g. when running the application for book “De jongen
die wolf riep”, we got a warning as part of the command line output, see Figure 5.9.
18:44:22.254 WARNING dictionary (AllWordDictionary) Missing word: slaapschaap
Figure 5.9: Example of a warning for a missing word
69
Table 5.3 shows which books had missing words, which words those were, and how
often they appeared in the input text of the book.
book titles
Avontuur in de woestijn
De jongen die wolf riep
De muzikanten van Bremen
Luna gaat op paardenkamp
De luie stoel
Het voetbaltoneel
#words
missing
missing
word
#times
the word
appears
5
1
total
#words
in book
1254
565
1
1
zorahs
slaapschaap
1
kuuuuuu
1
594
1
controle
1
1853
1
2
babien
fwiet, tuinhok
39
2
1131
1286
Table 5.3: Table containing information on words missing from the pronunciation dictionary
mean start time difference for normal pace
25000
mean start time difference for slow pace
milliseconds
20000
15000
10000
5000
0
Figure 5.10: The mean start time differences between the automatically generated alignment
using the dictionaries with words missing, and using the dictionaries with missing words added,
for books read at slow and normal pace
Figure 5.10 shows the amount of milliseconds of difference there is between the mean
start time deviation when using the dictionary with missing words (this is the dictionary
we used for the previous tests) and when using the dictionary, where we added the missing
words from Table 5.3 and their pronunciations. We show this difference for the mean start
70
milliseconds
10000
mean start time
difference
1000
mean stop time
difference
max. time
difference
100
with 'muis'
without 'muis'
dictionary
Figure 5.11: The mean start and stop, and maximum time differences between the automatically
generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, using
a dictionary that is missing the word “muis”
times of both the audio read at normal pace as at slow pace. The mean stop time difference
is nearly the same as the mean start time difference, and thus adds no extra value to the
graph.
It is clear from Figure 5.10 that when a word is missing from the pronunciation dictionary, adding this word does provide a better synchronisation, especially when that word
is used often throughout the book’s text. As can be seen, the mean start time difference
for the book “De luie stoel”, which had the highest appearance of the missing word in its
text, has improved with almost five seconds for the normal pace, and over 20 seconds for
the slow pace version of the read book.
As a reference, we also performed this test on the book “Ridder Muis”, by deleting
the word “muis” from the pronunciation dictionary and comparing the timing results.
The word “muis” appears 123 times in the book contents, which contain 2501 words in
total. The average start and stop time differences can be seen in Figure 5.11, as well as
the maximum time difference between the automatically generated synchronisations and
the PlayLane timings.
As you can see from both previous results (Figures 5.10 and 5.11), it is important to
71
add missing words to the pronunciation dictionary, especially if the missing word has a
high presence in the book’s text.
5.4
Accuracy of the Input Text
After running the tests to verify the increased accuracy achieved by adding missing words
to the pronunciation dictionary, we thought it might be interesting to know how well our
ASR system performs when there are words missing from the input text. We therefore
decided to remove the word “muis” from the “Ridder Muis” input text. The results can be
found in Figure 5.12, and clearly show that the most accurate synchronisation results are
to be had when the input text file represents the actually spoken text as well as possible.
100000
10000
milliseconds
mean start time
difference
mean stop time
difference
max. time
difference
1000
100
with `muis´
without `muis´
input text
Figure 5.12: The mean start and stop, and maximum time differences between the automatically
generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, with
missing word “muis in the input text
Comparing Figures 5.11 and 5.12, shows us that the correctness of the input text has
a higher influence on the accuracy of the synchronisation result than the content of the
pronunciation dictionary, as using a lacking dictionary gives a mean average start time
72
of around 500 milliseconds, while using a lacking input text gives a mean average start
time of around 7000 milliseconds. It is, thus, more important to make sure to pass on
an accurate input text, than to make sure the pronunciation dictionary contains all the
words in that input text.
5.5
A Detailed Word-per-Word Time Analysis
We now take a closer look a the synchronisation results for a specific book, namely “Ridder
Muis”. Figure 5.13 shows the start and stop time deviation for each word separately. The
negative values mean that the start or stop time in our application’s result starts earlier
than the time for the same word in the Playlane file. Positive values indicate that the
time from our application was later than the time from the same word in the Playlane
synchronisation. The horizontal axis in the figure has one word as its basic unit, thus,
every value on the horizontal axis represents the timing deviation for one word.
start time deviation
stop time deviation
3000
2000
milliseconds
1000
0
-1000
-2000
-3000
-4000
Ridder Muis
Figure 5.13: time deviations for each word between the automatically generated alignment and
the Playlane timings, for the normal pace “Ridder Muis” book
As can be seen from Figure 5.13, there is much higher proportion of negative values
than positive values. This indicates Sphinx’ preference to start a word in the pause before
the word is actually spoken. This preference can also be noted from the fact that the dark
line (indicating deviation on start times) is overall more notable on the graph, indicating
generally higher values than the stop time deviations.
There are a number of outliers, in the negative as well as the positive range, which can
usually be described to errors in the input file. The first outlier, in the positive deviation
range, happens at the text “Maar dan...” in the book contents. We cannot readily explain
why Sphinx struggles to align these words, except that both words are pronounced very
slowly and containing a high level of anticipation. The first negative outlier happens on
73
the text “Draak slaakt een diepe zucht.”, where Sphinx quickly passes over the first four
words and pauses on “zucht” until that word is said, for no apparent reason. The biggest
outlier happens on the text “Het geluid komt uit haar eigen buik.”, and we believe it is
caused by the long (3490 milliseconds) pause between the previous sentence and this one.
But as can be seen from the figure, the alignment always rectifies itself after each
outlier, and goes back to, on average, one second of time difference.
5.6
Training the Acoustic Model
5.6.1
How to Train an Acoustic Model
To adapt an acoustic model by training it on certain data, all one needs is some audio
data and the corresponding transcriptions. The user will want to divide the audio data
in sentence segments, e.g. using the tool SoX [58] we mentioned before, or the Audacity
software [3] which easily allows the user to separate an audio file in several smaller parts.
Due to its visualisation of the audio, it is easy to see when sentences start or stop. For
more information about the necessary characteristics of the .wav files and the mentioned
tools we used, see Appendix D.
These smaller audio files should have the following names: <<filename>>0xx.wav;
where <<filename>> refers to, for example, the original audio or book title, to easily
group and recognize which files contain which data, and the 0xx part should be different
for each file. Each audio file must have a unique file name.
Next, we created a <<filename>>.fileids file, which contains the name of each
audio file we wanted to use for the training (without its extension), see Figure 5.14.
<<filename>>_001
<<filename>>_002
<<filename>>_003
<<filename>>_004
<<filename>>_0xx
Figure 5.14: Example contents of a .fileids file
The final file we needed, to perform acoustic model training, is the <<filename>>.transcription file, which contains the transcription of each audio file mentioned in the
.fileids file. Figure 5.15 shows an example content of such a file. The transcription of
each audio fragment must be inserted between <s> and </s>, followed by the file name
of the corresponding audio fragment between parentheses.
74
<s>
<s>
<s>
<s>
<s>
tot het weer ruzie is </s> (<<filename>>_001)
die zijn weer vrienden </s> (<<filename>>_002)
die is dan weer weg </s> (<<filename>>_003)
en met de kras </s> (<<filename>>_004)
die is dan weer dun </s> (<<filename>>_0xx)
Figure 5.15: Example contents of a .transcription file
After the creation of these files, training the acoustic model can be done by following
the steps below.
1. We created a new folder to insert all the files, e.g. “adapted model”.
2. We created a new folder, called “bin”, inside folder “adapted model” and inserted
the binary files of Sphinxbase [12] and SphinxTrain [13] in this “bin” folder.
3. We created another folder inside the “adapted model” folder, called “original”. We
inserted the acoustic model we wished to alter, i.e. the ‘original’ acoustic model.
(The pronunciation dictionary need not be added here.)
4. We then created another folder inside the “adapted model” folder, called “adapted”.
We inserted the original acoustic model in this folder as well. This folder will
ultimately contain the adapted acoustic model.
5. Then we inserted the following files to the “adapted model” folder:
• <<filename>>.fileids;
• <<filename>>.transcription;
• the several .wav files; and,
• the pronunciation dictionary (e.g., the “celex.dic” file, which is the pronunciation dictionary we use for our application). Note that for the pronunciation
dictionary to be usable for training the acoustic model, it only needs to contain
all the words in the transcription file.
6. The internal folder structure should now correspond to Figure 5.16.
We could then start with the actual adaptation of the acoustic model. For this, we
needed to open a command prompt in the “adapted model” folder.
1. We first needed to create the feature files (.mfc files) from the .wav files, using the
following command:
$> bin/sphinx_fe -argfile original/feat.params -samprate 16000 -c
<<filename>>.fileids -di . -do . -ei wav -eo mfc -mswav yes
Now, for each .wav file, there should be a corresponding .mfc file in the “adapted model” folder.
75
2. The next step is to create some statistics for the adaptation of the acoustic model.
We uses the tool bw and ran the following command: (we wrote the options separately on each line for readability purposes only):
$> bin/bw
-hmmdir original
-moddeffn original/mdef
-ts2cbfn .cont.
-feat 1s_c_d_dd
-cmn current
-agc none
-dictfn <<celex.dic>>
-ctlfn <<filename>>.fileids
-lsnfn <<filename>>.transcription
-accumdir .
We ran this command with the “celex.dic” pronunciation dictionary, but any dictionary can be used. The dictionary that will needed depends on the language of
the acoustic model, and the data it will be trained on.
3. Next, we needed to perform a maximum-likelihood linear regression (MLLR) transformation. This is a small adaptation to the acoustic model, and is needed when
the amount of available data is limited.
$> bin/mllr_solve
-meanfn original/means
-varfn original/variances
-outmllrfn mllr_matrix
-accumdir .
4. Then we needed to update the acoustic model, using maximum a posteriori (MAP)
adaptation:
$> bin/map_adapt
-meanfn original/means
-varfn original/variances
-mixwfn original/mixture_weights
-tmatfn original/transition_matrices
-accumdir .
-mapmeanfn adapted/means
-mapvarfn adapted/variances
-mapmixwfn adapted/mixture_weights
-maptmatfn adapted/transition_matrices
5. The “adapted” folder will now contain the adapted acoustic model. The can be
verified by checking the date of modification for the “means”, “variances”, “mixture weights” and “transition matrices” files.
For more information on how to adapt an acoustic model, see [60].
76
|--adapted_model
|
|--adapted
| |--feat.params
| |--mdef
| |--means
| |--mixture_weights
| |--noisedict
| |--transition_matrices
| |--variances
|
|--bin
| |-- .exe files
| |-- .dll files
|
|--original
| |--feat.params
| |--mdef
| |--means
| |--mixture_weights
| |--noisedict
| |--transition_matrices
| |--variances
|
|-- pronunciation file, e.g. celex.dic
|--<<filename>>0xx.wav files
|--<<filename>>.fileids
|--<<filename>>.transcription
Figure 5.16: Example structure of the “adapted model” folder
5.6.2
Results with Different Acoustic Models
As mentioned before, we wanted to train our acoustic model to S.’s voice, as it seemed
to get the worst results of the alignment task. We trained the acoustic model on a book
called “Wolf heeft jeuk”, which is also read by S. but is not part of the test data. We
trained it on the last chapter of the book, which contained 13 sentences with a total of 66
words, covering 24 seconds of audio. And we also trained the original acoustic model on a
part of the book “De luie stoel”. We used 23 sentences with a total of 95 words, covering
29 seconds of audio from this book. The alignment results we achieved when using the
trained acoustic models can be seen in Figure 5.17 for the books read at normal pace,
and in Figure 5.18 for the slow pace.
Figure 5.17 shows some definite improvements for both books read by S., which is
77
mean start time difference (old AM)
mean start time difference (luie stoel AM)
mean start time difference (wolf AM)
120000
milliseconds
100000
80000
60000
40000
20000
0
Figure 5.17: Mean start time difference of each normal pace book, using the original acoustic
model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De
luie stoel”
as we expected, though the mean time difference is still almost 20 seconds for “De luie
stoel” and over 40 seconds for “Het voetbaltoneel” for the alignment results we achieved
using the “Wolf heeft jeuk” acoustic model. The average time difference for the book “De
luie stoel” is only around 300 milliseconds when we perform the alignment task using the
acoustic model trained on “De luie stoel”; and also the alignment for “Het voetbaltoneel”
has improved in comparison to when we use the model trained on “Wolf heeft jeuk”, with
only around 30 seconds of average time difference instead of 40.
This means the alignment results with the acoustic model trained on “Wolf heeft jeuk”
are still not acceptable, but show promising results for the acoustic model if it were to
be further trained on that book. The alignments results achieved by using the acoustic
model trained on book “De luie stoel” are near perfect for that book, and also provide
an improvement for the book “Het voetbaltoneel”. This allows us to believe that further
training an acoustic model on S.’s voice will achieve much improved alignment results for
books read by S..
Of the books that are read by V., some have around the same accuracy with the newly
trained acoustic models as with the old one, others have a better accuracy. But, as four
78
mean start time difference (old AM)
mean start time difference (luie stoel AM)
mean start time difference (wolf AM)
500000
450000
400000
milliseconds
350000
300000
250000
200000
150000
100000
50000
0
Figure 5.18: Mean start time difference of each slow pace book, using the original acoustic
model, the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De
luie stoel”
out of eight books have a worse accuracy, we can conclude that in general the acoustic
model trained on S.’s voice has a bad influence on the accuracy of books read by V.
As can be seen from Figure 5.18, the newly trained acoustic model has even worse
alignment accuracy for the books read at slow pace. For example, the book “Ridder
Muis” has a mean time difference of 450 seconds (7.5 minutes) with the trained model.
Considering the results we achieved on the books read by S. at normal pace, by training
the acoustic model on her voice, it might be a good idea to train an acoustic model on one
of these slow pace versions of the audio books, and see if Sphinx can then better recognize
pauses between words.
5.7
The Sphinx-4.5 ASR Plugin
During the writing of this dissertation, CMU has released a new update to their Sphinx
project. This Sphinx 4.5 was released in February and is a pre-alpha release, but we
decided to have a look at its alignment abilities for future reference. The results we
achieved can be found in Figures 5.19 and 5.20 for the books read at normal pace, and in
79
Figure 5.21 for the books read at slow pace.
mean start time difference (Sphinx-4.5)
mean start time difference (Sphinx-4)
120000
milliseconds
100000
80000
60000
40000
20000
0
Figure 5.19: The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin,
for the books read at normal pace
mean start time difference (Sphinx-4.5)
mean start time difference (Sphinx-4)
300
milliseconds
250
200
150
100
50
0
Figure 5.20: The acceptable mean start time differences for both the Sphinx-4 and the Sphinx-4.5
plugin, for the books read at normal pace
80
mean start time difference (Sphinx-4.5)
mean start time difference (Sphinx-4)
160000
140000
milliseconds
120000
100000
80000
60000
40000
20000
0
Figure 5.21: The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin,
for the books read at slow pace
The configuration settings can be found in Appendix C. However, as Sphinx internally
uses these values as well, it does not need to be added to the command line when running
the application with this plugin. The path to the acoustic model and pronunciation
dictionary do need to be specified in the command.
As can be seen on Figure 5.19, the pre-alpha release achieves worse accuracy then
the Sphinx-4 plugin, on four books, namely “De jongen die wolf riep”, “De muzikanten
van Bremen” (both read by V.), “De luie stoel” and “Het voetbaltoneel” (read by S.).
The other six books, which already had a good transcription accuracy with the Sphinx-4
plugin, now have an even better accuracy, as seen in Figure 5.20.
What we can conclude from Figure 5.21, is that the Sphinx-4.5 ASR plugin provides
an overall better alignment accuracy than the Sphinx-4 plugin for the books that are read
at a slow pace. For five books the mean time difference is even less than one second.
5.7.1
Sphinx-4.5 with Different Acoustic Models
We now also compare the results for the alignment task when we use the acoustic models
we trained in the previous Section. The results can be found in Figures 5.22, 5.23 and 5.24.
81
mean start time difference (old AM)
mean start time difference (wolf AM)
mean start time difference (luie stoel AM)
120000
milliseconds
100000
80000
60000
40000
20000
0
Figure 5.22: The mean start time differences for the Sphinx-4.5 plugin, for the books read at
normal pace, using the three different acoustic models
mean start time difference (old AM)
mean start time difference (luie stoel AM)
mean start time difference (wolf AM)
2000
1800
1600
milliseconds
1400
1200
1000
800
600
400
200
0
Figure 5.23: The mean start time differences for the Sphinx-4.5 plugin, for the six books read at
normal pace that usually achieve acceptable accuracy, using the three different acoustic models
82
As can be seen from Figures 5.22 and 5.23, Sphinx-4.5 also achieves better alignment
accuracy with the trained acoustic models, for books read by S.. But Sphinx-4.5 performs
nearly as good using the acoustic models trained on S.’s voice, as when using the original
acoustic model for the books shown in Figure 5.23 (disregarding the outlier for “Luna
gaat op paardenkamp” for the acoustic model trained on “Wolf heeft jeuk”).
mean start time difference (old AM)
mean start time difference (luie stoel AM)
mean start time difference (wolf AM)
300000
milliseconds
250000
200000
150000
100000
50000
0
Figure 5.24: The mean start time differences for the Sphinx-4.5 plugin, for the books read at
slow pace, using the three different acoustic models
The effects the trained acoustic models had on the books read at a slow pace can be
seen in Figure 5.24, and are very irregular. For some books a trained acoustic model
perform better, for other books it performs worse than the original acoustic model. In
general, we can say that the acoustic model trained on “wolf heeft jeuk” performs worst
of all three models. There appear no similarities between the alignment accuracy results
for the Sphinx-4.5 plugin and the Sphinx-4 plugin (see Figure 5.18).
5.8
Alignment Results for English Text and Audio
We have also tried our application and ASR plugin on English text and audio, as English
is a more researched language, due to its higher presence in audio and text.
83
It is difficult to find a word-per-word alignment baseline to compare our results against,
since these are often created manually, which is highly work- and time-intensive, and are
therefore not made freely available.
However, we were able to perform the alignment task on a number of English books,
such as “The curious case of Benjamin Button” by F. Scott Fitzgerald, and “Moby Dick”
by Herman Melville. Both these books are part of the public domain, due to the copyright
expiry law. When taking a look at the generated EPUB file, we concluded that, for both
British and American voices, the alignment results are near perfect.
These generated EPUB files can be found online2 .
2
http://1drv.ms/1k0f258
84
Chapter 6
Conclusions and Future Work
The goal of this dissertation was to investigate whether performing an alignment task
automatically, instead of manually, lays within the realms of the possible. Therefore, we
created a software application that provides its user with the option to simply switch
out different ASR systems, via the use of plugins. We provide extra flexibility for our
application by offering two different output formats (a general subtitle file, and an EPUB
file), and by making the creation of a new output format as simple as possible. To support
speech-text alignment in EPUB format, we extended the existing EPUB library with a
media-overlay option.
From the results in the previous chapter, using the ASR plugin CMU Sphinx, we
conclude that it is indeed possible to automatically generate an alignment of audio and
text that is accurate enough for use (e.g., our test results have on average less than one
second of difference between the automatic alignment results and a pre-existing baseline).
However, there is still work to be done, especially for undersourced languages, such
as Dutch. We achieved positive results when training the acoustic model on (less than
60 seconds of) audio data that corresponded with the person or type of book we wanted
to increase alignment accuracy for. Our first remark for future work is then to further
train the acoustic model for Dutch, especially when one has a clearly defined type of
alignment tasks to perform. For example, the Playlane company has a set of around 20
voice actors they work with. Based on the results we achieved when training the acoustic
model, we believe that training an acoustic model for each of these actors would highly
increase accuracy for an alignment task on an audio fragment of each of these people,
when using the corresponding acoustic model. Considering it can take days to manually
align an audiobook, this small effort to train an acoustic model definitely appears to be
highly beneficial, keeping in mind the gain in time one might achieve when automatically
generating an accurate automatic alignment. The trained model could also achieve accurate results on multiple books, meaning that it is not necessary to train an acoustic
85
model for every new alignment task.
As can clearly be concluded from the alignment results we achieved for the books
read at a slow pace, Sphinx shows a number of definite issues with aligning pauses and
silences. Sphinx might, for example, claim a word starts in the pause before it is actually
spoken, or not recognize small pauses between words. It can misrecognise pauses as either
longer or shorter than they actually are. We therefore propose the idea to have a further
examination as to why Sphinx is perceiving these problems. It is highly likely that the
data that was originally used to build and train the acoustic model consisted mainly of
adult’s speech, which tends to keep a fast pace. This means that it might be possible
that these difficulties can be alleviated as well, by training the acoustic model on audio
containing a high amount of silences and pauses. It is, however, also possible the problem
occurs at the front end parsing of the audio data; taking a closer look at how Sphinx
operates might help discover why silences form a rather big issue.
We mentioned before, in Section 5.2.3, that the amount of missing words at the end of
the input file that were not included in the alignment by Sphinx, provide a fair indication
of the alignment’s accuracy. It might be interesting to investigate whether this is caused
by a single word Sphinx has difficulty to align, which then causes the other words to be
misaligned in turn, or whether words are consequently further misaligned until there is
no more audio to align while there is still some input text left unaligned.
We also note that the accuracy of the input text and the pronunciation dictionary
coverage highly influences the accuracy of the alignment output. From our tests, we can
conclude that it is best to not have words missing from the input text or the pronunciation
dictionary. There is a clear need for a more robust system, with less unexplainable outlying
results. We propose a way to increase robustness for our application by comparing the
alignment results created by two, or more, different ASR plugins. The overlapping results,
within a certain error range, can be considered ‘correct’. This approach is based on the
approach followed in [19].
It is our belief that the system we designed provides a flexible approach to speech-text
alignment and, as it can be adapted to the user’s preferred ASR system, might be to the
benefit of users that previously performed the alignment task manually.
86
Appendix A
Configuration File Used for
Recognizing Dutch
<?xml version="1.0" encoding="UTF-8"?>
<config>
<!-- ************************************************** -->
<!-- Global Properties
-->
<!-- ************************************************** -->
<property name="logLevel" value="WARNING"/>
<property name="absoluteBeamWidth" value="-1"/>
<property name="relativeBeamWidth" value="1E-300"/>
<property name="wordInsertionProbability" value="1.0"/>
<property name="languageWeight" value="10"/>
<property name="addOOVBranch" value="true"/>
<property name="frontend" value="epFrontEnd"/>
<property name="recognizer" value="recognizer"/>
<property name="showCreations" value="false"/>
<property name="outOfGrammarProbability" value="1E-26" />
<property name="phoneInsertionProbability" value="1E-140" />
<component name="recognizer"
type="edu.cmu.sphinx.recognizer.Recognizer">
<property name="decoder" value="decoder"/>
<propertylist name="monitors">
<item>accuracyTracker </item>
<item>speedTracker </item>
<item>memoryTracker </item>
</propertylist>
</component>
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder">
<property name="searchManager" value="searchManager"/>
87
</component>
<component name="searchManager"
type="edu.cmu.sphinx.decoder.search.AlignerSearchManager">
<property name="logMath" value="logMath"/>
<property name="linguist" value="aflatLinguist"/>
<property name="pruner" value="trivialPruner"/>
<property name="scorer" value="threadedScorer"/>
<property name="activeListFactory" value="activeList"/>
</component>
<component name="activeList"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="logMath" value="logMath"/>
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
<component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>
<component name="threadedScorer"
type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer">
<property name="frontend" value="${frontend}"/>
</component>
<component name="aflatLinguist" type="edu.cmu.sphinx.linguist.aflat.AFlatLinguist">
<property name="logMath" value="logMath" />
<property name="grammar" value="AlignerGrammar" />
<property name="acousticModel" value="wsj" />
<property name="wordInsertionProbability" value="${wordInsertionProbability}"/>
<property name="languageWeight" value="${languageWeight}" />
<property name="unitManager" value="unitManager" />
<property name="addOutOfGrammarBranch" value="${addOOVBranch}" />
<property name="phoneLoopAcousticModel" value="WSJ" />
<property name="outOfGrammarProbability" value="${outOfGrammarProbability}" />
<property name="phoneInsertionProbability" value="${phoneInsertionProbability}"/>
<property name="dumpGStates" value="true" />
</component>
<component name="AlignerGrammar"
type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar">
<property name="dictionary" value="dictionary" />
<property name="logMath" value="logMath" />
<property name="addSilenceWords" value="true" />
<property name="allowLoopsAndBackwardJumps" value="allowLoopsAndBackwardJumps"/>
<property name="selfLoopProbability" value="selfLoopProbability" />
<property name="backwardTransitionProbability"
88
value="backwardTransitionProbability"/>
</component>
<!-- ******************* -->
<!-- DICTIONARY SETTINGS -->
<!-- ******************* -->
<component name="dictionary"
type="edu.cmu.sphinx.linguist.dictionary.AllWordDictionary">
<property name="dictionaryPath" value="resource:/nl/dict/celex.dic"/>
<property name="fillerPath" value="resource:/nl/noisedict"/>
<property name="dictionaryLanguage" value="NL"/>
<property name="addSilEndingPronunciation" value="true"/>
<property name="wordReplacement" value="<sil>"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="wsj"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="wsjLoader"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath"/>
<property name="unitManager" value="unitManager"/>
<property name="location" value="resource:/nl"/>
</component>
<component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/>
<!-- additions start-->
<component name="WSJ"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="WSJLOADER" />
<property name="unitManager" value="UNITMANAGER" />
</component>
<component name="WSJLOADER"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath" />
<property name="unitManager" value="UNITMANAGER" />
<property name="location"
value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"/>
</component>
<component name="UNITMANAGER" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/>
89
<!-- additions end -->
<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>audioFileDataSource </item>
<item>dataBlocker </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>melFilterBank </item>
<item>dct </item>
<item>liveCMN </item>
<item>featureExtraction </item>
</propertylist>
</component>
<component name="audioFileDataSource"
type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/>
<component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/>
<component name="speechClassifier"
type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"/>
<component name="nonSpeechDataFilter"
type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/>
<component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker"/>
<component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/>
<component name="windower"
type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"/>
<component name="fft"
type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/>
<component name="melFilterBank"
type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"/>
<component name="dct"
type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/>
<component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/>
<component name="featureExtraction"
type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/>
90
<component name="logMath" type="edu.cmu.sphinx.util.LogMath">
<property name="logBase" value="1.0001"/>
<property name="useAddTable" value="true"/>
</component>
<!-- ******************************************************* -->
<!-- monitors
-->
<!-- ******************************************************* -->
<component name="accuracyTracker"
type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker">
<property name="recognizer" value="${recognizer}" />
<property name="showAlignedResults" value="false" />
<property name="showRawResults" value="false" />
</component>
<component name="memoryTracker" type="edu.cmu.sphinx.instrumentation.MemoryTracker">
<property name="recognizer" value="${recognizer}" />
<property name="showSummary" value="true" />
<property name="showDetails" value="false" />
</component>
<component name="speedTracker" type="edu.cmu.sphinx.instrumentation.SpeedTracker">
<property name="recognizer" value="${recognizer}" />
<property name="frontend" value="${frontend}" />
<property name="showSummary" value="true" />
<property name="showDetails" value="false" />
</component>
</config>
91
92
Appendix B
Slow Pace Audio Alignment results
As explained in Section 5.2.3, there appears to be no obvious cause why one book has
a larger average time difference than the other. To visualise this, we have sorted the
alignment results on several arguments.
Figure B.1 shows the alignment accuracies for the books, sorted from smallest input
text size to largest text size.
mean start time difference
mean stop time difference
180000
160000
milliseconds
140000
120000
100000
80000
60000
40000
20000
0
Figure B.1: The mean start and stop time differences between the automatically generated
alignment and the PlayLane timings, for the books read at slow pace
93
Figure B.2 shows the alignment accuracies for each book, sorted from smallest audio
length to largest audio length.
mean start time difference
mean stop time difference
180000
160000
milliseconds
140000
120000
100000
80000
60000
40000
20000
0
Figure B.2: The mean start and stop time differences between the automatically generated
alignment and the PlayLane timings, for the books read at slow pace
As the accuracy results show no pattern on both figures, we conclude that, due to the
long pauses in the audio files, Sphinx-4 cannot provide accurate alignment results.
94
Appendix C
Configuration File Used by
Sphinx-4.5 for Dutch Audio
<?xml version="1.0" encoding="UTF-8"?>
<config>
<property name="logLevel" value="WARNING"/>
<property
<property
<property
<property
name="absoluteBeamWidth" value="50000"/>
name="relativeBeamWidth" value="1e-80"/>
name="absoluteWordBeamWidth" value="1000"/>
name="relativeWordBeamWidth" value="1e-60"/>
<property name="wordInsertionProbability" value="0.1"/>
<property name="silenceInsertionProbability" value="0.1"/>
<property name="fillerInsertionProbability" value="1e-2"/>
<property name="languageWeight" value="12.0"/>
<component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer">
<property name="decoder" value="decoder"/>
</component>
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder">
<property name="searchManager" value="wordPruningSearchManager"/>
</component>
<component name="simpleSearchManager"
type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager">
<property name="linguist" value="flatLinguist"/>
<property name="pruner" value="trivialPruner"/>
<property name="scorer" value="threadedScorer"/>
<property name="activeListFactory" value="activeList"/>
95
</component>
<component name="wordPruningSearchManager"
type="edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager">
<property name="linguist" value="lexTreeLinguist"/>
<property name="pruner" value="trivialPruner"/>
<property name="scorer" value="threadedScorer"/>
<property name="activeListManager" value="activeListManager"/>
<property name="growSkipInterval" value="0"/>
<property name="buildWordLattice" value="true"/>
<property name="keepAllTokens" value="true"/>
<property name="acousticLookaheadFrames" value="1.7"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
<component name="activeList"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
<component name="activeListManager"
type="edu.cmu.sphinx.decoder.search.SimpleActiveListManager">
<propertylist name="activeListFactories">
<item>standardActiveListFactory</item>
<item>wordActiveListFactory</item>
<item>wordActiveListFactory</item>
<item>standardActiveListFactory</item>
<item>standardActiveListFactory</item>
<item>standardActiveListFactory</item>
</propertylist>
</component>
<component name="standardActiveListFactory"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
<component name="wordActiveListFactory"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="absoluteBeamWidth" value="${absoluteWordBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeWordBeamWidth}"/>
</component>
<component name="trivialPruner"
type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>
96
<component name="threadedScorer"
type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer">
<property name="frontend" value="liveFrontEnd"/>
</component>
<component name="flatLinguist"
type="edu.cmu.sphinx.linguist.flat.FlatLinguist">
<property name="grammar" value="jsgfGrammar"/>
<property name="acousticModel" value="acousticModel"/>
<property name="wordInsertionProbability"
value="${wordInsertionProbability}"/>
<property name="silenceInsertionProbability"
value="${silenceInsertionProbability}"/>
<property name="languageWeight" value="${languageWeight}"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="lexTreeLinguist"
type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist">
<property name="acousticModel" value="acousticModel"/>
<property name="languageModel" value="simpleNGramModel"/>
<property name="dictionary" value="dictionary"/>
<property name="addFillerWords" value="true"/>
<property name="generateUnitStates" value="false"/>
<property name="wantUnigramSmear" value="true"/>
<property name="unigramSmearWeight" value="1"/>
<property name="wordInsertionProbability"
value="${wordInsertionProbability}"/>
<property name="silenceInsertionProbability"
value="${silenceInsertionProbability}"/>
<property name="fillerInsertionProbability"
value="${fillerInsertionProbability}"/>
<property name="languageWeight" value="${languageWeight}"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="simpleNGramModel"
type="edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel">
<property name="location" value=""/>
<property name="dictionary" value="dictionary"/>
<property name="maxDepth" value="3"/>
<property name="unigramWeight" value=".7"/>
</component>
<component name="largeTrigramModel"
type="edu.cmu.sphinx.linguist.language.ngram.large.LargeTrigramModel">
97
<property
<property
<property
<property
</component>
name="location" value=""/>
name="unigramWeight" value=".5"/>
name="maxDepth" value="3"/>
name="dictionary" value="dictionary"/>
<component name="alignerGrammar"
type="edu.cmu.sphinx.linguist.language.grammar.AlignerGrammar">
<property name="dictionary" value="dictionary"/>
<property name="addSilenceWords" value="true"/>
</component>
<component name="jsgfGrammar" type="edu.cmu.sphinx.jsgf.JSGFGrammar">
<property name="dictionary" value="dictionary"/>
<property name="grammarLocation" value=""/>
<property name="grammarName" value=""/>
<property name="addSilenceWords" value="true"/>
</component>
<component name="grXmlGrammar" type="edu.cmu.sphinx.jsgf.GrXMLGrammar">
<property name="dictionary" value="dictionary"/>
<property name="grammarLocation" value=""/>
<property name="grammarName" value=""/>
<property name="addSilenceWords" value="true"/>
</component>
<component name="dictionary"
type="edu.cmu.sphinx.linguist.dictionary.FastDictionary">
<property name="dictionaryPath" value="file:models/nl/dict/celex.dic"/>
<property name="fillerPath" value="file:models/nl/noisedict"/>
<property name="addSilEndingPronunciation" value="false"/>
<property name="allowMissingWords" value="false"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="acousticModel"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="acousticModelLoader"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="acousticModelLoader"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="unitManager" value="unitManager"/>
<property name="location" value="file:models/nl"/>
</component>
98
<component name="unitManager"
type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/>
<component name="liveFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>dataSource </item>
<item>dataBlocker </item>
<item>speechClassifier </item>
<item>speechMarker </item>
<item>nonSpeechDataFilter </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>autoCepstrum </item>
<item>liveCMN </item>
<item>featureExtraction </item>
<item>featureTransform </item>
</propertylist>
</component>
<component name="batchFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>dataSource </item>
<item>dataBlocker </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>autoCepstrum </item>
<item>liveCMN </item>
<item>featureExtraction </item>
<item>featureTransform </item>
</propertylist>
</component>
<component name="dataSource"
type="edu.cmu.sphinx.frontend.util.StreamDataSource"/>
<component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/>
<component name="speechClassifier"
type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier">
<property name="threshold" value="13" />
</component>
<component name="nonSpeechDataFilter"
type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/>
99
<component name="speechMarker"
type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" >
<property name="speechTrailer" value="50"/>
</component>
<component name="preemphasizer"
type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/>
<component name="windower"
type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower">
</component>
<component name="fft"
type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform">
</component>
<component name="autoCepstrum"
type="edu.cmu.sphinx.frontend.AutoCepstrum">
<property name="loader" value="acousticModelLoader"/>
</component>
<component name="batchCMN"
type="edu.cmu.sphinx.frontend.feature.BatchCMN"/>
<component name="liveCMN"
type="edu.cmu.sphinx.frontend.feature.LiveCMN"/>
<component name="featureExtraction"
type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/>
<component name="featureTransform"
type="edu.cmu.sphinx.frontend.feature.FeatureTransform">
<property name="loader" value="acousticModelLoader"/>
</component>
<component name="confidenceScorer"
type="edu.cmu.sphinx.result.MAPConfidenceScorer">
<property name="languageWeight" value="${languageWeight}"/>
</component>
</config>
100
Appendix D
Specifications for the .wav Files Used
for Training the Acoustic Model
The most important component for training acoustic models is, of course, the audio data. When we use
SphinxTrain, as explained in Section 5.6.1, the audio data needs to conform to the .wav file format. For
optimal training, it is best to only contain one sentence in each audio file.
It is important to note that the audio files that we plan to use for training, must have the same
characteristics as those used to build the acoustic model, just like the audio files we perform an alignment
task on, must have the same characteristics as these as well. The acoustic models provided by Sphinx have
a typical sampling rate of 16kHz, use 16 bits per sample, and have one channel (they are monophonic).
D.1
SoX Tool
There are several possibilities to generate these audio files. For example, we can create the audio files
ourself by recording our own voice saying the texts and saving them as .wav files. This can be done,
by using the aforementioned SoX tool. SoX contains a command line tool called rec, which provides
the ability to record input from a microphone, and directly save it to a .wav file with the provided
characteristics:
$> rec -r 16000 -e signed-integer -b 16 -c 1 <<filename>>_0xx.wav
In the example above, speech is recorded into a .wav file, using the default input device, e.g. the
microphone, with a sampling rate of 16kHz (-r 16000), using 16 bit per sample (-e signed-integer
-b16), existing of one channel (-c 1).
The tool also contains plenty of extra features, e.g. using the silence argument can be used to
indicate that data should only be written to a file when audio is detected with a certain volume, and that
recording may be stopped after a specified number of seconds of silence. It is also possible to automatically
start recording to a new file after a specified number of seconds of silence. This is especially useful when
we want to record large amounts of texts, and do not wish to rerun the command for each sentence.
See [59] for an extended overview of the possibilities of the rec tool.
An extra advantage of the tool is that it is a command line tool, which means it can be easily used in
combination with scripting. For example, in the following command (found on the Sphinx wiki [11]), the
first 20 lines of the “arctic20.txt” files will be shown to the user, one sentence at the time, and the rec
101
command will be started to record the corresponding speech. The user can display the next sentence by
stopping the rec command, e.g. with CTRL+C.
for i in ‘seq 1 20‘; do
fn=‘printf arctic_%04d $i‘;
read sent; echo $sent;
rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null;
done < arctic20.txt
D.2
Audacity Software
Of course, we can also use existing audio files to train an acoustic model. These files will need to be
divided by sentence, and be saved and converted to the correct file format. There is also a need for a
perfect transcription.
To divide and convert audio we can again use the SoX tool, using the silence argument if preferred.
There is also the possibility to use a graphic tool, for example, Audacity [3].
Audacity is a free, open-source, cross-platform software package for recording and manipulating
sound. Considering there are typically pauses between sentences, it is often very easy to distinguish
sentences in a graphical display. In Audacity, it is possible to export parts of an audio file by simply
selecting the part of the audio signal, see Figures D.1 and D.2.
Figure D.1: How to select a sentence using Audacity
The Audacity software contains several functions, e.g. recording audio, noise cancelling, band filters,
etc. Note that it can sometimes be interesting to allow (little) noise in the file, so the corresponding
102
Figure D.2: How to export a sentence using Audacity
acoustic model will have less trouble with aligning audio that was recorded in the same noisy environment.
103
104
Bibliography
[1] Defense Advanced Research Projects Agency(DARPA)’s Effective, Affordable Reusable Speech-totext(EARS) Kickoff Meeting, Vienna, VA, May 9-10 2002.
[2] Defense Advanced Research Projects Agency(DARPA)’s Effective, Affordable Reusable Speech-totext(EARS) Conference, Boston, MA, May 21-22 2003.
[3] Audacity. Audacity software tool. http://audacity.sourceforge.net/.
[4] S. Axelrod, V. Goel, R. Gopinath, P. Olsen, and K. Visweswariah. Discriminative Estimation of
Subspace Constrained Gaussian Mixture Models for Speech Recognition. IEEE Transactions on
Audio, Speech, and Language Processing, 15(1):172–189, January 2007.
[5] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe. Global Optimization of a Neural NetworkHidden Markov Model Hybrid. In International Joint Conference on Neural Networks, 1991. IJCNN91-Seattle., volume 2, pages 789–794, July 1991.
[6] J. Bilmes. Lecture 2: Automatic Speech Recognition.
~bilmes/ee516/lecs/lec2_scribe.pdf, 2005.
http://melodi.ee.washington.edu/
[7] H. Bourlard and C.J. Wellekens. Links Between Markov Models and Multilayer Perceptrons. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 12(12):1167–1178, December 1990.
[8] Carnegie Mellon University. CMU DICT. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
[9] Carnegie Mellon University. CMU Sphinx Documentation. http://cmusphinx.sourceforge.net/
doc/sphinx4/.
[10] Carnegie Mellon University. CMU Sphinx Forum. http://cmusphinx.sourceforge.net/wiki/
communicate/.
[11] Carnegie Mellon University. CMU Sphinx Wiki. http://cmusphinx.sourceforge.net/wiki/.
[12] Carnegie Mellon University. CMU SphinxBase. http://sourceforge.net/projects/cmusphinx/
files/sphinxbase/0.8/.
[13] Carnegie Mellon University. CMU SphinxTrain. http://sourceforge.net/projects/cmusphinx/
files/sphinxtrain/1.0.8/.
[14] L. Carrio, C. Duarte, R. Lopes, M. Rodrigues, and N. Guimares. Building rich user interfaces for
digital talking books. In Robert. J.K., Q. Limbourg, and J. Vanderdonckt, editors, Computer-Aided
Design of User Interfaces IV, pages 335–348. Springer Netherlands, 2005.
[15] S. Cassidy. Chapter 9. Feature Extraction for ASR. http://web.science.mq.edu.au/~cassidy/
comp449/html/ch09s02.html.
[16] CGN. Corpus gesproken nederlands. http://lands.let.ru.nl/cgn/ehome.htm.
[17] DAISY Consortium. Digital Accessible Information SYstem. http://www.daisy.org.
105
[18] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-Dependent Pre-Trained Deep Neural Networks
for Large Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language
Processing, 20(1):30–42, January 2012.
[19] B. De Meester, R. Verborgh, P. Pauwels, W. De Neve, E. Mannens, and R. Van de Walle. Improving
Multimedia Analysis Through Semantic Integration of Services. In 4th FTRA International Conference on Advanced IT, Engineering and Management, Abstracts, page 2. Future Technology Research
Association (FTRA), 2014.
[20] M. De Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools, and D. Van Compernolle.
Template-Based Continuous Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1377–1390, May 2007.
[21] H. Dewey-Hagborg.
speech.ppt.
Speech Recognition.
http://www.deweyhagborg.com/learningBitByBit/
[22] EEL6586. A simple coder to work in real-time between two pcs. http://plaza.ufl.edu/hsuzh/
#report.
[23] D. Eldridge. Have you heard? audiobooks are booming. BookBusiness your source for publishing
intelligence, 17(2):20–25, April 2014.
[24] M. Franzini, K.-F. Lee, and A. Waibel. Connectionist Viterbi Training: A New Hybrid Method
for Continuous Speech Recognition. In International Conference on Acoustics, Speech, and Signal
Processing, 1990. ICASSP-90., volume 1, pages 425–428, April 1990.
[25] J. R. Glass. Challenges For Spoken Dialogue Systems. In Proceedings of 1999 IEEE ASRU Workshop,
1999.
[26] N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M. Gilbert. The AT&T Spoken
Language Understanding System. IEEE Transactions on Audio, Speech, and Language Processing,
14(1):213–222, January 2006.
[27] P. Haffner, M. Franzini, and A. Waibel. Integrating Time Alignment and Neural Networks for High
Performance Continuous Speech Recognition. In International Conference on Acoustics, Speech, and
Signal Processing, 1991. ICASSP-91., volume 1, pages 105–108, April 1991.
[28] H. Hermansky. Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical
Society of America, 87(4):1738–1752, May 1990.
[29] G. Hinton, L. Deng, D. Yu, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,
T. Sainath, G. Dahl, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech
Recognition. IEEE Signal Processing Magazine, November 2012.
[30] D. Ho. Notepad++ Editor. http://notepad-plus-plus.org/.
[31] X. Huang, Y. Ariki, and M. Jack. Hidden Markov Models for Speech Recognition. Columbia University Press, New York, NY, USA, 1990.
[32] X. Huang, J. Baker, and R. Reddy. A Historical Perspective of Speech Recognition. Communication
ACM, 57(1):94–103, 2014.
[33] M.-Y. Hwang and X. Huang. Shared-Distribution Hidden Markov Models for Speech Recognition.
IEEE Transactions on Speech and Audio Processing, 1(4):414–420, October 1993.
[34] H. Jiang, X. Li, and C. Liu. Large Margin Hidden Markov Models for Speech Recognition. IEEE
Transactions on Audio, Speech, and Language Processing, 14(5):1584–1595, September 2006.
[35] A. Katsamanis, M.P. Black, P. G. Georgiou, L. Goldstein, and S. Narayanan. Sailalign: Robust
long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale
Phonetics Research, January 2011.
106
[36] S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component
of a Speech Recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing, pages
400–401, 1987.
[37] C. Kim and R. M. Stern. Feature Extraction for Robust Speech Recognition using a Power-Law
Nonlinearity and Power-Bias Subtraction, 2009.
[38] D. H. Klatt. Readings in Speech Recognition. chapter Review of the ARPA Speech Understanding
Project, pages 554–575. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
[39] Kyoto University. Julius LVCSR. http://julius.sourceforge.jp/en_index.php.
[40] L. Lamel and J.-L. Gauvain. Speech Processing for Audio Indexing. In B. Nordstrm and A. Ranta,
editors, Advances in Natural Language Processing, volume 5221 of Lecture Notes in Computer Science, pages 4–15. Springer Berlin Heidelberg, 2008.
[41] P. Lamere, P. Kwok, W. Walker, E. Gouvˆea, R. Singh, B. Raj, and P. Wolf. Design of the CMU
Sphinx-4 Decoder. In 8th European Conference on Speech Communication and Technology (Eurospeech), 2003.
[42] E. Levin. Word Recognition Using Hidden Control Neural Architecture. In International Conference
on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1, pages 433–436, April
1990.
[43] H. Lin, J. Bilmes, D. Vergyri, and K. Kirchhoff. OOV Detection by Joint Word/Phone Lattice
Alignment. In IEEE Workshop on Automatic Speech Recognition Understanding, 2007. ASRU.,
pages 478–483, December 2007.
[44] Mississippi State University. ISIP ASR System. http://www.isip.piconepress.com/projects/
speech/.
[45] N. Morgan and H. Bourlard. Continuous Speech Recognition Using Multilayer Perceptrons with
Hidden Markov Models. In International Conference on Acoustics, Speech, and Signal Processing,
1990. ICASSP-90., volume 1, pages 413–416, April 1990.
[46] National Institute of Standards and Technology (NIST). The History of Automatic Speech Recognition Evaluations at NIST. http://www.itl.nist.gov/iad/mig/publications/ASRhistory/.
[47] L.T. Niles and H.F. Silverman. Combining Hidden Markov Model and Neural Network Classifiers. In
International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., volume 1,
pages 417–420, April 1990.
[48] Shmyryov NV. Free speech database voxforge.org. http://translate.google.ca/translate?
js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.dialog-21.ru%
2Fdialog2008%2Fmaterials%2Fhtml%2F90.htm&sl=ru&tl=en.
[49] G. Oppy and D. Dowe. The turing test. http://plato.stanford.edu/entries/turing-test/.
[50] D. O’Shaughnessy. Speech Communications: Human and Machine. Institute of Electrical and
Electronics Engineers, 2000.
[51] D. OShaughnessy. Invited Paper: Automatic Speech Recognition: History, Methods and Challenges.
Pattern Recognition, 41(10):2965 – 2979, 2008.
[52] L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.
Proceedings of the IEEE, 77(2):257–286, February 1989.
[53] L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine,
1986.
107
[54] T. Schlippe.
Pronunciation Modeling.
http://csl.anthropomatik.kit.edu/downloads/
vorlesungsinhalte/MMMK-PP12-PronunciationModeling-SS2012.pdf, 2012.
[55] K. Schutte and J. Glass. Speech Recognition with Localized Time-Frequency Pattern Detectors.
In IEEE Workshop on Automatic Speech Recognition Understanding, 2007. ASRU., pages 341–346,
December 2007.
[56] A. Serralheiro, I. Trancoso, T. Chambel, L. Carrio, and N. Guimares. Towards a repository of digital
talking books. In Eurospeech, 2003.
[57] R. Solera-Ure˜
na, D. Mart´ın-Iglesias, A. Gallardo-Antol´ın, C. Pel´aez-Moreno, and F. D´ıaz-de Mar´ıa.
Robust ASR Using Support Vector Machines. Speech Communication, 49(4):253–267, April 2007.
[58] SoX. SoX Sound eXchange. http://sox.sourceforge.net/.
[59] SoX. SoX Sound eXchange Options. http://sox.sourceforge.net/sox.html.
[60] CMU Sphinx. Adapting the Default Acoustic Model. http://cmusphinx.sourceforge.net/wiki/
tutorialadapt.
[61] J. Tebelskis, A. Waibel, B. Petek, and O. Schmidbauer. Continuous Speech Recognition Using
Linked Predictive Neural Networks. In International Conference on Acoustics, Speech, and Signal
Processing, 1991. ICASSP-91., volume 1, pages 61–64, April 1991.
[62] The International Engineering Consortium. Speech-Enabled Interactive Voice Response Systems.
http://www.uky.edu/~jclark/mas355/SPEECH.PDF.
[63] L. T´
oth, B. Tarj´
an, G. S´
arosi, and P. Mihajlik. Speech recognition experiments with audiobooks.
Acta Cybernetica, 19(4):695–713, January 2010.
[64] E. Trentin and M. Gori. A Survey of Hybrid ANN/HMM Models for Automatic Speech Recognition.
Neural Computing, 37(14):91 – 126, 2001.
[65] K. P. Truong and D. A. van Leeuwen. Automatic Discrimination Between Laughter and Speech.
Speech Communication, 49(2):144–158, 2007.
[66] C.J. Van Heerden, F. De Wet, and M.H. Davel. Automatic alignment of audiobooks in afrikaans.
In PRASA 2012, CSIR International Convention Centre, Pretoria. PRASA, November 2012.
[67] A. Waibel. Modular Construction of Time-Delay Neural Networks for Speech Recognition. Neural
Computing, 1(1):39–46, March 1989.
[68] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang. Phoneme Recognition Using TimeDelay Neural Networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3):328–
339, March 1989.
[69] Z.-Y. Yan, Q. Huo, and J. Xu. A Scalable Approach to Using DNN-Derived Features in GMM-HMM
Based Acoustic Modeling for LVCSR. In International Speech Communication Association, August
2013.
[70] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason,
D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Cambridge University Engineering
Department, 2006.
[71] D. Yu, F. Seide, and G. Li. Conversational Speech Transcription Using Context-Dependent Deep
Neural Networks. In ICML, June 2012.
108
List of Figures
1.1
The distribution of phone recognition accuracy as a function of the speaker on the MTBA
corpus; figure taken from [66] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
An overview of historical progress on machine speech recognition performance; figure taken
from [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
System diagram of a speech recognizer based on statistical models, including training and
decoding processes; figure adapted from [40] . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3
LPC speech production scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.4
MFCC feature extraction procedure; figure adapted from [6] . . . . . . . . . . . . . . . . .
14
2.5
PLP feature extraction procedure; figure adapted from [6] . . . . . . . . . . . . . . . . . .
15
2.6
Possible pronunciations of the word ‘and’; figure adapted from [6] . . . . . . . . . . . . . .
20
2.7
Steps in pronunciation modelling; figure adapted from [54] . . . . . . . . . . . . . . . . . .
20
2.8
Links between pronunciation dictionary, audio and text; figure adapted from [54] . . . . .
21
2.9
Hidden Markov Model; figure adapted from [70] . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1
High-level architecture of CMU Sphinx-4; figure adapted from [41] . . . . . . . . . . . . .
27
3.2
High-level design of CMU Sphinx front end; figure adapted from the Sphinx documentation [9] 28
3.3
Basic flow chart of how the components of Sphinx-4 fit together; figure adapted from [21]
31
3.4
Global properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.5
Recognizer and Decoder components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.6
ActiveList component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.7
Pruner and Scorer configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.8
Linguist component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.9
Grammar component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.10 Dictionary configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.11 Acoustic model configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.12 Additions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.13 Front end configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.14 The relation between original data size, window size and window shift; figure adapted from
the Sphinx documentation [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.15 A mel-filter bank; figure adapted from the Sphinx documentation [9] . . . . . . . . . . . .
44
3.16 Layout of the returned features; figure adapted from the Sphinx documentation [9] . . . .
45
3.17 Front end pipeline elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.18 Example of monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1
2.2
109
4.1
High-level view of our application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2
UML scheme of the pluginsystem.plugin project, including the class that links the plugin
to our application, namely SphinxLongSpeechRecognizer . . . . . . . . . . . . . . . . . .
51
4.3
UML scheme of the pluginsystem project . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.4
The commands used for testing accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.5
UML scheme of the entire application . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.6
Example of configuration output for book “Ridder Muis” . . . . . . . . . . . . . . . . . .
57
4.7
Example of path and file names for book “Ridder Muis” . . . . . . . . . . . . . . . . . . .
58
4.8
Example of an .srt file content; taken from the .srt output for “Moby Dick” . . . . . .
59
4.9
Example part of an EPUB file, generated by our application . . . . . . . . . . . . . . . . .
60
5.1
Chart containing the size of the input text file, and length of the input audio files for both
normal and slow pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Percentage of extra audio length for the books read at slow pace, compared to the normal
pace audio file length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.3
Example of a word and its start and stop times, in .srt file format . . . . . . . . . . . . .
64
5.4
Example of a word and its start and stop times, in .srt file format . . . . . . . . . . . . .
64
5.5
Processing times for the automatic alignment performed on normal pace book . . . . . . .
66
5.6
The mean start and stop time differences between the automatically generated alignment
and the Playlane timings, for the books read at normal pace . . . . . . . . . . . . . . . . .
67
The acceptable mean start and stop time differences between the automatically generated
alignment and the PlayLane timings, for the books read at normal pace . . . . . . . . . .
67
The mean start and stop time differences between the automatically generated alignment
and the Playlane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . .
69
Example of a warning for a missing word . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.10 The mean start time differences between the automatically generated alignment using the
dictionaries with words missing, and using the dictionaries with missing words added, for
books read at slow and normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.11 The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, using
a dictionary that is missing the word “muis” . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.12 The mean start and stop, and maximum time differences between the automatically generated alignment and the PlayLane timings for the normal pace “Ridder Muis” book, with
missing word “muis in the input text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.13 time deviations for each word between the automatically generated alignment and the
Playlane timings, for the normal pace “Ridder Muis” book . . . . . . . . . . . . . . . . . .
73
5.14 Example contents of a .fileids file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
5.15 Example contents of a .transcription file . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.16 Example structure of the “adapted model” folder . . . . . . . . . . . . . . . . . . . . . . .
77
5.17 Mean start time difference of each normal pace book, using the original acoustic model,
the acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie
stoel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.18 Mean start time difference of each slow pace book, using the original acoustic model, the
acoustic model trained on “Wolf heeft jeuk”, or the acoustic model trained on “De luie
stoel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2
5.7
5.8
5.9
110
5.19 The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the
books read at normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.20 The acceptable mean start time differences for both the Sphinx-4 and the Sphinx-4.5
plugin, for the books read at normal pace . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.21 The mean start time differences for both the Sphinx-4 and the Sphinx-4.5 plugin, for the
books read at slow pace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.22 The mean start time differences for the Sphinx-4.5 plugin, for the books read at normal
pace, using the three different acoustic models . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.23 The mean start time differences for the Sphinx-4.5 plugin, for the six books read at normal
pace that usually achieve acceptable accuracy, using the three different acoustic models .
82
5.24 The mean start time differences for the Sphinx-4.5 plugin, for the books read at slow pace,
using the three different acoustic models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
B.1 The mean start and stop time differences between the automatically generated alignment
and the PlayLane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . .
93
B.2 The mean start and stop time differences between the automatically generated alignment
and the PlayLane timings, for the books read at slow pace . . . . . . . . . . . . . . . . . .
94
D.1 How to select a sentence using Audacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
D.2 How to export a sentence using Audacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
111
Download