Forced Alignment as a Tool and Methodology for Phonetics Resear

advertisement
New Tools and Methods for Very-Large-Scale Phonetics Research
1. Introduction
The field of phonetics has experienced two revolutions in the last century: the advent of the
sound spectrograph in the 1950s and the application of computers beginning in the 1970s.
Today, advances in digital multimedia, networking and mass storage are promising a third
revolution: a movement from the study of small, individual datasets to the analysis of published
corpora that are several orders of magnitude larger.
Peterson & Barney’s influential 1952 study of American English vowels was based on
measurements from a total of less than 30 minutes of speech. Many phonetic studies have been
based on the TIMIT corpus, originally published in 1991, which contains just over 300 minutes of
speech. Since then, much larger speech corpora have been published for use in technology
development: LDC collections of transcribed conversational telephone speech in English now
total more than 300,000 minutes, for example. And many even larger collections are now
becoming accessible, from sources such as oral histories, audio books, political debates and
speeches, podcasts, and so on.
These new bodies of data are badly needed, to enable the field of phonetics to develop
and test hypotheses across languages and across the many types of individual, social and
contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit
even more. However, in contrast to speech technology research, speech science has so far taken
very little advantage of this opportunity, because access to these resources for phonetics
research requires tools and methods that are now incomplete, untested, and inaccessible to most
researchers.
Transcripts in ordinary orthography, typically inaccurate or incomplete in various ways,
must be turned into detailed and accurate phonetic transcripts that are time-aligned with the
digital recordings. And information about speakers, contexts, and content must be integrated with
phonetic and acoustic information, within collections involving tens of thousands of speakers and
billions of phonetic segments, and across collections with differing sorts of metadata that may be
stored in complex and incompatible formats. Our research aims to solve these problems by
integrating, adapting and improving techniques developed in speech technology research and
database research.
The most important technique is forced alignment of digital audio with phonetic
representations derived from orthographic transcripts, using HMM methods developed for speech
recognition technology. Our preliminary results, described below, convince us that this approach
will work. However, forced-alignment techniques must be improved and validated for robust
application to phonetics research. There are three basic challenges to be met: orthographic
ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of
disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad
alignment to be identified and eliminated or fixed. Researchers need an easy way to get a
believable picture of the distribution of errors in their aligned data, so as to estimate confidence
intervals, and also to determine the extent of any bias that may be introduced. And in addition to
solving these problems for English, we need to show how to apply the same techniques to a
range of other languages, with different phonetic problems, and with orthographies that (as in the
case of Arabic and Chinese) may be more phonetically ambiguous than English.
In addition to more robust forced alignment, researchers also need improved techniques
for dealing with the results. Existing LDC speech corpora involve tens of thousands of speakers,
hundreds of millions of words, and billions of phonetic segments. Other sources of transcribed
audio are collectively even larger. Different corpora, even from the same source, typically have
differing sorts of metadata, and may be laid out in quite different ways. Manual or automatic
annotation of syntactic, semantic or pragmatic categories may be added to some parts of some
data sets.
Researchers need a coherent model of these varied, complex, and multidimensional
databases, with methods to retrieve relevant subsets in a suitably combinatoric way. Approaches
to these problems were developed at LDC under NSF awards 9983258, “Multidimensional
Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”; with key
ideas documented in Bird and Liberman (2001); and we propose to adapt and improve the results
for the needs of phonetics research.
The proposed research will help the field of phonetics to enter a new era: conducting
research using very large speech corpora, in the range from hundreds of hours to hundreds of
thousands of hours. It will also enhance research in other language-related fields, not only within
linguistics proper, but also in neighboring disciplines such as psycholinguistics, sociolinguistics
and linguistic anthropology. And this effort to enable new kinds of research also brings up a
number of research problems that are interesting in their own right, as we will explain.
2. Forced Alignment
Analysis of large speech corpora is crucial for understanding variation in speech (Keating et
al., 1994; Johnson, 2004). Understanding variation in speech is not only a fundamental goal of
phonetics, but it is also important for studies of language change (Labov, 1994), language
acquisition (Pierrehumbert, 2003), psycholinguistics (Jurafsky, 2003), and speech technology
(Benzeghiba et al., 2007). In addition, large speech corpora provide rich sources of data to study
prosody (Grabe et al., 2005; Chu et al., 2006), disfluency (Shriberg, 1996; Stouten et al., 2006),
and discourse (Hastie et al., 2002).
The ability to use speech corpora for phonetics research depends on the availability of
phonetic segmentation and transcriptions. In the last twenty years, many large speech corpora
have been collected; however, only a small portion of them have come with phonetic
segmentation and transcriptions, including: TIMIT (Garofolo et al., 1993), Switchboard (Godfrey &
Holliman, 1997), the Buckeye natural speech corpus (Pitt et al., 2007), the Corpus of
Spontaneous Japanese (http://www.kokken.go.jp/katsudo/seika/corpus/public/), and the Spoken
Dutch Corpus (http://lands.let.kun.nl/cgn/ehome.htm). Manual phonetic segmentation is timeconsuming and expensive (Van Bael et al. 2007); it takes about 400 times real time (Switchboard
Transcription Project, 1999) or 30 seconds per phoneme (1800 phonemes for 15 hours) (Leung
and Zue, 1984). Furthermore, manual segmentation is somewhat inconsistent, with much less
than perfect inter-annotator agreement (Cucchiarini, 1993).
Forced alignment has been widely used for automatic phonetic segmentation in speech
recognition and corpus-based concatenative speech synthesis. This task requires two inputs:
recorded audio and (usually) word transcriptions. The transcribed words are mapped into a phone
sequence in advance by using a pronouncing dictionary, or grapheme to phoneme rules. Phone
boundaries are determined based on the acoustic models via computer algorithms such as Viterbi
search (Wightman and Talkin, 1997) and Dynamic Time Warping (Wagner, 1981).
The most frequently used approach for forced alignment is to build a Hidden Markov
Model (HMM) based phonetic recognizer. The speech signal is analyzed as a successive set of
frames (e.g., every 3 - 10 ms). The alignment of frames with phonemes is determined via the
Viterbi algorithm, which finds the most likely sequence of hidden states (in practice each phone
has 3-5 states) given the observed data and the acoustic model represented by the HMMs. The
acoustic features used for training HMMs are normally cepstral coefficients such as MFCCs
(Davis and Mermelstein, 1980) and PLPs (Hermansky, 1990). A common practice involves
training single Gaussian HMMs first and then extending these HMMs to more Gaussians
(Gaussian Mixture Models (GMMs)). The reported performances of state-of-the-art HMM-based
forced alignment systems range from 80%-90% agreement (of all boundaries) within 20 ms
compared to manual segmentation on TIMIT (Hosom, 2000). Human labelers have an average
agreement of 93% within 20 ms, with a maximum of 96% within 20 ms for highly-trained
specialists (Hosom, 2000).
In forced alignment, unlike in automatic speech recognition, monophone (contextindependent) HMMs are more commonly used than triphone (context-dependent) HMMs. Ljolje et
al. (1997) provide a theoretical explanation as to why triphone models tend to be less precise in
automatic segmentation. In the triphone model, the HMMs do not need to discriminate between
the target phone and the context; the spectral movement characteristics are better modeled, but
phone boundary accuracy is sacrificed. Toledano et al. (2003) compare monophone and triphone
models for forced alignment under different criteria and show in their experiments that
monophone models outperform triphone models for medium tolerances (15-30 ms different from
manual segmentation). However, monophone models underperform for small tolerances (5-10
ms) and large tolerances (>35 ms).
Many researchers have tried to improve forced alignment accuracy. Hosom (2000)
uses acoustic-phonetic information (phonetic transitions, acoustic-level features, and distinctive
phonetic features) in addition to PLPs. This study shows that the phonetic transition information
provides the greatest relative improvement in performance. The acoustic-level features - such as
impulse detection, intensity discrimination, and voicing features – provide the next-greatest
improvement, and the use of distinctive features (manner, place, and height) may increase or
decrease performance, depending on the corpus used for evaluation. Toledano et al. (2003)
propose a statistical correction procedure to compensate for the systematic errors produced by
context-dependent HMMs. The procedure is comprised of two steps: a training phase, where
some statistical averages are estimated; and a boundary correction phase, where the phone
boundaries are moved according to the estimated averages. The procedure has been shown to
correct segmentations produced by context-dependent HMMs; therefore, the results are more
accurate than those obtained by context-independent and context-dependent HMMs alone. There
are also studies in the literature that attempt to improve forced alignment by using a different
model than HMMs. Lee (2006) employs a multilayer perceptron (MLP) to refine the phone
boundaries provided by HMM-based alignment; Keshet et al. (2005) describe a new paradigm for
alignment based on Support Vector Machines (SVMs).
Although forced alignment works well on read speech and short sentences, the alignment of
long and spontaneous speech remains a great challenge (Osuga et al., 2001; Toth, 2004).
Spontaneous speech contains filled pauses, disfluencies, errors, repairs, and deletions that do
not normally occur in read speech and are often omitted in transcripts. Moreover, pronunciations
in spontaneous speech are much more variable than read speech.
Researchers have attempted to improve recognition of spontaneous speech (Furui, 2005)
by: using better models of pronunciation variation (Strik & Cucchiarini, 1998; Saraclar et al.,
2000); using prosodic information (Wang, 2001, Shriberg & Stolcke, 2004); and improving
language models (Stolcke & Shriberg, 1996; Johnson et al., 2004).
With respect to pronunciation models, Riley et al. (1999) use statistical decision trees to
generate alternate word pronunciations in spontaneous speech. Bates et al. (2007) present a
phonetic-feature-based prediction model of pronunciation variation. Their study shows that
feature-based models are more efficient than phone-based models; they require fewer
parameters to predict variation and give smaller distance and perplexity values when comparing
predictions to the hand-labeled reference. Saraclar et al. (2000) propose a new method of
accommodating nonstandard pronunciations: rather than allowing a phoneme to be realized as
one of a few alternate phones, the HMM states of the phoneme’s model are allowed to share
Gaussian mixture components with the HMM states of the model(s) of the alternate realization(s).
The use of prosody and language models to improve automatic recognition of spontaneous
speech has been largely integrated. Liu et al. (2006) describe a metadata (sentence boundaries,
pause fillers, and disfluencies) detection system; it combines information from different types of
textual knowledge sources with information from a prosodic classifier. Huang and Renals (2007)
incorporate syllable-based prosodic features into language models. Their experiment shows that
exploiting prosody in language modeling significantly reduces perplexity and marginally reduces
word error rate.
In contrast to automatic recognition, little effort has been made to reduce forced alignment
errors for spontaneous speech. Automatic phonetic transcription procedures tend to focus on the
accuracy of the phonetic labels generated rather than the accuracy of the boundaries of the
labels. Van Bael et al. (2007) show that in order to approximate the quality of the manually
verified phonetic transcriptions in the Spoken Dutch corpus, one only needs an orthographic
transcription, a canonical lexicon, a small sample of manually verified phonetic transcriptions,
software for the implementation of decision trees, and a standard continuous speech recognizer.
Chang et al. (2000) developed an automatic transcription system that does not use word-level
transcripts. Instead, special purpose neural networks are built to classify each 10ms frame of
speech in terms of articulatory-acoustic-based phonetic features; the features are subsequently
mapped to phonetic labels using multilayer perceptron (MLP) networks. The phonetic labels
generated by this system are 80% concordant with the labels produced by human transcribers.
Toth (2004) presents a model for segmenting long recordings into smaller utterances. This
approach estimates prosodic phrase break locations and places words around breaks (based on
length and break probabilities for each word).
Forced alignment assumes that the orthographic transcription is correct and accurate.
However, transcribing spontaneous speech is difficult. Disfluencies are often missed in the
transcription process (Lickley & Bard, 1996). Instructions to attend carefully to disfluencies
increase bias to report them but not accuracy in locating them (Martin & Strange, 1968). Forced
alignment also assumes that our word-to-phoneme mapping generates a path that contains the
correct pronunciation – but of course, natural speech is highly variable.
The obvious approach is to use language models to postulate additional disfluencies that
may have been omitted in the transcript, and to use models of pronunciation variation to enrich
the lattice of pronunciation alternatives for words in context; and then to use the usual HMM
Viterbi decoding to choose the best path given the acoustic data. Most of the research on related
topics is aimed at improving speech recognition rather than improving phonetic alignments, but
the results suggest that these approaches, properly used, will not only give better alignments, but
also provide valid information about the distribution of phonetic variants. For example, Fox (2006)
demonstrated that a forced alignment technique worked well in studying the distribution of sdeletion in Spanish, using LDC corpora of conversational telephone speech and radio news
broadcasts. She was also able to get reliable estimates of the distribution of the durations of nondeleted /s/ segments.
A critical component of any such research is estimation of the distribution of errors, whether
in disambiguating alternative pronunciations, correcting the transcription of disfluencies, or
determining the boundaries of segments. Since human annotators also disagree about these
matters, it’s crucial to compare the distribution of human/human differences as well as the
distribution of human/machine differences. And in both cases, the mean squared (or absolutevalue) error often matters less than the bias. If we want to estimate (for example) the average
duration of a certain vowel segment, or the average ratio of durations between vowels and
following voiced vs. voiceless consonants, the amount of noise in the measurement of individual
instances matters less than the bias of the noise, since as the volume of data increases, our
confidence intervals will steadily shrink – and the whole point of this enterprise is to increase the
available volume of data by several orders of magnitude.
Fox (2006) found this kind of noise reduction, just as we would hope, so that overall
parameter estimates from forced alignment converged with the overall parameter estimates from
human annotation. We will need to develop standard procedures for checking this in new
applications. Since a sample of human annotations is a critical and expensive part of this
process, a crucial step will be to define of the mimimal sample of such annotations required to
achieve a given level of confidence in the result.
3. Preliminary results:
3.1. The Penn Phonetics Lab Forced Aligner
The U.S. Supreme Court began recording its oral arguments in the early 1950s; some 9,000
hours of recording are stored in the National Archives. The transcripts do not identify the
speaking turns of individual Justices but refer to them all as “The Court”. As part of a project to
make this material available online in aligned digital form, we have developed techniques for
identifying speakers and aligning entire (hour-long) transcripts with the digitized audio (Yuan &
Liberman, 2008). The Penn Phonetics Lab Forced Aligner was developed from this project.
Seventy-nine arguments of the SCOTUS corpus were transcribed, speaker identified, and
manually word-aligned by the OYEZ project (http://www.oyez.org). Silence and noise segments in
these arguments were also annotated. A total of 25.5 hours of speaker turns were extracted from
the arguments and used for our training data; one argument was set aside for testing purposes.
Silences were separately extracted and randomly added to the beginning and end of each turn.
Our acoustic models are GMM-based, monophone HMMs. Each HMM state has 32 Gaussian
Mixture components on 39 PLP coefficients (12 cepstral coefficients plus energy, and Delta and
Acceleration). The models were trained using the HTK toolkit (http://htk.eng.cam.ac.uk) and the
CMU American English Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict).
We tested the forced aligner on both TIMIT (the training set data) and the Buckeye corpus
(the data of s14). TIMIT is read speech and the audio files are short (a few seconds each). The
Buckeye corpus is spontaneous interview speech and the audio files are nine minutes long on
average. Table 1 lists the average absolute difference between the automatic and manually
labeled phone boundaries; it also lists the percentage of agreement within 25 ms (the length of
the analysis window used by the aligner) between forced alignment and manual segmentation.
Table 1. Performance of the PPL Forced Aligner on TIMIT and Buckeye.
TIMIT
Buckeye
Average absolute difference
12.5 ms
21.2 ms
Percentage of agreement within 25ms
87.6%
79.2%
We also tested the aligner on hour-long audio files - i.e., alignment of entire hour-long
recordings without cutting them into smaller pieces - using the British National Corpus (BNC) and
the SCOTUS corpus. The spoken part of the BNC corpus consists of informal conversations
recorded by volunteers. The conversations contain a large amount of background noise, speech
overlaps, etc. To help our forced aligner better handle the BNC data, we combined the CMU
pronouncing
dictionary
with
the
Oxford
Advanced
Learner's
dictionary
(http://ota.ahds.ac.uk/headers/0710.xml), which is a British English pronouncing dictionary. We
also retrained the silence and noise model using data from the BNC corpus. We manually
checked the word alignments on a 50-minute recording, and 78.6% of the words in the recording
were aligned accurately. The argument in the SCOTUS corpus that was set aside for testing in
our study is 58 minutes long and manually word-aligned. The performance of the aligner on this
argument is shown by Figure 1, where a boxplot of alignment errors (absolute differences from
manual segmentation) in every minute from the beginning to the end of the recording is drawn.
We can see that the alignment is consistently good throughout the entire recording.
Figure 1. Alignment errors in every minute in a 58-minute recording.
Possible reasons for why our forced aligner can handle long and spontaneous speech well
include: the high quality of the training data; the fact that the training data is large enough to train
robust monophone GMM models; and the robustness of the silence and noise models.
3.2. Phonetics research using very large speech corpora and forced alignment
We have used large speech corpora to investigate speech and language phenomena such
as speaking rate (Yuan et al., 2006), speech overlap (Yuan et al., 2007), stress (Yuan et al.,
2008), duration (Yuan, 2008), and tone sandhi (Chen & Yuan, 2007). We will now summarize our
study on English word stress, which shows how we can revisit classic phonetic and phonological
problems from the perspective of utilizing very large speech corpora.
Studies on the acoustic correlates of word-level stress have demonstrated contradictory
results regarding the importance of the acoustic correlates (see a review in Okobi, 2006). Most of
the studies are based on small amounts of laboratory speech. By contrast, the acoustics of
secondary stress - especially the three-way distinction of primary-stress, secondary-stress, and
reduced vowels - has not been widely studied.
We investigated the pitch and duration of vowels from different lexical stress classes in the
SCOTUS corpus. The vowels were automatically segmented using the Penn Phonetics Lab
Forced Aligner, including 157,138 primary-stress vowels, 10,368 secondary-stress vowels, and
116,229 reduced vowels. The durations of the vowels were calculated from forced alignment. The
F0 was extracted using Praat (http://www.praat.org) and converted to a semitone scale. The base
frequency used for calculating semitones was Justice-dependent and was defined as the 10th
percentile of all F0 values for that Justice. A simple linear regression was applied to the pitch
contour of each turn, and the regression residuals were used for the pitch analysis. Using the
regression residuals instead of the real pitch values normalized the global downtrend of the pitch
contours and captured the local pitch behavior of the vowels.
Figure 2 shows the F0 contours of the vowels. We segmented each vowel into four equal
parts and averaged the pitches within each quarter. The F 0 contour of the primary stress vowels
stayed well above zero, which represented the pitch regression line. The contours of secondarystress and reduced vowels were very similar; both were below the regression line.
Figure 2. F0 contours of primary-stress (‘1’), secondary-stress (‘2’) and reduced (‘0’) vowels.
The histograms in Figure 3 (below) show the frequency distributions of vowel duration for the
three stress classes. Interestingly, the secondary-stress vowels were more similar to the primarystress vowels in terms of duration. The reduced vowels were much shorter than these two types
of vowels.
Figure 3. Duration densities of primary-stress (‘1’), secondary-stress (‘2’) and reduced (‘0’)
vowels.
3.3. Improving automatic phonetic alignments
Forced alignment is a powerful tool for utilizing very large speech corpora in phonetics
research, but as we noted, it has several obvious problems: orthographic ambiguity,
pronunciation variation, and imperfect transcripts. The general approach in all cases is to add
alternative paths to the “language model” (which in the simplest case is just a simple sequence of
expected phonetic segments), with estimates of the a priori probability of the alternatives, and let
the Viterbi decoding choose the best option. In some cases, it may also be helpful to add
additional acoustically-based features – perhaps based on some decision-specific machine
learning – designed to discriminate among the alternatives. We and others have gotten
promising results with such techniques, and we’re confident that with some improvements, they
will deal adequately with the problems, as well as adding information about the distribution of
phonetic variants in speech.
Pretonic schwa deletion (e.g., suppose -> [sp]ose) presents a typical challenge of this
type (Hooper, 1978; Patterson et al., 2003; Davidson, 2006). Editing the pronouncing dictionary
may solve the problem, but it is time-consuming and error-prone. We propose a different
approach: using a “tee model” for schwa in forced alignment. A “tee-model” has a direct transition
from the entry to the exit node in the HMM; therefore, a phone with a “tee-model” can have “zero”
length. The “tee-model” has mainly been used for handling possible inter-word silence. In a pilot
experiment, we trained a “tee-model” for schwa and used the model to identify schwa elision
(“zero” length from alignment) in the SCOTUS corpus.
We asked a phonetics student to examine all the tokens of the word suppose in the corpus
(ninety-nine total) and manually identify whether there was a schwa in the word by listening to the
sound and looking at the spectrogram. The agreement between the forced alignment procedure
and the manual procedure was 88.9% (88/99). 24 tokens were identified as ‘no schwa’ (schwa
elision) by the student, 22 of them (91.7%) were correctly identified by the aligner; 75 tokens were
identified as having a schwa by the student, 66 of them (88%) were correctly identified by the
aligner. Figure 4 (below) illustrates two examples from the forced alignment results. We can see
that the forced aligner correctly identifies a schwa in the first word and a schwa elision in the
second word, although the word suppose does not have a pronunciation variant with schwa
elision in the pronouncing dictionary.
Figure 4. Identifying schwa elision through forced alignment and “tee-model”.
4. Research Plans
We plan to improve the Penn Phonetics Lab (PPL) Forced Aligner in two respects: 1) its
segmentation accuracy; and 2) its robustness to conversational speech and long recordings. We
will further explore techniques for modeling phonetic variation and recognizing untranscribed
disfluencies, and for marking regions of unreliable alignment. In addition, we will extend this
system to other speech genres (e.g., child directed speech) and more languages, including
Mandarin Chinese, Arabic, Vietnamese, Hindi/Urdu, French, and Portuguese. We will apply these
techniques to the LDC’s very large speech corpoa, and explore how to integrate the resulting
automated annotations into a database system that is convenient for phonetic search and
retrieval, as per the techniques developed in previous NSF-funded projects at LDC. In addition to
using these results in our own research, we will publish both the annotations and the search
software for use by the research community at large, in order to learn as much as possible about
the issues that arise in applying this new approach.
4.1. Analysis of segmentation errors
To evaluate the performance of forced alignment and analyze segmentation errors, we
will develop phonetic segmentation data that can serve as a gold-standard benchmark. This
dataset will be created in a uniform format for all our target languages. Manual phonetic
segmentation is very time consuming, so we will randomly select representative utterances from
the spoken corpora. Half an hour of benchmark data will be created English, and lesser amounts
for other languages, consistent with estimating confidence intervals for a range of phonetic
measurements. These datasets will be published through LDC by the end of the second year of
the project.
Error analysis provides information about where and how the system should be
improved; it allows us to estimate whether the deviations from human annotation introduce any
bias; and in addition, it may yield insights of its own. Thus Greenberg and Chang (2000)
conducted a diagnostic evaluation of eight Switchboard-corpus recognition systems, and found
that syllabic structure, prosodic stress and speaking rate were important factors in accounting for
recognition performance.
We performed a preliminary error analysis on alignment of the TIMIT corpus using the PPL
aligner. As shown in Table 2, we found that the signed alignment errors between different phone
classes have different patterns. There is no bias towards either phone class for the boundaries
between Nasals and Glides (no matter which is first, -0.002s vs. 0.006s); however, there is a
significant bias towards Stops for the boundaries between Stops and Glides (no matter which is
first, -0.01s vs. 0.015s). There is no bias for the boundaries between Vowels and Glides (-0.002
s), but there is a significant bias towards Vowels for the boundaries between Glides and Vowels
(0.013s). We will undertake further analyses to reveal how the error patterns are related to phone
characteristics, coarticulation, and syllable structure. We will then use the information to improve
forced alignment.
Table 2. Average signed errors for boundaries between broad phone classes (Seconds).
Affricate Fricative Glide
/h/
Nasal
Stop
Vowel
Affricate
–.008
–.006
.019
Fricative
.026
–.009
–.013
.008
.007
.006
.015
.013
Glide
.014
.003
.013
/h/
–.008
.010
–.002
Nasal
–.005
.009
.013
–.010
Stop
–.001
–.008
–.003
–.002
Vowel
.006
–.012
.006
–.004
.006
4.2. Analysis of phonetic variation
A key issue in forced alignment is the inherent variation of human speech. Phonetic variation
has been extensively studied (Keating et al., 1994; Bell et al., 2003; Johnson, 2004). Based upon
our review of the literature, we will conduct studies to investigate the phonetic variation in the
TIMIT, Buckeye and SCOTUS corpora from the perspective of forced alignment. Such studies are
essential to improving forced alignment and improving our understanding of the mechanisms of
speech production. Our study on Pretonic schwa deletion is a good example of how to integrate
phonetic variation analysis and forced alignment.
In this research, we will investigate how to better handle phonetic variation (i.e., deletion,
reduction, and insertion) for the purpose of forced alignment. One possible experiment on vowel
reduction involves building a system in which all English reduced vowels are the same phoneme;
this special phoneme would have triphone (context dependent) models instead of a monophone
model.
Figure 5. Forced alignment errors by speakers.
We will also conduct error analyses in terms of speaker variation using the TIMIT, Buckeye,
and SCOTUS corpora. Toledano et al. (2003) demonstrate that the use of speaker adaptation
techniques increases segmentation precision. Figure 5 (above) shows the absolute phone
alignment errors using the PPL aligner on individual speakers in the TIMIT corpus. We can see
that a small number of speakers deviate greatly from the others. The deviations may be due to
factors such as speaking rate, disfluency, dialect, and speaker characteristics of pronunciation
and prosody. In addition, as shown above, we found that speakers vary in different degrees on
different phones (Yuan and Liberman, 2008). We aim to understand how speaker variation affects
forced alignment and to improve forced alignment through better adaptation to speaker variation.
A remaining challenge is the robust identification of regions where (for whatever reason) the
automatic process has gone badly wrong, so that the resulting data should be disregarded. The
use of likelihood scores or other confidence measures in the obvious approach, but the history of
such confidence measures in automatic speech recognition is mixed at best. We hypothesize that
this is mainly because the language model is weighted more heavily than the acoustic model in
automatic speech recognition. In forced alignment, however, the word sequence is (mostly or
completely) given, so the language model plays much less of a role in computing the likelihood
scores, which are therefore more reliable and useful.
We have used likelihood scores to study speaker variation on individual phones. In
particular, we asked which phones are more acoustically variable among speakers. We again
turned to the SCOTUS data. To identify the speaking turns of individual Justices, we trained
Justice-dependent, GMM-based, monophone HMMs. We computed the acoustic variation among
Justices on individual phones using two methods. First, we directly computed the distances of the
GMM models of the Justices on each phone. A natural measure between two distributions is the
Kullback-Leibler divergence (Kullback, 1968); however, it cannot be analytically computed in the
case of a GMM. We adopted a dissimilarity measure proposed in Goldberger and Aronowitz
(2005), which is an accurate and efficiently computed approximation of the KL-divergence. Next,
we computed the acoustic distance from likelihood scores. Let L(/p/i, Mj) denote the average
likelihood score of the /p/ phones of speaker i when the phones are forced aligned using speaker
j’s model Mj. L(/p/i, Mj) measures the distance of the phone /p/ from speaker i to speaker j.
Figure 6 (below) shows that the correlation between the KL-divergence measure and the
likelihood score distance measure is very high (r = 0.88). This result suggests that likelihood
scores can be used as reliable measurements of acoustic and phonetic variations. The likelihood
scores are particularly useful when there are not enough data to train GMM models. For example,
to evaluate the sentence pronunciation of a foreign language learner, we can use the likelihood
score obtained from alignment of the sentence against acoustic models trained on the standard
accent speech.
Figure 6. Correlation between KL divergence and likelihood score distance
4.3. Integration of phonetic models
To our understanding, part of the reason why the integration of phonetic knowledge has not
significantly improved the accuracy of speech recognition is the strong impact of the language
model in automatic speech recognition (ASR) procedures. Since the word sequence is provided
in forced alignment, the application of phonetic knowledge is more likely to be successful here.
The proposed research will attempt to improve forced alignment by incorporating well-established
phonetic models. Specifically, we will explore the phone-transition model (Hertz, 1991), the πgesture model (Byrd & Saltzman, 2003), and the landmark model (Stevens, 2002).
Hertz (1991) presents a phone-transition model that treats formant transitions as
independent units between phones, rather than incorporating the transitions into the phones as in
more conventional models. The phone-transition model can be easily implemented in the HMM
framework. However, several questions require investigation. For example, can (some)
transitions be clustered with (some) reduced vowels in their acoustic models? The π-gesture
model of Byrd and Saltzman (2003) suggests that boundary-related durational patterning can
result from prosodic gestures or π-gestures, which stretch or shrink the local temporal fabric of an
utterance. We propose to incorporate the π-gesture model into the forced alignment procedure
through the rescoring of alignment lattices (Jennequin & Gauvain, 2007). Stevens (2002)
proposes a model for lexical access based on acoustic landmarks and distinctive features.
Landmark-based speech recognition has advanced in recent years (Hasegawa-Johnson et al.,
2005). We will adopt a two-step procedure to apply the landmark model in forced alignment. In
the first step, segment boundaries will be obtained by the HMM-based PPL forced aligner. In the
second step, the boundaries will be refined through landmark detection, using the framework
proposed in Juneja and Espy-Wilson (2008).
4.4. Incorporating prosody and language models
The transcriptions of long and spontaneous speech are usually imperfect. Spontaneous
speech contains filled pauses, disfluencies, errors, repairs, and deletions that are often missed in
the transcription process. Recordings of long and spontaneous speech usually contain
background noises, speech overlaps, and very long non-speech segments. These factors make
the alignment of long and spontaneous speech a great challenge. We aim to improve the
robustness of the PPL aligner to long and spontaneous speech in two aspects: 1) improve the
acoustic models of silences, noises, and filled pauses; and 2) introduce constraints from prosody
and language model into forced alignment. We will use the Buckeye, SCOTUS, and BNC corpora
for this part of the research.
In our experiments on alignment of the BNC corpus - which consists of casual and long
speech in a natural setting - we found that erroneous alignments can be reduced by adapting the
silence and noise models of the PPL aligner to the BNC data. We will further explore the
importance of the non-speech models in forced alignment of long and casual speech. We will also
investigate ways to improve the acoustic models for better handling filled pauses. Schramm et al.
(2003) created many pronunciation variants for a filled pause through a data-drive lexical
modeling technique. The new model outperforms the single-pronunciation filled pause model in
recognition of highly spontaneous medical speech. Stouten et al. (2006) argue that a better way
to cope with pause fillers in speech recognition is to introduce a specialized filled pause detector
(as a preprocessor) and supply the output of that detector to the general decoder. We will explore
these two approaches for our purpose of improving forced alignment of long and casual speech.
A common practice in forced alignment is to insert a “tee-model” phone, called sp (short
pauses), after each word in the transcription for handling possible inter-word silence. Since a
“tee-model” has a direct transition from the entry to the exit node, sp can be skipped during forced
alignment. In this way, a forced aligner can “spot” and segment pauses in the speech, which are
usually not transcribed. In casual and long speech, such pauses can be extremely long and filled
with background noises. In this case, the sp-insertion approach could cause severe problems.
In our study on the BNC corpus, we found that there are often many sp segments mistakenly
determined by the aligner in regions where the word boundaries were not correctly aligned. We
believe that these types of errors can be reduced by introducing constraints on the occurrences of
sp from both a language model and a prosodic model. For example, it is unlikely to have pauses
between the words in very common phrases such as “How are you doing?”. On the other hand, if
there is a single word between two pauses in speech, the word is likely to be lengthened; hence,
it should have longer duration and particular F0 characteristics.
Another type of error we have seen from the BNC corpus is that some words are extremely
long in the alignment results. This usually occurs when there is long speech-like background
noise surrounding the words. This type of error can be reduced by introducing constraints on
word or phone duration.
4.5. Extension to other speech genres and more languages
The CHILDES corpus (http://childes.psy.cmu.edu/) contains audio/video data and transcripts
collected from conversations between young children and their playmates and caretakers. It has
been a great resource for studying language acquisition and phonetic variation. We propose to
conduct forced alignment on the child directed speech data in this corpus in order to make the
data more usable for phonetic research. Kirchhoff and Schimmel (2005) trained automatic speech
recognizers on infant directed (ID) and adult directed speech (AD), respectively, and tested the
recognizers on both ID and AD speech. They found that matched conditions produced better
results than mismatched conditions, and that the relative degradation of ID-trained recognizers on
AD speech was significantly less severe than in the reverse case. We will conduct a similar study
for forced alignment by comparing the aligner trained on the SCOTUS corpus and on the
CHILDES corpus.
We will also extend the PPL aligner to more languages. Our first target languages are
Mandarin Chinese and Vietnamese, both of which are tonal languages. Many researchers have
tried to improve automatic speech recognition for tonal languages by incorporating tonal models
or by utilizing different acoustic units such as syllables and initials/finals (Fu et al. 1996, Vu et al.,
2005, Lei, 2006). This research will not investigate how to apply advanced tonal models or
acoustic unit modeling in forced alignment. Instead, we will build simple tone-dependent and
tone-independent models based on monophones, syllables, and initials/finals; we will then
choose the one that performs the best. The data we will use to build a Mandarin Chinese aligner
is the Hub-4 Mandarin Broadcast News speech (LDC98S73) and the transcripts (LDC98T24).
There is no easily accessible large speech corpus in Vietnamese. However, Giang Nguyen, a
graduate student in our lab, has collected more than 30 hours of interview speech from Vietnam
for her dissertation research; the recordings are currently being transcribed. We will use this
dataset for training a Vietnamese forced aligner.
We will also attempt to extend the forced aligner to other languages - including Arabic, Hindi,
Urdu, French and Portuguese - to provide a more general tool for conducting phonetics research
using very large corpora.
4.6. Dissemination of the research
We will disseminate the research using methods that include journal publications; opensource toolkits and web-based applications; and tutorials, workshops, and courses.
The research findings and methodology innovations will be published in archival journals.
We are currently preparing two journal articles, “Forced alignment as a tool and methodology for
phonetics research” (to submit to the Journal of International Phonetic Association), and “English
stress and vowel reduction revisited: from the perspective of corpus phonetics” (to submit to the
Journal of Phonetics). Other target journals include Speech Communication, the Journal of the
Acoustical Society of America, IEEE Transactions on Audio, Speech and Language Processing,
International Journal of Speech Technology, etc.
We have already released the current version of the PPL aligner to many different sites,
including NYU, Oxford, Stanford, UIUC, and the University of Chicago. We have built a freely
accessible
forced
alignment
online
processing
system,
residing
at
http://www.ling.upenn.edu/phonetics/align.html. We will publish new version of the aligner
annually at no cost, through LDC and the phonetics lab website. To ensure long-term use of the
aligner, we will produce a permanent free-standing tutorial, covering the training and use of the
aligner, and the integration of its output. We will also provide a collection of related Python
scripts. We will also develop web-based applications that integrate forced alignment, database
query, and phonetics research. For example, we have built a web-based search engine for
searching phones and words in the SCOTUS corpus, where the search results are word-aligned
speaking turns. (http://165.123.213.123:8180/PhoneticDatabaseSearchEngine/)
We propose the organization of a workshop on the use of very large corpora in speech
research during the first year of the project. The purposes of the workshop will be: 1) to introduce
these techniques to those in the speech-research community who are not familiar with them; and
2) to promote phonetics research using very large corpora with forced alignment as both a tool
and methodology. The workshop will also provide an opportunity for us to test the aligner on
different datasets from the workshop participants, and to seek research collaborations.
At the University of Pennsylvania we have been teaching a course on “corpus phonetics”,
covering relevant Python and Praat scripting, database access, statistical analysis in R, etc. We
plan to teach a similar course at the Linguistic Society of America's 2011 Summer Institute.
4.7. Time schedule
Year 1 (June 2009 – May 2010):
1. Improve segmentation accuracy through error analyses and the incorporation of phonetic
knowledge.
2. Integrate the resulting annotations with database search and retrieval technology
3. Organize a workshop on forced alignment for phonetics research.
Year 2 (June 2010 – May 2011):
1. Improve the aligner’s robustness to long and spontaneous speech.
2. Expand the aligner to child directed speech and to other languages, including Mandarin
Chinese, Arabic, Hindi, Urdu, Vietnamese, French, and Portuguese.
Year 3 (June 2011 – May 2012):
1. Publish the PPL forced aligner and the data sets used for training the aligner on the web
and through LDC.
2. Investigate new ways to use forced alignment as a methodology for phonetics research.
Download