doc file - Speech Signal Processing

advertisement
Comparison of software for transcription of
speech data
NSA2003-030
SAMUDRAVIJAYA K
Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005
chief@tifr.res.in
ABSTRACT
Transcribed speech data is the raw material for
development of speech recognition and speech
synthesis systems. Appropriate software tools are
necessary for facilitating the process of annotation of
large amount of speech data. Several speech analysis
tools, in commercial as well as public domain, also
permit transcription of speech data. Here, we present a
comparison of a few prominent tools in the context of
creating speech databases of Indian languages.
Specifically, a comparison is made in terms of features
such as facility for sequential as well as multi-level
labeling, ease of use, extensibility/adaptability, platform
independence, documentation and support.
INTRODUCTION
Annotated speech data is the raw material for
development of speech recognition and speech
synthesis systems. Acoustic-phonetic study of the
speech sounds of a language is essential for
determining the parameters of speech synthesis
systems following articulatory or parametric approach.
Concatenative synthesizers require an inventory of
labeled segments that are generally excised from
transcribed and time-aligned spoken utterances. Such a
database is also necessary for training continuous
speech recognition systems.
Creation of databases of spoken Indian languages has
attracted the attention of Government of India and
private bodies. The government has initiated projects for
creation of databases for Voice Recognition as well as
Speech Recognition and Text-to-Speech systems in 6
Indian languages. Recent initiative of several
organizations to form Linguistic Data Consortium for
Indian Languages is another step in this direction.
Development of speech database involves 3 steps: (a)
identifying text to be spoken appropriate for the end use,
(b) recording speech in digital format and (c) annotation
of speech data. This paper deals with software tools that
will facilitate the last step.
Transcription of large amount of speech data is laborious
and time-consuming, especially if the alignment of the
transcription with the time waveform is also necessary.
In addition such labeling in terms of segmental units
such as phonemes or words, prosodic annotation is also
necessary for synthesis of natural sounding speech.
Appropriate software tools are necessary for facilitating
the annotation process. In fact, a good software tool
should permit annotation at a hierarchy of levels. Several
speech analysis tools, in commercial as well as public
domain, also permit transcription of speech data. Here,
we present a comparison of a few prominent tools in the
context of creating speech databases of Indian
languages. Specifically, a comparison is made in terms
of features such as facility for sequential as well as
hierarchical labeling, ease of use, extensibility,
adaptability, platform independence, documentation and
support. One may also refer to a special issue of Speech
Communication on the topic 'Speech Annotation and
Corpus Tools' [1].
LINGUISTIC ANNOTATION
The term linguistic annotation covers any descriptive or
analytic notations applied to raw language data. The
added notations may include information of various
kinds: multi-tier transcription of speech in terms of units
such as acoustic-phonetic features, syllables, words etc;
syntactic and semantic analysis; paralinguistic
information (stress, speaking rate) non-linguistic
information (speaker's gender, age, voice quality,
emotions, dialect room acoustics, additive noise,
channel effects). The internet site of Linguistic Data
Consortium [2] describes tools and formats for creating
and managing linguistic annotations. A brief description
of various tools for annotation is also given at that site.
Essential qualities
TYPES OF ANNOTATION
In the current work, we will focus on annotation of
spoken language. Since spoken language involves time
waveform, the data may be not only be annotated in
terms of various linguistic events and segments, but also
the temporal locations of such linguistic entities be
marked. In other words, the linguistic events and
segments may be aligned with the time waveform. Such
segmented speech database is necessary for training
acoustic models for speech recognition as well as for
speech synthesis using concatenative approach.
A good software tool for annotation should be able to
handle large quantities of data collected. It should be
able to read data recorded and stored in a few standard
formats such as (a) desktop speech: PCM 16 bit, 16000
sampling frequency (b) telephone speech: 8bit A or 
law, 8000Hz. Other standard features should include
time-synchronized display of time waveform and
spectrogram; ability to mark/edit segment boundaries;
attach labels to segments. The tool should be easy and
intuitive to use with adequate documentation. It should
be easy to acquire and install.
Desirable qualities
A given speech data can be annotated in terms of many
acoustic-phonetic units such as acoustic-phonetic
features and events, segments such as phones,
syllables, words, phrases etc. These units, in general,
are related in a hierarchical manner; consequently, a
multi-tier annotation scheme is desirable. In addition to
linguistic annotation, paralinguistic and non-linguistic
information can also be included in the database. While
the time intervals of linguistic units are generally nonoverlapping, this need not be the case in case of nonlinguistic information. For example, two speakers may be
speaking simultaneously during a dialogue. An ideal
transcription scheme should accommodate such cases
in addition to multi-tier annotation.
TOOLS FOR ANNOTATION
Transcription of large amount of speech data is laborintensive and repetitive. However, it is important to
maintain good uniform quality of annotation. For
transcription of spoken text, human beings not only use
speech-specific knowledge, but also benefit from display
of derived information such as energy contour,
spectrogram. Therefore, those tools that provide the
necessary analytic functions together with appropriate
display facilities aid the transcription process. Unlike in
the past, where special purpose hardware was
necessary for compute intensive operations, almost all
currently available annotation tools are software based
thanks to high computational power of prevalent
personal computers. In addition to specialized tools,
several speech analysis software, commercial as well as
public domain, also permit transcription of speech data.
The motivation of this work is to compare the strength
and weakness of various software tools those may be
used for annotation. This will, hopefully, lead to a set of
tools that are most suitable for annotation of spoken
Indian languages and thus become a standard.
ESSENTIAL AND DESIRABLE QUALITIES
It is better to enumerate the essential and desirable
qualities of a good annotation tool so that a fair
comparison of tools can be made.
In addition to the essential qualities listed above, the
following properties make a good tool a better one. Such
qualities encourage annotators to adopt a tool as a
standard.
Recording: should handle data collected under a
multiplicity of recording conditions and stored in various
storage formats including stereo recording.
Labeling: should permit not only sequential labeling but
also labeling at multiple levels (e.g., acoustic-phonetic,
syllable, word, prosodic). The tool should not only allow
an user to define his/her set of symbols, but facilitate
usage of IPA symbols. The storage of annotation should
be based on standards such as ASCII, ISCII, Unicode
and XML so that the annotation is portable across tools.
Platform independence: versions of the tool should be
available for use in multiple operating systems such
Microsoft windows and GNU Linux.
Versatility: The software should permit the display of
non-standard temporal features such as pitch contour,
overlay of formant contours on spectrograms. Features
such as user controllable time and frequency resolution
of spectrogram, display of instantaneous power
spectrum, waveform zoom in/out makes a tool much
more attractive for use. The value of a software
increases if source code is available for customization by
users.
Usability: A good user guide (preferably in the form of
online help) enables a novice to get started quickly.
Proper documentation aids a user to utilize advanced
facilities for efficient annotation. Software that is widely
in use around the world, a discussion group related to
the tool, and email support from developers increases
the potential of a tool to be adopted as a standard.
Analysis/search: A good tool should have facilities for
advanced analysis of data to yield features such as
spectral smoothing via Linear Prediction or Cepstral
analysis, voiced/unvoiced decision, built-in pitch and
formant extraction routines, pitch and formant tracking.
An additional desirable property of a good tool is an inbuilt search engine that enables a user to search for and
extract segments with user-specified properties. This
facilitates acoustic-phonetic studies of a language and
discovery of patterns of usage of a language. This may
come handy in developing better language models.
Cost: The price of an annotation tool should not be high
so that a large number of users can acquire it and use it.
A freely downloadable public domain software is much
more attractive in the Indian context. However, better
documentation and good after-sales-service of
commercial software should not be overlooked.
PRAAT AND EMU
Praat and Emu are two public domain software tools that
satisfy all the essential qualities and many desired
facilities of a good software tool for annotation. Hence,
these software have been examined in some detail. We
will first deal with Emu, and then describe praat-the most
suitable software tool.
EMU
POPULAR SIGNAL ANALYIS SOFTWARE
Several acoustic signal analysis software are available in
public domain and have been widely used for audio
recording and editing. Thanks to the familiarity of users
with such software, queries have been asked about the
suitability of such software for annotation. A few such
popular software have been briefly examined in this
context.
Goldwave: Goldwave (www.goldwave.com) is a
versatile sound editor. It claims to have spectrogram
capabilities (not present in demo version). However it
does not appear to be suitable for segmentation and
annotation of speech data as no such facilities are inbuilt into the software.
Speechstation: It is a good speech analysis software
from Sensimetrics corporation (www.sens.com). The
primary display of SpeechStation2 shows the audio
waveform and its spectrogram. Other available displays
show spectral slices, LPC-based spectrograms, formant
tracks, pitch tracks, 3-D and line waterfall plots, vowel
space plots, and real-time spectrograms. However,
facility for segmentation and annotation of speech data
is not available.
CSL: Computerized Speech Lab, model 4400
(www.kayelemetrics.com) along with Phonetic Database
(model 4332: special accessory) primarily contains an
annotated and time-aligned speech database (45
languages) for teaching phonetic sciences. Included with
the database is a custom program for using program
tools from CSL/Multi-Speech to fully explore the acoustic
properties. The program also provides listening tools and
the ability to add transcription using IPA symbols. It is
not clear whether sub-phonetic (ex: closure of /p/) labels
can be used. It does not permit multi-tier labeling.
Transcriber:
(www.isip.msstate.edu/projects/speech/
software/legacy/transcriber/index.html).
This
is
a
graphical user interface tool for speech segmentation
and
speech
transcription. The
tool provides
spectrograms and energy plots, speech selection, and
audio playback capabilities. It currently handles 16 bit
raw data only; data in other formats need to be
converted using external tools. Multiple levels of labeling
are not possible.
The Emu system, from Macquarie University, Australia
(http://emu.sourceforge.net) offers consistent access to
diverse speech databases, with facilities for easy
extraction of statistics, and support for database creation
as well. Emulabel, a tool of EMU, permits complex multitiered and hierarchical structures, and these can be built
using a combination of manual and automatic
annotation.
The advantage of emulabel over Praat is that it permits
not only multi-tier labeling but also permits hierarchical
structures and their visual display. The strongest point of
emulabel is its built-in search engine. For example, a
question such as "find all occurrences of phonetic
segment A" lists all the relevant segments within the
database along with their time boundaries, and
optionally extracts such segments for further analysis.
Moreover, a future version (2.0) plans to use annotation
Graph library from the LDC to support a standard data
model for annotations and make use of shared file
input/output code for different kinds of annotations. In
addition, emulabel is suggested for editing labels and
segments by festvox (a festival based concatenative
speech synthesis system) [3]. Thus, persons using
festvox are likely to be familiar with Emulabel.
The major drawback of emulabel is the lack of built-in
analysis tools. While the recently released version (1.8)
is said to include formant and pitch tracking tools based
on snack, formants and pitch have to be pre-computed
and stored for display. This makes it less convenient for
novice users to adopt the system. Moreover, the
document available currently is for version 1.2 and does
not contain information about several important facilities
in the latest version such as overlay of formant tracks in
the spectrogram window. So, the current version of
emulabel is not well suited for wide usage by novice
annotators, although it has high potential to become a
standard in future.
PRAAT
Praat is a product of Phonetic Sciences department of
University of Amsterdam [4] and hence oriented for
acoustic-phonetic studies by phoneticians. It has multiple
functionalities that include speech analysis/synthesis
and manipulation, labeling and segmentation, listening
experiments. It is a public domain software based on
Figure 1. Graphical environment of Praat for labeling and segmentation
Tcl/Tk and hence can be modified by a user. Source
code and binary executable versions for many popular
platforms including Microsoft windows and Linux are
downloadable. Praat can handle large data files, can
read and write many sound types. It has powerful
graphical interface and online manual. As a bonus,
learning algorithms and statistical analysis routines have
been integrated with praat. There is a user group on the
Internet; the tool is being constantly upgraded in
response to user's requests.
Labeling and segmentation
Praat can be used for labeling events as well as
segments. It permits multiple levels of labeling. The
labels and boundary information are stored in a text file
in a praat-specific format. Every labeled segment is
associated with it's start and end times. This permits
labeling of selected segments, and not necessarily all
the segments in a file. This feature is particularly useful
for segmentation of diphone-like units in running speech
for concatenative speech synthesis. Since the label file
is a plain ASCII file, scripts can be written to convert it to
any other format; thus, label and time information are
portable.
The main strength of praat is its powerful graphic
interface. Figure 1 shows the graphical user interface for
labeling and segmentation. The top panel displays the
time waveform. The middle panel has the spectrogram
overlaid with (4) formant tracks. The energy and pitch
contours are also shown in the same panel. Below the
spectrogram panel, are two panels corresponding to
labels at the word and acoustic-phonetic units. The
boundaries of segments at both the levels have been
marked. Dragging it can shift any boundary. The labels
can be edited by clicking on them. In the picture, the
label of the segment corresponding to the voice bar of
/b/ is highlighted and is ready to be edited. All the panels
are time-synchronized and can zoomed in or out. Any
part of the signal can be played on a speaker.
Pseudocode for labeling a file
The steps involved in labeling a speech file in terms of
two levels (word and acoustic-phonetic) using praat are
enumerated in a tutorial fashion below.
Invoke praat: Double click on the praat icon in windows;
type praat & at command prompt in Linux environment.
Two windows will appear: Praat Object and Praat
Picture; Ignore/iconize the Praat Picture window.
Select wave file: Click on `Read from File' in `Read'
menu in the Praat Object window. This will open a file
browser window. Select the wave file to be labeled.
Create/Read label file: To create a new label file, click
on 'Create TextGrid' in `New' menu. Select time range as
0.0 to 0.5. Set `Tier names' as `word ac-phonetic'; Leave
the `Point tiers' as blank. Click OK. If this is a
continuation of a previous labeling session related to the
same file, read the label file using the `Read' menu.
Display: Highlight both Sound and TextGrid lines; a new
menu will appear on the right side of Praat Objects
window. Click on Edit; a new window will appear with
waveform, spectrogram and two panels for two levels of
editing (see Figure 1 for reference). The word label
panel will be highlighted (yellow background) to indicate
that the current labeling will be at word level.
A desirable feature in Praat
Sometimes, the sequence of labels is known in advance
and is available in machine-readable form. For example,
if speakers read pre-determined text, the word sequence
is known and the corresponding phone sequence can be
generated automatically. If praat permits to read this
label sequence, the user does not have to type in the
labels; (s)he has to just mark the boundaries of the
segments. This facility was provided in xwaves+
software from Entropic Laboratory (unfortunately,
xwaves+ is no longer available).
CONCLUSION
Play: One can play the signal in the window by pressing
`Shift-tab'. To play a segment, mark the segment in
waveform/segment window using mouse and press `tab'.
Segment and Label: To label at the acoustic-phonetic
level, click on that panel. To place a boundary, click on
waveform or segment window. Type in the label of the
segment (for example sil); then, press `Enter'; a
boundary will be placed. Repeat the process to mark
other boundaries. To segment and label at the word
level, click on the word label panel. Drag the boundary
for boundary placement correction. Click on the segment
and edit the label, if necessary.
Save: Press `Alt-S' to save the label file in ASCII format.
Quit: The program will exit when you press `Alt-Q' on
Praat Objects window.
Well-designed, flexible software tools can make the task
of annotating large amounts of speech data less
cumbersome. The features of a few popular sound
editing programs as well as speech analysis tools have
been examined to assess their suitability as a tool for
labeling and segmentation of spoken Indian languages.
Praat, a public domain software, designed for acoustic
phonetic analysis appears to be most suitable for this
purpose.
REFERENCES
[1] Speech Communication, 33, numbers 1-2, 2001
[2] http://www.ldc.upenn.edu/annotation
[3] http://www.festvox.org
[4] http://www.fon.hum.uva.nl/praat
Download