Comparison of software for transcription of speech data NSA2003-030 SAMUDRAVIJAYA K Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005 chief@tifr.res.in ABSTRACT Transcribed speech data is the raw material for development of speech recognition and speech synthesis systems. Appropriate software tools are necessary for facilitating the process of annotation of large amount of speech data. Several speech analysis tools, in commercial as well as public domain, also permit transcription of speech data. Here, we present a comparison of a few prominent tools in the context of creating speech databases of Indian languages. Specifically, a comparison is made in terms of features such as facility for sequential as well as multi-level labeling, ease of use, extensibility/adaptability, platform independence, documentation and support. INTRODUCTION Annotated speech data is the raw material for development of speech recognition and speech synthesis systems. Acoustic-phonetic study of the speech sounds of a language is essential for determining the parameters of speech synthesis systems following articulatory or parametric approach. Concatenative synthesizers require an inventory of labeled segments that are generally excised from transcribed and time-aligned spoken utterances. Such a database is also necessary for training continuous speech recognition systems. Creation of databases of spoken Indian languages has attracted the attention of Government of India and private bodies. The government has initiated projects for creation of databases for Voice Recognition as well as Speech Recognition and Text-to-Speech systems in 6 Indian languages. Recent initiative of several organizations to form Linguistic Data Consortium for Indian Languages is another step in this direction. Development of speech database involves 3 steps: (a) identifying text to be spoken appropriate for the end use, (b) recording speech in digital format and (c) annotation of speech data. This paper deals with software tools that will facilitate the last step. Transcription of large amount of speech data is laborious and time-consuming, especially if the alignment of the transcription with the time waveform is also necessary. In addition such labeling in terms of segmental units such as phonemes or words, prosodic annotation is also necessary for synthesis of natural sounding speech. Appropriate software tools are necessary for facilitating the annotation process. In fact, a good software tool should permit annotation at a hierarchy of levels. Several speech analysis tools, in commercial as well as public domain, also permit transcription of speech data. Here, we present a comparison of a few prominent tools in the context of creating speech databases of Indian languages. Specifically, a comparison is made in terms of features such as facility for sequential as well as hierarchical labeling, ease of use, extensibility, adaptability, platform independence, documentation and support. One may also refer to a special issue of Speech Communication on the topic 'Speech Annotation and Corpus Tools' [1]. LINGUISTIC ANNOTATION The term linguistic annotation covers any descriptive or analytic notations applied to raw language data. The added notations may include information of various kinds: multi-tier transcription of speech in terms of units such as acoustic-phonetic features, syllables, words etc; syntactic and semantic analysis; paralinguistic information (stress, speaking rate) non-linguistic information (speaker's gender, age, voice quality, emotions, dialect room acoustics, additive noise, channel effects). The internet site of Linguistic Data Consortium [2] describes tools and formats for creating and managing linguistic annotations. A brief description of various tools for annotation is also given at that site. Essential qualities TYPES OF ANNOTATION In the current work, we will focus on annotation of spoken language. Since spoken language involves time waveform, the data may be not only be annotated in terms of various linguistic events and segments, but also the temporal locations of such linguistic entities be marked. In other words, the linguistic events and segments may be aligned with the time waveform. Such segmented speech database is necessary for training acoustic models for speech recognition as well as for speech synthesis using concatenative approach. A good software tool for annotation should be able to handle large quantities of data collected. It should be able to read data recorded and stored in a few standard formats such as (a) desktop speech: PCM 16 bit, 16000 sampling frequency (b) telephone speech: 8bit A or law, 8000Hz. Other standard features should include time-synchronized display of time waveform and spectrogram; ability to mark/edit segment boundaries; attach labels to segments. The tool should be easy and intuitive to use with adequate documentation. It should be easy to acquire and install. Desirable qualities A given speech data can be annotated in terms of many acoustic-phonetic units such as acoustic-phonetic features and events, segments such as phones, syllables, words, phrases etc. These units, in general, are related in a hierarchical manner; consequently, a multi-tier annotation scheme is desirable. In addition to linguistic annotation, paralinguistic and non-linguistic information can also be included in the database. While the time intervals of linguistic units are generally nonoverlapping, this need not be the case in case of nonlinguistic information. For example, two speakers may be speaking simultaneously during a dialogue. An ideal transcription scheme should accommodate such cases in addition to multi-tier annotation. TOOLS FOR ANNOTATION Transcription of large amount of speech data is laborintensive and repetitive. However, it is important to maintain good uniform quality of annotation. For transcription of spoken text, human beings not only use speech-specific knowledge, but also benefit from display of derived information such as energy contour, spectrogram. Therefore, those tools that provide the necessary analytic functions together with appropriate display facilities aid the transcription process. Unlike in the past, where special purpose hardware was necessary for compute intensive operations, almost all currently available annotation tools are software based thanks to high computational power of prevalent personal computers. In addition to specialized tools, several speech analysis software, commercial as well as public domain, also permit transcription of speech data. The motivation of this work is to compare the strength and weakness of various software tools those may be used for annotation. This will, hopefully, lead to a set of tools that are most suitable for annotation of spoken Indian languages and thus become a standard. ESSENTIAL AND DESIRABLE QUALITIES It is better to enumerate the essential and desirable qualities of a good annotation tool so that a fair comparison of tools can be made. In addition to the essential qualities listed above, the following properties make a good tool a better one. Such qualities encourage annotators to adopt a tool as a standard. Recording: should handle data collected under a multiplicity of recording conditions and stored in various storage formats including stereo recording. Labeling: should permit not only sequential labeling but also labeling at multiple levels (e.g., acoustic-phonetic, syllable, word, prosodic). The tool should not only allow an user to define his/her set of symbols, but facilitate usage of IPA symbols. The storage of annotation should be based on standards such as ASCII, ISCII, Unicode and XML so that the annotation is portable across tools. Platform independence: versions of the tool should be available for use in multiple operating systems such Microsoft windows and GNU Linux. Versatility: The software should permit the display of non-standard temporal features such as pitch contour, overlay of formant contours on spectrograms. Features such as user controllable time and frequency resolution of spectrogram, display of instantaneous power spectrum, waveform zoom in/out makes a tool much more attractive for use. The value of a software increases if source code is available for customization by users. Usability: A good user guide (preferably in the form of online help) enables a novice to get started quickly. Proper documentation aids a user to utilize advanced facilities for efficient annotation. Software that is widely in use around the world, a discussion group related to the tool, and email support from developers increases the potential of a tool to be adopted as a standard. Analysis/search: A good tool should have facilities for advanced analysis of data to yield features such as spectral smoothing via Linear Prediction or Cepstral analysis, voiced/unvoiced decision, built-in pitch and formant extraction routines, pitch and formant tracking. An additional desirable property of a good tool is an inbuilt search engine that enables a user to search for and extract segments with user-specified properties. This facilitates acoustic-phonetic studies of a language and discovery of patterns of usage of a language. This may come handy in developing better language models. Cost: The price of an annotation tool should not be high so that a large number of users can acquire it and use it. A freely downloadable public domain software is much more attractive in the Indian context. However, better documentation and good after-sales-service of commercial software should not be overlooked. PRAAT AND EMU Praat and Emu are two public domain software tools that satisfy all the essential qualities and many desired facilities of a good software tool for annotation. Hence, these software have been examined in some detail. We will first deal with Emu, and then describe praat-the most suitable software tool. EMU POPULAR SIGNAL ANALYIS SOFTWARE Several acoustic signal analysis software are available in public domain and have been widely used for audio recording and editing. Thanks to the familiarity of users with such software, queries have been asked about the suitability of such software for annotation. A few such popular software have been briefly examined in this context. Goldwave: Goldwave (www.goldwave.com) is a versatile sound editor. It claims to have spectrogram capabilities (not present in demo version). However it does not appear to be suitable for segmentation and annotation of speech data as no such facilities are inbuilt into the software. Speechstation: It is a good speech analysis software from Sensimetrics corporation (www.sens.com). The primary display of SpeechStation2 shows the audio waveform and its spectrogram. Other available displays show spectral slices, LPC-based spectrograms, formant tracks, pitch tracks, 3-D and line waterfall plots, vowel space plots, and real-time spectrograms. However, facility for segmentation and annotation of speech data is not available. CSL: Computerized Speech Lab, model 4400 (www.kayelemetrics.com) along with Phonetic Database (model 4332: special accessory) primarily contains an annotated and time-aligned speech database (45 languages) for teaching phonetic sciences. Included with the database is a custom program for using program tools from CSL/Multi-Speech to fully explore the acoustic properties. The program also provides listening tools and the ability to add transcription using IPA symbols. It is not clear whether sub-phonetic (ex: closure of /p/) labels can be used. It does not permit multi-tier labeling. Transcriber: (www.isip.msstate.edu/projects/speech/ software/legacy/transcriber/index.html). This is a graphical user interface tool for speech segmentation and speech transcription. The tool provides spectrograms and energy plots, speech selection, and audio playback capabilities. It currently handles 16 bit raw data only; data in other formats need to be converted using external tools. Multiple levels of labeling are not possible. The Emu system, from Macquarie University, Australia (http://emu.sourceforge.net) offers consistent access to diverse speech databases, with facilities for easy extraction of statistics, and support for database creation as well. Emulabel, a tool of EMU, permits complex multitiered and hierarchical structures, and these can be built using a combination of manual and automatic annotation. The advantage of emulabel over Praat is that it permits not only multi-tier labeling but also permits hierarchical structures and their visual display. The strongest point of emulabel is its built-in search engine. For example, a question such as "find all occurrences of phonetic segment A" lists all the relevant segments within the database along with their time boundaries, and optionally extracts such segments for further analysis. Moreover, a future version (2.0) plans to use annotation Graph library from the LDC to support a standard data model for annotations and make use of shared file input/output code for different kinds of annotations. In addition, emulabel is suggested for editing labels and segments by festvox (a festival based concatenative speech synthesis system) [3]. Thus, persons using festvox are likely to be familiar with Emulabel. The major drawback of emulabel is the lack of built-in analysis tools. While the recently released version (1.8) is said to include formant and pitch tracking tools based on snack, formants and pitch have to be pre-computed and stored for display. This makes it less convenient for novice users to adopt the system. Moreover, the document available currently is for version 1.2 and does not contain information about several important facilities in the latest version such as overlay of formant tracks in the spectrogram window. So, the current version of emulabel is not well suited for wide usage by novice annotators, although it has high potential to become a standard in future. PRAAT Praat is a product of Phonetic Sciences department of University of Amsterdam [4] and hence oriented for acoustic-phonetic studies by phoneticians. It has multiple functionalities that include speech analysis/synthesis and manipulation, labeling and segmentation, listening experiments. It is a public domain software based on Figure 1. Graphical environment of Praat for labeling and segmentation Tcl/Tk and hence can be modified by a user. Source code and binary executable versions for many popular platforms including Microsoft windows and Linux are downloadable. Praat can handle large data files, can read and write many sound types. It has powerful graphical interface and online manual. As a bonus, learning algorithms and statistical analysis routines have been integrated with praat. There is a user group on the Internet; the tool is being constantly upgraded in response to user's requests. Labeling and segmentation Praat can be used for labeling events as well as segments. It permits multiple levels of labeling. The labels and boundary information are stored in a text file in a praat-specific format. Every labeled segment is associated with it's start and end times. This permits labeling of selected segments, and not necessarily all the segments in a file. This feature is particularly useful for segmentation of diphone-like units in running speech for concatenative speech synthesis. Since the label file is a plain ASCII file, scripts can be written to convert it to any other format; thus, label and time information are portable. The main strength of praat is its powerful graphic interface. Figure 1 shows the graphical user interface for labeling and segmentation. The top panel displays the time waveform. The middle panel has the spectrogram overlaid with (4) formant tracks. The energy and pitch contours are also shown in the same panel. Below the spectrogram panel, are two panels corresponding to labels at the word and acoustic-phonetic units. The boundaries of segments at both the levels have been marked. Dragging it can shift any boundary. The labels can be edited by clicking on them. In the picture, the label of the segment corresponding to the voice bar of /b/ is highlighted and is ready to be edited. All the panels are time-synchronized and can zoomed in or out. Any part of the signal can be played on a speaker. Pseudocode for labeling a file The steps involved in labeling a speech file in terms of two levels (word and acoustic-phonetic) using praat are enumerated in a tutorial fashion below. Invoke praat: Double click on the praat icon in windows; type praat & at command prompt in Linux environment. Two windows will appear: Praat Object and Praat Picture; Ignore/iconize the Praat Picture window. Select wave file: Click on `Read from File' in `Read' menu in the Praat Object window. This will open a file browser window. Select the wave file to be labeled. Create/Read label file: To create a new label file, click on 'Create TextGrid' in `New' menu. Select time range as 0.0 to 0.5. Set `Tier names' as `word ac-phonetic'; Leave the `Point tiers' as blank. Click OK. If this is a continuation of a previous labeling session related to the same file, read the label file using the `Read' menu. Display: Highlight both Sound and TextGrid lines; a new menu will appear on the right side of Praat Objects window. Click on Edit; a new window will appear with waveform, spectrogram and two panels for two levels of editing (see Figure 1 for reference). The word label panel will be highlighted (yellow background) to indicate that the current labeling will be at word level. A desirable feature in Praat Sometimes, the sequence of labels is known in advance and is available in machine-readable form. For example, if speakers read pre-determined text, the word sequence is known and the corresponding phone sequence can be generated automatically. If praat permits to read this label sequence, the user does not have to type in the labels; (s)he has to just mark the boundaries of the segments. This facility was provided in xwaves+ software from Entropic Laboratory (unfortunately, xwaves+ is no longer available). CONCLUSION Play: One can play the signal in the window by pressing `Shift-tab'. To play a segment, mark the segment in waveform/segment window using mouse and press `tab'. Segment and Label: To label at the acoustic-phonetic level, click on that panel. To place a boundary, click on waveform or segment window. Type in the label of the segment (for example sil); then, press `Enter'; a boundary will be placed. Repeat the process to mark other boundaries. To segment and label at the word level, click on the word label panel. Drag the boundary for boundary placement correction. Click on the segment and edit the label, if necessary. Save: Press `Alt-S' to save the label file in ASCII format. Quit: The program will exit when you press `Alt-Q' on Praat Objects window. Well-designed, flexible software tools can make the task of annotating large amounts of speech data less cumbersome. The features of a few popular sound editing programs as well as speech analysis tools have been examined to assess their suitability as a tool for labeling and segmentation of spoken Indian languages. Praat, a public domain software, designed for acoustic phonetic analysis appears to be most suitable for this purpose. REFERENCES [1] Speech Communication, 33, numbers 1-2, 2001 [2] http://www.ldc.upenn.edu/annotation [3] http://www.festvox.org [4] http://www.fon.hum.uva.nl/praat