Specifications on the C-ORAL-ROM Corpus Massimo Moneglia (University of Florence) C-ORAL-ROM is available in the ELDA catalogue http://catalog.elda.org:8080/product_info.php?cPath=37_46&products_id=757 © University of Florence All rights reserved. No part of this specifications may be reproduced in any form without the permission of the copyright holder. Distributed with the C-ORAL-ROM corpus by ELRA/ELDA 55-57 rue Brillat Savarin 75013 Paris France 2 I Introduction .......................................................................................................................................7 II General Information.......................................................................................................................10 1. Contact person............................................................................................................................10 2. Distribution media .....................................................................................................................10 3. Content .......................................................................................................................................10 4. Format of speech and label file ..................................................................................................10 5. Layout of the disk-file system....................................................................................................11 6. Hardware, Software and Recording Platforms ..........................................................................13 6.1 Hardware..............................................................................................................................13 6.2 Recording Platforms and Software ......................................................................................14 7. Number of recording and general corpus data ...........................................................................15 8. Acoustic quality .........................................................................................................................16 III C-ORAL-ROM corpus...................................................................................................................17 1. Corpus Design............................................................................................................................17 1.2 Sampling Parameters ...........................................................................................................17 1.3. Sampling strategy and Corpus design.................................................................................18 1.4 Comparability.......................................................................................................................19 2. Filenames conventions...............................................................................................................20 3. Annotation information..............................................................................................................21 3.1 Meta-Data.............................................................................................................................21 3.1.2 Rules for the Participant field ......................................................................................23 3.1.3 Rules for the Class field................................................................................................24 3.1.4 Rules for the Situation field ..........................................................................................24 3.1.5 Rules for marking Acoustic quality............................................................................25 3.1.6 Quality assurance on metadata format ..........................................................................26 3.1.7 Statistics from C-ORAL-ROM metadata.....................................................................26 3.1.7.1 Number of speakers ...............................................................................................26 3.1.7.2Distribution of speakers per geographical origin....................................................26 3.1.7.3 Completeness of speakers features in the metadata records ..................................28 3.1.7.4 Completeness of session metadata.........................................................................29 3.2. Transcription and Dialogue representation .........................................................................31 3.2.1 Basic Concepts for dialogue representation..................................................................31 3.2.2. Turn representation. .....................................................................................................31 3.2.3 Utterance representation. ..............................................................................................31 3.2.4. Word representation.....................................................................................................32 3.2.5 Transcription .................................................................................................................32 3.2.6 Overlapping...................................................................................................................32 3.2.7 Cross over dialogue convention....................................................................................33 3.2.7.1 Overlapping and cross-over dialogue ....................................................................33 3.2.7.2 Intersection of turns ...............................................................................................34 3.2.7.3 Interruption and cross-over dialogue .....................................................................34 3.2.8 Transcription conventions for segmental features ........................................................34 3.2.8.1. Non-understandable words ...................................................................................34 3.2. 8. 2 Paralinguistic elements ........................................................................................34 3.2.8.3. Fragments..............................................................................................................35 3.2.8.4. Interjections...........................................................................................................35 3.2.8.5. Non standard words ..............................................................................................35 3.2.8.6. Non-transcribed words..........................................................................................35 3.2.8.7. Non-transcribed audio signal ................................................................................35 3.2.9 Transcription conventions for human-machine interaction ..........................................35 3.2.10 Quality assurance on format and orthographic transcription ......................................36 3 3.3. Prosodic annotation scheme................................................................................................37 3.3.1 Principles.......................................................................................................................37 3.3.2 Concepts........................................................................................................................37 3.3.3 Theoretical background.................................................................................................37 3.3.4 Conventions for prosodic tagging in the transcripts: types of prosodic breaks ............38 3.3.4.1 Terminal breaks (utterance limit)...........................................................................38 3.3.4 .2Non-terminal breaks...............................................................................................38 3.3.5 Fragmentation phenomena............................................................................................39 3.3.5. 1 Interruptions ..........................................................................................................39 3.3.5.2 Retracting and/or restart and/or false start(s).........................................................39 3.3.5. 3. Retracting/interruption ambiguity........................................................................40 3.3.6 Summary of prosodic break types.................................................................................40 3.3.7 Pauses............................................................................................................................40 3.4 Quality assurance on prosodic tagging ...............................................................................40 3.5. Dependent lines...................................................................................................................41 3.6. Alignment............................................................................................................................41 3.6. 1 Annotation procedure...................................................................................................42 3.6.2. Prosodic tagging and the alignment unit......................................................................42 3.6. 3 Quality assurance on the aligment ...............................................................................42 3.6.4 Win Pitch Corpus ..........................................................................................................43 3.6.5 DTD of WinPitch Corpus aligment files.......................................................................43 3.7. PoS tagging and lemmatization ..........................................................................................46 3.7.1. Minimal Tag Set requirements.....................................................................................46 3.7.2. Tagging Format............................................................................................................47 3.7.3. Frequency Lists Format ...............................................................................................47 3.7.4 Tag sets .........................................................................................................................49 3.7.4.1 French tagset ..........................................................................................................49 3.7.4 2 Italian tagset ...........................................................................................................50 3.7.4 3. Portuguese Tagset .................................................................................................51 3.7.4 4. Spanish Tagset ......................................................................................................53 3.7.5 automatic PoS tagging: tool and evaluation..................................................................55 3.7.5.1 Italian : tool and evaluation....................................................................................55 3.7.5.2 French: tool and evaluation....................................................................................60 3.7.5.3. Portuguese: tool and evaluation ............................................................................64 3.7.5.4. Spanish: tool and evaluation .................................................................................69 4. XML Format of the textual resource..........................................................................................73 4.1. Macro for the translation of C-ORAL-ROM .txt files to .XML ........................................73 4.1.1. Checking the C-ORAL-ROM format ..........................................................................73 4.1.2. Generating XML files ..................................................................................................73 4.2.1. Requirements and procedure........................................................................................74 4.2.2. Rectifying errors ..........................................................................................................74 4. 3. C-oral-rom dtd ..................................................................................................................75 5. Bibliographical references .........................................................................................................78 APPENDIXES ....................................................................................................................................79 APPENDIX 1- TYPICAL EXAMPLES OF PROSODIC BREAKS TYPES IN ITALIAN, FRENCH, PRTUGUESE AND SPANISH .....................................................................................................................................80 Examples of strings with typical non terminal breaks, generic terminal breaks and interrogative breaks .................................................................................................................................................80 Examples of strings with typical intentional suspension ...................................................................80 Examples of strings with typical interruption ...................................................................................81 Examples of strings with typical retracting ......................................................................................82 Examples of retracting/interruption ambiguity..................................................................................82 APPENDIX 2 - TAGSETS USED FOR POS TAGGING IN THE FOUR LANGUAGE COLLECTIONS. DETAILED TABLES AND COMPARISON TABLE .......................................................................................................83 4 French tagset......................................................................................................................................83 Italian Tag set ...................................................................................................................................84 Portuguese tagset...............................................................................................................................86 Spanish tagset ...................................................................................................................................88 Synopsis tag sets ................................................................................................................................90 APPENDIX 3 ORTHOGRAPHIC TRANSCRIPTION CONVENTIONS IN THE FOUR LANGUAGE CORPORA .92 Italian transcription conventions .......................................................................................................92 French transcription conventions ......................................................................................................96 Portuguese transcription conventions ...............................................................................................98 Spanish transcription conventions ...................................................................................................100 APPENDIX 4 C-ORAL-ROM PROSODIC TAGGING EVALUATION REPORT ..........................................104 1 Introduction...................................................................................................................................104 2 Experimental Setting .....................................................................................................................104 3 Measures and Statistics.................................................................................................................106 3.1 Evaluation data.......................................................................................................................106 3.2 First step: binary comparison file...........................................................................................107 3.3 Second step: ternary comparison file .....................................................................................108 3.4 Third step: measures ..............................................................................................................108 3.5 Statistics .................................................................................................................................110 3.5.1 Percentages......................................................................................................................111 3.5.2 Kappa coefficient ............................................................................................................112 4 Results ...........................................................................................................................................112 4.1 General Data ..........................................................................................................................113 4.2 Binary Comparison ................................................................................................................114 4.2.1 Binary Comparison: Terminal Break Confirmation .......................................................114 4.2.2 Binary Comparison: Terminal Missing ..........................................................................116 4.2.3 Binary Comparison: Non Terminal Missing...................................................................116 4.2.3’ Binary Comparison: Non Terminal Deletion ................................................................117 4.2.4 Binary Comparison: Added Terminal.............................................................................119 4.2.5 Binary Comparison: Activity Rate..................................................................................119 4.2.6 Binary Comparison: Misplacement Rate ........................................................................120 4.2.3 Binary Comparison: Non Terminal Deletion..................................................................121 4.3 Ternary Comparison ..............................................................................................................121 4.3.1 Ternary Comparison: Strong Disagreement on prosodic breaks ....................................121 4.3.2 Ternary Comparison: Partial Consensus.........................................................................127 4.3.3 Ternary Comparison: Total Agreement ..........................................................................129 4.3.4 Ternary Comparison: Global Disagreement on prosodic breaks ....................................132 4.3.5 Ternary Comparison: Consensus in the Disagreement ...................................................133 4.4 Kappa coefficients..................................................................................................................134 4.5 Summarizing Table ................................................................................................................136 4.5 Discussion of results ..............................................................................................................137 5 Conclusions ..................................................................................................................................140 5 6 I Introduction The C-ORAL-ROM multilingual resource provides a comparable set of corpora of spontaneous spoken language of the main romance languages, namely French, Italian, Portuguese and Spanish. The resource is the result of the C-ORAL-ROM project, which has been undertaken by an European consortium, co-ordinated by the University of Florence and funded within the Fifth EU framework program (C-ORAL-ROM IST 2000-26228). C-ORAL-ROM consists of 772 spoken texts and 123:27:35 hours of speech. Four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (roughly 300,000 words for each Language) have been delivered respectively by the following providers: University of Florence (LABLITA, Laboratorio linguistico del Dipartimento di italianistica); Université de Provence (DELIC, Description Linguistique Informatisée sur Corpus); Centro de Linguística da Universidade de Lisboa (CLUL); 1 Universidad Autónoma de Madrid (Departamento de linguistica, Laboratorio de Lingüística Informática ). The main C-ORAL-ROM objective is to allow HLT based on spoken language interface to face challenging LRs which represent spontaneous speech in real environment. The resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. More specifically the representation of significant variations found in spontaneous speech performances in natural environments allow the use of C-ORAL-ROM for comparable spoken language modelling of the main romance languages, both for linguistics studies and Language Technology purposes. The resource can also be used for testing speech recognition tools in critical contexts. The recording conditions and the acoustic quality of the sessions collected in C-ORALROM are variable. The speech files of the acoustic database are defined on a quality scale (recording, volume, voice overlapping and noise) and are comparable with respect to it. The quality scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality. The quality is gauged spectrographically and is always annotated in the metadata of each session. See §7 in Chapter II 2 In order to ensure a significant representation of the spontaneous speech universe the corpus design of the four resources foresees recording in natural environment in a variety of different contexts. The contextual variation is controlled on the basis of a strictly defined set of parameters whose significance has been recognized in the linguistic tradition. As a consequence of this sampling strategy the four language resources are comparable as far as they fit the same corpus design scheme. See §1 in Chapter III. Each recorded session is stored in wav files (Windows PCM, 22.050 Hz, 16 bit) and is delivered in a multimedia corpus with the following main annotations: a. Session metadata; 1 Other partners in the C-ORAL-ROM Consortium: Pitch France, for the speech software; European Language Distribution Agency, France (ELDA) for distribution of the resource in the industrial market sector; Instituto Cervantes, Spain (IC) for the dissemination and exploitation for language acquisition purposes. Istituto Trentino di Cultura (ITC-Irst), Italy, tested the resource in present multilingual speech recognition technologies. 2 The C-ORAL-ROM data bases are anonymous. All speech segments that may have offended the user for decency reasons have been erased and substituted with a beep in the audio signal. Speakers authorized each provider for the use of the recorded data to all ends foreseen in the C-ORAL-ROM project, including publication and language technology applications. The authorization models are available in http://lablita.dit.unifi.it/coralrom/authorization_model. The use of radio and TV emissions has been authorized by the Broadcasting companies which also provided the raw data for the Data Base. Acknowledgements of all companies are given in the Copyright section of the DVDs. The authorization data bases have been checked by ELDA (European Language Resource Distribution Agency). 7 b. The orthographic transcription, in CHAT format3, enriched by the tagging of terminal and non terminal prosodic breaks, in txt files c. The text-to-speech synchronization, based on the alignment to the acoustic source of each transcribed utterance, in .xml files. This resource is stored in DVDs and is accompanied by the Win Pitch Corpus speech software (© Pitch France) 4. Win Pitch Corpus allows the direct and simultaneous exploitation of the acoustic and textual information by loading the .xml alignment files, compiled in accordance with the CORAL-ROM DTD alignment. Metadata are defined following an explicit set of rules and contain essential information regarding the speakers, the recording situation, the acoustic quality, the source, and the content of each session and ensure a clear identification of the various speech types documented in the resource (see. § 3.1 in Chapter II) The corpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney 1994) with the representation of the main dialogue characters; that is, speaker’s turns, the occurring non linguistic and paralinguistic events, prosodic breaks and the segmentation of the speech flow into discrete speech events. (See § 3.2 in Chapter III) In C-ORAL-ROM’s implementation, the textual string is divided into utterances following the annotation of perceptively relevant prosodic breaks, which are discriminated in the speech flow through perceptive judgments. 5 (See §3.3 in Chapter III) The annotated transcripts are aligned to the acoustic counterpart through WinPicthCorpus. Segments deriving from the alignment are defined on independent layers, with automatic generation of the corresponding database. This multimedia storage ensures a natural and meaningful text/sound correspondence that may be considered one of the main added values of the resource. Each utterance is aligned to its acoustic counterpart, generating the data base of all the utterances in the resource (roughly 134,000 in the multilingual corpus).6 (see § 3.6 in Chapter III) Besides text-to-speech and speech-to-text alignment, WinPitchCorpus allows an easy and efficient acoustic analysis of speech, as regards real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc... The multimedia C-ORAL-ROM resource is integrated with additional label files in various formats, which ensure a multitask exploitation of the resource: o o o o o TXT files with the resource metadata in CHAT format XML files with the resource metadata in IMDI format Textual resource without alignment information in TXT files Textual resource with automatic Part of Speech (PoS) and lemma tagging of each form Textual resource in XML format according with the C-ORAL-ROM DTD The consortium ensures maximum accuracy in the transcripts, which have been compiled by PhD and PhD students in linguistics. The original transcripts have been revised by at least two transcribers. The orthographic accuracy of transcripts has been cheeked automatically through a word spelling check and through automatic PoS tagging. The reliability of the prosodic tagging has been evaluated by an Industrial user in the speech technology sector, external to the consortium, with a detailed analysis of the consensus reached for each speech style represented in each language corpus. The level of consensus in terms of Kstatistics index is always over 0.8 (see. the evaluation report in Appendix II). 3 http://childes.psy.cmu.edu/manuals/CHAT.pdf Minimal configuration required: Pentium III, 1 GHz, 256-megabytes Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only http://www.winpitch.com 5 The level of inter-annotator agreement on prosodic tag assignment has been evaluated by an external institution (LOQUENDO) see Appendix 3. 6 In the French resource terminal breaks are annotated but the alignment mainly follows pauses in the speech flow. 4 8 The level of accuracy of the automatic PoS tagging of each language resource has been evaluated by the providers and reported in §3.7.5 The percentage of exact recognition ranges from 90% to 97% according to the language resource. The format accuracy of both metadata and transcripts has been double-checked automatically through a conversion of the original label plain text files into XML files. More specifically, the metadata format has been validated through two conversion scripts: o conversion in XML files according to the C-ORAL-ROM DTD o conversion in XML according to the IMDI format. The distributor will provide additional quality assurance. See the VALREP.DOC file. 9 II General Information 1. Contact person Prof. Emanuela Cresti Co-ordinator of the C-ORAL-ROM project Italian Department University of Florence Piazza Savonarola, 1 50132 Firenze Italy phone: +39 055 5032486 fax: +39 055 503247 e-mail: elicresti@unifi.ti web: http://lablita.dit.unifi.it 2. Distribution media The C-ORAL-ROM resource is distributed in DVD-5 . 3. Content The C-ORAL-ROM multilingual resource of spontaneous speech for Italian, French, Portuguese and Spanish comprises three components: a) Multimedia corpus; b) Speech software; c) Appendix. C-ORAL-ROM is delivered in 9 DVDs, with the following content: o DVDs 1 to 8 contain the Multimedia corpus. Respectively Italian collection; French collection; Portuguese collection; Spanish collection. o DVD 9 contains a set of Appendixes to the multimedia corpus (Speech Software, Textual resources in various format, and additional documentation) that are delivered to allow a more efficient multitask exploitation of the resource. 4. Format of speech and label file For each spontaneous speech recording session, the following is delivered into folders of the multimedia corpus 1. Speech files: uncompressed files (Windows PCM: 22,050 Hz; 16 bit7) with “.wav” extension 2. Transcripts in CHAT format8 enriched by the annotation of terminal and non terminal prosodic breaks and the alignment information, in plain text files with “.txt” extension 3. The text-to-speech alignment files: XML file in WIN PITCH CORPUS format with “.xml” extension9. 4. DTD of the WinPitchCorpus alignment format (coralrom.dtd) 7 The resource comprise a sub-corpus of private telephone conversation and human-machine interactions obtained through a phone call service. The human machine interactions are sampled in files Windows PCM: 8,000 Hz 16 bit. Private telephone conversations have been sampled at 22,050 Hz or at 8,000 Hz. 8 http://childes.psy.cmu.edu/manuals/CHAT.pdf 9 human-machine files were not aligned, so XML files are not present in relative folders. 10 For each session, the following files are also delivered in the Appendix to allow a multitask exploitation of the resource: 5. The transcription of each session in CHAT format in plain text files (without the alignment information) with “.txt” extension 6. The C-ORAL-ROM transcription of each session in XML files with “.xml” extension 7. DTD of the C-ORAL-ROM textual format (coralrom.dtd) 8. Metadata in CHAT format (plain text files with “_META.txt” extension) 9. Metadata in IMDI format (XML files with “_META.imdi” extension) 10. The C-ORAL-ROM transcription of each session with Part of Speech annotation and Lemma annotation for each form in plain text files with “_PoS.txt” extension10 In addition, the following files are delivered for each language resource: 11. Tag set adopted in plain text files (tagset_french.txt, tagset_italian.txt, tagset_portuguese.txt, tagset_spanish.txt) 12. Frequency lists of lemmas and Frequency lists of forms in plain text files 13. Measurements of the Language values recorded in each text: in the Excel files “measurements_language.xls” 14. Line diagrams presenting the trend observed with regard to the standard text variation parameters along the corpus structure's nodes, in the Excel file “multi-lingual_graphics.xls” 15. A set of excel files containing statistics on the metadata: a) metadata_session.xls: statistics regarding the completeness of session metadata in the four collections; b) participants_records.xls: statistics regarding the completeness of the main speaker metadata in the four collections; c) french_speakers.xls, italian_speakers.xls, portuguese_speakers.xls, spanish_speakers.xls: list of speakers recorded in each corpus, in anonymous form, with their main metadata including geographical origin, sex and age. 16. A set of multimedia samples referred to in the resource documentation Standard character set used for transcription and annotation: ISO-8859-1 5. Layout of the disk-file system DVDs 1 to 8 of the Multimedia corpus have the following structure: Italian collection: DVDs CORALROM1II; CORALROM2IF French collection: DVDs CORALROM3FI; CORALROM4FF Portuguese collection: DVDs CORALROM5PI; CORALROM6PF Spanish collection: DVDs CORALROM7SI; CORALROM8SF Each language collection has the same folder structure, which mirrors the C-ORAL-ROM corpus design (no language reference is indicated in the folder structure, this information can be retrieved by the DVD name): /<Category>/<Context>/<Domain> where: Category Context 10 INFORMAL FORMAL If category = INFORMAL family_private public If category = FORMAL formal in natural context media context telephone The human-machine interactions txt files of the French and Spanish collections have not been pos_tagged 11 Domain If category = INFORMAL monologues dialogues conversations If category = FORMAL - If context = natural_context business conference law political_debate political_speech preaching professional_explanation teaching - If context = media interviews meteorology news reportages scientific_press sport talk_show - If context = telephone private_conversations human-machine (non applicable to the Portuguese corpus) The following is the directory structure of each language data-base in the C-ORAL-ROM multimedia resource: INFORMAL/ INFORMAL/ family_private INFORMAL/ family_private / monologues INFORMAL/ family_private /dialogues INFORMAL/ family_private /conversations INFORMAL/public INFORMAL/public/dialogues INFORMAL/public/ conversations INFORMAL/public /monologues FORMAL/ natural_context /professional_explanation FORMAL/ natural_context / conference FORMAL/ natural_context / business FORMAL/ natural_context /law FORMAL/media FORMAL/media/news FORMAL/ media /meteorology FORMAL/ media/interviews FORMAL/ media/reportage FORMAL/ media /scientific_press FORMAL/ media /sport FORMAL/ media /political debate FORMAL/ media /talk_show FORMAL/telephone FORMAL/telephone/private_conversations FORMAL/telephone/human-machine FORMAL FORMAL/natural_context FORMAL/ natural_context / political_speech FORMAL/ natural_context / political_debate FORMAL/ natural_context / preaching FORMAL/ natural_context / teaching DVD 9 (CORALROM_AP) contains a set of Appendixes to the multimedia corpus (Speech Software, Textual resources in various format, and additional documentation). The content of CORALROM_AP is structured into folders as follows: \ Measurements Metadata Textual Corpus WinPitchCorpus Utilities \Textual Corpus\ Frequency Lists PoS-Tagging txt xml \Textual Corpus\xml\ French_xml 12 Italian_xml Portuguese_xml Spanish_xml Spanish \Metadata\ CHAT_Metadata IMDI_Metadata \Textual Corpus\txt\ French_txt Italian_txt Portuguese_txt Spanish_txt \Metadata\CHAT_Metadata\ French Italian Portuguese Spanish \Textual Corpus\PoS-Tagging\ French Italian Portuguese Spanish tag_sets \Metadata\IMDI_Metadata\ French Italian Portuguese Spanish \Textual Corpus\PoS-Tagging\Italian\ human-machine_interaction \Utilities Prosodic_breaks Specifications \Textual Corpus\Frequency Lists\ French Italian Portuguese In addition to the previous structures, the following directories are used to store those files which are not part of the Data Base: \ (root) COPYRIGH.TXT \DOC \INDEX README.TXT file containing a short description of the database and the files, DISK.ID file and documentation index files, i.e. contents file The following files are in \DOC: DESIGN.DOC VALREP.DOC DESIGN.DOC contains the main documentation file VALREP.DOC contains the validation report created by the validation center. The root directory contains the files: README.TXT: ASCII text file containing a description of the files in the database and a copyright statement DISK.ID: 11-character string with volume name the following file is in \INDEX. CONTENTS.LST is a plain text file containing the list of the files in the relative DVD. This list shows each folder, each file and their size in bytes. At the end of the file list of each folder, the number of files placed in it and the size of the folder in bytes is showed. 6. Hardware, Software and Recording Platforms 6.1 Hardware Computers: Processor: Intel Pentium III 500MHz or higher RAM: 256MB or more Sound Card: Sound Blaster compatible with S-PDIF input connection Hard Disk: IDE-ATAPI 10GB or higher Video Card: SVGA compatible 13 6.2 Recording Platforms and Software ITALIAN Recording platforms: • DAT recorder TASCAM© DA-P1 • Analogue recorder Sony© TCD5M Microphones Unidirectional radio microphones Sennheiser© MKE 40 (dialogue and monologues marked A). Omni directional microphones Sennheiser© MKE2 or equivalent. Software O.S.: Microsoft© Windows© 2000 Sound Editor: Syntrillium© Cool Edit 2000 Text Editor: Microsoft© Word© 2000 Text to Speech Alignment Editor: WinPitchCorpus – Pitch France© PoS tagging: Pi-system CNR© FRENCH Recording platforms Recording platforms: • DAT recorder TASCAM© DA-P1 • Analogue recorder Sony© TCD5M Microphones Unidirectional radio microphones Sennheiser© MKE 40 (dialogue and monologues marked A). Unidirectional microphones Sennheiser© MKE2 or equivalent. Software O.S.: Microsoft© Windows© 2000 & XP Linux General OpenOffice MSOffice Speech Software Various tools developed at Université de Provence Transcriber (developed by DGA - freeware) WinPitchCorpus – Pitch France© Syntrillium© Cool Edit 2000 MES (developed at Université de Provence) Programming languages Perl, C++ Morphosyntactic software Cordial (tagger) Multiples tools developed internally at Université de Provence Concordance programming Contextes (developed internally at Université de Provence) PORTUGUESE Recording platforms: • DAT recorder TASCAM DA-P1; • A TEAC W-580R Double Auto Reverse Cassette Deck UR audio tape recorder; Microphones AKG UHFPT40; 14 O.S.: Microsoft© Windows© 2000 and XP Sound Editor: Syntrillium© Cool Edit 2000 Text Editor: Microsoft© Word© 2000 Text to Speech Alignment Editor: WinPitchCorpus – Pitch France© PoS tagging: Brill tagger © SPANISH Recording platforms: DAT recorder TASCAM© DA-P1 Unidirectional radio microphones E-Voice© MC-150 (for all sessions). Software O.S.: Microsoft© Windows© 2000 Sound Editor: Creative© Wave Studio© v 4.11 Text Editor: Microsoft© Word© 2000 Text to Speech Alignment Editor: WinPitchCorpus – Pitch France© Part-Of-Speech: Tagger: Grampal© 7. Number of recording and general corpus data Speech files of the recorded session and the corresponding transcription files are in one to one correspondence. 11 The following is the general table of the main values recorded in the C-ORALROM multimedia corpus. wav files GB Duration Utterances Words Speakers Male Female 206 Portuguese 204 152 3,77 26.21.43 21010 5,19 36.16.10 40402 4,43 29.43.42 38855 310969 317916 305 451 261 154 276 144 150 175 117 Spanish 210 4,56 31.06.00 35588 333482 410 247 163 French Italian 295803 11 The one to one correspondence is partial for sessions with more then 8 participants, that are split in more then one wav file in the multimedia collection. Those sessions are reported in a wav file which record the all session and is counted in the table below. However they are also split in more then one wav files, following the convention: filename_1; filename_2 etc. each one containing not more the 8 participants (not counted in the number of sessions). The transcription of split session is delivered in one file only in the textual resource, while is delivered split in more then one file in the multimedia collection. 15 8. Acoustic quality C-ORAL-ROM is oriented towards the collection of corpora in natural environment, despite the fact that this necessarily causes a lower acoustic quality of the resource. Moreover, C-ORALROM has exploited, in the frame of a new multilingual work, the rich contents of the archives set up by the providers during years of research on spoken languages; therefore the acoustic quality and the recording conditions of the resource are variable. The following are the requirements for the acoustic format and for the recording apparatus: Format: mono wav files (Windows PCM), Sampling frequency: 22050Hz, 16 bit12 Recording and storing process for old Analogue recordings: directly derived in wav files (20.050 hz 16 bit) from the original analogue tapes through a standard sound card (Sound Blaster live or compatible) with a professional sound editor, Cool Edit 2000 ® Recording and storing process for new recordings: a) dialogues: stereo DAT or minidisk recording (44.100Hz) with unidirectional Micro-phones, converted into mono .wav files (Windows PCM, 22050Hz, 16 bit) via SPDIF port of a standard sound card (Sound Blaster live or compatible) with a professional sound editor b) conversations with more than two participants: mono DAT or minidisk recording with cardioids or omnidirectional microphone converted into mono .wav files via SPDIF port of a standard sound card (Sound Blaster live or compatible) with a professional sound editor. The speech files of the acoustic database are defined on a quality scale (recording, volume, voice overlapping and noise). The quality scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality. 1) Digital recordings with DAT or minidisk apparatus and unidirectional microphones or analogue recording of very high quality 2) Digital recording with poorer microphone response or analogue recordings with: • Good microphone response; • Low background noise • Low percentage of overlapped utterances; • F0 computing possible in most of the file 3) Low quality analogue recordings with: • Poor microphone response • Background noise • Average percentage of overlapped utterances • F0 computing possible in many parts of the files The quality is gauged spectrographically. Sessions in which F0 analysis is not significant are excluded from sampling. The acoustic quality of each recording and the most relevant data on the recording condition are always recorded in the metadata of each text. 12 See foot-note 7 16 III C-ORAL-ROM corpus 1. Corpus Design 1.2 Sampling Parameters Spontaneous speech events are those communication events where the programming of speech is simultaneous to its execution by the speaker; i.e. the speech event is non-scripted or only partially scripted. The C-ORAL-ROM resource offers a representation of the spontaneous speech universe in the four main romance languages (Italian, French, Portuguese and Spanish) with regard to the following main parameters, which define speech variation: A. Communication context B. Language register C. Speaker A. The communication context of a speech event is defined by the following features, each specified by a closed vocabulary13: Channel: the means by which the signal transmission is achieved. Face to face communication: speech event among participants in the same unity of space and time with reciprocal direct multi-modal perception and interaction; Broadcasting: unidirectional speech emission to an undefined audience by devices that ensure, at least, the perception of voice; Telephone: bi-directional speech event by means of telephone. Structure of the communication event: role and nature of the participants in the speech event. Monologue: speech event with only one intervenient performing a main communication task14 Dialogue: speech event with two intervenient Conversation: speech event with more than two intervenient Human-machine interaction: speech event between a human being and an electronic device Non-natural format: other; i.e. format of the broadcasting emissions (undefined in this resource) Social context: organization level of the society to which the speech event belongs. Family/private: speech event within the family, or private social context Public: speech event within a public social context Domain of use: semantic field defined by the content of the speech event B. Language register Informal: un-scripted low variety of language, used for everyday interactive purposes; Formal: partially-scripted task-oriented high variety of language. C. Speaker: the main qualities of the speaker, that may influence their speech production Sex: sex of the speaker Age: speaker’s age Education: speaker’s schooling degree Occupation; speaker’s employment field Geographical origin: speaker’s place of origin 13 See IMDI Metadata Elements for Session Descriptions, Version 3.0.3, in http://www.mpi.nl/IMDI/ The monologue context can fits exactly with the definition of “only one intervenient” only in the formal use of language. In this case the social rules governing the interaction among participants may ensure the execution of the communication task by a sole speaker. On the contrary, in the everyday informal use of language, although only one speaker performs the main communication task, other participants may interact in the communication event with low informative contribution. 14 17 1.3. Sampling strategy and Corpus design The corpus design of the C-ORAL-ROM resource is the result of the application of two criteria: - a sampling strategy, which defines how to apply the set of relevant parameters for the representation of the universe through recording sessions - a definition of the size of the resource and of each session Given the variation parameters in I.1.1 the sampling strategy adopted for the representation of speech variability in the four C-ORAL-ROM corpora is a function of the following principles: • • • • • • • • Definition of the size of the resource. Due to the cost of the spoken resources, a content of around 1,200.000 words (300,000 words for each language collection) was fixed. sampling of the universe with reference to context variation and to language register variation, leaving random speaker’s variation. distinction between formal speech (50%) and informal speech (50%), thus ensuring a sufficient representation of dialogical Informal Speech (which is the resource with higher added value); the selection of different criteria for sampling the formal and informal part of the corpus. the definition of the text weight in terms of information units (words) The definition of a text weight that ensures both the possible appreciation of macrotextual properties and sufficient representation of the universe in each 300.000-word corpus. the representation of a variety of possible recording situations within the range of perception and intelligibility of the human ear. the recording as part of the meta-data: a) Speaker characteristics; (gender, age, geographical region, education and occupation); b) acoustic quality of the text. In a multilingual collection of a limited size, strong diatopical limits for each language must be established. C-ORAL-ROM does not represent in a systematic way diatopical phonetic variations due to the geographical origin of the speakers.15 Assuming the relatively small size of the resource the corpus design strategy concentrates on those variation parameters that, in principle, are more relevant to the documentation of speech variation and try to maximize the significance of the sampling for what regards the probability of occurrence of different types of speech acts and syntactic constructions. To this end the most of different types of context of use and possible language tasks are represented, the more typical speech acts and modes of language construction in those contexts are represented. The use of the formal / informal parameter in the corpus design scheme allows the restriction of the number of significant parameters for what regards context variation. More specifically, while it can be assumed that in western societies the formal use of language is applied in a closed series of typical domains, the same does not hold for the informal use of language. The list of possible domains of use for informal language is by definition open, and no domain can in principle be considered more typical than others. Under this assumption, the identification of the main domains of use of formal language maximizes the probability of representing the significant variations in this language variety, and is 15 Corpora are mainly collected in Continental Portugal, Central Castile Spain, Southern France, Western Tuscany, and are intended to represent a possible accepted standard, rather than all the varieties of pronunciation, which need collections of inter-linguistic corpora with a wide diatopical variability. This limitation is quite severe for Italian, where local varieties may strongly diverge from the standard (De Mauro et al., 1993). See in II. the statistics on the geographical origin of speakers and detailed excell tables in the Utility directory in DVD9 18 therefore the best strategy. On the contrary, if significant variations of informal spontaneous speech are to be considered, the same strategy will cause a reduction of their probability of occurrence. Therefore, the definition of a finite list of typical domains of use is the main criterion applied in documenting the formal uses of the four romance languages, while variations in dialogue structure are not controlled (the social context of use is generally the public one, and the more frequent dialogue structure is the monologue). On the contrary, the variations in social context of use and in dialogue structure are the parameters systematically adopted for the documentation of the informal part, while the choice of the specific semantic domain of use is left random. Also the strategy regarding the text weight vary its significance considering the Formal and in the Informal use of language. The formal use of language feature in general long textual structure, while in the informal the length of syntactic construction is limited. Therefore in order to ensure the probability of occurrence of typical structures the text length for the Formal sampling must be significantly longer. The above variation parameters and sampling strategy are projected in the corpus design matrix presented in the following paragraph. 1.4 Comparability By definition of spontaneous speech, comparability cannot be obtained through the use of parallel corpora. Each resource of the C-ORAL-ROM corpus is comparable with the others as far as it satisfies the conditions on the corpus design stated in the following matrix, which reflects the variation parameters defined in 1.1 Section MANDATORY INFORMAL [-Partially scripted]* Context MANDATORY Family context [-Public; scripted]* Domain MANDATORY (124,500) /Private -Partially (25,500) Public context [+Public; -Partially scripted] or [ –Public; + partially scripted]* Section MANDATORY FORMAL* [+Public; +Partially scripted]* Number of words MANDATORY16 150,000 Context MANDATORY Monologues Dialogues/Conversation Domain MANDATORY 48,000 102,000 Number of words 150,000 Formal context in natural political speech; political debate; 65,000 preaching; teaching; professional explanation; conference; business; law; Media news; sport; interviews; science; 60,000 meteo (weather forecast); scientific press; reportage; talk_show 17 Telephone Private conversation; 25,000 18 Human-Machine interactions * Additional feature used in C-ORAL-ROM for the classification of session when the definitions of parameters in 1.2 turn out not sufficient. 16 No upper limit. 5% variation allowed. This limit is not be considered strictly mandatory, in the case of the Formal in Natural Context sub-field. 17 Talk shows may belong to various typologies; e.g. political debate; thematic discussions, culture, etc. 18 10,000 words. Multilingual service for train information accomplished in the C-ORAL-ROM project by ITC-IRST. Field not present in the Portuguese corpus 19 TEXT LENGTH REQUIREMENTS In the informal section, Short texts: at least 64 texts of around 1500 words each. Up to 20% of this part may be constituted by texts of different length 19 Long texts: from 8 to 10 texts of around 4500 words each In the formal section, the text length is defined according to the following rules: For Formal in natural context: 2 or 3 samples for each domain of 3000 words average For Media: At least one short sample for Weather forecasting and News. At least 2 or 3 samples, for each of the other domains, which must be represented with at least 6,000 words. Samples of 1500 or 3000 words average. For Telephone: text length not defined (by preference 1500 words upper limit, no lower limit). The Human-machine interactions domain should contain 10,000 words. This domain is not present in the Portuguese corpus; The term “word” refers to all graphic elements in the text surrounded by two spaces corresponding to the orthographic transcription of speech. All signs in the textual files corresponding to metadata, dialogue representation format and tagging elements are not counted as words. Tolerance The requirements in the corpus design matrix concerning section, context, domain structure and text length are mandatory. The target number of words requested for each field in the matrix has been approximated in each language collection as in the following table: Section Context Domain INFORMAL Family-private Target Italian French Words Words Words Spanish Port. Words Words 150000 154967 152385 168868 165436 124500 128676 124886 131056 132887 Monologue 42000 45212 47702 42082 45937 Dialogue-Conversation 82500 83464 77184 88974 86950 25500 26291 27499 37812 32549 Monologue 6000 6050 6960 6116 7696 Dialogue-Conversation 19500 20241 20539 31696 24853 Public FORMAL 150000 156002 143418 164614 152480 Natural context [see. the above list] 65000 68328 57319 72268 66140 Media [see. the above list] 60000 61759 57143 62739 62018 Telephone [see. the above list] 25000 25915 28956 29607 300000 310969 295803 333482 TOTAL 24322 317916 The corpus design and the sampling criteria of C-ORAL-ROM ensure: - The production of corpora which offer, in principle, a representation of the wide variety of syntactic and prosodic features of spontaneous speech - The production of comparable multilingual corpora as a basis for the induction of linguistic generalizations regarding spontaneous speech in the four romance languages 2. Filenames conventions Filenames of the C-ORAL-ROM resources bear information of three types, in order: a) The represented language b) The text type; that is the field and sub-field to which each text belongs in the corpus structure c) The serial number identifying each text in its sub-field 19 In order to allow a better exploitation of the archives existing prior of C-ORAL-ROM the use of smaller samples and the use of samples that come from the splitting longer sessions has been tolerated till the 20% of the informal part. 20 The following are the conventions adopted in each C-ORAL-ROM language collection: 1) Language Country code: “f” (French), “i” (Italian), “p” (Portuguese), “e” (Spanish), 2) Text type Informal section: context: “fam” (family_private), “pub” (public) domain: “mn” (monologues), “dl” (dialogues), “cv” (conversations), Formal section: context: “nat” (natural_context) domain: “ps” (political_speech), “pd” (political_debate), “pr” (preaching), “te” (teaching), “pe” (professional_explanation), “bu” (business), “co” (conferences), “la” (law) context: “med” (media) domain: “nw” (news), “mt” (metereology), “in” (interviews), “rp” or “rt” (reportages), “sc” (scientific_press), “sp” (sport), “ts” or “st” (talk_show) context: “tel” sub-field: “pv” or “ef” (private_conversations), “mm” (human-machine) 3) Text identification number: Two serial numbers identifying the text progressive in the sub-field; e.g.: efamdl01 (Spanish, family_private, dialogues, 01) efammn02 (Spanish, family_private, monologues, 02) 4) Split session convention: Sessions with more than 8 participants cannot be aligned by WinPitchCorpus, which in the present version is limited to 8 layers (one layer x speaker). For this reason in the multimedia corpus, although the metadata refer to the all session, those files were stored in more than one alignment, sound and transcription in accordance with the following convention. an underscore sign (“_”) and a digit (starting from “1”) were added at the end of the filename; e.g.: pnatpd01_1.xml, pnatpd01_2.xml, … pnatpd01_1.wav, pnatpd01_2.wav, … pnatpd01_1.txt, pnatpd01_2.txt, … In the French corpus every minor speaker which exceeds the layer number limit has been aligned on the same layer and with the same participant short name (“ZZZ”) therefore the problem does not arise. The split file convention above regards the Multimedia Corpus only and doesn’t apply to files in the Textual Corpus folder, where split files are delivered in one file only. The wav file of the all session is also delivered into folders of the multimedia session. 3. Annotation information For each recorded session, C-ORAL-ROM provides the following set of annotations: 1. 2. 3. 4. 5. Metadata Transcription and Dialogue Representation Prosodic tagging of terminal and non-terminal breaks Alignment Part of speech (PoS) and lemma tagging of each transcribed form 3.1 Meta-Data For each session, an ordered set of meta-data is recorded in four modalities: 1) as Headers lines in chat format before the ortographic transcription of each session; 2) as independent files in CHAT format; 3) in IMDI format; 4) within the XML file according to the C-ORAL-ROM DTD. 21 The following are the rules for marking the metadata of C-ORAL-ROM in CHAT format, from which the annotations in other formats derive: • • • Each metadata type is introduced by “@” immediately followed by a label, followed by “:”and an empty space. Metadata are listed in a closed set of types regarding o the session; o its size; o the speakers; o the acoustic quality; o the source; o the person who can provide information on the session Metadata fields or sub-fields that cannot be filled for a lack of information are filled with an 'x' (capital or small cap). Label Type @Title: @File: @Participants: @Date: @Place: @Situation: @Topic: @Source: @Class: @Length: @Words: @Acoustic_quality: @Transcriber: @Revisors @Comments: • Description One or two words in the object language that help to recognize the text Filename without extension (The name of audio file and the text file differ only in the extension). Three capital letters identifying each speaker, followed by the corresponding proper name (first name), plus a sub-field with an ordered set of information on the speaker. Date of the recording: Day/month/year, separated by slashes; e.g. 20/06/2001 Unknown fields are filled with '00'; e.g.: 00/00/2001 or 00/00/00 Name of the city where the recording session take place Ordered set of information: genre and role of the participants in the situation, environment, main actions performed, recording conditions; according with the rules specified below; e.g. gossip between friends at home during dinner, not hidden, researcher participant The main argument dealt with in the speech event (max 50 characters); e.g. traffic problems Name of the collection leading to a copyright holder e.g. LABLITA_CORPUS; CORPAIX The set of fields on the text class in accordance with the C-ORAL-ROM corpus structure (separated by commas) Length of the transcribed audio file in minutes(’) and seconds (”) e.g.: 12’ 15” Number of words in the text file The acoustic quality of the recording. In accordance with specific criteria (A B or C) Name of the person responsible for the text, who can provide further information Names of the revisors transcriber's comments on the text Each metadata type is filled with PCdata (closed or open vocabulary) ending with “enter” and can be specified in accordance with the rules. 22 3.1.2 Rules for the Participant field Ordered set of sub-fields for each Participant (in parentheses, separated by a comma and an empty space). Type Description Vocabulary Sex Sex of the speaker Closed vocabulary. One of the following conventional vocabularies according to the description in brackets: O(ptional)/ M(andatory) Man (Male) M Woman (Female) X (unknown) Close vocabulary. One of the M following conventional capital letters for each range of age between brackets : Age Age of the speaker Education A (18-25); B (26-40); C (41-50) D (over 60) X (unknown) The level of education Closed vocabulary. One number M according to the according to the degree between schooling degree brackets: 1 (primary school or illiteracy); 2 (high school) 3 (graduated or university students) Profession: Role Geographical origin/linguistic influence X (unknown) Usual profession of the Open vocabulary. Name of the M speaker profession; (eg. professor; secretary; student) or X (unknown) Role in the recorded Open vocabulary. Name of the role; M event (even if it is (e.g. father; professor) or X equal to the profession) (unknown) Name of the region Open vocabulary; (e.g. Ile de M from which the speaker France; Castile) or X (unknown) origin 23 3.1.3 Rules for the Class field informal Type: family/private public Sub-type: monologue dialogue conversation formal Type formal in natural context Sub-type: political speech political debate preaching teaching professional explanation conference business law (through media) sub-sub-type: (optional) monologue dialogue conversation formal Type:media Sub-type: news sport interviews meteo scientific press reportage talk_show Type:telephone Sub type: private conversation human-machine interactions 3.1.4 Rules for the Situation field The situation field is a set of information reported in discursive manner that help to identify context in which the language event take place. The following are the guideline used in C-ORAL-ROM to define the set of possible relevant information for the situation field* Type Genre Description Information that helps to define the genre of the linguistic event Reciprocal role The reciprocal role of the Open vocabulary (e.g. friends, participants colleagues, relatives, citizens) Ambience The kind of surroundings Open vocabulary (e.g in a silent where the recording took studio; on the street; at home; in place a shop, at school, in an office, etc.) Main action performed Open vocabulary (e.g. while during the speech event (if ironing, during depilation) any) Status of the recording with Closed vocabulary. A choice of respect to the “Observer one alternative in each of the paradox” in spontaneous following two sets: speech resources 1) hidden vs. not hidden 2) participant researcher v.s observant researcher v.s. researcher not present. Action Recording conditions Vocabulary Open vocabulary; e.g. (gossip; chat; quarrel; discussion; narration; claim, etc). The neutral case is "talk". The information on the Class field (dialogue, conversation etc.) is not repeated. * For media corpora the situation field is filled with the name of the program. 24 3.1.5 Rules for marking Acoustic quality Texts in the collections are MANDATORILY labeled with respect to the acoustic quality of the sound source*: Type Digital recordings Analogue recordings: Properties Label Digital recordings A with DAT or minidisk apparatus and unidirectional microphones or analogue recordings of very high quality Digital recordings with poorer B microphone response or analogue recordings with: •1 Good microphone response; • Low background noise • Low percentage of overlapped utterances; • F0 computing possible in most of the file Analogue recordings: Low quality analogue C recordings with: •2 mediocre microphone response • Background noise • Average percentage of overlapped utterances • F0 computing possible in many parts of the file * Sessions in which F0 analysis is not significant are labeled D and excluded from sampling. 25 3.1.6 Quality assurance on metadata format The format correctness of metadata has been double cheeked automatically through conversion of the original label .txt files into xml files. More specifically metadata format has been validated through two conversion scripts: o conversion in xml files according with the C-ORAL-ROM DTD (see. §4 below) o conversion in xml according with the IMDI format. (see. C-ORAL-ROM metadata in IMDI format in http://lablita.dit.unifi.it/coralrom/imdi/) 3.1.7 Statistics from C-ORAL-ROM metadata A set of statistics on the C-ORAL-ROM metadata are reported in excel files in DVD9 the following are the main figures regarding Speakers metadata and sessions metadata. 3.1.7.1 Number of speakers Given that each database is substantially anonymous, the number of speakers is estimated on the metadata set. More specifically a speaker is identified in each collection by the identity of short name, together with the long name, sex, age, and geographical origin, when these fields are filled with positive data 20 The number of speaker is given below, in total and separately for each main field in the corpus design.21 ITALIAN Total Speakers: 451 Informal: 209 Formal in Natural Context: 57 Media: 173 Telephone:27 FRENCH Total Speakers: 305 Informal: 164 Formal in Natural Context: 51 Media:75 Telephone:18 PORTUGUESE Total Speakers: 261 Informal: 106 Formal in Natural Context: 78 Media: 64 Telephone: 28 SPANISH Total Speakers : 410 Informal: 164 Formal in Natural Context: 51 Media: 184 Telephone: 20 3.1.7.2Distribution of speakers per geographical origin The following is the distribution pf speakers per geographical origin according with the metadata. The high number of unknown speaker is mainly due to the media collections where this information is not available (see distribution of absent speakers metadata below). FRENCH Provence and South of France Poitiers and West France Paris Other regions Centre of France Other countries Other Francophone countries speakers 103 28 26 17 12 5 2 Unknown 112 305 Total 20 This restriction is intended not to overestimate the number of different speaker, given that some information about the same speaker might be not available to every transcriber. 21 The number of different speakers in human-machine interaction cannot be computed. 26 ITALIAN speakers Tuscany South and Isles Other countries Central Italy North - Various Regions 188 38 20 16 14 Unknown Total 175 451 PORTUGUESE Lisbon and Center Portugal North Portugal South Portugal Açores and Madeira Overseas Other Regions Other countries speakers 77 19 16 10 8 7 5 Unknown Total 119 261 SPANISH Madrid and Castillia South America Andalusia Extremadura Others Regions Catalunia Other countries speakers 188 19 11 11 11 7 2 Unknown Total 161 410 27 3.1.7.3 Completeness of speakers features in the metadata records The following set of statistics are given for what regards the completeness of the information for the speakers reported in the metadata records of each session. Statistics refers only to the main features, i.e sex, age, education, geographical origin, that may be significant for the exploitation of the resource 22. The number of records where the information is complete are identified and for each language corpus and in the main field of the corpus design.23 Moreover the percentage of record in which each of the main feature is unknown is also reported for each language corpus and for each main field in the corpus design.24 French TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE 456 243 64 91 58 RECORD 48,44% 9,89% 68,97% SEX-AGE-ORIGIN-EDUCATION 57,89% 75,72% 0,00% 0,00% 0,00% 0,00% 0,00% NO SEX 24,56% 13,17% 37,50% 42,86% 29,31% NO AGE 36,40% 16,87% 48,44% 84,62% 29,31% NO ORIGIN 24,78% 10,29% 18,75% 63,74% 31,03% NO EDUCATION Italian TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE 596 273 70 214 39 RECORD 40,00% 5,14% 79,49% SEX-AGE-ORIGIN-EDUCATION 46,14% 75,09% 0,00% 0,00% 0,00% 0,00% 0,00% NO SEX 24,50% 12,09% 30,00% 42,99% 0,00% NO AGE 37,58% 3,66% 41,43% 84,58% 10,26% NO ORIGIN 36,24% 20,51% 20,00% 66,36% 10,26% NO EDUCATION Portuguese TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE 437 188 103 109 37 RECORD 16,50% 2,75% 100,00% SEX-AGE-ORIGIN-EDUCATION 54,92% 97,34% 0,00% 0,00% 0,00% 0,00% 0,00% NO SEX 40,96% 0,00% 75,73% 92,66% 0,00% NO AGE 43,48% 2,66% 76,70% 97,25% 0,00% NO ORIGIN 34,78% 0,00% 55,34% 87,16% 0,00% NO EDUCATION Spanish TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE 553 204 58 269 22 RECORD 65,52% 7,43% 100,00% SEX-AGE-ORIGIN-EDUCATION 51,36% 100,00% 0,00% 0,00% 0,00% 0,00% 0,00% NO SEX 36,71% 0,00% 5,17% 74,35% 0,00% NO AGE 44,85% 0,00% 34,48% 84,76% 0,00% NO ORIGIN 0,18% 0,00% 0,00% 0,37% 0,00% NO EDUCATION 22 Profession and role fields are considered additional information and are not object of this statistics evaluation. The number of record is superior to the number of speakers, given that the same speaker may appear in different sessions with more or less metadata information available 24 This information is never available for Speakers of Human Machine interactions and for speakers of mixed turns, that therefore have not been counted. 23 28 Statistics shows that only “sex” is always filled. However the relevant information regarding the speakers are complete in all or most of the Informal and Telephone sub-corpora, that are the more significant contexts for this type of information. It must be considered that the information in object is usually not available for media sessions and only occasionally available in Formal in natural_context sessions. 3.1.7.4 Completeness of session metadata All session metadata are filled with real information. The “Place” information” is absent only in 5 session and in Human machine interactions, where the Place is necessarily un-known. The statistics below record the number of session metadata that are filled with empty information (x): 29 French Title Name Data Place Topic Situation AQ TOTAL 0 0 0 46 0 0 0 INFORMAL 0 0 0 1 0 0 0 NATURAL CONTEXT 0 0 0 0 0 0 0 MEDIA 0 0 0 3 0 0 0 TEL. PRIVATE 0 0 0 0 0 0 0 TEL. HUMAN-MACHINE 0 0 0 42 0 0 0 TOTAL 0 0 0 51 0 0 0 INFORMAL 0 0 0 0 0 0 0 NATURAL CONTEXT 0 0 0 0 0 0 0 MEDIA 0 0 0 0 0 0 0 TEL. PRIVATE 0 0 0 0 0 0 0 TEL. HUMAN-MACHINE 0 0 0 51 0 0 0 INFORMAL 0 0 0 0 0 0 0 NATURAL CONTEXT 0 0 0 1 0 0 0 MEDIA 0 0 0 0 0 0 0 TEL. PRIVATE 0 0 0 0 0 0 0 TEL. HUMAN-MACHINE 0 0 0 0 0 0 0 INFORMAL 0 0 0 0 0 0 0 NATURAL CONTEXT 0 0 0 0 0 0 0 MEDIA 0 0 0 0 0 0 0 TEL. PRIVATE 0 0 0 0 0 0 0 TEL. HUMAN-MACHINE 0 0 0 41 0 0 0 Italian Title Name Data Place Topic Situation AQ Portuguese Title Name Data Place Topic Situation AQ TOTAL 0 0 0 1 0 0 0 Spanish Title Name Data Place Topic Situation AQ TOTAL 0 0 0 41 0 0 0 30 3.2. Transcription and Dialogue representation The C-ORAL-ROM dialogue representation is defined as an implementation of the CHAT architecture (Mac Whinney, 1994) (http://childes.psy.cmu.edu/manuals/CHAT.pdf) and has the following structure: 1 Text lines: orthographic transcription of the speech information divided: a) Vertically, in dialogic turns (introduced by a speaker label) b) Horizontally, by prosodic parsing and utterance limit, representing terminal and non terminal prosodic breaks of the speech continuum 2. Dependent tiers: contextual information. 3.2.1 Basic Concepts for dialogue representation Concept Dialogic Turn Definition Continuous set of speech events by only one speaker’s voice . The dialogic turn changes if, and only if, a speech event by another speaker occurs. Set of dialogic turns corresponding to one meta-data set. The minimal speech event by a single speaker such that it can be pragmatically interpreted as a speech act.25 A speech event perceived as a phonetic unit, such that it conveys a meaning. Session Utterance Word 3.2.2. Turn representation. A dialogic turn by one speaker is expressed by “*” immediately followed by three capital letters identifying the speakers in the metadata, then followed by “:” and one space before the transcription of the speech event. Each dialogic turn ends with an “enter”. Convention26 Description ^\*[A-ZÑ]{3}:\s{1} Dialogic turn of a given speaker 3.2.3 Utterance representation. Each utterance is represented by a series of transcribed words ending with the termination symbol “//” or other symbols having a termination value (see below). E.g.: *MOR: I’m going home // I’m tired // *PIL: bye bye // A dialogic turn can also be filled with non linguistic or paralinguistic material according to the transcription convention (see below)27 *MAX: are you sure ? *PIL: hhh 25 See the operative definition of Speech Act in § 3.4.3 and references therein. Expressed with regular expression notation. In this case the dialogic turn is filled by a communication event instead of a speech event. The relation between the two concepts is left undefined in C-ORAL-ROM; this resource marks the main communication events in a dialogue, but is not specifically devoted to the study of such events, which must be considered in a multimodal framework. 26 27 31 3.2.4. Word representation. Each word is transcribed as a continuous sequence of characters between two empty spaces, in accordance with the orthographic convention of each language. 3.2.5 Transcription 1. The transcription of a dialogic turn expresses, horizontally, the sequence of speech events (utterances) that occur in each dialogic turn; i.e., in principle, no “enter” can occur within an utterance. This may not be the case in overlapping and other cross over dialogue phenomena (See below) 2. The transcription follows the standard orthography of every language in the C-ORAL-ROM resource and is integrated with special signs devoted to handling spoken language phenomena. No phonetic transcription is scheduled. In the absence of any previous orthographic tradition, nonstandard regional expressions are normalized. Orthographic choices made in each language collection are reported APPENDIX 3 . Concept Text Definition Each recorded session is an ordered collection of transcribed dialogic turns referring to a given metadata set. 3.2.6 Overlapping The speech of one speaker in spontaneous dialogues is frequently overlapped by the speech of another speaker, who may insert his dialogic turn in the first speaker’s turn. The overlapping therefore determines a relation of temporal equivalence between two or more speech portions in different dialogic turns. In C-ORAL-ROM, overlapping is represented through the conventions reported below. The overlapped text in both dialogic turns is placed between brackets < >. Moreover, the sign [<] may optionally appear at the beginning of the second turn, immediately before the overlapped text, to mark that an overlapping relation holds between the first two pieces of text between brackets, e.g.: *ABA: María mi ha detto che [/] <che non viene> // [Maria told me that /<that she’s not coming>] *BAB: [<] <che non viene> // In C-ORAL-ROM overlapping is marked only when at least two words in two different turns are concerned; this means that the overlapping of syllables is left unmarked or reported to word boundaries. When, due to the simultaneous occurrence of more than one dialogic turn, it is impossible to attribute the speech to a speaker, a fixed variable is used to mark a mixed-turn, e.g.: *XYZ: chi è va bene //28 [who is it all right] OVERLAPPING Symbol <> [<] *XYZ: 28 Description Brackets which mark the beginning and the end of the overlapped text of a given speaker This symbol specifies the overlapping relation between two bracketed textual strings belonging to two speakers Turn of overlapped speech by non-identified speakers In these cases the transcription may not be reliable for perceptual reasons 32 3.2.7 Cross over dialogue convention In spontaneous spoken language, the event of an intersection of dialogic turns by different speakers occurs frequently; this means that a dialogic turn may arise before the end of the turn that immediately precedes it. Therefore, in those cases, the representation of dialogue as a vertical ordered collection of dialogic turns may be maintained with difficulty. Because the C-ORAL-ROM format forces to represent the sequence of turns in a temporal order, a cross-over dialogue convention has been adopted. Cross-over dialogue convention In the C-ORAL-ROM transcripts, a slash placed at the beginning of a turn, i.e. immediately after the turn label and before the transcribed text, is a convention expressing that the turn in object is only virtual, while the linguistic information of the turn belongs to the preceding turn of the same speaker 29 Configuration of symbols :/ Operational definition Relation that converts the turn in which it appears into a linear sequence which includes the preceding turn of the same speaker Three major cases of cross-over dialogue have been detected in C-ORAL-ROM: 1) overlapping; 2) interruption by the listener 3) complete intersection of turns 3.2.7.1 Overlapping and cross-over dialogue Because overlapping is a relation between texts belonging to different turns, it affects the dialogue representation, which expresses the time dimension both in vertical and horizontal relations. In principle, the dialogue representation system requires the linguistic information which follows vertically in a subsequent turn to be also necessarily subsequent in time to the text reported in the previous turn. However, this cannot be the case in most overlapped sequences, where a dialogic turn continues despite the insertion of another turn in it that partially overlaps it, e.g.: *ABA: Maria mi ha detto che [/] <che non viene> più / al concerto // perché non si sente // *BAB: [<] <viene> // The above representation of the cross dialogue phenomenon, which is possible in the traditional CHAT format, has been abandoned in the C-ORAL-ROM format, for two reasons: a) practical reasons concerning the mapping of the textual format onto the text-to-speech alignment format (where only one vertical temporal order is assumed); b) The above convention does not assume the generalization with other cases of cross dialogue phenomena (see below). In this system, when the overlapped text in the upper turn continues after the overlapping, the Cross-over dialogue convention has been applied. It must be noted that the convention has been applied in C-ORAL-ROM in two alternative ways: a) by transcribing over the overlapped turn until the first terminal break *ABA: Maria mi ha detto che [/] <che non viene> più / al concerto // *BAB: [<] < viene> // 29 Each slash immediately following the speaker mark is not counted as a prosodic break.. 33 *ABA: / perché non si sente // b) by interrupting the transcription of the overlapped turn when the overlapping ends *ABA: Maria mi ha detto che [/] <che non viene> *BAB: [<] < viene> // *ABA: / più / al concerto // perché non si sente // Only the former alternative compels the system to assume the generalization that each formal turn ends with a terminal break, and allows the alignment of each utterance; the other ensures a frame which is more consistent with the representation of the cross over dialogue (intersection) in the paragraph below. 3.2.7.2 Intersection of turns Some cases of complete intersection of dialogic turns may occur, without interruption nor overlapping; e.g in the following example, the listener, without really interrupting the speaker, starts a brief dialogic turn that goes on with his prosodic program:30 *MAX: in linea di principio / voi dovreste / *ELA: ah // *MAX: / seguire le regole di trascrizione // 3.2.7.3 Interruption and cross-over dialogue Even if overlapping is absent, or reduced to a few milliseconds, the speaker may interrupt his utterance in connection with the intervention of the listener. For example, in the following situation, a speaker inserts himself in a dialogic turn, interrupting it, but the other speaker goes on with his turn despite the interruption: *MAX: in linea di principio / voi dovreste + *ELA: ah // *MAX: / seguire le regole di trascrizione // 3.2.8 Transcription conventions for segmental features 3.2.8.1. Non-understandable words All words that are not properly understood are reported (and counted as word occurrences in a frequency list) as: xxx 3.2. 8. 2 Paralinguistic elements All paralinguistic elements (laughing, crying, etc) are not counted as a word occurrence in a frequency list and are indicated as: hhh It is possible to detail what the element is in a dependent tier (see. below). 30 In this case the utterance-based alignment cannot be maintained. 34 3.2.8.3. Fragments All incomplete words and/or phonetic fragments are immediately preceded by &, as in the following incomplete utterance:31 *MAX: mio &cug [my &cous] or in the following retracting: *MAX: mio &cug [/] mio fratello non nuota // [my &cous [/] my brother doesn’t swim] or in the following lengthening of the programming time:32 *MOR: credo che si chiami / &eh / Giovanni // [I think they call him/ &eh / Giovanni] 3.2.8.4. Interjections Interjections are not fragments; they are phonetic elements with dialogical function. Interjections are transcribed following the lexicographical tradition of each romance language. New interjections discovered in the corpus are transcribed tentatively and their presence is reported in a glossary added to the corpus edition. 3.2.8.5. Non standard words Non-standard words found in the corpus are transcribed tentatively and their presence is reported in a glossary added to the corpus’s edition. 3.2.8.6. Non-transcribed words When a word must be cancelled for reasons concerning privacy or decency it is substituted by a variable, “yyy”, to be counted as a word33: *MOR: il dottor yyy è un cretino // è proprio un bello yyy // [doctor yyy is an idiot // he’s a right yyy //] 3.2.8.7. Non-transcribed audio signal When, for whatever reason, part of the audio cannot be transcribed, a single variable “yyyy” is inserted in the transcripts, not depending on the length of the signal. Said variable may be subject to alignment, but will not be counted as a word34: Symbol & hhh xxx yyy yyyy Description Mark for speech fragments Paralinguistic or non linguistic element Non-understandable word Non-transcribed word Non-transcribed audio signal 3.2.9 Transcription conventions for human-machine interaction Speech recognition systems fail to recognize a fair number of words. In order to insert 10.000 words of man-machine interactions in each C-ORAL-ROM corpus, the transcripts of the caller’s 31 Incomplete words are never subject to rebuilding, except, of course, for what regards systematic phonetic phenomena (elision, breaking off of the last syllable etc.). Those phenomena are mirrored (or not) in the transcription, following the orthography of each language and the particular traditions in editing oral text. Such choices must be detailed in the notes to the corpus edition. 32 Note that the lengthening of syllables, which is quite a common and perceptively relevant phenomenon in spoken language, is not marked in this system. However, following the philosophy of marking prosodic breaks, the system automatic assumes the generalization that lengthening necessarily causes a prosodic break. 33 When a word is not transcribed in the text it is substituted with some beep of a similar length in the acoustic signal. 34 Music or advert fragments in media may be not transcribed, and therefore substituted by a variable (that could be aligned or not). In the case of very long music or advert fragments in a media corpus, the fragment can be cut and the cut noted in a dependent line. 35 real speech are provided. The following are the basic requirements for the transcription of humanmachine interaction in accordance with the format. An alignment between real speech recognition and real data is provided: i.Main line *MAC: what the machine said, with the prosodic annotation Dependent line (immediately below the lines *MAC) %alt: The synthesized text; i.e. the input text file from ITC-IRST Main Line *WOM or *MAN (if the speaker is a man or a woman): the standard C-ORAL-ROM transcription of what the speaker really said Dependent line (immediately below the lines MAN or WOM) %alt: what the speech recognition system recognized; that is, what is in the ITC-irst text file The following are general requirements • • • In the metadata for speaker’s parameters, in the field dedicated to the machine’s artificial voice, the value “Woman” or the value “x” is reported. In the @comment headers, the name of the original file in the itc-irst collection is reported An alignment of human-machine interactions is not scheduled. 3.2.10 Quality assurance on format and orthographic transcription The format correctness of both metadata and transcripts has been cheeked automatically through conversion of the original label .txt files into C-ORA-ROM .xml files through the script detailed in 4.1. The consortium ensures maximum accuracy in the transcripts, that have been compiled by PhD and PhD students in linguistics. The original transcripts has been revised by at least two transcribers. The orthographic correctness of transcripts has been cheeked automatically through word spell check and through the automatic PoS tagging procedure of each corpus, that highlight all forms found in the corpora that are not consistent with the dictionary. Orthographic conventions adopted in each language corpus according with the language tradition and non standard forms has been registered in Appendix 3 of this specifications. 36 3.3. Prosodic annotation scheme 3.3.1 Principles C-ORAL-ROM’s prosodic tagging is informed to the following series of principles: • • • • • • • The prosodic tagging specifies each perceptively relevant prosodic break in the speech continuum All positions between two words are considered possible positions to be fitted with a prosodic tag. No within-word prosodic breaks are marked in C-ORAL-ROM Prosodic breaks are distinguished in accordance with two main qualities: terminal vs. nonterminal . Each between-words position necessarily has one of the following values with respect to the prosodic tagging of the resource: o no break o terminal break o non-terminal break Prosodic breaks are always tagged and reported according to perceptual judgments of the transcribers, within the process of corpus revision and transcription accuracy Prosodic tagging is part of the transcription and is reported within the text lines. The criterion for the segmentation of the speech flow into utterances is prosodic. Each prosodic break qualified as terminal defines the utterance limits in the speech flow 3.3.2 Concepts Concept Prosodic break Terminal prosodic breaks Non-terminal prosodic breaks Prosodic pattern (Utterance) Definition Perceptively relevant prosodic variation in the speech continuum such that it causes the parsing of the continuum into discrete prosodic units. Given a sequence of one or more prosodic units, a prosodic break is known as terminal if a competent speaker assigns the quality of concluding such sequence to it. Given a sequence of one or more prosodic units, a prosodic break is known as nonterminal if a competent speaker assigns the quality of being non conclusive to it. Each sequence of prosodic units (≥ 1) ending with a terminal prosodic break35 3.3.3 Theoretical background At theoretical level, it has been noted that perception is highly sensitive to voluntary F0 variation (’t Hart et al., 1990). In accordance with this theoretical framework, the melodic pattern which scans the speech flow is an object of perception. Each tone unit of a prosodic pattern corresponds to a perceptually relevant pitch movement. A prosodic pattern may be simple (composed of a single tone units) or complex (in which case it is made up of two or more tone units melodically linked together). From another point of view, according to the speech act theory tradition, every utterance in spoken language is the voluntary accomplishment of a speech act (Austin, 1962). The background theory of the transcription format (Cresti, 1994, 2000) links the two properties: voluntary F0 variations do not simply scan the utterance, but rather express the information necessary to the accomplishment of speech acts. For this reason, the selection of textual units corresponding to an utterance can be based on prosodic properties. 35 Heuristic definition of utterance in C-ORAL-ROM 37 More specifically, it is possible to identify an utterance each time prosody enables the perception of the completion of a speech act; i.e. intonation permits the pragmatic interpretation of the text (Illocutionary criterion Cresti, 1994, 2000). In the transcription format, the identification of utterances in the sound continuum is linked to the detection of perceptively relevant F0 movements with a terminal value. It is assumed that there is no such thing as an utterance without a profile of terminal intonation (Karcevsky, 1931; Crystal, 1975). Non-terminal tone units correspond to the scanning of an utterance by means of a complex pattern. In other words, the systematic correlation between terminal breaks and utterance limits is the heuristic method for speech segmentation in utterances, that is the segmentation of the linguistic information in the resource with respect to the specific unit of analysis of spontaneous speech (See. Miller & Weinert, 1998; Quirk et alii, 1985; Biber et alii, 1999; Cresti, 2000). 3.3.4 Conventions for prosodic tagging in the transcripts: types of prosodic breaks To discriminate between terminal and non terminal breaks is mandatory in all the C-ORAL-ROM transcripts. However, the C-ORAL-ROM format allows the prosodic tagging to be displayed at two hierarchical levels, with greater or lesser attention to: • the annotation of the types of terminal breaks; • fragmentation phenomena in the speech performance. 3.3.4.1 Terminal breaks (utterance limit) A signal is inserted in the transcription each time a prosodic break is perceived as terminal by a competent speaker. Each terminal break indicates the prosodic completion of the utterance. At poor levels of transcription, interrogatives and intentionally suspended utterances are marked with the generic terminal break “//”, with no supplementary specification. At richer levels of transcription, specifications are optionally added by distinguishing different types of terminal breaks, in accordance with the following closed list of values of the utterance in object: Value all possible illocutionary values Description Concluding prosodic break interrogatives Concluding prosodic break such that ? the utterance has an interrogative value Concluding prosodic break such that … the utterance is left intentionally suspended by the speaker intentional suspensions Symbol // 3.3.4 .2Non-terminal breaks The symbol “/” (single slash) is inserted in the transcription to mark the internal prosodic parsing of a textual string which ends with a terminal break; it is inserted in the position where a prosodic break, that is not perceived as terminal, is detected in the speech flow by a competent speaker. Value Non-terminal Description Non conclusive prosodic break 38 Symbol / 3.3.5 Fragmentation phenomena The annotation scheme embodies the generalization that a prosodic break always occurs when a fragmentation of the linguistic information arises in the speech performance. That is, in spontaneous speech, a break of the prosodic unit in which the fragmentation arises. When a prosodic break occurs in connection with a fragmentation phenomenon, at the richer level of transcription, the prosodic tagging is specified in accordance with the complete set of the following alternatives: Symbol + [/](optional) [//](optional) [///](optional) Description Concluding prosodic break such that the utterance is interrupted by the listener or by the speaker himself Non-conclusive prosodic break caused by a false start Non-conclusive prosodic break caused by a false start (retracting) such that the linguistic material is only partially repeated Non conclusive prosodic break caused by a false start (retracting) such that the linguistic material is not repeated Type of break Terminal Non-terminal Non-terminal Non-terminal 3.3.5. 1 Interruptions The interruption (non-completion) of an utterance may be due to any reason : a change of the linguistic programming by the speaker, an interruption caused by the listener or by other events in the environment. Interruptions may be accompanied by word fragmentation (interruption before the end of the last word of the utterance) or, as is more frequently the case, may not feature any word fragmentation. interruption mark: +36 The interruption mark is counted as a kind of terminal break. The sign is inserted in the transcription in the position where the utterance is interrupted because of an interruption made by the listener, or because of a change in programming by the speaker (Examples in Appendix 1) 3.3.5.2 Retracting and/or restart and/or false start(s) The retracting phenomenon (or false start) is the most frequent fragmentation phenomenon in spontaneous speech. The speaker hesitates while trying to find the best way to express himself and retracts his speech before choosing between two alternatives. This phenomenon is, generally, clearly distinguishable from interruptions or changes in programming, reported above, which do not feature speaker’s hesitations. Contrary to interruptions, the retracting phenomenon is almost always accompanied by the repetition (complete or partial) of the linguistic material and clearly causes a loss of the informational value of the retracted material, which is abandoned by the speaker in favor of the chosen alternative. As in the case of interruption, in retracting phenomena the change of prosodic envelope is again necessary. In other words, the retracting between two elements cannot be accomplished in the same prosodic envelope. Therefore, retracting is always accompanied by a prosodic break marked with the symbol “[/]”37 36 No distinction connected to possible causes of interruption is considered in this frame (e.g. the CHAT format marks when the interruption is caused by the listener). On the contrary, the format explicitly marks the distinction between interruption and intentional suspension, which frequently occurs at the end of the utterances. Intentional suspension must be marked as a generic utterance limit “//”, or specified, as “…” 37 Retracting with complete or partial repetition can both be expressed by this symbol, therefore, in principle, all traditional CHAT [//] should be simplified to [/] in this system, although the use of both traditional CHAT symbols is tolerated in the C-ORAL-ROM format. 39 Retracting breaks are considered a type of non terminal breaks and are highlighted only at richer levels of transcription. The symbol is inserted in the transcription after each set of fragments in the position where a restart begins.. At poor levels of transcription, retracting phenomena are not treated as a special kind of prosodic break caused by fragmentation, and only the generic non-terminal break sign is used after each set of fragments in the position where a restart begins. Examples of retracting are available in Appendix 1. 3.3.5. 3. Retracting/interruption ambiguity In some cases it is hard to decide whether a fragmentation phenomenon fits the definition of “restart” or “interruption”. This can be the case when an alternative to the locution in object is realized, but no repetition is involved. In this case a supplementary sign, “[///]”,can optionally be used marking the fact that a probable retracting phenomenon occurs with neither partial nor complete repetition of linguistic material. The ambiguous mark is counted as a non terminal break. 3.3.6 Summary of prosodic break types Symbol // ? (optional) … (optional) + / [/](optional) [//](optional) [///](optional) Description Conclusive prosodic break Conclusive prosodic break such that the utterance has an interrogative value Conclusive prosodic break such that the utterance is left intentionally suspended by the speaker Conclusive prosodic break such that the utterance is interrupted by the listener or by the speaker himself Non conclusive prosodic break Non conclusive prosodic break caused by a false start Non conclusive prosodic break caused by a false start (retracting) such that the linguistic material is only partially repeated Non conclusive prosodic break caused by a false start (retracting) such that the linguistic material is not repeated Type of break Terminal Terminal Terminal Terminal Non-terminal Non-terminal Non-terminal Non-terminal 3.3.7 Pauses Pauses in the speech flow are indicated with “#” and are reported in the transcription only if clearly perceived as a significant interruption of speech fluency. Other pauses can be reported optionally. No distinction is made with respect to the length of the pause. The “ #” symbol is not a sign of prosodic parsing and never substitutes the marks dedicated to prosodic breaks. Symbol # Description Pause in the speech flow Definition A perceptively relevant silence in the speech continuum, or in any case, a silence longer than 250 ms. 3.4 Quality assurance on prosodic tagging In C-ORAL-ROM each position between two words is considered a possible position for a prosodic break. The prosodic tagging is based only on perceptual judgments and does not require any specific linguistic knowledge, although the notion of speech act is always familiar to the transcribers who annotated the C-ORAL-ROM corpus. The annotation of terminal and non-terminal 40 Eliminato: in the transcripts Eliminato: i breaks has been accomplished by expert transcribers (PHDs and PHD students) with the following procedure: 1) Tagging of prosodic breaks simultaneously to the transcription by a first labeler 2) Revision of tagging by a different labeler, in connection with the revision of transcripts 3) Revision of tagging, in connection with the alignment, by a third labeler: After the definition of each alignment unit, the labeler always challenges the presence of a terminal break This process ensures control on the inter-annotator relevance of tags and maximum accuracy in the detection of terminal breaks. The accuracy with respect to non-terminal breaks is by definition lower. The level of inter-annotator agreement on prosodic tag assignment has been evaluated by an external institution (LOQUENDO) on a statistically significant sampling of the C-ORAL-ROM corpus. The evaluation report is attached here in Appendix 4 3.5. Dependent lines Information of three types, regarding the text reported in a dialogic turn, is optionally given in nontextual lines following a dialogic turn (dependent lines in the CHAT tradition): a) Alternatives proposed for the transcription of the text b) Comments regarding the pragmatic context and the visual modality of communication c) Other comments. Eliminato: Both t Eliminato: T Eliminato: he detection of prosodic breaks and Eliminato: in special Eliminato: the distinction between terminal and non terminal breaks Eliminato: is Eliminato: are therefore Eliminato: a Eliminato: relevant added values for the representation of spontaneous speech. In C-ORALROM prosodic tagging is based only on the perceptual judgments and it does not foresee any specific linguistic knowledge; t Eliminato: Each session has been tagged Eliminato: ly Eliminato: operator Eliminato: The Eliminato: has been revised Eliminato: operator Sign marking the Dependent lines Definition of the type of information %act: %sit: %add: %par: %exp: Actions of a participant, while speaking Events or state of affairs occurring in the speech situation The participant to whom the speech is addressed Gestures or paralinguistic aspects of the speaker Explanations necessary for the understanding of the turn, or signs in the text (hhh) Description of the Setting in a media emission Description of the Scene in a media emission Alternative transcription Transcribers’ comments %amb: % sce: %alt: %com: Eliminato: <#>on the transcripts¶ During alignment the prosodic tagging marked on each transcription is challenged and revised again Eliminato: Eliminato: operator Eliminato: before the selection of the alignment unit Eliminato: e Eliminato: of tagging revision Eliminato: The link between the information in the dependent line and a word, or series of words, in the dialogic turn can be specified through a serial number, which indicates the position in the dialogic turn of the word referred to.38 In the following example, the alternative reported refers to the third word of the utterance Eliminato: s Eliminato: the Eliminato: inserted Eliminato: in the speech flow Eliminato: *MAX: voglio mangiare // pasta // %alt: (3) basta 3.6. Alignment Alignment is to the tagging of each textual string of a transcribed session with two tags corresponding to a temporal information in the speech file: 38 In long monologues, where the serial number of a word is high, the “: /” convention can be used for the insertion of a dependent line immediately following the commented item. 41 a) start of the alignment unit: the temporal unit corresponding to the start of the transcribed information in the speech file b) end of the alignment unit: the temporal unit of the speech file corresponding to the end of the transcribed information in the speech file In C-ORAL-ROM, the alignment information is stored in an .xml file placed in the same directory of the text file and the audio file. 3.6. 1 Annotation procedure The alignment of C-ORAL-ROM texts is performed after textual transcription and prosodic tagging. The alignment of transcribed texts is achieved by an expert operator through the assistance of the Win Pitch Corpus alignment tool. The alignment tagging task consists in the insertion of a tag ($) in the text after each terminal break annotated in the transcript while the audio is played (at a reduced speed). After loading the text and the sound files to be aligned, the operator merely listens to the slow rate speech playback (between 1 and 7 times real-time), and clicks on the text segments as they are perceived, in accordance with the general choice adopted in the project in order to define a significant alignment unit (each string ending with a terminal break. See below). Automatic dispatching of speaker turns on alignment layers is provided. The editing of segment edges is achieved through user-friendly commands using a mouse, and many other features such as the automatic scrolling of text and the dynamic adjustment of playback speed. As out-put of this process, the system assigns two temporal units to each alignment unit : a) end of the alignment unit: the temporal unit of the sound file in the instant in which the tag is inserted b) start of the alignment unit: the temporal unit which marks the end of the previous segment 3.6.2. Prosodic tagging and the alignment unit The Alignment of C-ORAL-ROM relies on two choices: a) Specification of the Alignment unit at utterance level (as previously defined) b) Rough equivalence between terminal breaks and utterance limit Each text is aligned with respect to perceptively relevant terminal breaks annotated in the original transcripts. When the alignment conforms to the previous requirements, the alignment file corresponds to the acoustic data base of all utterances of each speaker in the recorded session labeled with the transcribed utterance. The French corpus has been aligned through pauses, each text string being surrounded by two pauses of more than 200 ms (automatically detected). 3.6. 3 Quality assurance on the aligment The expert operator in charge of the alignment (a PHD or PHD student) always considers whether the accomplished alignment unit truly corresponds, in his perception, to a speech segment ending with a terminal break. The operator may add or delete the terminal breaks annotated in the original 42 transcripts in accordance to his personal perceptual perspective of the speech signal, thus improving the quality of the annotation. The correspondence of an aligned segment to perceptually relevant breaks is revised immediately after the tag insertion. The same operator revises the perceptual relevance of the aligned segments of the aligned text and adjusts the edge if necessary. The alignment of overlapped strings is achieved with lower accuracy. 3.6.4 Win Pitch Corpus WinPitch Corpus is an innovative software program for the computer-aided alignment of large corpora. WINPITCHCORPUS is built around a general purpose speech analyzer program, allowing real-time display of spectrographic and prosodic data. It provides an easy and precise method of selection of alignment units, ranging from syllables to whole utterances up to dialogic turns, in a hierarchical aligned data storing system. The method is based on the ability to visually link a moving target with the perception of the corresponding speech sound, played back at a rate reduced by at least 30%. Listening to slower speech, an operator is able to highlight with a mouse click segments of text corresponding to the speech sound perceived, and to generate bidirectional speech-text pointers defining the alignment. This method has the advantage, over emerging automatic processes, of being effective even for poor quality speech recordings, or in case of speakers’ voice overlaps. Existing text transcriptions can be quickly aligned, and text entering and editing is also possible on the fly. Speech playback speed variability is implemented by a streaming modified PSOLA-type synthesizer, which performs, in real-time, the necessary fundamental frequency tracking, period marking (for voiced segments) and additive synthesis required for good quality playback. Segments deriving from the alignment can be defined on eight independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. The alignment database allows a direct access to the related speech segment, with automatic displaying of spectrogram, wave, fundamental frequency (Fo) and intensity curves. Bi-directional access between text and sound segments is also possible. Besides text-to-speech alignment, WinPitch Corpus, has numerous features which allow an easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc. The Fo tracking is based on the popular and robust spectral comb method. The streaming process allows continuous operation on sound files of any length (within total computer memory limits), ensuring a very efficient use of the program. 3.6.5 DTD of WinPitch Corpus aligment files <!-- DTD : WinPitch Corpus --> <!-- Version : 1 --> <!ELEMENT Alignment (TimeStamp, WinPitch, Trans, Layer1, Layer2, Layer3, Layer4, Layer5, Layer6, Layer7, Layer8, UNIT* )> <!ELEMENT TimeStamp (#PCDATA)> <!ELEMENT WinPitch (#PCDATA)> <!ELEMENT Trans (#PCDATA)> <!ELEMENT Layer1 (#PCDATA)> <!ELEMENT Layer2 (#PCDATA)> <!ELEMENT Layer3 (#PCDATA)> <!ELEMENT Layer4 (#PCDATA)> <!ELEMENT Layer5 (#PCDATA)> <!ELEMENT Layer6 (#PCDATA)> <!ELEMENT Layer7 (#PCDATA)> 43 <!ELEMENT Layer8 (#PCDATA)> <!ELEMENT UNIT (#PCDATA)> <!ATTLIST TimeStamp Value CDATA #REQUIRED > <!ATTLIST WinPitch Program CDATA #REQUIRED Version CDATA #REQUIRED > <!ATTLIST Trans version CDATA #REQUIRED creationDate CDATA #REQUIRED audioFilename CDATA #REQUIRED textFilename CDATA #REQUIRED > <!ATTLIST Layer1 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer2 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer3 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer4 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer5 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer6 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST Layer7 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED 44 > Color CDATA #REQUIRED <!ATTLIST Layer8 Name CDATA #REQUIRED ID CDATA #REQUIRED Short CDATA #REQUIRED Color CDATA #REQUIRED > <!ATTLIST UNIT speaker CDATA #REQUIRED startTime CDATA #REQUIRED endTime CDATA #REQUIRED Channel CDATA #REQUIRED Tag CDATA #REQUIRED > 45 3.7. PoS tagging and lemmatization The C-ORAL-ROM corpus comprises, for each session, a .txt file in which a “Part of Speech” label and a “Lemma” label are assigned to each form of the transcribed text. The C-ORAL-ROM project was not aimed at solving the puzzling questions of automatic PoS tagging of speech. The project’s main goal is to provide a set of empirical data on spontaneous speech tagging, so as to improve the state of the art for further analysis on spontaneous speech PoS tagging. Nevertheless, the PoS tagged entry of the C-ORAL-ROM corpus also allows a better exploitation of the resource for linguistic research and language technology purposes. C-ORAL-ROM’s four sub-corpora are automatically tagged with the language technologies already available for each language, and provides an evaluation of the level of accuracy reached by those tools in spontaneous speech. Tagging has been accomplished using different tag sets for each language, because of the differences among the languages and the different traditions in natural language processing systems. Each team defined a tagging strategy for what concerns: 1) Tool 2) Tag set 3) Main choices: multi-word expression, proper names, auxiliaries and compound tenses The four tag-sets, reported below, despite some variations due to the language, the tool adopted and the tradition of each team are all oriented towards the EAGLE standard, and are for this reason highly comparable. The comparison tables are given in Appendix 2. For each language resource, details on the tool used and on the accuracy reached are reported below. In order to ensure a level of comparability within the whole corpus, a compulsory minimal threshold of information has been established for what regards: 1. Minimal Tag Set requirements 2. Tagging Format 3. Frequency Lists format 3.7.1. Minimal Tag Set requirements 1. The Part of Speech (PoS) tag is compulsory for each class of words and, if applicable, for Locutions; 2. Specifications on Mood and Tense features are compulsory for Verbs, or, alternatively, the feature Finite/Non-finite [± person], necessary for the detection of verb-less utterances, must be specified; 3. Specification of the Coordinative/Subordinative feature is compulsory for Conjunctions; 4. Specification of Common/Proper feature is compulsory for Nouns; The common level of morpho-syntactic tagging involves a distinction within the set of non-properly linguistic elements that are peculiar to spontaneous speech texts, providing two special tags to this aim (the code of the tag depends on the specific tag set) for: a) para-linguistic elements e.g. mh (used for filled pauses etc.) he ehm b) extra-linguistic elements e.g. hhh (used for laughs and coughs etc.). 46 3.7.2. Tagging Format The output of the morpho-syntactic tagged text represents both the speaker codes and the prosodic breaks, in order to allow context-bound grammatical studies (within utterances or tone units ). The text is given horizontally, with respect to the legibility of dialogic turns. The format of the tag for each lexical entry is composed by three elements divided by backslash separators (\), as follows: - the first element is the word-form - the second element is the LEMMA (in capital letters) - the third element is the code which includes the PoS category (in capital letters) and optionally the morpho-syntactic description (msd) of the word form The result is a pattern with the following structure: wordform\LEMMA\POSmsd e.g. *CAR: come\COME\B andò\ANDARE\Vs3ir / a\A\E casa\CASA\S vostra\VOSTRO\POS ? Regarding multiwords, there are two possibilities for their tagging, with respect to the different ways to identify locutions in each tag set: e.g. (a) a_priori\A_PRIORI\B (b) a\A_PRIORI\B1 priori\A_PRIORI\B2 3.7.3. Frequency Lists Format The output of the frequency lists is standardly encoded, to ensure the highest comparability between the four romance languages. The frequency lists are presented in two formats: a) by lemmas, featuring 4 columns (rank, lemma, POS, frequency), e.g. rank LEMMA POS ================================= 1. IL R 2. ESSERE V ... ... ... 32. BELLO A 33. CASA S ... ... ... frequency 4213 3840 ... 352 ... 289 b) by word-forms, featuring 5 columns (rank, form, lemma, POSmsd, frequency); e.g. rank form LEMMA POSmsd frequency ============================================== 1. il IL R 2214 2. è ESSERE Vs3ip 1587 … …… … … 39. bello BELLO A 125 … …… … … 46. bella BELLO A 109 47 … 150. … 589. …… era ESSERE …… era ERA … Vs3ii … S … 56 … 22 48 3.7.4 Tag sets 3.7.4.1 French tagset ============= Verbs ----VER:CON:PRE VER:IMP:PRE VER:IND:FUT VER:IND:IMP VER:IND:PAS VER:IND:PRE VER:INF VER:PAR:PAS VER:PAR:PRE VER:SUB:IMP VER:SUB:PRE Conditional, present Imperative, present Indicative, future Indicative, imperfect Indicative, past Indicative, present Infinitive Participle, past Participle, present Subjunctive, imperfect Subjunctive, present Nouns ----NOM:COM NOM:PRO Common Proper Adjectives ---------ADJ:ORD ADJ:QUA CON:COO CON:SUB coordination subordination Determiners ----------DET:DEF DET:DEM DET:IND DET:INT DET:POS definite demonstrative indefinite interrogative possessive Pronouns -------PRO:DEM PRO:IND PRO:PER PRO:POS PRO:RIN Demonstrative Indefinite Personal Possessive Relative/interrogative Numerals ------NUM ordinal qualifying Interjections and discourse particles ------------------------------------INT Adverbs ------ADV Unclassifiable -------------XXX:ETR XXX:EUP XXX:TIT Prepositions -----------PRE Conjunctions ------------ 49 Foreign word Euphonic particle (-t-, l') Title 3.7.4 2 Italian tagset ============== Verbs (main tags) ----------------V VW V_E B Main verb Non-main verb Verb + clitic Verb features (finite forms) ---------------------------number s singular p plural person 1 2 3 first second third mood i c d m indicative subjunctive conditional imperative tense p r i f present past impefect future Prepositions -----------E E_R Simple Fused (+ article) Conjunctions -----------CC CS coordination subordination Articles ---------R Demonstratives ------------DIM Indefinites ----------IND Personals --------PER Possessives ----------POS Verb features (non-finite forms) -------------------------------gender (participles) m masculine f feminine n common Relatives/Interrogatives -----------------------REL number (participles) s p singular plural mood f p g Numerals -------N NA infinite participle gerund Interjections ------------I tense p r present past Nouns ----S SP Common Proper Cardinal Ordinal Non-Standard Linguistic Elements -------------------------------(PoS)+K Foreign word (PoS)+Z Invented word ONO Onomatopoeia ACQ Acquisition Non-Linguistic Elements ----------------------PLG Paralinguistic XLG Extralinguistic X Non understandable word Adjectives ---------A Adverbs ------- 50 3.7.4 3. Portuguese Tagset MAIN VERBS ---------Vpi Vppi Vii Vmpi Vfi Vc Vpc Vic Vfc VB VBf VG Vimp VPP Compound Tenses PPA Present Indicative Past Indicative Imperfect Indicative Pluperfect Indicative Future Indicative Conditional Present Subjunctive Imperfect Subjunctive Future Subjunctive Infinitive Inflected Infinitive Gerundive Imperative Past Participles Present Indicative Past Indicative Imperfect Indicative Pluperfect Indicative Future Indicative Conditional Present Subjunctive Imperfect Subjunctive Future Subjunctive Infinitive Inflected Infinitive Gerundive Imperative NOUN ---Np Nc Proper Noun Common Noun Invariable Variable INDEFINITE ---------INDi INDv Invariable Variable POSSESSIVE ---------POS RELATIVE/INTERROGATIVE/EXCLAMATIVE ---------------------------------RELi Invariable RELv Variable in Adjectival Past Participles AUXILIARY VERBS --------------VAUXpi VAUXppi VAUXii VAUXmpi VAUXfi VAUXc VAUXpc VAUXic VAUXfc VAUXB VAUXBf VAUXG VAUXimp DEMONSTRATIVE ------------DEMi DEMv PERSONAL PRONOUN ---------------PES CLITIC -----CL NUMERAL -----NUMc NUMo Cardinal Ordinal INTERJECTION -----------INT ADVERBIAL LOCUTION -----------------LADV ADJECTIVE --------ADJ PREPOSITIONAL LOCUTION ---------------------LPREP ADVERB -----ADV CONJUNCTIONAL LOCUTION ---------------------LCONJ PREPOSITION ----------PREP PRONOMINAL LOCUTION ------------------LPRON CONJUNCTION ----------CONJc CONJs ENFATIC ------ENF ARTICLE -----ARTi ARTd Coordinative Subordinative FOREIGN WORD -----------ESTR Indefinite Definite ACRONYMOUS ---------- 51 SIGL LD EXTRA-LINGUISTIC ---------------EL WITHOUT CLASSIFICATION ---------------------SC PARA-LINGUISTIC --------------PL WORD IMPOSSIBLE TO TRANSCRIBE ----------------------------Pimp FRAGMENTED WORD OR FILLED PAUSE ------------------------------FRAG SEQUENCE IMPOSSIBLE TO TRANSCRIBE --------------------------------Simp DISCOURSE MARKER ---------------MD SUB-TAGS -------: + (excepting compounds) DISCURSIVE LOCUTION ------------------- 52 Ambiguous form Contracted forms Hyphenated forms 3.7.4 4. Spanish Tagset VERBS ----V Verb Verb Features ------------number s p singular plural person 1 2 3 first second third mood ind sub cond imp indicative subjunctive conditional imperative tense p s f indef i present past simple future simple past simple (indefinido) past simple ind sub cond imp indicative subjunctive conditional imperative tense p s f indef i present past simple future simple past simple (indefinido) past simple Auxiliary features (non-finite forms) ------------------------------------gender (participles) M masculine F feminine number (participles) S P singular plural mood inf ger par infinitive gerund participle tense P past (participles) NOUNS ----N Noun Features -------gender m f ig masculine feminine invariable in gender number s p in singular plural invariable in number AUXILIARY VERBS --------------AUX auxiliary verb other features i P invariable (in general) Proper Features -------number s p singular plural ADJECTIVES ---------ADJ Adjective person 1 2 3 first second third Features -------number s p in singular plural invariable (in number) Verb features (non-finite forms) -------------------------------gender (participles) M masculine F feminine number (participles) S P singular plural mood inf ger par infinitive gerund participle tense P past (participles) mood gender 53 m f ig masculine feminine invariable (in gender) other features i invariable (in general) ADVERBS ------ADV Adverb PREPOSITIONS -----------PREP Preposition CONJUNCTIONS -----------C Conjunction DETERMINERS ----------DET Determiner Features -------d dem poss article demostrative (adjective) posessive (adjective) gender m f ig masculine feminine invariable in gender number s p singular plural INTERJECTIONS ------------INT DISCOURSE MARKERS ----------------MD discourse marker PRONOUNS -------P Pronoun Features -------PER Personal (Pronoun) Person 1 2 3 first second third gender M F IG masculine feminine invariable in gender number s p singular plural R relative pronoun QUANTIFIERS ----------Q Quantifier Interjection 54 3.7.5 automatic PoS tagging: tool and evaluation 3.7.5.1 Italian : tool and evaluation The automatic tagging procedure of lemmatization and morpho-syntactic annotation of the Italian C-ORAL-ROM corpus is based on the PiSystem set of tools, created and developed by Eugenio Picchi within the ILC ( Pisa). PiSystem (http://www.ilc.cnr.it/pisystem/) is an integrated procedure for textual and lexical analysis, which consists in the following main components: 1. DBT text encoding and analysis modules; 2. a morpho-syntactic analyzer (PiMorpho); 3. a Part of Speech tagger and lemmatizer (PiTagger). The DBT encoding provides a first parsing of the text (tokenization) and represents the pre-analysis level of the morphological analyzer. The encoded text is then given to the PiMorpho, which assigns all the possible alternatives of Morpho-Syntactic-Description (MSD) to each lexical item. For the PoS disambiguation, PiTagger uses two other input-resources: an electronic dictionary and a training corpus. In detail, these resources consist in: a. the DMI, a morphologic dictionary of Italian language, developed within the ILC at CNR, Pisa.; it collects 106,090 lemmas encoded with PoS specifications and inflectional tags (Zampolli and Ferrari 1979; Calzolari, Ceccotti and Roventini 1983); b. a Training Corpus of 50,000 words, manually tagged; c. a statistical database extracted from the Training Corpus (BDR). These resources are built on a coherent tag set for PoS and morpho-syntactic annotation which is strictly coherent with the EAGLE tag set. The disambiguation phase is processed by statistic measurements (on tri-grams) extracted from the Training Corpus and stored in the BDR. The main program of this procedure estimates the maximum likelihood pattern among the possible alternatives given by the morphological component, with a transitional probabilistic method (Picchi 1994). In this environment, the level and the precision of the analysis depends on the information archived in the BDR. The Training Corpus (manually tagged with lemma, Pos and MSD) is the source of the statistics on word-form associations. Statistics must be defined on a specific level of linguistic information (lexical patterns or PoS sequences). The BDR uses a hybrid set of specifications called Disambiguation Tags, which defines the threshold of the analysis relevant for statistic measurements, i.e.: a. b. c. PoS codes, the lemmas ESSERE and AVERE, the MSD tags on non-finite moods of verbs. This information is the same used by the PiTagger in the disambiguation phase. Tag Set The PoS tag-set used for the Italian C-ORAL-ROM is, for the greater part, in agreement with the EAGLES recommendations for the morpho-syntactic annotation of Italian language. The C-ORAL-ROM tag set features some adjustments in relation to specific category spaces: 1. with regard to verbs, in order to operate a distinction between main verbal instances and non main ones (auxiliaries and copulas); 55 2. with regard to pronouns and determiners, in order to achieve a semantic-oriented classification of such lexical objects (not depending on their functional value). Main choices in the PoS tag set a. Main and non-main verbs. The morpho-syntactic description of verbal forms represents a crucial problem for automatic PoS tagging. At the state of the art is not easy to achieve a good level of automatic recognition for both auxiliaries and copulas. The Italian C-ORAL-ROM tried a way out of this problem, encoding both auxiliaries, verbs [ESSERE and AVERE], and copulas [ESSERE] with the [non-main] feature. An automatic post-edit procedure provides a distinction between the different uses of these verbs: MAIN ones (fully semantic, predicative verbs) and NON-MAIN ones (support verbs: auxiliaries and copulas). The tag dedicated to the NON-MAIN verbs is a capital W following V, the main tag. b. Pronouns and pronominal adjectives. With regard to the categorical space outlined by the traditional categories of "pronouns" and "pronominal adjectives", the Italian C-ORAL-ROM tag set disagrees with the EAGLES standards. Specifically, there is no distinction between pronominal and adjectival uses of the possessives, indefinites and determinatives, which all merge into the same category. This choice follows the strategy used in the tag set for the Portuguese C-ORAL-ROM). The following table shows the relations between the Italian categories in EAGLES and C-ORAL-ROM: The EAGLES tag set is structured on a sub-categorization system within Pronouns and Determiners (macro-categories which represent the general PoS tags); on the contrary, in the Italian C-ORALROM tag set, following an extensive criterion, each category is treated as a proper PoS, merging the pronominal and the adjectival uses of the same lemmas. Extended tag set for spoken language The spoken language transcripts of C-ORAL-ROM corpora contain elements of different types. More specifically, besides the linguistic elements which belong to the dictionary, there is also a wide variety of non-standard linguistic forms in the corpora. The following cases have been distinguished: a. foreign words; (PoS) + [K]; (they\PERK); b. new formations; (PoS) + [Z]; (bruttozzi\AZ, torniante\SZ); c. onomatopoeia; [ONO]; (fffs\ONO, zun\ONO); d. language acquisition forms; [ACQ]; (aua\ACQ, cutta\ACQ). Moreover, a wide series of non-linguistic phenomena may also be involved in the speech flow. Such phenomena are identified in the transcripts following the C-ORAL-ROM format and are encoded in the tagged resource as special elements: a. para-linguistic elements; [PLG]: - word fragments (&pa\PLG, &costrui\PLG); - phonetic support elements and pause fillings (&he\PLG, &mh\PLG); c. extra-linguistic elements (laughs and coughs); [XLG]; (hhh\XLG); d. not understandable words; [X]; (xxx\X). EVALUATION The C-ORAL-ROM Italian tagged resource comprises 306,638 tokens. Since the non-standard and regional forms were inserted in a special pre-dictionary (1,985 word forms), the PiTagger system reached a 100% recall of the number of tokens: 56 total number of tokens: tagged tokens: total recall = 100% 306,638 306,638 The evaluation of the precision of the automatic PoS-tagging procedure is based on a random sampling of 1/100 tokens picked out of the whole C-ORAL-ROM Italian resource. Each token is extracted from a different utterance (non contiguous), also randomly selected. The evaluators had to express a judgment on the correctness of the tagging with respect to a selected word in its utterance context. The random sampling obtained sufficiently represents the whole. The size of the sampling has been considered sufficient from a statistical point of view, as it ensures a 95% confidence interval lower than 1%. The manual revision of the tagged samples evaluates the automatic procedure with respect to different degrees of accuracy: errors in sub-categorization, errors in morpho-syntactic description of verbs, errors in main tag category and in lemma assignation. The statistic precision of these levels of annotation is shown by the following table: Table 1. Precision of the automatic tagging procedure Total Sampling 3100 Non Decidable39 31 Total Evaluated 3069 All Correct 2726 Correct POS tag 2773 1) PoS tag errors 296 2) Lemma errors 2 3) Sub-categoriz. errors 38 4) Morpho-syntactic description errors 7 Total Error 343 1,00% 88,82% 90,36% 9,64% 0,07% 1,24% 0,23% 11,18% The percentage of accuracy of the annotation system shows that the more relevant errors (PoS tag and lemma errors) are around 10% of tokens40. The results of the evaluation (restricted to the PoS level) are shown in a confusion matrix, listing the errors in the tag assignment with respect of each word class: in each line, the errors are recorded by category, while the columns report the correct PoS to which the token belongs41. The last column records the number of cases in which the PoS is overextended, while the bottom line represents cases of under-extension. 39 The 31 non-decidable cases (second row in the table) are constituted by words which are under-specified with regards to their PoSvalue in the given context (i.e. it is impossible to express a disambiguation judgment). These cases (1% of the total) have not been counted as part of the evaluation samples: (e.g.) che\CHE\Conj?Relative? mi\MI\PER piace\PIACERE\Vs3ip tanto\TANTO\B // 40 [which\that I like so much] Tested on a corpus of official documents of the UE Commission (500,000 tokens, reviewed by Enrica Calchini), PiTagger reached a 97% degree of accuracy. The same recognition rate was reached in the LABLITA literary sampling corpus (60.000 tokens). 41 For example, number 30 (second line, third column) corresponds to 30 tokens wrongly tagged as nouns (S), which should be instead tagged as adjectives (A). 57 Table 2. Confusion matrix The collected data show that most mistakes occur in the Nouns category (103 over 296 total errors). Considering the frequency of the category in the evaluation corpus, the projected overextension of this word class would be of around 15%. This probably depends on the statistical normalization applied by the PiTagger system, which, very roughly, assigns to Nouns the highest probability of occurrence (see. Picchi 1994). Another interesting result is that the Adverbs and Interjections category is consistently underextended with respect to its actual weight (over 10% under-extension for Adverbs and 8.5% underextension for Interjections). Verbs show a lower incidence of errors (3.95% over-extension and 3.44% under-extension) with a roughly correct projection of the frequency of the category on the total. From the confusion matrix above, it is possible to obtain data on precision, recall and f-measure for each category; the following table details these measurements, which give an overall estimate of the automatic tagging procedure:42 Table 2. Confusion matrix corrections V V S 15 A 3 errors B S A B 15 2 2 2 30 23 1 3 1 R E C DIM IND PER POS REL N 1 20 1 3 4 C 5 2 23 19 103 17 1 27 1 1 6 2 5 I 1 R E 4 NA 5 11 1 14 25 7 5 2 10 1 DIM 32 0 IND 5 4 1 PER 10 8 9 POS 1 1 9 1 21 27 1 REL 12 N 12 1 1 NA 0 I 6 20 19 33 85 6 14 11 39 0 2 11 0 23 11 5 23 296 42 The lines in the table are sorted by the f-measure value (last column), that is an overall standard measurement to evaluate the general accuracy of automatic procedures. The PoS which feature a higher f-measure value are the ones tagged with a higher precision. For the PoS marked with an asterisk in the first column, the number of occurrences in the sampling corpus is too low to ensure an adequate evaluation. 58 Table 3. Precision recall and f-measure for each PoS PoS tp fp fn precision DIM 65 0 0 100,00% E 253 7 11 97,31% V 559 23 20 96,05% POS* 10 1 0 90,91% R 181 11 14 94,27% I 195 6 23 97,01% B 502 25 85 95,26% PER 170 27 11 86,29% S 440 103 19 81,03% N 40 1 11 97,56% C 200 32 39 86,21% IND 50 21 2 70,42% A 90 27 33 76,92% REL* 24 12 23 66,67% NA* 2 0 5 100,00% recall 100,00% 95,83% 96,55% 100,00% 92,82% 89,45% 85,52% 93,92% 95,86% 78,43% 83,68% 96,15% 73,17% 51,06% 28,57% 59 f-measure 1,0000 0,9656 0,9630 0,9524 0,9354 0,9308 0,9013 0,8995 0,8782 0,8696 0,8493 0,8130 0,7500 0,5783 0,4444 3.7.5.2 French: tool and evaluation Tagset order): The most notable features of the tagset used for the French corpus are (in alphabetical o The distinction between two categories of adjectives: qualifying and ordinal. Other words that are traditionally called adjectives are listed under determiners or numerals. o The grouping of interjections and “discourse particles” o The grouping of cardinals in a “numeral” category o The fusion of prepositions + determiners (e.g. du = de + le) simply received the tag corresponding to the determiner (e.g. du/DET:DEF). o The fact that relative and interrogative pronouns were not distinguished (qui, quoi, etc.), as this remains unachievable using the current automatic processing technology. o The traditional distinction between the conjunction que and the relative pronoun que has been maintained, in spite of strong linguistic arguments found in modern linguistics in favor of a complementizer (conjunction) analysis of the “relative” que (either in relative clauses or in cleft sentences). This is because many possible future syntactic analyses could be based on the distinction between ordinary and movement clauses.43 o A residual XXX category containing unclassifiable cases such as foreign words, titles or “euphonic particles” that had no precise linguistic function (-t-, l' before on). o Multi-word expressions were detected in every grammatical category. Tool The French tagging strategy is a rather complex one. It is based on Cordial Analyzer, which at the moment is probably the best morpho-syntactic tagger for French. Cordial Analyzer was developed by the Synapse Development company44 with considerable input from our team. It uses the technology and modules developed by Synapse, which have been incorporated in Microsoft Word's French spelling corrector. The characteristics of this tool are: • • • a very large dictionary (about 142,000 lemmas and 900,000 forms) a combination of statistics and detailed linguistic rules shallow parsing capabilities, which enable taking into account other relations than strictly local ones a remarkable robustness with respect to errors and non-standard language. • This last feature (which comes from the sophisticated error-correction modules) is particularly important for processing spoken corpora, and explains by a fair degree the very high results obtained on the French C-ORAL-ROM tagging (see below). For example, the tagger is capable of detecting repetitions, such as le le euh le chien, and is not fooled by occurrences that are not “normal” bigrams in the language, such as le le. Such repetitions very often lead to tagging errors for most taggers (usually trained on written corpora). 43 Our tagging allows automatic retrieving of both structures. This decision could be considered as somewhat contradictory with the conjunction analysis of the comparative que: it has indeed been argued that the comparative clause que has properties of movement. In that case as well, the decision is a practical one: comparative structures can easily be retrieved using the quantifier that is always correlated with this type of que clause. 44 http://www.synapse-fr.com 60 Cordial Analyzer has been supplemented by our own dictionary and various pre- and postprocessing modules developed by our team that adapt it to the spoken language and correct a number of residual errors. These modules also change some of Cordial's tagging decisions. The most noticeable example concerns “discourses particles” such as bon or quoi, which are not at all treated by Cordial, given its orientation towards written language. One of our post-processing modules changes the original tagging when appropriate, using linguistic rules that use the local context. Examples: le chocolat est bon\ADJ => no change et alors bon\ADV je lui ai dis => bon\INT Since Cordial can flag spelling errors and/or unknown words, the tagging process enabled us to detect spelling errors that remained in the transcribed corpus. The errors were manually corrected before the final tagging. After correcting the spelling errors in the transcription, only 311 tokens remained unknown to the tagger (200 different types). All these tokens have been checked manually and added to an ad hoc dictionary used by a final tool that tagged them appropriately. Among these tokens, we found neologisms (C-plus-plussien), rare or specialized words (carbichounette), familiar abbreviations (ophtalmo), alphanumeric acronyms and codes (Z14), and foreign words (tribulum, brownies). Evaluation In the case of the unknown words described above, the system still attributed a tag like NOM:COM (which in many cases is right: ambiguity was mostly with XXX:ETR). In that sense, system recall was 100%. However, since an ad hoc dictionary was created, it is probably fair to exclude the 311 tokens concerned by the recall. Even this did not change the results much, since it applied to a corpus of about 300,000 tokens. Recall measured that way is still 99.999%! The precision figure is more interesting. It was evaluated by drawing a 20-token sample from each of the 164 texts composing the corpus, i.e. a sub-corpus of 3280 tokens or about 1/100th of the entire corpus (by token we mean either a single word or a multiword unit). Elementary statistics show that this size is enough to ensure a 95% confidence interval no larger than 1%, and therefore perform a very precise evaluation, as will be shown below. Tagging was checked and corrected manually, and the errors were categorized according to several criteria: • • • • error on main category main category correct, but error on subtype error on lemma error on multiword grouping The tagger’s behavior was excellent, since only 58 tokens presented an error of one type or the other (or occasionally two errors combined). This amounts to a 1.77% error rate, i.e. a precision of 98.23%. This is a very-high figure according to current standards, especially for spoken corpora. Table 6 lists the errors by type. The rightmost column shows a 95% confidence interval (computed using the Binomial law). This can be used to evaluate the impact of the possible variations in sampling, as well as the sample size. We can see that in all cases the confidence interval is smaller that 1%, which is more that enough for this type of evaluation, given the fact that the disagreement among linguists on the correctness of tags is probably of the same order, if not greater. 61 Type of error Cat SubCat Tag (Cat or SubCat) Lemma Multiword Any Distribution of error types nb of errors 41 8 49 4 8 58 % error 1,25% 0,24% 1,49% 0,12% 0,24% 1,77% precision 98,75% 99,76% 98,51% 99,88% 99,76% 98,23% 95% CI 98.3% - 99.1% 99.5% - 99.9% 98.0% - 98.9% 99.7% - 100.0% 99.5% - 99.9% 97.7% - 98.7% The main category was correctly allocated in 98.75% of cases. It is worth noting that the tagging of verbs was 100% correct on the sample. This confirms previous studies by our team, which reported over 99% correctness on this category (e.g. Valli & Véronis, 1999). On the opposite, the worst categories were adverbs and conjunctions, but this is hardly unexpected in French. The matrix of confusion between categories is given in Table 7. One can see that most of the confusions occurred between (difficult) grammatical categories, for example: • • • • • adverb vs. conjunction (si) adverb vs. preposition (avec) preposition vs. interjection/particle (voilà) pronouns vs. determiners (un, tout) relative pronoun vs. conjunction (que). In a few cases, grammatical words were incorrectly tagged as a major category, mainly as a noun (e.g. pendant). Mistagging rarely occurred across major categories (ADJ, NOM, VER). The only errors concerned the difficult distinction, in French, between adjective and nouns, since many words can be both: quelqu'un qu'on sent de passionné/NOM:COM => should be ADJ:QUA Correct ADV CON Err ADV CON DET DET INT NUM PRE 3 PRO XXX ADJ Tot. 5 2 9 NOM VER 16 7 INT NUM PRE 1 XXX ADJ NOM VER Total 5 5 PRO 4 3 1 1 10 4 3 1 1 7 2 1 2 3 5 7 5 9 1 41 Matrix of confusion between main categories In a few cases, the main category was right, but the sub-category was wrong (Table 8). Half of these cases involved the determiner des, which can either be a preposition fused with a definite article, equivalent to de + les (therefore coded DET:DEF, see above) or an indefinite article (DET:IND). Respective examples: il sort des grandes écoles 62 il mange des pommes Three other errors involved a confusion between proper and common noun when a given form could be both, e.g. côtes/Côtes (as in Côtes de Beaune), salon/Salon (a city in Provence). As opposed to the case of the determiner, this error will be easy to rectify in future versions, since the capital letter, in spoken corpora, is a non-ambiguous clue marking proper names. The last case involved confusion between present indicative and imperative (attends). Correct DET:IND Err DET:DEF NOM:COM NOM:PER VER:IMP 2 2 1 4 Total 4 4 NOM:PRO VER:IND Total NOM:COM 1 2 1 1 1 1 8 Confusion matrix between subcategories Overall, an error in the tag occurred 49 times, either in the main category or in the subcategory, i.e. 1.49% of cases. This corresponds to a precision of 98.51% of tags, a result that is consistent with previous evaluations conducted by our team: Valli & Véronis (1999) found a 97.9 % precision in their experiment. This improvement is due to better dictionaries, better treatment of multi-word expressions and taking into account discourse particles (which were previously ignored). While most multiword expressions from the sample were processed correctly, a total of 8 cases were not. These involved either multiword units that were not recognized (e.g. d'après [quelqu'un], un petit peu, which inexplicably were not included in the dictionary), or words which should not have been pasted together (the latter being more frequent with a total of 6 cases). For instance discuter de quoi was tagged discuter/VER:INF + de_quoi/INT instead of being tagged as three separate words. 63 3.7.5.3. Portuguese: tool and evaluation In order to re-use the existent tools, the morphosyntactic annotation and the lemmatization of the corpus were performed in two different tasks. The Morphosyntactic Annotation The Portuguese team used Eric Brill's tagger (Brill 1993)45, trained over a written Portuguese corpus of 250.000 words, morphosyntactically annotated and manually revised. The initial tagset for written corpus morphosyntactic annotation covered the main POS categories (Noun, Verb, Adjective, etc.) and the secondary ones (tense, conjunction type, proper noun and common noun, variable vs. invariable pronouns, auxiliary vs. main verbs, etc.), but person, gender and number categories were not included. Specific Tags for a Spoken Corpus Due to some of the spoken language characteristic phenomena and to the specific transcription guidelines used in the C-ORAL-ROM project, it was necessary to adapt the tagset. We implemented a post-tagger automatic process to account for the following cases: (a) extra-linguistic elements; transcription: hhh; tag: EL; (b) fragmented words or filled pauses; transcription: &(form); tag: FRAG; (c) words and sequences impossible to transcribe; transcription: xxx, yyyy; tag: Pimp, Simp; (d) paralinguistic elements, such as hum, hã and onomatopoeias; tag: PL. In the cases described in (a), (b) and (c), the adopted specific transcription allowed for automatic tag identification and replacement, through a post-tagger process. The same process was applied in the cases described in (d), since there is a predictable finite list of symbols representing paralinguistic elements. Onomatopoeias however needed manual revision. Three other categories had to be added, but they did not allow for automatic post-tagging replacement, since they correspond to forms that also belong to classic categories: (e) discourse markers, such as pá, portanto, pronto; tag: MD (f) discursive locutions, such as sei lá, estás a ver, quer dizer, quer-se dizer; tag: LD (g) non classifiable forms, for words whose context does not allow an accurate classification; tag: SC For the cases (e) and (f), forms like pronto and não sei, for instance, are automatically tagged as pronto\ADJ and não\ADV sei\Vpi and there is no automatic post-tagging procedure that can decide whether it is or not a Discoursive Locution. These cases required a manual revision (and frequent listening of the sequence). For the cases in (g), it would be more difficult (if not impossible) to tag or post-tag them automatically. Because it also works on statistic rules, the tagger will always try to classify these words according to those rules (note that it was chosen to tag all the forms whenever possible, trying to avoid the use of SC tag (which rarely occurred)). Lemmatization of the Spoken Corpus In order to accomplish this task, the Léxico Multifuncional Computorizado do Português Contemporâneo (henceforth LMCPC)46 was used as the source for a lemmatization tool. The LMCPC is a 26.443 lemma frequency lexicon with 140.315 wordforms, with a minimum lemma frequency of 6, extracted from a 16.210.438 word corpus of contemporary Portuguese. The lemma and its correspondent forms (including inflected forms and compounds) are followed by morphosyntactic and frequency information. The lemma and wordforms are lemmatized concerning main POS categories, as N (noun), V (verb), A (adjective), or other, namely F (foreign word), G (acronym/sigla), X (abbreviation). 45 http://www.cs.jhu.edu/~brill The LMCPC (in English, Multifunctional Computational Lexicon of Contemporary Portuguese) is available via internet at http://clul.ul.pt/english/sectores/projecto_lmcpc.html/ 46 64 The lemmatization of the C-ORAL-ROM spoken corpus comprised two major tasks: the formatting of the LMCPC data and the construction of a tool to extract the lemma from the lexicon. Unfortunately, at the beginning of this process, we were not able to use the POS information present in the LMCPC to improve the lemma selection process. The lemmatization tool developed turned out to be very simple. It consisted in a Perl script that extracts the lemma for each token of the corpus from the LMCPC data file: each form of the corpus is searched for in the lexicon, and the correspondent lemma(s) is(are) found and placed near the form. In the case of multiword expressions, since the lemma is the entire set of elements, there was no correspondence between the wanted result and the LMCPC data. Therefore, it was necessary to develop a tool to automatically compose the desired lemma format from a given list of locutions. The final format of the lemmatization of a locution is given below: (1) o\O_QUAL\LPRON qual\O_QUAL\LPRON Since it is possible for a wordform to be attributed several lemmas (CORLEX corpus, for instance, has a percentage of homographic words of 34%) and due to the problems concerning locutions (overlapping of words that can pertain to different kinds of locutions (em cima\LADV vs. em cima de\LPREP) and the distinction between locutions and independent words grouping), a manual lemma revision was strictly required, together with the tagging one. Specific lemmatization choices It was decided that some categories would be left without lemma, namely, Proper Nouns, Paralinguistic elements and Extra-linguistic elements. Lemmas are given in the masculine gender, as is common practice. However, some cases received a masculine and a feminine lemma: Articles; Indefinites; Demonstratives; Possessives; Personal Pronouns; Clitics; Cardinal Numerals (os\O\ARTd, as\A\ARTd). For some verbs, whenever their reflexive use implies a change in the semantic functions of their arguments, it was considered that the lemma includes the reflexive pronoun, which had to be added manually: (2) A Ana lembrou\LEMBRAR\Vppi a mãe da sua consulta médica. (3) A Ana lembrou-se\LEMBRAR_SE-SE\Vppi-CL de telefonar à mãe. (4) a Ana não se\SE\CL lembrou\LEMBRAR_SE\Vppi de telefonar à mãe. In Portuguese, whenever adverbs derived with the suffix -mente are coordinated, the first adverb loses the suffix, surfacing with its adjectival form. In these cases, despite of this adjectival form, our option was to lemmatize and to tag it as an adverb: (5) pura\PURAMENTE\ADV e simplesmente\SIMPLESMENTE\ADV. Effectiveness of the tagging and error rate Considering the introduction of new categories and despite the tagset length and type of training corpus (written), the tagger achieved a success rate of 91,5%, excluding the tagging of MD and of any kind of locution (due to the problems explained above). Afterwards, the Portuguese team performed a manual revision of 231.540 words and decided to attempt the training of the tagger over a subset of this subcorpus (with 184.153 words). However, this training revealed to be ineffective, since from an empirical appreciation, the errors tended to increase enormously (we did not establish a precise error rate for this task). Given the ineffectiveness of the training with the tagged spoken subcorpus, it was decided to proceed towards the annotation of the remaining 87.052 words with the tool trained on the written corpus, the same one that we were using before, improving the post-tagger skills. In the final calculus of the recognition rate of the tagger and the lemmatizer all together, all kinds of locutions and discourse markers (MD) were considered. This fact implied a decreasing of the previous recognition rate of 91,5% (where those tags were not considered) to 88%. 65 Given this unexpected and undesirable low rate of success of the tagger, it became imperative to carefully observe the most typical errors performed by the tool. Considering the distinction between tag errors and lemma errors, it was observed that the errors regarding POS tagging occurred with a higher percentage (74,5% of the errors) than the errors regarding lemmatisation (64,8%). It is worth mentioning that 40% of the errors concerned both tag and lemma. Taking into account, firstly, the errors regarding POS tag, as it was already expected (since the tagger did not include this tag), a high percentage of the errors occurred in the annotation of discourse markers (18% of the total errors; 25% of tag errors; 2% of the corpus). The annotation of locutions also increased the error rate, since they represent 29% of the total errors (39% of tag errors; 4% of the corpus). It is important to underline once again the fact that most of the locutions consist of discoursive locutions, which are extremely frequent in the oral discourse and particularly difficult to predict and, therefore, to automatically tag them. Finally, a significant percentage of errors regarding the subcategories tagging was also observed (tense and mood concerning verbs; common and proper concerning nouns; subordinative and coordinative concerning conjunctions) – 14% of total errors; 10% of tag errors; 1% of the corpus. These 3 types of errors constitute 77,5% of the tagging errors. Observing now the errors that occurred in the lemmatization process, we can conclude that the majority of the errors concerns the lemmatization of locutions (38% of total errors; 58% of lemmatization errors; 5% of the corpus). It was also observed that a relevant percentage of errors regarding the gender of the lemma (11% of total errors; 16% of lemmatisation errors; 1% of the corpus). These 2 types of errors constitute 74,5% of the lemmatisation errors. Taking into account all these facts and bearing in mind that the errors regarding the tagging and lemmatization of discoursive markers and locutions were the ones which constituted a particular problem for the tool (note that in the first training corpus, where these elements were not considered, we achieved a success rate of 91,5%), if we excluded these errors from the error rate we would have a success rate of the lemmatizer tool of 96,8% and a success rate of the tagger of 96,7%. It is still worth mentioning that, as normal whenever there is a human intervention and since it is not possible to perform manual revisions indefinitely, some errors may have been introduced at this level, not only mistyping errors but also classification ones. The remaining 87.052 word subcorpus was not exhaustively revised, but some manual revision was performed: – All the multiword expressions (Locutions); – All the past participles – since we distinguish the form of compound tenses from the other uses of participles (which can be ambiguous with an adjective form); – All the que forms– since it can have several uses and a description of its use was foreseen; – The most common discourse markers forms were looked for – since the tagger could not identify them as a MD: assim enfim pá pronto bem bom então exacto percebe percebes pois sabe sabes claro olha portanto sim digamos olhe – Non lemmatized words – since, as will be described bellow, it was highly probable that most words that were not lemmatized (\-\) would have received a wrong tag. – All the clitic forms – forms like o, a, os, as can be either a clitic, a definite article and a preposition (for a). Also the form se can be a clitic or a conjunction. – The forms a(s) and o(s) – for the reasons given above. – The form como – it can be a verb form, an adverb, a conjunction or a interrogative. – The form até – it can be an adverb or a preposition. As far as lemmatization is concerned, for the remaining 87.052 words, we were able to improve the lemmatizer in order to cross the POS information of the annotated corpus with the POS information of LMCPC. This means that, for ambiguous forms, this tool was already able to select the corresponding lemma for a given tag. 66 For instance, for an example like (6), the preceding tool would provide two lemmas for the word processo – which can be homonymous between a noun and the first person singular of indicative present of the verb processar (7). (6) com este processo (7) com este processo\PROCESSAR,PROCESSO\Nc However, once the improved tool correctly tags the form as a noun, it is also able to choose the right lemma, as we can see in 8. (8) com este processo\PROCESSO\Nc In the final manual revision, this allowed us to check, with a high level of accuracy, both lemma and POS tagging: if a form did not receive a lemma, necessarily it would have been mistagged, as, for example, the word evoluem, in (9), which received the tag of an adjective, instead of the verb tag: (9) e\E\CONJc também\TAMBÉM\ADV aqui\AQUI\ADV / as\A\ARTd coisas\COISA\Nc evoluem\-\ADJ // Finally, the wordforms of lemma SER and IR were verified too, since they have homographic forms and, although the tagger was able to propose a lemma for them, it could have been wrong. (10) O João foi\IR\Vppi ao cinema. (11) O João foi\SER\Vppi bombeiro. Some options of the Portuguese team It is worth mentioning some options that were considered concerning the non-assignment of lemmas or tags relatively to some phenomena of spoken language: a) wordforms without lemma but with POS tag: Some non-existing words that result from lapsus linguae still preserve a clear POS function (usually they result from the crossing of several words, and are corrected by the speaker): (12) ou o lugal\-\Nc / um lugar\LUGAR\Nc no palco da vida //$ (pfammn02) (13) o &feni / o &feni / o femininismo\-\Nc / o feminismo\FEMINISMO\Nc (pfamdl22) (14) tivesse pordido\-\VPP / &eh / podido\PODER\VPP / &eh / pronunciar-se (pnatpd03) Even when the speaker does not auto-correct himself, it is possible to assign a tag to the wordform: (15) ainda que toda a carga negativa desabate\-\Vpc sobre vocês //$ (pnatpr01)47 Speakers may also produce wordforms that do not obey the normative use and, therefore, are not registered in the Portuguese dictionaries. (16) alguma função catársica\-\ADJ48 (pnatla02) However, when speakers produce wordforms that were deliberately created by them, these nonregistered words receive a lemma. (17) estendeu / este princípio abandonatório\ABANDONATÓRIO\ADJ / a algumas outras regras da reforma fiscal (pnatps01) There are cases where the context does not provide enough information to lemmatize the ambiguous forms between the verbs ser e ir. In these cases, the forms receive a tag (it is possible to decide the tense and mood of the verb), but not a lemma. (18) o Napoleão de / &tai foi\-\Vppi / teve / tentou / o grande império dele (pfamcv09) (19) os sócios / &eh / pronto / foram\-\Vppi / fizemos o quarteto // (pmedin03) 47 Note that the word desabate does not exist in the Portuguese language. It seems to result from a crossing of desabar with abater (semantically related verbs). 48 Note that the correct form is catártica. 67 b) wordforms with lemma but without POS tag (\SC): It may also happen for ambiguous wordforms to clearly have a lemma but not to receive a POS tag because the context does not allow us to classify it – the wordform a always features the lemma A, despite being an article, a preposition or a clitic, and the wordform que always has the lemma QUE, despite being a conjunction or a relative element. (20) e até ficou a\A\SC / com a / com a / com a ponta do sapato (pfammn02) 68 3.7.5.4. Spanish: tool and evaluation For the morphological analysis we have used GRAMPAL (Moreno 1991; Moreno and Goñi 1995) which is based on a rich morpheme lexicon of over 40.000 lexical units, and morphological rules. This system has been successfully used in language engineering applications as ARIES (Goñi, González and Moreno 1997) and also in linguistic description (Moreno and Goñi 2002). Originally, GRAMPAL was developed for analysing written texts. The tagging has been the most useful test for showing the ability of GRAMPAL to deal with a wide-coverage corpus of Spanish. We use this application for enhancing GRAMPAL with new modules: a POS tagger and an unknown words recognizer, both specifically developed for spoken Spanish. GRAMPAL is theoretically based on feature unification grammars and originally implemented in Prolog. The system is reversible: same set of rules and the same lexicon are used for both analysis and generation of inflected wordforms. It is designed to allow only grammatical forms. In other words, the most salient feature of this model is its linguistic rigour, which avoids both overacceptance and over-generation. The analysis provides a full set of morphosyntactic information in terms of features: lemma, POS, gender, number, tense, mood, etc. In order to be suitable for tagging the C-ORAL-ROM corpus, a number of developments has been introduced in GRAMPAL, reported in Moreno & Guirao (2003): 1. A new tokenization for the spoken corpus. 2. A set of rules for derivative morphology Tokenization in spoken corpora is slightly different to the same task in written corpora. Neither sentence nor paragraph boundaries make sense in spontaneous speech. Instead, dialog turns and prosodic tags are used for identifying utterance boundaries. For disambiguation, specific features of spoken corpora directly affect the tagger: repetition and retracting produce agrammatical sequences; subsentential fragments are frequent; there is a more relaxed word order. Finally, there are no punctuation marks. All those characteristics force to adapt the POS tagger typically trained for written texts. Fortunately Proper Names recognition is not a problem for C-ORAL-ROM since only names are transcribed with a capital letter. As a consequence, to analyze them is a trivial task. On the lexical side, we detected two specific features with respect to written corpora: there is a low presence of new terms (i.e. the lexicon used by speakers in spontaneous conversations is mostly common and basic) and a high frequency of derivative prefixes and suffixes that do not change the syntactic category, because most of them are appreciative morphemes. In order to handle the recognition of derivatives, GRAMPAL has been extended with derivation rules. The Prefix rule is: take any Prefix and any (inflected) word and form another word with the same features. This rule is effective for POS tagging since in Spanish prefixes never change the syntactic category of the base. The rule assigns the category feature to the unknown word. 239 prefixes have been added to the GRAMPAL lexicon. POS disambiguation has been solved using a rule-based model: specifically, an extension of a Constraint Grammar using features in a Context-Sensible PS. The output of the tagger is a feature structure written in XML. The formalism allows several types of context-sensitive rules. First in application are the lexical ones: those for a particular ambiguous word. "word" Æ <cat="X"> / _ <cat ="Y"> "word" Æ <cat="Z"> / <cat ="W"> _ 69 where a given ambiguous word is assigned an X category before a word with a Y category, or the Z category after a W category. Any kind of feature may be taken into account, not only the category. For instance, we can face the problem of two ambiguous verbs belonging to different lemmas: "word" Æ <lemma = "L"> / _ <cat="X"> "word" Æ <lemma = "M"> / _ <cat="Y"> In addition to features, strings and punctuation marks can be specified in the RHS of the contextsensitive rule "word" --> <cat="X"> / string _ "word" --> <cat="Y"> / # _ where string is any token, and # is the symbol for start or end of utterance. If no lexical rule exists for a given ambiguity, then more general, syntactic rules are applied: <cat="N">,<cat="V"> -> <cat="N"> / <cat="V"> _ where if a given word is analysed with two different tags, one as a noun (N), and the other as a verb (V), then the one with N category is chosen if it appears after a word with a V category. In short, those syntactic rules apply when there is no specific rule for the case (either because it is a new ambiguity not covered by the grammar, or because the grammar writer did not find a proper way to describe the ambiguity). This method benefits from the fact that most frequent ambiguities for a given language are well-know after training. As a consequence, many context-sensitive lexical rules can be written by hand or extracted automatically from the data. Figures on the tagger’s performance are given in the "Recognition Rate" section. Finally, those words which did not undergo disambiguation are treated with TNT (Brants 2000), which will assign POS following a statistic model obtained from a 50000 words training corpus. It has been proven that TNT is the most precise statistical tagger (Bigert, Knutsson, Sjöberg 2003). Electronic vocabulary. The GRAMPAL lexicon is a collection of allomorphs for both stems and endings. New additions can be easily incorporated, since every possibility in Spanish inflection has been classified in a particular class. In an experiment reported by Moreno & Guirao (2003), 8 % of the corpus are unknown words for the system, from which: 1. 2. 3. 4. Foreign words: walkman, parking Missing words in the lexicon, typically from the spoken language: caramba, hijoputa. Errors in transcription Neologisms, mostly derivatives. Rules for handling derivative morphology have been shown in the previous paragraph. For the remaining three classes of unknown words, a simple approach is adopted: a. Foreign words are included in a list, updated regularly. b. Any word in the corpus but not in the lexicon is added, expanding the base resource c. Errors in the source texts are corrected, and then analysed by the tool. As a summary, the tagger procedure, consisting of seven parts, is described below: 1. Unknown words detection: once the tokenizer has segmented the transcription in tokens, a quick look-up for unknown words is run. The new words detected are added to the lexicon. 70 2. Lexical pre-processing: the program splits portmanteau words (“al”, “del” Æ “a” “el”, “de” “el”) and verbs with clitics (“damelo” Æ “da” “me” “lo”). 3. Multi-words recognition: the text is scanned for candidates to multi-words. A lexicon, compiled from printed dictionaries and corpora, is used for this task. 4. Single words recognition: every single token is scanned for every possible analysis according to the morphological rules and lexicon entries. Approximately 30% of the tokens are given more than one analysis, and some of them are assigned up to 5 different analyses. 5. Unknown words recognition: the remaining tokens that are not considered new words, pass through the derivative morphology rules. If some tokens still remain without any analysis (because they were neither included in the lexicon nor recognised by the derivative rules), they will wait until the statistical processing, where the most probable tag, according the surrounding context, is given. 6. Disambiguation phase 1: a feature-based Constraint Grammar solves some of the ambiguities. 7. Disambiguation phase 2: a statistical tagger (the TnT tagger) solves the remaining ambiguous analyses. Guirao & Moreno-Sandoval (2004) describe a tool developed for helping human annotators to revise and correct the tagged corpus. Evaluation The total number of units (assuming that unique words and multiwords count as one unit) in the test corpus (hand-annotated)49 is 44144. The test corpus has been developed using a combined procedure of automatic and human tagging: 1. A fragment of approximately 50.000 words (15% of the corpus) was selected, taken from the different sections and intended to be a representative sampling of the whole. Each word in the 27 texts was tagged with all possible analyses. 2. Each file has been revised by a linguist, who selected the correct tag for every case, discarding the wrong ones. 3. From the revised corpus, a set of disambiguation rules were written for handling the most frequent cases. 4. A new run of the tagger, augmented with the disambiguation grammar, provided an automatically tagged corpus, with only one tag per unit. 5. The automatic and human tagged corpora have been compared, and the differences were noted one by one, assuming that the agreement in the same tag implied a correct analysis. In most cases the wrong tag was assigned by the tagger, but in several cases the linguist wrote an incorrect tag. Mistakes are probably due to loss of attention because of a repetitive task. After assigning the proper tag in all the disagreements, a final version of the test corpus was delivered. Both the disambiguation grammar and the statistical tagger have been trained against the test corpus. Finally, the rest of the corpus, over 250.000 words, was tagged as described in Section 3.1. 49 Those files from the C-ORAL-ROM corpus annotated by hand are the following: efamcv03; efamcv06; efamcv07; efamdl03; efamdl04; efamdl05; efamdl10; efammn05; emedin01; emdmt01; emednw01; emedrp01_1; emedsc04; emedsp01; emedts01; emedts05; emedts07; emedts09; enatco03; enatpd01; enatte01; epubcv01; epubcv02; epubdl07; epubdl13; epubmn01; etelef01 71 In order to evaluate the tagger’s performance (including disambiguation), a new run of GRAMPAL against the test corpus was conducted. The mismatches between the GRAMPAL tagged corpus and the test corpus, working as a golden standard for evaluation, were counted (the figures are shown in Table 4). The precision rate is calculated by the number of correct tags assigned by the tagger divided by the total number of tagged units in the test corpus. In other words, 42206 tags out of 44144 were assigned correctly. No evaluation of the precision has been performed for the rest of the corpus, but a similar rate (95,61 %) to the obtained against the test corpus can be assumed. With respect to the recall, to be intended as the ratio between the number of tagged units by the program and the total number of units, the figure for the whole corpus is 99,96. Only 117 tokens were not given a tag by the program. Table 4. Test corpus and whole corpus evaluation Number of Number of tagged Recall units units Test Corpus 44144 44144 100% 313504 313387 99,96% Whole Corpus Precision 95,61% ≅ 95,61% With respect to the evaluation of the POS tagging, it is important to stress that only a subset of around 50.000 transcripted words were revised by hand, resulting in over 44.000 tagged words. The rest of the tagged corpus has not been revised by human annotators. This fact has some consequences in the list of forms and lemmas. When two or more tags are available, the tagger always assigns the tag with the shared information between the candidates. For instance, many verb forms are ambiguous with respect to the first and third singular persons: (yo) cante / (él, ella) cante (I sing vs. He/she sings). The tags for both are Vp1s and Vp3s, respectively. When the context cannot solve the ambiguity (by means of the pronoun), the tag assigned is V, compatible with both. The human annotators, however, can normally resolve the ambiguity while they are revising the tagging. In that case, the appropriate full tag is provided. As a result, different tags for the same word can be found in the list of lemmas and forms. 72 4. XML Format of the textual resource 4.1. Macro for the translation of C-ORAL-ROM .txt files to .XML The C-ORAL-ROM Macro is a Perl program which has two functions. - First, that of validating the C-ORAL-ROM format; that is, checking that all the texts in the corpus have the same format and that no typing errors have been made; - Second, that of generating XML files for the texts in the corpus.50 4.1.1. Checking the C-ORAL-ROM format The C-ORAL-ROM format consists of a set of textual conventions. These conventions help to introduce the information which surrounds the recording, that is to say, everything which is not transcribed but is relevant: information on the participants, on the situation and non-linguistic features that appear in the recording. In what has been called the "header" there are fifteen different fields which contain information about the participants, the date, the place, the situation, the kind of text, the topic, the transcribers or the length of the transcription. These fields all have the same format, an “@” followed by the name of the field and a “:” Some fields have a closed format, which means they can only include a set number of values, as it occurs for example in the Participants field: this field must include three capital letters to define the speaker, the name of the speaker and then, between brackets and in this order, the sex (man/woman), age (A, B, C or D), education (1, 2, 3) profession and origin of the speaker. The age section cannot include anything which is not a capital A, B, C or D: the C-ORAL-ROM Macro enables the checking of this in all texts. The same can be applied to the signs representing different features inside the transcription. For example, the sign for overlapping is [<] <overlapped text>. The C-ORAL-ROM Macro will check, for example, that there is no space between the < and the overlapped text. 4.1.2. Generating XML files Once we have checked that all the texts are compiled according to the C-ORAL-ROM format, the program generates an XML file for each text. A typical XML file of a C-ORAL-ROM transcription consists of an initial and an ending tag which includes the whole text (called <transcription>) and two sub-sections: the header (marked by the tag <header>) and the transcribed text (marked by <text>). In the header, each field will have its corresponding tag, in replacement of the name of the field following the @, and each piece of information will have its own tag. Let's look at an example: @Participants: PAT, Patricia, (woman, B, 2, hairdresser, participant, Madrid) will become, in XML: <Participants> <Speaker> <Name>PAT, Patricia</Name> <Sex Type="woman"/> <Age Type="B"/> <Education Type="2"/> <Occupation>hair dresser</Occupation> <Role>participant</Role> <Origin>Madrid</Origin> </Speaker> </Participants> 50 This program has been written by the Computational Linguistics Laboratory at Universidad Autónoma de Madrid. 73 As for the symbols included in the text, an XML tag will replace them, as in the example: *ROS: con los [/] con el walkman de / Chechu // # will become <Turn> <Name>ROS</Name> <Says>con los <Tone_Unit Type="partial_restart"/> con el walkman de<Tone_Unit Type="standard"/>Chechu <Utterance Type="enunciation"/> <Pause/> </Says> </Turn> where *ROS: means the speaker whose name is ROS has started a new turn, and says what follows; [/] signals a kind of tone unit where there is a partial retracting by the speaker; / signals a standard tone unit; // signals the end of an utterance and # a pause. All this information is expressed in XML by the different tags. 4.2 Running the program 4.2.1. Requirements and procedure The C-ORAL-ROM Macro requires a Perl distribution, which can work both in Unix and Windows. To run the program, all the texts to be checked and converted will have to be in the same directory where the program is; then, we will type perl xml-dtd_coralrom.pl <file_name> efamdl01.txt is only and example of text (in its place we could include any text we want to check and convert). This way, the program will generate a file with the same name but with an XML extension. 4.2.2. Rectifying errors While running the program, the following lines will appear every time we have a well-formed text: Preprocessing Document... Processing Document Main Structure... Processing Header Data Fields... Processing Text Data Fields... Turn Preprocessing Done... Processing Turn Structure... Processing InTurn Data Structure... File correct: efamdl01.txt --> Output file: efamdl01.xml If the program finds some kind of mistake in the text, it will show an error message with information on the place where the mistake is, for example: if we write "lenght" instead of "length", the program will tell us that there is a mistake in the Length field; or, if we write F in the age field, the program will tell us there is a mistake in the Age section, as only A, B, C and D are allowed. In the case of mistakes inside the transcribed text, the program will show the text surrounding the mistake. 74 To rectify the mistake, we will go to the transcribed text (the txt file), correct the mistake and run the program again to see if it detects new mistakes or generates an XML file (the program will not generate an XML file until no mistakes are included in the text). 4. 3. C-oral-rom dtd <!-- Version 3.0 DATE: 01 Sept 2004 --> <!-- Declaration XML type document --> <!-- Declaration of external DTD document. Address Contains asociated XML file --> <!ELEMENT Transcription (Header, Text)> <!-- Start of Header relevant tags declaration --> <!ELEMENT Header (Title, File, Participants, Date, Place, Situation, Topic, Source, Class+, Length, Words, Acoustic_quality, Transcribers, Revisors, Comments)> <!ELEMENT Title (#PCDATA)> <!ELEMENT File (#PCDATA)> <!ELEMENT Participants (Speaker+)> <!ELEMENT Speaker (ShortName, FullName,Sex, Age, Education, Occupation, Role, Origin)> <!ELEMENT ShortName (#PCDATA)> <!ELEMENT FullName (#PCDATA)> <!ELEMENT Sex (#PCDATA)> <!ELEMENT Age (#PCDATA)> <!ELEMENT Education (#PCDATA)> <!ELEMENT Occupation (#PCDATA)> <!ELEMENT Role (#PCDATA)> <!ELEMENT Origin (#PCDATA)> <!ELEMENT Date (#PCDATA)> <!ELEMENT Place (#PCDATA)> <!ELEMENT Situation (#PCDATA)> <!ELEMENT Topic (#PCDATA)> <!ELEMENT Source (#PCDATA)> <!ELEMENT Length (#PCDATA)> <!ELEMENT Words (#PCDATA)> <!ELEMENT Transcribers (Transcriber+)> <!ELEMENT Transcriber (#PCDATA)> <!ELEMENT Revisors (Revisor+)> <!ELEMENT Revisor (#PCDATA)> <!ELEMENT Acoustic_quality EMPTY> <!-- Other More-Meaning names can be used for each SubClass --> <!ELEMENT Class (#PCDATA)> <!ELEMENT Comments (#PCDATA | Notes)*> <!-- Text is defined as one or more turns of DIFFERENT persons' speaking --> <!ELEMENT Text (Turn+)> <!ELEMENT Turn (Name, Says, Notes*)> <!-- End of Header Definitions Start of Text Tags Definitions --> <!ELEMENT Name (#PCDATA)> <!ELEMENT Says (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic | Unintelligible | Fragment | Pause | Overlap | Backwards | Forwards | Interjection | Notes)*> <!-- Tonal Units Tags--> <!ELEMENT Tone_Unit (#PCDATA)> <!-- Limits each tonal unit. May have --> <!-- If turn continues after overlapping --> 75 <!ELEMENT Continues (#PCDATA)> <!-- Limits each utterance --> <!ELEMENT Utterance (#PCDATA)> <!-- Interjection Tags--> <!ELEMENT Support (#PCDATA)> <!-- Sylabic Support Possible --> <!ELEMENT Non_Linguistic (#PCDATA)> <!-- Non-Linguistic-Sounds. May have --> <!-- Reconstruction Tags--> <!ELEMENT Unintelligible (#PCDATA)> <!-- When it's unknown what have been said--> <!ELEMENT Fragment (#PCDATA)> <!-- Literally written--> <!-- Pause Tags--> <!ELEMENT Pause (#PCDATA)> <!-- Means a noticeable pause. Not a --> <!-- suspension nor a Restart--> <!-- Overlapping Tags--> <!ELEMENT Overlap (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic | Unintelligible | Fragment | Pause | Interjection | Notes)*> <!-- Overlapping direction --> <!ELEMENT Backwards (#PCDATA)> <!ELEMENT Forwards (#PCDATA)> <!-- Interjections --> <!ELEMENT Interjection (#PCDATA)> <!-- Sylabic Support Possible --> <!-- Comments --> <!ELEMENT Notes (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic | Unintelligible | Fragment | Pause | Interjection)*> <!-- Atributes Definitions --> <!-- Notes --> <!ATTLIST Notes Type CDATA #REQUIRED > <!-- Tone Unit --> <!ATTLIST Tone_Unit Type (standard | partial_restart | total_restart) #REQUIRED > <!-- Utterance --> <!ATTLIST Utterance Type (enunciation | interrogation | suspension | interruption) #REQUIRED > <!ATTLIST Sex Type (man | woman | x) #REQUIRED > <!ATTLIST Age Type (A | B | C | D | x | X) #REQUIRED > <!ATTLIST Education Type (1 | 2 | 3 | x | X) #REQUIRED > <!ATTLIST Class Type1 (informal | formal) #REQUIRED 76 Type2 (family_private | public | formal_in_natural_context | media | telephone) #REQUIRED Type3 (conversation | monologue | dialogue | political_speech | political_debate | preaching | teaching | professional_explanation | conference | business | law | news | sport | interviews | meteo | scientific_press | reportage | talk_show | private_conversation | human-machine_interactions) #REQUIRED Type4 (political_debate | thematic_discussions | culture | science | man_interaction | machine_interaction | conversation | monologue | dialogue) #IMPLIED Type5 (turism | health | meteo | traffic | train | restaurants) #IMPLIED > <!ATTLIST Acoustic_quality Type (A | B | C) #REQUIRED > 77 5. Bibliographical references Austin, L.J., 1962. How to do things with words. Oxford: Oxford University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E. 1999. The Longman grammar of spoken and written English. London: Longman. Bigert, J., Knutsson, O. and Sjöbergh, J. 2003. "Automatic Evaluation of Robustness and Degradation in Tagging and Parsing". In Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP'2003) – September, 2003, Borovets, Bulgaria, 51-57. Brants, T. 2000. "TnT. A statistical part-of-speech tagger." In Proceedings of the 6th Applied NLP Conference, ANLP-2000 – April 29-May 3, 2000, Seattle, WA. ed pp Brill, E. 1994. "Some advances in transformation-based part-of-speech tagging". In Proceedings of the third International Worksohp on Parsing Technologies, Tilburg, The Netherlands. ed pp Calzolari, N., Ceccotti, M. L. and Roventini, A. 1983. "Documentazione sui tre nastri contenenti il DMI". Technical Report. Pisa: ILC-CNR. CHAT http://childes.psy.cmu.edu/manuals/CHAT.pdf Cresti, E., 1994. Information and intonational patterning in Italian. In Accent, intonation, et modéles phonologiques, B. Ferguson, H. Gezundhajt, Ph. Martin (eds.), 99-140. Toronto: Editions Mélodie. Cresti, E. 2000. Corpus di italiano parlato, voll. I-II, CD-Rom. Firenze: Accademia della Crusca. Crystal, D. 1975. The English tone of voice. London: Edward Arnold. De Mauro, T. et alii 1993. Lessico di frequenza dell'italiano parlato. Milano: ETAS libri. Goñi, J.M., González, J.C. and Moreno, A. 1997. "ARIES: A lexical platform for engineering Spanish processing tools". Natural Language Engineering 3(4): 317-345. Guirao, J.M.and Moreno-Sandoval, A. 2004. "A 'toolbox' for tagging the Spanish C-ORAL-ROM corpus". In Proceedings of the Workshop “Compiling and Processing Spoken Language Corpora”, LREC-2004, Lisbon. IMDI http://www.mpi.nl/IMDI/ Karcevsky, S. 1931. "Sur la phonologie de la phrase". In Travaux du Cercle linguistique de Prague IV, 188-228. MacWhinney, B. 1994. The CHILDES project: tools for analyzing talk. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Miller, J. and Weinert, R. 1998. Spontaneous Spoken language. Oxford: Clarendon Press. Moreno, A. 1991. Un modelo computacional basado en la unificación para el análisis y generación de la morfología del español. PhD. Thesis, Universidad Autónoma de Madrid. Moreno, A. and Goñi, J. M. 1995. "A morphological model and processor for Spanish implemented in Prolog". In Proceedings of Joint Conference on Declarative Programming (GULP-PRODE 95) – Marina di Vietri, Italy, September 1995, M. Alpuente and M. I. Sessa, (eds.), 321-331. Moreno, A. and Goñi, J. M. 2002. "Spanish Inflectional Morphology in DATR". Journal of Logic, Language and Information 11: 79-105. Moreno, A. and Guirao, J. M. 2003. "Tagging a spontaneous speech corpus of Spanish". In Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP'2003) – September, 2003, 292-296. Borovets, Bulgaria. Picchi, E. 1994. "Statistical Tools for Corpus Analysis: A Tagger and Lemmatizer of Italian". In Proceedings of EURALEX 1994, W. Martin et alii (eds), 501-510. Amsterdam. PiSystem, http://www.ilc.cnr.it/pisystem/ Quirk, R., Greenbaum, S., Leech, G., Svartvik, J. 1985. A comprehensive grammar of the English language. Longman: London. 't Hart J., Collier R., Cohen A. 1990. A perceptual study on intonation. An experimental approach to speech melody. Cambridge: Cambridge University Press. Valli, A. and Véronis, J. 1999. "Etiquetage grammatical de corpus oraux: problèmes et perspectives". Revue Française de Linguistique Appliquée IV 2: 113-133. Zampolli, A.and Ferrari, G. 1979. "Il dizionario di macchina dell'italiano". In Linguaggi e formalizzazioni. Atti del convegno internazionale di studi della SLI, Gambarara, D., Lo Piparo, F. (eds), 683-707. Bulzoni: Roma 78 APPENDIXES 79 APPENDIX 1- TYPICAL EXAMPLES OF PROSODIC BREAKS TYPES IN ITALIAN, FRENCH, PRTUGUESE AND SPANISH 51 Examples of strings with typical non terminal breaks, generic terminal breaks and interrogative breaks (highlight in the examples). Italian *ALE: [1] eh / io invece / mi sono preso una chitarra //$ *IDA: [2] ah / la chitarra //$ *ALE: [3] acustica //$ [4] bella //$ *IDA: [5] e l' obiettivo ?$ (ifamdl18_double slash2) Portuguese *JOS: esteve na baixa / recentemente ?$ *NOE: não //$ há muito tempo que não vou //$ acho que está muito gira //$ não é //$ *JOS: está a ficar muito bonita //$ *NOE: estive no Chiado / há pouco tempo //$ *JOS: hum <hum> //$ (pfamdl14_doubleslash3) French *EMA: [1] ça c' est clair // #$ [2] de plus en plus // #$ *JUL: [3] quels sont vos rapports avec les clients en général ? #$ *EMA: [4] très bons // #$ [5] sur la base / très bons // #$ (fpubdl03_double&single slash) Spanish *MIG: [<] < [1]¡joder!>$ // [2] te quejarás de tu hijo / <eh> ?$ *EST: [<] [3]<nada> / nada //$ [4] muy bien todo //$ [5]<fenomenal> //$ *MIG: < [6]¡joder!>$ (efamdl39_double slash1) Examples of strings with typical intentional suspension Italian *MAR: [1]<allora è &propr > +$ *ROS: / [2] come un fuso //$ [3] icché vo' fare?$ [4] quando uno l' è ignorante ...$ *MAR: [5] eh / sie / <&ignoran> +$ (ifamdl07_suspension1) Portuguese *AMA: / [1] com aquela sensação de que eles vão ficar / naquela / &eh / fechados naquele / naquela cerca //$ *RAQ: [2] pois //$ *AMA: / [3] que é ...$ *RAQ: [4] eléctrica //$ [5] depois de apanharem <os choque todos> //$ *AMA: [6]<e que é horrível> //$ (Pfamcv10_suspension_2) French *SYL: [1] avant de quoi ? #$ *CHR: [2] ben de [/] de [/] de sortir ensemble //$ [3] enfin du temps / #$ [4] non // pas tant que ça // mais finalement ...$ *SYL: [5] j' ai commencé la formation le vingt-six novembre //$ [6] je m' en souviens // #$ (ffamdl03_ suspension1) Spanish 51 The Multimedia files corresponding to the examples are available in the directory Utilities of the DVD9 CORALROM_AP 80 *FER: [1] claro //$ *EVA: [2]¡ah! //$ [3] claro / digo si es tu primo / y no tienes / pueblo ...$ [4] claro //$ [5] hhh //$ *FER: [6] y tú qué hiciste / Jose ?$ (epubcv02_suspension 3) Examples of strings with typical interruption Italian *SIM: # [1]&he / però # sarai controllato //$ [2] e dovrai spiegare / &he / qualsiasi natura dei tuoi +$ *ROD: [3] esperimenti //$ *SIM: [4] dei tuoi esperimenti //$ (ifamcv07_listener interruption) *MAR: [1] ma no / perché lei / vuole una storia seria //$ [2] è perché forse / non c' è stata +$ [3] dico / tu / che perdi il capo / che +$ [4] perché / a lui / gli piacciono i soldi / Ida //$ [5] questo / io lo so //$ [6] e / anche lui / dice / questo è vero //$ *IDA: [7] lo ammette //$ (ifamdl20_interruption3) Portuguese *ACL: [1] também //$ *JOS: [2] olhe / como é que se chama aquela +$ *ACD: / [3] ou um albardão //$ (pnatpe01_listener interruption3) *LUC: [1]<claro> //$ *GRA: / [2] agora / para mim +$ [3] preciso ver que eu tenho setenta e um <anos> //$ *LUC: [4] pois //$ (pfamcv03_interruption) French *VAL: [1] la même chose que l' amour passionné / #$ *CHA: [2]<mais>$ *VAL: [3]<enfin> //$ *CHA: [4] si // c' est le même à la base // #$ [5] mais il y a comme une +$ *ALE: [6] il a mûri // #$ (ffamcv01_listener interruption 1) *SAN: [1] et donc enfin toi tu es de Caen / je crois #$ *EDO: [2] ah oui oui oui / de + ben mon frère aussi // #$ *SAN: [3] vous êtes$ [5]<de Caen> ?$ *EDO: [4]<on est nés>$ (ffamdl02_interruption1) Spanish *CRI: [1] me dijo / &eh / el Ministerio de Sanidad en Madrid / fue lo que +$ *LUI: [2] lo que os dijo //$ *CRI: [3] lo <que>+$ *ALB: [<] [4]<eso> era una / excursión organizada ?$ (efamcv15_listener interruption1) *CRI: [<] < [1]¡ah! / claro> //$ [2] me llamaste desde el +$ [3] es que me [/] se fue a la sierra / y me llamó desde la sierra //$ (efamdl08_interruption3) 81 Examples of strings with typical retracting Italian *MAX: [1] una tristezza //$ [2] ma proprio una cosa / <&f> +$ *CLA: [3]<ah> //$ [4] io l' ho trovato uno dei posti più belli / che si [/] che [/] che [///] in Italia //$ *GAB: [5]<Matera> //$ (ifamcv17_simple retracting) French *PER: [2]<ouais après>$ *JEA: [1]<tout est> +$ *PER: [3] la [/] la petite [/] le petit tour / c' était à Casablanca //$ [4] tu sais / la petite visite guidée / qui se termine à la boutique des souvenirs //$ < [6]&euh>$ *STE: [5]<ah>$ (ffamcv10_simple retracting1) Spanish *PIE: / [1] que está perdiendo audiencia ya [/] audiencia ya / <el programa> //$ *PAZ: [<] [2]<no creas / tía> //$ [3] era [/] era el líder de audiencia / en [/] en Canarias / tía / y por ahí //$ *PIE: [4] sí //$ (efamdl33_simple retracting) Portuguese *ANT: [1] é evidente //$ *FER: / [2] o / o touro / realmente / estava a / estava a dar cabo do cavalo //$ [3] agora / <não é uma pessoa &iso> +$ (pfamdl07_simple retracting) Examples of retracting/interruption ambiguity Italian *GUI: [1] fanno una cooperativa / per poter lavorare / perché / un camion / quando parte / deve essere [///] abbia una buona assicurazione / prima di tutto //$ (ifammn22_retracting-interruption3) French *ALE: [1]<cinq ans toi> ?$ *CHA: [2]<moi deux ans> //$ [3] et une fois trois ans //$ [4] enfin là /$ *ALE: [5]<bon donc> /$ *CHA: [6]<trois ans>$ *ALE: [7] tout le monde a vécu [/] a eu l' expérience quoi // #$ [8] et$ *CHA: [9] yes // #$ *ALE: [10]<derrière> ?$ (ffamcv01_retracting-interruption_1) Spanish *NAN: < [1]¡joder! //$ [2] y qué pasa en el &tea> [/] y qué pasa en el teatro romano / que veía / cómo las [/] <los animales / se comían a los cristianos> //$ *BEC: [3]<hhh //$ [4] afortunadamente +$ [5] pero por eso la &evo> [/] la [///] hemos evolucionado //$ *NAN: [6] pero tú crees que la [/] que / a lo largo de la historia / <la gente evoluciona> ?$ *BEC: < [7]&es [/] pues espero que sí> //$ (efamdl18_retracting-interruption2) 82 APPENDIX 2 - TAGSETS USED FOR POS TAGGING IN THE FOUR LANGUAGE COLLECTIONS. DETAILED TABLES AND COMPARISON TABLE French tagset Table 1. French PoS tagset. POS Verbs Sub-type Conditional, present Imperative, present Indicative, future Indicative, imperfect Indicative, past Indicative, present Participle, past Participle, present Subjunctive, imperfect Subjunctive, present Common Proper Ordinal Qualifying Nouns Adjectives Adverbs Prepositions Conjunctions Coordination Subordination Definite Demonstrative Indefinite Interrogative Possessive Demonstrative Indefinite Personal Possessive Relative/interrogative Determiners Pronouns Numerals Interjections discourse particles Uncategorisable and Foreign word Euphonic particle Title TAG VER:CON:PRE VER:IMP:PRE VER:IND:FUT VER:IND:IMP VER:IND:PAS VER:IND:PRE VER:INF VER:PAR:PAS VER:PAR:PRE VER:SUB:IMP VER:SUB:PRE NOM:COM NOM:PRO ADJ:ORD ADJ:QUA ADV PRE CON:COO CON:SUB DET:DEF DET:DEM DET:IND DET:INT DET:POS PRO:DEM PRO:IND PRO:PER PRO:POS PRO:RIN NUM INT EXAMPLES aurait, serait, dirais attends, allez, écoutez sera aura fera seront pourra faudra était avait faisait fallait disait allait fut, vint est a ai sont va ont peut fait sais suis Infinitive fait dit été eu vu pris pu mis étant disant maintenant faisant ayant fût, vînt soit ait puisse fasse aille heure, temps, travail, langue France, Marseille, Freud, Roosevelt premier, deuxième, troisième petit, grand, vrai ne, pas, oui, alors, très, pratiquement de, à, pour, dans, sur et, ou, et, mais que, parce que, comme, quand, si le, la, les ce, cette, ces une, une, tout, quelques, plusieurs quel mon, ma, ton, ta ce, ça, celui, cela, ceci un, une, tout, rien, quelqu'un je, tu, il, elle, y, en, se mien qui, que, où, quoi, dont, laquelle deux, trois, mille, cent ben, bon, hein, mh, ah XXX:ETR XXX:EUP XXX:TIT check up, Eine Sache -t-, l' "Tapas_Café", "With_Full_Force", "Retour_des_Vampires", "Fables_de_La_Fontaine" 83 Italian Tag set Table 2.1 General structure of the Italian Tag set. ROOT classification Secondary classification Elements classified standard compositional (PoS tagset) non-compositional non-standard compositional (NS tagset) non-compositional PoS interjection (according to tradition, within PoS) foreign and new forms onomatopoeia, language acquisition forms fragm. words, phonetic supports, pause fillings coughs and laughs linguistic elements non-linguistic elements (NL tagset) para-linguistic extra-linguistic Table 2.2 Italian PoS tagset. POS Sub-Type TAG Definition Verb main V semantic reference to a class of events; main morphological features: mood, tense, (person, number, gender) non main VW grammatical verbs (auxiliaries and copula)* + enclitic V_E verb form which includes a clitic common S semantic reference to a class of morphological features: gender and number proper SP direct reference, absence of determiners Massimo\SP Adjective A semantic reference to a quality or property; morphological features: gender and number grande\A Adverb B modifier of the predicative adjectives); no inflection non\B simple E adds semantic and syntactic features to a NP; no inflection di\E fused E_R preposition fused with an article del\E_R coord. CC establishes syntactic relationships among phrasal constituents; no inflection e\CC subord. CS establishes syntactic relationships among sentences; no inflection perché\CS Article R adds the feature [±definite] to a NP; morphological features: gender and number il\R Demonstrative DEM see below* questo\DEM Indefinite IND see below* molti\IND Personal PER see below* io\PER Possessive POS see below* mio\POS Relative REL see below* cui\REL cardinal N number quattro\N ordinal NA numeral adjective quarto\NA I non compositional element; illocutive value ehi\INT Noun Preposition Conjunction Numeral Interjection 84 EXAMPLES mangio\Vs1ip era\VWs3ii vederlo\Vfp_E elements objects, (verbs, albero\S Table 2.3. Morphological information for Verbs: finite forms. Verb (nonmain) Number singular V Person s W plural Mood first 1 second 2 indicative subjuncti ve condition al imperativ e p third Tense 3 i (encl.) present p c past r d imperfe ct i m future f _E Table 2.4. Morphological information for Verbs: non-finite forms. Verb V (non-main) W Gender Number masculine* m feminine* f common* n singular* plural* * only for participles Table 2.5 Tags for Non-standard element Non-standard element Compositional Foreign forms New formations Non-compositional Acquisition Onomatopeic TAG EXAMPLES (PoS+)K (PoS+)Z they\PERK torniante\SZ ACQ ONO cutta\ACQ zun\ONO Table 2.6. Non-linguistic (NL) Tag set. Non-linguistic element Paralinguistic Extralinguistic Non-understandable words TAG PLG XLG X EXAMPLES &he|PLG hhh\XLG xxx\X 85 Mood s p Tense infinite f participle p gerund g present (encl.) p _E past r Portuguese tagset Table 3. Portuguese PoS tagset. POS VERB AUXILIARY VERB Sub-type Tag Present Indicative Past Indicative Imperfect Indicative Pluperfect Indicative Future Indicative Conditional Present Subjunctive Imperfect Subjunctive Future Subjunctive Infinitive Inflected Infinitive Gerundive Imperative Past Participles in Compound Tenses Adjectival Past Participles NOUN ADJECTIVE ADVERB PREPOSITION CONJUNCTION ARTICLE DEMONSTRATIVE Proper Noun Common Noun Coordinative Subordinative Indefinite Definite Invariable Variable INDEFINITE POSSESSIVE RELATIVE/ INTERROGATIVE/ EXCLAMATIVE PERSONAL PRONOUN CLITIC NUMERAL Examples V VAUX Invariable Variable Invariable Variable Cardinal Ordinal INTERJECTION ADVERBIAL LOCUTION PREPOSITIONAL LOCUTION CONJUNCTIONAL LOCUTION PRONOMINAL LOCUTION ENFATIC FOREIGN WORD ACRONYMOUS EXTRA-LINGUISTIC PARA-LINGUISTIC FRAGMENTED WORD OR FILLED PAUSE DISCOURSE MARKER DISCURSIVE LOCUTION 86 pi ppi ii mpi fi c pc ic fc B Bf G imp VPP PPA N p c ADJ ADV PREP CONJ c s ART i d DEM i sou, tenho, chamo fui, tive, quis, dormi estava, era, dizia, dormia fora, estivera, comera serei, terei, direi seria, teria, dormiria seja, tenha, chame, durma estivesse, dormisse estiver, dormir estar, ser, dormir estares, seres, dormires estando, chamando, sendo come, dorme, trabalha comido, entregado comido, entregue Lisboa, Amália casa, trabalho feliz, rico, giro só, felizmente, como a, de, com, para e, mas, porque, ou que, porque, quando, como uns, umas a, o, as, os isso, isto v IND i v POS REL essa, aquela, dito i v PES CL NUM c o INT LADV LPREP que, como, quando, quem cujo, quanto eu, tu, ele, nós se, a, o, me, lhe LCONJ só que LPRON ENF ESTR SIGL EL PL FRAG o que, o qual lá, cá, agora okay, mail PSD, ACAPO hhh hum, hã, nanana &dis, &eh MD LD bom, pronto, pá, digamos digamos assim, quer dizer alguém, nada, algo outro, todo meu, teu dois, três primeiro, segundo, terceiro ah, adeus, olá em cima em cima de WITHOUT CLASSIFICATION WORD IMPOSSIBLE TO TRANSCRIBE SEQUENCE IMPOSSIBLE TO TRANSCRIBE SC Sub-Tags Ambiguous form TAG : EXAMPLES um\ARTi:NUMc Contracted forms + da\PREP+ARTd Hyphenated forms (excepting compounds) - viu-se\Vppi-CL 87 Pimp xxx Simp yyyy Spanish tagset Table 4.1. Spanish PoS tagset. POS Verb Sub-type Auxiliary Noun TAG V EXAMPLES cantar AUX habrá cantado N mesa NP María Adjective ADJ azul Adverb ADV así, aquí, allí Prepositión PREP ante, bajo, con Conjunctión C y, pero, ni Proper noun Determiner Article ART Possessive POSS mi, tu, su DEM ese, este P PR yo, tú, lo_que que Q uno, dos, tres Interjection INTJ Discourse marker MD primer, segundo muchos, pocos madre mía, yuju oye, o sea, es decir Demonstrative Pronoun Quantifier él, Table 4.3. Semantic, morpho-syntactic, and syntactic features of each POS. POS Semantic features Morphological features Per Num Gen Denote an element or X object classification Mono-referential Fixed inflection X ADJ Denote qualities properties or X X ART Restrict/define the referent of a noun phrase Express relation of posession or ownership X X X X N N (PR) POSS DEM Q REL Syntactic features Examples Subject, direct object, complement Absence of determiners Noun complement, predicative complement Prenominal position, no syntactic function actitudes\ACTIT UD\NCfp Luisa\LUISA\Npi Tense Mood Prenominal (1st series) and postnominal (2nd series) positions Pre and postnominal positions X X Express location of the referent in space and time Express number of Will vary depending on the kind of entity they are quantifying individuals or objects Retrieve the referent of Inherit their morphological features from Different syntactic functions inside the the noun they modify the noun working as the referent clause inside the clause they introduce 88 extranjeros\EXT RANJERO\ADJ mp los\EL\DETdmp mío\MÍO\DETpo ss este\ESTE\DETd em una\UN\Q que\QUE\PR P Refer to a noun phrase X X Verb Express events X X INTJ MD Express a mental state Invariability Guide the inferences Invariability which take place in conversation Establish logical or Invariability discourse bounds C PREP Establish semantic Invariability relationships associated to spatial concepts ADV Set the meaning of the Invariability verb X X 89 Maximum expansion of the noun phrase Central element of the sentence, determine the different syntactic functions Not been assigned Not been assigned yo\YO\PPER1s Relate sentences or elements in a sentence Establish relationships between two elements Do not introduce a second term pero\PERO\C es\SER\Vindp3s ah\AH\INT es decir\ES DECIR\MD de\DE\PREP también\TAMBI ÉN\ADV Synopsis tag sets Table 5a. Synopsis of the PoS tag sets Tag-Set Projection Nouns verbs adjectives adverbs prepositions conjunctions interjections discourse markers emphatic French NOM VER ADJ:QUA ADV PRE CON INT Table 5b. Synopsis of the PoS tag sets Tag-Set Projection French DET:DEF articles (definite determiner) demonstrative determiners DET:DEM demonstrative pronouns PRO:DEM possessive determiners DET:POS possessive pronouns PRO:POS personal pronouns PRO:PER clitic rel-int-excl determiners DET:INT rel-int-excl pronouns PRO:RIN indefinite determiners DET:IND indefinite pronouns PRO:IND numbers (cardinals) NUM numerals (ordinals) ADJ:ORD Italian S V A B E C I Portuguese N V ADJ ADV PREP CONJ INT MD ENF Italian Portuguese Spanish DETd ART (definite (articles) determiner) R (articles) Spanish N V ADJ ADV PREP C INT MD DIM DEM DETdem POS POS DETposs PER PES CL PPER REL REL PR IND IND N NA NUMc NUMo Q (quantifiers) Table 6. Synopsis of the morpho-syntactic encodings for verbs MOOD TENSE PERSON NUMBER GENDER (only for participles) VERB TYPE French indicative subjunctive conditional imperative infinitive participle Italian indicative subjunctive conditional imperative infinitive participle gerund present past imperfect future present past imperfect future first second third singular plural masculine feminine common main non-main Portuguese Spanish indicative indicative subjunctive subjunctive conditional conditional imperative imperative infinitive infinitive participle participle gerundive gerund adjectival past participle present present past past imperfect imperfect pluperfect future future first second third singular plural masculine feminine main auxiliary main auxiliary 90 Table7 a. Synopsis of the non-standard tag sets French extralinguistic support & fillers paralinguistic fragments - Italian XLG PLG Portuguese EL PL FRAG Spanish <nl> <sup> - Portuguese ESTR Pimp Spanish <nc> - Table 7b. Synopsis of the non-standard tag sets foreign words new formations acquisition forms onomatopoeia meaningless forms euphonic particle non understandable words French XXX:ETR XXX:EUP - Italian (Pos) + K (Pos) + Z ACQ ONO X 91 APPENDIX 3 ORTHOGRAPHIC TRANSCRIPTION CONVENTIONS IN THE FOUR LANGUAGE CORPORA During the construction phase of the project, it was agreed upon that a number of transcription conventions would be common to all languages (speaker notation, hesitations, prosody annotation, etc.), but that each team, within the national orthographic traditions, would keep their conventions for all other features regarding the orthographic transcription of spontaneous speech,. The following are the main choice adopted by each team. Italian transcription conventions In the transcription of the texts included in the C-ORAL-ROM Italian corpus, the use of capital letters has been reserved to: a. Christian names (Calvino; Paola), b. Toponyms (Parigi, Prato), c. Odonyms (via Burchiello, Porta Romana, Via Pian de' Giullari, Ponte alla Vittoria), d. Tv-programme names (Fantastico), e. Film titles (Piccolo grande uomo, Il secondo tragico Fantozzi), f. Book titles (Memoriale, Vangelo, Divina Commedia), 52 g. Band names (Depeche Mode; Neganeura). Abbreviations are entirely written in block capitals: BTP, MPS, without full stops between the letters. Capitals are used in accordance with their role in traditional grammar, and with the swings that grammar lets them perform: e.g., in expressions which contain toponyms and odonyms preceded by a common name, only the proper name always requires a capital letter, whereas the common name can be written beginning with a capital as well as with a small letter; as for titles of works, the use of capitals is only compulsory for the first letter (see Serianni 1997:46). Italian Glossary of non standard and regional expressions 52 For titles, citations and discourse, inverted commas were not used 92 a' = ai a i' = al 'a= la abbi= abbia 'abbè = vabbè ae' = avere aere = avere aerlo= averlo ahia = interjection ahm = onomatopoeia ahò = interjection aimmeno, aimmen = almeno allo'= allora ammeno= almeno analizza'= analizzare anda' = andare andaa = andava andao = andavo anderà = andrà andesti = andasti antro, antra = altro, altra 'apito = capito apri'= aprire arà = avrà ara' = avrai 'arda= guarda arebbe = avrebbe arria = arriva arriò = arrivò aspe'= aspetta (2nd singular person) ata= altra attro, attri = altro, altri ave' = avere aveo, avei, avea, aveano, avean = avevo, avevi, aveva, avevano, avevano avessin = avessero avra' = avrai avre' = avrei 'azzo = cazzo barre = bar beh= interjection bell' e = bello e = già bigherino = lace bòna, bòni, bòno = buona, buoni, buono bongiorno = buongiorno bra'= bravo briaca, briache = ubriaca, ubriache brioscia = brioche brum = onomatopoeia brururum = onomatopoeia bum= onomatopoeia burum burum = onomatopoeia bz = onomatopoeia capi' = capire capi' = capito ché = perché chello, chelli = quello, quelli cherce = querce chesto, chesta = questo, questa chi = qui chie ? = chi ? chiù = più (variant from Campania) ciaccioni= impiccioni 'cidenti = accidenti co' = con, coi còcere = cuocere compra' = comprare credeo = credevo da' = dai da i'= dal da'= dare dagnene = 'daglielo/a/i/e', 'dagliene', 'darglielo/a/e/i', 'dargliene' dalle, dalli = dagli (interjection) dao = davo de' = del, dei dèo, dèi, dèe, dèano, dèan = devo, devi, deve, devono, devono devan = devano, devono di' = del di' = di' (to say imperative) diceo, dicea, diceano = dicevo, diceva, dicevano dignene = 'diglielo/a/i/e', 'digliene', 'dirglielo/a/e/i', 'dirgliene' dimórto = di molto, di molto dio bò' = dio buono dividano = dividano, dividono do'= dove dòle = duole domall' altro = domani l'altro doveo, dovea, dovean = dovevo, doveva, dovevano doventò= diventò drento = dentro du' = due dugentomila = duecentomila dum = onomaotopea dumila = duemila e' = il, i e'= lui/loro (subject) and io 'e = le (article, southern variant) ecce' = eccetera embè = interjection esse' = essere fa' = fai, fare facci = faccia faccino = facciano faceo, facei, facea, facevano, facean = facevo, facevi, faceva, facevano, facevano 'fatti = see 'nfatti fizzaa = see infizzaa fo = faccio fòco = fuoco fòri = fuori frazio = puzza gioane = giovane 'giorno = see 'ngiorno gli, gl' = subject pronoun 3rd singular 93 'gna= bisogna gnam = onomatopoeia gnamo = andiamo governavi = governavi, governavate gozza' = gozzare = ingozzarsi gua' = guarda guarda' = guardare guardaan = guardavano ha' = hai ha' voglia = hai voglia = certamente he' = hai i' = il icché= che cosa, quello che (interrogative/relative pronoun) ieh = interjection ih= interjection indo' = dove indoe = dove infizzaa = infilzava intendano = intendono 'io = Dio 'isto, 'isti = visto, visti ito= andato laóra = lavora le'= lei learsi = levarsi leato, leata = levato, levata leggile = leggerle lu' = lui ma' = mai ma'= mamma macellaro = macellaio mandaa = mandava mane mano = a mano a mano mangia' = mangiare mangiari = pietanze mangitoie= mangiatoie mannaja= mannaggia (interjection) mésse = mise méssi = misi mettan = mettano, mettono metteo, mettea = mettevo, metteva mettignene = mettiglielo/a/i/e, mettigliene, mettergliene, metterglielo/a/i/e mi' = mia, mio mo' = adesso, ora mòre = muore mórto= molto mòve = muove mòvere = muovere muah = onomatopoeia 'n = in 'n', 'na, = un' , una (variant from Campania) 'ndo'= dove 'ndove= dove ne'= nei nesci = gnorri 'nfatti = infatti 'ngiorno = buongiorno 'nsomma = insomma ni' = nel 'nnaggia = mannaggia nòvo, nòva = nuovo, nuova òmini= uomini òmo = uomo 'orte = volte ostia= interjection pa' = padre paga' = pagare pah = onomatopoeia parean = parevano passai= passati pe' = per pem = interjection perdie = perdio (interjection) 'petta = aspetta pi pi pi pi = onomatopoeia piacea = piaceva piantaa = piantava piglia' = pigliare 'pito = capito pò = puoi, può po' = poi poera, poeri, poerino, poerina, poerini, poeretti = povera, poveri, poverino, poverina, poverini, poveretti poho= poco pòi = puoi pòle = può polea= poteva poleva= poteva pomarola = pummarola portagnene = 'portaglielo/a/i/e', 'portagliene', 'portarglielo/a/e/i', 'portargliene'. portao = portavo possano = possano, possono poteo, potea = potevo, poteva potette = potè prendilo = prenderlo presempio = per esempio proemi = problemi proi = provi prr = onomatopoeia pulìo = pulivo quarcuna = qualcuna quande = quando quante = quanto que' = quei QUELL' ALTRI = QUEGLI ALTRI qui' = quel rendeo = rendevo rendano = rendano, rendono resta' = restare richiedegnene = 'richiediglielo/a/i/e', 'richiedigliene', 'richiederglielo/a/e/i', 'richiedergliene' rideo = ridevo rifa' = rifare rifò = rifaccio rinvortata = rinvoltata riprendila = riprenderla riscòte' = riscuotere riscòto = riscuoto ritonfo = mi tocca ritornare ritorna' = ritornare 'rivederci = arrivederci rivòle = rivuole rompé = ruppe rompiede= ruppe rompò = ruppe ròta, ròte = ruota, ruote sa' = sai sape'= sapere sapeo, sapean = sapevo, sapevano sare' = sarei scarza = scalza (scalzare verb) se' = sei (numeral and verb) sede' = sedere segnaa = segnava seguro = sicuro sembraa, sembraan = sembrava, sembravano sentìo = sentito 'sicologia = psicologia 'sicologo = psicologo sie = interjection or 'sì' (meaning "no") smette' = smettere so' = sono sòcera = suocera 'somma = see 'nsomma sòr = signore sòrdi = soldi sòrdo = soldo sorte = esce (from sortire) SPENDE' = SPENDERE 'spetta = aspetta 'st', 'sta, 'ste = quest', questa, queste sta' = stai sta'= stare sta'= stare staa = stata staa = stava 'ste robe che qui = queste cose qui stendea = stendeva 'sto, 'sti = questo, questi studia' = studiare su' = suo, sua su i'= sul su' = sui sua = sua, suoi t' = tu telefonera' = telefonerai tenea = teneva tenevi = tenevi, tenevate tennicamente = tecnicamente tin = onomatopoeia to' = tuo toccaa = toccava toh = interjection troa = trova 94 tu' = tuo-a-e , tuoi tu tu tu tu = onomatopoeia tum= onomatopoeia 'u = il uh= interjection 'un= non va' = vai vabbè = va bene vabbò= va bene (buono) ve' = vedi vedano, vedan = vedano, vedono vede' = vedere vedea = vedeva vèn = vien vengan = vengono veni' = venire venìa, venìan = veniva, venivano vie' = vieni (imperative) vo = vado vò'= vuoi vò'= vuole voglian = vogliano, vogliono vòl = vuole vòle = vuole voleo, volea = volevo, voleva vorre' = vorrei vorte = volte vota = volta vòto = vuoto vu' = voi vuum = onomatopoeia www = onomatopoeia za = onomatopoeia zzz = onomatopoeia High frequency interjections and discourse particles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 EH MH AH VABBE’ OH MAH BEH CIAO BOH MAGARI BUONASERA BAH HE MAMMA_MIA OHI VIA GRAZIE UH HAN EEE’ MAMMA ODDIO ACCIDENTI BASTA MADONNA PER_CARITA’ UEH EHM IEH PERDIO AHI BU IH UHM ARRIVEDERCI BUONGIORNO CHI_SE_NE_FREGA HEI MACCHE' MANNAGGIA SH VAFFANCULO AHO' BUONANOTTE CHE_PALLE CRISTO DAGLI EHI MALE MENO_MALE OOH AHM AMEN CASPITA GRAZIE_AL_CIELO HI HUM IA IN_BOCCA_AL_LUPO MA_VA' MARAMAO MHM MOH OHE' OHIOHI PERDINCI PREGO TOH UFFA UHI UMH VABBUO’ 95 French transcription conventions 1. Acronyms were transcribed using the common practice in French texts: sometimes with dots (e.g. C.N.R.S), sometimes without (e.g. ATALA). 2. Some proper names were anonymized (and the corresponding fragment was replaced by a beep in the sound file) using the following convention: a. _P1, _P2 , etc. for names of persons; b. _S1, _S2, etc. for names of companies ("sociétés"); c. _T1, _T2, etc. for names of places; d. _C1, _C2, etc. for numbers ("chiffres"), such as telephone or credit card numbers. 3. Undecidable spellings were put within parentheses. E.g. il(s) voulai(en)t pas, j'en (n')ai pas voulu. 4. Titles (works, radio broadcasts, etc.) were enclosed within quotes. E.g. "Fables de La Fontaine". 5. Phonetic transcriptions (particular or deviant pronunciations) were given when needed (on the separate %pho tier) using the SAMPA alphabet. The SAMPA alphabet for French53. a. Consonants Plosives Fricatives Nasals Liquids Glides Symbol p b t d k g f v s z S Z m n J N l R w H j Example pont bon temps dans quand gant femme vent sans zone champ gens mont nom oignon camping long rond coin juin pierre Transcription po~ bo~ ta~ da~ ka~ ga~ fam va~ sa~ zon Sa~ Za~ mo~ no~ oJo~ ka~piN lo~ Ro~ kwe~ ZHe~ pjER b. Vowels Oral Symbol i e E a Example si ses seize patte Transcription si se sEz pat 53 http://www.phon.ucl.ac.uk/home/sampa/home.htm 96 Nasal Indeterminate Symbol A O o u y 2 9 @ e~ a~ o~ 9~ E/ A/ &/ O/ U~/ High frequency interjections and discourse particles 1 ben 2 bon 3 hein 4 mh 5 ah 6 quoi 7 hum 8 eh 9 oh 10 bé 11 pff 12 bah Example pâte comme gros doux du deux neuf justement vin vent bon brun = e or E = a or A = 2 or 9 = o or O = e~ or 9~ 13 14 15 16 17 18 19 20 21 22 23 Transcription pAt kOm gRo du dy d2 n9f Zyst@ma~ ve~ va~ bo~ bR9~ allô tiens pardon merci hé tth na là comment attention ciao 97 Portuguese transcription conventions The general orthographical norms As a general rule, the team transcribed the entire corpus according to the official orthography. a.- Free variation Variants of a word were transcribed whenever they were registered in the reference Portuguese dictionaries. Such is the case of phonetic phenomena like apharesis: ainda > inda; prothesis: mostrar > amostrar, metathesis: cacatua > catatua or alternations: louça / loiça. b. Proper names It is extremely difficult in spoken texts to accurately set up the differences between proper and common names. For this reason, the team’s tradition is to transcribe with initial capital letter only anthroponomys and toponyms. These names correspond, effectively, to the functions more consensually assigned to the proper names: designating and not signifying (Cf. SEGURA DA CRUZ, 1987: 344-353). Some examples are for toponyms: África, Bruxelas, Casablanca, China, Lisboa, Magrebe, Pequim and for anthroponyms: Boris Vian, Camões, Einstein, Madonna, Gabriel Garcia Marques, Oronte. c. Titles Titles are in quotes, such as books: “crónica dos bons malandros”, movies: “pixote”, pieces of music: “requiem” (de Verdi), newspapers: “expresso”, radio programmes: “feira franca” or television broadcasts: “big brother”. d. Foreign words Foreign words were transcribed in the original orthography whenever they were pronounced closely to the original pronunciation: cachet, check-in, feeling, free-lancer, emmerdeur, ski, stress, voyeurs, workshop. When the foreign words were pronounced in a “Portuguese way” they were transcribed according to the entries of the Portuguese reference dictionaries or according to the orthography adopted in those dictionaries for similar cases. In this last case are included all the hybrid forms like pochettezinha, shortezinho, stressado, videozinho. e. Wrong pronunciations When the speaker mispronounced a word and immediately corrected it, the two spellings were maintained in the transcription of the text: lugal/lugar, sau/céu. If the speaker misspelled a word and went on in his speech without any correction, the standard spelling of the word was kept in the transcription and a note regarding the wrong pronunciation was added in the header comments (for instance, in the text pfamdl07 the informant produces “banrdarilheiro” instead of bandarilheiro). f. Paralinguistics and Onomatopoeia Paralinguistic forms and onomatopoeia not registered in the reference dictionaries were transcribed to represent, the closest possible, the sound produced: pac-pac-pac, pffff, tanana, tatata, uuu. g. Acronyms The acronyms were transcribed in capital letters, without dots: TAP and not T.A.P. However, if the acronym already has an entry in Portuguese dictionaries as a common name, it was transcribed in minuscule, like sida or radar. h. Shortened forms Shortened forms were kept in the transcription: prof. i. Numbers Dates and numbers were always transcribed in full letters: mil novecentos e trinta e cinco; mil escudos; mil paus. l. Letters The names of the letters were always transcribed in full: pê for P, erre for R, xis, for X 98 Interjections, Dirscourse Markers, Emphatics Position 1 POIS 2 PORTANTO 3 SIM 4 AH 5 PRONTO 6 ENTÃO 7 PÁ 8 CLARO 9 LÁ 10 AI 11 OLHA 12 ASSIM 13 EXACTO 14 DIGAMOS 15 OLHE 99 Spanish transcription conventions As far as the transcription of the sound files is concerned, the Spanish team have respected the rules set by the Real Academia Española. 1. Acronyms and symbols. Acronyms are shown in block capitals and without dots, as in IVA, ONG, NIF... We used small letters for the plural forms of these acronyms, as in ONGs. In the case of symbols related to science (chemistry and so on) we have followed the conventions. The letter x was used in certain contexts to make reference to a non specific quantity. No abbreviations were used in the transcription of the sound files. 2. New words. We included words which, although not included in the Real Academia Española dictionaries, have a high frequency of occurrence in spoken Spanish, such as porfa, finde or pafeto. 3. Numbers. Numbers were transcribed in letters, except in the cases of numbers which are part of proper nouns, as in La 2, and numbers included in mathematical formulas. Roman numbers were used when referring to popes or kings, as in Juan XXIII, as well as in names in which they are included, as N - III. 4. Capital letter. Capital letters were used when transcribing proper nouns which made reference to people (including nicknames), as Inma or el Bibi, as well as cities, countries, towns, regions, districts, squares, streets and so on, as in Segovia, Carabanchel or Madrid. The same was applied to names of institutions, entities, organisations, political parties, etc., as in the case of Comunidad de Madrid, Ministerio de Hacienda, la Politécnica. Capital letters were also used when naming scientific disciplines, as well as entities which are considered absolute concepts and religious concepts, such as la Sociología, Internet, la Humanidad, tu Reino y el Universo, while in the case of Señor salvador the second word, being an adjective, will not be written in capital letters. Names of sports competitions were transcribed (all words) with capital letters, as in Copa del Rey or Champions League. In the case of books and song titles, as well as all kinds of works of art, even television programs, only the first word was written in capital letters, except in cases like Las Meninas (as is conventional). Both nouns and adjectives included in the names of newspapers, magazines and such were written with capital letters, as in El Mundo or El País. The names of stores and commercial brands as El Corte Inglés were transcribed following the registered name. 5. Italics. No italics were used in the transcription of the sound files. 6. Foreign words. Foreign words were written following the original. Words of foreign origin were written with the original orthography when pronounced in that language, whether they're included in the academic dictionaries or not. When adapted to Spanish, these words were transcribed following the rules set in these dictionaries. Orthography for non-standard words. These non-orthographic productions have been labelled in C-ORAL-ROM as %alt and are as follows: abrís además adictiva adonde además adelante armónica Amsterdam adonde abréis amás aditiva aonde aemás alante amónica Amsterdar aonde 100 arcadas ortodoncias audiencia azulejos básico a El be cero Durán i Lleida Josep Andoni Durán i Lleida bungaló bronconeumonía casa claro claro cassette Champions chiquitita chisme alcoba cuélgalo compacts consciente conscientes corazón cuadraditos cuñada decorado delegado demasiado desilusión desplazarte divorciaron dijéramos dijiste diagnosticado disminuirá luego durmió engaño enseguida entonces entonces entonces entonces entregar escuela especificidad estaros estructura estuvisteis extranjeros farolero frase friegaplatos friolera habéis habla hecho hubieses importante importa infracción instituto instrumento instrumento interfono entierro Jesús joder joder joder joder arcás artodoncias audencia alzulejo básimoco al berceo Durán Lleida Josep Andoni Durán Lleida bílgaro bronconeunomia ca cao caro casé Champion chiquetita chirme colba cólgalo compas conciente concientes corasón cuadraitos cuñá decorao delegao demasiao desilución desplatazarte devorciaron diciéramos diiste dinosticado disminurá dogo dormió encaño enseguía entoes entoces entones tonces entrebar escuella especifidad estaos estrutura estuvistesis extranyeros falorero flase fregaplatos friorera habís hambla hezo hubiecese importanto impota infranción istituto instumentro trumento intérfono intierrro Jesu jode joe joé joer 101 joder lado lados luego me ha maternidad mecanismos mediados bueno merdado mental meteorología armónica te ha muy nada nada laten buenos obtusas oftalmólogo a donde padrino para adelante para pasado patada espera pesado pesados pescado pues pues comprado pero positivo precipitarnos prénsiles proporcionaba pringado propia pues puede podamos puedes pues puñado diálogos relativamente resultado resultados se ha se había ser siguiendo sexo situado socialmente nos superponiendo sostenible está también están esté tendrán tienes tienen tienes tinglado todavía tubería oe lao laos logo ma marternidad mecaneismos mediaos meno mercao mertal metereología mónica ta mu na nara naten nos obstrusas ofmólogo onde paíno palante pa pasao patá pera pesau pesaos pescao po pos pompao poro posetivo precipietarnos prensiles preporcionaba pringao pripia pu pue puedamos pues pue puñao reálogos relamente resultao resultaos sa sabía saer seguiendo seso sintuado socialremente son superponiendos sustenible ta tamién tan te terán tie tien ties tinglao toavía tobería 102 entonces toda todas todo todos tobillo tutiplén estuvo o utilizó verdad vaya toces toa toas to tos torbillo tutiplé tuvo u utilició verdá yava Interjections. ¡ah! ¡anda! ¡brum! ¡bueno! ¡chun! ¡Dios mío de mi alma! ¡ey! ¡hombre! ¡jobar! ¡jolines! ¡madre! ¡madre mía de mi vida y mi corazón! ¡mua! ¡ojo! ¡ostras! ¡oy! ¡pum! ¡ups! ¡ahí va! ¡bah! ¡buah! ¡cachis en la mar! ¡coño! ¡eh! ¡hala! ¡jo! ¡joder! ¡leche! ¡madre del amor hermoso! ¡madre mía! ¡oh! ¡ole! ¡ouh! ¡por Dios! ¡uh! ¡yeah! 103 APPENDIX 4 C-ORAL-ROM PROSODIC TAGGING EVALUATION REPORT 1 Introduction This is the final report on the evaluation of the C-Oral-Rom prosodic tagging. Loquendo organized the evaluation on behalf of the C-Oral-Rom Consortium. The evaluation for the four Project languages took place during Fall 2003. The evaluation had the main goal of assessing the perceptual relevance of the coding scheme adopted in the Project for annotating prosodic breaks, applied to different languages. According to the Consortium requirements (see the Technical Annex of the C-ORAL-ROM Project), two naïve evaluators for each of the four languages were asked to evaluate the prosodic annotation of a subset of the corpus in their language. Each evaluator, independently of the other, had to examine the original annotation and possibly correct it by deleting, inserting or substituting prosodic-break tags. The experimental setting is described in Chapter 2. The evaluation data were then gathered and statistically analyzed in order to measure the degree of consensus expressed by the evaluators towards the original annotation. The adopted metrics are described in Chapter 3. The cases of disagreement, where one or both evaluators corrected the original annotation, were compared with the total number of word boundaries and, more perspicuously, with the number of positions, which are reasonable candidates for a prosodic, break. Also the agreement between evaluators was measured. On the basis of the assumption that an annotation scheme is good if it can be applied with a high degree of accordance by two or more independent annotators, the Consortium decided to apply a replicability statistics to compare the three annotations obtained by C-ORAL ROM and by the two evaluators. Following previous work on annotation coding for discourse and dialogue 54, the Kappa coefficient was calculated to show replicability of results. Such statistics is useful to test not only if the annotators agreed upon the majority of the coding, but also to what extent that agreement is significantly different from chance. As it will be explained in paragraph 3.5.2, the Kappa coefficient measures pair-wise agreement among a set of annotators by correcting for the expected chance agreement. When there is total agreement, Kappa is one. When the agreement happens by chance, Kappa is 0. In the literature there was discussion about what makes a ‘good’ level of agreement, and that probably depends on several different variables, including the presence of an expert annotator. In this work the expert coder (i.e. the C-Oral-Rom portions of corpora) was compared to the naïve coders’ choices. 2 Experimental Setting Goal of the evaluation is to assess the reliability of the prosodic tagging of the C-ORAL–ROM speech corpora. Such tagging consists in the marking of prosodic breaks in the orthographic transcription of speech. Each word boundary in the corpus is a potential position for a break. Breaks can be non-terminal or terminal. Terminal breaks are considered the main cue for the detection of utterances, which are the reference unit of spontaneous speech. The annotation is based only on perceptual judgments. It is the result of a first labeling by a first annotator and two successive revisions by two different annotators. The evaluation performed by Loquendo aims to test the hypothesis that prosodic breaks, especially the terminal ones, have strong perceptual prominence and can be detected with a high level of interannotator agreement. 54 A. Isard and J. Carletta (1995), "Replicability of transaction and action coding in the Map Task corpus.", in J. Moore et al. (Eds), "Empirical Methods in Discourse Interpretation and Generation", Working Notes of the AAAI Spring Symposium Series, Stanford University, Stanford, Ca., pp. 60-66. 104 The C-ORAL-ROM Project provides four multimedia corpora of spontaneous speech for French, Italian, Portuguese and Spanish, each amounting to about 300,000 words (roughly 35 hours of speech). Given the size of the resource the evaluation of prosodic tagging was necessarily performed on a statistically significant portion of each corpus. According to the Consortium requirements, from each language corpus a subset was extracted, amounting to roughly 1/30 of its utterances (about 1300 utterances and around 1:30 hours of speech). The speech sections to be evaluated were automatically selected with a random procedure ensuring the same distribution of speech types as in the corpus, which is organized according to the following tree structure: Language Corpus Informal Formal Natural context Dialogue Media Monologue Telephone Dialogue Private Public Monologue Dialogue Monologue The selection procedure (see Appendix A.3)55, based on the software tool Win Pitch Corpus, extracts dialogues/monologues with the same proportion from each node of the tree and guarantees semantic and contextual coherence of the speech sections to evaluate, by choosing continuous series of utterances, where possible, and by providing also the utterances surrounding the selected sections. For each selected speech section, the procedure outputs an XML file ensuring text-audio alignment and a text file where each tagged utterance is reported twice (validation copy). The generation of samples for the evaluation has been accomplished at Loquendo premises with the assistance of an engineer of University of Florence acting as consultant. For each language, two mother-tongue evaluators were chosen, with medium cultural level and no specific expertise in phonetics and prosody (see Appendix A.1 and Appendix A.6). No constraint about regional origin was imposed on the choice (see Appendix A.2), except for Portuguese, for which both the European and Brazilian varieties were covered. The evaluators received a two-days training (see Appendix A.4), in which the trainers illustrated the goal of the evaluation, the C-ORAL-ROM corpus structure, the meaning and format of the prosodic tagging, the evaluation criteria and procedure. The notion of terminal and non-terminal break was carefully explained by discussing specific examples extracted from the corpus. Written instructions were delivered to the evaluators (see Appendix A.4). At the end of the training, a test was performed in order to assess the acquired competence of the evaluators and to ensure consistency between them in the evaluation. The evaluation was carried out from mid September to mid October 2003, for the Italian and Portuguese corpora, and from mid October to mid November, for French and Spanish. Each evaluator worked independently of the others and spent around sixty hours to accomplish his task, in four-hours daily sessions. The task was performed on Personal PC’s, with the help of the WPC tool, which allowed viewing the annotated text to be evaluated and listening to the corresponding aligned audio signal. The evaluation file, as output by the sample generation procedure (see Appendix A.3), reported each 55 All Appendixes to this evaluation report are not included in the DESIGN.doc and can be found in DVD9 in the directory “Utility/specifications” 105 annotated utterance twice. The second copy of the utterance was the validation copy that the evaluator could modify when he did not agree with the original tagging. The evaluator listened to the selected utterance (no more than three times) and considered the possible existence of prosodic breaks at each word boundary position. If his perception did not match with the original tagging, he could modify the validation copy by inserting, deleting or substituting break marks. In case the evaluator did not understand part of the utterance or he was not able to evaluate it, he could exclude that text portion by including it between two asterisks. The detailed evaluation procedure is attached in Appendix A.5. None of the eight evaluators reported any difficulties in the evaluation and all of them could easily accomplish their task. All the evaluation files were carefully checked in order to detect possible mistakes (double spacing, missing blank, incorrect tag, word deletion, etc.). Thereafter, the evaluation files were ready to be analyzed in order to measure the degree of consensus expressed by the evaluators with respect to the C-ORAL-ROM prosodic tagging. 3 Measures and Statistics The measures and statistics adopted by Loquendo are based on the requirements expressed in the CORAL-ROM Technical Annex 56. These requirements are: "The evaluation will focus on the following cases: 1) whether or not the selected utterance actually ends up with a perceptually relevant prosodic break (// or at least /) (generic confirmation) 2) whether the evaluator actually perceives that break as terminal (specific confirmation) 3) whether a perceived terminal break turns out non marked as a terminal break (lack of a //) (terminal missing) 4) whether to her/his perception any non terminal break marked within that utterance is on the contrary terminal (added terminal evaluation) 5) whether a perceived non terminal break turns out not marked as a non terminal break (non terminal missing) [...] The measurements will ensure the evaluation of consensus at least in terms of: 1) number of breaks registered 2) Kappa statistics 3) percentage of consensus on terminal and non terminal breaks [...] The institution under subcontracting will provide statistical analysis that clearly distinguishes the following cases, that will be explicitly reported in the description of the resource. 1) strong disagreement (disagreement by both evaluators) - With respect to the terminal prosodic tags - With respect to non terminal tags 2) partial consensus (disagreement by only one evaluator)" 3.1 Evaluation data The evaluation data are obtained from the evaluation files through a number of steps, described in the following paragraphs. Word boundaries W (positions candidate for prosodic breaks) are classified for the purpose of the evaluation into the following classes: Tag O N T 56 Semantics no break non-terminal break (/, [/], ///) terminal break (//, ?, ..., +) Technical Annex of the C-Oral-Rom Project, Annex 1 106 Each position in the evaluation file is classified with a tag expressing the agreement of the evaluator with the original annotation. In the following table, the tags are ordered according to their increasing degree of disagreement. Tag 0O 0N 0T 1i 1d 2ns 2ts 3i 3d Semantics agreement on non-break boundary agreement on non-terminal break agreement on terminal break non-terminal insertion non-terminal deletion non terminal break substitution (N->T) terminal break substitution (T->N) terminal insertion terminal deletion Ok Ok Ok non critical non critical Critical Critical Very critical Very critical Computations on the evaluation files are performed according to the corpus tree-structure described in Chapter 2, starting from each leaf of the tree, i.e. each dialogue/monologue file. Then, for each node, its measures are obtained by summing up the measures performed on the files (dialogue/monologue) subsumed by the node. Cumulative computations are performed also horizontally (e.g. by summing up the measures of all the "dialogue" nodes, or all the "monologue" nodes). In the following, we describe the measures performed on each leaf file. 3.2 First step: binary comparison file Starting from an evaluation file, which reports both the original tagging in the C-ORAL ROM corpus (C) and the evaluator choice (E), a first parser generates a comparison file (B) where each word boundary (candidate position for a break) is represented as a record, with the following information: b.1 b.2 b.3 b.4 b.5 b.6 b.7 b.8 b.9 b.10 utterance57 number speaker name position (sequential number) of the word boundary character position in the original line character position in the evaluator line break found at the given position in the original text class (O, T, N) of the original boundary break found at the given position in the evaluator text class (O, T, N) of the evaluator boundary agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original and evaluator Positions excluded from the evaluation do not contain information from b.4 to b.10 and are marked explicitly with the following expressions: "hhh" non-linguistic phenomena "Not evaluated" text portions not understood by the evaluator 57 The notion of utterance should in principle correspond to a text portion delimited by terminal breaks. It turns out that in some cases a line in the evaluation file is not terminated with a break and so it is improperly counted as an "utterance". This is not a problem for the evaluation, as the end-of-line position will simply be marked with an O and evaluated as usual. 107 Example of comparison file B: UTT SPEAK 1 2 3 3 ... POS. CHAR-R CHAR-E 0* 12 12 0* 13 13 0 5 5 1 19 19 BREAK-R VALUE-R BREAK-E VALUE-E AGREE // T // T 0T // T // T 0T O O 0O O O 0O 3.3 Second step: ternary comparison file Starting from the two comparison files B-E1 and B-E2 for the two evaluators E1 and E2, a new file (T) is generated, where each word boundary is represented as a record, with the following information: t.1 t.2 t.3 t.4 t.5 t.6 t.7 t.8 t.9 utterance number speaker name position (sequential number) of the word boundary class (O, T, N) of the boundary according to original C class (O, T, N) of the boundary according to evaluator E1 class (O, T, N) of the boundary according to evaluator E2 agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original C and E1 agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original C and E2 agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between E1 and E2 Positions excluded from the evaluation are explicitly marked as follows: - '?' in columns t.4 to t.9 non-linguistic phenomena - '*' in columns t.5, t.7, t.9 text portions not understood by the evaluator E1 - '*' in columns t.6, t.8, t.9 text portions not understood by the evaluator E2 - '*' in columns t.4 to t.9 text portions not understood by both evaluators Example of ternary comparison file T: UTT 1 1 1 1 2 3 3 3 4 SPK PAT PAT PAT PAT ROS MIG MIG MIG GUI POS 0 1 2 3* 0* 0 1 2* 0* REF O O O T ? N O T ? EV1 O O O T ? N O T ? EV2 O O O T ? N O T ? R-E1 0O 0O 0O 0T ? 0N 0O 0T ? R-E2 0O 0O 0O 0T ? 0N 0O 0T ? E1-E2 0O 0O 0O 0T ? 0N 0O 0T ? 3.4 Third step: measures Given a ternary comparison file T, a new file M is generated reporting the measures obtained by counting the following events: 108 General data m.1 "utterances" m.2 word boundaries m.3 '?' word boundaries m.4 T breaks by original annotation C m.5 N breaks by original annotation C Binary comparison E1-C (all measures, including mb1.2 and mb1.3, refer to positions evaluated by E1) mb1.2 word boundaries evaluated by E1 mb1.3 T breaks by original annotation C mb1.4 N breaks by original annotation C mb1.5 T breaks by the evaluator (t.5=T) mb1.6 N breaks by the evaluator (t.5=N) mb1.7 agreement t.7= 0O mb1.8 agreement t.7 = 0N mb1.9 agreement t.7 = 0T mb1.10 disagreement t.7 =1i mb1.11 disagreement t.7 =1d mb1.12 disagreement t.7 =2ns mb1.13 disagreement t.7=2ts mb1.14 disagreement t.7 =3i mb1.15 disagreement t.7 =3d mb1.16 disagreement t.7 =3i at "utterance" end (end-of-line) mb1.17 N break misplacements: occurrences of <1i, 1d> or <1d, 1i> on two consecutive word boundaries with identical speaker name mb1.18 T break misplacements: occurrences of <3i, 3d> or <3d, 3i> on two consecutive word boundaries with identical speaker name Binary comparison E2-C (all measures, including mb2.2 and mb2.3, refer to positions evaluated by E2) mb2.1 word boundaries evaluated by E2 mb2.2 T breaks by original annotation C mb2.3 N breaks by original annotation C mb2.4 T breaks by the evaluator (t.6=T) mb2.5 N breaks by the evaluator (t.6=N) mb2.6 agreement t.8= 0O mb2.7 agreement t.8 = 0N mb2.8 agreement t.8 = 0T mb2.9 disagreement t.8 =1i mb2.10 disagreement t.8 =1d mb2.11 disagreement t.8 =2ns mb2.12 disagreement t.8=2ts mb2.13 disagreement t.8 =3i mb2.14 disagreement t.8 =3d mb2.15 disagreement t.8 =3i at "utterance" end (end-of-line) mb2.16 N break misplacements: occurrences of <1i, 1d> or <1d, 1i> on two consecutive word boundaries with identical speaker name mb2.17 T break misplacements: occurrences of <3i, 3d> or <3d, 3i> on two consecutive word boundaries with identical speaker name 109 Ternary comparison E1-E2-C (all measures, including mt.2 to mt.7, refer to positions evaluated by both E1, E2) mt.1 word boundaries evaluated by both E1 and E2 mt.2 T breaks by original annotation C mt.3 N breaks by original annotation C mt.4 T breaks by the evaluator E1 mt.5 N breaks by the evaluator E1 mt.6 T breaks by the evaluator E2 mt.7 N breaks by the evaluator E2 mt.8 total agreement (t.7 = t.8 = 0O) on O boundaries mt.9 total agreement (t.7 = t.8 = 0N) on T breaks mt.10 total agreement (t.7 = t.8 = 0T) on N breaks mt.11 total disagreement (t.4 =/ t.5 =/ t.6) mt.12 agreement between evaluators (t.5 = t.6, t.7=t.8) mt.13 agreement between evaluators on 1i (t.7 = t.8 = 1i) mt.14 agreement between evaluators on 1d (t.7 = t.8 = 1d) mt.15 agreement between evaluators on 2ns (t.7 = t.8 = 2ns) mt.16 agreement between evaluators on 2ts (t.7 = t.8 = 2ts) mt.17 agreement between evaluators on 3i (t.7 = t.8 = 3i) mt.18 agreement between evaluators on 3d (t.7 = t.8 = 3d) mt.19 partial agreement (t.7 = 0O or t.8 = 0O, t.7=/t.8) on O boundaries mt.20 partial agreement (t.7 = 0N or t.8 = 0N, t.7=/t.8) on N breaks mt.21 partial agreement (t.7 = 0T or t.8 = 0T, t.7=/t.8) on T breaks mt.22 partial agreement 1i (t.7 = 1i and t.8 = 0O) or (t.8 = 1i and t.7 = 0O) mt.23 partial agreement 1d (t.7 = 1d and t.8 = 0N) or (t.8 = 1d and t.7 = 0N) mt.24 partial agreement 2ns (t.7 = 2ns and t.8 = 0N) or (t.8 = 2ns and t.7 = 0N) mt.25 partial agreement 2ts (t.7 = 2ts and t.8 = 0T) or (t.8 = 2nt and t.7 = 0T) mt.26 partial agreement 3i (t.7 = 3i and t.8 = 0O) or (t.8 = 3i and t.7 = 0O) mt.27 partial agreement 3d (t.7 = 3d and t.8 = 0T) or (t.8 = 3d and t.7 = 0T) All the measures described above have been applied to each leaf node of the corpus structure, i.e. to each evaluated dialogue/monologue file. 3.5 Statistics Starting from the above computations, a number of statistics can be obtained. Here we describe those that we have implemented, taking as a starting point the Annex 1 requirements reported above (3.1), which recommended the computation of Kappa statistics and percentages of consensus on prosodic breaks. A further measure initially considered was the pair of Precision and Recall indexes. This measure was then discarded because it applies to cases where a sequence of values (tags) is compared with a reference sequence, taken as the correct one. This was not the case of the C-ORAL-ROM evaluation, where neither the original annotation nor the evaluators' choices could be taken as a correct reference. All the statistics described below were applied to the evaluation data. The results are presented in Chapter 4. 110 3.5.1 Percentages The following percentages have been calculated. For each evaluator (binary comparison): • • • • • • • • Generic confirmation: percentage of T breaks evaluated as 0T or 2ts, 100*(mb.8+mb.12)/mb2 Specific confirmation: percentage of T breaks evaluated as 0T, 100*mb.8/mb.2 Terminal missing: percentage of W's evaluated as 3i, 100*mb.13/mb.1 Added terminal: percentage of N breaks evaluated as 2ns, 100*mb.11/mb.3 Non terminal missing: percentage of W's evaluated as 1i, 100*mb.9/mb.1 Activity rate: percentage of W's actually modified (insertions, deletions, substitutions) 100*(mb.9+mb.10+mb.11+mb.12+mb.13+mb.14)/mb.1 N break misplacement rate: percentage of N insertions and deletions that may be considered as N misplacements, 100*2*mb.16/(mb.9+mb.10); T break misplacement rate: percentage of T insertions and deletions that may be considered as T misplacements, 100*2*mb.17/(mb.13+mb.14); Percentage of consensus on terminal and non-terminal breaks (ternary comparison): • Strong disagreement with respect to terminal tags o percentage of T breaks evaluated as 3d, 100*mt.18/mt.2 o percentage of T breaks evaluated as 2ts, 100*mt.16/mt.2 o (percentage of W's evaluated as 2ns, 100*mt.15/mt.1) o (percentage of W's evaluated as 3i, 100*mt.17/mt.1 ) • Strong disagreement with respect to non-terminal tags o percentage of N breaks evaluated as 1d, 100*mt.14/mt.3 o percentage of N breaks evaluated as 2ns, 100*mt.15/mt.3 o (percentage of W's evaluated as 2ts, 100*mt.16/mt.1) o (percentage of W's evaluated as 1i, 100*mt.13/mt.1) • Partial consensus o percentage of partially agreed T breaks, 0T vs. 3d/2ts, 100*mt.21/mt.2 o percentage of partially agreed T breaks, 0T vs. 3d, 100*mt.27/mt.2 o percentage of partially agreed W's, 100*(mt.19+mt.20+mt.21)/mt.1 • Total agreement o percentage of T breaks evaluated 0T, 100*mt.9/mt.2 o percentage of N breaks evaluated 0N, 100*mt.10/mt.3 o percentage of O boundaries evaluated 0O,100*mt.8/(mt.1 -mt.2 -mt.3) o percentage of totally agreed W's, 100*(mt.8+mt.9+mt.10)/mt.1 • Global disagreement o percentage of W's disconfirmed by at least one evaluator, 100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)/mt.1 • Consensus in the disagreement o percentage of globally disagreed W's that are actually cases of strong disagreement, 100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18)/ (mt.13+mt.14+mt.15+ mt.16+mt.17+mt.18+mt.19+mt.20+mt.21) Note that a general measure of agreement like the percentage of totally agreed W's, i.e. the ratio between the number of 0O, 0T, 0N agreement tags and the number of word boundaries W (n0O +n0T+n0N)/nW, may perhaps sound too optimistic. It might be useful to compare it with a sort of baseline corresponding to the worst possible "realistic" result: if, for example, all N's and T's were deleted and a comparable number of N's and T's were inserted in different positions we still would have a number of agreed positions n0O = nW - 2 nN - 2 nT. Our percentage of agreement should be 111 significantly higher than (nW - 2 nN - 2 nT)/ nW, in order to provide a positive evaluation of the original corpus annotation. 3.5.2 Kappa coefficient The Kappa coefficient measures the agreement among annotators in their coding of data with a finite set of categories. Kappa is defined as: P( A) − P( E ) k= 1 − P( E ) where: - P(A) is the proportion of times that the annotators actually agree - P(E) is the probability that annotators agree by chance Given the hypothesis that the annotators may have different category distributions, P(E) is given by: M N P( E ) = ∑∏ Fr ( A j , ci ) i =1 j =1 where Fr(Ej,ci) is the frequency with which annotator Aj chooses category ci, (M=3, N=3). For our purposes we have considered 3 annotators: CORALROM (C), evaluator 1 (E1) and evaluator 2 (E2), and we have defined two different Kappa coefficients. • Kappa coefficients o K1: 3 graders (C,E1,E2), 3 categories (O,T,N) o K2: 3 graders (C,E1,E2), 2 categories (T,N) For K1, we have considered as categories the three boundary tags: no break (O), non-terminal break (N) and terminal break (T). P(A) has been calculated as the ratio between the number of total agreements among the evaluators, and the number of word boundaries evaluated by both E1 and E2. For K2 (a more realistic coefficient), we have restricted the set of categories to T and N, and consequently the set of evaluated boundaries to those annotated with T or N by at least one annotator. In this case, P(A) has been calculated as the ratio between the number of total agreements on T or N (mt.9 + mt10) and the sum of N and T breaks as resulting from C, plus the number of 1i and 3i (N or T insertions) by E1 and E2, minus the sum of the agreements between evaluators on 1i and 3i (mt.2+mt.3+mb1.9+mb1.13+mb2.9+mb2.13-mt.13-mt.17). 4 Results The present chapter reports the results of the statistics performed on the evaluation data, according to the specifications described in Chapter 3. The results are given separately for each language evaluation sub-corpus and for its relevant subsets, i.e. for its main bipartitions into Formal and Informal speech and for its two subsets of Dialogues and Monologues. Note that the Dialogues vs. Monologues distinction does not cover the entire sub-corpus, as the Media and Telephone subsets of the Formal speech section are not further split into dialogues and monologues (see the corpus tree structure described in Chapter 2). Paragraph 4.5 reports a summarizing table, where the main statistics on the different corpora can be compared. Paragraph 5.5 discusses the evaluation results in some detail. 112 4.1 General Data The following tables report the total number of word boundaries in each evaluation sub-corpus, compared with the number of word boundaries actually evaluated by each evaluator. Besides, they report the number of T-breaks and N-breaks inserted by the original C-ORAL-ROM annotation and by the two evaluators, respectively. 1- FRENCH CORPUS Word Boundaries and Breaks Orig. Annotation Total W. Bound 12893 Terminal Breaks 969 456 • Formal 513 • Informal 570 • Dialogues • Monologues 239 Evaluator 1 12776 960 454 506 563 242 Evaluator 2 12831 1064 495 569 601 289 Non Terminal Br. • Formal • Informal • Dialogues • Monologues 1462 630 832 587 641 1606 684 922 658 706 1355 580 775 550 585 O Boundaries 10462 10210 10412 Word Boundaries and Breaks Orig. Annotation Total W. Bound 10925 Terminal Breaks 1372 520 • Formal 852 • Informal 876 • Dialogues • Monologues 269 Evaluator 1 10900 1359 516 843 864 271 Evaluator 2 10892 1346 508 838 859 266 Non Terminal Br. • Formal • Informal • Dialogues • Monologues 2403 1171 1232 1136 813 2436 1195 1241 1136 832 2605 1253 1352 1224 903 O Boundaries 7150 7105 6941 Evaluator 1 12933 1484 Evaluator 2 12534 1388 2- ITALIAN CORPUS 3- PORTUGUESE CORPUS Word Boundaries and Breaks Orig. Annotation Total W. Bound 12958 Terminal Breaks 1483 113 • • • • 668 815 871 309 674 810 873 305 637 751 797 302 Non Terminal Br. • Formal • Informal • Dialogues • Monologues 2604 1169 1435 1185 894 2647 1184 1463 1204 908 2556 1152 1404 1148 886 O Boundaries 8871 8802 8590 Word Boundaries and Breaks Orig. Annotation Total W. Bound 11512 Terminal Breaks 1107 398 • Formal 709 • Informal 747 • Dialogues • Monologues 188 Evaluator 1 11474 1074 384 690 727 185 Evaluator 2 11512 1112 402 710 748 189 Non Terminal Br. • Formal • Informal • Dialogues • Monologues 2463 1304 1159 1084 847 2561 1368 1193 1137 866 2432 1297 1135 1075 822 O Boundaries 7942 7839 7968 Formal Informal Dialogues Monologues 4- SPANISH CORPUS 4.2 Binary Comparison The following tables report the statistics on the evaluation data by each single evaluator. For each evaluator, the data are obtained via a binary comparison of his annotation with the original CORAL_ROM annotation. All the figures are computed on the positions actually evaluated by the given evaluator, i.e. excluding those marked with asterisks. 4.2.1 Binary Comparison: Terminal Break Confirmation The following percentages, relative to the main subsets of each language sub-corpus, show to what extent each evaluator confirmed the original Terminal Break tags. Given the number of original Tbreaks (among the positions not excluded by the evaluator), the first value is the percentage (100*mb.8/mb.2) of specifically confirmed T-breaks, i.e. those evaluated as 0T, while the second one is the percentage (100*(mb.8+mb.12)/mb.2) of non-deleted T-breaks, i.e. those evaluated as 0T or 2ts. 1- FRENCH CORPUS Specific Confirmation (0T) Evaluator1 94,74% • Formal 96,25% • Informal 96,43% • Dialogues • Monologues 92,75% Evaluator2 100% 100% 100% 100% 114 Generic Confirmation (0T or 2ts) Evaluator1 100% • Formal 100% • Informal 100% • Dialogues • Monologues 100% 2- ITALIAN CORPUS Specific Confirmation (0T) Evaluator 1 98,95% • Formal 98,98% • Informal 98,50% • Dialogues • Monologues 100% Generic Confirmation (0T or 2ts) Evaluator 100% • Formal 99,89% • Informal 99,89% • Dialogues • Monologues 100% Evaluator2 100% 100% 100% 100% Evaluator 2 95,57% 97,36% 97,74% 97,52% Evaluator 2 100% 100% 100% 100% 3- PORTUGUESE CORPUS Specific Confirmation (0T) Evaluator 1 98,54% • Formal 98,13% • Informal 98,73% • Dialogues • Monologues 97,51% Generic Confirmation (0T or 2ts) Evaluator 1 100% • Formal 100% • Informal 100% • Dialogues • Monologues 100% Evaluator 2 99,82% 99,36% 99,54% 99,68% Evaluator 2 100% 100% 100% 100% 4- SPANISH CORPUS Specific Confirmation (0T) Evaluator 1 90,33% • Formal 95,29% • Informal 95,85% • Dialogues • Monologues 95,34% Generic Confirmation (0T or 2ts) Evaluator 1 97,87% • Formal 100% • Informal 100% • Dialogues • Monologues 100% Evaluator 2 98,94% 100% 100% 100% Evaluator 2 98,94% 100% 100% 100% 115 4.2.2 Binary Comparison: Terminal Missing For each evaluator and each corpus section, the following figures give the percentage of evaluated word boundaries where the evaluator inserted a Terminal Break (100*mb.13/mb.1), i.e. positions evaluated as 3i. 1- FRENCH CORPUS Terminal Missing (3i) • • • • Formal Informal Dialogues Monologues Evaluator1 0,01% 0% 0% 0,01% Evaluator2 0,02% 0,03% 0,01% 0,05% 2- ITALIAN CORPUS Terminal Missing (3i) • • • • Formal Informal Dialogues Monologues Evaluator 1 0% 0% 0% 0% Evaluator 2 0% 0% 0% 0% 3- PORTUGUESE CORPUS Terminal Missing (3i) • • • • Formal Informal Dialogues Monologues Evaluator 1 0% 0,05% 0,05% 0% Evaluator 2 0% 0,06% 0,04% 0,04% 4- SPANISH CORPUS Terminal Missing (3i) • • • • Formal Informal Dialogues Monologues Evaluator 1 0% 0% 0% 0% Evaluator 2 0,02% 0,02% 0,02% 0% 4.2.3 Binary Comparison: Non Terminal Missing For each evaluator and each corpus section, the following figures give the percentage of evaluated word boundaries where the evaluator inserted a Non Terminal Break (100*mb.9/mb.1), i.e. positions evaluated as 1i. 1- FRENCH CORPUS Non Terminal Missing (1i) Evaluator 1 1,50% • Formal 1,25% • Informal 1,60% • Dialogues • Monologues 1,26% Evaluator 2 0,42% 0,45% 0,36% 0,62% 116 2- ITALIAN CORPUS Non Terminal Missing (1i) Evaluator 1 1,10% • Formal 0,83% • Informal 0,52% • Dialogues • Monologues 1,47% Evaluator 2 2,58% 2,50% 2,20% 3,34% 3- PORTUGUESE CORPUS Non Terminal Missing (1i) Evaluator 1 0,73% • Formal 0,44% • Informal 0,54% • Dialogues • Monologues 0,35% Evaluator 2 0,25% 0,18% 0,24% 0,05% 4- SPANISH CORPUS Non Terminal Missing (1i) Evaluator 1 1,19% • Formal 0,46% • Informal 0,71% • Dialogues • Monologues 1,16% Evaluator 2 0,96% 0,39% 0,56% 0,74% 4.2.3’ Binary Comparison: Non Terminal Deletion For each evaluator and each corpus section, the following figures give the percentage of Non Terminal Breaks deleted by the evaluators (100*mb.10/mb.3), i.e. positions evaluated as 1d. 1- FRENCH CORPUS Non Terminal Deletion (1d) Evaluator 1 0,48% • Formal 0,72% • Informal 0,34% • Dialogues • Monologues 0,94% Evaluator 2 6,84% 4,93% 4,77% 7,18% 2- ITALIAN CORPUS Non Terminal Deletion (1d) Evaluator 1 3,76% • Formal 3,57% • Informal 3,88% • Dialogues • Monologues 3,20% Evaluator 2 4,87% 3,42% 4,15% 2,71% 3- PORTUGUESE CORPUS Non Terminal Deletion (1d) Evaluator 1 0,60% • Formal 0,20% • Informal 0,50% • Dialogues Evaluator 2 0,17% 0,50% 0,44% 117 • Monologues 0,11% 0,45% 4- SPANISH CORPUS Non Terminal Deletion (1d) Evaluator 1 0,84% • Formal 0,95% • Informal 0,83% • Dialogues • Monologues 0,94% Evaluator 2 3,53% 3,80% 3,61% 4,25% 118 4.2.4 Binary Comparison: Added Terminal For each evaluator and each corpus section, the following figures give the percentage of (evaluated) original N-Breaks that were substituted with a T-Break, i.e. positions evaluated as 2ns (100*mb.11/mb.3). 1- FRENCH CORPUS Added Terminal (2ns) • • • • Formal Informal Dialogues Monologues Evaluator 1 4,72% 1,29% 3,05% 3,11% Evaluator 2 4,85% 5,20% 4,15% 6,51% 2- ITALIAN CORPUS Added Terminal (2ns) • • • • Formal Informal Dialogues Monologues Evaluator 1 0,22% 0,35% 0,05% 0,17% Evaluator 2 0,10% 0,04% 0,29% 0,18% 3- PORTUGUESE CORPUS Added Terminal (2ns) • • • • Formal Informal Dialogues Monologues Evaluator 1 0,93% 0,33% 0,69% 0,17% Evaluator 2 0% 0,41% 0,40% 0% 4- SPANISH CORPUS Added Terminal (2ns) • • • • Formal Informal Dialogues Monologues Evaluator 1 1,68% 1,45% 1,35% 0,48% Evaluator 2 0,15% 0% 0% 0,15% 4.2.5 Binary Comparison: Activity Rate A measure of the evaluators' intervention rate is given by the following figures, showing the percentage of evaluated word boundaries that were actually modified by each evaluator, i.e. evaluated as 1i, 1d, 2ns, 2ts, 31, 3d. (100*(mb.9+mb.10+mb.11+mb.12+mb.13+mb.14)/mb.1) 1- FRENCH CORPUS Activity Rate • Total • Formal • Informal • Dialogues • Monologues Evaluator 1 2,22 % 2,61 % 1,73 % 2,39 % 2,16 % Evaluator 2 1,66 % 1,45 % 1,64 % 1,22 % 2,35 % 119 2- ITALIAN CORPUS Activity Rate • Total • Formal • Informal • Dialogues • Monologues Evaluator 1 2,07% 2,32% 1,82% 1,89% 2,27% Evaluator 2 4,17% 4,51% 3,81% 3,85% 4,51% Evaluator 1 1,01% 1,35% 0,88% 1,08% 0,76% Evaluator 2 0,41% 0,29% 0,56% 0,49% 0,33% Evaluator 1 1,96% 2,57% 1,66% 1,79% 1,82% Evaluator 2 1,42% 1,78% 1,02% 1,15% 1,73% 3- PORTUGUESE CORPUS Activity Rate • Total • Formal • Informal • Dialogues • Monologues 4- SPANISH CORPUS Activity Rate • Total • Formal • Informal • Dialogues • Monologues 4.2.6 Binary Comparison: Misplacement Rate The consecutive occurrence of a break insertion and a break deletion (or vice versa) may suggest that the evaluator judged the original tag just as misplaced. In this case his two actions should actually count as a single "break move". The following figures show, for each evaluator and each break type, which percentage of break insertions and deletions correspond to break moves. The percentages for N breaks and T breaks are obtained respectively by the following formulas: (100*2*mb.16/(mb.9+mb.10)) and (100*2*mb.17/(mb.13+mb.14)). 1- FRENCH CORPUS Misplacement Rate Terminal Breaks Non Terminal Breaks Evaluator 1 0% 0,38% Evaluator 2 0% 0,29% Evaluator 1 0% 5,06% Evaluator 2 0% 1,97% 2- ITALIAN CORPUS Misplacement Rate Terminal Breaks Non Terminal Breaks 3- PORTUGUESE CORPUS Misplacement Rate Terminal Breaks Non Terminal Breaks Evaluator 1 0% 0% Evaluator 2 0% 0% 120 4- SPANISH CORPUS Misplacement Rate Terminal Breaks Non Terminal Breaks Evaluator 1 0% 2,38% Evaluator 2 0% 1,28% 4.2.3 Binary Comparison: Non Terminal Deletion For each evaluator and each corpus section, the following figures give the percentage of Non Terminal Breaks deleted by the evaluators (100*mb.10/mb.3), i.e. positions evaluated as 1d. 1- FRENCH CORPUS Non Terminal Deletion (1d) Evaluator 1 0,48% • Formal 0,72% • Informal 0,34% • Dialogues • Monologues 0,94% Evaluator 2 6,84% 4,93% 4,77% 7,18% 2- ITALIAN CORPUS Non Terminal Deletion (1d) Evaluator 1 3,76% • Formal 3,57% • Informal 3,88% • Dialogues • Monologues 3,20% Evaluator 2 4,87% 3,42% 4,15% 2,71% 3- PORTUGUESE CORPUS Non Terminal Deletion (1d) Evaluator 1 0,60% • Formal 0,20% • Informal 0,50% • Dialogues • Monologues 0,11% Evaluator 2 0,17% 0,50% 0,44% 0,45% 4- SPANISH CORPUS Non Terminal Deletion (1d) Evaluator 1 0,84% • Formal 0,95% • Informal 0,83% • Dialogues • Monologues 0,94% Evaluator 2 3,53% 3,80% 3,61% 4,25% 4.3 Ternary Comparison 4.3.1 Ternary Comparison: Strong Disagreement on prosodic breaks The following figures show the percentage of cases where both evaluators disagreed with the original annotation and agreed in their evaluation tag. All the figures are relative to word boundaries evaluated by both evaluators. 1- Percentage of the original T-breaks deleted by both evaluators, i.e. evaluated as 3d by both (100*mt.18/mt.2) 121 1- FRENCH CORPUS Strong Disagreement on T-Breaks, terminal deletion 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 2- ITALIAN CORPUS Strong Disagreement on T-Breaks, terminal deletion 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Strong Disagreement on T-Breaks, terminal deletion 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on T-Breaks, terminal deletion 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 2- Percentage of the original T-breaks substituted with an N-break by both evaluators, i.e. evaluated as 2ts (100*mt.16/mt.2) 1- FRENCH CORPUS Strong Disagreement on T-Breaks, T->N substitution2ts Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 2- ITALIAN CORPUS Strong Disagreement on T-Breaks, T->N substitution2ts Percentage 0,56% • Formal 0,25% • Informal 0,64% • Dialogues 0% • Monologues 122 3- PORTUGUESE CORPUS Strong Disagreement on T-Breaks, T->N substitution2ts Percentage 0% • Formal 0,47% • Informal 0,30% • Dialogues 0,32% • Monologues 4- SPANISH CORPUS Strong Disagreement on T-Breaks, T->N substitution2ts Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- Percentage of the original word boundaries where an N-break was substituted with a T-break by both evaluators, i.e. positions evaluated as 2ns by both (100*mt.15/mt.1) 1- FRENCH CORPUS Strong Disagreement on T-Breaks, N->T substitution 2ns Percentage 0,18% • Formal 0,03% • Informal 0,08% • Dialogues 0,17% • Monologues 2- ITALIAN CORPUS Strong Disagreement on T-Breaks, N->T substitution 2ns Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Strong Disagreement on T-Breaks, N->T substitution 2ns Percentage 0% • Formal 0,01% • Informal 0,01% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on T-Breaks, N->T substitution 2ns Percentage 0,02% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 123 4- Percentage of the original word boundaries where both evaluators inserted a T-break, i.e. positions evaluated as 3i by both (100*mt.17/mt.1) 1- FRENCH CORPUS Strong Disagreement on T-Breaks, terminal insertion 3i Percentage 0,01% • Formal 0% • Informal 0% • Dialogues 0,01% • Monologues 2- ITALIAN CORPUS Strong Disagreement on T-Breaks, terminal insertion 3i Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Strong Disagreement on T-Breaks, terminal insertion 3i Percentage 0% • Formal 0,04% • Informal 0,04% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on T-Breaks, terminal insertion 3i Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 5- Percentage of the original N-breaks deleted by both evaluators, i.e. evaluated as 1d by both (100*mt.14/mt.3) 1- FRENCH CORPUS Strong Disagreement on N-Breaks, Non Terminal deletion 1d Percentage 0,25% • Formal 0,21% • Informal 0,19% • Dialogues 0,42% • Monologues 2- ITALIAN CORPUS Strong Disagreement on N-Breaks, Non Terminal deletion 1d Percentage 3,09% • Formal 1,11% • Informal 2,34% • Dialogues 1,38% • Monologues 124 3- PORTUGUESE CORPUS Strong Disagreement on N-Breaks, Non Terminal deletion 1d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on N-Breaks, Non Terminal deletion 1d Percentage 0,24% • Formal 0,33% • Informal 0,28% • Dialogues 0,18% • Monologues 6- Percentage of the original N-breaks substituted with a T-break by both evaluators, i.e. evaluated as 2ns by both (100*mt.15/mt.3) 1- FRENCH CORPUS Strong Disagreement on N-breaks, N->T substitution 2ns Percentage 2,39% • Formal 0,27% • Informal 1,28% • Dialogues 1,53% • Monologues 2- ITALIAN CORPUS Strong Disagreement on N-breaks, N->T substitution 2ns Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Strong Disagreement on N-breaks, N->T substitution 2ns Percentage 0% • Formal 0,05% • Informal 0,05% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on N-breaks, N->T substitution 2ns Percentage 0,08% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 125 7- Percentage of the original word boundaries where a T-break was substituted with an N-break by both evaluators, i.e. positions evaluated as 2ts by both (100*mt.16/mt.1) 1- FRENCH CORPUS Strong Disagreement on N-Breaks, T->N substitution 2ts Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 2- ITALIAN CORPUS Strong Disagreement on N-Breaks, T->N substitution 2ts Percentage 0,10% • Formal 0,03% • Informal 0,12% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Strong Disagreement on N-Breaks, T->N substitution 2ts Percentage 0% • Formal 0,09% • Informal 0,06% • Dialogues 0,04% • Monologues 4- SPANISH CORPUS Strong Disagreement on N-Breaks, T->N substitution 2ts Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 8- Percentage of the original word boundaries where both evaluators inserted an N-break, i.e. positions evaluated as 1i by both (100*mt.13/mt.1) 1- FRENCH CORPUS Strong Disagreement on N-breaks, Non Terminal insertion 1i Percentage 0,08% • Formal 0,10% • Informal 0,05% • Dialogues 0,23% • Monologues 2- ITALIAN CORPUS Strong Disagreement on N-breaks, Non Terminal insertion 1i Percentage 0,74% • Formal 0,66% • Informal 0,36% • Dialogues 1,19% • Monologues 126 3- PORTUGUESE CORPUS Strong Disagreement on N-breaks, Non Terminal insertion 1i Percentage 0,10% • Formal 0,10% • Informal 0,15% • Dialogues 0% • Monologues 4- SPANISH CORPUS Strong Disagreement on N-breaks, Non Terminal insertion 1i Percentage 0,38% • Formal 0,09% • Informal 0,19% • Dialogues 0,35% • Monologues 4.3.2 Ternary Comparison: Partial Consensus The following figures concern the cases of disagreement between the evaluators, where the original annotation was confirmed by one evaluator but modified by the other one. Here again, the set of relevant position includes only those evaluated by both evaluators. 1- Percentage of the original T-breaks confirmed by one evaluator and modified (deleted or substituted) by the other (100*mt.21/mt.2). 1- FRENCH CORPUS Partial Consensus on T-breaks, 0T vs. 3d or 2ts Percentage 4,36% • Total 5,26% • Formal 3,75% • Informal 3,57% • Dialogues 7,25% • Monologues 2- ITALIAN CORPUS Partial Consensus on T-breaks, 0T vs. 3d or 2ts Percentage 2,42% • Total 2,89% • Formal 2,96% • Informal 2,28% • Dialogues 2,48% • Monologues 3- PORTUGUESE CORPUS Partial Consensus on T-breaks, 0T vs. 3d or 2ts Percentage 1,51% • Total 127 • • • • Formal Informal Dialogues Monologues 1,45% 1,56% 1,13% 2,20% 4- SPANISH CORPUS Partial Consensus on T-breaks, 0T vs. 3d or 2ts Percentage 5,16% • Total 7,54% • Formal 4,71% • Informal 4,15% • Dialogues 4,66% • Monologues 2- Percentage of the original T-breaks confirmed by one evaluator and deleted by the other (100*mt.27/mt.2). 1- FRENCH CORPUS Partial Consensus on T-breaks, 0T vs. 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 2- ITALIAN CORPUS Partial Consensus on T-breaks, 0T vs. 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 3- PORTUGUESE CORPUS Partial Consensus on T-breaks, 0T vs. 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 4- SPANISH CORPUS Partial Consensus on T-breaks, 0T vs. 3d Percentage 0% • Formal 0% • Informal 0% • Dialogues 0% • Monologues 128 3- Percentage of word boundaries confirmed by one evaluator and modified (1d, 1i, 2ns, 2ts, 3d, 3i) by the other (100*(mt.19+mt.20+mt.21)/mt.1). 1- FRENCH CORPUS Partial Consensus on word boundaries Percentage 3,18% • Total 3,46% • Formal 3,01% • Informal 3,28% • Dialogues 3,54% • Monologues 2- ITALIAN CORPUS Partial Consensus on word boundaries Percentage 3,65% • Total 3,82% • Formal 3,64% • Informal 3,61% • Dialogues 3,68% • Monologues 3- PORTUGUESE CORPUS Partial Consensus on word boundaries Percentage 0,93% • Total 1,44% • Formal 0,83% • Informal 0,97% • Dialogues 0,96% • Monologues 4- SPANISH CORPUS Partial Consensus on word boundaries Percentage 2,56% • Total 3,26% • Formal 2,36% • Informal 2,44% • Dialogues 2,75% • Monologues 4.3.3 Ternary Comparison: Total Agreement The following figures concern the cases where both evaluators confirmed the original annotation. Here again, the set of relevant positions includes only those evaluated by both evaluators. 1- Percentage of the original T-breaks confirmed by both evaluators, i.e. evaluated as 0T (100*mt.9/mt.2) 1- FRENCH CORPUS Total Agreement on T-breaks, 0T Percentage 95,63% • Total 94,74% • Formal 96,25% • Informal 129 • • Dialogues Monologues 96,43% 92,75% 2- ITALIAN CORPUS Total Agreement on T-breaks, 0T Percentage 97,14% • Total 96,07% • Formal 96,68% • Informal 96,96% • Dialogues 97,52% • Monologues 3- PORTUGUESE CORPUS Total Agreement on T-breaks, 0T Percentage 98,12% • Total 98,55% • Formal 97,97% • Informal 98,58% • Dialogues 97,47% • Monologues 4- SPANISH CORPUS Total Agreement on T-breaks, 0T Percentage 94,84% • Total 93,70% • Formal 95,29% • Informal 95,85% • Dialogues 95,34% • Monologues 2- Percentage of the original N-breaks confirmed by both evaluators, i.e. evaluated as 0N (100*mt.10/mt.3) 1- FRENCH CORPUS Total Agreement on N-breaks, 0N Percentage 86,56% • Total 84,98% • Formal 87,75% • Informal 75,01% • Dialogues 84,61% • Monologues 2- ITALIAN CORPUS Total Agreement on N-breaks, 0N Percentage 93,15% • Total 90,50% • Formal 93,48% • Informal 92,93% • Dialogues 94,71% • Monologues 3- PORTUGUESE CORPUS Total Agreement on N-breaks, 0N 130 • • • • • Total Formal Informal Dialogues Monologues Percentage 98,38% 97,63% 97,30% 98,06% 98,95% 4- SPANISH CORPUS Total Agreement on N-breaks, 0N Percentage 94,62% • Total 94,69% • Formal 91,78% • Informal 94,71% • Dialogues 94,81% • Monologues 3- Percentage of the original O-boundaries confirmed by both evaluators, i.e. evaluated as 0O (100*mt.8/(mt.1-mt.2-mt.3)) 1- FRENCH CORPUS Total Agreement on O Boundaries, 0O Percentage 97,54% • Total 97,18% • Formal 97,89% • Informal 97,13% • Dialogues 97,97% • Monologues 2- ITALIAN CORPUS Total Agreement on O Boundaries, 0O Percentage 95,1% • Total 94,39% • Formal 96,02% • Informal 95,10% • Dialogues 94,92% • Monologues 3- PORTUGUESE CORPUS Total Agreement on O Boundaries, 0O Percentage 99,22% • Total 98,78% • Formal 99,25% • Informal 99,10% • Dialogues 99,41% • Monologues 4- SPANISH CORPUS Total Agreement on O Boundaries, 0O Percentage 98,28% • Total 96,52% • Formal 98,88% • Informal 131 • • Dialogues Monologues 98,43% 97,81% 4- Percentage of the evaluated word boundaries where the original annotation was confirmed by both evaluators, i.e. evaluated as 0T, 0N or 0O (100*(mt.8+mt.9+mt.10)/mt.1) 1- FRENCH CORPUS Total Agreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 96,48% 96,23% 96,81% 96,56% 95,97% 2- ITALIAN CORPUS Total Agreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 95,21% 94,67% 95,38% 95,33% 94,79% 3- PORTUGUESE CORPUS Total Agreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 98,93% 98,46% 98,89% 98,75% 98,96% 4- SPANISH CORPUS Total Agreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 97,17% 95,19% 97,49% 97,31% 96,85% 4.3.4 Ternary Comparison: Global Disagreement on prosodic breaks As a measure complementary to the Total Agreement Rate, the following Global Disagreement rate is calculated as the percentage of evaluated word boundaries disconfirmed by at least one evaluator: (100*(mt.1-(mt.8+mt.9+mt.10))/mt.1) or equivalently (100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)/mt1) 132 1- FRENCH CORPUS Global Disagreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 3,52% 3,77% 3,17% 3,44% 4,02% 2- ITALIAN CORPUS Global Disagreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 4,78% 5,33% 4,60% 4,65% 5,21% 3- PORTUGUESE CORPUS Global Disagreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 1,07% 1,54% 1,08% 1,23% 1% 4- SPANISH CORPUS Global Disagreement Rate • • • • • Total Formal Informal Dialogues Monologues Percentage 2,83% 3,3% 2,39% 2,63% 2,66% 4.3.5 Ternary Comparison: Consensus in the Disagreement The following figures show the percentage of globally disagreed boundaries (i.e. positions where at least one evaluator disagreed with the original annotation) that were actually cases of strong disagreement (i.e. cases where both evaluators disconfirmed the original annotation). The figures were obtained by the following formula: 100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18)/(mt.13+mt.14+mt.15+ mt.16+mt.17+mt.18+mt.19+mt.20+mt.21) 1- FRENCH CORPUS Consensus in the Disagreement Percentage 8,97% • Total 10,63% • Formal 7,53% • Informal 133 • • Dialogues Monologues 5,55% 13,82% 2- ITALIAN CORPUS Consensus in the Disagreement Percentage 23,50% • Total 27,27% • Formal 19,92% • Informal 18,75% • Dialogues 28,57% • Monologues 3- PORTUGUESE CORPUS Consensus in the Disagreement Percentage 12,12% • Total 2,98% • Formal 21,54% • Informal 21,52% • Dialogues 3,22% • Monologues 4- SPANISH CORPUS Consensus in the Disagreement Percentage 9,26% • Total 10,87% • Formal 7,14% • Informal 9,74% • Dialogues 6,82% • Monologues 4.4 Kappa coefficients The following table shows the two Kappa coefficients, as defined in Paragraph 3.5.2. The two coefficients were calculated for each leaf of the corpus tree, and then averaged for the different tree nodes. 1- FRENCH CORPUS Kappa coefficients (General and Realistic) Kappa General MEDIA node 0,953 NAT. CONTEXT node 0,923 TELEPHONE node 0,925 FORMAL node (total) 0,989 FAMILY PRIVATE node 0,933 PUBLIC node 0,924 INFORMAL node (total) 0,924 DIALOGUES 0,973 MONOLOGUES 0,907 TOTAL 0,952 Kappa Realistic 0,853 0,760 0,773 0,765 0,775 0,765 0,767 0,790 0,675 0,766 134 2- ITALIAN CORPUS Kappa coefficients (General and Realistic) Kappa General MEDIA node 0,917 NAT. CONTEXT node 0,927 TELEPHONE node 0,921 FORMAL node (total) 0,921 FAMILY PRIVATE node 0,934 PUBLIC node 0,938 INFORMAL node (total) 0,935 DIALOGUES 0,936 MONOLOGUES 0,922 TOTAL 0,928 Kappa Realistic 0,768 0,786 0,823 0,785 0,828 0,823 0,826 0,839 0,779 0,807 3- PORTUGUESE CORPUS Kappa coefficients (General and Realistic) Kappa General MEDIA node 0,969 NAT. CONTEXT node 0,979 TELEPHONE node 0,982 FORMAL node (total) 0,975 FAMILY PRIVATE node 0,985 PUBLIC node 0,978 INFORMAL node (total) 0,984 DIALOGUES 0,981 MONOLOGUES 0,985 TOTAL 0,980 Kappa Realistic 0,890 0,893 0,901 0,893 0,950 0,931 0,946 0,921 0,944 0,920 4- SPANISH CORPUS Kappa coefficients (General and Realistic) Kappa General MEDIA node 0,918 NAT. CONTEXT node 0,949 TELEPHONE node 0,945 FORMAL node (total) 0,930 FAMILY PRIVATE node 0,960 PUBLIC node 0,968 INFORMAL node (total) 0,962 DIALOGUES 0,959 MONOLOGUES 0,952 TOTAL 0,946 Kappa Realistic 0,737 0,807 0,865 0,772 0,883 0,890 0,885 0,880 0.818 0,827 135 4.5 Summarizing Table Boundaries and Breaks Total w. boundaries Evaluated w. Boundaries Evaluator 1 Evaluator 2 Not Evaluated Positions Evaluator 1 Evaluator 2 French 12893 Italian 10925 Portuguese 12958 Spanish 11512 12776 12831 10900 10892 12933 12534 11474 11512 117 (0,9 %) 62 (0,5 %) 25 (0,2 %) 33 (0,3 %) 25 (0,2 %) 424 (3,2 %) 38 (0,3 %) 0 French Italian Portuguese Spanish 96,12% 100% 98,8% 97,12% 98,7% 99,4% 94,1% 99% 100% 100% 99,9% 100% 100% 100% 99,8% 99,7% 0,01% 0,02% 0% 0% 0,03% 0,04% 0% 0,02% 1,59% 0,46% 1,05% 2,75% 0,62% 0,2% 1% 0,7% 2,95% 5,01% 0,2% 0.16% 0,5% 0,4% 1,57% 0,12% 2,22% 1,66% 2,07% 4,17% 1,01% 0,41% 1,96% 1,42% 0,19% 0,14% 5% 0,98% 0% 0% 1,19% 0,65% French Italian Portuguese Spanish 0% 0% 0% 0% 0% 0,48% 0,35% 0% 0,1% 0% 0,01% 0,01% 0,01% 0% 0,03% 0% 0,3% 1,87% 0% 0,25% 1,04% 0% 0,03% 0,05% Binary Comparisons Specific Confirmation of T Evaluator 1 Evaluator 2 Generic Confirmation of T Evaluator 1 Evaluator 2 Terminal Missing Evaluator 1 Evaluator 2 Non Terminal Missing Evaluator 1 Evaluator 2 Added Terminal N->T Evaluator 1 Evaluator 2 Activity Rate Evaluator 1 Evaluator 2 Misplacement Rate Evaluator 1 Evaluator 2 Ternary Comparisons Strong Disagreement 3d on T Strong Disagreement 2ts on T Strong Disagreement 2ns on W Strong Disagreement 3i on W Strong Disagreement 1d on N Strong Disagreement 2ns on N 136 Strong Disagreement 2ts on W Strong Disagreement 1i on N Partial Consensus on T 0T vs. 3d or 2ts Partial Consensus on T 0T vs. 3d Partial Consensus on W 0% 0,08% 0,06% 0% 0,12% 0,75% 0,10% 0,25% 4,36% 2,42% 1,51% 5,16% 0% 0% 0% 0% 3,18% 3,65% 0,93% 2,56% Total Agreement on T Total Agreement on N Total Agreement on O Global Disagreement Consensus in Disagreement 95,05% 86,56% 97,54% 3,52% 8,97% 97,14% 93,15% 95,1% 4,78% 23,50% 98,12% 98,38% 99,22% 1,57% 12,12% 94,84% 94,62% 98,28% 2,83% 9,26% K Index (General) K Index (Realistic) 0,952 0,776 0,928 0,807 0,980 0,920 0,946 0,827 4.5 Discussion of results The four language sub-corpora selected for the evaluation have a size ranging from 10925 (Italian) to 12985 (Portuguese) word boundaries. The number of T-breaks ranges from 969 (French) to 1483 (Portuguese), while N-breaks range from 1462 (French) to 2604 (Portuguese). The percentage of word boundaries that received a break tag in the C-ORAL-ROM annotation, is the following in the four corpora: French Word boundaries marked 19% with a prosodic break Italian 35% Portuguese 32% Spanish 31% The option of excluding text portions from evaluation, in case of doubts or unintelligible speech, was seldom applied. The following table gives the percentage of word boundaries excluded by each evaluator. Not Evaluated Positions Evaluator 1 Evaluator 2 French 117 (0,9 %) 62 (0,5 %) Italian 25 (0,2 %) 33 (0,3 %) Portuguese 25 (0,2 %) 424 (3,2 %) Spanish 38 (0,3 %) 0 Looking at the Binary Comparison statistics on the evaluation data, it is apparent that the evaluators confirmed virtually all Terminal Breaks in the C-ORAL-ROM annotation. The percentage of Tbreaks that were not deleted by the evaluators is 100% (with the single exception of the Formal section of the Spanish corpus, where it is around 98%). This means that where the original annotator perceived a terminal break, also the evaluators perceived a break, at least a non-terminal one (Generic Confirmation). The percentages of Specific Confirmations, where the evaluators confirmed that the break was indeed a T-break, are mostly above 95%. On the other hand, the evaluators seldom perceived T-breaks where the original annotator did not perceive any kind of break, as shown by the Terminal Missing percentages, which are close to 0%. In some cases they perceived a stronger break where the annotator marked a non-terminal break. This is true particularly for the French corpus, where the percentages of Added Terminals, i.e. Nbreaks substituted with T-breaks, range from 1.29% to 6.51% of the original N-breaks. For the other languages, the percentages are mostly below 1%. 137 The low percentages of break insertions and deletions could further be weakened taking into account that in some cases a pair <deletion, insertion> may count as a single misplacement. The misplacement rate is indeed zero for T-breaks, but it can reach an average 5% for N-breaks (for an Italian evaluator, who reaches 9.9% in the Formal-Natural Context section). The comparison of the Activity Rates of the different evaluators (percentage of evaluated word boundaries that were actually modified) shows the lowest values for Portuguese, ranging from 0.5% and 1%, and the highest for Italian, where the rate for Evaluator2 is above 4,5%. French and Spanish evaluators show rates around 2%. Coming to Ternary Comparisons, which give a measure of the inter-annotator agreement and of the reliability of the C-ORAL-ROM prosodic tagging, we see that the original annotation was basically confirmed, especially for terminal breaks. The percentages of T-breaks specifically confirmed by both evaluators are above 94% for all languages. In general, such agreement percentages are slightly lower in Formal than in Informal speech, and in Monologues vs. Dialogues. The agreement on N-breaks is expectedly lower, with higher differences among the corpora. The values range from 75,01% in the Dialogues of French to 98,95% in the Monologues of Portuguese. Total Agreement on T Total Agreement on N Total Agreement on O French 95,05% French 86,56% French 97,54% Italian 97,14% Italian 93,15% Italian 95,1% Portuguese 98,12% Portuguese 98,38% Portuguese 99,22% Spanish 94,84% Spanish 94,62% Spanish 98,28% Taking as reference the whole set of evaluated word boundaries, the most general measure of agreement is the Total Agreement Rate (percentage of boundaries agreed by both evaluators), together with the complementary Global Disagreement Rate (percentage of boundaries disconfirmed by at least one evaluator). The highest consensus is expressed on the Portuguese corpus, but the values are very close for all languages, ranging from 95% to 98.9%. Global Disagreement Rate Total Agreement Rate French 3,52% 96,48% Italian 4,78% 95,21% Portuguese 1,07% 98,93% Spanish 2,83% 97,17% As discussed in Chapter 3, the percentage of totally agreed word boundaries may sound too optimistic as a general measure of agreement, due to the disproportion between word boundaries and actual candidates for a break (around 30% of the total, as reported above). The following table compares the total agreement percentages with a "baseline" that may be considered the worst possible realistic result, obtained in case all N's and T's were deleted and a comparable number of N's and T's were inserted in different positions. French Worst possible Result 62,17% (baseline) Total Agreement Rate 96,48% Italian 30,84% Portuguese 37,28% Spanish 37,93% 95,21% 98,93% 97,17% As it seems, the total agreement percentages are significantly higher than the baseline. In a finer analysis of the non-totally-agreed positions measured in the Global Disagreement Rate, i.e. the positions where at least one evaluator expressed his dissent, we note that all disagreements involve non-terminal breaks. In fact, as already noticed in binary comparisons and shown in the 138 Strong Disagreement points 1 and 4, evaluators did not delete T-breaks nor inserted them in empty positions. As for Strong Disagreement on N-breaks where both evaluators substituted an N-break with a Tbreak (par. 4.3.1, point 6), the percentages are close to 0% except in the case of French, where they are around 2%. More detailed data for French are reported in the following table. Strong Disagreement Non Terminal – Extended Table 6 Percentage 0% • Media 3,56% • Nat. Context 3% • Telephone 0,3% • Fam. Private 0% • Public 2,39% • Formal 0,27% • Informal 1,28% • Dialogues 1,53% • Monologues For the Strong Disagreement cases where both evaluators deleted an N-break (par. 4.3.1, point 5), percentages are below 1% except in the case of Italian, for which we report a more detailed table. Strong Disagreement Non Terminal – Extended Table 5 Percentage 2,12% • Media 2,4% • Nat. Context 6,53% • Telephone 0% • Fam. Private 1,43% • Public 3,09% • Formal 1,11% • Informal 2,34% • Dialogues 1,38% • Monologues Strong disagreement is a very marginal phenomenon, as shown in the following table comparing the Global Disagreement Rate (cases where at least one evaluator disconfirmed the original annotation) with the Partial Consensus Rate (cases where only one evaluator disconfirmed) and obtaining the Strong Disagreement Rate by difference. The percentages of globally disagreed positions that were actually strongly disagreed range from about 9% to 23.5%, as shown in the "Consensus in the Disagreement" table in Paragraph 4.3.5. Global Disagreement Rate Partial Consensus Rate Strong Disagreement Rate French 3,52% 3,18% 0,34% Italian 4,78% 3,65% 1,13% Portuguese 1,07% 0,93% 0,14% Spanish 2,83% 2,56% 0,27% Finally, also the Kappa values are quite positive. Kappa Coefficients measure the reliability of the annotation scheme, that is the probability of obtaining the same annotation by different evaluators on the same corpus. Kappa values range from 0 to 1, where Kappa=1 shows the total reliability of the annotation scheme. Such ideal result is quite unrealistic, due to the intrinsic subjective nature of corpus annotation, so that researchers agree in considering as a positive result any Kappa above 0.6. 139 As discussed in Paragraph 3.5.2, we have calculated two Kappa coefficients: a general one, comparing the behavior of our three subjects, the original annotator and the two evaluators, with respect to the three categories of boundaries (T, N, O); and a more realistic coefficient, limiting the analysis to the two break tags T and N, in order to avoid the positive effect of the high agreement rate on no-break boundaries. Kappa coefficients were calculated on each evaluated dialogue/monologue considered as a single experiment. The following table gives the average Kappa values on the four language sub-corpora. K Index (General) K Index (Realistic) French 0,952 0,776 Italian 0,928 0,807 Portuguese 0,980 0,920 Spanish 0,946 0,827 Both coefficients are largely above the 0.6 threshold, confirming the reliability of the C-ORALROM annotation scheme. 5 Conclusions This document reports the data obtained by evaluating the C-Oral-Rom prosodic tagging in a controlled experimental setting. As we observed in the introductory remarks, the main goal of the evaluation was to understand the general replicability of the coding scheme adopted in the annotations of the four corpora. The data reported here authorize to expect a good level of replicability for the coding scheme, and this implicitly supports the hypothesis that the annotation of the utterances identified in terms of their prosodic profiles is able to capture a relevant perceptual fact. However, we refer the reader to the C-Oral-Rom Advisors’ document for the interpretation of the data presented in this document. 140 141