Specifications on the C-ORAL-ROM Corpus

advertisement
Specifications on the C-ORAL-ROM Corpus
Massimo Moneglia
(University of Florence)
C-ORAL-ROM is available in the ELDA catalogue
http://catalog.elda.org:8080/product_info.php?cPath=37_46&products_id=757
© University of Florence
All rights reserved. No part of this specifications may be reproduced in any form
without the permission of the copyright holder.
Distributed with the C-ORAL-ROM corpus by
ELRA/ELDA
55-57 rue Brillat Savarin
75013 Paris
France
2
I Introduction .......................................................................................................................................7
II General Information.......................................................................................................................10
1. Contact person............................................................................................................................10
2. Distribution media .....................................................................................................................10
3. Content .......................................................................................................................................10
4. Format of speech and label file ..................................................................................................10
5. Layout of the disk-file system....................................................................................................11
6. Hardware, Software and Recording Platforms ..........................................................................13
6.1 Hardware..............................................................................................................................13
6.2 Recording Platforms and Software ......................................................................................14
7. Number of recording and general corpus data ...........................................................................15
8. Acoustic quality .........................................................................................................................16
III C-ORAL-ROM corpus...................................................................................................................17
1. Corpus Design............................................................................................................................17
1.2 Sampling Parameters ...........................................................................................................17
1.3. Sampling strategy and Corpus design.................................................................................18
1.4 Comparability.......................................................................................................................19
2. Filenames conventions...............................................................................................................20
3. Annotation information..............................................................................................................21
3.1 Meta-Data.............................................................................................................................21
3.1.2 Rules for the Participant field ......................................................................................23
3.1.3 Rules for the Class field................................................................................................24
3.1.4 Rules for the Situation field ..........................................................................................24
3.1.5 Rules for marking Acoustic quality............................................................................25
3.1.6 Quality assurance on metadata format ..........................................................................26
3.1.7 Statistics from C-ORAL-ROM metadata.....................................................................26
3.1.7.1 Number of speakers ...............................................................................................26
3.1.7.2Distribution of speakers per geographical origin....................................................26
3.1.7.3 Completeness of speakers features in the metadata records ..................................28
3.1.7.4 Completeness of session metadata.........................................................................29
3.2. Transcription and Dialogue representation .........................................................................31
3.2.1 Basic Concepts for dialogue representation..................................................................31
3.2.2. Turn representation. .....................................................................................................31
3.2.3 Utterance representation. ..............................................................................................31
3.2.4. Word representation.....................................................................................................32
3.2.5 Transcription .................................................................................................................32
3.2.6 Overlapping...................................................................................................................32
3.2.7 Cross over dialogue convention....................................................................................33
3.2.7.1 Overlapping and cross-over dialogue ....................................................................33
3.2.7.2 Intersection of turns ...............................................................................................34
3.2.7.3 Interruption and cross-over dialogue .....................................................................34
3.2.8 Transcription conventions for segmental features ........................................................34
3.2.8.1. Non-understandable words ...................................................................................34
3.2. 8. 2 Paralinguistic elements ........................................................................................34
3.2.8.3. Fragments..............................................................................................................35
3.2.8.4. Interjections...........................................................................................................35
3.2.8.5. Non standard words ..............................................................................................35
3.2.8.6. Non-transcribed words..........................................................................................35
3.2.8.7. Non-transcribed audio signal ................................................................................35
3.2.9 Transcription conventions for human-machine interaction ..........................................35
3.2.10 Quality assurance on format and orthographic transcription ......................................36
3
3.3. Prosodic annotation scheme................................................................................................37
3.3.1 Principles.......................................................................................................................37
3.3.2 Concepts........................................................................................................................37
3.3.3 Theoretical background.................................................................................................37
3.3.4 Conventions for prosodic tagging in the transcripts: types of prosodic breaks ............38
3.3.4.1 Terminal breaks (utterance limit)...........................................................................38
3.3.4 .2Non-terminal breaks...............................................................................................38
3.3.5 Fragmentation phenomena............................................................................................39
3.3.5. 1 Interruptions ..........................................................................................................39
3.3.5.2 Retracting and/or restart and/or false start(s).........................................................39
3.3.5. 3. Retracting/interruption ambiguity........................................................................40
3.3.6 Summary of prosodic break types.................................................................................40
3.3.7 Pauses............................................................................................................................40
3.4 Quality assurance on prosodic tagging ...............................................................................40
3.5. Dependent lines...................................................................................................................41
3.6. Alignment............................................................................................................................41
3.6. 1 Annotation procedure...................................................................................................42
3.6.2. Prosodic tagging and the alignment unit......................................................................42
3.6. 3 Quality assurance on the aligment ...............................................................................42
3.6.4 Win Pitch Corpus ..........................................................................................................43
3.6.5 DTD of WinPitch Corpus aligment files.......................................................................43
3.7. PoS tagging and lemmatization ..........................................................................................46
3.7.1. Minimal Tag Set requirements.....................................................................................46
3.7.2. Tagging Format............................................................................................................47
3.7.3. Frequency Lists Format ...............................................................................................47
3.7.4 Tag sets .........................................................................................................................49
3.7.4.1 French tagset ..........................................................................................................49
3.7.4 2 Italian tagset ...........................................................................................................50
3.7.4 3. Portuguese Tagset .................................................................................................51
3.7.4 4. Spanish Tagset ......................................................................................................53
3.7.5 automatic PoS tagging: tool and evaluation..................................................................55
3.7.5.1 Italian : tool and evaluation....................................................................................55
3.7.5.2 French: tool and evaluation....................................................................................60
3.7.5.3. Portuguese: tool and evaluation ............................................................................64
3.7.5.4. Spanish: tool and evaluation .................................................................................69
4. XML Format of the textual resource..........................................................................................73
4.1. Macro for the translation of C-ORAL-ROM .txt files to .XML ........................................73
4.1.1. Checking the C-ORAL-ROM format ..........................................................................73
4.1.2. Generating XML files ..................................................................................................73
4.2.1. Requirements and procedure........................................................................................74
4.2.2. Rectifying errors ..........................................................................................................74
4. 3. C-oral-rom dtd ..................................................................................................................75
5. Bibliographical references .........................................................................................................78
APPENDIXES ....................................................................................................................................79
APPENDIX 1- TYPICAL EXAMPLES OF PROSODIC BREAKS TYPES IN ITALIAN, FRENCH, PRTUGUESE
AND SPANISH .....................................................................................................................................80
Examples of strings with typical non terminal breaks, generic terminal breaks and interrogative
breaks .................................................................................................................................................80
Examples of strings with typical intentional suspension ...................................................................80
Examples of strings with typical interruption ...................................................................................81
Examples of strings with typical retracting ......................................................................................82
Examples of retracting/interruption ambiguity..................................................................................82
APPENDIX 2 - TAGSETS USED FOR POS TAGGING IN THE FOUR LANGUAGE COLLECTIONS. DETAILED
TABLES AND COMPARISON TABLE .......................................................................................................83
4
French tagset......................................................................................................................................83
Italian Tag set ...................................................................................................................................84
Portuguese tagset...............................................................................................................................86
Spanish tagset ...................................................................................................................................88
Synopsis tag sets ................................................................................................................................90
APPENDIX 3 ORTHOGRAPHIC TRANSCRIPTION CONVENTIONS IN THE FOUR LANGUAGE CORPORA .92
Italian transcription conventions .......................................................................................................92
French transcription conventions ......................................................................................................96
Portuguese transcription conventions ...............................................................................................98
Spanish transcription conventions ...................................................................................................100
APPENDIX 4 C-ORAL-ROM PROSODIC TAGGING EVALUATION REPORT ..........................................104
1 Introduction...................................................................................................................................104
2 Experimental Setting .....................................................................................................................104
3 Measures and Statistics.................................................................................................................106
3.1 Evaluation data.......................................................................................................................106
3.2 First step: binary comparison file...........................................................................................107
3.3 Second step: ternary comparison file .....................................................................................108
3.4 Third step: measures ..............................................................................................................108
3.5 Statistics .................................................................................................................................110
3.5.1 Percentages......................................................................................................................111
3.5.2 Kappa coefficient ............................................................................................................112
4 Results ...........................................................................................................................................112
4.1 General Data ..........................................................................................................................113
4.2 Binary Comparison ................................................................................................................114
4.2.1 Binary Comparison: Terminal Break Confirmation .......................................................114
4.2.2 Binary Comparison: Terminal Missing ..........................................................................116
4.2.3 Binary Comparison: Non Terminal Missing...................................................................116
4.2.3’ Binary Comparison: Non Terminal Deletion ................................................................117
4.2.4 Binary Comparison: Added Terminal.............................................................................119
4.2.5 Binary Comparison: Activity Rate..................................................................................119
4.2.6 Binary Comparison: Misplacement Rate ........................................................................120
4.2.3 Binary Comparison: Non Terminal Deletion..................................................................121
4.3 Ternary Comparison ..............................................................................................................121
4.3.1 Ternary Comparison: Strong Disagreement on prosodic breaks ....................................121
4.3.2 Ternary Comparison: Partial Consensus.........................................................................127
4.3.3 Ternary Comparison: Total Agreement ..........................................................................129
4.3.4 Ternary Comparison: Global Disagreement on prosodic breaks ....................................132
4.3.5 Ternary Comparison: Consensus in the Disagreement ...................................................133
4.4 Kappa coefficients..................................................................................................................134
4.5 Summarizing Table ................................................................................................................136
4.5 Discussion of results ..............................................................................................................137
5 Conclusions ..................................................................................................................................140
5
6
I Introduction
The C-ORAL-ROM multilingual resource provides a comparable set of corpora of spontaneous
spoken language of the main romance languages, namely French, Italian, Portuguese and Spanish.
The resource is the result of the C-ORAL-ROM project, which has been undertaken by an
European consortium, co-ordinated by the University of Florence and funded within the Fifth EU
framework program (C-ORAL-ROM IST 2000-26228).
C-ORAL-ROM consists of 772 spoken texts and 123:27:35 hours of speech. Four comparable
recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions
(roughly 300,000 words for each Language) have been delivered respectively by the following
providers:
University of Florence (LABLITA, Laboratorio linguistico del Dipartimento di italianistica);
Université de Provence (DELIC, Description Linguistique Informatisée sur Corpus);
Centro de Linguística da Universidade de Lisboa (CLUL);
1
Universidad Autónoma de Madrid (Departamento de linguistica, Laboratorio de Lingüística Informática ).
The main C-ORAL-ROM objective is to allow HLT based on spoken language interface to face
challenging LRs which represent spontaneous speech in real environment. The resource aims to
represent the variety of speech acts performed in everyday language and to enable the induction of
prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative
point of view. More specifically the representation of significant variations found in spontaneous
speech performances in natural environments allow the use of C-ORAL-ROM for comparable
spoken language modelling of the main romance languages, both for linguistics studies and
Language Technology purposes. The resource can also be used for testing speech recognition tools
in critical contexts.
The recording conditions and the acoustic quality of the sessions collected in C-ORALROM are variable. The speech files of the acoustic database are defined on a quality scale
(recording, volume, voice overlapping and noise) and are comparable with respect to it. The quality
scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality.
The quality is gauged spectrographically and is always annotated in the metadata of each session.
See §7 in Chapter II 2
In order to ensure a significant representation of the spontaneous speech universe the corpus
design of the four resources foresees recording in natural environment in a variety of different
contexts. The contextual variation is controlled on the basis of a strictly defined set of parameters
whose significance has been recognized in the linguistic tradition. As a consequence of this
sampling strategy the four language resources are comparable as far as they fit the same corpus
design scheme. See §1 in Chapter III.
Each recorded session is stored in wav files (Windows PCM, 22.050 Hz, 16 bit) and is delivered in
a multimedia corpus with the following main annotations:
a. Session metadata;
1
Other partners in the C-ORAL-ROM Consortium: Pitch France, for the speech software; European Language Distribution Agency,
France (ELDA) for distribution of the resource in the industrial market sector; Instituto Cervantes, Spain (IC) for the dissemination
and exploitation for language acquisition purposes. Istituto Trentino di Cultura (ITC-Irst), Italy, tested the resource in present
multilingual speech recognition technologies.
2
The C-ORAL-ROM data bases are anonymous. All speech segments that may have offended the user for decency reasons have
been erased and substituted with a beep in the audio signal. Speakers authorized each provider for the use of the recorded data to all
ends foreseen in the C-ORAL-ROM project, including publication and language technology applications. The authorization models
are available in http://lablita.dit.unifi.it/coralrom/authorization_model. The use of radio and TV emissions has been authorized by the
Broadcasting companies which also provided the raw data for the Data Base. Acknowledgements of all companies are given in the
Copyright section of the DVDs. The authorization data bases have been checked by ELDA (European Language Resource
Distribution Agency).
7
b. The orthographic transcription, in CHAT format3, enriched by the tagging of terminal and non
terminal prosodic breaks, in txt files
c. The text-to-speech synchronization, based on the alignment to the acoustic source of each
transcribed utterance, in .xml files.
This resource is stored in DVDs and is accompanied by the Win Pitch Corpus speech software (©
Pitch France) 4. Win Pitch Corpus allows the direct and simultaneous exploitation of the acoustic
and textual information by loading the .xml alignment files, compiled in accordance with the CORAL-ROM DTD alignment.
Metadata are defined following an explicit set of rules and contain essential information regarding
the speakers, the recording situation, the acoustic quality, the source, and the content of each
session and ensure a clear identification of the various speech types documented in the resource
(see. § 3.1 in Chapter II)
The corpora are orthographically transcribed in standard textual format (CHAT format; Mac
Whinney 1994) with the representation of the main dialogue characters; that is, speaker’s turns, the
occurring non linguistic and paralinguistic events, prosodic breaks and the segmentation of the
speech flow into discrete speech events. (See § 3.2 in Chapter III)
In C-ORAL-ROM’s implementation, the textual string is divided into utterances following the
annotation of perceptively relevant prosodic breaks, which are discriminated in the speech flow
through perceptive judgments. 5 (See §3.3 in Chapter III)
The annotated transcripts are aligned to the acoustic counterpart through WinPicthCorpus.
Segments deriving from the alignment are defined on independent layers, with automatic generation
of the corresponding database.
This multimedia storage ensures a natural and meaningful text/sound correspondence that may
be considered one of the main added values of the resource. Each utterance is aligned to its acoustic
counterpart, generating the data base of all the utterances in the resource (roughly 134,000 in the
multilingual corpus).6 (see § 3.6 in Chapter III)
Besides text-to-speech and speech-to-text alignment, WinPitchCorpus allows an easy and
efficient acoustic analysis of speech, as regards real-time fundamental frequency tracking,
spectrographic display, re-synthesis after editing of prosodic parameters, etc...
The multimedia C-ORAL-ROM resource is integrated with additional label files in various
formats, which ensure a multitask exploitation of the resource:
o
o
o
o
o
TXT files with the resource metadata in CHAT format
XML files with the resource metadata in IMDI format
Textual resource without alignment information in TXT files
Textual resource with automatic Part of Speech (PoS) and lemma tagging of each form
Textual resource in XML format according with the C-ORAL-ROM DTD
The consortium ensures maximum accuracy in the transcripts, which have been compiled by
PhD and PhD students in linguistics. The original transcripts have been revised by at least two
transcribers. The orthographic accuracy of transcripts has been cheeked automatically through a
word spelling check and through automatic PoS tagging.
The reliability of the prosodic tagging has been evaluated by an Industrial user in the speech
technology sector, external to the consortium, with a detailed analysis of the consensus reached for
each speech style represented in each language corpus. The level of consensus in terms of Kstatistics index is always over 0.8 (see. the evaluation report in Appendix II).
3
http://childes.psy.cmu.edu/manuals/CHAT.pdf
Minimal configuration required: Pentium III, 1 GHz, 256-megabytes Ram, S-blaster or compatible sound card, running under
Windows 2000 or XP only http://www.winpitch.com
5
The level of inter-annotator agreement on prosodic tag assignment has been evaluated by an external institution (LOQUENDO) see
Appendix 3.
6
In the French resource terminal breaks are annotated but the alignment mainly follows pauses in the speech flow.
4
8
The level of accuracy of the automatic PoS tagging of each language resource has been
evaluated by the providers and reported in §3.7.5 The percentage of exact recognition ranges from
90% to 97% according to the language resource.
The format accuracy of both metadata and transcripts has been double-checked
automatically through a conversion of the original label plain text files into XML files.
More specifically, the metadata format has been validated through two conversion scripts:
o conversion in XML files according to the C-ORAL-ROM DTD
o conversion in XML according to the IMDI format.
The distributor will provide additional quality assurance. See the VALREP.DOC file.
9
II General Information
1. Contact person
Prof. Emanuela Cresti
Co-ordinator of the C-ORAL-ROM project
Italian Department
University of Florence
Piazza Savonarola, 1
50132 Firenze
Italy
phone: +39 055 5032486
fax: +39 055 503247
e-mail: elicresti@unifi.ti
web: http://lablita.dit.unifi.it
2. Distribution media
The C-ORAL-ROM resource is distributed in DVD-5 .
3. Content
The C-ORAL-ROM multilingual resource of spontaneous speech for Italian, French, Portuguese
and Spanish comprises three components:
a) Multimedia corpus;
b) Speech software;
c) Appendix.
C-ORAL-ROM is delivered in 9 DVDs, with the following content:
o
DVDs 1 to 8 contain the Multimedia corpus. Respectively Italian collection; French collection; Portuguese
collection; Spanish collection.
o
DVD 9 contains a set of Appendixes to the multimedia corpus (Speech Software, Textual resources in
various format, and additional documentation) that are delivered to allow a more efficient multitask
exploitation of the resource.
4. Format of speech and label file
For each spontaneous speech recording session, the following is delivered into folders of the
multimedia corpus
1. Speech files: uncompressed files (Windows PCM: 22,050 Hz; 16 bit7) with “.wav” extension
2. Transcripts in CHAT format8 enriched by the annotation of terminal and non terminal prosodic breaks and the
alignment information, in plain text files with “.txt” extension
3. The text-to-speech alignment files: XML file in WIN PITCH CORPUS format with “.xml” extension9.
4. DTD of the WinPitchCorpus alignment format (coralrom.dtd)
7
The resource comprise a sub-corpus of private telephone conversation and human-machine interactions obtained through a phone
call service. The human machine interactions are sampled in files Windows PCM: 8,000 Hz 16 bit. Private telephone conversations
have been sampled at 22,050 Hz or at 8,000 Hz.
8
http://childes.psy.cmu.edu/manuals/CHAT.pdf
9
human-machine files were not aligned, so XML files are not present in relative folders.
10
For each session, the following files are also delivered in the Appendix to allow a multitask
exploitation of the resource:
5. The transcription of each session in CHAT format in plain text files (without the alignment information) with “.txt”
extension
6. The C-ORAL-ROM transcription of each session in XML files with “.xml” extension
7. DTD of the C-ORAL-ROM textual format (coralrom.dtd)
8. Metadata in CHAT format (plain text files with “_META.txt” extension)
9. Metadata in IMDI format (XML files with “_META.imdi” extension)
10. The C-ORAL-ROM transcription of each session with Part of Speech annotation and Lemma annotation for each
form in plain text files with “_PoS.txt” extension10
In addition, the following files are delivered for each language resource:
11. Tag set adopted in plain text files (tagset_french.txt, tagset_italian.txt, tagset_portuguese.txt, tagset_spanish.txt)
12. Frequency lists of lemmas and Frequency lists of forms in plain text files
13. Measurements of the Language values recorded in each text: in the Excel files “measurements_language.xls”
14. Line diagrams presenting the trend observed with regard to the standard text variation parameters along the corpus
structure's nodes, in the Excel file “multi-lingual_graphics.xls”
15. A set of excel files containing statistics on the metadata: a) metadata_session.xls: statistics regarding the
completeness of session metadata in the four collections; b) participants_records.xls: statistics regarding the
completeness of the main speaker metadata in the four collections; c) french_speakers.xls, italian_speakers.xls,
portuguese_speakers.xls, spanish_speakers.xls: list of speakers recorded in each corpus, in anonymous form, with their
main metadata including geographical origin, sex and age.
16. A set of multimedia samples referred to in the resource documentation
Standard character set used for transcription and annotation: ISO-8859-1
5. Layout of the disk-file system
DVDs 1 to 8 of the Multimedia corpus have the following structure:
Italian collection: DVDs CORALROM1II; CORALROM2IF
French collection: DVDs CORALROM3FI; CORALROM4FF
Portuguese collection: DVDs CORALROM5PI; CORALROM6PF
Spanish collection: DVDs CORALROM7SI; CORALROM8SF
Each language collection has the same folder structure, which mirrors the C-ORAL-ROM corpus
design (no language reference is indicated in the folder structure, this information can be retrieved
by the DVD name):
/<Category>/<Context>/<Domain>
where:
Category
Context
10
INFORMAL
FORMAL
If category = INFORMAL
family_private
public
If category = FORMAL
formal in natural context
media context
telephone
The human-machine interactions txt files of the French and Spanish collections have not been pos_tagged
11
Domain
If category = INFORMAL
monologues
dialogues
conversations
If category = FORMAL
- If context = natural_context
business
conference
law
political_debate
political_speech
preaching
professional_explanation
teaching
- If context = media
interviews
meteorology
news
reportages
scientific_press
sport
talk_show
- If context = telephone
private_conversations
human-machine (non applicable to the Portuguese
corpus)
The following is the directory structure of each language data-base in the C-ORAL-ROM
multimedia resource:
INFORMAL/
INFORMAL/ family_private
INFORMAL/ family_private / monologues
INFORMAL/ family_private /dialogues
INFORMAL/ family_private /conversations
INFORMAL/public
INFORMAL/public/dialogues
INFORMAL/public/ conversations
INFORMAL/public /monologues
FORMAL/ natural_context /professional_explanation
FORMAL/ natural_context / conference
FORMAL/ natural_context / business
FORMAL/ natural_context /law
FORMAL/media
FORMAL/media/news
FORMAL/ media /meteorology
FORMAL/ media/interviews
FORMAL/ media/reportage
FORMAL/ media /scientific_press
FORMAL/ media /sport
FORMAL/ media /political debate
FORMAL/ media /talk_show
FORMAL/telephone
FORMAL/telephone/private_conversations
FORMAL/telephone/human-machine
FORMAL
FORMAL/natural_context
FORMAL/ natural_context / political_speech
FORMAL/ natural_context / political_debate
FORMAL/ natural_context / preaching
FORMAL/ natural_context / teaching
DVD 9 (CORALROM_AP) contains a set of Appendixes to the multimedia corpus (Speech
Software, Textual resources in various format, and additional documentation). The content of
CORALROM_AP is structured into folders as follows:
\
Measurements
Metadata
Textual Corpus
WinPitchCorpus
Utilities
\Textual Corpus\
Frequency Lists
PoS-Tagging
txt
xml
\Textual Corpus\xml\
French_xml
12
Italian_xml
Portuguese_xml
Spanish_xml
Spanish
\Metadata\
CHAT_Metadata
IMDI_Metadata
\Textual Corpus\txt\
French_txt
Italian_txt
Portuguese_txt
Spanish_txt
\Metadata\CHAT_Metadata\
French
Italian
Portuguese
Spanish
\Textual Corpus\PoS-Tagging\
French
Italian
Portuguese
Spanish
tag_sets
\Metadata\IMDI_Metadata\
French
Italian
Portuguese
Spanish
\Textual Corpus\PoS-Tagging\Italian\
human-machine_interaction
\Utilities
Prosodic_breaks
Specifications
\Textual Corpus\Frequency Lists\
French
Italian
Portuguese
In addition to the previous structures, the following directories are used to store those files which
are not part of the Data Base:
\ (root)
COPYRIGH.TXT
\DOC
\INDEX
README.TXT file containing a short description of the database and the files, DISK.ID file and
documentation
index files, i.e. contents file
The following files are in \DOC:
DESIGN.DOC
VALREP.DOC
DESIGN.DOC contains the main documentation file
VALREP.DOC contains the validation report created by the validation center.
The root directory contains the files:
README.TXT: ASCII text file containing a description of the files in the database and a copyright statement
DISK.ID: 11-character string with volume name
the following file is in \INDEX.
CONTENTS.LST is a plain text file containing the list of the files in the relative DVD. This list shows each folder, each
file and their size in bytes. At the end of the file list of each folder, the number of files placed in it and the size of the
folder in bytes is showed.
6. Hardware, Software and Recording Platforms
6.1 Hardware
Computers:
Processor: Intel Pentium III 500MHz or higher
RAM: 256MB or more
Sound Card: Sound Blaster compatible with S-PDIF input connection
Hard Disk: IDE-ATAPI 10GB or higher
Video Card: SVGA compatible
13
6.2 Recording Platforms and Software
ITALIAN
Recording platforms:
• DAT recorder TASCAM© DA-P1
• Analogue recorder Sony© TCD5M
Microphones
Unidirectional radio microphones Sennheiser© MKE 40 (dialogue and monologues marked A).
Omni directional microphones Sennheiser© MKE2 or equivalent.
Software
O.S.: Microsoft© Windows© 2000
Sound Editor: Syntrillium© Cool Edit 2000
Text Editor: Microsoft© Word© 2000
Text to Speech Alignment Editor: WinPitchCorpus – Pitch France©
PoS tagging: Pi-system CNR©
FRENCH
Recording platforms
Recording platforms:
• DAT recorder TASCAM© DA-P1
• Analogue recorder Sony© TCD5M
Microphones
Unidirectional radio microphones Sennheiser© MKE 40 (dialogue and monologues marked A).
Unidirectional microphones Sennheiser© MKE2 or equivalent.
Software
O.S.:
Microsoft© Windows© 2000 & XP
Linux
General
OpenOffice
MSOffice
Speech Software
Various tools developed at Université de Provence
Transcriber (developed by DGA - freeware)
WinPitchCorpus – Pitch France©
Syntrillium© Cool Edit 2000
MES (developed at Université de Provence)
Programming languages
Perl, C++
Morphosyntactic software
Cordial (tagger)
Multiples tools developed internally at Université de Provence
Concordance programming Contextes (developed internally at Université de Provence)
PORTUGUESE
Recording platforms:
• DAT recorder TASCAM DA-P1;
• A TEAC W-580R Double Auto Reverse Cassette Deck UR audio tape recorder;
Microphones AKG UHFPT40;
14
O.S.: Microsoft© Windows© 2000 and XP
Sound Editor: Syntrillium© Cool Edit 2000
Text Editor: Microsoft© Word© 2000
Text to Speech Alignment Editor: WinPitchCorpus – Pitch France©
PoS tagging: Brill tagger ©
SPANISH
Recording platforms:
DAT recorder TASCAM© DA-P1
Unidirectional radio microphones E-Voice© MC-150 (for all sessions).
Software
O.S.: Microsoft© Windows© 2000
Sound Editor: Creative© Wave Studio© v 4.11
Text Editor: Microsoft© Word© 2000
Text to Speech Alignment Editor: WinPitchCorpus – Pitch France©
Part-Of-Speech: Tagger: Grampal©
7. Number of recording and general corpus data
Speech files of the recorded session and the corresponding transcription files are in one to one
correspondence. 11 The following is the general table of the main values recorded in the C-ORALROM multimedia corpus.
wav files
GB Duration Utterances Words
Speakers Male Female
206
Portuguese
204
152
3,77 26.21.43 21010
5,19 36.16.10 40402
4,43 29.43.42 38855
310969
317916
305
451
261
154
276
144
150
175
117
Spanish
210
4,56 31.06.00 35588
333482
410
247
163
French
Italian
295803
11
The one to one correspondence is partial for sessions with more then 8 participants, that are split in more then one wav file in the
multimedia collection. Those sessions are reported in a wav file which record the all session and is counted in the table below.
However they are also split in more then one wav files, following the convention: filename_1; filename_2 etc. each one containing
not more the 8 participants (not counted in the number of sessions). The transcription of split session is delivered in one file only in
the textual resource, while is delivered split in more then one file in the multimedia collection.
15
8. Acoustic quality
C-ORAL-ROM is oriented towards the collection of corpora in natural environment, despite
the fact that this necessarily causes a lower acoustic quality of the resource. Moreover, C-ORALROM has exploited, in the frame of a new multilingual work, the rich contents of the archives set
up by the providers during years of research on spoken languages; therefore the acoustic quality and
the recording conditions of the resource are variable.
The following are the requirements for the acoustic format and for the recording apparatus:
Format: mono wav files (Windows PCM), Sampling frequency: 22050Hz, 16 bit12
Recording and storing process for old Analogue recordings: directly derived in wav files (20.050 hz 16 bit) from
the original analogue tapes through a standard sound card (Sound Blaster live or compatible) with a professional
sound editor, Cool Edit 2000 ®
Recording and storing process for new recordings:
a) dialogues: stereo DAT or minidisk recording (44.100Hz) with unidirectional Micro-phones, converted into
mono .wav files (Windows PCM, 22050Hz, 16 bit) via SPDIF port of a standard sound card (Sound Blaster live
or compatible) with a professional sound editor
b) conversations with more than two participants: mono DAT or minidisk recording with cardioids or omnidirectional microphone converted into mono .wav files via SPDIF port of a standard sound card (Sound Blaster
live or compatible) with a professional sound editor.
The speech files of the acoustic database are defined on a quality scale (recording, volume, voice
overlapping and noise). The quality scale extends from the highest level of clarity of the voice
signal to low levels of acoustic quality.
1) Digital recordings
with DAT or minidisk apparatus and unidirectional microphones or analogue recording of very high quality
2) Digital recording with poorer microphone response or analogue recordings with:
• Good microphone response;
• Low background noise
• Low percentage of overlapped utterances;
• F0 computing possible in most of the file
3) Low quality analogue recordings with:
•
Poor microphone response
• Background noise
• Average percentage of overlapped utterances
• F0 computing possible in many parts of the files
The quality is gauged spectrographically. Sessions in which F0 analysis is not significant are
excluded from sampling. The acoustic quality of each recording and the most relevant data on the
recording condition are always recorded in the metadata of each text.
12
See foot-note 7
16
III C-ORAL-ROM corpus
1. Corpus Design
1.2 Sampling Parameters
Spontaneous speech events are those communication events where the programming of speech is
simultaneous to its execution by the speaker; i.e. the speech event is non-scripted or only partially
scripted. The C-ORAL-ROM resource offers a representation of the spontaneous speech universe
in the four main romance languages (Italian, French, Portuguese and Spanish) with regard to the
following main parameters, which define speech variation:
A. Communication context
B. Language register
C. Speaker
A. The communication context of a speech event is defined by the following features, each
specified by a closed vocabulary13:
Channel: the means by which the signal transmission is achieved.
Face to face communication: speech event among participants in the same unity of space
and time with reciprocal direct multi-modal perception and interaction;
Broadcasting: unidirectional speech emission to an undefined audience by devices that
ensure, at least, the perception of voice;
Telephone: bi-directional speech event by means of telephone.
Structure of the communication event: role and nature of the participants in the speech event.
Monologue: speech event with only one intervenient performing a main communication
task14
Dialogue: speech event with two intervenient
Conversation: speech event with more than two intervenient
Human-machine interaction: speech event between a human being and an electronic device
Non-natural format: other; i.e. format of the broadcasting emissions (undefined in this
resource)
Social context: organization level of the society to which the speech event belongs.
Family/private: speech event within the family, or private social context
Public: speech event within a public social context
Domain of use: semantic field defined by the content of the speech event
B. Language register
Informal: un-scripted low variety of language, used for everyday interactive purposes;
Formal: partially-scripted task-oriented high variety of language.
C. Speaker: the main qualities of the speaker, that may influence their speech production
Sex: sex of the speaker
Age: speaker’s age
Education: speaker’s schooling degree
Occupation; speaker’s employment field
Geographical origin: speaker’s place of origin
13
See IMDI Metadata Elements for Session Descriptions, Version 3.0.3, in http://www.mpi.nl/IMDI/
The monologue context can fits exactly with the definition of “only one intervenient” only in the formal use of language. In this
case the social rules governing the interaction among participants may ensure the execution of the communication task by a sole
speaker. On the contrary, in the everyday informal use of language, although only one speaker performs the main communication
task, other participants may interact in the communication event with low informative contribution.
14
17
1.3. Sampling strategy and Corpus design
The corpus design of the C-ORAL-ROM resource is the result of the application of two criteria:
- a sampling strategy, which defines how to apply the set of relevant parameters for the
representation of the universe through recording sessions
- a definition of the size of the resource and of each session
Given the variation parameters in I.1.1 the sampling strategy adopted for the representation of
speech variability in the four C-ORAL-ROM corpora is a function of the following principles:
•
•
•
•
•
•
•
•
Definition of the size of the resource. Due to the cost of the spoken resources, a content
of around 1,200.000 words (300,000 words for each language collection) was fixed.
sampling of the universe with reference to context variation and to language register
variation, leaving random speaker’s variation.
distinction between formal speech (50%) and informal speech (50%), thus ensuring a
sufficient representation of dialogical Informal Speech (which is the resource with
higher added value);
the selection of different criteria for sampling the formal and informal part of the corpus.
the definition of the text weight in terms of information units (words)
The definition of a text weight that ensures both the possible appreciation of macrotextual properties and sufficient representation of the universe in each 300.000-word
corpus.
the representation of a variety of possible recording situations within the range of
perception and intelligibility of the human ear.
the recording as part of the meta-data: a) Speaker characteristics; (gender, age,
geographical region, education and occupation); b) acoustic quality of the text.
In a multilingual collection of a limited size, strong diatopical limits for each language must
be established. C-ORAL-ROM does not represent in a systematic way diatopical phonetic
variations due to the geographical origin of the speakers.15 Assuming the relatively small size of the
resource the corpus design strategy concentrates on those variation parameters that, in principle, are
more relevant to the documentation of speech variation and try to maximize the significance of the
sampling for what regards the probability of occurrence of different types of speech acts and
syntactic constructions. To this end the most of different types of context of use and possible
language tasks are represented, the more typical speech acts and modes of language construction in
those contexts are represented.
The use of the formal / informal parameter in the corpus design scheme allows the
restriction of the number of significant parameters for what regards context variation. More
specifically, while it can be assumed that in western societies the formal use of language is applied
in a closed series of typical domains, the same does not hold for the informal use of language. The
list of possible domains of use for informal language is by definition open, and no domain can in
principle be considered more typical than others.
Under this assumption, the identification of the main domains of use of formal language
maximizes the probability of representing the significant variations in this language variety, and is
15
Corpora are mainly collected in Continental Portugal, Central Castile Spain, Southern France, Western Tuscany, and are intended
to represent a possible accepted standard, rather than all the varieties of pronunciation, which need collections of inter-linguistic
corpora with a wide diatopical variability. This limitation is quite severe for Italian, where local varieties may strongly diverge from
the standard (De Mauro et al., 1993). See in II. the statistics on the geographical origin of speakers and detailed excell tables in the
Utility directory in DVD9
18
therefore the best strategy. On the contrary, if significant variations of informal spontaneous speech
are to be considered, the same strategy will cause a reduction of their probability of occurrence.
Therefore, the definition of a finite list of typical domains of use is the main criterion applied in
documenting the formal uses of the four romance languages, while variations in dialogue structure
are not controlled (the social context of use is generally the public one, and the more frequent
dialogue structure is the monologue). On the contrary, the variations in social context of use and in
dialogue structure are the parameters systematically adopted for the documentation of the informal
part, while the choice of the specific semantic domain of use is left random.
Also the strategy regarding the text weight vary its significance considering the Formal and in
the Informal use of language. The formal use of language feature in general long textual structure,
while in the informal the length of syntactic construction is limited. Therefore in order to ensure the
probability of occurrence of typical structures the text length for the Formal sampling must be
significantly longer.
The above variation parameters and sampling strategy are projected in the corpus design matrix
presented in the following paragraph.
1.4 Comparability
By definition of spontaneous speech, comparability cannot be obtained through the use of parallel
corpora. Each resource of the C-ORAL-ROM corpus is comparable with the others as far as it
satisfies the conditions on the corpus design stated in the following matrix, which reflects the
variation parameters defined in 1.1
Section
MANDATORY
INFORMAL
[-Partially scripted]*
Context
MANDATORY
Family
context
[-Public;
scripted]*
Domain
MANDATORY
(124,500)
/Private
-Partially
(25,500)
Public context
[+Public;
-Partially
scripted] or [ –Public;
+ partially scripted]*
Section
MANDATORY
FORMAL*
[+Public; +Partially
scripted]*
Number of words
MANDATORY16
150,000
Context
MANDATORY
Monologues
Dialogues/Conversation
Domain
MANDATORY
48,000
102,000
Number of words
150,000
Formal
context
in
natural
political speech; political debate; 65,000
preaching; teaching; professional
explanation; conference; business;
law;
Media
news; sport; interviews; science; 60,000
meteo (weather forecast); scientific
press; reportage; talk_show 17
Telephone
Private conversation;
25,000
18
Human-Machine interactions
* Additional feature used in C-ORAL-ROM for the classification of session when the definitions of parameters in 1.2
turn out not sufficient.
16
No upper limit. 5% variation allowed. This limit is not be considered strictly mandatory, in the case of the Formal in Natural
Context sub-field.
17
Talk shows may belong to various typologies; e.g. political debate; thematic discussions, culture, etc.
18
10,000 words. Multilingual service for train information accomplished in the C-ORAL-ROM project by ITC-IRST. Field not
present in the Portuguese corpus
19
TEXT LENGTH REQUIREMENTS
In the informal section,
Short texts: at least 64 texts of around 1500 words each. Up to 20% of this part may be constituted by texts of different
length 19
Long texts: from 8 to 10 texts of around 4500 words each
In the formal section, the text length is defined according to the following rules:
For Formal in natural context: 2 or 3 samples for each domain of 3000 words average
For Media: At least one short sample for Weather forecasting and News. At least 2 or 3 samples, for each of the other
domains, which must be represented with at least 6,000 words. Samples of 1500 or 3000 words average.
For Telephone: text length not defined (by preference 1500 words upper limit, no lower limit). The Human-machine
interactions domain should contain 10,000 words. This domain is not present in the Portuguese corpus;
The term “word” refers to all graphic elements in the text surrounded by two spaces corresponding
to the orthographic transcription of speech. All signs in the textual files corresponding to metadata,
dialogue representation format and tagging elements are not counted as words.
Tolerance
The requirements in the corpus design matrix concerning section, context, domain structure and text
length are mandatory. The target number of words requested for each field in the matrix has been
approximated in each language collection as in the following table:
Section
Context
Domain
INFORMAL
Family-private
Target
Italian
French
Words
Words
Words
Spanish Port.
Words
Words
150000
154967
152385
168868
165436
124500
128676
124886
131056
132887
Monologue
42000
45212
47702
42082
45937
Dialogue-Conversation
82500
83464
77184
88974
86950
25500
26291
27499
37812
32549
Monologue
6000
6050
6960
6116
7696
Dialogue-Conversation
19500
20241
20539
31696
24853
Public
FORMAL
150000
156002
143418
164614
152480
Natural context
[see. the above list]
65000
68328
57319
72268
66140
Media
[see. the above list]
60000
61759
57143
62739
62018
Telephone
[see. the above list]
25000
25915
28956
29607
300000
310969
295803
333482
TOTAL
24322
317916
The corpus design and the sampling criteria of C-ORAL-ROM ensure:
- The production of corpora which offer, in principle, a representation of the wide variety of
syntactic and prosodic features of spontaneous speech
- The production of comparable multilingual corpora as a basis for the induction of linguistic
generalizations regarding spontaneous speech in the four romance languages
2. Filenames conventions
Filenames of the C-ORAL-ROM resources bear information of three types, in order:
a) The represented language
b) The text type; that is the field and sub-field to which each text belongs in the corpus structure
c) The serial number identifying each text in its sub-field
19
In order to allow a better exploitation of the archives existing prior of C-ORAL-ROM the use of smaller samples and the use of
samples that come from the splitting longer sessions has been tolerated till the 20% of the informal part.
20
The following are the conventions adopted in each C-ORAL-ROM language collection:
1) Language
Country code: “f” (French), “i” (Italian), “p” (Portuguese), “e” (Spanish),
2) Text type
Informal section:
context: “fam” (family_private), “pub” (public)
domain: “mn” (monologues), “dl” (dialogues), “cv” (conversations),
Formal section:
context: “nat” (natural_context)
domain: “ps” (political_speech), “pd” (political_debate), “pr” (preaching), “te” (teaching), “pe”
(professional_explanation), “bu” (business), “co” (conferences), “la” (law)
context: “med” (media)
domain: “nw” (news), “mt” (metereology), “in” (interviews), “rp” or “rt” (reportages), “sc” (scientific_press),
“sp” (sport), “ts” or “st” (talk_show)
context: “tel”
sub-field: “pv” or “ef” (private_conversations), “mm” (human-machine)
3) Text identification number:
Two serial numbers identifying the text progressive in the sub-field;
e.g.:
efamdl01 (Spanish, family_private, dialogues, 01)
efammn02 (Spanish, family_private, monologues, 02)
4) Split session convention:
Sessions with more than 8 participants cannot be aligned by WinPitchCorpus, which in the present version is limited to
8 layers (one layer x speaker). For this reason in the multimedia corpus, although the metadata refer to the all session,
those files were stored in more than one alignment, sound and transcription in accordance with the following
convention.
an underscore sign (“_”) and a digit (starting from “1”) were added at the end of the filename;
e.g.:
pnatpd01_1.xml, pnatpd01_2.xml, …
pnatpd01_1.wav, pnatpd01_2.wav, …
pnatpd01_1.txt, pnatpd01_2.txt, …
In the French corpus every minor speaker which exceeds the layer number limit has been aligned on the same layer and
with the same participant short name (“ZZZ”) therefore the problem does not arise.
The split file convention above regards the Multimedia Corpus only and doesn’t apply to files in the Textual Corpus
folder, where split files are delivered in one file only. The wav file of the all session is also delivered into folders of the
multimedia session.
3. Annotation information
For each recorded session, C-ORAL-ROM provides the following set of annotations:
1.
2.
3.
4.
5.
Metadata
Transcription and Dialogue Representation
Prosodic tagging of terminal and non-terminal breaks
Alignment
Part of speech (PoS) and lemma tagging of each transcribed form
3.1 Meta-Data
For each session, an ordered set of meta-data is recorded in four modalities: 1) as Headers lines in
chat format before the ortographic transcription of each session; 2) as independent files in CHAT
format; 3) in IMDI format; 4) within the XML file according to the C-ORAL-ROM DTD.
21
The following are the rules for marking the metadata of C-ORAL-ROM in CHAT format, from
which the annotations in other formats derive:
•
•
•
Each metadata type is
introduced by “@” immediately followed by a label, followed by “:”and an empty space.
Metadata are listed in a closed set of types regarding
o the session;
o its size;
o the speakers;
o the acoustic quality;
o the source;
o the person who can provide information on the session
Metadata fields or sub-fields that cannot be filled for a lack of information are filled with an 'x'
(capital or small cap).
Label Type
@Title:
@File:
@Participants:
@Date:
@Place:
@Situation:
@Topic:
@Source:
@Class:
@Length:
@Words:
@Acoustic_quality:
@Transcriber:
@Revisors
@Comments:
•
Description
One or two words in the object language that help to recognize the text
Filename without extension (The name of audio file and the text file differ only in the
extension).
Three capital letters identifying each speaker, followed by the corresponding proper name
(first name), plus a sub-field with an ordered set of information on the speaker.
Date of the recording: Day/month/year,
separated by slashes; e.g. 20/06/2001
Unknown fields are filled with '00'; e.g.: 00/00/2001 or 00/00/00
Name of the city where the recording session take place
Ordered set of information: genre and role of the participants in the situation, environment,
main actions performed, recording conditions; according with the rules specified below;
e.g. gossip between friends at home during dinner, not hidden, researcher participant
The main argument dealt with in the speech event (max 50 characters); e.g. traffic problems
Name of the collection leading to a copyright holder e.g. LABLITA_CORPUS; CORPAIX
The set of fields on the text class in accordance with the C-ORAL-ROM corpus structure
(separated by commas)
Length of the transcribed audio file in minutes(’) and seconds (”) e.g.: 12’ 15”
Number of words in the text file
The acoustic quality of the recording. In accordance with specific criteria (A B or C)
Name of the person responsible for the text, who can provide further information
Names of the revisors
transcriber's comments on the text
Each metadata type is filled with PCdata (closed or open vocabulary) ending with “enter” and can be specified
in accordance with the rules.
22
3.1.2 Rules for the Participant field
Ordered set of sub-fields for each Participant (in parentheses, separated by a comma and an empty
space).
Type
Description
Vocabulary
Sex
Sex of the speaker
Closed vocabulary. One of the
following conventional vocabularies
according to the description in
brackets:
O(ptional)/
M(andatory)
Man (Male)
M
Woman (Female)
X (unknown)
Close vocabulary. One of the M
following conventional capital
letters for each range of age
between brackets :
Age
Age of the speaker
Education
A (18-25);
B (26-40);
C (41-50)
D (over 60)
X (unknown)
The level of education Closed vocabulary. One number M
according
to
the according to the degree between
schooling degree
brackets:
1 (primary school or illiteracy);
2 (high school)
3 (graduated or university students)
Profession:
Role
Geographical
origin/linguistic
influence
X (unknown)
Usual profession of the Open vocabulary. Name of the M
speaker
profession; (eg. professor; secretary;
student) or X (unknown)
Role in the recorded Open vocabulary. Name of the role; M
event (even if it is (e.g. father; professor) or X
equal to the profession) (unknown)
Name of the region Open vocabulary; (e.g. Ile de M
from which the speaker France; Castile) or X (unknown)
origin
23
3.1.3 Rules for the Class field
informal
Type:
family/private
public
Sub-type:
monologue
dialogue
conversation
formal
Type
formal in natural context
Sub-type:
political speech
political debate
preaching
teaching
professional explanation
conference
business
law (through media)
sub-sub-type: (optional)
monologue
dialogue
conversation
formal
Type:media
Sub-type:
news
sport
interviews
meteo
scientific press
reportage
talk_show
Type:telephone
Sub type:
private conversation
human-machine interactions
3.1.4 Rules for the Situation field
The situation field is a set of information reported in discursive manner that help to identify context
in which the language event take place. The following are the guideline used in C-ORAL-ROM to
define the set of possible relevant information for the situation field*
Type
Genre
Description
Information that helps to
define the genre of the
linguistic event
Reciprocal role
The reciprocal role of the Open vocabulary (e.g. friends,
participants
colleagues, relatives, citizens)
Ambience
The kind of surroundings Open vocabulary (e.g in a silent
where the recording took studio; on the street; at home; in
place
a shop, at school, in an office,
etc.)
Main action performed Open vocabulary (e.g. while
during the speech event (if ironing, during depilation)
any)
Status of the recording with Closed vocabulary. A choice of
respect to the “Observer one alternative in each of the
paradox” in spontaneous following two sets:
speech resources
1) hidden vs. not hidden
2) participant researcher
v.s
observant
researcher
v.s.
researcher not present.
Action
Recording conditions
Vocabulary
Open vocabulary; e.g. (gossip;
chat;
quarrel;
discussion;
narration; claim, etc). The
neutral case is "talk". The
information on the Class field
(dialogue, conversation etc.) is
not repeated.
* For media corpora the situation field is filled with the name of the program.
24
3.1.5 Rules for marking Acoustic quality
Texts in the collections are MANDATORILY labeled with respect to the acoustic quality of the
sound source*:
Type
Digital recordings
Analogue recordings:
Properties
Label
Digital recordings
A
with DAT or minidisk apparatus
and unidirectional microphones
or analogue recordings of very high
quality
Digital recordings with poorer B
microphone
response
or
analogue recordings with:
•1 Good microphone response;
• Low background noise
• Low
percentage
of
overlapped utterances;
• F0 computing possible in
most of the file
Analogue recordings:
Low
quality
analogue C
recordings with:
•2 mediocre
microphone
response
• Background noise
• Average percentage of
overlapped utterances
• F0 computing possible in
many parts of the file
* Sessions in which F0 analysis is not significant are labeled D and excluded from sampling.
25
3.1.6 Quality assurance on metadata format
The format correctness of metadata has been double cheeked automatically through conversion of
the original label .txt files into xml files.
More specifically metadata format has been validated through two conversion scripts:
o conversion in xml files according with the C-ORAL-ROM DTD (see. §4 below)
o conversion in xml according with the IMDI format. (see. C-ORAL-ROM metadata in IMDI
format in http://lablita.dit.unifi.it/coralrom/imdi/)
3.1.7 Statistics from C-ORAL-ROM metadata
A set of statistics on the C-ORAL-ROM metadata are reported in excel files in DVD9 the following
are the main figures regarding Speakers metadata and sessions metadata.
3.1.7.1 Number of speakers
Given that each database is substantially anonymous, the number of speakers is estimated on the
metadata set. More specifically a speaker is identified in each collection by the identity of short
name, together with the long name, sex, age, and geographical origin, when these fields are filled
with positive data 20
The number of speaker is given below, in total and separately for each main field in the corpus
design.21
ITALIAN
Total Speakers: 451
Informal: 209
Formal in Natural
Context: 57
Media: 173
Telephone:27
FRENCH
Total Speakers: 305
Informal: 164
Formal in Natural
Context: 51
Media:75
Telephone:18
PORTUGUESE
Total Speakers: 261
Informal: 106
Formal in Natural
Context: 78
Media: 64
Telephone: 28
SPANISH
Total Speakers : 410
Informal: 164
Formal in Natural
Context: 51
Media: 184
Telephone: 20
3.1.7.2Distribution of speakers per geographical origin
The following is the distribution pf speakers per geographical origin according with the metadata.
The high number of unknown speaker is mainly due to the media collections where this information
is not available (see distribution of absent speakers metadata below).
FRENCH
Provence and South of France
Poitiers and West France
Paris
Other regions
Centre of France
Other countries
Other Francophone countries
speakers
103
28
26
17
12
5
2
Unknown
112
305
Total
20
This restriction is intended not to overestimate the number of different speaker, given that some information about the same
speaker might be not available to every transcriber.
21
The number of different speakers in human-machine interaction cannot be computed.
26
ITALIAN
speakers
Tuscany
South and Isles
Other countries
Central Italy
North - Various Regions
188
38
20
16
14
Unknown
Total
175
451
PORTUGUESE
Lisbon and Center Portugal
North Portugal
South Portugal
Açores and Madeira
Overseas
Other Regions
Other countries
speakers
77
19
16
10
8
7
5
Unknown
Total
119
261
SPANISH
Madrid and Castillia
South America
Andalusia
Extremadura
Others Regions
Catalunia
Other countries
speakers
188
19
11
11
11
7
2
Unknown
Total
161
410
27
3.1.7.3 Completeness of speakers features in the metadata records
The following set of statistics are given for what regards the completeness of the information for the
speakers reported in the metadata records of each session. Statistics refers only to the main features,
i.e sex, age, education, geographical origin, that may be significant for the exploitation of the
resource 22.
The number of records where the information is complete are identified and for each language
corpus and in the main field of the corpus design.23 Moreover the percentage of record in which
each of the main feature is unknown is also reported for each language corpus and for each main
field in the corpus design.24
French
TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE
456
243
64
91
58
RECORD
48,44%
9,89% 68,97%
SEX-AGE-ORIGIN-EDUCATION 57,89% 75,72%
0,00% 0,00%
0,00%
0,00% 0,00%
NO SEX
24,56% 13,17%
37,50%
42,86% 29,31%
NO AGE
36,40% 16,87%
48,44%
84,62% 29,31%
NO ORIGIN
24,78% 10,29%
18,75%
63,74% 31,03%
NO EDUCATION
Italian
TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE
596
273
70
214
39
RECORD
40,00%
5,14% 79,49%
SEX-AGE-ORIGIN-EDUCATION 46,14% 75,09%
0,00% 0,00%
0,00%
0,00% 0,00%
NO SEX
24,50% 12,09%
30,00%
42,99% 0,00%
NO AGE
37,58% 3,66%
41,43%
84,58% 10,26%
NO ORIGIN
36,24% 20,51%
20,00%
66,36% 10,26%
NO EDUCATION
Portuguese
TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE
437
188
103
109
37
RECORD
16,50%
2,75% 100,00%
SEX-AGE-ORIGIN-EDUCATION 54,92% 97,34%
0,00% 0,00%
0,00%
0,00% 0,00%
NO SEX
40,96% 0,00%
75,73%
92,66% 0,00%
NO AGE
43,48% 2,66%
76,70%
97,25% 0,00%
NO ORIGIN
34,78% 0,00%
55,34%
87,16% 0,00%
NO EDUCATION
Spanish
TOTAL INFORMAL NATURAL CONTEXT MEDIA TELEPHONE PRIVATE
553
204
58
269
22
RECORD
65,52%
7,43% 100,00%
SEX-AGE-ORIGIN-EDUCATION 51,36% 100,00%
0,00% 0,00%
0,00%
0,00% 0,00%
NO SEX
36,71% 0,00%
5,17%
74,35% 0,00%
NO AGE
44,85% 0,00%
34,48%
84,76% 0,00%
NO ORIGIN
0,18% 0,00%
0,00%
0,37% 0,00%
NO EDUCATION
22
Profession and role fields are considered additional information and are not object of this statistics evaluation.
The number of record is superior to the number of speakers, given that the same speaker may appear in different sessions with
more or less metadata information available
24
This information is never available for Speakers of Human Machine interactions and for speakers of mixed turns, that therefore
have not been counted.
23
28
Statistics shows that only “sex” is always filled. However the relevant information regarding the
speakers are complete in all or most of the Informal and Telephone sub-corpora, that are the more
significant contexts for this type of information.
It must be considered that the information in object is usually not available for media sessions and
only occasionally available in Formal in natural_context sessions.
3.1.7.4 Completeness of session metadata
All session metadata are filled with real information. The “Place” information” is absent only in 5
session and in Human machine interactions, where the Place is necessarily un-known.
The statistics below record the number of session metadata that are filled with empty information
(x):
29
French
Title
Name
Data
Place
Topic
Situation
AQ
TOTAL
0
0
0
46
0
0
0
INFORMAL
0
0
0
1
0
0
0
NATURAL CONTEXT
0
0
0
0
0
0
0
MEDIA
0
0
0
3
0
0
0
TEL. PRIVATE
0
0
0
0
0
0
0
TEL. HUMAN-MACHINE
0
0
0
42
0
0
0
TOTAL
0
0
0
51
0
0
0
INFORMAL
0
0
0
0
0
0
0
NATURAL CONTEXT
0
0
0
0
0
0
0
MEDIA
0
0
0
0
0
0
0
TEL. PRIVATE
0
0
0
0
0
0
0
TEL. HUMAN-MACHINE
0
0
0
51
0
0
0
INFORMAL
0
0
0
0
0
0
0
NATURAL CONTEXT
0
0
0
1
0
0
0
MEDIA
0
0
0
0
0
0
0
TEL. PRIVATE
0
0
0
0
0
0
0
TEL. HUMAN-MACHINE
0
0
0
0
0
0
0
INFORMAL
0
0
0
0
0
0
0
NATURAL CONTEXT
0
0
0
0
0
0
0
MEDIA
0
0
0
0
0
0
0
TEL. PRIVATE
0
0
0
0
0
0
0
TEL. HUMAN-MACHINE
0
0
0
41
0
0
0
Italian
Title
Name
Data
Place
Topic
Situation
AQ
Portuguese
Title
Name
Data
Place
Topic
Situation
AQ
TOTAL
0
0
0
1
0
0
0
Spanish
Title
Name
Data
Place
Topic
Situation
AQ
TOTAL
0
0
0
41
0
0
0
30
3.2. Transcription and Dialogue representation
The C-ORAL-ROM dialogue representation is defined as an implementation of the CHAT
architecture (Mac Whinney, 1994) (http://childes.psy.cmu.edu/manuals/CHAT.pdf) and has the
following structure:
1 Text lines: orthographic transcription of the speech information divided:
a) Vertically, in dialogic turns (introduced by a speaker label)
b) Horizontally, by prosodic parsing and utterance limit, representing terminal and non terminal
prosodic breaks of the speech continuum
2. Dependent tiers: contextual information.
3.2.1 Basic Concepts for dialogue representation
Concept
Dialogic Turn
Definition
Continuous set of speech events by only one speaker’s
voice . The dialogic turn changes if, and only if, a speech
event by another speaker occurs.
Set of dialogic turns corresponding to one meta-data set.
The minimal speech event by a single speaker such that it
can be pragmatically interpreted as a speech act.25
A speech event perceived as a phonetic unit, such that it
conveys a meaning.
Session
Utterance
Word
3.2.2. Turn representation.
A dialogic turn by one speaker is expressed by “*” immediately followed by three capital letters
identifying the speakers in the metadata, then followed by “:” and one space before the transcription
of the speech event. Each dialogic turn ends with an “enter”.
Convention26
Description
^\*[A-ZÑ]{3}:\s{1} Dialogic turn of a given speaker
3.2.3 Utterance representation.
Each utterance is represented by a series of transcribed words ending with the termination symbol
“//” or other symbols having a termination value (see below).
E.g.:
*MOR: I’m going home // I’m tired //
*PIL: bye bye //
A dialogic turn can also be filled with non linguistic or paralinguistic material according to the
transcription convention (see below)27
*MAX: are you sure ?
*PIL: hhh
25
See the operative definition of Speech Act in § 3.4.3 and references therein.
Expressed with regular expression notation.
In this case the dialogic turn is filled by a communication event instead of a speech event. The relation between the two concepts is
left undefined in C-ORAL-ROM; this resource marks the main communication events in a dialogue, but is not specifically devoted to
the study of such events, which must be considered in a multimodal framework.
26
27
31
3.2.4. Word representation.
Each word is transcribed as a continuous sequence of characters between two empty spaces, in
accordance with the orthographic convention of each language.
3.2.5 Transcription
1. The transcription of a dialogic turn expresses, horizontally, the sequence of speech events
(utterances) that occur in each dialogic turn; i.e., in principle, no “enter” can occur within an
utterance. This may not be the case in overlapping and other cross over dialogue phenomena (See
below)
2. The transcription follows the standard orthography of every language in the C-ORAL-ROM
resource and is integrated with special signs devoted to handling spoken language phenomena.
No phonetic transcription is scheduled. In the absence of any previous orthographic tradition, nonstandard regional expressions are normalized. Orthographic choices made in each language
collection are reported APPENDIX 3 .
Concept
Text
Definition
Each recorded session is an ordered collection of transcribed dialogic turns referring to a given
metadata set.
3.2.6 Overlapping
The speech of one speaker in spontaneous dialogues is frequently overlapped by the speech of
another speaker, who may insert his dialogic turn in the first speaker’s turn.
The overlapping therefore determines a relation of temporal equivalence between two or more
speech portions in different dialogic turns.
In C-ORAL-ROM, overlapping is represented through the conventions reported below.
The overlapped text in both dialogic turns is placed between brackets < >.
Moreover, the sign [<] may optionally appear at the beginning of the second turn, immediately
before the overlapped text, to mark that an overlapping relation holds between the first two pieces
of text between brackets, e.g.:
*ABA: María mi ha detto che [/] <che non viene> // [Maria told me that /<that she’s not coming>]
*BAB: [<] <che non viene> //
In C-ORAL-ROM overlapping is marked only when at least two words in two different turns are
concerned; this means that the overlapping of syllables is left unmarked or reported to word
boundaries.
When, due to the simultaneous occurrence of more than one dialogic turn, it is impossible to
attribute the speech to a speaker, a fixed variable is used to mark a mixed-turn, e.g.:
*XYZ: chi è va bene //28 [who is it all right]
OVERLAPPING
Symbol
<>
[<]
*XYZ:
28
Description
Brackets which mark the beginning and the end of the overlapped text of a given speaker
This symbol specifies the overlapping relation between two bracketed textual strings belonging
to two speakers
Turn of overlapped speech by non-identified speakers
In these cases the transcription may not be reliable for perceptual reasons
32
3.2.7 Cross over dialogue convention
In spontaneous spoken language, the event of an intersection of dialogic turns by different speakers
occurs frequently; this means that a dialogic turn may arise before the end of the turn that
immediately precedes it.
Therefore, in those cases, the representation of dialogue as a vertical ordered collection of
dialogic turns may be maintained with difficulty. Because the C-ORAL-ROM format forces to
represent the sequence of turns in a temporal order, a cross-over dialogue convention has been
adopted.
Cross-over dialogue convention
In the C-ORAL-ROM transcripts, a slash placed at the beginning of a turn, i.e. immediately after
the turn label and before the transcribed text, is a convention expressing that the turn in object is
only virtual, while the linguistic information of the turn belongs to the preceding turn of the same
speaker 29
Configuration of symbols
:/
Operational definition
Relation that converts the turn in which it appears into a linear sequence which
includes the preceding turn of the same speaker
Three major cases of cross-over dialogue have been detected in C-ORAL-ROM:
1) overlapping;
2) interruption by the listener
3) complete intersection of turns
3.2.7.1 Overlapping and cross-over dialogue
Because overlapping is a relation between texts belonging to different turns, it affects the dialogue
representation, which expresses the time dimension both in vertical and horizontal relations.
In principle, the dialogue representation system requires the linguistic information which follows
vertically in a subsequent turn to be also necessarily subsequent in time to the text reported in the
previous turn. However, this cannot be the case in most overlapped sequences, where a dialogic turn
continues despite the insertion of another turn in it that partially overlaps it, e.g.:
*ABA: Maria mi ha detto che [/] <che non viene> più / al concerto // perché non si sente //
*BAB: [<] <viene> //
The above representation of the cross dialogue phenomenon, which is possible in the traditional
CHAT format, has been abandoned in the C-ORAL-ROM format, for two reasons: a) practical
reasons concerning the mapping of the textual format onto the text-to-speech alignment format
(where only one vertical temporal order is assumed); b) The above convention does not assume the
generalization with other cases of cross dialogue phenomena (see below).
In this system, when the overlapped text in the upper turn continues after the overlapping, the
Cross-over dialogue convention has been applied.
It must be noted that the convention has been applied in C-ORAL-ROM in two alternative
ways:
a) by transcribing over the overlapped turn until the first terminal break
*ABA: Maria mi ha detto che [/] <che non viene> più / al concerto //
*BAB: [<] < viene> //
29
Each slash immediately following the speaker mark is not counted as a prosodic break..
33
*ABA: / perché non si sente //
b) by interrupting the transcription of the overlapped turn when the overlapping ends
*ABA: Maria mi ha detto che [/] <che non viene>
*BAB: [<] < viene> //
*ABA: / più / al concerto // perché non si sente //
Only the former alternative compels the system to assume the generalization that each formal turn
ends with a terminal break, and allows the alignment of each utterance; the other ensures a frame
which is more consistent with the representation of the cross over dialogue (intersection) in the
paragraph below.
3.2.7.2 Intersection of turns
Some cases of complete intersection of dialogic turns may occur, without interruption nor
overlapping; e.g in the following example, the listener, without really interrupting the speaker, starts
a brief dialogic turn that goes on with his prosodic program:30
*MAX: in linea di principio / voi dovreste /
*ELA:
ah //
*MAX: / seguire le regole di trascrizione //
3.2.7.3 Interruption and cross-over dialogue
Even if overlapping is absent, or reduced to a few milliseconds, the speaker may interrupt his
utterance in connection with the intervention of the listener. For example, in the following situation,
a speaker inserts himself in a dialogic turn, interrupting it, but the other speaker goes on with his
turn despite the interruption:
*MAX: in linea di principio / voi dovreste +
*ELA:
ah //
*MAX: / seguire le regole di trascrizione //
3.2.8 Transcription conventions for segmental features
3.2.8.1. Non-understandable words
All words that are not properly understood are reported (and counted as word occurrences in a
frequency list) as:
xxx
3.2. 8. 2 Paralinguistic elements
All paralinguistic elements (laughing, crying, etc) are not counted as a word occurrence in a
frequency list and are indicated as:
hhh
It is possible to detail what the element is in a dependent tier (see. below).
30
In this case the utterance-based alignment cannot be maintained.
34
3.2.8.3. Fragments
All incomplete words and/or phonetic fragments are immediately preceded by &, as in the
following incomplete utterance:31
*MAX: mio &cug [my &cous]
or in the following retracting:
*MAX: mio &cug [/] mio fratello non nuota // [my &cous [/] my brother doesn’t swim]
or in the following lengthening of the programming time:32
*MOR: credo che si chiami / &eh / Giovanni // [I think they call him/ &eh / Giovanni]
3.2.8.4. Interjections
Interjections are not fragments; they are phonetic elements with dialogical function. Interjections
are transcribed following the lexicographical tradition of each romance language. New interjections
discovered in the corpus are transcribed tentatively and their presence is reported in a glossary
added to the corpus edition.
3.2.8.5. Non standard words
Non-standard words found in the corpus are transcribed tentatively and their presence is reported in
a glossary added to the corpus’s edition.
3.2.8.6. Non-transcribed words
When a word must be cancelled for reasons concerning privacy or decency it is substituted by a
variable, “yyy”, to be counted as a word33:
*MOR: il dottor yyy è un cretino // è proprio un bello yyy //
[doctor yyy is an idiot // he’s a right yyy //]
3.2.8.7. Non-transcribed audio signal
When, for whatever reason, part of the audio cannot be transcribed, a single variable “yyyy” is
inserted in the transcripts, not depending on the length of the signal. Said variable may be subject to
alignment, but will not be counted as a word34:
Symbol
&
hhh
xxx
yyy
yyyy
Description
Mark for speech fragments
Paralinguistic or non linguistic element
Non-understandable word
Non-transcribed word
Non-transcribed audio signal
3.2.9 Transcription conventions for human-machine interaction
Speech recognition systems fail to recognize a fair number of words. In order to insert 10.000
words of man-machine interactions in each C-ORAL-ROM corpus, the transcripts of the caller’s
31
Incomplete words are never subject to rebuilding, except, of course, for what regards systematic phonetic phenomena (elision,
breaking off of the last syllable etc.). Those phenomena are mirrored (or not) in the transcription, following the orthography of each
language and the particular traditions in editing oral text. Such choices must be detailed in the notes to the corpus edition.
32
Note that the lengthening of syllables, which is quite a common and perceptively relevant phenomenon in spoken language, is not
marked in this system. However, following the philosophy of marking prosodic breaks, the system automatic assumes the
generalization that lengthening necessarily causes a prosodic break.
33
When a word is not transcribed in the text it is substituted with some beep of a similar length in the acoustic signal.
34
Music or advert fragments in media may be not transcribed, and therefore substituted by a variable (that could be aligned or not).
In the case of very long music or advert fragments in a media corpus, the fragment can be cut and the cut noted in a dependent line.
35
real speech are provided. The following are the basic requirements for the transcription of humanmachine interaction in accordance with the format. An alignment between real speech recognition
and real data is provided:
i.Main line
*MAC: what the machine said, with the prosodic annotation
Dependent line (immediately below the lines *MAC)
%alt: The synthesized text; i.e. the input text file from ITC-IRST
Main Line
*WOM or *MAN (if the speaker is a man or a woman): the standard C-ORAL-ROM transcription
of what the speaker really said
Dependent line (immediately below the lines MAN or WOM)
%alt: what the speech recognition system recognized; that is, what is in the ITC-irst text file
The following are general requirements
•
•
•
In the metadata for speaker’s parameters, in the field dedicated to the machine’s artificial
voice, the value “Woman” or the value “x” is reported.
In the @comment headers, the name of the original file in the itc-irst collection is reported
An alignment of human-machine interactions is not scheduled.
3.2.10 Quality assurance on format and orthographic transcription
The format correctness of both metadata and transcripts has been cheeked automatically through
conversion of the original label .txt files into C-ORA-ROM .xml files through the script detailed in
4.1.
The consortium ensures maximum accuracy in the transcripts, that have been compiled by
PhD and PhD students in linguistics. The original transcripts has been revised by at least two
transcribers. The orthographic correctness of transcripts has been cheeked automatically through
word spell check and through the automatic PoS tagging procedure of each corpus, that highlight all
forms found in the corpora that are not consistent with the dictionary.
Orthographic conventions adopted in each language corpus according with the language tradition
and non standard forms has been registered in Appendix 3 of this specifications.
36
3.3. Prosodic annotation scheme
3.3.1 Principles
C-ORAL-ROM’s prosodic tagging is informed to the following series of principles:
•
•
•
•
•
•
•
The prosodic tagging specifies each perceptively relevant prosodic break in the speech
continuum
All positions between two words are considered possible positions to be fitted with a
prosodic tag. No within-word prosodic breaks are marked in C-ORAL-ROM
Prosodic breaks are distinguished in accordance with two main qualities: terminal vs. nonterminal .
Each between-words position necessarily has one of the following values with respect to the
prosodic tagging of the resource:
o no break
o terminal break
o non-terminal break
Prosodic breaks are always tagged and reported according to perceptual judgments of the
transcribers, within the process of corpus revision and transcription accuracy
Prosodic tagging is part of the transcription and is reported within the text lines.
The criterion for the segmentation of the speech flow into utterances is prosodic. Each
prosodic break qualified as terminal defines the utterance limits in the speech flow
3.3.2 Concepts
Concept
Prosodic break
Terminal prosodic breaks
Non-terminal prosodic breaks
Prosodic pattern (Utterance)
Definition
Perceptively relevant prosodic variation in the speech continuum such that it causes
the parsing of the continuum into discrete prosodic units.
Given a sequence of one or more prosodic units, a prosodic break is known as
terminal if a competent speaker assigns the quality of concluding such sequence to it.
Given a sequence of one or more prosodic units, a prosodic break is known as nonterminal if a competent speaker assigns the quality of being non conclusive to it.
Each sequence of prosodic units (≥ 1) ending with a terminal prosodic break35
3.3.3 Theoretical background
At theoretical level, it has been noted that perception is highly sensitive to voluntary F0 variation (’t
Hart et al., 1990). In accordance with this theoretical framework, the melodic pattern which scans
the speech flow is an object of perception. Each tone unit of a prosodic pattern corresponds to a
perceptually relevant pitch movement. A prosodic pattern may be simple (composed of a single
tone units) or complex (in which case it is made up of two or more tone units melodically linked
together).
From another point of view, according to the speech act theory tradition, every utterance in
spoken language is the voluntary accomplishment of a speech act (Austin, 1962).
The background theory of the transcription format (Cresti, 1994, 2000) links the two
properties: voluntary F0 variations do not simply scan the utterance, but rather express the
information necessary to the accomplishment of speech acts. For this reason, the selection of textual
units corresponding to an utterance can be based on prosodic properties.
35
Heuristic definition of utterance in C-ORAL-ROM
37
More specifically, it is possible to identify an utterance each time prosody enables the
perception of the completion of a speech act; i.e. intonation permits the pragmatic interpretation of
the text (Illocutionary criterion Cresti, 1994, 2000).
In the transcription format, the identification of utterances in the sound continuum is linked
to the detection of perceptively relevant F0 movements with a terminal value. It is assumed that
there is no such thing as an utterance without a profile of terminal intonation (Karcevsky, 1931;
Crystal, 1975). Non-terminal tone units correspond to the scanning of an utterance by means of a
complex pattern.
In other words, the systematic correlation between terminal breaks and utterance limits is the
heuristic method for speech segmentation in utterances, that is the segmentation of the linguistic
information in the resource with respect to the specific unit of analysis of spontaneous speech (See.
Miller & Weinert, 1998; Quirk et alii, 1985; Biber et alii, 1999; Cresti, 2000).
3.3.4 Conventions for prosodic tagging in the transcripts: types of prosodic breaks
To discriminate between terminal and non terminal breaks is mandatory in all the C-ORAL-ROM
transcripts. However, the C-ORAL-ROM format allows the prosodic tagging to be displayed at two
hierarchical levels, with greater or lesser attention to:
• the annotation of the types of terminal breaks;
• fragmentation phenomena in the speech performance.
3.3.4.1 Terminal breaks (utterance limit)
A signal is inserted in the transcription each time a prosodic break is perceived as terminal by a
competent speaker. Each terminal break indicates the prosodic completion of the utterance.
At poor levels of transcription, interrogatives and intentionally suspended utterances are
marked with the generic terminal break “//”, with no supplementary specification.
At richer levels of transcription, specifications are optionally added by distinguishing
different types of terminal breaks, in accordance with the following closed list of values of the
utterance in object:
Value
all possible illocutionary values
Description
Concluding prosodic break
interrogatives
Concluding prosodic break such that ?
the utterance has an interrogative
value
Concluding prosodic break such that …
the utterance is left intentionally
suspended by the speaker
intentional suspensions
Symbol
//
3.3.4 .2Non-terminal breaks
The symbol “/” (single slash) is inserted in the transcription to mark the internal prosodic parsing of
a textual string which ends with a terminal break; it is inserted in the position where a prosodic
break, that is not perceived as terminal, is detected in the speech flow by a competent speaker.
Value
Non-terminal
Description
Non conclusive prosodic break
38
Symbol
/
3.3.5 Fragmentation phenomena
The annotation scheme embodies the generalization that a prosodic break always occurs when a
fragmentation of the linguistic information arises in the speech performance. That is, in spontaneous
speech, a break of the prosodic unit in which the fragmentation arises.
When a prosodic break occurs in connection with a fragmentation phenomenon, at the
richer level of transcription, the prosodic tagging is specified in accordance with the complete set of
the following alternatives:
Symbol
+
[/](optional)
[//](optional)
[///](optional)
Description
Concluding prosodic break such that
the utterance is interrupted by the
listener or by the speaker himself
Non-conclusive prosodic break caused
by a false start
Non-conclusive prosodic break caused
by a false start (retracting) such that
the linguistic material is only partially
repeated
Non conclusive prosodic break caused
by a false start (retracting) such that
the linguistic material is not repeated
Type of break
Terminal
Non-terminal
Non-terminal
Non-terminal
3.3.5. 1 Interruptions
The interruption (non-completion) of an utterance may be due to any reason : a change of the
linguistic programming by the speaker, an interruption caused by the listener or by other events in
the environment. Interruptions may be accompanied by word fragmentation (interruption before the
end of the last word of the utterance) or, as is more frequently the case, may not feature any word
fragmentation.
interruption mark: +36
The interruption mark is counted as a kind of terminal break. The sign is inserted in the
transcription in the position where the utterance is interrupted because of an interruption made by
the listener, or because of a change in programming by the speaker (Examples in Appendix 1)
3.3.5.2 Retracting and/or restart and/or false start(s)
The retracting phenomenon (or false start) is the most frequent fragmentation phenomenon in
spontaneous speech. The speaker hesitates while trying to find the best way to express himself and
retracts his speech before choosing between two alternatives. This phenomenon is, generally,
clearly distinguishable from interruptions or changes in programming, reported above, which do not
feature speaker’s hesitations. Contrary to interruptions, the retracting phenomenon is almost always
accompanied by the repetition (complete or partial) of the linguistic material and clearly causes a
loss of the informational value of the retracted material, which is abandoned by the speaker in favor
of the chosen alternative. As in the case of interruption, in retracting phenomena the change of
prosodic envelope is again necessary. In other words, the retracting between two elements cannot be
accomplished in the same prosodic envelope. Therefore, retracting is always accompanied by a
prosodic break marked with the symbol “[/]”37
36
No distinction connected to possible causes of interruption is considered in this frame (e.g. the CHAT format marks when the
interruption is caused by the listener). On the contrary, the format explicitly marks the distinction between interruption and
intentional suspension, which frequently occurs at the end of the utterances. Intentional suspension must be marked as a generic
utterance limit “//”, or specified, as “…”
37
Retracting with complete or partial repetition can both be expressed by this symbol, therefore, in principle, all traditional CHAT
[//] should be simplified to [/] in this system, although the use of both traditional CHAT symbols is tolerated in the C-ORAL-ROM
format.
39
Retracting breaks are considered a type of non terminal breaks and are highlighted only at
richer levels of transcription. The symbol is inserted in the transcription after each set of fragments
in the position where a restart begins..
At poor levels of transcription, retracting phenomena are not treated as a special kind of prosodic
break caused by fragmentation, and only the generic non-terminal break sign is used after each set
of fragments in the position where a restart begins. Examples of retracting are available in
Appendix 1.
3.3.5. 3. Retracting/interruption ambiguity
In some cases it is hard to decide whether a fragmentation phenomenon fits the definition of
“restart” or “interruption”. This can be the case when an alternative to the locution in object is
realized, but no repetition is involved. In this case a supplementary sign, “[///]”,can optionally be
used marking the fact that a probable retracting phenomenon occurs with neither partial nor
complete repetition of linguistic material. The ambiguous mark is counted as a non terminal break.
3.3.6 Summary of prosodic break types
Symbol
//
? (optional)
… (optional)
+
/
[/](optional)
[//](optional)
[///](optional)
Description
Conclusive prosodic break
Conclusive prosodic break such that
the utterance has an interrogative
value
Conclusive prosodic break such that
the utterance is left intentionally
suspended by the speaker
Conclusive prosodic break such that
the utterance is interrupted by the
listener or by the speaker himself
Non conclusive prosodic break
Non conclusive prosodic break caused
by a false start
Non conclusive prosodic break caused
by a false start (retracting) such that
the linguistic material is only partially
repeated
Non conclusive prosodic break caused
by a false start (retracting) such that
the linguistic material is not repeated
Type of break
Terminal
Terminal
Terminal
Terminal
Non-terminal
Non-terminal
Non-terminal
Non-terminal
3.3.7 Pauses
Pauses in the speech flow are indicated with “#” and are reported in the transcription only if
clearly perceived as a significant interruption of speech fluency. Other pauses can be reported
optionally. No distinction is made with respect to the length of the pause.
The “ #” symbol is not a sign of prosodic parsing and never substitutes the marks dedicated to
prosodic breaks.
Symbol
#
Description
Pause in the speech flow
Definition
A perceptively relevant silence in the speech continuum, or
in any case, a silence longer than 250 ms.
3.4 Quality assurance on prosodic tagging
In C-ORAL-ROM each position between two words is considered a possible position for a prosodic
break. The prosodic tagging is based only on perceptual judgments and does not require any
specific linguistic knowledge, although the notion of speech act is always familiar to the
transcribers who annotated the C-ORAL-ROM corpus. The annotation of terminal and non-terminal
40
Eliminato: in the transcripts
Eliminato: i
breaks has been accomplished by expert transcribers (PHDs and PHD students) with the following
procedure:
1) Tagging of prosodic breaks simultaneously to the transcription by a first labeler
2) Revision of tagging by a different labeler, in connection with the revision of transcripts
3) Revision of tagging, in connection with the alignment, by a third labeler: After the
definition of each alignment unit, the labeler always challenges the presence of a terminal
break
This process ensures control on the inter-annotator relevance of tags and maximum accuracy in the
detection of terminal breaks. The accuracy with respect to non-terminal breaks is by definition
lower.
The level of inter-annotator agreement on prosodic tag assignment has been evaluated by an
external institution (LOQUENDO) on a statistically significant sampling of the C-ORAL-ROM
corpus. The evaluation report is attached here in Appendix 4
3.5. Dependent lines
Information of three types, regarding the text reported in a dialogic turn, is optionally given in nontextual lines following a dialogic turn (dependent lines in the CHAT tradition):
a) Alternatives proposed for the transcription of the text
b) Comments regarding the pragmatic context and the visual modality of communication
c) Other comments.
Eliminato: Both t
Eliminato: T
Eliminato: he detection of
prosodic breaks and
Eliminato: in special
Eliminato: the distinction
between terminal and non terminal
breaks
Eliminato: is
Eliminato: are therefore
Eliminato: a
Eliminato: relevant added values
for the representation of
spontaneous speech. In C-ORALROM prosodic tagging is based
only on the perceptual judgments
and it does not foresee any specific
linguistic knowledge; t
Eliminato: Each session has
been tagged
Eliminato: ly
Eliminato: operator
Eliminato: The
Eliminato: has been revised
Eliminato: operator
Sign marking the Dependent lines
Definition of the type of information
%act:
%sit:
%add:
%par:
%exp:
Actions of a participant, while speaking
Events or state of affairs occurring in the speech situation
The participant to whom the speech is addressed
Gestures or paralinguistic aspects of the speaker
Explanations necessary for the understanding of the turn, or signs in the text
(hhh)
Description of the Setting in a media emission
Description of the Scene in a media emission
Alternative transcription
Transcribers’ comments
%amb:
% sce:
%alt:
%com:
Eliminato: <#>on the transcripts¶
During alignment the prosodic
tagging marked on each
transcription is challenged and
revised again
Eliminato:
Eliminato: operator
Eliminato: before the selection
of the alignment unit
Eliminato: e
Eliminato: of tagging revision
Eliminato:
The link between the information in the dependent line and a word, or series of words, in the
dialogic turn can be specified through a serial number, which indicates the position in the dialogic
turn of the word referred to.38 In the following example, the alternative reported refers to the third
word of the utterance
Eliminato: s
Eliminato: the
Eliminato: inserted
Eliminato: in the speech flow
Eliminato:
*MAX: voglio mangiare // pasta //
%alt: (3) basta
3.6. Alignment
Alignment is to the tagging of each textual string of a transcribed session with two tags
corresponding to a temporal information in the speech file:
38
In long monologues, where the serial number of a word is high, the “: /” convention can be used for the insertion of a dependent
line immediately following the commented item.
41
a)
start of the alignment unit: the temporal unit corresponding to the start of the transcribed information in the
speech file
b) end of the alignment unit: the temporal unit of the speech file corresponding to the end of the transcribed
information in the speech file
In C-ORAL-ROM, the alignment information is stored in an .xml file placed in the same directory
of the text file and the audio file.
3.6. 1 Annotation procedure
The alignment of C-ORAL-ROM texts is performed after textual transcription and prosodic
tagging. The alignment of transcribed texts is achieved by an expert operator through the assistance
of the Win Pitch Corpus alignment tool.
The alignment tagging task consists in the insertion of a tag ($) in the text after each terminal break
annotated in the transcript while the audio is played (at a reduced speed).
After loading the text and the sound files to be aligned, the operator merely listens to the
slow rate speech playback (between 1 and 7 times real-time), and clicks on the text segments as
they are perceived, in accordance with the general choice adopted in the project in order to define a
significant alignment unit (each string ending with a terminal break. See below).
Automatic dispatching of speaker turns on alignment layers is provided. The editing of
segment edges is achieved through user-friendly commands using a mouse, and many other features
such as the automatic scrolling of text and the dynamic adjustment of playback speed.
As out-put of this process, the system assigns two temporal units to each alignment unit :
a) end of the alignment unit: the temporal unit of the sound file in the instant in which the tag is inserted
b) start of the alignment unit: the temporal unit which marks the end of the previous segment
3.6.2. Prosodic tagging and the alignment unit
The Alignment of C-ORAL-ROM relies on two choices:
a) Specification of the Alignment unit at utterance level (as previously defined)
b) Rough equivalence between terminal breaks and utterance limit
Each text is aligned with respect to perceptively relevant terminal breaks annotated in the original
transcripts.
When the alignment conforms to the previous requirements, the alignment file corresponds to the
acoustic data base of all utterances of each speaker in the recorded session labeled with the
transcribed utterance.
The French corpus has been aligned through pauses, each text string being surrounded by two
pauses of more than 200 ms (automatically detected).
3.6. 3 Quality assurance on the aligment
The expert operator in charge of the alignment (a PHD or PHD student) always considers whether
the accomplished alignment unit truly corresponds, in his perception, to a speech segment ending
with a terminal break. The operator may add or delete the terminal breaks annotated in the original
42
transcripts in accordance to his personal perceptual perspective of the speech signal, thus improving
the quality of the annotation.
The correspondence of an aligned segment to perceptually relevant breaks is revised
immediately after the tag insertion. The same operator revises the perceptual relevance of the
aligned segments of the aligned text and adjusts the edge if necessary.
The alignment of overlapped strings is achieved with lower accuracy.
3.6.4 Win Pitch Corpus
WinPitch Corpus is an innovative software program for the computer-aided alignment of large
corpora. WINPITCHCORPUS is built around a general purpose speech analyzer program, allowing
real-time display of spectrographic and prosodic data. It provides an easy and precise method of
selection of alignment units, ranging from syllables to whole utterances up to dialogic turns, in a
hierarchical aligned data storing system.
The method is based on the ability to visually link a moving target with the perception of the
corresponding speech sound, played back at a rate reduced by at least 30%. Listening to slower
speech, an operator is able to highlight with a mouse click segments of text corresponding to the
speech sound perceived, and to generate bidirectional speech-text pointers defining the alignment.
This method has the advantage, over emerging automatic processes, of being effective even for poor
quality speech recordings, or in case of speakers’ voice overlaps. Existing text transcriptions can be
quickly aligned, and text entering and editing is also possible on the fly.
Speech playback speed variability is implemented by a streaming modified PSOLA-type
synthesizer, which performs, in real-time, the necessary fundamental frequency tracking, period
marking (for voiced segments) and additive synthesis required for good quality playback.
Segments deriving from the alignment can be defined on eight independent layers, with
automatic generation of the corresponding database, which can be saved directly in both XML and
Excel formats.
The alignment database allows a direct access to the related speech segment, with automatic
displaying of spectrogram, wave, fundamental frequency (Fo) and intensity curves. Bi-directional
access between text and sound segments is also possible.
Besides text-to-speech alignment, WinPitch Corpus, has numerous features which allow an easy
and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking,
spectrographic display, re-synthesis after editing of prosodic parameters, etc.
The Fo tracking is based on the popular and robust spectral comb method. The streaming process
allows continuous operation on sound files of any length (within total computer memory limits),
ensuring a very efficient use of the program.
3.6.5 DTD of WinPitch Corpus aligment files
<!-- DTD : WinPitch Corpus -->
<!-- Version : 1 -->
<!ELEMENT Alignment (TimeStamp, WinPitch, Trans, Layer1, Layer2, Layer3, Layer4, Layer5, Layer6, Layer7,
Layer8, UNIT* )>
<!ELEMENT TimeStamp (#PCDATA)>
<!ELEMENT WinPitch (#PCDATA)>
<!ELEMENT Trans (#PCDATA)>
<!ELEMENT Layer1 (#PCDATA)>
<!ELEMENT Layer2 (#PCDATA)>
<!ELEMENT Layer3 (#PCDATA)>
<!ELEMENT Layer4 (#PCDATA)>
<!ELEMENT Layer5 (#PCDATA)>
<!ELEMENT Layer6 (#PCDATA)>
<!ELEMENT Layer7 (#PCDATA)>
43
<!ELEMENT Layer8 (#PCDATA)>
<!ELEMENT UNIT (#PCDATA)>
<!ATTLIST TimeStamp
Value CDATA #REQUIRED
>
<!ATTLIST WinPitch
Program CDATA #REQUIRED
Version CDATA #REQUIRED
>
<!ATTLIST Trans
version CDATA #REQUIRED
creationDate CDATA #REQUIRED
audioFilename CDATA #REQUIRED
textFilename CDATA #REQUIRED
>
<!ATTLIST Layer1
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer2
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer3
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer4
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer5
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer6
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST Layer7
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
44
>
Color CDATA #REQUIRED
<!ATTLIST Layer8
Name CDATA #REQUIRED
ID CDATA #REQUIRED
Short CDATA #REQUIRED
Color CDATA #REQUIRED
>
<!ATTLIST UNIT
speaker CDATA #REQUIRED
startTime CDATA #REQUIRED
endTime CDATA #REQUIRED
Channel CDATA #REQUIRED
Tag CDATA #REQUIRED
>
45
3.7. PoS tagging and lemmatization
The C-ORAL-ROM corpus comprises, for each session, a .txt file in which a “Part of
Speech” label and a “Lemma” label are assigned to each form of the transcribed text.
The C-ORAL-ROM project was not aimed at solving the puzzling questions of automatic
PoS tagging of speech. The project’s main goal is to provide a set of empirical data on spontaneous
speech tagging, so as to improve the state of the art for further analysis on spontaneous speech PoS
tagging. Nevertheless, the PoS tagged entry of the C-ORAL-ROM corpus also allows a better
exploitation of the resource for linguistic research and language technology purposes.
C-ORAL-ROM’s four sub-corpora are automatically tagged with the language technologies already
available for each language, and provides an evaluation of the level of accuracy reached by those
tools in spontaneous speech.
Tagging has been accomplished using different tag sets for each language, because of the
differences among the languages and the different traditions in natural language processing systems.
Each team defined a tagging strategy for what concerns:
1) Tool
2) Tag set
3) Main choices: multi-word expression, proper names, auxiliaries and compound tenses
The four tag-sets, reported below, despite some variations due to the language, the tool adopted and
the tradition of each team are all oriented towards the EAGLE standard, and are for this reason
highly comparable. The comparison tables are given in Appendix 2. For each language resource,
details on the tool used and on the accuracy reached are reported below.
In order to ensure a level of comparability within the whole corpus, a compulsory minimal
threshold of information has been established for what regards:
1. Minimal Tag Set requirements
2. Tagging Format
3. Frequency Lists format
3.7.1. Minimal Tag Set requirements
1. The Part of Speech (PoS) tag is compulsory for each class of words and, if applicable, for
Locutions;
2. Specifications on Mood and Tense features are compulsory for Verbs, or, alternatively, the
feature Finite/Non-finite [± person], necessary for the detection of verb-less utterances, must be
specified;
3. Specification of the Coordinative/Subordinative feature is compulsory for Conjunctions;
4. Specification of Common/Proper feature is compulsory for Nouns;
The common level of morpho-syntactic tagging involves a distinction within the set of non-properly
linguistic elements that are peculiar to spontaneous speech texts, providing two special tags to this
aim (the code of the tag depends on the specific tag set) for:
a) para-linguistic elements
e.g.
mh
(used for filled pauses etc.)
he
ehm
b) extra-linguistic elements
e.g.
hhh
(used for laughs and coughs etc.).
46
3.7.2. Tagging Format
The output of the morpho-syntactic tagged text represents both the speaker codes and the
prosodic breaks, in order to allow context-bound grammatical studies (within utterances or tone
units ). The text is given horizontally, with respect to the legibility of dialogic turns.
The format of the tag for each lexical entry is composed by three elements divided by
backslash separators (\), as follows:
- the first element is the word-form
- the second element is the LEMMA (in capital letters)
- the third element is the code which includes the PoS category (in capital letters) and
optionally the morpho-syntactic description (msd) of the word form
The result is a pattern with the following structure:
wordform\LEMMA\POSmsd
e.g.
*CAR: come\COME\B andò\ANDARE\Vs3ir / a\A\E casa\CASA\S vostra\VOSTRO\POS ?
Regarding multiwords, there are two possibilities for their tagging, with respect to the different
ways to identify locutions in each tag set:
e.g.
(a) a_priori\A_PRIORI\B
(b) a\A_PRIORI\B1 priori\A_PRIORI\B2
3.7.3. Frequency Lists Format
The output of the frequency lists is standardly encoded, to ensure the highest
comparability between the four romance languages. The frequency lists are presented in two
formats:
a) by lemmas, featuring 4 columns (rank, lemma, POS, frequency), e.g.
rank
LEMMA POS
=================================
1.
IL
R
2.
ESSERE V
...
...
...
32.
BELLO
A
33.
CASA
S
...
...
...
frequency
4213
3840
...
352
...
289
b) by word-forms, featuring 5 columns (rank, form, lemma, POSmsd, frequency); e.g.
rank
form
LEMMA
POSmsd frequency
==============================================
1.
il IL
R
2214
2.
è ESSERE
Vs3ip
1587
…
……
…
…
39.
bello
BELLO A
125
…
……
…
…
46.
bella
BELLO A
109
47
…
150.
…
589.
……
era ESSERE
……
era ERA
…
Vs3ii
…
S
…
56
…
22
48
3.7.4 Tag sets
3.7.4.1 French tagset
=============
Verbs
----VER:CON:PRE
VER:IMP:PRE
VER:IND:FUT
VER:IND:IMP
VER:IND:PAS
VER:IND:PRE
VER:INF
VER:PAR:PAS
VER:PAR:PRE
VER:SUB:IMP
VER:SUB:PRE
Conditional, present
Imperative, present
Indicative, future
Indicative, imperfect
Indicative, past
Indicative, present
Infinitive
Participle, past
Participle, present
Subjunctive, imperfect
Subjunctive, present
Nouns
----NOM:COM
NOM:PRO
Common
Proper
Adjectives
---------ADJ:ORD
ADJ:QUA
CON:COO
CON:SUB
coordination
subordination
Determiners
----------DET:DEF
DET:DEM
DET:IND
DET:INT
DET:POS
definite
demonstrative
indefinite
interrogative
possessive
Pronouns
-------PRO:DEM
PRO:IND
PRO:PER
PRO:POS
PRO:RIN
Demonstrative
Indefinite
Personal
Possessive
Relative/interrogative
Numerals
------NUM
ordinal
qualifying
Interjections and discourse particles
------------------------------------INT
Adverbs
------ADV
Unclassifiable
-------------XXX:ETR
XXX:EUP
XXX:TIT
Prepositions
-----------PRE
Conjunctions
------------
49
Foreign word
Euphonic particle (-t-, l')
Title
3.7.4 2 Italian tagset
==============
Verbs (main tags)
----------------V
VW
V_E
B
Main verb
Non-main verb
Verb + clitic
Verb features (finite forms)
---------------------------number
s
singular
p
plural
person
1
2
3
first
second
third
mood
i
c
d
m
indicative
subjunctive
conditional
imperative
tense
p
r
i
f
present
past
impefect
future
Prepositions
-----------E
E_R
Simple
Fused (+ article)
Conjunctions
-----------CC
CS
coordination
subordination
Articles
---------R
Demonstratives
------------DIM
Indefinites
----------IND
Personals
--------PER
Possessives
----------POS
Verb features (non-finite forms)
-------------------------------gender (participles)
m
masculine
f
feminine
n
common
Relatives/Interrogatives
-----------------------REL
number (participles)
s
p
singular
plural
mood
f
p
g
Numerals
-------N
NA
infinite
participle
gerund
Interjections
------------I
tense
p
r
present
past
Nouns
----S
SP
Common
Proper
Cardinal
Ordinal
Non-Standard Linguistic Elements
-------------------------------(PoS)+K
Foreign word
(PoS)+Z
Invented word
ONO
Onomatopoeia
ACQ
Acquisition
Non-Linguistic Elements
----------------------PLG
Paralinguistic
XLG
Extralinguistic
X
Non understandable word
Adjectives
---------A
Adverbs
-------
50
3.7.4 3. Portuguese Tagset
MAIN VERBS
---------Vpi
Vppi
Vii
Vmpi
Vfi
Vc
Vpc
Vic
Vfc
VB
VBf
VG
Vimp
VPP
Compound Tenses
PPA
Present Indicative
Past Indicative
Imperfect Indicative
Pluperfect Indicative
Future Indicative
Conditional
Present Subjunctive
Imperfect Subjunctive
Future Subjunctive
Infinitive
Inflected Infinitive
Gerundive
Imperative
Past
Participles
Present Indicative
Past Indicative
Imperfect Indicative
Pluperfect Indicative
Future Indicative
Conditional
Present Subjunctive
Imperfect Subjunctive
Future Subjunctive
Infinitive
Inflected Infinitive
Gerundive
Imperative
NOUN
---Np
Nc
Proper Noun
Common Noun
Invariable
Variable
INDEFINITE
---------INDi
INDv
Invariable
Variable
POSSESSIVE
---------POS
RELATIVE/INTERROGATIVE/EXCLAMATIVE
---------------------------------RELi
Invariable
RELv
Variable
in
Adjectival Past Participles
AUXILIARY VERBS
--------------VAUXpi
VAUXppi
VAUXii
VAUXmpi
VAUXfi
VAUXc
VAUXpc
VAUXic
VAUXfc
VAUXB
VAUXBf
VAUXG
VAUXimp
DEMONSTRATIVE
------------DEMi
DEMv
PERSONAL PRONOUN
---------------PES
CLITIC
-----CL
NUMERAL
-----NUMc
NUMo
Cardinal
Ordinal
INTERJECTION
-----------INT
ADVERBIAL LOCUTION
-----------------LADV
ADJECTIVE
--------ADJ
PREPOSITIONAL LOCUTION
---------------------LPREP
ADVERB
-----ADV
CONJUNCTIONAL LOCUTION
---------------------LCONJ
PREPOSITION
----------PREP
PRONOMINAL LOCUTION
------------------LPRON
CONJUNCTION
----------CONJc
CONJs
ENFATIC
------ENF
ARTICLE
-----ARTi
ARTd
Coordinative
Subordinative
FOREIGN WORD
-----------ESTR
Indefinite
Definite
ACRONYMOUS
----------
51
SIGL
LD
EXTRA-LINGUISTIC
---------------EL
WITHOUT CLASSIFICATION
---------------------SC
PARA-LINGUISTIC
--------------PL
WORD IMPOSSIBLE TO TRANSCRIBE
----------------------------Pimp
FRAGMENTED WORD OR FILLED PAUSE
------------------------------FRAG
SEQUENCE IMPOSSIBLE TO TRANSCRIBE
--------------------------------Simp
DISCOURSE MARKER
---------------MD
SUB-TAGS
-------:
+
(excepting compounds)
DISCURSIVE LOCUTION
-------------------
52
Ambiguous form
Contracted forms
Hyphenated
forms
3.7.4 4. Spanish Tagset
VERBS
----V
Verb
Verb Features
------------number
s
p
singular
plural
person
1
2
3
first
second
third
mood
ind
sub
cond
imp
indicative
subjunctive
conditional
imperative
tense
p
s
f
indef
i
present
past simple
future simple
past simple (indefinido)
past simple
ind
sub
cond
imp
indicative
subjunctive
conditional
imperative
tense
p
s
f
indef
i
present
past simple
future simple
past simple (indefinido)
past simple
Auxiliary features (non-finite forms)
------------------------------------gender (participles)
M
masculine
F
feminine
number (participles)
S
P
singular
plural
mood
inf
ger
par
infinitive
gerund
participle
tense
P
past (participles)
NOUNS
----N
Noun
Features
-------gender
m
f
ig
masculine
feminine
invariable in gender
number
s
p
in
singular
plural
invariable in number
AUXILIARY VERBS
--------------AUX
auxiliary verb
other features
i
P
invariable (in general)
Proper
Features
-------number
s
p
singular
plural
ADJECTIVES
---------ADJ
Adjective
person
1
2
3
first
second
third
Features
-------number
s
p
in
singular
plural
invariable (in number)
Verb features (non-finite forms)
-------------------------------gender (participles)
M
masculine
F
feminine
number (participles)
S
P
singular
plural
mood
inf
ger
par
infinitive
gerund
participle
tense
P
past (participles)
mood
gender
53
m
f
ig
masculine
feminine
invariable (in gender)
other features
i
invariable (in general)
ADVERBS
------ADV
Adverb
PREPOSITIONS
-----------PREP
Preposition
CONJUNCTIONS
-----------C
Conjunction
DETERMINERS
----------DET
Determiner
Features
-------d
dem
poss
article
demostrative (adjective)
posessive (adjective)
gender
m
f
ig
masculine
feminine
invariable in gender
number
s
p
singular
plural
INTERJECTIONS
------------INT
DISCOURSE MARKERS
----------------MD
discourse marker
PRONOUNS
-------P Pronoun
Features
-------PER Personal (Pronoun)
Person
1
2
3
first
second
third
gender
M
F
IG
masculine
feminine
invariable in gender
number
s
p
singular
plural
R
relative pronoun
QUANTIFIERS
----------Q
Quantifier
Interjection
54
3.7.5 automatic PoS tagging: tool and evaluation
3.7.5.1 Italian : tool and evaluation
The automatic tagging procedure of lemmatization and morpho-syntactic annotation of the Italian
C-ORAL-ROM corpus is based on the PiSystem set of tools, created and developed by Eugenio
Picchi within the ILC ( Pisa).
PiSystem (http://www.ilc.cnr.it/pisystem/) is an integrated procedure for textual and lexical
analysis, which consists in the following main components:
1. DBT text encoding and analysis modules;
2. a morpho-syntactic analyzer (PiMorpho);
3. a Part of Speech tagger and lemmatizer (PiTagger).
The DBT encoding provides a first parsing of the text (tokenization) and represents the pre-analysis
level of the morphological analyzer. The encoded text is then given to the PiMorpho, which assigns
all the possible alternatives of Morpho-Syntactic-Description (MSD) to each lexical item. For the
PoS disambiguation, PiTagger uses two other input-resources: an electronic dictionary and a
training corpus. In detail, these resources consist in:
a. the DMI, a morphologic dictionary of Italian language, developed within the ILC at
CNR, Pisa.; it collects 106,090 lemmas encoded with PoS specifications and
inflectional tags (Zampolli and Ferrari 1979; Calzolari, Ceccotti and Roventini
1983);
b. a Training Corpus of 50,000 words, manually tagged;
c. a statistical database extracted from the Training Corpus (BDR).
These resources are built on a coherent tag set for PoS and morpho-syntactic annotation which is
strictly coherent with the EAGLE tag set. The disambiguation phase is processed by statistic
measurements (on tri-grams) extracted from the Training Corpus and stored in the BDR. The main
program of this procedure estimates the maximum likelihood pattern among the possible
alternatives given by the morphological component, with a transitional probabilistic method (Picchi
1994).
In this environment, the level and the precision of the analysis depends on the information archived
in the BDR. The Training Corpus (manually tagged with lemma, Pos and MSD) is the source of the
statistics on word-form associations.
Statistics must be defined on a specific level of linguistic information (lexical patterns or PoS
sequences). The BDR uses a hybrid set of specifications called Disambiguation Tags, which defines
the threshold of the analysis relevant for statistic measurements, i.e.:
a.
b.
c.
PoS codes,
the lemmas ESSERE and AVERE,
the MSD tags on non-finite moods of verbs.
This information is the same used by the PiTagger in the disambiguation phase.
Tag Set
The PoS tag-set used for the Italian C-ORAL-ROM is, for the greater part, in agreement with the
EAGLES recommendations for the morpho-syntactic annotation of Italian language.
The C-ORAL-ROM tag set features some adjustments in relation to specific category spaces:
1. with regard to verbs, in order to operate a distinction between main verbal instances and non main ones
(auxiliaries and copulas);
55
2. with regard to pronouns and determiners, in order to achieve a semantic-oriented classification of such lexical
objects (not depending on their functional value).
Main choices in the PoS tag set
a. Main and non-main verbs.
The morpho-syntactic description of verbal forms represents a crucial problem for automatic PoS
tagging. At the state of the art is not easy to achieve a good level of automatic recognition for both
auxiliaries and copulas. The Italian C-ORAL-ROM tried a way out of this problem, encoding both
auxiliaries, verbs [ESSERE and AVERE], and copulas [ESSERE] with the [non-main] feature. An
automatic post-edit procedure provides a distinction between the different uses of these verbs:
MAIN ones (fully semantic, predicative verbs) and NON-MAIN ones (support verbs: auxiliaries
and copulas). The tag dedicated to the NON-MAIN verbs is a capital W following V, the main tag.
b. Pronouns and pronominal adjectives.
With regard to the categorical space outlined by the traditional categories of "pronouns" and
"pronominal adjectives", the Italian C-ORAL-ROM tag set disagrees with the EAGLES standards.
Specifically, there is no distinction between pronominal and adjectival uses of the possessives,
indefinites and determinatives, which all merge into the same category. This choice follows the
strategy used in the tag set for the Portuguese C-ORAL-ROM). The following table shows the
relations between the Italian categories in EAGLES and C-ORAL-ROM:
The EAGLES tag set is structured on a sub-categorization system within Pronouns and Determiners
(macro-categories which represent the general PoS tags); on the contrary, in the Italian C-ORALROM tag set, following an extensive criterion, each category is treated as a proper PoS, merging the
pronominal and the adjectival uses of the same lemmas.
Extended tag set for spoken language
The spoken language transcripts of C-ORAL-ROM corpora contain elements of different types.
More specifically, besides the linguistic elements which belong to the dictionary, there is also a
wide variety of non-standard linguistic forms in the corpora. The following cases have been
distinguished:
a. foreign words; (PoS) + [K]; (they\PERK);
b. new formations; (PoS) + [Z]; (bruttozzi\AZ, torniante\SZ);
c. onomatopoeia; [ONO]; (fffs\ONO, zun\ONO);
d. language acquisition forms; [ACQ]; (aua\ACQ, cutta\ACQ).
Moreover, a wide series of non-linguistic phenomena may also be involved in the speech flow.
Such phenomena are identified in the transcripts following the C-ORAL-ROM format and are
encoded in the tagged resource as special elements:
a. para-linguistic elements; [PLG]:
- word fragments (&pa\PLG, &costrui\PLG);
- phonetic support elements and pause fillings (&he\PLG, &mh\PLG);
c. extra-linguistic elements (laughs and coughs); [XLG]; (hhh\XLG);
d. not understandable words; [X]; (xxx\X).
EVALUATION
The C-ORAL-ROM Italian tagged resource comprises 306,638 tokens. Since the non-standard and
regional forms were inserted in a special pre-dictionary (1,985 word forms), the PiTagger system
reached a 100% recall of the number of tokens:
56
total number of tokens:
tagged tokens:
total recall = 100%
306,638
306,638
The evaluation of the precision of the automatic PoS-tagging procedure is based on a random
sampling of 1/100 tokens picked out of the whole C-ORAL-ROM Italian resource. Each token is
extracted from a different utterance (non contiguous), also randomly selected. The evaluators had to
express a judgment on the correctness of the tagging with respect to a selected word in its utterance
context.
The random sampling obtained sufficiently represents the whole. The size of the sampling
has been considered sufficient from a statistical point of view, as it ensures a 95% confidence
interval lower than 1%.
The manual revision of the tagged samples evaluates the automatic procedure with respect to
different degrees of accuracy: errors in sub-categorization, errors in morpho-syntactic description of
verbs, errors in main tag category and in lemma assignation. The statistic precision of these levels
of annotation is shown by the following table:
Table 1. Precision of the automatic tagging procedure
Total Sampling
3100
Non Decidable39
31
Total Evaluated
3069
All Correct
2726
Correct POS tag
2773
1) PoS tag errors
296
2) Lemma errors
2
3) Sub-categoriz. errors
38
4) Morpho-syntactic description errors
7
Total Error
343
1,00%
88,82%
90,36%
9,64%
0,07%
1,24%
0,23%
11,18%
The percentage of accuracy of the annotation system shows that the more relevant errors (PoS tag
and lemma errors) are around 10% of tokens40.
The results of the evaluation (restricted to the PoS level) are shown in a confusion matrix,
listing the errors in the tag assignment with respect of each word class: in each line, the errors are
recorded by category, while the columns report the correct PoS to which the token belongs41.
The last column records the number of cases in which the PoS is overextended, while the bottom
line represents cases of under-extension.
39
The 31 non-decidable cases (second row in the table) are constituted by words which are under-specified with regards to their PoSvalue in the given context (i.e. it is impossible to express a disambiguation judgment). These cases (1% of the total) have not been
counted as part of the evaluation samples:
(e.g.) che\CHE\Conj?Relative? mi\MI\PER piace\PIACERE\Vs3ip tanto\TANTO\B //
40
[which\that I like so much]
Tested on a corpus of official documents of the UE Commission (500,000 tokens, reviewed by Enrica Calchini), PiTagger reached
a 97% degree of accuracy. The same recognition rate was reached in the LABLITA literary sampling corpus (60.000 tokens).
41
For example, number 30 (second line, third column) corresponds to 30 tokens wrongly tagged as nouns (S), which should be
instead tagged as adjectives (A).
57
Table 2. Confusion matrix
The collected data show that most mistakes occur in the Nouns category (103 over 296 total errors).
Considering the frequency of the category in the evaluation corpus, the projected overextension of
this word class would be of around 15%. This probably depends on the statistical normalization
applied by the PiTagger system, which, very roughly, assigns to Nouns the highest probability of
occurrence (see. Picchi 1994).
Another interesting result is that the Adverbs and Interjections category is consistently underextended with respect to its actual weight (over 10% under-extension for Adverbs and 8.5% underextension for Interjections). Verbs show a lower incidence of errors (3.95% over-extension and
3.44% under-extension) with a roughly correct projection of the frequency of the category on the
total.
From the confusion matrix above, it is possible to obtain data on precision, recall and f-measure
for each category; the following table details these measurements, which give an overall estimate of
the automatic tagging procedure:42
Table 2. Confusion matrix
corrections
V
V
S
15
A
3
errors
B
S
A
B
15
2
2
2
30
23
1
3
1
R
E
C
DIM IND PER POS REL N
1
20
1
3
4
C
5
2
23
19
103
17
1
27
1
1
6
2
5
I
1
R
E
4
NA
5
11
1
14
25
7
5
2
10
1
DIM
32
0
IND
5
4
1
PER
10
8
9
POS
1
1
9
1
21
27
1
REL
12
N
12
1
1
NA
0
I
6
20
19
33
85
6
14
11
39
0
2
11
0
23
11
5
23
296
42
The lines in the table are sorted by the f-measure value (last column), that is an overall standard measurement to evaluate the
general accuracy of automatic procedures. The PoS which feature a higher f-measure value are the ones tagged with a higher
precision. For the PoS marked with an asterisk in the first column, the number of occurrences in the sampling corpus is too low to
ensure an adequate evaluation.
58
Table 3. Precision recall and f-measure for each PoS
PoS
tp
fp
fn
precision
DIM
65
0
0
100,00%
E
253
7
11
97,31%
V
559
23
20
96,05%
POS*
10
1
0
90,91%
R
181
11
14
94,27%
I
195
6
23
97,01%
B
502
25
85
95,26%
PER
170
27
11
86,29%
S
440
103
19
81,03%
N
40
1
11
97,56%
C
200
32
39
86,21%
IND
50
21
2
70,42%
A
90
27
33
76,92%
REL*
24
12
23
66,67%
NA*
2
0
5
100,00%
recall
100,00%
95,83%
96,55%
100,00%
92,82%
89,45%
85,52%
93,92%
95,86%
78,43%
83,68%
96,15%
73,17%
51,06%
28,57%
59
f-measure
1,0000
0,9656
0,9630
0,9524
0,9354
0,9308
0,9013
0,8995
0,8782
0,8696
0,8493
0,8130
0,7500
0,5783
0,4444
3.7.5.2 French: tool and evaluation
Tagset
order):
The most notable features of the tagset used for the French corpus are (in alphabetical
o The distinction between two categories of adjectives: qualifying and ordinal. Other words that
are traditionally called adjectives are listed under determiners or numerals.
o The grouping of interjections and “discourse particles”
o The grouping of cardinals in a “numeral” category
o The fusion of prepositions + determiners (e.g. du = de + le) simply received the tag
corresponding to the determiner (e.g. du/DET:DEF).
o The fact that relative and interrogative pronouns were not distinguished (qui, quoi, etc.), as this
remains unachievable using the current automatic processing technology.
o The traditional distinction between the conjunction que and the relative pronoun que has been
maintained, in spite of strong linguistic arguments found in modern linguistics in favor of a
complementizer (conjunction) analysis of the “relative” que (either in relative clauses or in cleft
sentences). This is because many possible future syntactic analyses could be based on the
distinction between ordinary and movement clauses.43
o A residual XXX category containing unclassifiable cases such as foreign words, titles or
“euphonic particles” that had no precise linguistic function (-t-, l' before on).
o Multi-word expressions were detected in every grammatical category.
Tool
The French tagging strategy is a rather complex one. It is based on Cordial Analyzer, which
at the moment is probably the best morpho-syntactic tagger for French. Cordial Analyzer was
developed by the Synapse Development company44 with considerable input from our team. It uses
the technology and modules developed by Synapse, which have been incorporated in Microsoft
Word's French spelling corrector.
The characteristics of this tool are:
•
•
•
a very large dictionary (about 142,000 lemmas and 900,000 forms)
a combination of statistics and detailed linguistic rules
shallow parsing capabilities, which enable taking into account other relations than strictly local
ones
a remarkable robustness with respect to errors and non-standard language.
•
This last feature (which comes from the sophisticated error-correction modules) is
particularly important for processing spoken corpora, and explains by a fair degree the very high
results obtained on the French C-ORAL-ROM tagging (see below). For example, the tagger is
capable of detecting repetitions, such as le le euh le chien, and is not fooled by occurrences that are
not “normal” bigrams in the language, such as le le. Such repetitions very often lead to tagging
errors for most taggers (usually trained on written corpora).
43
Our tagging allows automatic retrieving of both structures. This decision could be considered as somewhat contradictory with the
conjunction analysis of the comparative que: it has indeed been argued that the comparative clause que has properties of movement. In
that case as well, the decision is a practical one: comparative structures can easily be retrieved using the quantifier that is always
correlated with this type of que clause.
44
http://www.synapse-fr.com
60
Cordial Analyzer has been supplemented by our own dictionary and various pre- and postprocessing modules developed by our team that adapt it to the spoken language and correct a
number of residual errors. These modules also change some of Cordial's tagging decisions. The
most noticeable example concerns “discourses particles” such as bon or quoi, which are not at all
treated by Cordial, given its orientation towards written language. One of our post-processing
modules changes the original tagging when appropriate, using linguistic rules that use the local
context. Examples:
le chocolat est bon\ADJ
=> no change
et alors bon\ADV je lui ai dis
=> bon\INT
Since Cordial can flag spelling errors and/or unknown words, the tagging process enabled
us to detect spelling errors that remained in the transcribed corpus. The errors were manually
corrected before the final tagging.
After correcting the spelling errors in the transcription, only 311 tokens remained unknown
to the tagger (200 different types). All these tokens have been checked manually and added to an ad
hoc dictionary used by a final tool that tagged them appropriately. Among these tokens, we found
neologisms (C-plus-plussien), rare or specialized words (carbichounette), familiar abbreviations
(ophtalmo), alphanumeric acronyms and codes (Z14), and foreign words (tribulum, brownies).
Evaluation
In the case of the unknown words described above, the system still attributed a tag like
NOM:COM (which in many cases is right: ambiguity was mostly with XXX:ETR). In that sense,
system recall was 100%. However, since an ad hoc dictionary was created, it is probably fair to
exclude the 311 tokens concerned by the recall. Even this did not change the results much, since it
applied to a corpus of about 300,000 tokens. Recall measured that way is still 99.999%!
The precision figure is more interesting. It was evaluated by drawing a 20-token sample
from each of the 164 texts composing the corpus, i.e. a sub-corpus of 3280 tokens or about 1/100th
of the entire corpus (by token we mean either a single word or a multiword unit). Elementary
statistics show that this size is enough to ensure a 95% confidence interval no larger than 1%, and
therefore perform a very precise evaluation, as will be shown below.
Tagging was checked and corrected manually, and the errors were categorized according to
several criteria:
•
•
•
•
error on main category
main category correct, but error on subtype
error on lemma
error on multiword grouping
The tagger’s behavior was excellent, since only 58 tokens presented an error of one type or
the other (or occasionally two errors combined). This amounts to a 1.77% error rate, i.e. a precision
of 98.23%. This is a very-high figure according to current standards, especially for spoken corpora.
Table 6 lists the errors by type. The rightmost column shows a 95% confidence interval (computed
using the Binomial law). This can be used to evaluate the impact of the possible variations in
sampling, as well as the sample size. We can see that in all cases the confidence interval is smaller
that 1%, which is more that enough for this type of evaluation, given the fact that the disagreement
among linguists on the correctness of tags is probably of the same order, if not greater.
61
Type of error
Cat
SubCat
Tag (Cat or SubCat)
Lemma
Multiword
Any
Distribution of error types
nb of errors
41
8
49
4
8
58
% error
1,25%
0,24%
1,49%
0,12%
0,24%
1,77%
precision
98,75%
99,76%
98,51%
99,88%
99,76%
98,23%
95% CI
98.3% - 99.1%
99.5% - 99.9%
98.0% - 98.9%
99.7% - 100.0%
99.5% - 99.9%
97.7% - 98.7%
The main category was correctly allocated in 98.75% of cases. It is worth noting that the
tagging of verbs was 100% correct on the sample. This confirms previous studies by our team,
which reported over 99% correctness on this category (e.g. Valli & Véronis, 1999). On the opposite,
the worst categories were adverbs and conjunctions, but this is hardly unexpected in French.
The matrix of confusion between categories is given in Table 7. One can see that most of the
confusions occurred between (difficult) grammatical categories, for example:
•
•
•
•
•
adverb vs. conjunction (si)
adverb vs. preposition (avec)
preposition vs. interjection/particle (voilà)
pronouns vs. determiners (un, tout)
relative pronoun vs. conjunction (que).
In a few cases, grammatical words were incorrectly tagged as a major category, mainly as a
noun (e.g. pendant). Mistagging rarely occurred across major categories (ADJ, NOM, VER). The
only errors concerned the difficult distinction, in French, between adjective and nouns, since many
words can be both:
quelqu'un qu'on sent de passionné/NOM:COM => should be ADJ:QUA
Correct
ADV CON
Err
ADV
CON
DET
DET
INT
NUM PRE
3
PRO
XXX
ADJ
Tot.
5
2
9
NOM VER
16
7
INT
NUM
PRE
1
XXX
ADJ
NOM
VER
Total
5
5
PRO
4
3
1
1
10
4
3
1
1
7
2
1
2
3
5
7
5
9
1
41
Matrix of confusion between main categories
In a few cases, the main category was right, but the sub-category was wrong (Table 8). Half
of these cases involved the determiner des, which can either be a preposition fused with a definite
article, equivalent to de + les (therefore coded DET:DEF, see above) or an indefinite article
(DET:IND). Respective examples:
il sort des grandes écoles
62
il mange des pommes
Three other errors involved a confusion between proper and common noun when a given
form could be both, e.g. côtes/Côtes (as in Côtes de Beaune), salon/Salon (a city in Provence). As
opposed to the case of the determiner, this error will be easy to rectify in future versions, since the
capital letter, in spoken corpora, is a non-ambiguous clue marking proper names. The last case
involved confusion between present indicative and imperative (attends).
Correct
DET:IND
Err
DET:DEF
NOM:COM
NOM:PER
VER:IMP
2
2
1
4
Total
4
4
NOM:PRO
VER:IND
Total
NOM:COM
1
2
1
1
1
1
8
Confusion matrix between subcategories
Overall, an error in the tag occurred 49 times, either in the main category or in the subcategory, i.e. 1.49% of cases. This corresponds to a precision of 98.51% of tags, a result that is
consistent with previous evaluations conducted by our team: Valli & Véronis (1999) found a 97.9 %
precision in their experiment. This improvement is due to better dictionaries, better treatment of
multi-word expressions and taking into account discourse particles (which were previously
ignored).
While most multiword expressions from the sample were processed correctly, a total of 8
cases were not. These involved either multiword units that were not recognized (e.g. d'après
[quelqu'un], un petit peu, which inexplicably were not included in the dictionary), or words which
should not have been pasted together (the latter being more frequent with a total of 6 cases). For
instance discuter de quoi was tagged discuter/VER:INF + de_quoi/INT instead of being tagged as
three separate words.
63
3.7.5.3. Portuguese: tool and evaluation
In order to re-use the existent tools, the morphosyntactic annotation and the lemmatization of the
corpus were performed in two different tasks.
The Morphosyntactic Annotation
The Portuguese team used Eric Brill's tagger (Brill 1993)45, trained over a written Portuguese
corpus of 250.000 words, morphosyntactically annotated and manually revised.
The initial tagset for written corpus morphosyntactic annotation covered the main POS categories
(Noun, Verb, Adjective, etc.) and the secondary ones (tense, conjunction type, proper noun and
common noun, variable vs. invariable pronouns, auxiliary vs. main verbs, etc.), but person, gender
and number categories were not included.
Specific Tags for a Spoken Corpus
Due to some of the spoken language characteristic phenomena and to the specific transcription
guidelines used in the C-ORAL-ROM project, it was necessary to adapt the tagset. We implemented
a post-tagger automatic process to account for the following cases:
(a) extra-linguistic elements; transcription: hhh; tag: EL;
(b) fragmented words or filled pauses; transcription: &(form); tag: FRAG;
(c) words and sequences impossible to transcribe; transcription: xxx, yyyy; tag: Pimp, Simp;
(d) paralinguistic elements, such as hum, hã and onomatopoeias; tag: PL.
In the cases described in (a), (b) and (c), the adopted specific transcription allowed for automatic tag
identification and replacement, through a post-tagger process. The same process was applied in the
cases described in (d), since there is a predictable finite list of symbols representing paralinguistic
elements. Onomatopoeias however needed manual revision.
Three other categories had to be added, but they did not allow for automatic post-tagging
replacement, since they correspond to forms that also belong to classic categories:
(e) discourse markers, such as pá, portanto, pronto; tag: MD
(f) discursive locutions, such as sei lá, estás a ver, quer dizer, quer-se dizer; tag: LD
(g) non classifiable forms, for words whose context does not allow an accurate classification; tag: SC
For the cases (e) and (f), forms like pronto and não sei, for instance, are automatically tagged as
pronto\ADJ and não\ADV sei\Vpi and there is no automatic post-tagging procedure that can decide
whether it is or not a Discoursive Locution. These cases required a manual revision (and frequent
listening of the sequence).
For the cases in (g), it would be more difficult (if not impossible) to tag or post-tag them
automatically. Because it also works on statistic rules, the tagger will always try to classify these
words according to those rules (note that it was chosen to tag all the forms whenever possible,
trying to avoid the use of SC tag (which rarely occurred)).
Lemmatization of the Spoken Corpus
In order to accomplish this task, the Léxico Multifuncional Computorizado do Português
Contemporâneo (henceforth LMCPC)46 was used as the source for a lemmatization tool.
The LMCPC is a 26.443 lemma frequency lexicon with 140.315 wordforms, with a
minimum lemma frequency of 6, extracted from a 16.210.438 word corpus of contemporary
Portuguese. The lemma and its correspondent forms (including inflected forms and compounds) are
followed by morphosyntactic and frequency information.
The lemma and wordforms are lemmatized concerning main POS categories, as N (noun), V (verb),
A (adjective), or other, namely F (foreign word), G (acronym/sigla), X (abbreviation).
45
http://www.cs.jhu.edu/~brill
The LMCPC (in English, Multifunctional Computational Lexicon of Contemporary Portuguese) is available via internet at
http://clul.ul.pt/english/sectores/projecto_lmcpc.html/
46
64
The lemmatization of the C-ORAL-ROM spoken corpus comprised two major tasks: the
formatting of the LMCPC data and the construction of a tool to extract the lemma from the lexicon.
Unfortunately, at the beginning of this process, we were not able to use the POS information present
in the LMCPC to improve the lemma selection process.
The lemmatization tool developed turned out to be very simple. It consisted in a Perl script
that extracts the lemma for each token of the corpus from the LMCPC data file: each form of the
corpus is searched for in the lexicon, and the correspondent lemma(s) is(are) found and placed near
the form.
In the case of multiword expressions, since the lemma is the entire set of elements, there was
no correspondence between the wanted result and the LMCPC data. Therefore, it was necessary to
develop a tool to automatically compose the desired lemma format from a given list of locutions.
The final format of the lemmatization of a locution is given below:
(1) o\O_QUAL\LPRON qual\O_QUAL\LPRON
Since it is possible for a wordform to be attributed several lemmas (CORLEX corpus, for instance,
has a percentage of homographic words of 34%) and due to the problems concerning locutions
(overlapping of words that can pertain to different kinds of locutions (em cima\LADV vs. em cima
de\LPREP) and the distinction between locutions and independent words grouping), a manual
lemma revision was strictly required, together with the tagging one.
Specific lemmatization choices
It was decided that some categories would be left without lemma, namely, Proper Nouns, Paralinguistic elements and Extra-linguistic elements.
Lemmas are given in the masculine gender, as is common practice. However, some cases
received a masculine and a feminine lemma: Articles; Indefinites; Demonstratives; Possessives;
Personal Pronouns; Clitics; Cardinal Numerals (os\O\ARTd, as\A\ARTd).
For some verbs, whenever their reflexive use implies a change in the semantic functions of their
arguments, it was considered that the lemma includes the reflexive pronoun, which had to be added
manually:
(2) A Ana lembrou\LEMBRAR\Vppi a mãe da sua consulta médica.
(3) A Ana lembrou-se\LEMBRAR_SE-SE\Vppi-CL de telefonar à mãe.
(4) a Ana não se\SE\CL lembrou\LEMBRAR_SE\Vppi de telefonar à mãe.
In Portuguese, whenever adverbs derived with the suffix -mente are coordinated, the first adverb
loses the suffix, surfacing with its adjectival form. In these cases, despite of this adjectival form, our
option was to lemmatize and to tag it as an adverb:
(5) pura\PURAMENTE\ADV e simplesmente\SIMPLESMENTE\ADV.
Effectiveness of the tagging and error rate
Considering the introduction of new categories and despite the tagset length and type of training
corpus (written), the tagger achieved a success rate of 91,5%, excluding the tagging of MD and of
any kind of locution (due to the problems explained above).
Afterwards, the Portuguese team performed a manual revision of 231.540 words and
decided to attempt the training of the tagger over a subset of this subcorpus (with 184.153 words).
However, this training revealed to be ineffective, since from an empirical appreciation, the errors
tended to increase enormously (we did not establish a precise error rate for this task).
Given the ineffectiveness of the training with the tagged spoken subcorpus, it was decided to
proceed towards the annotation of the remaining 87.052 words with the tool trained on the written
corpus, the same one that we were using before, improving the post-tagger skills.
In the final calculus of the recognition rate of the tagger and the lemmatizer all together, all kinds of
locutions and discourse markers (MD) were considered. This fact implied a decreasing of the
previous recognition rate of 91,5% (where those tags were not considered) to 88%.
65
Given this unexpected and undesirable low rate of success of the tagger, it became
imperative to carefully observe the most typical errors performed by the tool.
Considering the distinction between tag errors and lemma errors, it was observed that the
errors regarding POS tagging occurred with a higher percentage (74,5% of the errors) than the
errors regarding lemmatisation (64,8%). It is worth mentioning that 40% of the errors concerned
both tag and lemma.
Taking into account, firstly, the errors regarding POS tag, as it was already expected (since
the tagger did not include this tag), a high percentage of the errors occurred in the annotation of
discourse markers (18% of the total errors; 25% of tag errors; 2% of the corpus). The annotation of
locutions also increased the error rate, since they represent 29% of the total errors (39% of tag
errors; 4% of the corpus). It is important to underline once again the fact that most of the locutions
consist of discoursive locutions, which are extremely frequent in the oral discourse and particularly
difficult to predict and, therefore, to automatically tag them. Finally, a significant percentage of
errors regarding the subcategories tagging was also observed (tense and mood concerning verbs;
common and proper concerning nouns; subordinative and coordinative concerning conjunctions) –
14% of total errors; 10% of tag errors; 1% of the corpus. These 3 types of errors constitute 77,5% of
the tagging errors.
Observing now the errors that occurred in the lemmatization process, we can conclude that
the majority of the errors concerns the lemmatization of locutions (38% of total errors; 58% of
lemmatization errors; 5% of the corpus). It was also observed that a relevant percentage of errors
regarding the gender of the lemma (11% of total errors; 16% of lemmatisation errors; 1% of the
corpus). These 2 types of errors constitute 74,5% of the lemmatisation errors.
Taking into account all these facts and bearing in mind that the errors regarding the tagging
and lemmatization of discoursive markers and locutions were the ones which constituted a
particular problem for the tool (note that in the first training corpus, where these elements were not
considered, we achieved a success rate of 91,5%), if we excluded these errors from the error rate we
would have a success rate of the lemmatizer tool of 96,8% and a success rate of the tagger of
96,7%. It is still worth mentioning that, as normal whenever there is a human intervention and since
it is not possible to perform manual revisions indefinitely, some errors may have been introduced at
this level, not only mistyping errors but also classification ones.
The remaining 87.052 word subcorpus was not exhaustively revised, but some manual revision was
performed:
– All the multiword expressions (Locutions);
– All the past participles – since we distinguish the form of compound tenses from the other uses of participles
(which can be ambiguous with an adjective form);
– All the que forms– since it can have several uses and a description of its use was foreseen;
– The most common discourse markers forms were looked for – since the tagger could not identify them as a
MD:
assim
enfim
pá
pronto
bem
bom
então
exacto
percebe percebes pois
sabe
sabes
claro
olha
portanto
sim
digamos
olhe
– Non lemmatized words – since, as will be described bellow, it was highly probable that most words that
were not lemmatized (\-\) would have received a wrong tag.
– All the clitic forms – forms like o, a, os, as can be either a clitic, a definite article and a preposition (for a).
Also the form se can be a clitic or a conjunction.
– The forms a(s) and o(s) – for the reasons given above.
– The form como – it can be a verb form, an adverb, a conjunction or a interrogative.
– The form até – it can be an adverb or a preposition.
As far as lemmatization is concerned, for the remaining 87.052 words, we were able to improve the
lemmatizer in order to cross the POS information of the annotated corpus with the POS information
of LMCPC. This means that, for ambiguous forms, this tool was already able to select the
corresponding lemma for a given tag.
66
For instance, for an example like (6), the preceding tool would provide two lemmas for the word
processo – which can be homonymous between a noun and the first person singular of indicative
present of the verb processar (7).
(6) com este processo
(7) com este processo\PROCESSAR,PROCESSO\Nc
However, once the improved tool correctly tags the form as a noun, it is also able to choose the
right lemma, as we can see in 8.
(8) com este processo\PROCESSO\Nc
In the final manual revision, this allowed us to check, with a high level of accuracy, both lemma
and POS tagging: if a form did not receive a lemma, necessarily it would have been mistagged, as,
for example, the word evoluem, in (9), which received the tag of an adjective, instead of the verb
tag:
(9) e\E\CONJc também\TAMBÉM\ADV aqui\AQUI\ADV / as\A\ARTd coisas\COISA\Nc evoluem\-\ADJ //
Finally, the wordforms of lemma SER and IR were verified too, since they have homographic forms
and, although the tagger was able to propose a lemma for them, it could have been wrong.
(10) O João foi\IR\Vppi ao cinema.
(11) O João foi\SER\Vppi bombeiro.
Some options of the Portuguese team
It is worth mentioning some options that were considered concerning the non-assignment of
lemmas or tags relatively to some phenomena of spoken language:
a) wordforms without lemma but with POS tag:
Some non-existing words that result from lapsus linguae still preserve a clear POS function (usually
they result from the crossing of several words, and are corrected by the speaker):
(12) ou o lugal\-\Nc / um lugar\LUGAR\Nc no palco da vida //$ (pfammn02)
(13) o &feni / o &feni / o femininismo\-\Nc / o feminismo\FEMINISMO\Nc (pfamdl22)
(14) tivesse pordido\-\VPP / &eh / podido\PODER\VPP / &eh / pronunciar-se (pnatpd03)
Even when the speaker does not auto-correct himself, it is possible to assign a tag to the wordform:
(15) ainda que toda a carga negativa desabate\-\Vpc sobre vocês //$ (pnatpr01)47
Speakers may also produce wordforms that do not obey the normative use and, therefore, are not
registered in the Portuguese dictionaries.
(16) alguma função catársica\-\ADJ48 (pnatla02)
However, when speakers produce wordforms that were deliberately created by them, these nonregistered words receive a lemma.
(17) estendeu / este princípio abandonatório\ABANDONATÓRIO\ADJ / a algumas outras regras da reforma
fiscal (pnatps01)
There are cases where the context does not provide enough information to lemmatize the ambiguous
forms between the verbs ser e ir. In these cases, the forms receive a tag (it is possible to decide the
tense and mood of the verb), but not a lemma.
(18) o Napoleão de / &tai foi\-\Vppi / teve / tentou / o grande império dele (pfamcv09)
(19) os sócios / &eh / pronto / foram\-\Vppi / fizemos o quarteto // (pmedin03)
47
Note that the word desabate does not exist in the Portuguese language. It seems to result from a crossing of desabar with abater
(semantically related verbs).
48
Note that the correct form is catártica.
67
b) wordforms with lemma but without POS tag (\SC):
It may also happen for ambiguous wordforms to clearly have a lemma but not to receive a
POS tag because the context does not allow us to classify it – the wordform a always features the
lemma A, despite being an article, a preposition or a clitic, and the wordform que always has the
lemma QUE, despite being a conjunction or a relative element.
(20) e até ficou a\A\SC / com a / com a / com a ponta do sapato (pfammn02)
68
3.7.5.4. Spanish: tool and evaluation
For the morphological analysis we have used GRAMPAL (Moreno 1991; Moreno and Goñi
1995) which is based on a rich morpheme lexicon of over 40.000 lexical units, and morphological
rules. This system has been successfully used in language engineering applications as ARIES
(Goñi, González and Moreno 1997) and also in linguistic description (Moreno and Goñi 2002).
Originally, GRAMPAL was developed for analysing written texts. The tagging has been the most
useful test for showing the ability of GRAMPAL to deal with a wide-coverage corpus of Spanish.
We use this application for enhancing GRAMPAL with new modules: a POS tagger and an
unknown words recognizer, both specifically developed for spoken Spanish.
GRAMPAL is theoretically based on feature unification grammars and originally implemented
in Prolog. The system is reversible: same set of rules and the same lexicon are used for both
analysis and generation of inflected wordforms. It is designed to allow only grammatical forms. In
other words, the most salient feature of this model is its linguistic rigour, which avoids both overacceptance and over-generation.
The analysis provides a full set of morphosyntactic information in terms of features: lemma,
POS, gender, number, tense, mood, etc. In order to be suitable for tagging the C-ORAL-ROM
corpus, a number of developments has been introduced in GRAMPAL, reported in Moreno &
Guirao (2003):
1. A new tokenization for the spoken corpus.
2. A set of rules for derivative morphology
Tokenization in spoken corpora is slightly different to the same task in written corpora. Neither
sentence nor paragraph boundaries make sense in spontaneous speech. Instead, dialog turns and
prosodic tags are used for identifying utterance boundaries.
For disambiguation, specific features of spoken corpora directly affect the tagger: repetition and
retracting produce agrammatical sequences; subsentential fragments are frequent; there is a more
relaxed word order. Finally, there are no punctuation marks. All those characteristics force to adapt
the POS tagger typically trained for written texts.
Fortunately Proper Names recognition is not a problem for C-ORAL-ROM since only names
are transcribed with a capital letter. As a consequence, to analyze them is a trivial task.
On the lexical side, we detected two specific features with respect to written corpora: there is a
low presence of new terms (i.e. the lexicon used by speakers in spontaneous conversations is mostly
common and basic) and a high frequency of derivative prefixes and suffixes that do not change the
syntactic category, because most of them are appreciative morphemes.
In order to handle the recognition of derivatives, GRAMPAL has been extended with derivation
rules. The Prefix rule is: take any Prefix and any (inflected) word and form another word with the
same features.
This rule is effective for POS tagging since in Spanish prefixes never change the syntactic
category of the base. The rule assigns the category feature to the unknown word. 239 prefixes have
been added to the GRAMPAL lexicon.
POS disambiguation has been solved using a rule-based model: specifically, an extension of a
Constraint Grammar using features in a Context-Sensible PS. The output of the tagger is a feature
structure written in XML.
The formalism allows several types of context-sensitive rules. First in application are the
lexical ones: those for a particular ambiguous word.
"word" Æ <cat="X"> / _ <cat ="Y">
"word" Æ <cat="Z"> / <cat ="W"> _
69
where a given ambiguous word is assigned an X category before a word with a Y category, or the Z
category after a W category.
Any kind of feature may be taken into account, not only the category. For instance, we can face
the problem of two ambiguous verbs belonging to different lemmas:
"word" Æ <lemma = "L"> / _ <cat="X">
"word" Æ <lemma = "M"> / _ <cat="Y">
In addition to features, strings and punctuation marks can be specified in the RHS of the contextsensitive rule
"word" --> <cat="X"> / string _
"word" --> <cat="Y"> / # _
where string is any token, and # is the symbol for start or end of utterance.
If no lexical rule exists for a given ambiguity, then more general, syntactic rules are applied:
<cat="N">,<cat="V"> -> <cat="N"> / <cat="V"> _
where if a given word is analysed with two different tags, one as a noun (N), and the other as a verb
(V), then the one with N category is chosen if it appears after a word with a V category. In short,
those syntactic rules apply when there is no specific rule for the case (either because it is a new
ambiguity not covered by the grammar, or because the grammar writer did not find a proper way to
describe the ambiguity).
This method benefits from the fact that most frequent ambiguities for a given language are
well-know after training. As a consequence, many context-sensitive lexical rules can be written by
hand or extracted automatically from the data. Figures on the tagger’s performance are given in the
"Recognition Rate" section.
Finally, those words which did not undergo disambiguation are treated with TNT (Brants
2000), which will assign POS following a statistic model obtained from a 50000 words training
corpus. It has been proven that TNT is the most precise statistical tagger (Bigert, Knutsson, Sjöberg
2003).
Electronic vocabulary. The GRAMPAL lexicon is a collection of allomorphs for both stems and
endings. New additions can be easily incorporated, since every possibility in Spanish inflection has
been classified in a particular class.
In an experiment reported by Moreno & Guirao (2003), 8 % of the corpus are unknown words
for the system, from which:
1.
2.
3.
4.
Foreign words: walkman, parking
Missing words in the lexicon, typically from the spoken language: caramba, hijoputa.
Errors in transcription
Neologisms, mostly derivatives.
Rules for handling derivative morphology have been shown in the previous paragraph. For the
remaining three classes of unknown words, a simple approach is adopted:
a. Foreign words are included in a list, updated regularly.
b. Any word in the corpus but not in the lexicon is added, expanding the base resource
c. Errors in the source texts are corrected, and then analysed by the tool.
As a summary, the tagger procedure, consisting of seven parts, is described below:
1. Unknown words detection: once the tokenizer has segmented the transcription in tokens, a quick
look-up for unknown words is run. The new words detected are added to the lexicon.
70
2. Lexical pre-processing: the program splits portmanteau words (“al”, “del” Æ “a” “el”, “de”
“el”) and verbs with clitics (“damelo” Æ “da” “me” “lo”).
3. Multi-words recognition: the text is scanned for candidates to multi-words. A lexicon, compiled
from printed dictionaries and corpora, is used for this task.
4. Single words recognition: every single token is scanned for every possible analysis according to
the morphological rules and lexicon entries. Approximately 30% of the tokens are given more
than one analysis, and some of them are assigned up to 5 different analyses.
5. Unknown words recognition: the remaining tokens that are not considered new words, pass
through the derivative morphology rules. If some tokens still remain without any analysis
(because they were neither included in the lexicon nor recognised by the derivative rules), they
will wait until the statistical processing, where the most probable tag, according the surrounding
context, is given.
6. Disambiguation phase 1: a feature-based Constraint Grammar solves some of the ambiguities.
7. Disambiguation phase 2: a statistical tagger (the TnT tagger) solves the remaining ambiguous
analyses.
Guirao & Moreno-Sandoval (2004) describe a tool developed for helping human annotators to
revise and correct the tagged corpus.
Evaluation
The total number of units (assuming that unique words and multiwords count as one unit) in the
test corpus (hand-annotated)49 is 44144.
The test corpus has been developed using a combined procedure of automatic and human
tagging:
1. A fragment of approximately 50.000 words (15% of the corpus) was selected, taken from the
different sections and intended to be a representative sampling of the whole. Each word in the
27 texts was tagged with all possible analyses.
2. Each file has been revised by a linguist, who selected the correct tag for every case, discarding
the wrong ones.
3. From the revised corpus, a set of disambiguation rules were written for handling the most
frequent cases.
4. A new run of the tagger, augmented with the disambiguation grammar, provided an
automatically tagged corpus, with only one tag per unit.
5. The automatic and human tagged corpora have been compared, and the differences were noted
one by one, assuming that the agreement in the same tag implied a correct analysis. In most
cases the wrong tag was assigned by the tagger, but in several cases the linguist wrote an
incorrect tag. Mistakes are probably due to loss of attention because of a repetitive task. After
assigning the proper tag in all the disagreements, a final version of the test corpus was delivered.
Both the disambiguation grammar and the statistical tagger have been trained against the test
corpus. Finally, the rest of the corpus, over 250.000 words, was tagged as described in Section 3.1.
49
Those files from the C-ORAL-ROM corpus annotated by hand are the following: efamcv03; efamcv06; efamcv07; efamdl03;
efamdl04; efamdl05; efamdl10; efammn05; emedin01; emdmt01; emednw01; emedrp01_1; emedsc04; emedsp01; emedts01;
emedts05; emedts07; emedts09; enatco03; enatpd01; enatte01; epubcv01; epubcv02; epubdl07; epubdl13; epubmn01; etelef01
71
In order to evaluate the tagger’s performance (including disambiguation), a new run of
GRAMPAL against the test corpus was conducted. The mismatches between the GRAMPAL
tagged corpus and the test corpus, working as a golden standard for evaluation, were counted (the
figures are shown in Table 4). The precision rate is calculated by the number of correct tags
assigned by the tagger divided by the total number of tagged units in the test corpus. In other words,
42206 tags out of 44144 were assigned correctly. No evaluation of the precision has been performed
for the rest of the corpus, but a similar rate (95,61 %) to the obtained against the test corpus can be
assumed.
With respect to the recall, to be intended as the ratio between the number of tagged units by the
program and the total number of units, the figure for the whole corpus is 99,96. Only 117 tokens
were not given a tag by the program.
Table 4. Test corpus and whole corpus evaluation
Number
of Number of tagged Recall
units
units
Test Corpus 44144
44144
100%
313504
313387
99,96%
Whole
Corpus
Precision
95,61%
≅ 95,61%
With respect to the evaluation of the POS tagging, it is important to stress that only a subset of
around 50.000 transcripted words were revised by hand, resulting in over 44.000 tagged words. The
rest of the tagged corpus has not been revised by human annotators. This fact has some
consequences in the list of forms and lemmas. When two or more tags are available, the tagger
always assigns the tag with the shared information between the candidates. For instance, many verb
forms are ambiguous with respect to the first and third singular persons: (yo) cante / (él, ella) cante
(I sing vs. He/she sings). The tags for both are Vp1s and Vp3s, respectively. When the context
cannot solve the ambiguity (by means of the pronoun), the tag assigned is V, compatible with both.
The human annotators, however, can normally resolve the ambiguity while they are revising the
tagging. In that case, the appropriate full tag is provided. As a result, different tags for the same
word can be found in the list of lemmas and forms.
72
4. XML Format of the textual resource
4.1. Macro for the translation of C-ORAL-ROM .txt files to .XML
The C-ORAL-ROM Macro is a Perl program which has two functions.
- First, that of validating the C-ORAL-ROM format; that is, checking that all the texts in the
corpus have the same format and that no typing errors have been made;
- Second, that of generating XML files for the texts in the corpus.50
4.1.1. Checking the C-ORAL-ROM format
The C-ORAL-ROM format consists of a set of textual conventions. These conventions help to
introduce the information which surrounds the recording, that is to say, everything which is not
transcribed but is relevant: information on the participants, on the situation and non-linguistic
features that appear in the recording. In what has been called the "header" there are fifteen different
fields which contain information about the participants, the date, the place, the situation, the kind of
text, the topic, the transcribers or the length of the transcription. These fields all have the same
format, an “@” followed by the name of the field and a “:” Some fields have a closed format, which
means they can only include a set number of values, as it occurs for example in the Participants
field: this field must include three capital letters to define the speaker, the name of the speaker and
then, between brackets and in this order, the sex (man/woman), age (A, B, C or D), education (1, 2,
3) profession and origin of the speaker. The age section cannot include anything which is not a
capital A, B, C or D: the C-ORAL-ROM Macro enables the checking of this in all texts. The same
can be applied to the signs representing different features inside the transcription. For example, the
sign for overlapping is [<] <overlapped text>. The C-ORAL-ROM Macro will check, for example,
that there is no space between the < and the overlapped text.
4.1.2. Generating XML files
Once we have checked that all the texts are compiled according to the C-ORAL-ROM format, the
program generates an XML file for each text. A typical XML file of a C-ORAL-ROM transcription
consists of an initial and an ending tag which includes the whole text (called <transcription>) and
two sub-sections: the header (marked by the tag <header>) and the transcribed text (marked by
<text>). In the header, each field will have its corresponding tag, in replacement of the name of the
field following the @, and each piece of information will have its own tag. Let's look at an example:
@Participants: PAT, Patricia, (woman, B, 2, hairdresser, participant, Madrid) will become, in
XML:
<Participants>
<Speaker>
<Name>PAT, Patricia</Name>
<Sex Type="woman"/>
<Age Type="B"/>
<Education Type="2"/>
<Occupation>hair dresser</Occupation>
<Role>participant</Role>
<Origin>Madrid</Origin>
</Speaker>
</Participants>
50
This program has been written by the Computational Linguistics Laboratory at Universidad Autónoma de Madrid.
73
As for the symbols included in the text, an XML tag will replace them, as in the example:
*ROS: con los [/] con el walkman de / Chechu // # will become
<Turn>
<Name>ROS</Name>
<Says>con los <Tone_Unit Type="partial_restart"/> con el walkman de<Tone_Unit Type="standard"/>Chechu
<Utterance Type="enunciation"/>
<Pause/>
</Says>
</Turn>
where *ROS: means the speaker whose name is ROS has started a new turn, and says what follows;
[/] signals a kind of tone unit where there is a partial retracting by the speaker; / signals a standard
tone unit; // signals the end of an utterance and # a pause. All this information is expressed in XML
by the different tags.
4.2 Running the program
4.2.1. Requirements and procedure
The C-ORAL-ROM Macro requires a Perl distribution, which can work both in Unix and Windows.
To run the program, all the texts to be checked and converted will have to be in the same directory
where the program is; then, we will type
perl xml-dtd_coralrom.pl <file_name>
efamdl01.txt is only and example of text (in its place we could include any text we want to check
and convert). This way, the program will generate a file with the same name but with an XML
extension.
4.2.2. Rectifying errors
While running the program, the following lines will appear every time we have a well-formed text:
Preprocessing Document...
Processing Document Main Structure...
Processing Header Data Fields...
Processing Text Data Fields...
Turn Preprocessing Done...
Processing Turn Structure...
Processing InTurn Data Structure...
File correct: efamdl01.txt --> Output file: efamdl01.xml
If the program finds some kind of mistake in the text, it will show an error message with
information on the place where the mistake is, for example: if we write "lenght" instead of "length",
the program will tell us that there is a mistake in the Length field; or, if we write F in the age field,
the program will tell us there is a mistake in the Age section, as only A, B, C and D are allowed. In
the case of mistakes inside the transcribed text, the program will show the text surrounding the
mistake.
74
To rectify the mistake, we will go to the transcribed text (the txt file), correct the mistake and run
the program again to see if it detects new mistakes or generates an XML file (the program will not
generate an XML file until no mistakes are included in the text).
4. 3. C-oral-rom dtd
<!-- Version 3.0 DATE: 01 Sept 2004 -->
<!-- Declaration XML type document -->
<!-- Declaration of external DTD document. Address Contains asociated XML file -->
<!ELEMENT Transcription (Header, Text)>
<!-- Start of Header relevant tags declaration -->
<!ELEMENT Header (Title, File, Participants, Date, Place, Situation, Topic, Source, Class+,
Length, Words, Acoustic_quality, Transcribers, Revisors, Comments)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT File (#PCDATA)>
<!ELEMENT Participants (Speaker+)>
<!ELEMENT Speaker (ShortName, FullName,Sex, Age, Education, Occupation, Role, Origin)>
<!ELEMENT ShortName (#PCDATA)>
<!ELEMENT FullName (#PCDATA)>
<!ELEMENT Sex (#PCDATA)>
<!ELEMENT Age (#PCDATA)>
<!ELEMENT Education (#PCDATA)>
<!ELEMENT Occupation (#PCDATA)>
<!ELEMENT Role (#PCDATA)>
<!ELEMENT Origin (#PCDATA)>
<!ELEMENT Date (#PCDATA)>
<!ELEMENT Place (#PCDATA)>
<!ELEMENT Situation (#PCDATA)>
<!ELEMENT Topic (#PCDATA)>
<!ELEMENT Source (#PCDATA)>
<!ELEMENT Length (#PCDATA)>
<!ELEMENT Words (#PCDATA)>
<!ELEMENT Transcribers (Transcriber+)>
<!ELEMENT Transcriber (#PCDATA)>
<!ELEMENT Revisors (Revisor+)>
<!ELEMENT Revisor (#PCDATA)>
<!ELEMENT Acoustic_quality EMPTY>
<!-- Other More-Meaning names can be used for each SubClass -->
<!ELEMENT Class (#PCDATA)>
<!ELEMENT Comments (#PCDATA | Notes)*>
<!-- Text is defined as one or more turns of DIFFERENT persons' speaking -->
<!ELEMENT Text (Turn+)>
<!ELEMENT Turn (Name, Says, Notes*)>
<!-- End of Header Definitions
Start of Text Tags Definitions -->
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Says (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic |
Unintelligible | Fragment | Pause | Overlap | Backwards | Forwards | Interjection | Notes)*>
<!-- Tonal Units Tags-->
<!ELEMENT Tone_Unit (#PCDATA)>
<!-- Limits each tonal unit. May have -->
<!-- If turn continues after overlapping -->
75
<!ELEMENT Continues (#PCDATA)>
<!-- Limits each utterance -->
<!ELEMENT Utterance (#PCDATA)>
<!-- Interjection Tags-->
<!ELEMENT Support (#PCDATA)>
<!-- Sylabic Support Possible -->
<!ELEMENT Non_Linguistic (#PCDATA)>
<!-- Non-Linguistic-Sounds. May have -->
<!-- Reconstruction Tags-->
<!ELEMENT Unintelligible (#PCDATA)>
<!-- When it's unknown what have been said-->
<!ELEMENT Fragment (#PCDATA)>
<!-- Literally written-->
<!-- Pause Tags-->
<!ELEMENT Pause (#PCDATA)>
<!-- Means a noticeable pause. Not a -->
<!-- suspension nor a Restart-->
<!-- Overlapping Tags-->
<!ELEMENT Overlap (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic |
Unintelligible | Fragment | Pause | Interjection | Notes)*>
<!-- Overlapping direction -->
<!ELEMENT Backwards (#PCDATA)>
<!ELEMENT Forwards (#PCDATA)>
<!-- Interjections -->
<!ELEMENT Interjection (#PCDATA)>
<!-- Sylabic Support Possible -->
<!-- Comments -->
<!ELEMENT Notes (#PCDATA | Tone_Unit | Continues | Utterance | Support | Non_Linguistic |
Unintelligible | Fragment | Pause | Interjection)*>
<!-- Atributes Definitions -->
<!-- Notes -->
<!ATTLIST Notes
Type CDATA #REQUIRED
>
<!-- Tone Unit -->
<!ATTLIST Tone_Unit
Type (standard | partial_restart | total_restart) #REQUIRED
>
<!-- Utterance -->
<!ATTLIST Utterance
Type (enunciation | interrogation | suspension | interruption) #REQUIRED
>
<!ATTLIST Sex
Type (man | woman | x) #REQUIRED
>
<!ATTLIST Age
Type (A | B | C | D | x | X) #REQUIRED
>
<!ATTLIST Education
Type (1 | 2 | 3 | x | X) #REQUIRED
>
<!ATTLIST Class
Type1 (informal | formal) #REQUIRED
76
Type2 (family_private | public | formal_in_natural_context | media | telephone)
#REQUIRED
Type3 (conversation | monologue | dialogue | political_speech | political_debate | preaching |
teaching | professional_explanation | conference | business | law | news | sport | interviews | meteo |
scientific_press | reportage | talk_show | private_conversation | human-machine_interactions)
#REQUIRED
Type4 (political_debate | thematic_discussions | culture | science | man_interaction |
machine_interaction | conversation | monologue | dialogue) #IMPLIED
Type5 (turism | health | meteo | traffic | train | restaurants) #IMPLIED
>
<!ATTLIST Acoustic_quality
Type (A | B | C) #REQUIRED
>
77
5. Bibliographical references
Austin, L.J., 1962. How to do things with words. Oxford: Oxford University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E. 1999. The Longman grammar of spoken
and written English. London: Longman.
Bigert, J., Knutsson, O. and Sjöbergh, J. 2003. "Automatic Evaluation of Robustness and Degradation
in Tagging and Parsing". In Proc. of the International Conference on Recent Advances in Natural
Language Processing (RANLP'2003) – September, 2003, Borovets, Bulgaria, 51-57.
Brants, T. 2000. "TnT. A statistical part-of-speech tagger." In Proceedings of the 6th Applied NLP
Conference, ANLP-2000 – April 29-May 3, 2000, Seattle, WA. ed pp
Brill, E. 1994. "Some advances in transformation-based part-of-speech tagging". In Proceedings of the
third International Worksohp on Parsing Technologies, Tilburg, The Netherlands. ed pp
Calzolari, N., Ceccotti, M. L. and Roventini, A. 1983. "Documentazione sui tre nastri contenenti il
DMI". Technical Report. Pisa: ILC-CNR.
CHAT http://childes.psy.cmu.edu/manuals/CHAT.pdf
Cresti, E., 1994. Information and intonational patterning in Italian. In Accent, intonation, et modéles
phonologiques, B. Ferguson, H. Gezundhajt, Ph. Martin (eds.), 99-140. Toronto: Editions Mélodie.
Cresti, E. 2000. Corpus di italiano parlato, voll. I-II, CD-Rom. Firenze: Accademia della Crusca.
Crystal, D. 1975. The English tone of voice. London: Edward Arnold.
De Mauro, T. et alii 1993. Lessico di frequenza dell'italiano parlato. Milano: ETAS libri.
Goñi, J.M., González, J.C. and Moreno, A. 1997. "ARIES: A lexical platform for engineering Spanish
processing tools". Natural Language Engineering 3(4): 317-345.
Guirao, J.M.and Moreno-Sandoval, A. 2004. "A 'toolbox' for tagging the Spanish C-ORAL-ROM
corpus". In Proceedings of the Workshop “Compiling and Processing Spoken Language Corpora”,
LREC-2004, Lisbon.
IMDI http://www.mpi.nl/IMDI/
Karcevsky, S. 1931. "Sur la phonologie de la phrase". In Travaux du Cercle linguistique de Prague IV,
188-228.
MacWhinney, B. 1994. The CHILDES project: tools for analyzing talk. Hillsdale, New Jersey:
Lawrence Erlbaum Associates.
Miller, J. and Weinert, R. 1998. Spontaneous Spoken language. Oxford: Clarendon Press.
Moreno, A. 1991. Un modelo computacional basado en la unificación para el análisis y generación de
la morfología del español. PhD. Thesis, Universidad Autónoma de Madrid.
Moreno, A. and Goñi, J. M. 1995. "A morphological model and processor for Spanish implemented in
Prolog". In Proceedings of Joint Conference on Declarative Programming (GULP-PRODE 95) –
Marina di Vietri, Italy, September 1995, M. Alpuente and M. I. Sessa, (eds.), 321-331.
Moreno, A. and Goñi, J. M. 2002. "Spanish Inflectional Morphology in DATR". Journal of Logic,
Language and Information 11: 79-105.
Moreno, A. and Guirao, J. M. 2003. "Tagging a spontaneous speech corpus of Spanish". In Proc. of the
International Conference on Recent Advances in Natural Language Processing (RANLP'2003) –
September, 2003, 292-296. Borovets, Bulgaria.
Picchi, E. 1994. "Statistical Tools for Corpus Analysis: A Tagger and Lemmatizer of Italian". In
Proceedings of EURALEX 1994, W. Martin et alii (eds), 501-510. Amsterdam.
PiSystem, http://www.ilc.cnr.it/pisystem/
Quirk, R., Greenbaum, S., Leech, G., Svartvik, J. 1985. A comprehensive grammar of the English
language. Longman: London.
't Hart J., Collier R., Cohen A. 1990. A perceptual study on intonation. An experimental approach to
speech melody. Cambridge: Cambridge University Press.
Valli, A. and Véronis, J. 1999. "Etiquetage grammatical de corpus oraux: problèmes et perspectives".
Revue Française de Linguistique Appliquée IV 2: 113-133.
Zampolli, A.and Ferrari, G. 1979. "Il dizionario di macchina dell'italiano". In Linguaggi e
formalizzazioni. Atti del convegno internazionale di studi della SLI, Gambarara, D., Lo Piparo, F.
(eds), 683-707. Bulzoni: Roma
78
APPENDIXES
79
APPENDIX 1- TYPICAL EXAMPLES OF PROSODIC BREAKS TYPES IN ITALIAN, FRENCH, PRTUGUESE
AND SPANISH
51
Examples of strings with typical non terminal breaks, generic terminal breaks and
interrogative breaks
(highlight in the examples).
Italian
*ALE: [1] eh / io invece / mi sono preso una chitarra //$
*IDA: [2] ah / la chitarra //$
*ALE: [3] acustica //$ [4] bella //$
*IDA: [5] e l' obiettivo ?$
(ifamdl18_double slash2)
Portuguese
*JOS: esteve na baixa / recentemente ?$
*NOE: não //$ há muito tempo que não vou //$ acho que está muito gira //$ não é //$
*JOS: está a ficar muito bonita //$
*NOE: estive no Chiado / há pouco tempo //$
*JOS: hum <hum> //$
(pfamdl14_doubleslash3)
French
*EMA: [1] ça c' est clair // #$ [2] de plus en plus // #$
*JUL: [3] quels sont vos rapports avec les clients en général ? #$
*EMA: [4] très bons // #$ [5] sur la base / très bons // #$
(fpubdl03_double&single slash)
Spanish
*MIG: [<] < [1]¡joder!>$ // [2] te quejarás de tu hijo / <eh> ?$
*EST: [<] [3]<nada> / nada //$ [4] muy bien todo //$ [5]<fenomenal> //$
*MIG: < [6]¡joder!>$
(efamdl39_double slash1)
Examples of strings with typical intentional suspension
Italian
*MAR: [1]<allora è &propr > +$
*ROS: / [2] come un fuso //$ [3] icché vo' fare?$ [4] quando uno l' è ignorante ...$
*MAR: [5] eh / sie / <&ignoran> +$
(ifamdl07_suspension1)
Portuguese
*AMA: / [1] com aquela sensação de que eles vão ficar / naquela / &eh / fechados naquele / naquela cerca //$
*RAQ: [2] pois //$
*AMA: / [3] que é ...$
*RAQ: [4] eléctrica //$ [5] depois de apanharem <os choque todos> //$
*AMA: [6]<e que é horrível> //$
(Pfamcv10_suspension_2)
French
*SYL: [1] avant de quoi ? #$
*CHR: [2] ben de [/] de [/] de sortir ensemble //$ [3] enfin du temps / #$ [4] non // pas tant que ça // mais finalement
...$
*SYL: [5] j' ai commencé la formation le vingt-six novembre //$ [6] je m' en souviens // #$
(ffamdl03_ suspension1)
Spanish
51
The Multimedia files corresponding to the examples are available in the directory Utilities of the DVD9 CORALROM_AP
80
*FER: [1] claro //$
*EVA: [2]¡ah! //$ [3] claro / digo si es tu primo / y no tienes / pueblo ...$ [4] claro //$ [5] hhh //$
*FER: [6] y tú qué hiciste / Jose ?$
(epubcv02_suspension 3)
Examples of strings with typical interruption
Italian
*SIM: # [1]&he / però # sarai controllato //$ [2] e dovrai spiegare / &he / qualsiasi natura dei tuoi +$
*ROD: [3] esperimenti //$
*SIM: [4] dei tuoi esperimenti //$
(ifamcv07_listener interruption)
*MAR: [1] ma no / perché lei / vuole una storia seria //$ [2] è perché forse / non c' è stata +$ [3] dico / tu / che perdi il
capo / che +$ [4] perché / a lui / gli piacciono i soldi / Ida //$ [5] questo / io lo so //$ [6] e / anche lui / dice / questo è
vero //$
*IDA: [7] lo ammette //$
(ifamdl20_interruption3)
Portuguese
*ACL: [1] também //$
*JOS: [2] olhe / como é que se chama aquela +$
*ACD: / [3] ou um albardão //$
(pnatpe01_listener interruption3)
*LUC: [1]<claro> //$
*GRA: / [2] agora / para mim +$ [3] preciso ver que eu tenho setenta e um <anos> //$
*LUC: [4] pois //$
(pfamcv03_interruption)
French
*VAL: [1] la même chose que l' amour passionné / #$
*CHA: [2]<mais>$
*VAL: [3]<enfin> //$
*CHA: [4] si // c' est le même à la base // #$ [5] mais il y a comme une +$
*ALE: [6] il a mûri // #$
(ffamcv01_listener interruption 1)
*SAN: [1] et donc enfin toi tu es de Caen / je crois #$
*EDO: [2] ah oui oui oui / de + ben mon frère aussi // #$
*SAN: [3] vous êtes$ [5]<de Caen> ?$
*EDO: [4]<on est nés>$
(ffamdl02_interruption1)
Spanish
*CRI: [1] me dijo / &eh / el Ministerio de Sanidad en Madrid / fue lo que +$
*LUI: [2] lo que os dijo //$
*CRI: [3] lo <que>+$
*ALB: [<] [4]<eso> era una / excursión organizada ?$
(efamcv15_listener interruption1)
*CRI: [<] < [1]¡ah! / claro> //$ [2] me llamaste desde el +$ [3] es que me [/] se fue a la sierra / y me llamó desde la
sierra //$
(efamdl08_interruption3)
81
Examples of strings with typical retracting
Italian
*MAX: [1] una tristezza //$ [2] ma proprio una cosa / <&f> +$
*CLA: [3]<ah> //$ [4] io l' ho trovato uno dei posti più belli / che si [/] che [/] che [///] in Italia //$
*GAB: [5]<Matera> //$
(ifamcv17_simple retracting)
French
*PER: [2]<ouais après>$
*JEA: [1]<tout est> +$
*PER: [3] la [/] la petite [/] le petit tour / c' était à Casablanca //$ [4] tu sais / la petite visite guidée / qui se termine à la
boutique des souvenirs //$ < [6]&euh>$
*STE: [5]<ah>$
(ffamcv10_simple retracting1)
Spanish
*PIE: / [1] que está perdiendo audiencia ya [/] audiencia ya / <el programa> //$
*PAZ: [<] [2]<no creas / tía> //$ [3] era [/] era el líder de audiencia / en [/] en Canarias / tía / y por ahí //$
*PIE: [4] sí //$
(efamdl33_simple retracting)
Portuguese
*ANT: [1] é evidente //$
*FER: / [2] o / o touro / realmente / estava a / estava a dar cabo do cavalo //$ [3] agora / <não é uma pessoa &iso> +$
(pfamdl07_simple retracting)
Examples of retracting/interruption ambiguity
Italian
*GUI: [1] fanno una cooperativa / per poter lavorare / perché / un camion / quando parte / deve essere [///] abbia una
buona assicurazione / prima di tutto //$
(ifammn22_retracting-interruption3)
French
*ALE: [1]<cinq ans toi> ?$
*CHA: [2]<moi deux ans> //$ [3] et une fois trois ans //$ [4] enfin là /$
*ALE: [5]<bon donc> /$
*CHA: [6]<trois ans>$
*ALE: [7] tout le monde a vécu [/] a eu l' expérience quoi // #$ [8] et$
*CHA: [9] yes // #$
*ALE: [10]<derrière> ?$
(ffamcv01_retracting-interruption_1)
Spanish
*NAN: < [1]¡joder! //$ [2] y qué pasa en el &tea> [/] y qué pasa en el teatro romano / que veía / cómo las [/] <los
animales / se comían a los cristianos> //$
*BEC: [3]<hhh //$ [4] afortunadamente +$ [5] pero por eso la &evo> [/] la [///] hemos evolucionado //$
*NAN: [6] pero tú crees que la [/] que / a lo largo de la historia / <la gente evoluciona> ?$
*BEC: < [7]&es [/] pues espero que sí> //$
(efamdl18_retracting-interruption2)
82
APPENDIX 2 - TAGSETS USED FOR POS TAGGING IN THE FOUR LANGUAGE
COLLECTIONS. DETAILED TABLES AND COMPARISON TABLE
French tagset
Table 1. French PoS tagset.
POS
Verbs
Sub-type
Conditional, present
Imperative, present
Indicative, future
Indicative, imperfect
Indicative, past
Indicative, present
Participle, past
Participle, present
Subjunctive, imperfect
Subjunctive, present
Common
Proper
Ordinal
Qualifying
Nouns
Adjectives
Adverbs
Prepositions
Conjunctions
Coordination
Subordination
Definite
Demonstrative
Indefinite
Interrogative
Possessive
Demonstrative
Indefinite
Personal
Possessive
Relative/interrogative
Determiners
Pronouns
Numerals
Interjections
discourse particles
Uncategorisable
and
Foreign word
Euphonic particle
Title
TAG
VER:CON:PRE
VER:IMP:PRE
VER:IND:FUT
VER:IND:IMP
VER:IND:PAS
VER:IND:PRE
VER:INF
VER:PAR:PAS
VER:PAR:PRE
VER:SUB:IMP
VER:SUB:PRE
NOM:COM
NOM:PRO
ADJ:ORD
ADJ:QUA
ADV
PRE
CON:COO
CON:SUB
DET:DEF
DET:DEM
DET:IND
DET:INT
DET:POS
PRO:DEM
PRO:IND
PRO:PER
PRO:POS
PRO:RIN
NUM
INT
EXAMPLES
aurait, serait, dirais
attends, allez, écoutez
sera aura fera seront pourra faudra
était avait faisait fallait disait allait
fut, vint
est a ai sont va ont peut fait sais suis
Infinitive
fait dit été eu vu pris pu mis
étant disant maintenant faisant ayant
fût, vînt
soit ait puisse fasse aille
heure, temps, travail, langue
France, Marseille, Freud, Roosevelt
premier, deuxième, troisième
petit, grand, vrai
ne, pas, oui, alors, très, pratiquement
de, à, pour, dans, sur
et, ou, et, mais
que, parce que, comme, quand, si
le, la, les
ce, cette, ces
une, une, tout, quelques, plusieurs
quel
mon, ma, ton, ta
ce, ça, celui, cela, ceci
un, une, tout, rien, quelqu'un
je, tu, il, elle, y, en, se
mien
qui, que, où, quoi, dont, laquelle
deux, trois, mille, cent
ben, bon, hein, mh, ah
XXX:ETR
XXX:EUP
XXX:TIT
check up, Eine Sache
-t-, l'
"Tapas_Café",
"With_Full_Force",
"Retour_des_Vampires",
"Fables_de_La_Fontaine"
83
Italian Tag set
Table 2.1 General structure of the Italian Tag set.
ROOT
classification
Secondary classification
Elements classified
standard
compositional
(PoS tagset)
non-compositional
non-standard
compositional
(NS tagset)
non-compositional
PoS
interjection
(according
to
tradition,
within
PoS)
foreign and new
forms
onomatopoeia,
language acquisition
forms
fragm.
words,
phonetic supports,
pause fillings
coughs and laughs
linguistic elements
non-linguistic
elements
(NL
tagset)
para-linguistic
extra-linguistic
Table 2.2 Italian PoS tagset.
POS
Sub-Type
TAG
Definition
Verb
main
V
semantic reference to a class of events; main
morphological features: mood, tense, (person,
number, gender)
non main
VW
grammatical verbs (auxiliaries and copula)*
+ enclitic
V_E
verb form which includes a clitic
common
S
semantic reference to a class of
morphological features: gender and number
proper
SP
direct reference, absence of determiners
Massimo\SP
Adjective
A
semantic reference to a quality or property;
morphological features: gender and number
grande\A
Adverb
B
modifier of the predicative
adjectives); no inflection
non\B
simple
E
adds semantic and syntactic features to a NP; no
inflection
di\E
fused
E_R
preposition fused with an article
del\E_R
coord.
CC
establishes syntactic relationships among phrasal
constituents; no inflection
e\CC
subord.
CS
establishes syntactic relationships among sentences;
no inflection
perché\CS
Article
R
adds the feature [±definite] to a NP; morphological
features: gender and number
il\R
Demonstrative
DEM
see below*
questo\DEM
Indefinite
IND
see below*
molti\IND
Personal
PER
see below*
io\PER
Possessive
POS
see below*
mio\POS
Relative
REL
see below*
cui\REL
cardinal
N
number
quattro\N
ordinal
NA
numeral adjective
quarto\NA
I
non compositional element; illocutive value
ehi\INT
Noun
Preposition
Conjunction
Numeral
Interjection
84
EXAMPLES
mangio\Vs1ip
era\VWs3ii
vederlo\Vfp_E
elements
objects,
(verbs,
albero\S
Table 2.3. Morphological information for Verbs: finite forms.
Verb
(nonmain)
Number
singular
V
Person
s
W
plural
Mood
first
1
second
2
indicative
subjuncti
ve
condition
al
imperativ
e
p
third
Tense
3
i
(encl.)
present
p
c
past
r
d
imperfe
ct
i
m
future
f
_E
Table 2.4. Morphological information for Verbs: non-finite forms.
Verb
V
(non-main)
W
Gender
Number
masculine*
m
feminine*
f
common*
n
singular*
plural*
* only for participles
Table 2.5 Tags for Non-standard element
Non-standard element
Compositional
Foreign forms
New formations
Non-compositional
Acquisition
Onomatopeic
TAG
EXAMPLES
(PoS+)K
(PoS+)Z
they\PERK
torniante\SZ
ACQ
ONO
cutta\ACQ
zun\ONO
Table 2.6. Non-linguistic (NL) Tag set.
Non-linguistic element
Paralinguistic
Extralinguistic
Non-understandable words
TAG
PLG
XLG
X
EXAMPLES
&he|PLG
hhh\XLG
xxx\X
85
Mood
s
p
Tense
infinite
f
participle
p
gerund
g
present
(encl.)
p
_E
past
r
Portuguese tagset
Table 3. Portuguese PoS tagset.
POS
VERB
AUXILIARY VERB
Sub-type
Tag
Present Indicative
Past Indicative
Imperfect Indicative
Pluperfect Indicative
Future Indicative
Conditional
Present Subjunctive
Imperfect Subjunctive
Future Subjunctive
Infinitive
Inflected Infinitive
Gerundive
Imperative
Past Participles in Compound Tenses
Adjectival Past Participles
NOUN
ADJECTIVE
ADVERB
PREPOSITION
CONJUNCTION
ARTICLE
DEMONSTRATIVE
Proper Noun
Common Noun
Coordinative
Subordinative
Indefinite
Definite
Invariable
Variable
INDEFINITE
POSSESSIVE
RELATIVE/
INTERROGATIVE/
EXCLAMATIVE
PERSONAL PRONOUN
CLITIC
NUMERAL
Examples
V
VAUX
Invariable
Variable
Invariable
Variable
Cardinal
Ordinal
INTERJECTION
ADVERBIAL LOCUTION
PREPOSITIONAL
LOCUTION
CONJUNCTIONAL
LOCUTION
PRONOMINAL LOCUTION
ENFATIC
FOREIGN WORD
ACRONYMOUS
EXTRA-LINGUISTIC
PARA-LINGUISTIC
FRAGMENTED WORD OR
FILLED PAUSE
DISCOURSE MARKER
DISCURSIVE LOCUTION
86
pi
ppi
ii
mpi
fi
c
pc
ic
fc
B
Bf
G
imp
VPP
PPA
N
p
c
ADJ
ADV
PREP
CONJ
c
s
ART
i
d
DEM
i
sou, tenho, chamo
fui, tive, quis, dormi
estava, era, dizia, dormia
fora, estivera, comera
serei, terei, direi
seria, teria, dormiria
seja, tenha, chame, durma
estivesse, dormisse
estiver, dormir
estar, ser, dormir
estares, seres, dormires
estando, chamando, sendo
come, dorme, trabalha
comido, entregado
comido, entregue
Lisboa, Amália
casa, trabalho
feliz, rico, giro
só, felizmente, como
a, de, com, para
e, mas, porque, ou
que, porque, quando, como
uns, umas
a, o, as, os
isso, isto
v
IND
i
v
POS
REL
essa, aquela, dito
i
v
PES
CL
NUM
c
o
INT
LADV
LPREP
que, como, quando, quem
cujo, quanto
eu, tu, ele, nós
se, a, o, me, lhe
LCONJ
só que
LPRON
ENF
ESTR
SIGL
EL
PL
FRAG
o que, o qual
lá, cá, agora
okay, mail
PSD, ACAPO
hhh
hum, hã, nanana
&dis, &eh
MD
LD
bom, pronto, pá, digamos
digamos assim, quer dizer
alguém, nada, algo
outro, todo
meu, teu
dois, três
primeiro, segundo, terceiro
ah, adeus, olá
em cima
em cima de
WITHOUT
CLASSIFICATION
WORD IMPOSSIBLE TO
TRANSCRIBE
SEQUENCE IMPOSSIBLE TO
TRANSCRIBE
SC
Sub-Tags
Ambiguous form
TAG
:
EXAMPLES
um\ARTi:NUMc
Contracted forms
+
da\PREP+ARTd
Hyphenated forms (excepting compounds)
-
viu-se\Vppi-CL
87
Pimp
xxx
Simp
yyyy
Spanish tagset
Table 4.1. Spanish PoS tagset.
POS
Verb
Sub-type
Auxiliary
Noun
TAG
V
EXAMPLES
cantar
AUX
habrá cantado
N
mesa
NP
María
Adjective
ADJ
azul
Adverb
ADV
así, aquí, allí
Prepositión
PREP
ante, bajo, con
Conjunctión
C
y, pero, ni
Proper noun
Determiner
Article
ART
Possessive
POSS
mi, tu, su
DEM
ese, este
P
PR
yo,
tú,
lo_que
que
Q
uno, dos, tres
Interjection
INTJ
Discourse marker
MD
primer,
segundo
muchos, pocos
madre
mía,
yuju
oye, o sea, es
decir
Demonstrative
Pronoun
Quantifier
él,
Table 4.3. Semantic, morpho-syntactic, and syntactic features of each POS.
POS
Semantic features
Morphological features
Per
Num
Gen
Denote an element or
X
object classification
Mono-referential
Fixed inflection
X
ADJ
Denote qualities
properties
or
X
X
ART
Restrict/define
the
referent of a noun
phrase
Express relation of
posession or ownership
X
X
X
X
N
N (PR)
POSS
DEM
Q
REL
Syntactic
features
Examples
Subject,
direct
object, complement
Absence
of
determiners
Noun complement,
predicative
complement
Prenominal position,
no syntactic function
actitudes\ACTIT
UD\NCfp
Luisa\LUISA\Npi
Tense
Mood
Prenominal
(1st
series)
and
postnominal
(2nd
series) positions
Pre and postnominal
positions
X
X
Express location of the
referent in space and
time
Express number of Will vary depending on the kind of entity they are quantifying
individuals or objects
Retrieve the referent of Inherit their morphological features from Different syntactic
functions inside the
the noun they modify the noun working as the referent
clause
inside the clause they
introduce
88
extranjeros\EXT
RANJERO\ADJ
mp
los\EL\DETdmp
mío\MÍO\DETpo
ss
este\ESTE\DETd
em
una\UN\Q
que\QUE\PR
P
Refer to a noun phrase
X
X
Verb
Express events
X
X
INTJ
MD
Express a mental state
Invariability
Guide the inferences Invariability
which take place in
conversation
Establish logical or Invariability
discourse bounds
C
PREP
Establish
semantic Invariability
relationships associated
to spatial concepts
ADV
Set the meaning of the Invariability
verb
X
X
89
Maximum expansion
of the noun phrase
Central element of
the
sentence,
determine
the
different syntactic
functions
Not been assigned
Not been assigned
yo\YO\PPER1s
Relate sentences or
elements
in
a
sentence
Establish
relationships
between
two
elements
Do not introduce a
second term
pero\PERO\C
es\SER\Vindp3s
ah\AH\INT
es
decir\ES
DECIR\MD
de\DE\PREP
también\TAMBI
ÉN\ADV
Synopsis tag sets
Table 5a. Synopsis of the PoS tag sets
Tag-Set Projection
Nouns
verbs
adjectives
adverbs
prepositions
conjunctions
interjections
discourse markers
emphatic
French
NOM
VER
ADJ:QUA
ADV
PRE
CON
INT
Table 5b. Synopsis of the PoS tag sets
Tag-Set Projection
French
DET:DEF
articles
(definite
determiner)
demonstrative determiners DET:DEM
demonstrative pronouns
PRO:DEM
possessive determiners
DET:POS
possessive pronouns
PRO:POS
personal pronouns
PRO:PER
clitic
rel-int-excl determiners
DET:INT
rel-int-excl pronouns
PRO:RIN
indefinite determiners
DET:IND
indefinite pronouns
PRO:IND
numbers (cardinals)
NUM
numerals (ordinals)
ADJ:ORD
Italian
S
V
A
B
E
C
I
Portuguese
N
V
ADJ
ADV
PREP
CONJ
INT
MD
ENF
Italian
Portuguese Spanish
DETd
ART
(definite
(articles)
determiner)
R
(articles)
Spanish
N
V
ADJ
ADV
PREP
C
INT
MD
DIM
DEM
DETdem
POS
POS
DETposs
PER
PES
CL
PPER
REL
REL
PR
IND
IND
N
NA
NUMc
NUMo
Q
(quantifiers)
Table 6. Synopsis of the morpho-syntactic encodings for verbs
MOOD
TENSE
PERSON
NUMBER
GENDER (only
for participles)
VERB TYPE
French
indicative
subjunctive
conditional
imperative
infinitive
participle
Italian
indicative
subjunctive
conditional
imperative
infinitive
participle
gerund
present
past
imperfect
future
present
past
imperfect
future
first
second
third
singular
plural
masculine
feminine
common
main
non-main
Portuguese
Spanish
indicative
indicative
subjunctive
subjunctive
conditional
conditional
imperative
imperative
infinitive
infinitive
participle
participle
gerundive
gerund
adjectival past
participle
present
present
past
past
imperfect
imperfect
pluperfect
future
future
first
second
third
singular
plural
masculine
feminine
main
auxiliary
main
auxiliary
90
Table7 a. Synopsis of the non-standard tag sets
French
extralinguistic
support & fillers paralinguistic
fragments
-
Italian
XLG
PLG
Portuguese
EL
PL
FRAG
Spanish
<nl>
<sup>
-
Portuguese
ESTR
Pimp
Spanish
<nc>
-
Table 7b. Synopsis of the non-standard tag sets
foreign words
new formations
acquisition forms
onomatopoeia
meaningless forms
euphonic particle
non understandable words
French
XXX:ETR
XXX:EUP
-
Italian
(Pos) + K
(Pos) + Z
ACQ
ONO
X
91
APPENDIX 3 ORTHOGRAPHIC
TRANSCRIPTION CONVENTIONS IN THE FOUR
LANGUAGE CORPORA
During the construction phase of the project, it was agreed upon that a number of transcription conventions
would be common to all languages (speaker notation, hesitations, prosody annotation, etc.), but that each
team, within the national orthographic traditions, would keep their conventions for all other features
regarding the orthographic transcription of spontaneous speech,. The following are the main choice adopted
by each team.
Italian transcription conventions
In the transcription of the texts included in the C-ORAL-ROM Italian corpus, the use of capital letters has
been reserved to:
a. Christian names (Calvino; Paola),
b. Toponyms (Parigi, Prato),
c. Odonyms (via Burchiello, Porta Romana, Via Pian de' Giullari, Ponte alla Vittoria),
d. Tv-programme names (Fantastico),
e. Film titles (Piccolo grande uomo, Il secondo tragico Fantozzi),
f. Book titles (Memoriale, Vangelo, Divina Commedia), 52
g. Band names (Depeche Mode; Neganeura).
Abbreviations are entirely written in block capitals: BTP, MPS, without full stops between the letters.
Capitals are used in accordance with their role in traditional grammar, and with the swings that grammar lets
them perform: e.g., in expressions which contain toponyms and odonyms preceded by a common name, only
the proper name always requires a capital letter, whereas the common name can be written beginning with a
capital as well as with a small letter; as for titles of works, the use of capitals is only compulsory for the first
letter (see Serianni 1997:46).
Italian Glossary of non standard and regional expressions
52
For titles, citations and discourse, inverted commas were not used
92
a' = ai
a i' = al
'a= la
abbi= abbia
'abbè = vabbè
ae' = avere
aere = avere
aerlo= averlo
ahia = interjection
ahm = onomatopoeia
ahò = interjection
aimmeno, aimmen = almeno
allo'= allora
ammeno= almeno
analizza'= analizzare
anda' = andare
andaa = andava
andao = andavo
anderà = andrà
andesti = andasti
antro, antra = altro, altra
'apito = capito
apri'= aprire
arà = avrà
ara' = avrai
'arda= guarda
arebbe = avrebbe
arria = arriva
arriò = arrivò
aspe'= aspetta (2nd singular
person)
ata= altra
attro, attri = altro, altri
ave' = avere
aveo, avei, avea, aveano, avean =
avevo, avevi, aveva, avevano,
avevano
avessin = avessero
avra' = avrai
avre' = avrei
'azzo = cazzo
barre = bar
beh= interjection
bell' e = bello e = già
bigherino = lace
bòna, bòni, bòno = buona, buoni,
buono
bongiorno = buongiorno
bra'= bravo
briaca, briache = ubriaca,
ubriache
brioscia = brioche
brum = onomatopoeia
brururum = onomatopoeia
bum= onomatopoeia
burum burum = onomatopoeia
bz = onomatopoeia
capi' = capire
capi' = capito
ché = perché
chello, chelli = quello, quelli
cherce = querce
chesto, chesta = questo, questa
chi = qui
chie ? = chi ?
chiù = più (variant from
Campania)
ciaccioni= impiccioni
'cidenti = accidenti
co' = con, coi
còcere = cuocere
compra' = comprare
credeo = credevo
da' = dai
da i'= dal
da'= dare
dagnene
=
'daglielo/a/i/e',
'dagliene',
'darglielo/a/e/i',
'dargliene'
dalle, dalli = dagli (interjection)
dao = davo
de' = del, dei
dèo, dèi, dèe, dèano, dèan = devo,
devi, deve, devono, devono
devan = devano, devono
di' = del
di' = di' (to say imperative)
diceo, dicea, diceano = dicevo,
diceva, dicevano
dignene
=
'diglielo/a/i/e',
'digliene',
'dirglielo/a/e/i',
'dirgliene'
dimórto = di molto, di molto
dio bò' = dio buono
dividano = dividano, dividono
do'= dove
dòle = duole
domall' altro = domani l'altro
doveo, dovea, dovean = dovevo,
doveva, dovevano
doventò= diventò
drento = dentro
du' = due
dugentomila = duecentomila
dum = onomaotopea
dumila = duemila
e' = il, i
e'= lui/loro (subject) and io
'e = le (article, southern variant)
ecce' = eccetera
embè = interjection
esse' = essere
fa' = fai, fare
facci = faccia
faccino = facciano
faceo, facei, facea, facevano,
facean = facevo, facevi, faceva,
facevano, facevano
'fatti = see 'nfatti
fizzaa = see infizzaa
fo = faccio
fòco = fuoco
fòri = fuori
frazio = puzza
gioane = giovane
'giorno = see 'ngiorno
gli, gl' = subject pronoun 3rd
singular
93
'gna= bisogna
gnam = onomatopoeia
gnamo = andiamo
governavi
=
governavi,
governavate
gozza' = gozzare = ingozzarsi
gua' = guarda
guarda' = guardare
guardaan = guardavano
ha' = hai
ha' voglia = hai voglia =
certamente
he' = hai
i' = il
icché= che cosa, quello che
(interrogative/relative pronoun)
ieh = interjection
ih= interjection
indo' = dove
indoe = dove
infizzaa = infilzava
intendano = intendono
'io = Dio
'isto, 'isti = visto, visti
ito= andato
laóra = lavora
le'= lei
learsi = levarsi
leato, leata = levato, levata
leggile = leggerle
lu' = lui
ma' = mai
ma'= mamma
macellaro = macellaio
mandaa = mandava
mane mano = a mano a mano
mangia' = mangiare
mangiari = pietanze
mangitoie= mangiatoie
mannaja=
mannaggia
(interjection)
mésse = mise
méssi = misi
mettan = mettano, mettono
metteo, mettea = mettevo,
metteva
mettignene = mettiglielo/a/i/e,
mettigliene,
mettergliene,
metterglielo/a/i/e
mi' = mia, mio
mo' = adesso, ora
mòre = muore
mórto= molto
mòve = muove
mòvere = muovere
muah = onomatopoeia
'n = in
'n', 'na, = un' , una (variant from
Campania)
'ndo'= dove
'ndove= dove
ne'= nei
nesci = gnorri
'nfatti = infatti
'ngiorno = buongiorno
'nsomma = insomma
ni' = nel
'nnaggia = mannaggia
nòvo, nòva = nuovo, nuova
òmini= uomini
òmo = uomo
'orte = volte
ostia= interjection
pa' = padre
paga' = pagare
pah = onomatopoeia
parean = parevano
passai= passati
pe' = per
pem = interjection
perdie = perdio (interjection)
'petta = aspetta
pi pi pi pi = onomatopoeia
piacea = piaceva
piantaa = piantava
piglia' = pigliare
'pito = capito
pò = puoi, può
po' = poi
poera, poeri, poerino, poerina,
poerini, poeretti = povera, poveri,
poverino, poverina, poverini,
poveretti
poho= poco
pòi = puoi
pòle = può
polea= poteva
poleva= poteva
pomarola = pummarola
portagnene = 'portaglielo/a/i/e',
'portagliene', 'portarglielo/a/e/i',
'portargliene'.
portao = portavo
possano = possano, possono
poteo, potea = potevo, poteva
potette = potè
prendilo = prenderlo
presempio = per esempio
proemi = problemi
proi = provi
prr = onomatopoeia
pulìo = pulivo
quarcuna = qualcuna
quande = quando
quante = quanto
que' = quei
QUELL' ALTRI = QUEGLI ALTRI
qui' = quel
rendeo = rendevo
rendano = rendano, rendono
resta' = restare
richiedegnene
=
'richiediglielo/a/i/e',
'richiedigliene',
'richiederglielo/a/e/i',
'richiedergliene'
rideo = ridevo
rifa' = rifare
rifò = rifaccio
rinvortata = rinvoltata
riprendila = riprenderla
riscòte' = riscuotere
riscòto = riscuoto
ritonfo = mi tocca ritornare
ritorna' = ritornare
'rivederci = arrivederci
rivòle = rivuole
rompé = ruppe
rompiede= ruppe
rompò = ruppe
ròta, ròte = ruota, ruote
sa' = sai
sape'= sapere
sapeo, sapean = sapevo, sapevano
sare' = sarei
scarza = scalza (scalzare verb)
se' = sei (numeral and verb)
sede' = sedere
segnaa = segnava
seguro = sicuro
sembraa, sembraan = sembrava,
sembravano
sentìo = sentito
'sicologia = psicologia
'sicologo = psicologo
sie = interjection or 'sì' (meaning
"no")
smette' = smettere
so' = sono
sòcera = suocera
'somma = see 'nsomma
sòr = signore
sòrdi = soldi
sòrdo = soldo
sorte = esce (from sortire)
SPENDE' = SPENDERE
'spetta = aspetta
'st', 'sta, 'ste = quest', questa,
queste
sta' = stai
sta'= stare
sta'= stare
staa = stata
staa = stava
'ste robe che qui = queste cose qui
stendea = stendeva
'sto, 'sti = questo, questi
studia' = studiare
su' = suo, sua
su i'= sul
su' = sui
sua = sua, suoi
t' = tu
telefonera' = telefonerai
tenea = teneva
tenevi = tenevi, tenevate
tennicamente = tecnicamente
tin = onomatopoeia
to' = tuo
toccaa = toccava
toh = interjection
troa = trova
94
tu' = tuo-a-e , tuoi
tu tu tu tu = onomatopoeia
tum= onomatopoeia
'u = il
uh= interjection
'un= non
va' = vai
vabbè = va bene
vabbò= va bene (buono)
ve' = vedi
vedano, vedan = vedano, vedono
vede' = vedere
vedea = vedeva
vèn = vien
vengan = vengono
veni' = venire
venìa, venìan = veniva, venivano
vie' = vieni (imperative)
vo = vado
vò'= vuoi
vò'= vuole
voglian = vogliano, vogliono
vòl = vuole
vòle = vuole
voleo, volea = volevo, voleva
vorre' = vorrei
vorte = volte
vota = volta
vòto = vuoto
vu' = voi
vuum = onomatopoeia
www = onomatopoeia
za = onomatopoeia
zzz = onomatopoeia
High frequency interjections and discourse particles
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
EH
MH
AH
VABBE’
OH
MAH
BEH
CIAO
BOH
MAGARI
BUONASERA
BAH
HE
MAMMA_MIA
OHI
VIA
GRAZIE
UH
HAN
EEE’
MAMMA
ODDIO
ACCIDENTI
BASTA
MADONNA
PER_CARITA’
UEH
EHM
IEH
PERDIO
AHI
BU
IH
UHM
ARRIVEDERCI
BUONGIORNO
CHI_SE_NE_FREGA
HEI
MACCHE'
MANNAGGIA
SH
VAFFANCULO
AHO'
BUONANOTTE
CHE_PALLE
CRISTO
DAGLI
EHI
MALE
MENO_MALE
OOH
AHM
AMEN
CASPITA
GRAZIE_AL_CIELO
HI
HUM
IA
IN_BOCCA_AL_LUPO
MA_VA'
MARAMAO
MHM
MOH
OHE'
OHIOHI
PERDINCI
PREGO
TOH
UFFA
UHI
UMH
VABBUO’
95
French transcription conventions
1. Acronyms were transcribed using the common practice in French texts: sometimes with dots (e.g.
C.N.R.S), sometimes without (e.g. ATALA).
2. Some proper names were anonymized (and the corresponding fragment was replaced by a beep in the
sound file) using the following convention:
a. _P1, _P2 , etc. for names of persons;
b. _S1, _S2, etc. for names of companies ("sociétés");
c. _T1, _T2, etc. for names of places;
d. _C1, _C2, etc. for numbers ("chiffres"), such as telephone or credit card numbers.
3. Undecidable spellings were put within parentheses. E.g. il(s) voulai(en)t pas, j'en (n')ai pas voulu.
4. Titles (works, radio broadcasts, etc.) were enclosed within quotes. E.g. "Fables de La Fontaine".
5. Phonetic transcriptions (particular or deviant pronunciations) were given when needed (on the separate
%pho tier) using the SAMPA alphabet.
The SAMPA alphabet for French53.
a. Consonants
Plosives
Fricatives
Nasals
Liquids
Glides
Symbol
p
b
t
d
k
g
f
v
s
z
S
Z
m
n
J
N
l
R
w
H
j
Example
pont
bon
temps
dans
quand
gant
femme
vent
sans
zone
champ
gens
mont
nom
oignon
camping
long
rond
coin
juin
pierre
Transcription
po~
bo~
ta~
da~
ka~
ga~
fam
va~
sa~
zon
Sa~
Za~
mo~
no~
oJo~
ka~piN
lo~
Ro~
kwe~
ZHe~
pjER
b. Vowels
Oral
Symbol
i
e
E
a
Example
si
ses
seize
patte
Transcription
si
se
sEz
pat
53
http://www.phon.ucl.ac.uk/home/sampa/home.htm
96
Nasal
Indeterminate
Symbol
A
O
o
u
y
2
9
@
e~
a~
o~
9~
E/
A/
&/
O/
U~/
High frequency interjections and discourse particles
1
ben
2
bon
3
hein
4
mh
5
ah
6
quoi
7
hum
8
eh
9
oh
10
bé
11
pff
12
bah
Example
pâte
comme
gros
doux
du
deux
neuf
justement
vin
vent
bon
brun
= e or E
= a or A
= 2 or 9
= o or O
= e~ or 9~
13
14
15
16
17
18
19
20
21
22
23
Transcription
pAt
kOm
gRo
du
dy
d2
n9f
Zyst@ma~
ve~
va~
bo~
bR9~
allô
tiens
pardon
merci
hé
tth
na
là
comment
attention
ciao
97
Portuguese transcription conventions
The general orthographical norms
As a general rule, the team transcribed the entire corpus according to the official orthography.
a.- Free variation
Variants of a word were transcribed whenever they were registered in the reference Portuguese dictionaries.
Such is the case of phonetic phenomena like apharesis: ainda > inda; prothesis: mostrar > amostrar,
metathesis: cacatua > catatua or alternations: louça / loiça.
b. Proper names
It is extremely difficult in spoken texts to accurately set up the differences between proper and common
names. For this reason, the team’s tradition is to transcribe with initial capital letter only anthroponomys and
toponyms. These names correspond, effectively, to the functions more consensually assigned to the proper
names: designating and not signifying (Cf. SEGURA DA CRUZ, 1987: 344-353). Some examples are for
toponyms: África, Bruxelas, Casablanca, China, Lisboa, Magrebe, Pequim and for anthroponyms: Boris
Vian, Camões, Einstein, Madonna, Gabriel Garcia Marques, Oronte.
c. Titles
Titles are in quotes, such as books: “crónica dos bons malandros”, movies: “pixote”, pieces of music:
“requiem” (de Verdi), newspapers: “expresso”, radio programmes: “feira franca” or television broadcasts:
“big brother”.
d. Foreign words
Foreign words were transcribed in the original orthography whenever they were pronounced closely to the
original pronunciation: cachet, check-in, feeling, free-lancer, emmerdeur, ski, stress, voyeurs, workshop.
When the foreign words were pronounced in a “Portuguese way” they were transcribed according to the
entries of the Portuguese reference dictionaries or according to the orthography adopted in those dictionaries
for similar cases. In this last case are included all the hybrid forms like pochettezinha, shortezinho, stressado,
videozinho.
e. Wrong pronunciations
When the speaker mispronounced a word and immediately corrected it, the two spellings were maintained in
the transcription of the text: lugal/lugar, sau/céu. If the speaker misspelled a word and went on in his speech
without any correction, the standard spelling of the word was kept in the transcription and a note regarding
the wrong pronunciation was added in the header comments (for instance, in the text pfamdl07 the informant
produces “banrdarilheiro” instead of bandarilheiro).
f. Paralinguistics and Onomatopoeia
Paralinguistic forms and onomatopoeia not registered in the reference dictionaries were transcribed to
represent, the closest possible, the sound produced: pac-pac-pac, pffff, tanana, tatata, uuu.
g. Acronyms
The acronyms were transcribed in capital letters, without dots: TAP and not T.A.P. However, if the acronym
already has an entry in Portuguese dictionaries as a common name, it was transcribed in minuscule, like sida
or radar.
h. Shortened forms
Shortened forms were kept in the transcription: prof.
i. Numbers
Dates and numbers were always transcribed in full letters: mil novecentos e trinta e cinco; mil escudos; mil
paus.
l. Letters
The names of the letters were always transcribed in full: pê for P, erre for R, xis, for X
98
Interjections, Dirscourse Markers, Emphatics
Position
1
POIS
2
PORTANTO
3
SIM
4
AH
5
PRONTO
6
ENTÃO
7
PÁ
8
CLARO
9
LÁ
10
AI
11
OLHA
12
ASSIM
13
EXACTO
14
DIGAMOS
15
OLHE
99
Spanish transcription conventions
As far as the transcription of the sound files is concerned, the Spanish team have respected the rules set by
the Real Academia Española.
1. Acronyms and symbols.
Acronyms are shown in block capitals and without dots, as in IVA, ONG, NIF... We used small letters for the
plural forms of these acronyms, as in ONGs. In the case of symbols related to science (chemistry and so on)
we have followed the conventions. The letter x was used in certain contexts to make reference to a non
specific quantity. No abbreviations were used in the transcription of the sound files.
2. New words.
We included words which, although not included in the Real Academia Española dictionaries, have a high
frequency of occurrence in spoken Spanish, such as porfa, finde or pafeto.
3. Numbers.
Numbers were transcribed in letters, except in the cases of numbers which are part of proper nouns, as in La
2, and numbers included in mathematical formulas. Roman numbers were used when referring to popes or
kings, as in Juan XXIII, as well as in names in which they are included, as N - III.
4. Capital letter.
Capital letters were used when transcribing proper nouns which made reference to people (including
nicknames), as Inma or el Bibi, as well as cities, countries, towns, regions, districts, squares, streets and so
on, as in Segovia, Carabanchel or Madrid. The same was applied to names of institutions, entities,
organisations, political parties, etc., as in the case of Comunidad de Madrid, Ministerio de Hacienda, la
Politécnica.
Capital letters were also used when naming scientific disciplines, as well as entities which are
considered absolute concepts and religious concepts, such as la Sociología, Internet, la Humanidad, tu Reino
y el Universo, while in the case of Señor salvador the second word, being an adjective, will not be written in
capital letters.
Names of sports competitions were transcribed (all words) with capital letters, as in Copa del Rey or
Champions League.
In the case of books and song titles, as well as all kinds of works of art, even television programs, only
the first word was written in capital letters, except in cases like Las Meninas (as is conventional). Both nouns
and adjectives included in the names of newspapers, magazines and such were written with capital letters, as
in El Mundo or El País. The names of stores and commercial brands as El Corte Inglés were transcribed
following the registered name.
5. Italics.
No italics were used in the transcription of the sound files.
6. Foreign words.
Foreign words were written following the original. Words of foreign origin were written with the original
orthography when pronounced in that language, whether they're included in the academic dictionaries or not.
When adapted to Spanish, these words were transcribed following the rules set in these dictionaries.
Orthography for non-standard words.
These non-orthographic productions have been labelled in C-ORAL-ROM as %alt and are as follows:
abrís
además
adictiva
adonde
además
adelante
armónica
Amsterdam
adonde
abréis
amás
aditiva
aonde
aemás
alante
amónica
Amsterdar
aonde
100
arcadas
ortodoncias
audiencia
azulejos
básico
a El
be cero
Durán i Lleida
Josep Andoni Durán i Lleida
bungaló
bronconeumonía
casa
claro
claro
cassette
Champions
chiquitita
chisme
alcoba
cuélgalo
compacts
consciente
conscientes
corazón
cuadraditos
cuñada
decorado
delegado
demasiado
desilusión
desplazarte
divorciaron
dijéramos
dijiste
diagnosticado
disminuirá
luego
durmió
engaño
enseguida
entonces
entonces
entonces
entonces
entregar
escuela
especificidad
estaros
estructura
estuvisteis
extranjeros
farolero
frase
friegaplatos
friolera
habéis
habla
hecho
hubieses
importante
importa
infracción
instituto
instrumento
instrumento
interfono
entierro
Jesús
joder
joder
joder
joder
arcás
artodoncias
audencia
alzulejo
básimoco
al
berceo
Durán Lleida
Josep Andoni Durán Lleida
bílgaro
bronconeunomia
ca
cao
caro
casé
Champion
chiquetita
chirme
colba
cólgalo
compas
conciente
concientes
corasón
cuadraitos
cuñá
decorao
delegao
demasiao
desilución
desplatazarte
devorciaron
diciéramos
diiste
dinosticado
disminurá
dogo
dormió
encaño
enseguía
entoes
entoces
entones
tonces
entrebar
escuella
especifidad
estaos
estrutura
estuvistesis
extranyeros
falorero
flase
fregaplatos
friorera
habís
hambla
hezo
hubiecese
importanto
impota
infranción
istituto
instumentro
trumento
intérfono
intierrro
Jesu
jode
joe
joé
joer
101
joder
lado
lados
luego
me ha
maternidad
mecanismos
mediados
bueno
merdado
mental
meteorología
armónica
te ha
muy
nada
nada
laten
buenos
obtusas
oftalmólogo
a donde
padrino
para adelante
para
pasado
patada
espera
pesado
pesados
pescado
pues
pues
comprado
pero
positivo
precipitarnos
prénsiles
proporcionaba
pringado
propia
pues
puede
podamos
puedes
pues
puñado
diálogos
relativamente
resultado
resultados
se ha
se había
ser
siguiendo
sexo
situado
socialmente
nos
superponiendo
sostenible
está
también
están
esté
tendrán
tienes
tienen
tienes
tinglado
todavía
tubería
oe
lao
laos
logo
ma
marternidad
mecaneismos
mediaos
meno
mercao
mertal
metereología
mónica
ta
mu
na
nara
naten
nos
obstrusas
ofmólogo
onde
paíno
palante
pa
pasao
patá
pera
pesau
pesaos
pescao
po
pos
pompao
poro
posetivo
precipietarnos
prensiles
preporcionaba
pringao
pripia
pu
pue
puedamos
pues
pue
puñao
reálogos
relamente
resultao
resultaos
sa
sabía
saer
seguiendo
seso
sintuado
socialremente
son
superponiendos
sustenible
ta
tamién
tan
te
terán
tie
tien
ties
tinglao
toavía
tobería
102
entonces
toda
todas
todo
todos
tobillo
tutiplén
estuvo
o
utilizó
verdad
vaya
toces
toa
toas
to
tos
torbillo
tutiplé
tuvo
u
utilició
verdá
yava
Interjections.
¡ah!
¡anda!
¡brum!
¡bueno!
¡chun!
¡Dios mío de mi alma!
¡ey!
¡hombre!
¡jobar!
¡jolines!
¡madre!
¡madre mía de mi vida y mi corazón!
¡mua!
¡ojo!
¡ostras!
¡oy!
¡pum!
¡ups!
¡ahí va!
¡bah!
¡buah!
¡cachis en la mar!
¡coño!
¡eh!
¡hala!
¡jo!
¡joder!
¡leche!
¡madre del amor hermoso!
¡madre mía!
¡oh!
¡ole!
¡ouh!
¡por Dios!
¡uh!
¡yeah!
103
APPENDIX 4 C-ORAL-ROM PROSODIC TAGGING EVALUATION REPORT
1 Introduction
This is the final report on the evaluation of the C-Oral-Rom prosodic tagging. Loquendo organized
the evaluation on behalf of the C-Oral-Rom Consortium. The evaluation for the four Project
languages took place during Fall 2003.
The evaluation had the main goal of assessing the perceptual relevance of the coding scheme
adopted in the Project for annotating prosodic breaks, applied to different languages. According to
the Consortium requirements (see the Technical Annex of the C-ORAL-ROM Project), two naïve
evaluators for each of the four languages were asked to evaluate the prosodic annotation of a subset
of the corpus in their language. Each evaluator, independently of the other, had to examine the
original annotation and possibly correct it by deleting, inserting or substituting prosodic-break tags.
The experimental setting is described in Chapter 2.
The evaluation data were then gathered and statistically analyzed in order to measure the
degree of consensus expressed by the evaluators towards the original annotation. The adopted
metrics are described in Chapter 3. The cases of disagreement, where one or both evaluators
corrected the original annotation, were compared with the total number of word boundaries and,
more perspicuously, with the number of positions, which are reasonable candidates for a prosodic,
break. Also the agreement between evaluators was measured.
On the basis of the assumption that an annotation scheme is good if it can be applied with a high
degree of accordance by two or more independent annotators, the Consortium decided to apply a
replicability statistics to compare the three annotations obtained by C-ORAL ROM and by the two
evaluators. Following previous work on annotation coding for discourse and dialogue 54, the Kappa
coefficient was calculated to show replicability of results. Such statistics is useful to test not only if
the annotators agreed upon the majority of the coding, but also to what extent that agreement is
significantly different from chance.
As it will be explained in paragraph 3.5.2, the Kappa coefficient measures pair-wise
agreement among a set of annotators by correcting for the expected chance agreement. When there
is total agreement, Kappa is one. When the agreement happens by chance, Kappa is 0. In the
literature there was discussion about what makes a ‘good’ level of agreement, and that probably
depends on several different variables, including the presence of an expert annotator. In this work
the expert coder (i.e. the C-Oral-Rom portions of corpora) was compared to the naïve coders’
choices.
2 Experimental Setting
Goal of the evaluation is to assess the reliability of the prosodic tagging of the C-ORAL–ROM
speech corpora. Such tagging consists in the marking of prosodic breaks in the orthographic
transcription of speech. Each word boundary in the corpus is a potential position for a break. Breaks
can be non-terminal or terminal. Terminal breaks are considered the main cue for the detection of
utterances, which are the reference unit of spontaneous speech. The annotation is based only on
perceptual judgments. It is the result of a first labeling by a first annotator and two successive
revisions by two different annotators.
The evaluation performed by Loquendo aims to test the hypothesis that prosodic breaks, especially
the terminal ones, have strong perceptual prominence and can be detected with a high level of interannotator agreement.
54
A. Isard and J. Carletta (1995), "Replicability of transaction and action coding in the Map Task corpus.", in J. Moore et al. (Eds),
"Empirical Methods in Discourse Interpretation and Generation", Working Notes of the AAAI Spring Symposium Series, Stanford
University, Stanford, Ca., pp. 60-66.
104
The C-ORAL-ROM Project provides four multimedia corpora of spontaneous speech for French,
Italian, Portuguese and Spanish, each amounting to about 300,000 words (roughly 35 hours of
speech). Given the size of the resource the evaluation of prosodic tagging was necessarily
performed on a statistically significant portion of each corpus. According to the Consortium
requirements, from each language corpus a subset was extracted, amounting to roughly 1/30 of its
utterances (about 1300 utterances and around 1:30 hours of speech). The speech sections to be
evaluated were automatically selected with a random procedure ensuring the same distribution of
speech types as in the corpus, which is organized according to the following tree structure:
Language Corpus
Informal
Formal
Natural context
Dialogue
Media
Monologue
Telephone
Dialogue
Private
Public
Monologue Dialogue Monologue
The selection procedure (see Appendix A.3)55, based on the software tool Win Pitch Corpus,
extracts dialogues/monologues with the same proportion from each node of the tree and guarantees
semantic and contextual coherence of the speech sections to evaluate, by choosing continuous series
of utterances, where possible, and by providing also the utterances surrounding the selected
sections. For each selected speech section, the procedure outputs an XML file ensuring text-audio
alignment and a text file where each tagged utterance is reported twice (validation copy).
The generation of samples for the evaluation has been accomplished at Loquendo premises with the
assistance of an engineer of University of Florence acting as consultant.
For each language, two mother-tongue evaluators were chosen, with medium cultural level and no
specific expertise in phonetics and prosody (see Appendix A.1 and Appendix A.6). No constraint
about regional origin was imposed on the choice (see Appendix A.2), except for Portuguese, for
which both the European and Brazilian varieties were covered.
The evaluators received a two-days training (see Appendix A.4), in which the trainers illustrated the
goal of the evaluation, the C-ORAL-ROM corpus structure, the meaning and format of the prosodic
tagging, the evaluation criteria and procedure. The notion of terminal and non-terminal break was
carefully explained by discussing specific examples extracted from the corpus. Written instructions
were delivered to the evaluators (see Appendix A.4). At the end of the training, a test was
performed in order to assess the acquired competence of the evaluators and to ensure consistency
between them in the evaluation.
The evaluation was carried out from mid September to mid October 2003, for the Italian and
Portuguese corpora, and from mid October to mid November, for French and Spanish. Each
evaluator worked independently of the others and spent around sixty hours to accomplish his task,
in four-hours daily sessions.
The task was performed on Personal PC’s, with the help of the WPC tool, which allowed viewing
the annotated text to be evaluated and listening to the corresponding aligned audio signal. The
evaluation file, as output by the sample generation procedure (see Appendix A.3), reported each
55
All Appendixes to this evaluation report are not included in the DESIGN.doc and can be found in DVD9 in the
directory “Utility/specifications”
105
annotated utterance twice. The second copy of the utterance was the validation copy that the
evaluator could modify when he did not agree with the original tagging. The evaluator listened to
the selected utterance (no more than three times) and considered the possible existence of prosodic
breaks at each word boundary position. If his perception did not match with the original tagging, he
could modify the validation copy by inserting, deleting or substituting break marks. In case the
evaluator did not understand part of the utterance or he was not able to evaluate it, he could exclude
that text portion by including it between two asterisks. The detailed evaluation procedure is
attached in Appendix A.5.
None of the eight evaluators reported any difficulties in the evaluation and all of them could easily
accomplish their task.
All the evaluation files were carefully checked in order to detect possible mistakes (double spacing,
missing blank, incorrect tag, word deletion, etc.). Thereafter, the evaluation files were ready to be
analyzed in order to measure the degree of consensus expressed by the evaluators with respect to
the C-ORAL-ROM prosodic tagging.
3 Measures and Statistics
The measures and statistics adopted by Loquendo are based on the requirements expressed in the CORAL-ROM Technical Annex 56. These requirements are:
"The evaluation will focus on the following cases:
1) whether or not the selected utterance actually ends up with a perceptually relevant prosodic break (// or at least
/) (generic confirmation)
2) whether the evaluator actually perceives that break as terminal (specific confirmation)
3) whether a perceived terminal break turns out non marked as a terminal break (lack of a //) (terminal missing)
4) whether to her/his perception any non terminal break marked within that utterance is on the contrary terminal
(added terminal evaluation)
5) whether a perceived non terminal break turns out not marked as a non terminal break (non terminal missing)
[...]
The measurements will ensure the evaluation of consensus at least in terms of:
1) number of breaks registered
2) Kappa statistics
3) percentage of consensus on terminal and non terminal breaks
[...]
The institution under subcontracting will provide statistical analysis that clearly distinguishes the following cases, that
will be explicitly reported in the description of the resource.
1) strong disagreement (disagreement by both evaluators)
- With respect to the terminal prosodic tags
- With respect to non terminal tags
2) partial consensus (disagreement by only one evaluator)"
3.1 Evaluation data
The evaluation data are obtained from the evaluation files through a number of steps, described in
the following paragraphs.
Word boundaries W (positions candidate for prosodic breaks) are classified for the purpose of the
evaluation into the following classes:
Tag
O
N
T
56
Semantics
no break
non-terminal break (/, [/], ///)
terminal break (//, ?, ..., +)
Technical Annex of the C-Oral-Rom Project, Annex 1
106
Each position in the evaluation file is classified with a tag expressing the agreement of the evaluator
with the original annotation. In the following table, the tags are ordered according to their increasing
degree of disagreement.
Tag
0O
0N
0T
1i
1d
2ns
2ts
3i
3d
Semantics
agreement on non-break boundary
agreement on non-terminal break
agreement on terminal break
non-terminal insertion
non-terminal deletion
non terminal break substitution (N->T)
terminal break substitution (T->N)
terminal insertion
terminal deletion
Ok
Ok
Ok
non critical
non critical
Critical
Critical
Very critical
Very critical
Computations on the evaluation files are performed according to the corpus tree-structure described
in Chapter 2, starting from each leaf of the tree, i.e. each dialogue/monologue file. Then, for each
node, its measures are obtained by summing up the measures performed on the files
(dialogue/monologue) subsumed by the node. Cumulative computations are performed also
horizontally (e.g. by summing up the measures of all the "dialogue" nodes, or all the "monologue"
nodes).
In the following, we describe the measures performed on each leaf file.
3.2 First step: binary comparison file
Starting from an evaluation file, which reports both the original tagging in the C-ORAL ROM
corpus (C) and the evaluator choice (E), a first parser generates a comparison file (B) where each
word boundary (candidate position for a break) is represented as a record, with the following
information:
b.1
b.2
b.3
b.4
b.5
b.6
b.7
b.8
b.9
b.10
utterance57 number
speaker name
position (sequential number) of the word boundary
character position in the original line
character position in the evaluator line
break found at the given position in the original text
class (O, T, N) of the original boundary
break found at the given position in the evaluator text
class (O, T, N) of the evaluator boundary
agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original and evaluator
Positions excluded from the evaluation do not contain information from b.4 to b.10 and are marked
explicitly with the following expressions:
"hhh"
non-linguistic phenomena
"Not evaluated"
text portions not understood by the evaluator
57
The notion of utterance should in principle correspond to a text portion delimited by terminal breaks. It turns out that in some cases
a line in the evaluation file is not terminated with a break and so it is improperly counted as an "utterance". This is not a problem for
the evaluation, as the end-of-line position will simply be marked with an O and evaluated as usual.
107
Example of comparison file B:
UTT SPEAK
1
2
3
3
...
POS. CHAR-R CHAR-E
0*
12
12
0*
13
13
0
5
5
1
19
19
BREAK-R VALUE-R BREAK-E VALUE-E AGREE
//
T
//
T
0T
//
T
//
T
0T
O
O
0O
O
O
0O
3.3 Second step: ternary comparison file
Starting from the two comparison files B-E1 and B-E2 for the two evaluators E1 and E2, a new file
(T) is generated, where each word boundary is represented as a record, with the following
information:
t.1
t.2
t.3
t.4
t.5
t.6
t.7
t.8
t.9
utterance number
speaker name
position (sequential number) of the word boundary
class (O, T, N) of the boundary according to original C
class (O, T, N) of the boundary according to evaluator E1
class (O, T, N) of the boundary according to evaluator E2
agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original C and E1
agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between original C and E2
agreement tag (0O,0N,0T,1i,1d,2ns,2ts,3i,3d) between E1 and E2
Positions excluded from the evaluation are explicitly marked as follows:
- '?' in columns t.4 to t.9
non-linguistic phenomena
- '*' in columns t.5, t.7, t.9
text portions not understood by the evaluator E1
- '*' in columns t.6, t.8, t.9
text portions not understood by the evaluator E2
- '*' in columns t.4 to t.9
text portions not understood by both evaluators
Example of ternary comparison file T:
UTT
1
1
1
1
2
3
3
3
4
SPK
PAT
PAT
PAT
PAT
ROS
MIG
MIG
MIG
GUI
POS
0
1
2
3*
0*
0
1
2*
0*
REF
O
O
O
T
?
N
O
T
?
EV1
O
O
O
T
?
N
O
T
?
EV2
O
O
O
T
?
N
O
T
?
R-E1
0O
0O
0O
0T
?
0N
0O
0T
?
R-E2
0O
0O
0O
0T
?
0N
0O
0T
?
E1-E2
0O
0O
0O
0T
?
0N
0O
0T
?
3.4 Third step: measures
Given a ternary comparison file T, a new file M is generated reporting the measures obtained by
counting the following events:
108
General data
m.1
"utterances"
m.2
word boundaries
m.3
'?' word boundaries
m.4
T breaks by original annotation C
m.5
N breaks by original annotation C
Binary comparison E1-C
(all measures, including mb1.2 and mb1.3, refer to positions evaluated by E1)
mb1.2 word boundaries evaluated by E1
mb1.3 T breaks by original annotation C
mb1.4 N breaks by original annotation C
mb1.5 T breaks by the evaluator (t.5=T)
mb1.6 N breaks by the evaluator (t.5=N)
mb1.7 agreement t.7= 0O
mb1.8 agreement t.7 = 0N
mb1.9 agreement t.7 = 0T
mb1.10 disagreement t.7 =1i
mb1.11 disagreement t.7 =1d
mb1.12 disagreement t.7 =2ns
mb1.13 disagreement t.7=2ts
mb1.14 disagreement t.7 =3i
mb1.15 disagreement t.7 =3d
mb1.16 disagreement t.7 =3i at "utterance" end (end-of-line)
mb1.17 N break misplacements: occurrences of <1i, 1d> or <1d, 1i> on two consecutive word
boundaries with identical speaker name
mb1.18 T break misplacements: occurrences of <3i, 3d> or <3d, 3i> on two consecutive word
boundaries with identical speaker name
Binary comparison E2-C
(all measures, including mb2.2 and mb2.3, refer to positions evaluated by E2)
mb2.1 word boundaries evaluated by E2
mb2.2 T breaks by original annotation C
mb2.3 N breaks by original annotation C
mb2.4 T breaks by the evaluator (t.6=T)
mb2.5 N breaks by the evaluator (t.6=N)
mb2.6 agreement t.8= 0O
mb2.7 agreement t.8 = 0N
mb2.8 agreement t.8 = 0T
mb2.9 disagreement t.8 =1i
mb2.10 disagreement t.8 =1d
mb2.11 disagreement t.8 =2ns
mb2.12 disagreement t.8=2ts
mb2.13 disagreement t.8 =3i
mb2.14 disagreement t.8 =3d
mb2.15 disagreement t.8 =3i at "utterance" end (end-of-line)
mb2.16 N break misplacements: occurrences of <1i, 1d> or <1d, 1i> on two consecutive word
boundaries with identical speaker name
mb2.17 T break misplacements: occurrences of <3i, 3d> or <3d, 3i> on two consecutive word
boundaries with identical speaker name
109
Ternary comparison E1-E2-C
(all measures, including mt.2 to mt.7, refer to positions evaluated by both E1, E2)
mt.1 word boundaries evaluated by both E1 and E2
mt.2 T breaks by original annotation C
mt.3 N breaks by original annotation C
mt.4 T breaks by the evaluator E1
mt.5 N breaks by the evaluator E1
mt.6 T breaks by the evaluator E2
mt.7 N breaks by the evaluator E2
mt.8 total agreement (t.7 = t.8 = 0O) on O boundaries
mt.9 total agreement (t.7 = t.8 = 0N) on T breaks
mt.10 total agreement (t.7 = t.8 = 0T) on N breaks
mt.11 total disagreement (t.4 =/ t.5 =/ t.6)
mt.12 agreement between evaluators (t.5 = t.6, t.7=t.8)
mt.13 agreement between evaluators on 1i (t.7 = t.8 = 1i)
mt.14 agreement between evaluators on 1d (t.7 = t.8 = 1d)
mt.15 agreement between evaluators on 2ns (t.7 = t.8 = 2ns)
mt.16 agreement between evaluators on 2ts (t.7 = t.8 = 2ts)
mt.17 agreement between evaluators on 3i (t.7 = t.8 = 3i)
mt.18 agreement between evaluators on 3d (t.7 = t.8 = 3d)
mt.19 partial agreement (t.7 = 0O or t.8 = 0O, t.7=/t.8) on O boundaries
mt.20 partial agreement (t.7 = 0N or t.8 = 0N, t.7=/t.8) on N breaks
mt.21 partial agreement (t.7 = 0T or t.8 = 0T, t.7=/t.8) on T breaks
mt.22 partial agreement 1i (t.7 = 1i and t.8 = 0O) or (t.8 = 1i and t.7 = 0O)
mt.23 partial agreement 1d (t.7 = 1d and t.8 = 0N) or (t.8 = 1d and t.7 = 0N)
mt.24 partial agreement 2ns (t.7 = 2ns and t.8 = 0N) or (t.8 = 2ns and t.7 = 0N)
mt.25 partial agreement 2ts (t.7 = 2ts and t.8 = 0T) or (t.8 = 2nt and t.7 = 0T)
mt.26 partial agreement 3i (t.7 = 3i and t.8 = 0O) or (t.8 = 3i and t.7 = 0O)
mt.27 partial agreement 3d (t.7 = 3d and t.8 = 0T) or (t.8 = 3d and t.7 = 0T)
All the measures described above have been applied to each leaf node of the corpus structure, i.e. to
each evaluated dialogue/monologue file.
3.5 Statistics
Starting from the above computations, a number of statistics can be obtained. Here we describe
those that we have implemented, taking as a starting point the Annex 1 requirements reported above
(3.1), which recommended the computation of Kappa statistics and percentages of consensus on
prosodic breaks.
A further measure initially considered was the pair of Precision and Recall indexes. This measure
was then discarded because it applies to cases where a sequence of values (tags) is compared with a
reference sequence, taken as the correct one. This was not the case of the C-ORAL-ROM
evaluation, where neither the original annotation nor the evaluators' choices could be taken as a
correct reference.
All the statistics described below were applied to the evaluation data. The results are presented in
Chapter 4.
110
3.5.1 Percentages
The following percentages have been calculated.
For each evaluator (binary comparison):
•
•
•
•
•
•
•
•
Generic confirmation: percentage of T breaks evaluated as 0T or 2ts, 100*(mb.8+mb.12)/mb2
Specific confirmation: percentage of T breaks evaluated as 0T, 100*mb.8/mb.2
Terminal missing: percentage of W's evaluated as 3i, 100*mb.13/mb.1
Added terminal: percentage of N breaks evaluated as 2ns, 100*mb.11/mb.3
Non terminal missing: percentage of W's evaluated as 1i, 100*mb.9/mb.1
Activity rate: percentage of W's actually modified (insertions, deletions, substitutions)
100*(mb.9+mb.10+mb.11+mb.12+mb.13+mb.14)/mb.1
N break misplacement rate: percentage of N insertions and deletions that may be considered
as N misplacements, 100*2*mb.16/(mb.9+mb.10);
T break misplacement rate: percentage of T insertions and deletions that may be considered
as T misplacements, 100*2*mb.17/(mb.13+mb.14);
Percentage of consensus on terminal and non-terminal breaks (ternary comparison):
•
Strong disagreement with respect to terminal tags
o percentage of T breaks evaluated as 3d, 100*mt.18/mt.2
o percentage of T breaks evaluated as 2ts, 100*mt.16/mt.2
o (percentage of W's evaluated as 2ns, 100*mt.15/mt.1)
o (percentage of W's evaluated as 3i, 100*mt.17/mt.1 )
• Strong disagreement with respect to non-terminal tags
o percentage of N breaks evaluated as 1d, 100*mt.14/mt.3
o percentage of N breaks evaluated as 2ns, 100*mt.15/mt.3
o (percentage of W's evaluated as 2ts, 100*mt.16/mt.1)
o (percentage of W's evaluated as 1i, 100*mt.13/mt.1)
• Partial consensus
o percentage of partially agreed T breaks, 0T vs. 3d/2ts, 100*mt.21/mt.2
o percentage of partially agreed T breaks, 0T vs. 3d, 100*mt.27/mt.2
o percentage of partially agreed W's, 100*(mt.19+mt.20+mt.21)/mt.1
• Total agreement
o percentage of T breaks evaluated 0T, 100*mt.9/mt.2
o percentage of N breaks evaluated 0N, 100*mt.10/mt.3
o percentage of O boundaries evaluated 0O,100*mt.8/(mt.1 -mt.2 -mt.3)
o percentage of totally agreed W's, 100*(mt.8+mt.9+mt.10)/mt.1
• Global disagreement
o percentage
of
W's
disconfirmed
by
at
least
one
evaluator,
100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)/mt.1
• Consensus in the disagreement
o percentage of globally disagreed W's that are actually cases of strong disagreement,
100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18)/
(mt.13+mt.14+mt.15+
mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)
Note that a general measure of agreement like the percentage of totally agreed W's, i.e. the ratio
between the number of 0O, 0T, 0N agreement tags and the number of word boundaries W (n0O
+n0T+n0N)/nW, may perhaps sound too optimistic. It might be useful to compare it with a sort of
baseline corresponding to the worst possible "realistic" result: if, for example, all N's and T's were
deleted and a comparable number of N's and T's were inserted in different positions we still would
have a number of agreed positions n0O = nW - 2 nN - 2 nT. Our percentage of agreement should be
111
significantly higher than (nW - 2 nN - 2 nT)/ nW, in order to provide a positive evaluation of the
original corpus annotation.
3.5.2 Kappa coefficient
The Kappa coefficient measures the agreement among annotators in their coding of data with a
finite set of categories. Kappa is defined as:
P( A) − P( E )
k=
1 − P( E )
where:
- P(A) is the proportion of times that the annotators actually agree
- P(E) is the probability that annotators agree by chance
Given the hypothesis that the annotators may have different category distributions, P(E) is given by:
M
N
P( E ) = ∑∏ Fr ( A j , ci )
i =1 j =1
where Fr(Ej,ci) is the frequency with which annotator Aj chooses category ci, (M=3, N=3).
For our purposes we have considered 3 annotators: CORALROM (C), evaluator 1 (E1) and
evaluator 2 (E2), and we have defined two different Kappa coefficients.
• Kappa coefficients
o K1: 3 graders (C,E1,E2), 3 categories (O,T,N)
o K2: 3 graders (C,E1,E2), 2 categories (T,N)
For K1, we have considered as categories the three boundary tags: no break (O), non-terminal break
(N) and terminal break (T). P(A) has been calculated as the ratio between the number of total
agreements among the evaluators, and the number of word boundaries evaluated by both E1 and E2.
For K2 (a more realistic coefficient), we have restricted the set of categories to T and N, and
consequently the set of evaluated boundaries to those annotated with T or N by at least one
annotator. In this case, P(A) has been calculated as the ratio between the number of total agreements
on T or N (mt.9 + mt10) and the sum of N and T breaks as resulting from C, plus the number of 1i
and 3i (N or T insertions) by E1 and E2, minus the sum of the agreements between evaluators on 1i
and 3i (mt.2+mt.3+mb1.9+mb1.13+mb2.9+mb2.13-mt.13-mt.17).
4 Results
The present chapter reports the results of the statistics performed on the evaluation data, according
to the specifications described in Chapter 3.
The results are given separately for each language evaluation sub-corpus and for its relevant
subsets, i.e. for its main bipartitions into Formal and Informal speech and for its two subsets of
Dialogues and Monologues. Note that the Dialogues vs. Monologues distinction does not cover the
entire sub-corpus, as the Media and Telephone subsets of the Formal speech section are not further
split into dialogues and monologues (see the corpus tree structure described in Chapter 2).
Paragraph 4.5 reports a summarizing table, where the main statistics on the different corpora can be
compared.
Paragraph 5.5 discusses the evaluation results in some detail.
112
4.1 General Data
The following tables report the total number of word boundaries in each evaluation sub-corpus,
compared with the number of word boundaries actually evaluated by each evaluator. Besides, they
report the number of T-breaks and N-breaks inserted by the original C-ORAL-ROM annotation and
by the two evaluators, respectively.
1- FRENCH CORPUS
Word Boundaries and Breaks
Orig. Annotation
Total W. Bound
12893
Terminal Breaks
969
456
• Formal
513
• Informal
570
• Dialogues
• Monologues 239
Evaluator 1
12776
960
454
506
563
242
Evaluator 2
12831
1064
495
569
601
289
Non Terminal Br.
• Formal
• Informal
• Dialogues
• Monologues
1462
630
832
587
641
1606
684
922
658
706
1355
580
775
550
585
O Boundaries
10462
10210
10412
Word Boundaries and Breaks
Orig. Annotation
Total W. Bound
10925
Terminal Breaks
1372
520
• Formal
852
• Informal
876
• Dialogues
• Monologues 269
Evaluator 1
10900
1359
516
843
864
271
Evaluator 2
10892
1346
508
838
859
266
Non Terminal Br.
• Formal
• Informal
• Dialogues
• Monologues
2403
1171
1232
1136
813
2436
1195
1241
1136
832
2605
1253
1352
1224
903
O Boundaries
7150
7105
6941
Evaluator 1
12933
1484
Evaluator 2
12534
1388
2- ITALIAN CORPUS
3- PORTUGUESE CORPUS
Word Boundaries and Breaks
Orig. Annotation
Total W. Bound
12958
Terminal Breaks
1483
113
•
•
•
•
668
815
871
309
674
810
873
305
637
751
797
302
Non Terminal Br.
• Formal
• Informal
• Dialogues
• Monologues
2604
1169
1435
1185
894
2647
1184
1463
1204
908
2556
1152
1404
1148
886
O Boundaries
8871
8802
8590
Word Boundaries and Breaks
Orig. Annotation
Total W. Bound
11512
Terminal Breaks
1107
398
• Formal
709
• Informal
747
• Dialogues
• Monologues 188
Evaluator 1
11474
1074
384
690
727
185
Evaluator 2
11512
1112
402
710
748
189
Non Terminal Br.
• Formal
• Informal
• Dialogues
• Monologues
2463
1304
1159
1084
847
2561
1368
1193
1137
866
2432
1297
1135
1075
822
O Boundaries
7942
7839
7968
Formal
Informal
Dialogues
Monologues
4- SPANISH CORPUS
4.2 Binary Comparison
The following tables report the statistics on the evaluation data by each single evaluator. For each
evaluator, the data are obtained via a binary comparison of his annotation with the original CORAL_ROM annotation. All the figures are computed on the positions actually evaluated by the given
evaluator, i.e. excluding those marked with asterisks.
4.2.1 Binary Comparison: Terminal Break Confirmation
The following percentages, relative to the main subsets of each language sub-corpus, show to what
extent each evaluator confirmed the original Terminal Break tags. Given the number of original Tbreaks (among the positions not excluded by the evaluator), the first value is the percentage
(100*mb.8/mb.2) of specifically confirmed T-breaks, i.e. those evaluated as 0T, while the second one is
the percentage (100*(mb.8+mb.12)/mb.2) of non-deleted T-breaks, i.e. those evaluated as 0T or 2ts.
1- FRENCH CORPUS
Specific Confirmation (0T)
Evaluator1
94,74%
• Formal
96,25%
• Informal
96,43%
• Dialogues
• Monologues 92,75%
Evaluator2
100%
100%
100%
100%
114
Generic Confirmation (0T or 2ts)
Evaluator1
100%
• Formal
100%
• Informal
100%
• Dialogues
• Monologues 100%
2- ITALIAN CORPUS
Specific Confirmation (0T)
Evaluator 1
98,95%
• Formal
98,98%
• Informal
98,50%
• Dialogues
• Monologues 100%
Generic Confirmation (0T or 2ts)
Evaluator
100%
• Formal
99,89%
• Informal
99,89%
• Dialogues
• Monologues 100%
Evaluator2
100%
100%
100%
100%
Evaluator 2
95,57%
97,36%
97,74%
97,52%
Evaluator 2
100%
100%
100%
100%
3- PORTUGUESE CORPUS
Specific Confirmation (0T)
Evaluator 1
98,54%
• Formal
98,13%
• Informal
98,73%
• Dialogues
• Monologues 97,51%
Generic Confirmation (0T or 2ts)
Evaluator 1
100%
• Formal
100%
• Informal
100%
• Dialogues
• Monologues 100%
Evaluator 2
99,82%
99,36%
99,54%
99,68%
Evaluator 2
100%
100%
100%
100%
4- SPANISH CORPUS
Specific Confirmation (0T)
Evaluator 1
90,33%
• Formal
95,29%
• Informal
95,85%
• Dialogues
• Monologues 95,34%
Generic Confirmation (0T or 2ts)
Evaluator 1
97,87%
• Formal
100%
• Informal
100%
• Dialogues
• Monologues 100%
Evaluator 2
98,94%
100%
100%
100%
Evaluator 2
98,94%
100%
100%
100%
115
4.2.2 Binary Comparison: Terminal Missing
For each evaluator and each corpus section, the following figures give the percentage of evaluated
word boundaries where the evaluator inserted a Terminal Break (100*mb.13/mb.1), i.e. positions
evaluated as 3i.
1- FRENCH CORPUS
Terminal Missing (3i)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator1
0,01%
0%
0%
0,01%
Evaluator2
0,02%
0,03%
0,01%
0,05%
2- ITALIAN CORPUS
Terminal Missing (3i)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
0%
0%
0%
0%
Evaluator 2
0%
0%
0%
0%
3- PORTUGUESE CORPUS
Terminal Missing (3i)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
0%
0,05%
0,05%
0%
Evaluator 2
0%
0,06%
0,04%
0,04%
4- SPANISH CORPUS
Terminal Missing (3i)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
0%
0%
0%
0%
Evaluator 2
0,02%
0,02%
0,02%
0%
4.2.3 Binary Comparison: Non Terminal Missing
For each evaluator and each corpus section, the following figures give the percentage of evaluated
word boundaries where the evaluator inserted a Non Terminal Break (100*mb.9/mb.1), i.e. positions
evaluated as 1i.
1- FRENCH CORPUS
Non Terminal Missing (1i)
Evaluator 1
1,50%
• Formal
1,25%
• Informal
1,60%
• Dialogues
• Monologues 1,26%
Evaluator 2
0,42%
0,45%
0,36%
0,62%
116
2- ITALIAN CORPUS
Non Terminal Missing (1i)
Evaluator 1
1,10%
• Formal
0,83%
• Informal
0,52%
• Dialogues
• Monologues 1,47%
Evaluator 2
2,58%
2,50%
2,20%
3,34%
3- PORTUGUESE CORPUS
Non Terminal Missing (1i)
Evaluator 1
0,73%
• Formal
0,44%
• Informal
0,54%
• Dialogues
• Monologues 0,35%
Evaluator 2
0,25%
0,18%
0,24%
0,05%
4- SPANISH CORPUS
Non Terminal Missing (1i)
Evaluator 1
1,19%
• Formal
0,46%
• Informal
0,71%
• Dialogues
• Monologues 1,16%
Evaluator 2
0,96%
0,39%
0,56%
0,74%
4.2.3’ Binary Comparison: Non Terminal Deletion
For each evaluator and each corpus section, the following figures give the percentage of Non Terminal
Breaks deleted by the evaluators (100*mb.10/mb.3), i.e. positions evaluated as 1d.
1- FRENCH CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,48%
• Formal
0,72%
• Informal
0,34%
• Dialogues
• Monologues 0,94%
Evaluator 2
6,84%
4,93%
4,77%
7,18%
2- ITALIAN CORPUS
Non Terminal Deletion (1d)
Evaluator 1
3,76%
• Formal
3,57%
• Informal
3,88%
• Dialogues
• Monologues 3,20%
Evaluator 2
4,87%
3,42%
4,15%
2,71%
3- PORTUGUESE CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,60%
• Formal
0,20%
• Informal
0,50%
• Dialogues
Evaluator 2
0,17%
0,50%
0,44%
117
•
Monologues
0,11%
0,45%
4- SPANISH CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,84%
• Formal
0,95%
• Informal
0,83%
• Dialogues
• Monologues 0,94%
Evaluator 2
3,53%
3,80%
3,61%
4,25%
118
4.2.4 Binary Comparison: Added Terminal
For each evaluator and each corpus section, the following figures give the percentage of (evaluated)
original N-Breaks that were substituted with a T-Break, i.e. positions evaluated as 2ns
(100*mb.11/mb.3).
1- FRENCH CORPUS
Added Terminal (2ns)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
4,72%
1,29%
3,05%
3,11%
Evaluator 2
4,85%
5,20%
4,15%
6,51%
2- ITALIAN CORPUS
Added Terminal (2ns)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
0,22%
0,35%
0,05%
0,17%
Evaluator 2
0,10%
0,04%
0,29%
0,18%
3- PORTUGUESE CORPUS
Added Terminal (2ns)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
0,93%
0,33%
0,69%
0,17%
Evaluator 2
0%
0,41%
0,40%
0%
4- SPANISH CORPUS
Added Terminal (2ns)
•
•
•
•
Formal
Informal
Dialogues
Monologues
Evaluator 1
1,68%
1,45%
1,35%
0,48%
Evaluator 2
0,15%
0%
0%
0,15%
4.2.5 Binary Comparison: Activity Rate
A measure of the evaluators' intervention rate is given by the following figures, showing the
percentage of evaluated word boundaries that were actually modified by each evaluator, i.e. evaluated
as 1i, 1d, 2ns, 2ts, 31, 3d.
(100*(mb.9+mb.10+mb.11+mb.12+mb.13+mb.14)/mb.1)
1- FRENCH CORPUS
Activity Rate
• Total
• Formal
• Informal
• Dialogues
• Monologues
Evaluator 1
2,22 %
2,61 %
1,73 %
2,39 %
2,16 %
Evaluator 2
1,66 %
1,45 %
1,64 %
1,22 %
2,35 %
119
2- ITALIAN CORPUS
Activity Rate
• Total
• Formal
• Informal
• Dialogues
• Monologues
Evaluator 1
2,07%
2,32%
1,82%
1,89%
2,27%
Evaluator 2
4,17%
4,51%
3,81%
3,85%
4,51%
Evaluator 1
1,01%
1,35%
0,88%
1,08%
0,76%
Evaluator 2
0,41%
0,29%
0,56%
0,49%
0,33%
Evaluator 1
1,96%
2,57%
1,66%
1,79%
1,82%
Evaluator 2
1,42%
1,78%
1,02%
1,15%
1,73%
3- PORTUGUESE CORPUS
Activity Rate
• Total
• Formal
• Informal
• Dialogues
• Monologues
4- SPANISH CORPUS
Activity Rate
• Total
• Formal
• Informal
• Dialogues
• Monologues
4.2.6 Binary Comparison: Misplacement Rate
The consecutive occurrence of a break insertion and a break deletion (or vice versa) may suggest that
the evaluator judged the original tag just as misplaced. In this case his two actions should actually
count as a single "break move". The following figures show, for each evaluator and each break type,
which percentage of break insertions and deletions correspond to break moves. The percentages for N
breaks and T breaks are obtained respectively by the following formulas:
(100*2*mb.16/(mb.9+mb.10)) and (100*2*mb.17/(mb.13+mb.14)).
1- FRENCH CORPUS
Misplacement Rate
Terminal Breaks
Non Terminal Breaks
Evaluator 1
0%
0,38%
Evaluator 2
0%
0,29%
Evaluator 1
0%
5,06%
Evaluator 2
0%
1,97%
2- ITALIAN CORPUS
Misplacement Rate
Terminal Breaks
Non Terminal Breaks
3- PORTUGUESE CORPUS
Misplacement Rate
Terminal Breaks
Non Terminal Breaks
Evaluator 1
0%
0%
Evaluator 2
0%
0%
120
4- SPANISH CORPUS
Misplacement Rate
Terminal Breaks
Non Terminal Breaks
Evaluator 1
0%
2,38%
Evaluator 2
0%
1,28%
4.2.3 Binary Comparison: Non Terminal Deletion
For each evaluator and each corpus section, the following figures give the percentage of Non Terminal
Breaks deleted by the evaluators (100*mb.10/mb.3), i.e. positions evaluated as 1d.
1- FRENCH CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,48%
• Formal
0,72%
• Informal
0,34%
• Dialogues
• Monologues 0,94%
Evaluator 2
6,84%
4,93%
4,77%
7,18%
2- ITALIAN CORPUS
Non Terminal Deletion (1d)
Evaluator 1
3,76%
• Formal
3,57%
• Informal
3,88%
• Dialogues
• Monologues 3,20%
Evaluator 2
4,87%
3,42%
4,15%
2,71%
3- PORTUGUESE CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,60%
• Formal
0,20%
• Informal
0,50%
• Dialogues
• Monologues 0,11%
Evaluator 2
0,17%
0,50%
0,44%
0,45%
4- SPANISH CORPUS
Non Terminal Deletion (1d)
Evaluator 1
0,84%
• Formal
0,95%
• Informal
0,83%
• Dialogues
• Monologues 0,94%
Evaluator 2
3,53%
3,80%
3,61%
4,25%
4.3 Ternary Comparison
4.3.1 Ternary Comparison: Strong Disagreement on prosodic breaks
The following figures show the percentage of cases where both evaluators disagreed with the original
annotation and agreed in their evaluation tag. All the figures are relative to word boundaries
evaluated by both evaluators.
1- Percentage of the original T-breaks deleted by both evaluators, i.e. evaluated as 3d by both
(100*mt.18/mt.2)
121
1- FRENCH CORPUS
Strong Disagreement on T-Breaks, terminal deletion 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on T-Breaks, terminal deletion 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Strong Disagreement on T-Breaks, terminal deletion 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on T-Breaks, terminal deletion 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
2- Percentage of the original T-breaks substituted with an N-break by both evaluators, i.e. evaluated
as 2ts (100*mt.16/mt.2)
1- FRENCH CORPUS
Strong Disagreement on T-Breaks, T->N substitution2ts
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on T-Breaks, T->N substitution2ts
Percentage
0,56%
• Formal
0,25%
• Informal
0,64%
• Dialogues
0%
• Monologues
122
3- PORTUGUESE CORPUS
Strong Disagreement on T-Breaks, T->N substitution2ts
Percentage
0%
• Formal
0,47%
• Informal
0,30%
• Dialogues
0,32%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on T-Breaks, T->N substitution2ts
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- Percentage of the original word boundaries where an N-break was substituted with a T-break by
both evaluators, i.e. positions evaluated as 2ns by both (100*mt.15/mt.1)
1- FRENCH CORPUS
Strong Disagreement on T-Breaks, N->T substitution 2ns
Percentage
0,18%
• Formal
0,03%
• Informal
0,08%
• Dialogues
0,17%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on T-Breaks, N->T substitution 2ns
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Strong Disagreement on T-Breaks, N->T substitution 2ns
Percentage
0%
• Formal
0,01%
• Informal
0,01%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on T-Breaks, N->T substitution 2ns
Percentage
0,02%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
123
4- Percentage of the original word boundaries where both evaluators inserted a T-break, i.e.
positions evaluated as 3i by both (100*mt.17/mt.1)
1- FRENCH CORPUS
Strong Disagreement on T-Breaks, terminal insertion 3i
Percentage
0,01%
• Formal
0%
• Informal
0%
• Dialogues
0,01%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on T-Breaks, terminal insertion 3i
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Strong Disagreement on T-Breaks, terminal insertion 3i
Percentage
0%
• Formal
0,04%
• Informal
0,04%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on T-Breaks, terminal insertion 3i
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
5- Percentage of the original N-breaks deleted by both evaluators, i.e. evaluated as 1d by both
(100*mt.14/mt.3)
1- FRENCH CORPUS
Strong Disagreement on N-Breaks, Non Terminal deletion 1d
Percentage
0,25%
• Formal
0,21%
• Informal
0,19%
• Dialogues
0,42%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on N-Breaks, Non Terminal deletion 1d
Percentage
3,09%
• Formal
1,11%
• Informal
2,34%
• Dialogues
1,38%
• Monologues
124
3- PORTUGUESE CORPUS
Strong Disagreement on N-Breaks, Non Terminal deletion 1d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on N-Breaks, Non Terminal deletion 1d
Percentage
0,24%
• Formal
0,33%
• Informal
0,28%
• Dialogues
0,18%
• Monologues
6- Percentage of the original N-breaks substituted with a T-break by both evaluators, i.e. evaluated
as 2ns by both (100*mt.15/mt.3)
1- FRENCH CORPUS
Strong Disagreement on N-breaks, N->T substitution 2ns
Percentage
2,39%
• Formal
0,27%
• Informal
1,28%
• Dialogues
1,53%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on N-breaks, N->T substitution 2ns
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Strong Disagreement on N-breaks, N->T substitution 2ns
Percentage
0%
• Formal
0,05%
• Informal
0,05%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on N-breaks, N->T substitution 2ns
Percentage
0,08%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
125
7- Percentage of the original word boundaries where a T-break was substituted with an N-break by
both evaluators, i.e. positions evaluated as 2ts by both (100*mt.16/mt.1)
1- FRENCH CORPUS
Strong Disagreement on N-Breaks, T->N substitution 2ts
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on N-Breaks, T->N substitution 2ts
Percentage
0,10%
• Formal
0,03%
• Informal
0,12%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Strong Disagreement on N-Breaks, T->N substitution 2ts
Percentage
0%
• Formal
0,09%
• Informal
0,06%
• Dialogues
0,04%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on N-Breaks, T->N substitution 2ts
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
8- Percentage of the original word boundaries where both evaluators inserted an N-break, i.e.
positions evaluated as 1i by both (100*mt.13/mt.1)
1- FRENCH CORPUS
Strong Disagreement on N-breaks, Non Terminal insertion 1i
Percentage
0,08%
• Formal
0,10%
• Informal
0,05%
• Dialogues
0,23%
• Monologues
2- ITALIAN CORPUS
Strong Disagreement on N-breaks, Non Terminal insertion 1i
Percentage
0,74%
• Formal
0,66%
• Informal
0,36%
• Dialogues
1,19%
• Monologues
126
3- PORTUGUESE CORPUS
Strong Disagreement on N-breaks, Non Terminal insertion 1i
Percentage
0,10%
• Formal
0,10%
• Informal
0,15%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Strong Disagreement on N-breaks, Non Terminal insertion 1i
Percentage
0,38%
• Formal
0,09%
• Informal
0,19%
• Dialogues
0,35%
• Monologues
4.3.2 Ternary Comparison: Partial Consensus
The following figures concern the cases of disagreement between the evaluators, where the original
annotation was confirmed by one evaluator but modified by the other one. Here again, the set of
relevant position includes only those evaluated by both evaluators.
1- Percentage of the original T-breaks confirmed by one evaluator and modified (deleted or
substituted) by the other (100*mt.21/mt.2).
1- FRENCH CORPUS
Partial Consensus on T-breaks, 0T vs. 3d or 2ts
Percentage
4,36%
• Total
5,26%
• Formal
3,75%
• Informal
3,57%
• Dialogues
7,25%
• Monologues
2- ITALIAN CORPUS
Partial Consensus on T-breaks, 0T vs. 3d or 2ts
Percentage
2,42%
• Total
2,89%
• Formal
2,96%
• Informal
2,28%
• Dialogues
2,48%
• Monologues
3- PORTUGUESE CORPUS
Partial Consensus on T-breaks, 0T vs. 3d or 2ts
Percentage
1,51%
• Total
127
•
•
•
•
Formal
Informal
Dialogues
Monologues
1,45%
1,56%
1,13%
2,20%
4- SPANISH CORPUS
Partial Consensus on T-breaks, 0T vs. 3d or 2ts
Percentage
5,16%
• Total
7,54%
• Formal
4,71%
• Informal
4,15%
• Dialogues
4,66%
• Monologues
2- Percentage of the original T-breaks confirmed by one evaluator and deleted by the other
(100*mt.27/mt.2).
1- FRENCH CORPUS
Partial Consensus on T-breaks, 0T vs. 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
2- ITALIAN CORPUS
Partial Consensus on T-breaks, 0T vs. 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
3- PORTUGUESE CORPUS
Partial Consensus on T-breaks, 0T vs. 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
4- SPANISH CORPUS
Partial Consensus on T-breaks, 0T vs. 3d
Percentage
0%
• Formal
0%
• Informal
0%
• Dialogues
0%
• Monologues
128
3- Percentage of word boundaries confirmed by one evaluator and modified (1d, 1i, 2ns, 2ts, 3d, 3i)
by the other (100*(mt.19+mt.20+mt.21)/mt.1).
1- FRENCH CORPUS
Partial Consensus on word boundaries
Percentage
3,18%
• Total
3,46%
• Formal
3,01%
• Informal
3,28%
• Dialogues
3,54%
• Monologues
2- ITALIAN CORPUS
Partial Consensus on word boundaries
Percentage
3,65%
• Total
3,82%
• Formal
3,64%
• Informal
3,61%
• Dialogues
3,68%
• Monologues
3- PORTUGUESE CORPUS
Partial Consensus on word boundaries
Percentage
0,93%
• Total
1,44%
• Formal
0,83%
• Informal
0,97%
• Dialogues
0,96%
• Monologues
4- SPANISH CORPUS
Partial Consensus on word boundaries
Percentage
2,56%
• Total
3,26%
• Formal
2,36%
• Informal
2,44%
• Dialogues
2,75%
• Monologues
4.3.3 Ternary Comparison: Total Agreement
The following figures concern the cases where both evaluators confirmed the original annotation. Here
again, the set of relevant positions includes only those evaluated by both evaluators.
1- Percentage of the original T-breaks confirmed by both evaluators, i.e. evaluated as 0T
(100*mt.9/mt.2)
1- FRENCH CORPUS
Total Agreement on T-breaks, 0T
Percentage
95,63%
• Total
94,74%
• Formal
96,25%
• Informal
129
•
•
Dialogues
Monologues
96,43%
92,75%
2- ITALIAN CORPUS
Total Agreement on T-breaks, 0T
Percentage
97,14%
• Total
96,07%
• Formal
96,68%
• Informal
96,96%
• Dialogues
97,52%
• Monologues
3- PORTUGUESE CORPUS
Total Agreement on T-breaks, 0T
Percentage
98,12%
• Total
98,55%
• Formal
97,97%
• Informal
98,58%
• Dialogues
97,47%
• Monologues
4- SPANISH CORPUS
Total Agreement on T-breaks, 0T
Percentage
94,84%
• Total
93,70%
• Formal
95,29%
• Informal
95,85%
• Dialogues
95,34%
• Monologues
2- Percentage of the original N-breaks confirmed by both evaluators, i.e. evaluated as 0N
(100*mt.10/mt.3)
1- FRENCH CORPUS
Total Agreement on N-breaks, 0N
Percentage
86,56%
• Total
84,98%
• Formal
87,75%
• Informal
75,01%
• Dialogues
84,61%
• Monologues
2- ITALIAN CORPUS
Total Agreement on N-breaks, 0N
Percentage
93,15%
• Total
90,50%
• Formal
93,48%
• Informal
92,93%
• Dialogues
94,71%
• Monologues
3- PORTUGUESE CORPUS
Total Agreement on N-breaks, 0N
130
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
98,38%
97,63%
97,30%
98,06%
98,95%
4- SPANISH CORPUS
Total Agreement on N-breaks, 0N
Percentage
94,62%
• Total
94,69%
• Formal
91,78%
• Informal
94,71%
• Dialogues
94,81%
• Monologues
3- Percentage of the original O-boundaries confirmed by both evaluators, i.e. evaluated as 0O
(100*mt.8/(mt.1-mt.2-mt.3))
1- FRENCH CORPUS
Total Agreement on O Boundaries, 0O
Percentage
97,54%
• Total
97,18%
• Formal
97,89%
• Informal
97,13%
• Dialogues
97,97%
• Monologues
2- ITALIAN CORPUS
Total Agreement on O Boundaries, 0O
Percentage
95,1%
• Total
94,39%
• Formal
96,02%
• Informal
95,10%
• Dialogues
94,92%
• Monologues
3- PORTUGUESE CORPUS
Total Agreement on O Boundaries, 0O
Percentage
99,22%
• Total
98,78%
• Formal
99,25%
• Informal
99,10%
• Dialogues
99,41%
• Monologues
4- SPANISH CORPUS
Total Agreement on O Boundaries, 0O
Percentage
98,28%
• Total
96,52%
• Formal
98,88%
• Informal
131
•
•
Dialogues
Monologues
98,43%
97,81%
4- Percentage of the evaluated word boundaries where the original annotation was confirmed by
both evaluators, i.e. evaluated as 0T, 0N or 0O (100*(mt.8+mt.9+mt.10)/mt.1)
1- FRENCH CORPUS
Total Agreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
96,48%
96,23%
96,81%
96,56%
95,97%
2- ITALIAN CORPUS
Total Agreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
95,21%
94,67%
95,38%
95,33%
94,79%
3- PORTUGUESE CORPUS
Total Agreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
98,93%
98,46%
98,89%
98,75%
98,96%
4- SPANISH CORPUS
Total Agreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
97,17%
95,19%
97,49%
97,31%
96,85%
4.3.4 Ternary Comparison: Global Disagreement on prosodic breaks
As a measure complementary to the Total Agreement Rate, the following Global Disagreement rate is
calculated as the percentage of evaluated word boundaries disconfirmed by at least one evaluator:
(100*(mt.1-(mt.8+mt.9+mt.10))/mt.1)
or equivalently
(100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)/mt1)
132
1- FRENCH CORPUS
Global Disagreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
3,52%
3,77%
3,17%
3,44%
4,02%
2- ITALIAN CORPUS
Global Disagreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
4,78%
5,33%
4,60%
4,65%
5,21%
3- PORTUGUESE CORPUS
Global Disagreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
1,07%
1,54%
1,08%
1,23%
1%
4- SPANISH CORPUS
Global Disagreement Rate
•
•
•
•
•
Total
Formal
Informal
Dialogues
Monologues
Percentage
2,83%
3,3%
2,39%
2,63%
2,66%
4.3.5 Ternary Comparison: Consensus in the Disagreement
The following figures show the percentage of globally disagreed boundaries (i.e. positions where at
least one evaluator disagreed with the original annotation) that were actually cases of strong
disagreement (i.e. cases where both evaluators disconfirmed the original annotation). The figures
were
obtained
by
the
following
formula:
100*(mt.13+mt.14+mt.15+mt.16+mt.17+mt.18)/(mt.13+mt.14+mt.15+
mt.16+mt.17+mt.18+mt.19+mt.20+mt.21)
1- FRENCH CORPUS
Consensus in the Disagreement
Percentage
8,97%
• Total
10,63%
• Formal
7,53%
• Informal
133
•
•
Dialogues
Monologues
5,55%
13,82%
2- ITALIAN CORPUS
Consensus in the Disagreement
Percentage
23,50%
• Total
27,27%
• Formal
19,92%
• Informal
18,75%
• Dialogues
28,57%
• Monologues
3- PORTUGUESE CORPUS
Consensus in the Disagreement
Percentage
12,12%
• Total
2,98%
• Formal
21,54%
• Informal
21,52%
• Dialogues
3,22%
• Monologues
4- SPANISH CORPUS
Consensus in the Disagreement
Percentage
9,26%
• Total
10,87%
• Formal
7,14%
• Informal
9,74%
• Dialogues
6,82%
• Monologues
4.4 Kappa coefficients
The following table shows the two Kappa coefficients, as defined in Paragraph 3.5.2. The two
coefficients were calculated for each leaf of the corpus tree, and then averaged for the different tree
nodes.
1- FRENCH CORPUS
Kappa coefficients (General and Realistic)
Kappa General
MEDIA node
0,953
NAT. CONTEXT node
0,923
TELEPHONE node
0,925
FORMAL node (total)
0,989
FAMILY PRIVATE node
0,933
PUBLIC node
0,924
INFORMAL node (total)
0,924
DIALOGUES
0,973
MONOLOGUES
0,907
TOTAL
0,952
Kappa Realistic
0,853
0,760
0,773
0,765
0,775
0,765
0,767
0,790
0,675
0,766
134
2- ITALIAN CORPUS
Kappa coefficients (General and Realistic)
Kappa General
MEDIA node
0,917
NAT. CONTEXT node
0,927
TELEPHONE node
0,921
FORMAL node (total)
0,921
FAMILY PRIVATE node
0,934
PUBLIC node
0,938
INFORMAL node (total)
0,935
DIALOGUES
0,936
MONOLOGUES
0,922
TOTAL
0,928
Kappa Realistic
0,768
0,786
0,823
0,785
0,828
0,823
0,826
0,839
0,779
0,807
3- PORTUGUESE CORPUS
Kappa coefficients (General and Realistic)
Kappa General
MEDIA node
0,969
NAT. CONTEXT node
0,979
TELEPHONE node
0,982
FORMAL node (total)
0,975
FAMILY PRIVATE node
0,985
PUBLIC node
0,978
INFORMAL node (total)
0,984
DIALOGUES
0,981
MONOLOGUES
0,985
TOTAL
0,980
Kappa Realistic
0,890
0,893
0,901
0,893
0,950
0,931
0,946
0,921
0,944
0,920
4- SPANISH CORPUS
Kappa coefficients (General and Realistic)
Kappa General
MEDIA node
0,918
NAT. CONTEXT node
0,949
TELEPHONE node
0,945
FORMAL node (total)
0,930
FAMILY PRIVATE node
0,960
PUBLIC node
0,968
INFORMAL node (total)
0,962
DIALOGUES
0,959
MONOLOGUES
0,952
TOTAL
0,946
Kappa Realistic
0,737
0,807
0,865
0,772
0,883
0,890
0,885
0,880
0.818
0,827
135
4.5 Summarizing Table
Boundaries and Breaks
Total w. boundaries
Evaluated w. Boundaries
Evaluator 1
Evaluator 2
Not Evaluated Positions
Evaluator 1
Evaluator 2
French
12893
Italian
10925
Portuguese
12958
Spanish
11512
12776
12831
10900
10892
12933
12534
11474
11512
117 (0,9 %)
62 (0,5 %)
25 (0,2 %)
33 (0,3 %)
25 (0,2 %)
424 (3,2 %)
38 (0,3 %)
0
French
Italian
Portuguese
Spanish
96,12%
100%
98,8%
97,12%
98,7%
99,4%
94,1%
99%
100%
100%
99,9%
100%
100%
100%
99,8%
99,7%
0,01%
0,02%
0%
0%
0,03%
0,04%
0%
0,02%
1,59%
0,46%
1,05%
2,75%
0,62%
0,2%
1%
0,7%
2,95%
5,01%
0,2%
0.16%
0,5%
0,4%
1,57%
0,12%
2,22%
1,66%
2,07%
4,17%
1,01%
0,41%
1,96%
1,42%
0,19%
0,14%
5%
0,98%
0%
0%
1,19%
0,65%
French
Italian
Portuguese
Spanish
0%
0%
0%
0%
0%
0,48%
0,35%
0%
0,1%
0%
0,01%
0,01%
0,01%
0%
0,03%
0%
0,3%
1,87%
0%
0,25%
1,04%
0%
0,03%
0,05%
Binary Comparisons
Specific Confirmation of T
Evaluator 1
Evaluator 2
Generic Confirmation of T
Evaluator 1
Evaluator 2
Terminal Missing
Evaluator 1
Evaluator 2
Non Terminal Missing
Evaluator 1
Evaluator 2
Added Terminal N->T
Evaluator 1
Evaluator 2
Activity Rate
Evaluator 1
Evaluator 2
Misplacement Rate
Evaluator 1
Evaluator 2
Ternary Comparisons
Strong Disagreement
3d on T
Strong Disagreement
2ts on T
Strong Disagreement
2ns on W
Strong Disagreement
3i on W
Strong Disagreement
1d on N
Strong Disagreement
2ns on N
136
Strong Disagreement
2ts on W
Strong Disagreement
1i on N
Partial Consensus on T
0T vs. 3d or 2ts
Partial Consensus on T
0T vs. 3d
Partial Consensus on W
0%
0,08%
0,06%
0%
0,12%
0,75%
0,10%
0,25%
4,36%
2,42%
1,51%
5,16%
0%
0%
0%
0%
3,18%
3,65%
0,93%
2,56%
Total Agreement on T
Total Agreement on N
Total Agreement on O
Global Disagreement
Consensus in Disagreement
95,05%
86,56%
97,54%
3,52%
8,97%
97,14%
93,15%
95,1%
4,78%
23,50%
98,12%
98,38%
99,22%
1,57%
12,12%
94,84%
94,62%
98,28%
2,83%
9,26%
K Index (General)
K Index (Realistic)
0,952
0,776
0,928
0,807
0,980
0,920
0,946
0,827
4.5 Discussion of results
The four language sub-corpora selected for the evaluation have a size ranging from 10925 (Italian)
to 12985 (Portuguese) word boundaries. The number of T-breaks ranges from 969 (French) to 1483
(Portuguese), while N-breaks range from 1462 (French) to 2604 (Portuguese). The percentage of
word boundaries that received a break tag in the C-ORAL-ROM annotation, is the following in the
four corpora:
French
Word boundaries marked 19%
with a prosodic break
Italian
35%
Portuguese
32%
Spanish
31%
The option of excluding text portions from evaluation, in case of doubts or unintelligible speech,
was seldom applied. The following table gives the percentage of word boundaries excluded by each
evaluator.
Not Evaluated Positions
Evaluator 1
Evaluator 2
French
117 (0,9 %)
62 (0,5 %)
Italian
25 (0,2 %)
33 (0,3 %)
Portuguese
25 (0,2 %)
424 (3,2 %)
Spanish
38 (0,3 %)
0
Looking at the Binary Comparison statistics on the evaluation data, it is apparent that the evaluators
confirmed virtually all Terminal Breaks in the C-ORAL-ROM annotation. The percentage of Tbreaks that were not deleted by the evaluators is 100% (with the single exception of the Formal
section of the Spanish corpus, where it is around 98%). This means that where the original
annotator perceived a terminal break, also the evaluators perceived a break, at least a non-terminal
one (Generic Confirmation). The percentages of Specific Confirmations, where the evaluators
confirmed that the break was indeed a T-break, are mostly above 95%.
On the other hand, the evaluators seldom perceived T-breaks where the original annotator did not
perceive any kind of break, as shown by the Terminal Missing percentages, which are close to 0%.
In some cases they perceived a stronger break where the annotator marked a non-terminal break.
This is true particularly for the French corpus, where the percentages of Added Terminals, i.e. Nbreaks substituted with T-breaks, range from 1.29% to 6.51% of the original N-breaks. For the other
languages, the percentages are mostly below 1%.
137
The low percentages of break insertions and deletions could further be weakened taking into
account that in some cases a pair <deletion, insertion> may count as a single misplacement. The
misplacement rate is indeed zero for T-breaks, but it can reach an average 5% for N-breaks (for an
Italian evaluator, who reaches 9.9% in the Formal-Natural Context section).
The comparison of the Activity Rates of the different evaluators (percentage of evaluated word
boundaries that were actually modified) shows the lowest values for Portuguese, ranging from 0.5%
and 1%, and the highest for Italian, where the rate for Evaluator2 is above 4,5%. French and
Spanish evaluators show rates around 2%.
Coming to Ternary Comparisons, which give a measure of the inter-annotator agreement and of the
reliability of the C-ORAL-ROM prosodic tagging, we see that the original annotation was basically
confirmed, especially for terminal breaks.
The percentages of T-breaks specifically confirmed by both evaluators are above 94% for all
languages. In general, such agreement percentages are slightly lower in Formal than in Informal
speech, and in Monologues vs. Dialogues.
The agreement on N-breaks is expectedly lower, with higher differences among the corpora. The
values range from 75,01% in the Dialogues of French to 98,95% in the Monologues of Portuguese.
Total Agreement on T
Total Agreement on N
Total Agreement on O
French
95,05%
French
86,56%
French
97,54%
Italian
97,14%
Italian
93,15%
Italian
95,1%
Portuguese
98,12%
Portuguese
98,38%
Portuguese
99,22%
Spanish
94,84%
Spanish
94,62%
Spanish
98,28%
Taking as reference the whole set of evaluated word boundaries, the most general measure of
agreement is the Total Agreement Rate (percentage of boundaries agreed by both evaluators),
together with the complementary Global Disagreement Rate (percentage of boundaries
disconfirmed by at least one evaluator). The highest consensus is expressed on the Portuguese
corpus, but the values are very close for all languages, ranging from 95% to 98.9%.
Global Disagreement Rate
Total Agreement Rate
French
3,52%
96,48%
Italian
4,78%
95,21%
Portuguese
1,07%
98,93%
Spanish
2,83%
97,17%
As discussed in Chapter 3, the percentage of totally agreed word boundaries may sound too
optimistic as a general measure of agreement, due to the disproportion between word boundaries
and actual candidates for a break (around 30% of the total, as reported above). The following table
compares the total agreement percentages with a "baseline" that may be considered the worst
possible realistic result, obtained in case all N's and T's were deleted and a comparable number of
N's and T's were inserted in different positions.
French
Worst
possible
Result 62,17%
(baseline)
Total Agreement Rate
96,48%
Italian
30,84%
Portuguese
37,28%
Spanish
37,93%
95,21%
98,93%
97,17%
As it seems, the total agreement percentages are significantly higher than the baseline.
In a finer analysis of the non-totally-agreed positions measured in the Global Disagreement Rate,
i.e. the positions where at least one evaluator expressed his dissent, we note that all disagreements
involve non-terminal breaks. In fact, as already noticed in binary comparisons and shown in the
138
Strong Disagreement points 1 and 4, evaluators did not delete T-breaks nor inserted them in empty
positions.
As for Strong Disagreement on N-breaks where both evaluators substituted an N-break with a Tbreak (par. 4.3.1, point 6), the percentages are close to 0% except in the case of French, where they
are around 2%. More detailed data for French are reported in the following table.
Strong Disagreement Non Terminal – Extended Table 6
Percentage
0%
• Media
3,56%
• Nat. Context
3%
• Telephone
0,3%
• Fam. Private
0%
• Public
2,39%
• Formal
0,27%
• Informal
1,28%
• Dialogues
1,53%
• Monologues
For the Strong Disagreement cases where both evaluators deleted an N-break (par. 4.3.1, point 5),
percentages are below 1% except in the case of Italian, for which we report a more detailed table.
Strong Disagreement Non Terminal – Extended Table 5
Percentage
2,12%
• Media
2,4%
• Nat. Context
6,53%
• Telephone
0%
• Fam. Private
1,43%
• Public
3,09%
• Formal
1,11%
• Informal
2,34%
• Dialogues
1,38%
• Monologues
Strong disagreement is a very marginal phenomenon, as shown in the following table comparing the
Global Disagreement Rate (cases where at least one evaluator disconfirmed the original annotation)
with the Partial Consensus Rate (cases where only one evaluator disconfirmed) and obtaining the
Strong Disagreement Rate by difference. The percentages of globally disagreed positions that were
actually strongly disagreed range from about 9% to 23.5%, as shown in the "Consensus in the
Disagreement" table in Paragraph 4.3.5.
Global Disagreement Rate
Partial Consensus Rate
Strong Disagreement Rate
French
3,52%
3,18%
0,34%
Italian
4,78%
3,65%
1,13%
Portuguese
1,07%
0,93%
0,14%
Spanish
2,83%
2,56%
0,27%
Finally, also the Kappa values are quite positive. Kappa Coefficients measure the reliability of the
annotation scheme, that is the probability of obtaining the same annotation by different evaluators
on the same corpus. Kappa values range from 0 to 1, where Kappa=1 shows the total reliability of
the annotation scheme. Such ideal result is quite unrealistic, due to the intrinsic subjective nature of
corpus annotation, so that researchers agree in considering as a positive result any Kappa above 0.6.
139
As discussed in Paragraph 3.5.2, we have calculated two Kappa coefficients: a general one,
comparing the behavior of our three subjects, the original annotator and the two evaluators, with
respect to the three categories of boundaries (T, N, O); and a more realistic coefficient, limiting the
analysis to the two break tags T and N, in order to avoid the positive effect of the high agreement
rate on no-break boundaries. Kappa coefficients were calculated on each evaluated
dialogue/monologue considered as a single experiment. The following table gives the average
Kappa values on the four language sub-corpora.
K Index (General)
K Index (Realistic)
French
0,952
0,776
Italian
0,928
0,807
Portuguese
0,980
0,920
Spanish
0,946
0,827
Both coefficients are largely above the 0.6 threshold, confirming the reliability of the C-ORALROM annotation scheme.
5 Conclusions
This document reports the data obtained by evaluating the C-Oral-Rom prosodic tagging in a
controlled experimental setting. As we observed in the introductory remarks, the main goal of the
evaluation was to understand the general replicability of the coding scheme adopted in the annotations
of the four corpora. The data reported here authorize to expect a good level of replicability for the
coding scheme, and this implicitly supports the hypothesis that the annotation of the utterances
identified in terms of their prosodic profiles is able to capture a relevant perceptual fact. However, we
refer the reader to the C-Oral-Rom Advisors’ document for the interpretation of the data presented
in this document.
140
141
Download