State of the art of the CORAL ROM sampling strategies

advertisement
Report on the Decision on the C-ORAL-ROM sampling criteria (From Kick-off meeting and Paris Meeting)

Sampling criteria must be explicit in order to allow validation of corpora (ELDA). The correctness of sampling
operations by each principal contractor will be verified at the next PCC (June).

In accordance with the contract the sampling of each corpus must ensure:
50% Informal (150.000 words)
50% Formal (including media, telephone conversations and man/machine interactions)150.000 words,

1.
2.
3.
4.
Differences among speakers are always marked as regards:
age (A: 18-25; B: 25-40; C: 40-50; D: >60),
sex (M-F),
education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university);
geographical origin.
The PCC has decided on the application of two different sampling strategies in order to represent spoken language
correctly in both formal and informal domains. More specifically, with respect to informal speech, sampling will
follow variation parameters, regarding both the structure of the communicative event and its main sociological context,
while the guidelines of sampling of formal speech will follow gender distinctions. The decision run as follows:
Informal.
112.500 words (3/4 of total) in private & family contexts
37.500 words (1/4 of total) in public contexts


Monologues
42.000 w.
Dialogues/Conversations 108.000 w.
Long texts:
10 texts of around 4500 words each (c. 30 minutes each; tot. around 45.000 words )
Short texts:
65 texts of around 1500 words each (c. 10 minutes each; tot. around 97.500 words )
Very short texts of around 7.500 words from 2 to 5 Minutes from dialogues or conversations in public contexts
The advisor from IPO noticed that the kick-off decision was not sufficient due to a lack of mapping text length and
language variation for informal speech (the Dutch corpus detail it). The following is the model proposed in the Paris
meeting integrated with type/length relation:
Private / Family
112.000
Monologues
32.000
Dialogues/Conversations
75.000 *
Public
37.000
Monologues
10.000
9000
L (2)
18000
L (4)
9000
L (2)
23000
S (15)
57000
S (38)
Dialogues/Conversations
27.000
7500
S (5)
9000
L (2)
10500
S (7)
7500
M
*At least 23.000 conversations with more then two participants)
Text lenght in words (5% variation allowed)
Long (L) : 4.500 average**
Short (S) : 1500 average**
Collections of very short dialogues (M) :
** In order to facilitate the reuse of previous corpora the Paris meeting decided the following integration to the kick-off
model: up to 20% of the corpus devoted to small corpora (1500 words) can be set up using, instead of small corpora,
couples of texts of the same type of just 750 words each or, on the contrary splitting a 3000 words corpus in two files
of 1500 words
Formal
The formal C-ORAL-ROM corpus is divided in four sections:
a)
b)
c)
d)
Formal speech in the natural context
Formal speech in Media
Telephone Private conversation
Phone call services and/or man-machine interaction
The representation of various uses of spoken language realised by media is not an objective of C-ORAL-ROM. Media
speech will be considered as part of formal speech i.e. it represents formal speech in Media contexts.
The proportion between Media and Formal speech in the natural context has been proposed in the kick off meeting, and
it is reported here as guideline. However, given that legal problems may emerge for the representation of speech in
some formal contexts, it is assumed that a sample of that typology can be taken from media (for ex. it is allowed the use
of political debate trough media). As a consequence of it the proportion of media speech may vary in each corpus (Paris
meeting)
Formal speech in the natural context (65.000 words):
 Political speech
 Political debate
 Preaching
 Teaching
 Professional explanation
 Conference
 Business
 Law (the possibility to use materials from mock-situations in placement trainee-ships for lawyers as well as
from law schools is foreseen in order to avoid problems due to the special protection of juridical speech )
Texts of around 3.000 words each. (2/3 text for each field) . Texts are marked with respect to the distinction of
monologue, dialogue, conversation, but no proportion is defined.
Formal speech in Media (60.000 words)









News, only one sample (1500 words)
Meteo, just a small sample (500 words)
Sport
Interviews
Talk shows (Political debates, thematic discussions, culture and science)
Reportage
Scientific press
Broadcasting
up to 20 texts of c. 1.500 words each
TV Entertainment or / Movie.*
* The possibility to cut “entertainment” and “movie” from the corpus has been discussed in light of the fact that the CORAL-ROM corpus should represent spontaneous speech, whereas such genders are specific to Media production,
which is not an objective of the project. Moreover the C-ORAL-ROM project does not face storage of multi-modal
informations. Therefore sampling from TV must focus on contexts where information pass mainly through spoken
language, rather than images (low image -dependence).
For this reason the presence of not more then one text of the last field in each corpus is allowed, but must be considered
optional.
Telephone Private conversation (10.000 words)
(300-1.000 words each text)
In order to avoid legal problems with the recording of private telephone conversation, P7 proposed the following
strategy: ask for permission for the recording of a possible phone call to be received at some time during a given week.
A possible agreement with the telephone company can ensure direct recording by the telephone company once that
permission is ensured (procedure to be detailed by P7).
Open questions
Phone call services and/or man-machine interaction
Two proposals related to the representation of telephone conversation and man machine interactions in the four corpora
must be verified
1)- Proposal from advisors

Information from operator (15.000 words) sampled from the following semantic domains:

Tourism

Health

Meteo

Traffic

Train

Restaurants
Note: The proposal needs to be verified with phone companies that provide services in each country
2)- ITC/IRST proposal
Within the project, ITC/IRST will develop an automatic multi-lingual call center with speech
recognition device devoted to tourist information. IRST asks the 4 principal contractors to
contribute to the collection of information for multi-lingual phone-call database. IRST proposes
around 50 phone calls from each country to a toll-free number asking for information about
holidays in the Trento region Should the agreement with phone companies not run, each corpus will
include a selection from this database, realised by each principal contractor without transcription,
irrespective of whatever happens..
Download