Minute of the Kick off meeting

advertisement
Draft minutes of PCC at the C-ORAL-ROM kick-off meeting
(Florence, 15th January 2001)
Sampling criteria
Sampling criteria must be explicit in order to allow validation of corpora (ELDA). The correctness
of sampling operations by each principal contractor will be verified at the next PCC (June).
In accordance with the contract the sampling of each corpus must ensure:
50% Informal (150.000 words)
50% Formal (including telephon conversations and man/macine interactions)150.000 words,
The PCC has decided on the application of two different sampling strategies in order to represent
spoken language correctly in both formal and informal domains. More specifically, with respect to
informal speech, sampling will follow sociolinguistic variation parameters, while the guidelines of
sampling of formal speech will follow gender distinctions.
Following the opinion expressed by the advisors, the PCC accepts the idea that media speech will
be considered as part of formal speech and its percentage raised to 60.000 words (30.000 in the
original proposal). Media speech should represent formal speech in Media contexts rather than the
various uses of spoken language that each media realizes, the latter is not the goal of C-ORALROM. Moreover, eventual legal problems that the rappresentation of speech in formal contexts may
create, are more easily overcome when such speech is already documented in the Media.
Two proposals related to the representation of telephone conversation and man machine interactions
in the four corpora (see below) must be verified in the first trimester of the project.
Informal
.
Differences among speakers are always marked as regards age (A: 18-25; B: 25-40; C: 40-50; D:
>60), sex (M-F), education (1: illiterate and/or elementary school; 2:secondary school - high school;
3: B.A. – university); geographical origin.
Sampling will follow socio-linguistic variation parameters
112.500 words (3/4 of total) in private & family contexts
37.500 words (1/4 of total) in public contexts



Monologues 42.000 w.
Dialogues
84.000 w.
Conversations 24.000 w.
Long texts:
10 texts of around 4500 words each (c. 30 minutes each; tot. around 45.000 words )
Short texts: 65 texts of around 1500 words each (c. 10 minutes each; tot. around 97.500 words )
Very short texts of around 7.500 words from 2 to 5 minutes from dialogues or conversations in
public contexts
Formal
Differences among speakers are always marked as regards age (A: 18-25; B: 25-40; C: 40-50; D:
>60), sex (M-F), education (1: illiterate and/or elementary school; 2:secondary school - high school;
3: B.A. – university); geographical origin.
Texts are marked with respect to the distinction of monologue, dialogue, conversation but Sampling
will follow gender distinctions instead of socio-linguistic parameters:
Non- Media (65.000 words):








Political speech
Political debate
Preaching
Teaching
Professional explanation
Conference
Business
Law (through media)
Texts of around 3.000 words each. 2/3 text for each field
Formal speech in Media (60.000 words)
2/3 texts for each field taken from emissions in one single day/week

Broadcasting 20 texts of c. 1.500 words each

News

Sport

Interviews

Entertainment

Meteo

TV/Film.18 texts of c. 1.500 words each (possibly with at least one movie for each corpus)
 Scientific press
 Talk shows (Political debates, thematic discussions, culture and science)
 Reportage
Sampling from TV must focus contexts where information pass mainly through spoken language
rather than images (low image -dependence).
Telephone (300-1.000 words each text)
 Private conversation (10.000 words)
Phone call services and/or man-machine interaction
1)- Proposal from advisors
 Information from operator (15.000 words) sampled from the following semantic domains:
 Tourism
 Health
 Meteo
 Traffic
 Train
 Restaurants
Note: The proposal needs to be verified with phone companies that provide services in each
country
2)- ITC/IRST proposal
Within the project, ITC/IRST will develop an automatic multi-lingual call center with speech
recognition device devoted to tourist information. IRST asks the 4 principal contractors to
contribute to the collection of information for multi-lingual phone-call database. IRST proposes
around 50 phone calls from each country to a toll-free number asking for information about
holidays in the Trento region.
Should the agreement with phone companies not run, each corpus will include a selection from this
database, realised by each principal contractor without transcription, irrespective of whatever
happens.
Acoustic quality
Each text must be labelled from the point of view of acoustic quality.
Acoustic quality is evaluated on the basis of the possibility to compute F0 according to the
guidelines presented by PitchFrance. Texts where F0 cannot be computed must be excluded from
sampling. Text accepted for sampling can present three different levels of acoustic quality:
A) Excellent.
B) F0 computing possible in most files.
C) F0 computing in many parts of files despite many possible disturbing factors.
All new recordings must be A-level quality, acquired using digital devices (DAT, mini-disc...).
Non-Media sampling (formal & informal) must contain texts of all quality levels.






Sampling frequency
22050Hz
16 bit
mono
Training
Training for WinPitchCorpus
Training for acoustic quality labelling
Checking of suitability of electronic equipment
Each Principal Contractor will define with Philippe Martin how to organize training courses from
mid- February to mid-March. As far as we can understand from a query to the EU project-officer,
on this point, labour payment is allowed as other project costs, however, travel remboursment from
outside Europe is not permitted.
Legal aspects
All authorisations must cover:
1- Acquisition
2- Transcription
3- Treatment of sound + transcription
4- Publication of sound + transcription
Each selected text must be evaluated from the point of view of the legal problems that it could be
present. A synthetic comprehensive framework relative to the possible legal problems must be
realised at the end of the sampling operation.
One expert lawyer in each country will be asked for an authoritative opinion (ELDA will help on
formalisation of query) on the matter.
All media corpora must be acquired through broadcasting & TV companies and their use
authorised by them.
Download