Draft minutes of PCC at the C-ORAL-ROM kick-off meeting (Florence, 15th January 2001) Sampling criteria Sampling criteria must be explicit in order to allow validation of corpora (ELDA). The correctness of sampling operations by each principal contractor will be verified at the next PCC (June). In accordance with the contract the sampling of each corpus must ensure: 50% Informal (150.000 words) 50% Formal (including telephon conversations and man/macine interactions)150.000 words, The PCC has decided on the application of two different sampling strategies in order to represent spoken language correctly in both formal and informal domains. More specifically, with respect to informal speech, sampling will follow sociolinguistic variation parameters, while the guidelines of sampling of formal speech will follow gender distinctions. Following the opinion expressed by the advisors, the PCC accepts the idea that media speech will be considered as part of formal speech and its percentage raised to 60.000 words (30.000 in the original proposal). Media speech should represent formal speech in Media contexts rather than the various uses of spoken language that each media realizes, the latter is not the goal of C-ORALROM. Moreover, eventual legal problems that the rappresentation of speech in formal contexts may create, are more easily overcome when such speech is already documented in the Media. Two proposals related to the representation of telephone conversation and man machine interactions in the four corpora (see below) must be verified in the first trimester of the project. Informal . Differences among speakers are always marked as regards age (A: 18-25; B: 25-40; C: 40-50; D: >60), sex (M-F), education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university); geographical origin. Sampling will follow socio-linguistic variation parameters 112.500 words (3/4 of total) in private & family contexts 37.500 words (1/4 of total) in public contexts Monologues 42.000 w. Dialogues 84.000 w. Conversations 24.000 w. Long texts: 10 texts of around 4500 words each (c. 30 minutes each; tot. around 45.000 words ) Short texts: 65 texts of around 1500 words each (c. 10 minutes each; tot. around 97.500 words ) Very short texts of around 7.500 words from 2 to 5 minutes from dialogues or conversations in public contexts Formal Differences among speakers are always marked as regards age (A: 18-25; B: 25-40; C: 40-50; D: >60), sex (M-F), education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university); geographical origin. Texts are marked with respect to the distinction of monologue, dialogue, conversation but Sampling will follow gender distinctions instead of socio-linguistic parameters: Non- Media (65.000 words): Political speech Political debate Preaching Teaching Professional explanation Conference Business Law (through media) Texts of around 3.000 words each. 2/3 text for each field Formal speech in Media (60.000 words) 2/3 texts for each field taken from emissions in one single day/week Broadcasting 20 texts of c. 1.500 words each News Sport Interviews Entertainment Meteo TV/Film.18 texts of c. 1.500 words each (possibly with at least one movie for each corpus) Scientific press Talk shows (Political debates, thematic discussions, culture and science) Reportage Sampling from TV must focus contexts where information pass mainly through spoken language rather than images (low image -dependence). Telephone (300-1.000 words each text) Private conversation (10.000 words) Phone call services and/or man-machine interaction 1)- Proposal from advisors Information from operator (15.000 words) sampled from the following semantic domains: Tourism Health Meteo Traffic Train Restaurants Note: The proposal needs to be verified with phone companies that provide services in each country 2)- ITC/IRST proposal Within the project, ITC/IRST will develop an automatic multi-lingual call center with speech recognition device devoted to tourist information. IRST asks the 4 principal contractors to contribute to the collection of information for multi-lingual phone-call database. IRST proposes around 50 phone calls from each country to a toll-free number asking for information about holidays in the Trento region. Should the agreement with phone companies not run, each corpus will include a selection from this database, realised by each principal contractor without transcription, irrespective of whatever happens. Acoustic quality Each text must be labelled from the point of view of acoustic quality. Acoustic quality is evaluated on the basis of the possibility to compute F0 according to the guidelines presented by PitchFrance. Texts where F0 cannot be computed must be excluded from sampling. Text accepted for sampling can present three different levels of acoustic quality: A) Excellent. B) F0 computing possible in most files. C) F0 computing in many parts of files despite many possible disturbing factors. All new recordings must be A-level quality, acquired using digital devices (DAT, mini-disc...). Non-Media sampling (formal & informal) must contain texts of all quality levels. Sampling frequency 22050Hz 16 bit mono Training Training for WinPitchCorpus Training for acoustic quality labelling Checking of suitability of electronic equipment Each Principal Contractor will define with Philippe Martin how to organize training courses from mid- February to mid-March. As far as we can understand from a query to the EU project-officer, on this point, labour payment is allowed as other project costs, however, travel remboursment from outside Europe is not permitted. Legal aspects All authorisations must cover: 1- Acquisition 2- Transcription 3- Treatment of sound + transcription 4- Publication of sound + transcription Each selected text must be evaluated from the point of view of the legal problems that it could be present. A synthetic comprehensive framework relative to the possible legal problems must be realised at the end of the sampling operation. One expert lawyer in each country will be asked for an authoritative opinion (ELDA will help on formalisation of query) on the matter. All media corpora must be acquired through broadcasting & TV companies and their use authorised by them.