State of the art of the CORAL ROM sampling strategies

Report on the Decision on the C-ORAL-ROM sampling criteria (From Kick-off meeting and Paris Meeting)  Sampling criteria must be explicit in order to allow validation of corpora (ELDA). The correctness of sampling operations by each principal contractor will be verified at the next PCC (June).  In accordance with the contract the sampling of each corpus must ensure: 50% Informal (150.000 words) 50% Formal (including media, telephone conversations and man/machine interactions)150.000 words,  1. 2. 3. 4. Differences among speakers are always marked as regards: age (A: 18-25; B: 25-40; C: 40-50; D: >60), sex (M-F), education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university); geographical origin. The PCC has decided on the application of two different sampling strategies in order to represent spoken language correctly in both formal and informal domains. More specifically, with respect to informal speech, sampling will follow variation parameters, regarding both the structure of the communicative event and its main sociological context, while the guidelines of sampling of formal speech will follow gender distinctions. The decision run as follows: Informal. 112.500 words (3/4 of total) in private & family contexts 37.500 words (1/4 of total) in public contexts   Monologues 42.000 w. Dialogues/Conversations 108.000 w. Long texts: 10 texts of around 4500 words each (c. 30 minutes each; tot. around 45.000 words ) Short texts: 65 texts of around 1500 words each (c. 10 minutes each; tot. around 97.500 words ) Very short texts of around 7.500 words from 2 to 5 Minutes from dialogues or conversations in public contexts The advisor from IPO noticed that the kick-off decision was not sufficient due to a lack of mapping text length and language variation for informal speech (the Dutch corpus detail it). The following is the model proposed in the Paris meeting integrated with type/length relation: Private / Family 112.000 Monologues 32.000 Dialogues/Conversations 75.000 * Public 37.000 Monologues 10.000 9000 L (2) 18000 L (4) 9000 L (2) 23000 S (15) 57000 S (38) Dialogues/Conversations 27.000 7500 S (5) 9000 L (2) 10500 S (7) 7500 M *At least 23.000 conversations with more then two participants) Text lenght in words (5% variation allowed) Long (L) : 4.500 average** Short (S) : 1500 average** Collections of very short dialogues (M) : ** In order to facilitate the reuse of previous corpora the Paris meeting decided the following integration to the kick-off model: up to 20% of the corpus devoted to small corpora (1500 words) can be set up using, instead of small corpora, couples of texts of the same type of just 750 words each or, on the contrary splitting a 3000 words corpus in two files of 1500 words Formal The formal C-ORAL-ROM corpus is divided in four sections: a) b) c) d) Formal speech in the natural context Formal speech in Media Telephone Private conversation Phone call services and/or man-machine interaction The representation of various uses of spoken language realised by media is not an objective of C-ORAL-ROM. Media speech will be considered as part of formal speech i.e. it represents formal speech in Media contexts. The proportion between Media and Formal speech in the natural context has been proposed in the kick off meeting, and it is reported here as guideline. However, given that legal problems may emerge for the representation of speech in some formal contexts, it is assumed that a sample of that typology can be taken from media (for ex. it is allowed the use of political debate trough media). As a consequence of it the proportion of media speech may vary in each corpus (Paris meeting) Formal speech in the natural context (65.000 words):  Political speech  Political debate  Preaching  Teaching  Professional explanation  Conference  Business  Law (the possibility to use materials from mock-situations in placement trainee-ships for lawyers as well as from law schools is foreseen in order to avoid problems due to the special protection of juridical speech ) Texts of around 3.000 words each. (2/3 text for each field) . Texts are marked with respect to the distinction of monologue, dialogue, conversation, but no proportion is defined. Formal speech in Media (60.000 words)          News, only one sample (1500 words) Meteo, just a small sample (500 words) Sport Interviews Talk shows (Political debates, thematic discussions, culture and science) Reportage Scientific press Broadcasting up to 20 texts of c. 1.500 words each TV Entertainment or / Movie.* * The possibility to cut “entertainment” and “movie” from the corpus has been discussed in light of the fact that the CORAL-ROM corpus should represent spontaneous speech, whereas such genders are specific to Media production, which is not an objective of the project. Moreover the C-ORAL-ROM project does not face storage of multi-modal informations. Therefore sampling from TV must focus on contexts where information pass mainly through spoken language, rather than images (low image -dependence). For this reason the presence of not more then one text of the last field in each corpus is allowed, but must be considered optional. Telephone Private conversation (10.000 words) (300-1.000 words each text) In order to avoid legal problems with the recording of private telephone conversation, P7 proposed the following strategy: ask for permission for the recording of a possible phone call to be received at some time during a given week. A possible agreement with the telephone company can ensure direct recording by the telephone company once that permission is ensured (procedure to be detailed by P7). Open questions Phone call services and/or man-machine interaction Two proposals related to the representation of telephone conversation and man machine interactions in the four corpora must be verified 1)- Proposal from advisors  Information from operator (15.000 words) sampled from the following semantic domains:  Tourism  Health  Meteo  Traffic  Train  Restaurants Note: The proposal needs to be verified with phone companies that provide services in each country 2)- ITC/IRST proposal Within the project, ITC/IRST will develop an automatic multi-lingual call center with speech recognition device devoted to tourist information. IRST asks the 4 principal contractors to contribute to the collection of information for multi-lingual phone-call database. IRST proposes around 50 phone calls from each country to a toll-free number asking for information about holidays in the Trento region Should the agreement with phone companies not run, each corpus will include a selection from this database, realised by each principal contractor without transcription, irrespective of whatever happens..

State of the art of the CORAL ROM sampling strategies

Related documents

Products

Support

State of the art of the CORAL ROM sampling strategies

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib