Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent Dutch Parallel Corpus • • • • Annotated sentence aligned corpus 10 million words Dutch - English / Dutch – French Linguistic annotations – PoS & lemma – Shallow syntactic analysis • Quality control • May 2006- September 2009 Users and applications • Fundamental research – Translation studies / contrastive linguistics – Corpus linguistics • Support applications – Translation support (CAT) – Didactic support (CALL) • HLT applications – Machine Translation / Terminology Extraction – Training and test data Fundamental Research Contrastive Linguistics Translation Studies Translation product Translation process Language systems Translation strategies • High-quality data • Balanced by translation direction Parallel & comparable corpus Dutch texts English & French translations English & French texts Dutch translations Language Learning - CorpusCall • Computer Assisted Language Learning – – • Reference samples Learning activity Key Words in Context – • Authentic language usage Example Nederlex – – – Electronic reading platform for French students learning Dutch Development reading platform: FUNDP, Namur Compilation parallel corpus: REBECA project (K.U.Leuven Campus Kortrijk) Nederlex Full text corpora as Translator’s aid • Computer assisted Translation – – – • To identify more appropriate TL equivalent, idiomatic expressions Extension to bilingual dictionaries Words in context Example: TransSearch (Canadian Hansards) – Simard & Macklovitch 2005 Machine Translation • Data-driven development of MT-systems – Example Based MT & Statistical MT • P. Khoen 2005: 110 SMT-systemen trained on Europarl-corpus – Example output Finnish-English: we know very well that the current treaties are not enough and that in future , it is necessary to develop a better structure for the union and , therefore perustuslaillisempi structure , which also expressed more clearly what the member states and the union is concerned . Large corpora are useful … • • • • Number crunching applications Statistical analysis Automatic analysis No human intervention … but less adequate for: • Applications involving quality at all levels • Applications involving human analysis • Educational applications DPC requirements 1) 2) 3) 4) Corpus design Linguistic annotation Quality control Corpus exploitation & availability DPC requirements 1) 2) 3) 4) Corpus design Linguistic annotation Quality control Corpus exploitation & availability Design: translation directions • Language Pairs & Translation Directions • Balanced wrt language pair and translation direction – Min. 2 mio words/translation direction EN NL EN NL FR NL FR Design: text types • Commercial publishers – Fictional & non-fictional literature e.g. novels, essays – Journalistic texts, e.g. news articles • Institutions – Instructive texts, e.g. user manuals – Administrative texts, e.g. meeting minutes – External communication, e.g. promotion material, newsletters Text providers • Quality – Published material – Professional translation division • Copyright clearance – License agreements – Collaboration with Dutch Agency of HLT 50 Text providers Text Type Provider Administrative texts European parliament, Europarl, Melexis, Flemish government, Speeches Kok, Balkende, Melexis, FOD Sociale Zekerheid, … External Communication BMM, Bosch, Barco, NMBS Holding, Arcelor Mittal, Fédération du tourisme de la province de Namur, Westtoer, … Literature Ons Erfdeel, Lannoo, Vlaams Fonds der Letteren, Nijgh&VanDitmar, Le Dilletante, … Journalistic texts Roularta, The Independent, The Guradian/ The Observer, De Standaard, De Morgen, Campuskrant, ING, Fortis, … Instructive texts IBM, Bosch, DNS, Eli-lilly, … DPC requirements 1) 2) 3) 4) Corpus design Linguistic annotation Quality control Corpus exploitation & availability Linguistic Annotation • Structure – Paragraphs, sentences, words Linguistic Annotation • Structure – Paragraphs, sentences, words • Alignment – Sentence alignment • Vanilla Aligner • Microsoft Bilingual Aligner • Melamed’s GMA Aligner – (Sub-sentential alignment) Linguistic Annotation • Structure – Paragraphs, sentences, words • Alignment – Sentence alignment – (Sub-sentential alignment) • Linguistic annotation – Lemma – PoS Corpus Representation • Text Mark-up – TEI • Encoding – UTF8 DPC requirements 1) 2) 3) 4) Corpus design Linguistic annotation Quality control Corpus exploitation & availability Quality control • Manually checked – 10% of whole corpus • Spot checking – Based on error analysis of manually verified data • Automatic control procedures – e.g. automatic comparison of output from different alignment programs Alignment merge Alignment merge Tekst taal1 Tekst taal2 1 2 3 4 5 1 2 3 4 5 AL1 1 2 3 4 5 AL2 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 manual check Quality control • Manually checked – 10% of whole corpus • Spot checking – Based on error analysis of manually verified data • Automatic control procedures – e.g. automatic comparison of output from different alignment programs External validation • Formal validation by CST (Centre for language Technology - Copenhagen) • Suitability test by Xplanation DPC requirements 1) 2) 3) 4) Corpus design Linguistic annotation Quality control Corpus exploitation & availability Corpus exploitation • Web search interface – Parallel KWIC concordance – Simple queries – Extended queries • Pattern matching & annotation labels • Full text resource – Data-driven automatic learning (e.g. SMT) – Two monolingual XML-files + alignment file Metadata • Additional filter to retrieve samples – Text-related data • Language, text type, domain and keywords – Translation-related data • Source language, target language – Annotation-related data • Quality label Availability • Via Dutch Agency for Human Language Technologies (TST-centrale) DPC objectives • Quality control • Level of annotation – Sentence alignment – PoS, lemma • Balanced composition – Translation direction – Text types • Availability – Via Dutch Agency for Human Language Technologies (TST-centrale) DPC Team • K.U. Leuven campus Kortrijk Prof. Dr. Piet Desmet Dr. Hans Paulussen Lic. Maribel Montero Perez • Univeristy College Ghent School of Translation Studies Prof. Dr. Willy Vandeweghe Dra. Lieve Macken Orphée Declercq Questions? www.kuleuven-kortrijk.be/dpc