Dia 1 - KU Leuven Kulak

advertisement
Dutch Parallel Corpus
A Multilingual Annotated Corpus
Lieve Macken
Language & Translation Technology Team
University College Ghent
Dutch Parallel Corpus
•
•
•
•
Annotated sentence aligned corpus
10 million words
Dutch - English / Dutch – French
Linguistic annotations
– PoS & lemma
– Shallow syntactic analysis
• Quality control
• May 2006- September 2009
Users and applications
• Fundamental research
– Translation studies / contrastive linguistics
– Corpus linguistics
• Support applications
– Translation support (CAT)
– Didactic support (CALL)
• HLT applications
– Machine Translation / Terminology Extraction
– Training and test data
Fundamental Research
Contrastive Linguistics
Translation Studies
Translation product
Translation process
Language systems
Translation strategies
• High-quality data
• Balanced by translation direction
Parallel & comparable corpus
Dutch texts
English & French
translations
English & French
texts
Dutch
translations
Language Learning - CorpusCall
•
Computer Assisted Language Learning
–
–
•
Reference samples
Learning activity
Key Words in Context
–
•
Authentic language usage
Example Nederlex
–
–
–
Electronic reading platform for French students
learning Dutch
Development reading platform: FUNDP, Namur
Compilation parallel corpus: REBECA project
(K.U.Leuven Campus Kortrijk)
Nederlex
Full text corpora as Translator’s aid
•
Computer assisted Translation
–
–
–
•
To identify more appropriate TL equivalent,
idiomatic expressions
Extension to bilingual dictionaries
Words in context
Example: TransSearch (Canadian Hansards)
–
Simard & Macklovitch 2005
Machine Translation
• Data-driven development of MT-systems
– Example Based MT & Statistical MT
• P. Khoen 2005: 110 SMT-systemen trained on
Europarl-corpus
– Example output Finnish-English:
we know very well that the current treaties are not
enough and that in future , it is necessary to develop
a better structure for the union and , therefore
perustuslaillisempi structure , which also expressed
more clearly what the member states and the union
is concerned .
Large corpora are useful …
•
•
•
•
Number crunching applications
Statistical analysis
Automatic analysis
No human intervention
… but less adequate for:
• Applications involving quality at all levels
• Applications involving human analysis
• Educational applications
DPC requirements
1)
2)
3)
4)
Corpus design
Linguistic annotation
Quality control
Corpus exploitation & availability
DPC requirements
1)
2)
3)
4)
Corpus design
Linguistic annotation
Quality control
Corpus exploitation & availability
Design: translation directions
• Language Pairs & Translation Directions
• Balanced wrt language pair and
translation direction
–
Min. 2 mio words/translation direction
EN

NL
EN

NL

FR
NL

FR
Design: text types
• Commercial publishers
– Fictional & non-fictional literature e.g. novels,
essays
– Journalistic texts, e.g. news articles
• Institutions
– Instructive texts, e.g. user manuals
– Administrative texts, e.g. meeting minutes
– External communication, e.g. promotion
material, newsletters
Text providers
• Quality
– Published material
– Professional translation division
• Copyright clearance
– License agreements
– Collaboration with Dutch Agency of HLT
50 Text providers
Text Type
Provider
Administrative texts
European parliament, Europarl, Melexis, Flemish
government, Speeches Kok, Balkende, Melexis, FOD
Sociale Zekerheid, …
External Communication
BMM, Bosch, Barco, NMBS Holding, Arcelor Mittal,
Fédération du tourisme de la province de Namur,
Westtoer, …
Literature
Ons Erfdeel, Lannoo, Vlaams Fonds der Letteren,
Nijgh&VanDitmar, Le Dilletante, …
Journalistic texts
Roularta, The Independent, The Guradian/ The
Observer, De Standaard, De Morgen, Campuskrant,
ING, Fortis, …
Instructive texts
IBM, Bosch, DNS, Eli-lilly, …
DPC requirements
1)
2)
3)
4)
Corpus design
Linguistic annotation
Quality control
Corpus exploitation & availability
Linguistic Annotation
• Structure
– Paragraphs, sentences, words
Linguistic Annotation
• Structure
– Paragraphs, sentences, words
• Alignment
– Sentence alignment
• Vanilla Aligner
• Microsoft Bilingual Aligner
• Melamed’s GMA Aligner
– (Sub-sentential alignment)
Linguistic Annotation
• Structure
– Paragraphs, sentences, words
• Alignment
– Sentence alignment
– (Sub-sentential alignment)
• Linguistic annotation
– Lemma
– PoS
Corpus Representation
• Text Mark-up
– TEI
• Encoding
– UTF8
DPC requirements
1)
2)
3)
4)
Corpus design
Linguistic annotation
Quality control
Corpus exploitation & availability
Quality control
• Manually checked
– 10% of whole corpus
• Spot checking
– Based on error analysis of manually verified
data
• Automatic control procedures
– e.g. automatic comparison of output from
different alignment programs
Alignment merge
Alignment merge
Tekst taal1
Tekst taal2
1
2
3
4
5
1
2
3
4
5
AL1
1
2
3
4
5
AL2
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
manual
check
Quality control
• Manually checked
– 10% of whole corpus
• Spot checking
– Based on error analysis of manually verified
data
• Automatic control procedures
– e.g. automatic comparison of output from
different alignment programs
External validation
• Formal validation by CST (Centre for
language Technology - Copenhagen)
• Suitability test by Xplanation
DPC requirements
1)
2)
3)
4)
Corpus design
Linguistic annotation
Quality control
Corpus exploitation & availability
Corpus exploitation
• Web search interface
– Parallel KWIC concordance
– Simple queries
– Extended queries
• Pattern matching & annotation labels
• Full text resource
– Data-driven automatic learning (e.g. SMT)
– Two monolingual XML-files + alignment file
Metadata
• Additional filter to retrieve samples
– Text-related data
• Language, text type, domain and keywords
– Translation-related data
• Source language, target language
– Annotation-related data
• Quality label
Availability
• Via Dutch Agency for Human Language
Technologies (TST-centrale)
DPC objectives
• Quality control
• Level of annotation
– Sentence alignment
– PoS, lemma
• Balanced composition
– Translation direction
– Text types
• Availability
– Via Dutch Agency for Human Language
Technologies (TST-centrale)
DPC Team
• K.U. Leuven campus Kortrijk
Prof. Dr. Piet Desmet
Dr. Hans Paulussen
Lic. Maribel Montero Perez
• Univeristy College Ghent School of Translation Studies
Prof. Dr. Willy Vandeweghe
Dra. Lieve Macken
Orphée Declercq
Questions?
www.kuleuven-kortrijk.be/dpc
Download