1 The acquis communautaire
Acquis communautaire means the entire body of European legislation, including all the treaties, regulations and directives adopted by the European Union and the rulings of the
Court of Justice. Since each new country joining the EU is required to accept the whole acquis communautaire , this body of laws has to be translated into all 23 official languages. As a result, the acquis now exists as parallel texts in 23 languages.
Multilingual parallel corpora are an invaluable source for developing multilingual language technologies (machine translation, terminology extraction, sense disambiguation, etc.). The value of a parallel corpus grows as the number of translation units and of languages increases, especially if each language can be paired with any other.
2 Source
This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the legislative documents ( acquis ) of the European Union in 22 EU languages.
The aligned sentences (" translation units" ) have been provided by the Directorate-General for
Translation of the European Commission by extraction from one of its large shared translation memories in Euramis ( European advanced multilingual information system ). In order to cut down the size, the extraction takes English as the source language. Users can select the language combination they want, using the extraction extension tool also provided by DGT.
The extraction does not cover the whole acquis since it is taken from a database that does not contain the complete EU legislation. The sequence in the extracted files is not necessarily the same as in the underlying documents, and redundancies like " Article 1" are inevitable. The documents in the files are identified by the document number (Numdoc) of the original legislative document in the EUR-Lex database. The documents are in tmx, a widely used format provided by LISA ( http://www.lisa.org/standards/tmx/tmx.htm
): in order to be backwards compatible, the header mentions TMX format 1.1, but the files are also compliant with TMX 1.4b.
DGT cannot assume any responsibility for the quality and the content.
3 Description of the data
Before the documents were aligned and corrected, they were pre-processed to remove any differences between the source and target language versions (for futher details see EUR-Lex preprocessing.doc
). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-
General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived (for further information on the Numdoc structure, see http://europa.eu.int/eurlex/en/information/help/help-dir.html
).
4 Statistics
The extraction of the acquis is available in 22 languages. The units in the multilingual extraction files are distributed as follows:
HU
IT
LT
LV
ES
ET
FI
FR
Language
EN
BG
CS
DA
DE
EL
RO
SK
SL
SV
MT
NL
PL
PT
Number of units
2 187 504
708 658
890 025
433 871
532 668
371 039
509 054
1 047 503
514 868
1 106 442
1 159 975
542 873
1 126 255
1 120 835
1 021 855
502 557
1 052 136
945 203
650 735
1 065 399
1 026 668
555 362
5 Conditions for use/licensing issues
Under Commission Decision 2006/291/EC, Euratom of 7 April 2006 on the re-use of
Commission information (Official Journal L 107, 20.4.2006, pp. 38–41), these data may be disseminated, but only within the limits set by the Decision.
By agreement with the European Commission's Office for Official Publications (OPOCE), the acquis can be used and distributed for research purposes, but the following conditions for use must be observed:
The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series plus charters, treaties and Court of Justice case law to be in the public domain. Prior written permission is not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgement is given to the European
Communities and to the source, and provided the additional guidelines set out below are observed.
(1) Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read: "Only European Community legislation printed in the paper edition of the Official
Journal of the European Union is deemed authentic."
(2) For the reasons stated in the disclaimer above, it is advisable to ensure that translations are from the printed, authentic version of the Official Journal. This precaution, while minimising the risk of error, does not confer any legal status whatsoever on the translated text. The following notice must accompany the translated text, printed below the acknowledgement:
"Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities.
Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder]." Moreover, please note that inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/theses/studies/reports/books issued by third-party authors or publishers, by whatever means, and disseminated subject to payment is not considered
"further commercial dissemination".