This multilingual parallel corpus has been compiled by the

advertisement

Multilingual extraction of the acquis

1 The acquis communautaire

Acquis communautaire means the entire body of European legislation, including all the treaties, regulations and directives adopted by the European Union and the rulings of the

Court of Justice. Since each new country joining the EU is required to accept the whole acquis communautaire , this body of laws has to be translated into all 23 official languages. As a result, the acquis now exists as parallel texts in 23 languages.

Multilingual parallel corpora are an invaluable source for developing multilingual language technologies (machine translation, terminology extraction, sense disambiguation, etc.). The value of a parallel corpus grows as the number of translation units and of languages increases, especially if each language can be paired with any other.

2 Source

This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the legislative documents ( acquis ) of the European Union in 22 EU languages.

The aligned sentences (" translation units" ) have been provided by the Directorate-General for

Translation of the European Commission by extraction from one of its large shared translation memories in Euramis ( European advanced multilingual information system ). In order to cut down the size, the extraction takes English as the source language. Users can select the language combination they want, using the extraction extension tool also provided by DGT.

The extraction does not cover the whole acquis since it is taken from a database that does not contain the complete EU legislation. The sequence in the extracted files is not necessarily the same as in the underlying documents, and redundancies like " Article 1" are inevitable. The documents in the files are identified by the document number (Numdoc) of the original legislative document in the EUR-Lex database. The documents are in tmx, a widely used format provided by LISA ( http://www.lisa.org/standards/tmx/tmx.htm

): in order to be backwards compatible, the header mentions TMX format 1.1, but the files are also compliant with TMX 1.4b.

DGT cannot assume any responsibility for the quality and the content.

3 Description of the data

Before the documents were aligned and corrected, they were pre-processed to remove any differences between the source and target language versions (for futher details see EUR-Lex preprocessing.doc

). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-

General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived (for further information on the Numdoc structure, see http://europa.eu.int/eurlex/en/information/help/help-dir.html

).

4 Statistics

The extraction of the acquis is available in 22 languages. The units in the multilingual extraction files are distributed as follows:

HU

IT

LT

LV

ES

ET

FI

FR

Language

EN

BG

CS

DA

DE

EL

RO

SK

SL

SV

MT

NL

PL

PT

Number of units

2 187 504

708 658

890 025

433 871

532 668

371 039

509 054

1 047 503

514 868

1 106 442

1 159 975

542 873

1 126 255

1 120 835

1 021 855

502 557

1 052 136

945 203

650 735

1 065 399

1 026 668

555 362

5 Conditions for use/licensing issues

Under Commission Decision 2006/291/EC, Euratom of 7 April 2006 on the re-use of

Commission information (Official Journal L 107, 20.4.2006, pp. 38–41), these data may be disseminated, but only within the limits set by the Decision.

By agreement with the European Commission's Office for Official Publications (OPOCE), the acquis can be used and distributed for research purposes, but the following conditions for use must be observed:

The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series plus charters, treaties and Court of Justice case law to be in the public domain. Prior written permission is not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgement is given to the European

Communities and to the source, and provided the additional guidelines set out below are observed.

(1) Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read: "Only European Community legislation printed in the paper edition of the Official

Journal of the European Union is deemed authentic."

(2) For the reasons stated in the disclaimer above, it is advisable to ensure that translations are from the printed, authentic version of the Official Journal. This precaution, while minimising the risk of error, does not confer any legal status whatsoever on the translated text. The following notice must accompany the translated text, printed below the acknowledgement:

"Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities.

Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder]." Moreover, please note that inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/theses/studies/reports/books issued by third-party authors or publishers, by whatever means, and disseminated subject to payment is not considered

"further commercial dissemination".

Download