CLARIN_wg5_7_ISST_formats

advertisement
ISST – Italian Treebank Conversion to the CoNLL format
The ISST Treebank is available in 2 formats:


the ISST format
the CoNLL 2007 format (ISST@CoNLL)
The ISST Treebank has a three-level structure ranging over syntactic and semantic levels.
Syntactic annotation is distributed over two different levels:


the constituent structure which is annotated in terms of phrase structure tree; and
the functional relations which provides a characterisation of the sentence in terms of
grammatical functions (subject, objects etc.)
With respect to other treebank the ISST multilevel structure shows two main novelties: the
combination of syntactic and lexico-sematic annotation, creating the prerequisites for corpusbased analysis on the synstax-semantics interface and it adopts a distributed approach to
syntactic annotation. The two syntactic annotation lavels are intended to provide a orthogonal
view of the same surface syntax, particularly suted for a language like Italian.
The ISST Treebank consists of 305,547 word tokens reflecting contemporary language used. It is
composed by two different sections: a “balanced” corpus of 216,606 tokens and a specialised
corpus of 89,941 tokens with text belonging to the financial domain. The balanced corpus
contains different types of Italian texts, namely newspaper articles and a number of different
periodicals selected to cover a high variety of topics (politics, economy, culture, science, health,
sport, leisure etc.). The balanced corpus covers a 10 year period from 1985 to 1995, while the
specialised corpus includes articles published in 1994.
The morpho-syntactic annotation was carrid out in the framework of the EU-funded PAROLE
and ELSNET projects. The adopeted morpho-syntactic tagset conforms to the EAGLES
international standard.
The constituency annotation departs from other consituency-based annotation schemes, like the
Penn Treebank, in a number of respect, such as:


the peculiarity of Italian which qulifies as a (relatively) free order language;
the distributed organization of syntactic annotation in ISST.
The fact that in ISST the functional level and the constituency level are separated allows to
dispense with empty elements such as traces or pro-drop phenomena, topicalisation or noncanonical order of constituents making the annotation more intelligible. These syntactic
phenomena are not accounted in terms of empty categories and coindexation but at the
functional level.
Functional annotation in ISST is carried out by marking relations between words which belong to
major lexical classes only (e.g. non auxiliary verbs, nouns and adjectives), independently of
previous identification of phrasal constituents. Functional annotation in ISST is based on a
revisited version of the FAME annotation scheme in order to make it suitable for annotation of
open domain texts.
The ISST@CoNLL has been developed to build the Italian corpus for the ConNLL-2007 Shared
Task. In particular, the ISST@CoNLL was built on top of the morpho-syntactic annotation and
syntactic dependency annoation layers. The conversion process has been carried out in a semiautomatic way cooperatively carried out by the ILC_CNR and the Dipartimento di Informatica of
the University of Pisa.
The conversion was in charge of:
 combinig the information coming from the two annotation levels, functional and
constituet;
 converting the ISST annotation for dependency relations into the CoNLL tabular format.
One of the main issue was that the conversion had to cope with the fact that the ISST
dependency relations are expressed in terms of binary relations between major lexical classes
only. The information about grammatical words is encoded in terms of features associated with
the partcipants to the relation. During the conversion process, the dependency relations
involving grammatical words had to be reconstructed from the ISST original annotation and the
already existing dependency relations had to be revised accordingly. Othr conversion issues
concerned:
multi-headed tokens, which caused the dependency
structure not to be a tree; empty tokens, representing omitted subjects due to the pro–drop
property of Italian; identification of the sentence root; insertion of dependencies involving
punctuation.
The conversion has been carried out by means of several scripts and manual post-processing by
means of a graphical annotation tool. Due to these reasons an automatic converter has not been
implemented.
The ISST@CoNLL is a subset of the balanced ISST corpus of 79,654 word tokens (65,016 nonpunctual tokens), for a total of 4,162 sentences.
REFERENCES
Lenci, A., S. Montemagni, V. Pirrelli and C. Soria. 1999. FAME: a Functional Annotation Metascheme for multi-modal and multi-lingual Parsing Evaluation, in Proceedings of the ACL
Workshop on Computer-Mediated Language Assessment and Evaluation, Olsen M.B.
(ed),ACL,pp. 39-46.
Lenci, A., S. Montemagni, V. Pirrelli and C. Soria. 2000. Where opposites meets. A Syntactic
Meta-scheme for Corpus Annotation and Parsing Evaluation, in Proceedings of the 2nd
International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece..
Montemagni S., Barsotti F., Battista M., Calzolari N., Corazzari O., Zampolli A., Fanciulli F.,
Massetani M., Raffaelli R., Basili R., Pazienza M.T., Saracino D., Zanzotto F., Mana N., Pianesi
F., Delmonte R. 2000. "The Italian Syntactic-Semantic Treebank: Architecture, Annotation,
Tools and Evaluation". In Proceedings of the Workshop ‘Linguistically Interpreted Corpora’,
within “The 18th International Conference on Computational Linguistics”, [COLING-2000],
Luxembourg, 5-6 August 2000, pp. 18-27
Montemagni S., Barsotti F., Battista M., Calzolari N., Corazzari O., Lenci A., Zampolli A.,
Fanciulli F., Massetani M., Raffaelli R., Basili R., Pazienza M.T., Saracino D., Zanzotto F., Mana
N., Pianesi F., Delmonte R. 2003. "The Syntactic-Semantic Treebank of Italian: an Overview",
in Computational Linguistics in Pisa – Linguistica Computazionale a Pisa. Linguistica
Computazionale, Special Issue, XVI-XVII. Istituti Editoriali e Poligrafici Internazionali, PisaRoma. Tomo I, pp. 461-492.
Montemagni, S. and M. Simi. 2007. The Italian dependency annotated corpus
developed for the CoNLL-X Shared Task
ISST-CoNLL. Technical Report.
(http://www.ilc.cnr.it/viewpage.php/sez=ricerca/id=894/vers=ita)
Download