CLARIN_wg5_7_ISST_formats

ISST – Italian Treebank Conversion to the CoNLL format The ISST Treebank is available in 2 formats:   the ISST format the CoNLL 2007 format (ISST@CoNLL) The ISST Treebank has a three-level structure ranging over syntactic and semantic levels. Syntactic annotation is distributed over two different levels:   the constituent structure which is annotated in terms of phrase structure tree; and the functional relations which provides a characterisation of the sentence in terms of grammatical functions (subject, objects etc.) With respect to other treebank the ISST multilevel structure shows two main novelties: the combination of syntactic and lexico-sematic annotation, creating the prerequisites for corpusbased analysis on the synstax-semantics interface and it adopts a distributed approach to syntactic annotation. The two syntactic annotation lavels are intended to provide a orthogonal view of the same surface syntax, particularly suted for a language like Italian. The ISST Treebank consists of 305,547 word tokens reflecting contemporary language used. It is composed by two different sections: a “balanced” corpus of 216,606 tokens and a specialised corpus of 89,941 tokens with text belonging to the financial domain. The balanced corpus contains different types of Italian texts, namely newspaper articles and a number of different periodicals selected to cover a high variety of topics (politics, economy, culture, science, health, sport, leisure etc.). The balanced corpus covers a 10 year period from 1985 to 1995, while the specialised corpus includes articles published in 1994. The morpho-syntactic annotation was carrid out in the framework of the EU-funded PAROLE and ELSNET projects. The adopeted morpho-syntactic tagset conforms to the EAGLES international standard. The constituency annotation departs from other consituency-based annotation schemes, like the Penn Treebank, in a number of respect, such as:   the peculiarity of Italian which qulifies as a (relatively) free order language; the distributed organization of syntactic annotation in ISST. The fact that in ISST the functional level and the constituency level are separated allows to dispense with empty elements such as traces or pro-drop phenomena, topicalisation or noncanonical order of constituents making the annotation more intelligible. These syntactic phenomena are not accounted in terms of empty categories and coindexation but at the functional level. Functional annotation in ISST is carried out by marking relations between words which belong to major lexical classes only (e.g. non auxiliary verbs, nouns and adjectives), independently of previous identification of phrasal constituents. Functional annotation in ISST is based on a revisited version of the FAME annotation scheme in order to make it suitable for annotation of open domain texts. The ISST@CoNLL has been developed to build the Italian corpus for the ConNLL-2007 Shared Task. In particular, the ISST@CoNLL was built on top of the morpho-syntactic annotation and syntactic dependency annoation layers. The conversion process has been carried out in a semiautomatic way cooperatively carried out by the ILC_CNR and the Dipartimento di Informatica of the University of Pisa. The conversion was in charge of:  combinig the information coming from the two annotation levels, functional and constituet;  converting the ISST annotation for dependency relations into the CoNLL tabular format. One of the main issue was that the conversion had to cope with the fact that the ISST dependency relations are expressed in terms of binary relations between major lexical classes only. The information about grammatical words is encoded in terms of features associated with the partcipants to the relation. During the conversion process, the dependency relations involving grammatical words had to be reconstructed from the ISST original annotation and the already existing dependency relations had to be revised accordingly. Othr conversion issues concerned: multi-headed tokens, which caused the dependency structure not to be a tree; empty tokens, representing omitted subjects due to the pro–drop property of Italian; identification of the sentence root; insertion of dependencies involving punctuation. The conversion has been carried out by means of several scripts and manual post-processing by means of a graphical annotation tool. Due to these reasons an automatic converter has not been implemented. The ISST@CoNLL is a subset of the balanced ISST corpus of 79,654 word tokens (65,016 nonpunctual tokens), for a total of 4,162 sentences. REFERENCES Lenci, A., S. Montemagni, V. Pirrelli and C. Soria. 1999. FAME: a Functional Annotation Metascheme for multi-modal and multi-lingual Parsing Evaluation, in Proceedings of the ACL Workshop on Computer-Mediated Language Assessment and Evaluation, Olsen M.B. (ed),ACL,pp. 39-46. Lenci, A., S. Montemagni, V. Pirrelli and C. Soria. 2000. Where opposites meets. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation, in Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece.. Montemagni S., Barsotti F., Battista M., Calzolari N., Corazzari O., Zampolli A., Fanciulli F., Massetani M., Raffaelli R., Basili R., Pazienza M.T., Saracino D., Zanzotto F., Mana N., Pianesi F., Delmonte R. 2000. "The Italian Syntactic-Semantic Treebank: Architecture, Annotation, Tools and Evaluation". In Proceedings of the Workshop ‘Linguistically Interpreted Corpora’, within “The 18th International Conference on Computational Linguistics”, [COLING-2000], Luxembourg, 5-6 August 2000, pp. 18-27 Montemagni S., Barsotti F., Battista M., Calzolari N., Corazzari O., Lenci A., Zampolli A., Fanciulli F., Massetani M., Raffaelli R., Basili R., Pazienza M.T., Saracino D., Zanzotto F., Mana N., Pianesi F., Delmonte R. 2003. "The Syntactic-Semantic Treebank of Italian: an Overview", in Computational Linguistics in Pisa – Linguistica Computazionale a Pisa. Linguistica Computazionale, Special Issue, XVI-XVII. Istituti Editoriali e Poligrafici Internazionali, PisaRoma. Tomo I, pp. 461-492. Montemagni, S. and M. Simi. 2007. The Italian dependency annotated corpus developed for the CoNLL-X Shared Task ISST-CoNLL. Technical Report. (http://www.ilc.cnr.it/viewpage.php/sez=ricerca/id=894/vers=ita)

CLARIN_wg5_7_ISST_formats

Related documents

Products

Support

CLARIN_wg5_7_ISST_formats

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib