FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory Romanian language Extending Fips to Romanian: two main tasks Vocabulary • Latin origin (fundamental vocabulary) • Slavic origin • Neologisms: French, Italian, … • Loanwords: Turkish, Greek, Hungarian, Albanian, ... Morphology • Case system inherited from Latin Europe - Romance languages nominative-accusative, genitive-dative, vocative • Three grammatical genders masculine, feminine, neuter Sample text Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. http://wt.jrc.it/lt/Acquis/ {violeta.seretan, eric.wehrli, luka.nerima, gabriela.soare@unige.ch} This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. • Rich declension of determiners, nouns, adjectives, and verbs e.g., about 35 forms for a verb • The definite article is enclitic, i.e., suffixed to nouns and adjectives: casă/house – casa/house-the mare/big – marea/big-the Orthography • phonemic; Latin alphabet (since 1859) • Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ Syntax Lexicon construction Grammar implementation • list of headwords (DEX, 1998) • morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm • Specifications (Soare, 2005) • Customisation of FipsRomanian grammar for standard operations (syntactic transformations: relativization, interrogation, passivization, ...) • Similarities and differences. Examples: – clitic system • manual and semi-automatic insertion • manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …) • Current status: – simple entries: 60K lexemes/ 380K words (10 K proper nouns) – complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise” • VSO language, relatively free word order Fips: a multilingual parsing architecture (Wehrli, 2007) Underlying theory Output • Generative Grammar (Chomsky, 1995) Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005) • Lexical Functional Grammar (Bresnan, 2001) • Rich sentence representation: – constituent structure – predicate-argument table – co-indexation chains – intra-sentential pronoun resolution – wh-fronting • Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure • Current status: about 100 rules specified; nearly half implemented and tested FipsRomanian: Sample results direct object subject predicate Sample parse tree produced by Fips Implementation • Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation • Component Pascal, OOP paradigm, BlackBox IDE • Supported languages: French, English, German, Spanish, Italian, Greek; others in progress Preliminary results Screen captures Parsing experiment • data: journalistic texts, 1.05M words • average sentence length: 26.9 tokens • 16.2% full parses (FipsFrench, FipsEnglish: about 80%) • average partial parses length : 5.3 tokens • unknown words: 6.5% (of which 39.2% proper nouns) • satisfactory lexical coverage • grammatical coverage needs to be improved (work in progress!) parsing output Task-based evaluation • Collocation extraction from parsed data (Seretan, 2008) • Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system • Precision for top 2000 results: 30.3% Sample collocations extracted (Precision for French data: 65.9%, top 500 results) Related work & Useful resources • Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words). • Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/ • Dependency treebank construction, work in progress at the University of Iaşi, Romania • Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx • A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language: Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/ Faculté des Lettes, Département de Linguistique POS-tagging output Fips interface Lexicon interface References Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford. Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass. Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65–76, Groningen, Holland. 1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest. Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva. Soare, G. 2005. Romanian syntax. Technical report, University of Geneva. Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing, pages 120–127, Prague, Czech Republic.