Diapositive 1

advertisement
FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser
Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare
LATL – Language Technology Laboratory
Romanian language
Extending Fips to Romanian: two main tasks
Vocabulary
• Latin origin (fundamental vocabulary)
• Slavic origin
• Neologisms: French, Italian, …
• Loanwords: Turkish, Greek, Hungarian,
Albanian, ...
Morphology
• Case system inherited from Latin
Europe - Romance languages
nominative-accusative, genitive-dative, vocative
• Three grammatical genders
masculine, feminine, neuter
Sample text
Prezentul regulament
intră în vigoare în a
douăzecea zi de la
publicarea în Jurnalul
Oficial
al
Uniunii
Europene.
http://wt.jrc.it/lt/Acquis/
{violeta.seretan, eric.wehrli, luka.nerima, gabriela.soare@unige.ch}
This
Regulation shall
enter into force on the
twentieth day following
that of its publication in
the Official Journal of the
European Union.
• Rich declension of determiners, nouns,
adjectives, and verbs
e.g., about 35 forms for a verb
• The definite article is enclitic, i.e., suffixed to
nouns and adjectives:
casă/house – casa/house-the
mare/big – marea/big-the
Orthography
• phonemic; Latin alphabet (since 1859)
• Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ
Syntax
Lexicon construction
Grammar implementation
• list of headwords (DEX, 1998)
• morphological generation: given a base word
form, generates all its forms according to the
appropriate inflection paradigm
• Specifications (Soare, 2005)
• Customisation of FipsRomanian grammar for
standard operations (syntactic
transformations: relativization, interrogation,
passivization, ...)
• Similarities and differences. Examples:
– clitic system
• manual and semi-automatic insertion
• manual insertion for verbs (specific information:
subcategorization, selectional features, thematic
function, …)
• Current status:
– simple entries:
60K lexemes/ 380K words
(10 K proper nouns)
– complex entries: multi-word expressions
(compounds and collocations):
de jur împrejurul “around”
problemă – a se pune “problem – to arise”
• VSO language, relatively free word order
Fips: a multilingual parsing architecture (Wehrli, 2007)
Underlying theory
Output
• Generative Grammar (Chomsky, 1995)
Similarities:
• Simpler Syntax (Culicover and Jackendoff, 2005)
• Lexical Functional Grammar (Bresnan, 2001)
• Rich sentence representation:
– constituent structure
– predicate-argument table
– co-indexation chains
– intra-sentential pronoun resolution
– wh-fronting
• Attachment rules: constraints on the main
parser operation, Merge, which combines
two adjacent structures into a larger structure
• Current status: about 100 rules specified;
nearly half implemented and tested
FipsRomanian: Sample results
direct object
subject
predicate
Sample parse tree produced by Fips
Implementation
• Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information
• Language-independent core + language-specific implementation
• Component Pascal, OOP paradigm, BlackBox IDE
• Supported languages: French, English, German, Spanish, Italian, Greek; others in progress
Preliminary results
Screen captures
Parsing experiment
• data: journalistic texts, 1.05M words
• average sentence length: 26.9 tokens
• 16.2% full parses (FipsFrench, FipsEnglish: about 80%)
• average partial parses length : 5.3 tokens
• unknown words: 6.5% (of which 39.2% proper nouns)
• satisfactory lexical coverage
• grammatical coverage needs to be improved (work in
progress!)
parsing output
Task-based evaluation
• Collocation extraction from parsed data (Seretan, 2008)
• Collocations are half idioms (of encoding, but not of decoding)
• Used by parser and in-house rule-based machine translation
system
• Precision for top 2000 results: 30.3%
Sample collocations extracted
(Precision for French data: 65.9%, top 500 results)
Related work & Useful resources
• Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies
from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and
grammatical coverage (simple structures, no subordination, average sentence length only 9
words).
• Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/
• Dependency treebank construction, work in progress at the University of Iaşi, Romania
• Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian
Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx
• A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language:
Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/
Faculté des Lettes, Département de Linguistique
POS-tagging output
Fips interface
Lexicon interface
References
Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford.
Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass.
Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In
Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7),
pages 65–76, Groningen, Holland.
1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest.
Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of
Geneva.
Soare, G. 2005. Romanian syntax. Technical report, University of Geneva.
Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep
Linguistic Processing, pages 120–127, Prague, Czech Republic.
Download