Grammar Development Platform Miriam Butt October 2002 Grammar Development What is a Grammar Development Platform good for? • Information Retrieval/Extraction • Machine Translation (MT) XLE German: Anna sieht den Mann. English: Anna sees the man. Parser Generator MT English c-str and f-str German f-str A Sample Development Platform XLE (Xerox Linguistic Environment) • Main Developer: John Maxwell (PARC) • Software (Shareware): Emacs, Tcl/Tk • Platforms: Unix (Solaris), Linux, MacOsX A Sample Development Platform XLE (Xerox Linguistic Environment) • Linguistic Theory: LFG (Lexical-Functional Grammar) orginally developed by Ronald M. Kaplan (PARC) and Joan Bresnan (Stanford) • Parser: Bottom-Up, Left-to-Right • Performance: Worst-case exponential, polynomial in practice (makes broad-coverage grammars feasible) Palo Alto Research Center (PARC), English Grammar IMS, University of Stuttgart German Grammar Fuji Xerox Japanese Grammar The ParGram Project University of Bergen Norwegian: Bokmal and Nynorsk XRCE Grenoble French Grammar UMIST Urdu Grammar ParGram Possible Applications: • Machine Translation (French, English) • Tree Banking (English, German) • Smart Text Annotation (German) • Robust Parsing (English, German, French) • Information Extraction (English) • Teaching Tools (Urdu) Grammar Components Each Grammar Contains: • Phrase Structure Rules (S NP VP) • Lexicon (verb stems and functional elements) • Finite-State Morphological Analyzer No Semantics Phrase Structure Rules Formulation as used today goes back to Chomsky 1957. Sample Set for English: S NP VP VP V NP NP D (ADJ) N Why these kinds of rules? • Natural Language is recursive and potentially infinite. • Constituency, X-bar Theory Phrase Structure Rules The syntax of natural languages is context-free. Colorless green ideas sleep furiously. However, we must also deal with context-sensitive information. The monkey sleeps. The monkey sleep. The monkeys sleeps. Features and Unifications Context-Sensitivity can be achieved in many ways. XLE and LFG (like many other theories/platforms) uses phrase-structure annotation via attribute-value pairs. S NP (SUBJ) = VP (SUBJ NUM) = ( NUM) Features are checked via Unificaition. XLE The Ambiguity Problem PP-Attachment The girl saw the monkey with the telescope. Categorial Ambiguity Flying planes can be dangerous. Time flies like an arrow. XLE Lexicons Typically Contain: • Category Information (Terminal Node in Tree) • Context Sensitive Featural Information • Subcategorization Information • Semantics (sometimes) XLE Ambiguity in Large Grammars Ambiguity: a serious problem even in simple sentences • PP-attachment (English) • Subject/Object Ambiguities (German) Within XLE various techniques have been invented to cut down on the explosion of parses. • Packed Representations • Optimality Marking XLE Morphologies and Tokenizers Beyond the Word: Writing and adding in Morphological Analysis and Tokenization XLE Parallel Analyses Languages Differ on the Surface (c-structure) English: Yassin was seen. German: Yassin wurde gesehen. Urdu: yassin dekha gaya XLE ParGram Goal: The same underlying f-structures for all languages (modulo lexical semantics). The “Parallel” in ParGram Analyses at the level of f-structure are held as parallel as possible across languages (crosslinguistic invariance). • Theoretical Advantage: This models the idea of UG. • Applicational Advantage: machine translation is made easier. Analyses at the level of c-structure are allowed to differ much more (variance across languages). FST Morphological Analyzers Kaplan and Butt (2002): this LFG morphology-syntax interface is natural: calana ‘to drive’ surface (M.Sg) Seq form Sequence Relation drive+Verb+Inf+M+S g L Lexical Relation [VFORM inf] Sat f-structure (m-structure) PRED VFORM GEND NUM [NUM sg] [GEND masc] Satisfaction Relation ‘drive<Subj,Obj>’ inf masc sg