German

advertisement
Grammar Development
Platform
Miriam Butt
October 2002
Grammar Development
What is a Grammar Development Platform good for?
• Information Retrieval/Extraction
• Machine Translation (MT)
XLE
German: Anna sieht den
Mann.
English: Anna sees the man.
Parser
Generator
MT
English c-str and f-str
German f-str
A Sample Development Platform
XLE (Xerox Linguistic Environment)
• Main Developer: John Maxwell (PARC)
• Software (Shareware): Emacs, Tcl/Tk
• Platforms: Unix (Solaris), Linux, MacOsX
A Sample Development Platform
XLE (Xerox Linguistic Environment)
• Linguistic Theory: LFG (Lexical-Functional
Grammar) orginally developed by Ronald M.
Kaplan (PARC) and Joan Bresnan (Stanford)
• Parser: Bottom-Up, Left-to-Right
• Performance: Worst-case exponential,
polynomial in practice (makes broad-coverage
grammars feasible)
Palo Alto Research
Center (PARC),
English Grammar
IMS, University of Stuttgart
German Grammar
Fuji Xerox
Japanese Grammar
The
ParGram
Project
University of Bergen
Norwegian: Bokmal and Nynorsk
XRCE Grenoble
French Grammar
UMIST
Urdu Grammar
ParGram
Possible Applications:
• Machine Translation (French, English)
• Tree Banking (English, German)
• Smart Text Annotation (German)
• Robust Parsing (English, German, French)
• Information Extraction (English)
• Teaching Tools (Urdu)
Grammar Components
Each Grammar Contains:
• Phrase Structure Rules (S NP VP)
• Lexicon (verb stems and functional elements)
• Finite-State Morphological Analyzer
No Semantics
Phrase Structure Rules
Formulation as used today goes back to Chomsky 1957.
Sample Set for English:
S NP VP
VP V NP
NP D (ADJ) N
Why these kinds of rules?
• Natural Language is recursive and potentially infinite.
• Constituency, X-bar Theory
Phrase Structure Rules
The syntax of natural languages is context-free.
Colorless green ideas sleep furiously.
However, we must also deal with context-sensitive
information.
The monkey sleeps.
The monkey sleep.
The monkeys sleeps.
Features and Unifications
Context-Sensitivity can be achieved in many ways.
XLE and LFG (like many other theories/platforms) uses
phrase-structure annotation via attribute-value pairs.
S 
NP
(SUBJ) = 
VP
(SUBJ NUM) = ( NUM)
Features are checked via Unificaition.
XLE
The Ambiguity Problem
PP-Attachment
The girl saw the monkey with the telescope.
Categorial Ambiguity
Flying planes can be dangerous.
Time flies like an arrow.
XLE
Lexicons
Typically Contain:
• Category Information (Terminal Node in Tree)
• Context Sensitive Featural Information
• Subcategorization Information
• Semantics (sometimes)
XLE
Ambiguity in Large Grammars
Ambiguity: a serious problem even in simple sentences
• PP-attachment (English)
• Subject/Object Ambiguities (German)
Within XLE various techniques have been invented to cut down
on the explosion of parses.
• Packed Representations
• Optimality Marking
XLE
Morphologies and Tokenizers
Beyond the Word: Writing and adding in
Morphological Analysis and Tokenization
XLE
Parallel Analyses
Languages Differ on the Surface (c-structure)
English:
Yassin was seen.
German:
Yassin wurde gesehen.
Urdu:
yassin dekha gaya
XLE
ParGram Goal: The same underlying f-structures
for all languages (modulo lexical semantics).
The “Parallel” in ParGram
Analyses at the level of f-structure are held as parallel as
possible across languages (crosslinguistic invariance).
• Theoretical Advantage: This models the idea of UG.
• Applicational Advantage: machine translation is made
easier.
Analyses at the level of c-structure are allowed to differ
much more (variance across languages).
FST Morphological Analyzers
Kaplan and Butt (2002): this LFG morphology-syntax interface is
natural:
calana ‘to drive’
surface
(M.Sg) Seq
form
Sequence Relation
drive+Verb+Inf+M+S
g
L
Lexical Relation
[VFORM inf]
Sat
f-structure
(m-structure)
PRED
VFORM
GEND
NUM
[NUM sg]
[GEND masc]
Satisfaction Relation
‘drive<Subj,Obj>’
inf
masc
sg
Download