Language and Speech Technology: Parsing Jan Odijk January 2011 LOT Winter School 2011 1 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 2 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 3 Grammars • Grammar G = (VT, VN, P, S) where – VT terminal vocabulary – VN nonterminal vocabulary – P set of rules α→β (lhs → rhs) • α Є VN+ • β Є (VN U VT)* – S Є VN (start symbol) 4 Grammars • Example Grammar G = (VT, VN, P, S) with – VT = {the, a, garden, book, in,} – VN = {NP, Det, N, P, PP} – P = {PP→P NP, NP→Det N, Det→the, Det→a, N→garden, N→book, P→in } – S = PP 5 Example Derivation • • • • • • PP P NP in NP in Det N in the N in the garden (start symbol) (PP →P NP) (P → in) (NP →Det N) (Det → the) ( N → garden) 6 Grammar Types • Finite State Grammars (Type 3) – – – – A → aA, A → a. A Є VN, a Є VT Too weak to deal with natural language in toto Efficient processing techniques Often used for applications where partial analyses of natural language are sufficient – Often used for morphology / phonology 7 Grammar Types • Context-Free Grammars (CFG, Type 2) – A → β. A Є VN – To weak to deal with natural language • Surely for strong generative adequacy • Also for weak generative adequacy – Reasonably efficient processing techniques – Generally taken as a basis for dealing with natural language, extended with other techniques 8 Grammar Types • Context-Sensitive Grammars (Type 1) – α→β, |α| <= |β| – Usually not considered in the context of NLP • Type-0 grammars – No restrictions – Usually not considered except in combination with CFG 9 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 10 Parsing • Parsing – Is an algorithm • It must finish! – For assigning syntactic structures • Ambiguity! – To a sequence of terminal symbols – In accordance with a given grammar – (If possible, efficient) 11 Parsing for CFGs • Focus here on – Parser for CFGs – for natural language – More specifically: Earley parser • Why? – Most NLP systems with a grammar use a parser for CFG as a basis – Basic techniques will also recur in parsers for 12 different grammar types Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 13 Naïve Parsing • see handout • Problems for naïve parsing – A lot of re-parsing of subtrees – Bottom-up • Wastes time and space on trees that cannot lead to S – Top-down • Wastes time and space on trees that cannot match input string 14 Naïve parsing • Top-down – Recursion problem • Can be solved for right-recursion by matching with input tokens, but • Problem with left recursion remains: – NP → NP PP • Ambiguity – Temporary ambiguity – Real ambiguity 15 Naïve parsing • Naïve Parsing Complexity – Time needed to parse is exponential: – cn (c a constant, length input tokens) – (in the worst case) • Takes too much time • Is not practically feasible 16 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 17 Earley Parser • Top-down approach but – Predictor avoids wasting time and space on irrelevant trees – Does not build actual structures, but stores enough information to reconstruct structures – Uses dynamic programming technique to avoid recomputation of subtrees – Avoids problems with left recursion – Makes complexity cubic: n3 18 Earley Parser • Number positions in input string (0 .. N) • 0 book 1 that 2 flight 3 • Notation [i,j] stands for the string from position i to position j – [0,1] = “book” – [1,3] = “that flight” – [2,2]= “” 19 Earley Parser • Dotted Rules – is a grammar rule + indication of progress – ie. Which elements of the rhs have been seen yet and which ones not yet – Indicated by a dot (we use an asterisk) • Example – S → Aux NP * VP – Aux and NP have been dealt with but VP not yet 20 Earley Parser • Input: – Sequence of N words (words[1..N]), and – grammar • Output: – a Store = (agenda, chart) • (sometimes chart = N+1 chart entries: chart[0 .. N]) 21 Earley Parser • Agenda, chart: sets of states • A state consists of – Dotted rule – Span relative to the input: [i,j] – Previous states: list of state identifiers • And gets a unique identifier • Example – S11: VP → V’ * NP; [0,1]; [S8] 22 Earley Parser • State – Is complete • iff dot is the last element in the dotted rule • E.g. state with VP → Verb NP * is complete • NextCat (state) – Only applies if state is not complete – Is the category immediately following the dot – VP → Verb * NP : NextCat(state)= NP 23 Earley Parser • 3 operations on states, – Predictor • Predicts which categories to expect – Scanner • if a terminal category C is expected, and a word of category C is encountered in this position, – Consumes the word and shifts the dot – Completer • Applies to a complete state s, and modifies all states 24 that gave rise to this state Earley Parser • Predictor – – – – Applies to an incomplete state ( A → α * B β, [i,j], _) B is a nonterminal For each (B → γ) in grammar • Make a new state s = (B → * γ, [j,j], []) • enqueue(s , store) – Enqueue (s,ce) = add s to ce unless ce already contains s 25 Earley Parser • Scanner – Applies to an incomplete state – ( A → α * b β, [i,j], _) – b is a terminal • Make a new state s = (b → words[j] * , [j,j+1], []) • enqueue(s , store) 26 Earley Parser • Completer – Applies to an complete state – ( B → γ *, [j,k], L1) – For each (A → α * B β, [i,j], L2) in chart[j] • Make new state s = (A → α B * β, [i,k], L2 ++ L1) • enqueue(s , store) 27 Earley Parser • Store = (agenda, chart) • Apply operations on states in the agenda until the agenda is empty • When applying an operation to a state s in the agenda – Move the state s from the agenda into the chart – Add the resulting states of the operation to the agenda 28 Earley Parser • Initial store = ([Г → *S], emptychart) – Where Г is a ‘fresh’ nonterminal start symbol • Input sentence accepted – Iff there is a state (Г → S *, [0,N], LS) in the chart and the agenda is empty • Parse tree(s) can be reconstructed via the list of earlier states (LS) 29 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 30 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 31 Earley Parser Extensions • Replace elements of V by feature sets (attribute-value matrices, AVMs) – Harmless if finitely valued – E.g. instead of NP [cat=N, bar=max, case=Nom] – Usually other relation than ‘=‘ used for comparison • E.g. ‘is compatible with’, ‘unifies with’, ‘subsumes’ 32 Earley Parser Extensions • Replace rhs of rules by regular expressions over V (or AVMs) • E.g. VP → V NP? (AP | PP)* abbreviates • VP → V, VP → V NP, VP → V APorPP, VP → V NP APorPP, • APorPP → AP APorPP, APorPP → PP APorPP, APorPP → AP, APorPP → PP • Where APorPP is a ‘fresh’ virtual nonterminal • Virtual : is discarded when constructing the trees 33 Earley Parser Extensions • My grammatical formalism has no PS rules! • But only ‘lexical projection’ of syntactic selection properties (subcategorization list) • E.g. buy: [cat=V, subcat = [_ NP PP, _ NP]] • create PS rules on the fly – If buy occurs in the input tokens, create rules • VP → buy NP PP and VP → buy NP – From the lexical entry – And use these rules to parse 34 Earley Parser Extensions • My grammar contains ε-rules: – NP → ε – Where ε stands for the empty string – (i.e. NP matches the empty string in the input token list) • Earley parser can deal with these! • But extensive use creates many ambiguities! 35 Earley Parser Extensions • My grammar contains empty categories – Independent • PRO as subject of non-finite verbs – PRO buying books is fun • pro as subject of finite verbs in pro-drop languages – pro no hablo Español • Pro as subject of imperatives – pro schaam je! • Epsilon rules can be used or represent this at other level 36 Earley Parser Extensions • My grammar contains empty categories – Dependent • trace of wh-movement – What did you buy t • Trace of Verb movement (e.g V2 in Dutch, German, Aux movement in English – Hij belt hem op t – Did you t buy a book? – Epsilon rules are not sufficient 37 Earley Parser Extensions • Other types (levels) of representation • LFG: (c-structure, f-structure) • HPSG: DAGs (special type of AVMs) • (constituent structure, semantic representation) • Use CFG as backbone grammar – Which accepts a superset of the language – For each rule specify how to construct other level of representation – Extend Earley parser to deal with this 38 Earley Parser Extensions • Other types (levels) of representation • f-structure, DAGs, semantic representations are not finitely valued • Thus it will affect efficiency • But allows dealing with e.g. – Non-context-free aspects of a language – Unbounded dependencies (e.g. by ‘gap-threading’) 39 Earley Parser in Practice • Parsers for natural language yield – Many many parse trees for an input sentence • • • • Many more than you can imagine (thousands) Even for relatively short, simple sentences They are all syntactically correct But make no sense semantically 40 Earley Parser in Practice • Additional constraining is required – To reduce the temporary ambiguities – To come up with the ‘best’ parse • Can be done by semantic constraints – But only feasible for very small domains • Is most often done using probabilities – Rule probabilities derived from frequencies in treebanks 41 Parsers: Some Examples • Dutch: Alpino parser • Stanford parsers – English, Arabic, Chinese • English: ACL Overview 42 Overview • Grammars & Grammar Types • Parsing – Naïve Parsing – Earley Parser – Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN 43 Parsers & CLARIN • Parser allows one to automatically analyze large text corpora • Resulting in treebanks • Can be used for linguistic research – But with care!! • Example: Lassy Demo (Dutch) – Simple search interface to LASSY-small Treebank 44 – Use an SVG compatible browser (e.g. Firefox) Parsers & CLARIN • Example of linguistic research using a treebank: • Van Eynde 2009: A treebank-driven investigation of predicative complements in Dutch 45 Thanks for your attention! 46