PARSING Analyzing Linguistic Units Task Formal Mechanism Resulting Representation Morphology Analyze words into morphemes Context dependency rules FST composition Morphological structure Phonology Analyze words into phonemes Context dependency rules FST composition Phonemic structure Syntax Analyze sentences for syntactic relations between words Grammars: CFGs PDA Top-down, Parse tree, derivation tree Bottom-up, Earley, CKY parsing • Why should we parse a sentence? to detect relations among words used to normalize surface syntactic variations. invaluable for a number of NLP applications Some Concepts Grammar: A generative device that prescribes a set of valid strings. Parser: A device that uncovers the sequence of grammar rules that might have generated the input sentence. • Input: Grammar, Sentence • Output: parse tree, derivation tree Recognizer: A device that returns a “yes” if the input string could be generated by the grammar. • Input: Grammar, Sentence • Output: boolean Searching for a Parse Grammar + rewrite procedure encodes • all strings generated by the grammar L(G) • all parse trees for each string (s) generated T(G) = U{Ts(G)} Given an input sentence (I), the set of parse trees is TI (G). Parsing is searching for TI (G) ⊆ T(G) Ideally, parser finds the appropriate parse for the sentence. CFG for Fragment of English S S NP VP VP V S Aux NP VP PP -> Prep NP S VP N book | flight | meal | money NP Det Nom V book | include | prefer NP PropN Aux does Nom N Nom Prep from | to | on Nom N PropN Houston | TWA Nom Nom PP Det that | this | a VP NP V Book Det that VP V NP Nom N flight Bottom-up Parsing Top-down Parsing Top-down/Bottom-up Parsing Top-down (recursive decent parser) Bottom-up (shift-reduce parser) Starts from S (goal) Words (input) Algorithm a. Pick non-terminals (Parallel) b. Pick rules from the grammar to expand the non-terminals a. Match sequence of input symbols with the RHS of some rule Termination Success: When the leaves of a tree match the input Failure: No more non-terminals to expand in any of the trees Pros/Cons b. Replace the sequence by the LHS of the matching rule Success: When “S” is reached Failure: No more rewrites possible Pro: Goal-driven, starts with “S” Pro: Constrained by the input string Con: Constructs trees that may not match input Con: Constructs constituents that may not lead to the goal “S” • Control strategy -- how to explore search space? • Pursuing all parses in parallel or backtrack or …? • Which rule to apply next? • Which node to expand next? • Look at how the Top-down and Bottom-up parsing works on the board for “Book that flight” Top-down, Depth First, Left-to-Right parser Systematic, incremental expansion of the search space. • In contrast to a parallel parser Start State: (•S,0) End State: (•,n) n is the length of input to be parsed Next State Rules • (•wj+1b,j) (•b,j+1) • (•Bb,j) (•gb,j) if Bg (note B is left-most non-terminal) Agenda: A data structure to keep track of the states to be expanded. Depth-first expansion, if Agenda is a stack. Fig 10.7 CFG Left Corners • Can we help top-down parsers with some bottom-up information? – Unnecessary states created if there are many Bg rules. – If after successive expansions B * w d; and w does not match the input, then the series of expansion is wasted. • The leftmost symbol derivable from B needs to match the input. – look ahead to left-corner of the tree • B is a left-corner of A if A * B g • Build table with left-corners of all non-terminals in grammar and consult before applying rule Category Left Corners • At a given point in state expansion (•Bb,j) S Det, PropN, Aux, V NP Det, PropN Nom N VP V – Pick the rule B C g if left-corner of C matches the input wj+1 Limitation of Top-down Parsing: Left Recursion Depth-first search will never terminate if grammar is left recursive (e.g. NP --> NP PP) * * ( , ) Solutions: • Rewrite the grammar to a weakly equivalent one which is not left-recursive NP NP PP NP Nom PP NP Nom – This may make rules unnatural • NP Nom NP’ NP’ PP NP’ NP’ e Fix depth of search explicitly Other book-keeping needed in top-down parsing • Memoization for reusing previously parsed substrings • Packed representation for parse ambiguity Dynamic Programming for Parsing Memoization: • Create table of solutions to sub-problems (e.g. subtrees) as parse proceeds • Look up subtrees for each constituent rather than re-parsing • Since all parses implicitly stored, all available for later disambiguation Examples: Cocke-Younger-Kasami (CYK) (1960), Graham-Harrison-Ruzzo (GHR) (1980) and Earley (1970) algorithms Earley parser: O(n^3) parser • Top-down parser with bottom-up information • State: [i, A • b, j] – j is the position in the string that has been parsed – i is the position in the string where A begins • Top-down prediction: S * w1… wi A g • Bottom-up completion: wj+1 … wn * wi … wn Earley Parser Data Structure: An n+1 cell array called : Chart • For each word position, chart contains set of states representing all partial parse trees generated to date. – E.g. chart[0] contains all partial parse trees generated at the beginning of the sentence Chart entries represent three type of constituents: • predicted constituents (top-down predictions) • in-progress constituents (we’re in the midst of …) • completed constituents (we’ve found …) Progress in parse represented by Dotted Rules • Position of • indicates type of constituent • 0 Book 1 that 2 flight 3 (0,S • VP, 0) (predicting VP) (1,NP Det • Nom, 2) (finding NP) (0,VP V NP •, 3) (found VP) Earley Parser: Parse Success Final answer is found by looking at last entry in chart • If entry resembles (0,S •, n) then input parsed successfully But … note that chart will also contain a record of all possible parses of input string, given the grammar -- not just the successful one(s) • Why is this useful? Earley Parsing Steps Start State: (0, S’ •S, 0) End State: (0, S•, n) n is the input size Next State Rules • Scanner: read input • Predictor: add top-down predictions • (i, A•wj+1b, j) (i, Awj+1•b, j+1) (i, A•Bb, j) (j, B•g, j) if Bg (note B is left-most non-terminal) Completer: move dot to right when new constituent found (i, B•Ab, k) (k, Ag•, j) (i, BA•b, j) No backtracking and no states removed: keep complete history of parse • Why is this useful? Earley Parser Steps Scanner Predictor Applied when terminals are to the right of a dot Applied when nonApplied when dot terminals are to the reaches the end of a right of a dot rule (0, VP • V NP, 0) (0, S • VP ,0) (1, NP Det Nom •, 3) What chart cell is affected New states are added to the next cell New states are added to current cell New states are added to current cell What contents in the chart cell Move the dot over the terminal One new state for each expansion of the non-terminal in the grammar One state for each rule “waiting” for the constituent such as When does it apply (0, VP V • NP, 1) (0, VP • V, 0) (0, VP • V NP, 0) Completer (0, VP V • NP, 1) (0, VP V NP •, 3) Book that flight (Chart [0]) Seed chart with top-down predictions for S from grammar g S NP VP S Aux NP VP S VP NP Det Nom NP PropN VP V VP V NP [0,0] [0,0] [0,0] [0,0] [0,0] [0,0] [0,0] [0,0] Dummy start state Predictor Predictor Predictor Predictor Predictor Predictor Predictor CFG for Fragment of English S NP VP S Aux NP VP S VP NP Det Nom Nom N Nom N Nom NP PropN VP V VP V NP Det that | this | a N book | flight | meal | money V book | include | prefer Aux does Prep from | to | on PropN Houston | Nom TWA Nom PP PP Prep NP Chart[1] V book VP V VP V NP S VP NP Det Nom NP PropN [0,1] [0,1] [0,1] [0,1] [1,1] [1,1] Scanner Completer Completer Completer Predictor Predictor V book passed to Completer, which finds 2 states in Chart[0] whose left corner is V and adds them to Chart[1], moving dots to right Retrieving the parses Augment the Completer to add pointer to prior states it advances as a field in the current state • i.e. what states combined to arrive here? • Read the pointers back from the final state What if the final cell does not have the final state? – Error handling. • Is it a total loss? No... • Chart contains every constituent and combination of constituents possible for the input given the grammar • Useful for partial parsing or shallow parsing used in information extraction Alternative Control Strategies Change Earley top-down strategy to bottom-up or ... Change to best-first strategy based on the probabilities of constituents • Compute and store probabilities of constituents in the chart as you parse • Then instead of expanding states in fixed order, allow probabilities to control order of expansion Probabilistic and Lexicalized Parsing Probabilistic CFGs Weighted CFGs • Attach weights to rules of CFG • Compute weights of derivations • Use weights to pick, preferred parses – Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR. Parsing with weighted grammars (like Weighted FA) • T* = arg maxT W(T,S) Probabilistic CFGs are one form of weighted CFGs. Probability Model • Rule Probability: – Attach probabilities to grammar rules – Expansions for a given non-terminal sum to 1 R1: VP V .55 R2: VP V NP R3: VP V NP NP .40 .05 – Estimate the probabilities from annotated corpora P(R1)=counts(R1)/counts(VP) • Derivation Probability: – – – – Derivation T= {R1…Rn} n P ( Ri ) Probability of a derivation: P(T ) i 1 Most likely probable parse: T * arg Tmax P(T ) Probability of a sentence: P( S ) P(T | S ) T • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded. Structural ambiguity • • • • • S NP VP VP V NP NP NP PP VP VP PP PP P NP • NP John | Mary | Denver • V -> called • P -> from John called Mary from Denver S S VP NP NP PP VP V John called VP NP P NP NP Mary from Denver John V NP called Mary PP P NP from Denver Cocke-Younger-Kasami Parser Bottom-up parser with top-down filtering Start State(s): (A, i, i+1) for each Awi+1 End State: (S, 0,n) n is the input size Next State Rules • (B, i, k) (C, k, j) (A, i, j) if ABC Example John called Mary from Denver Base Case: Aw NP P NP V NP John called Mary from Denver Recursive Cases: ABC NP P NP X V NP called John Mary from Denver NP P VP NP X V Mary NP called John from Denver NP X P VP NP from X V Mary NP called John Denver PP NP X P Denver VP NP from X V Mary NP called John S NP John PP NP X P Denver VP NP from V Mary called PP NP Denver X X P S VP NP from X V Mary NP called John NP X S VP NP X V Mary NP called John PP NP P Denver from NP PP NP Denver X X X P S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John VP1 NP PP NP Denver VP2 X X X P S VP NP from X V Mary NP called John S VP1 NP PP NP Denver VP2 X X X P S VP NP from X V Mary NP called John S VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John Probabilistic CKY • Assign probabilities to constituents as they are completed and placed in the table • Computing the probability P( A, i, j ) P( A BC , i, j ) A BC P( A BC , i, j ) P( B, i, k ) *P(C , k , j )* P( A BC ) – Since we are interested in the max P(S,0,n) • Use the max probability for each constituent • Maintain back-pointers to recover the parse. Problems with PCFGs The probability model we’re using is just based on the rules in the derivation. Lexical insensitivity: • Doesn’t use the words in any real way • Structural disambiguation is lexically driven – PP attachment often depends on the verb, its object, and the preposition – I ate pickles with a fork. – I ate pickles with relish. Context insensitivity of the derivation • Doesn’t take into account where in the derivation a rule is used – Pronouns more often subjects than objects – She hates Mary. – Mary hates her. Solution: Lexicalization • Add lexical information to each rule An example of lexical information: Heads Make use of notion of the head of a phrase • Head of an NP is a noun • Head of a VP is the main verb • Head of a PP is its preposition Each LHS of a rule in the PCFG has a lexical item Each RHS non-terminal has a lexical item. • One of the lexical items is shared with the LHS. If R is the number of binary branching rules in CFG, in lexicalized CFG: O(2*|∑|*|R|) Unary rules: O(|∑|*|R|) Example (correct parse) Attribute grammar Example (less preferred) Computing Lexicalized Rule Probabilities We started with rule probabilities • VP V NP PP – P(rule|VP) E.g., count of this rule divided by the number of VPs in a treebank Now we want lexicalized probabilities • VP(dumped) V(dumped) NP(sacks)PP(in) • P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ in is the head of the PP) • Not likely to have significant counts in any treebank Another Example Consider the VPs • Ate spaghetti with gusto • Ate spaghetti with marinara Dependency is not between mother-child. Vp (ate) Vp(ate) Pp(with) np v Ate spaghetti with gusto Vp(ate) Np(spag) np Pp(with) v Ate spaghetti with marinara Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? – Use even larger context – Word sequence, word types, sub-tree context etc. • In general, compute P(y|x); where fi(x,y) test the properties of the context; li is the weight of that feature. e i i P( y | x) * f ( x, y ) e i i * f ( x, y ) yY • Use these as scores in the CKY algorithm to find the best scoring parse. Supertagging: Almost parsing Poachers now control VP S N N N Adv VP S NP NP S Adv NP poachers now S NP VP NP NP VP poachers now Det NP : : Adj trade N NP VP NP control N N trade S NP S N underground S V : : Adj control Adv N N N V VP NP underground the NP VP V S S NP NP VP control N trade S V V underground NP VP now poachers the S S NP VP V NP NP VP V NP N Adj underground : trade Summary Parsing context-free grammars • Top-down and Bottom-up parsers • Mixed approaches (CKY, Earley parsers) Preferences over parses using probabilities • Parsing with PCFG and PCKY algorithms Enriching the probability model • Lexicalization • Log-linear models for parsing