Materi Pendukung : T0264P23_2 PARSING Patrick Blackburn and Kristina Striegnitz Version 1.2.4 (20020829) 8.1 Introduction Last week we discussed bottom-up parsing/recognition, and implemented a simple bottom up recognizer called bureg/1. Unfortunately, bureg/1 was very inefficient, and its inefficiency had two sources. First the implementation used, which was based on heavy use of append/3 to split lists up into all possible sequences of three sublists, was highly inefficient. Second, the algorithm was naive. That is, it did not make use of a chart to record what has already been discovered about the syntactic structure. This week we discuss top-down parsing, and we will take care to remove the implementational inefficiency: we won't use append/3, but will work with difference lists instead. As we shall see, this makes a huge difference to the performance. Whereas bureg/1 is unusable for anything except very small sentences and grammars, today's parser and recognizer will easily handle examples that bureg/1 finds difficult. However we won't remove the deeper algorithmic inefficiency. Today's top-down algorithms are still naive: they don't make use of a chart. 8.2 Top Down Parsing As we have seen, in bottom-up parsing/recognition we start at the most concrete level (the level of words) and try to show that the input string has the abstract structure we are interested in (this usually means showing that it is a sentence). So we use our CFG rules right-to-left. In top-down parsing/recognition we do the reverse. We start at the most abstract level (the level of sentences) and work down to the most concrete level (the level of words). So, given an input string, we start out by assuming that it is a sentence, and then try to prove that it really is one by using the rules left-to-right. That works as follows: If we want to prove that the input is of category and we have the rule , then we will try next to prove that the input string consists of a noun phrase followed by a verb phrase. If we furthermore have the rule , we try to prove that the input string consists of a determiner followed by a noun and a verb phrase. That is, we use the rules in a left-to-right fashion to expand the categories that we want to recognize until we have reached categories that match the preterminal symbols corresponding to the words of the input sentence. Of course there are lots of choices still to be made. Do we scan the input string from right-to-left, from left-to-right, or zig-zagging out from the middle? In what order should we scan the rules? More interestingly, do we use depth-first or breadth-first search? In what follows we'll assume that we scan the input left-to-right (that is, the way we read) and the rules from top to bottom (that is, the way Prolog reads). But we'll look at both depth first and breadth-first search. 8.2.1 With Depth First Search Depth first search means that whenever there is more than one rule that could be applied at one point, we explore one possibility and only look at the others when this one fails. Let's look at an example. Here's part of the grammar ourEng.pl, which we introduced last week: s np vp vp ---> ---> ---> ---> [np,vp]. [pn]. [iv]. [tv,np]. lex(vincent,pn). lex(mia,pn). lex(died,iv). lex(loved,tv). lex(shot,tv). The sentence ``Mia loved vincent'' is admitted by this grammar. Let's see how a top-down parser using depth first search would go about showing this. The following table shows the steps a top-down depth first parser would make. The second row gives the categories the parser tries to recognize in each step and the third row the string that has to be covered by these categories. It should be clear why this approach is called top-down: we clearly work from the abstract to the concrete, and we make use of the CFG rules left-to-right. And why was this an example of depth first search? Because when we were faced with a choice, we selected one alternative, and worked out its consequences. If the choice turned out to be wrong, we backtracked. For example, above we were faced with a choice of which way to try and build a VP -- using an intransitive verb or a transitive verb. We first tried to do so using an intransitive verb (at state 4) but this didn't work out (state 5) so we backtracked and tried a transitive analysis (state 4'). This eventually worked out. 8.2.2 With Breadth First Search Let's look at the same example with breadth-first search. The big difference between breadth-first and depth-first search is that in breadth-first search we carry out all possible choices at once, instead of just picking one. It is useful to imagine that we are working with a big bag containing all the possibilities we should look at --- so in what follows I have used set-theoretic braces to indicate this bag. When we start parsing, the bag contains just one item. The crucial difference occurs at state 5. There we try both ways of building VPs at once. At the next step, the intransitive analysis is discarded, but the transitive analysis remains in the bag, and eventually succeeds. The advantage of breadth-first search is that it prevents us from zeroing in on one choice that may turn out to be completely wrong; this often happens with depth-first search, which causes a lot of backtracking. Its disadvantage is that we need to keep track of all the choices --- and if the bag gets big (and it may get very big) we pay a computational price. So which is better? There is no general answer. With some grammars breadthfirst search, with others depth-first. 8.3 Top Down Recognition in Prolog It is easy to implement a top-down depth-first recognizer in Prolog --- for this is the strategy Prolog itself uses in its search. Actually, it's not hard to implement a top-down breadth-first recognizer in Prolog either, though I'm not going to discuss how to do that. The implementation will be far better than that used in the naive bottom up recognizer that we discussed last week. This is not because because top-down algorithms are better than bottom-up ones, but simply because we are not going to use append/3. Instead we'll use difference lists. Here's the main predicate, recognize_topdown/3. Note the operator declaration (we want to use our ---> notation we introduced last week). ?- op(700,xfx,--->). recognize_topdown(Category,[Word|Reststring],Reststring) :lex(Word,Category). recognize_topdown(Category,String,Reststring) :Category ---> RHS, matches(RHS,String,Reststring). Here Category is the category we want to recognize (s, np, vp, and so on). The second and third argument are a difference list representation of the string we are working with (read this as: the second argument starts with a string of category Category leaving Reststring, the third argument behind). The first clause deals with the case that Category is a preterminal that matches the category of the next word on the input string. That is: we've got a match and can remove that word from the string that is to be recognized. The second clause deals with phrase structure rules. Note that we are using the CFG rules right-to-left: Category will be instantiated with something, so we look for rules with Category as a left-hand-side, and then we try to match the righthand-side of these rules (that is, RHS) with the string. Now for matches/3, the predicate which does all the work: matches([],String,String). matches([Category|Categories],String,RestString) :recognize_topdown(Category,String,String1), matches(Categories,String1,RestString). The first clause handles an empty list of symbols to be recognized. The string is returned unchanged. The second clause lets us match a non-empty list against the difference list. This works as follows. We want to see if String begins with strings belonging to the categories [Category|Categories] leaving behind RestString. So we see if String starts with a substring of category Category (the first item on the list). Then we recursively call matches to see whether what's left over (String1) starts with substrings belonging to the categories Categories leaving behind RestString. This is classic difference list code. Finally, we can wrap this up in a driver predicate: recognize_topdown(String) :recognize_topdown(s,String,[]). Now we're ready to play. We shall make use of the ourEng.pl grammar that we worked with last week. We used this same grammar with our bottom-up recognizer bottomup_recognizer.pl --- and we saw that it was very easy to grind bottomup_recognizer.pl into the dust. For example, the following are all sentences admitted by the ourEng.pl grammar: jules believed the robber who shot the robber fell jules believed the robber who shot the robber who shot marsellus fell The bottom-up recognizer takes a long time on these examples. But the topdown program handles them without problems. The following sentence is not admitted by the grammar, because the last word is spelled wrong (felll instead of fell). jules believed the robber who shot marsellus felll Unfortunately it takes bottomup_recognizer.pl a long time to find that out, and hence to reject the sentence. The top-down program is far better. 8.4 Top Down Parsing in Prolog It is easy to turn this recognizer into a parser --- and (unlike with bottomup_recognizer.pl ) it's actually worth doing this, because it is efficient on small grammars. As is so often the case in Prolog, moving from a recognizer to a parser is simply a matter of adding additional arguments to record the structure we find along the way. Here's the code. The ideas involved should be familiar by now. Read what is going on in the fourth argument position declaratively: ?- op(700,xfx,--->). parse_topdown(Category,[Word|Reststring],Reststring,[Category,Word]) :lex(Word,Category). parse_topdown(Category,String,Reststring,[Category|Subtrees]) :Category ---> RHS, matches(RHS,String,Reststring,Subtrees). matches([],String,String,[]). matches([Category|Categories],String,RestString,[Subtree|Subtrees]) :parse_topdown(Category,String,String1,Subtree), matches(Categories,String1,RestString,Subtrees). And here's the new driver that we need: parse_topdown(String,Parse) :parse_topdown(s,String,[],Parse). Time to play. Here's a simple example: parse_topdown([vincent,fell]). [s,[np,[pn,vincent]],[vp,[iv,fell]]] yes And another one: parse_topdown([vincent,shot,marsellus]). [s,[np,[pn,vincent]],[vp,[tv,shot],[np,[pn,marsellus]]]] yes And here's a much harder one: parse_topdown([jules,believed,the,robber,who,shot,the,robber,who,shot,t he, robber,who,shot,marsellus,fell]). [s,[np,[pn,jules]],[vp,[sv,believed],[s,[np,[det,the], [nbar,[n,robber],[rel,[wh,who],[vp,[tv,shot],[np,[det,the], [nbar,[n,robber],[rel,[wh,who],[vp,[tv,shot],[np,[det,the], [nbar,[n,robber],[rel,[wh,who],[vp,[tv,shot], [np,[pn,marsellus]]]]]]]]]]]]]],[vp,[iv,fell]]]]] yes As this last example shows, we really need a pretty-print output!