chaptr04

advertisement
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
%
Extracts from the book "Natural Language Processing in LISP"
%
published by Addison Wesley
%
Copyright (c) 1989, Gerald Gazdar & Christopher Mellish.
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
Words, rules and structures
Tree diagrams are not, given the limitations of present computers, a
convenient data structure to use for NLP. The simplest way to handle
trees is to manipulate them as lists where the first element is the
root category and subsequent elements are the daughter subtrees in
order of occurrence. Thus the tree above in LISP would be:
(S (NP MediCenter)(VP (V employed)(NP nurses)))
The general principle is that a tree is represented by a list whose
first element is the top label and whose subsequent elements are the
immediate subtrees in order. These elements are then themselves
lists, representing the subtrees according to the same convention. In
this scheme, a leaf node of a tree (here, an English word), is simply
represented by itself. This kind of list might appear in a 'prettyprinted' format as follows:
(S
(NP MediCenter)
(VP
(V employed)
(NP nurses)))
Grammars in LISP and random generation
In LISP programs, we will represent a grammar by a list of structures,
one for each rule, and store the current grammar in a global variable
called rules. Each rule is represented by a list of items. The first
represents the left hand side of the rule, and the rest represent the
items on the right hand side in sequence. In rules, English words are
represented by the corresponding LISP atoms and grammatical categories
are put inside lists to distinguish them from words. In grammars like
Grammar1, where the only relevant feature for categories is cat, we
need not explicitly mention the feature names in rules, and so, in
writing rules for this kind of grammar, we will specify the category
that has cat S simply by (S), for instance. Note that, although we
will adopt this convention for rules, we will still build parse trees
as described above, i.e. with bare cat values like S, rather than
lists like (S), as labels. Here is our representation of Grammar1 in
LISP (also in lib cfpsgram):
(setq rules
'(((S) (NP) (VP))
((VP) (V))
((VP) (V) (NP))
((V) died)
((V) employed)
%
%
%
%
%
((NP)
((NP)
((NP)
((NP)
nurses)
patients)
Medicenter)
Dr Chan)))
Notice that we have included the lexicon in the set of rules without
making any explicit distinction. Thus, for instance, we have
interpreted the lexical entry:
Word employed:
<cat> = V.
simply as another way to specify the rule
V -> employed
One advantage of this approach is that it allows us to deal with 'Dr
Chan' as a sequence of two words in a natural way. There are a number
of reasons why, for efficiency, it might make sense to treat lexical
entries differently from other rules, however. We will not make these
optimisations here as they make the algorithms harder to follow.
One important way in which computers can be of use to formal linguists
is by allowing them to experiment with different grammars and show
them implications of the various choices they make which might be hard
to work out by hand. A useful tool in this context is a program that
can randomly generate sentences that are judged legal by a grammar.
Notice that it would not be so useful to have a random generator built
to work only with a specific grammar (in the way that the ATN program
developed in Chapter 3 was specific to a particular analysis of
English). For linguists would then have to get involved with
programming whenever they wanted to change a rule in the grammar.
Instead we want to have a program that can accept any grammar (in some
formalism), written in a way not dissimilar to how a linguist would
write one, and randomly produce legal sentences. In our LISP program
(lib randgen), we will deal with context-free phrase structure
grammars, represented by list structures as above.
Here is the top level function of our random generator. The function
generate takes a description of the kind of phrase to be generated
(e.g. (S)) and produces as its result a list of atoms (e.g. (Dr Chan
died)).
(defun generate (description)
(if (atom description)
(list description)
(let ((rs (matching_rules description)))
(if (null rs)
(error "Cannot generate description")
(generate_all (cdr (oneof rs)))))))
generate will be called recursively with descriptions of smaller and
smaller phrases until it is called with a single atom as the
description. In this case the atom itself (packaged in a list) is
returned as the result. If the description is not a single atom, it
must be a list representing a syntactic category; so, to generate from
it the program must look for rules about what a phrase of this kind
can consist of. matching_rules returns the list of all such rules,
and one is then chosen at random using the function oneof. oneof
simply takes a list and returns a random element, using the LISP
function random to generate a number indicating the position of the
element to be chosen:
(defun oneof (list)
;; randomly returns one of the given list
(nth (random (length list)) list))
The RHS of the rule chosen is extracted (the LHS, which is the same as
the original description, being ignored) and passed to generate_all.
Since the RHS of a rule is just a list of the kind of items that
generate can work from, generate_all simply calls generate on each
element, concatenating the results together:
(defun generate_all (body)
(if (null body)
'()
(append
(generate (car body))
(generate_all (cdr body)))))
matching_rules has to work through the grammar rules, collecting
together in a list all the ones that start with the right category.
It uses a function unify to compare the description it is generating
from with the LHS of the rule. In a more general situation, where the
description and LHS may specify values of several features, unify will
have to be more complicated, as we will see. For our current
purposes, however, it suffices for unify to test whether the two lists
are equal.
(defun matching_rules (description)
(let ((results ()))
(dolist (rule rules results)
(if (unify description (car rule))
(setq results (cons rule results))))))
(defun unify (x y)
(equal x y))
This simple program works well with small grammars but would have to
be improved to work efficiently if we had hundreds of rules. One
source of inefficiency is the search through the whole list of rules
carried out by matching_rules every time a category has to be
generated from. If we grouped rules with the same category on the
'left hand side' together, perhaps as follows:
(setq rules
'(((S)
((NP) (VP)))
((VP)
((V))
((V) (NP)))
((NP)
(Dr Chan)
(Medicenter)
(nurses)
(patients))
((V)
(died)
(employed))))
then matching_rules would only have to work through a list of groups,
rather than a list of individual rules. Alternatively, rules could be
stored in a LISP hash table or on the property lists of the LHS
categories, for rapid retrieval, given the left hand category.
If you have just written a large phrase structure grammar, this
generation program might be useful to help you detect places where the
grammar is not strict enough in its characterisation of legality.
Getting an example of an illegal sentence that can be generated is,
however, not necessarily very useful. What one more often wants to
know is how the grammar managed to generate such an object. For this
reason, it is interesting to see whether we can generate the parse
trees of random sentences allowed by the grammar. In fact, our
program can be altered straightforwardly to do this, and the result is
called lib randtree. In this modified program, generate returns a
single parse tree (represented as a list in the way we showed above),
and generate_all returns a list of parse trees. Here are the new
definitions (the other definitions do not need to be changed):
(defun generate (description)
(if (atom description)
description
(let ((rs (matching_rules description)))
(if (null rs)
(error "Cannot generate from description ~S" description)
(cons
(category description)
(generate_all (cdr (oneof rs))))))))
(defun generate_all (body)
(if (null body)
'()
(cons
(generate (car body))
(generate_all (cdr body)))))
(defun category (description)
(car description))
Notice how, now that generate returns a single item (parse tree),
rather than a sequence of items (atoms), the call to append in
generate_all is replaced by a call to cons.
Encoding feature specifications in LISP
Collections of feature names and associated values are easily
represented in LISP. We will represent them using association lists,
with feature names acting as keys. For instance, the following list:
((person 3) (number sing))
represents something with person and number features. The values for
these are respectively 3 and sing. As before, we can represent a
grammar rule by a list consisting of the LHS category followed by the
RHS categories, these categories now being more complex than the
single-element lists used previously. The main complication is that a
PATR rule can state that two features must have the same value,
without specifying in advance what that value is to be. In our
representation, where two features are required to share a value (by a
PATR '=' statement), the two values will be denoted by occurrences of
the same variable symbol. Variable symbols need to be distinguished
from normal atomic values in some way, but apart from this only one
aspect of their form is of importance. Where the same variable symbol
occurs several times, the appropriate values have to be the same;
where two different variable tokens appear, the values do not have to
be the same. We have chosen to represent a variable symbol by an atom
beginning with a '_', because this is easy to type, yet easily
distinguished from the feature names and values that we are likely to
use. A PATR rule can then be translated into the LISP notation by
collecting up all the information provided about each given category
into a single list, using multiple occurrences of the same variable to
indicate where a value must be shared. For instance, the Grammar4
rule:
Rule {single complement verbs}
VP -> V X:
<V arg1> = <X cat>
<VP slash> = <X slash>.
could be translated into the following in LISP:
(((cat VP) (slash _y))
((cat V) (arg1 _x))
((cat _x) (slash _y)))
where the variable symbol _y is used to indicate that the slash of the
VP must be the same as the slash of the second subphrase and the
symbol _x is used to indicate that the cat of the second subphrase
must be the same as the arg1 of the verb.
The LISP notation makes it relatively easy to see how a single rule
using features corresponds to a number of context-free phrase
structure rules. With the above rule, a possible CF-PSG rule can be
derived first of all by filling out the representation to mention
explicitly all possible features for all phrases. Thus if cat, slash
and arg1 are the only possible features then the above rule is 'filled
out' to:
(((cat VP) (slash _y) (arg1 _v1))
((cat V) (arg1 _x) (slash _v2))
((cat _x) (slash _y) (arg1 _v3)))
where the extra variables _v1, _v2 and '_v3' appear once each. It is
now possible to consider a completely specified instance of this rule
by choosing a possible concrete value for each variable, for instance:
_y
_x
_v1
_v2
_v3
-
NP
NP
0
0
0
This assignment of values to variables gives the following rule:
(((cat VP) (slash NP) (arg1 0))
((cat V) (arg1 NP) (slash 0))
((cat NP) (slash NP) (arg1 0)))
All we need now is a convention for translating each completely
specified feature bundle into an atomic symbol. For instance, we
could take the CAT, slash, and arg1 feature values in order, separated
by '_'s. We would then have the following context-free rule:
((VP_NP_0)
(V_0_NP)
(NP_NP_0))
Corresponding to different variable assignments, we will, of course,
get different context-free phrase structure rules out at the end. But
as long as we are consistent about our conventions (for instance in
translating at the end from feature bundles to symbols) the resulting
grammar will be exactly equivalent to the original one.
The top level of a LISP program to do this translation might look like
the following function, which prints out all the CF-PSG rules that
correspond to a give rule:
(defun expand_rule (rule)
(let ((newrule (fillout_rule rule)))
(subst_values varlist newrule)))
fillout_rule produces a new rule where each category consists of a
list of bare values for each feature in turn, obtained by adding new
variables (like _v1 in the above) for unmentioned features. It also
creates in the global variable varlist a list with an entry for each
variable symbol and the list of possible values it can have, for
instance:
((_x NP S PP 0)
(_v1 NP 0 N V PP))
How is it to know what possible features there are and what possible
values they can have? The possible features, together with their
possible values, will have to be declared in a global variable:
(defvar features
'((cat S NP VP PP AP P N V 0)
(slash S NP VP PP AP P N V 0)
(subcat S NP VP PP AP P N V 0)
(empty yes no)))
subst_values is in charge of enumerating every possible concrete
instance of the rule obtainable by substituting one of the possible
values for each of the variables in the varlist.
(defun subst_values (varlist rule)
(if (null varlist)
(print rule)
(dolist (val (cdar varlist))
(subst_values (cdr varlist)
(subst_value (caar varlist) val rule)))))
The function uses recursion to carry out the enumeration of all the
possibilities. Each call of subst_values is in charge of trying out
all possible values for the variable that occurs first in the list.
For each such value (the caar of varlist), it calls subst_value to
actually substitute that value for the variable and then subst_values
again to deal with the rest of the variables. When the recursion
bottoms out, the actual value of the rule (with values for all the
variables) is printed out. subst_value is easily defined:
(defun subst_value (name val thing)
(if (equal name thing)
val
(if (consp thing)
(cons
(subst_value name val (car thing))
(subst_value name val (cdr thing)))
thing)))
In Chapter 7 we will look in more detail at the semantics of feature
structures and develop LISP programs for more interesting
applications.
Download