formato Word

advertisement
3 - Syntax: How to Put It
Grammar, Efficiency And
Recovering Unknown Words
We will now take a closer look at the NLP way of analyzing texts.
Any communication or speech act is built from seven distinct
processes: intention, generation, synthesis, and perception, analysis,
disambiguation and incorporation. Of these, the first three take place
in the speaker or sender, and the last four happen in the hearer's
mind. In this context, only generation, analysis and disambiguation are
of interest to us.
During generation, the speaker makes a choice of words or symbols
appropriate to what he wants to convey to the hearer.
Analysis means that the perceived string is being processed by the
hearer in order to extract the possible meanings. This consists of
both syntactic interpretation (also called parsing) and semantic
interpretation, taking into account the words' meaning as well as their
meaning in the current situation. The result of the analysis of a
syntactically correct sentence is something equivalent to a parse
tree(words connected to phrases).
Disambiguation, finally, picks out the meaning that has most likely
been intended by the sender, as some syntactically correct constructs
allow for more than one semantic interpretation. You cannot know
exactly what the sender wanted to express without having direct
access to his knowledge.
On this page: 3.1 - The Grammar of Formal Languages | 3.2 - Parsing|
The
Grammar of
Formal
Languages
Certain rules apply to the structure of a message (a series of symbols with a
special meaning).
Of course, the message must be formulated in a language common to both the
sender and the hearer; this can be either a formal - invented - or natural
language - like English, German or Chinese.
Natural languages can be partly represented by special formal languages. The
basic parts of such a language are terminal symbols - symbols (tokens) that are
final, i.e. words in natural languages. These terminal symbols form phrases parts of speech that stand for certain grammatical categories like nouns, verbs
etc. Noun phrases, for example, describe nouns in detail - "the red herring",
"the one I saw" etc., while verb phrases express action, behavior or state - "is
dead", "reads quickly" and so on. We will refer to noun phrases by NP and to
verb phrases by VP. These and other categories all combine to create a complete
sentence S. These groups are called nonterminal symbols. In this context a
language is a set of strings composed from terminal symbols according to a
series of rules, the grammar.
The terminal symbols (the words) must be present in a lexicon - a list that is
subdivided into sections for nouns, verbs, adjectives etc. that includes all
allowed words for this language.
The grammar itself provides the framework for building sentences from phrases
which in turn are built from terminal symbols.
The single grammar entries are called rewrite rules because they substitute a
part of the first string to obtain another. They have a similar form to this
sample rule
S - NP VP
which expresses the fact that to build a full-fledged sentence S you need a
noun phrase NP and a verb phrase VP (or vice versa: you can combine a NP
and a VP to obtain a S). Of course there are analogous rewrite rules for NP,
VP and all other phrase types. There may also be several rewrite rules with
the same left-hand side symbol; e.g. in addition to the one above there might
exist a
S - NP VP CON NP VP,
where CON stands for a conjunction. This rule could also be expressed as
S - S CON S,
as any occurrence of S can be replaced by NP VP according to our first rule.
Parsing and
Efficency
Supposing that the perception work is done perfectly by the machine, we have to
analyze what the string means and how it can be used.
Effectively, we have to do the reverse job of generation during the analysis.
This delinearization of the input string is called parsing - finding out which
tokens represent which phrase and reconstructing the generation path that led
to the perceived sentence.
Because of the recursive structure of the rewrite rules, a hierarchical
structure would be appropriate - thus a tree called parse tree is used in which
the leaves, inner nodes and links represent tokens, phrases of any kind, and
applied rewrite rules, respectively.
The left example shows the parse tree for the sentence "The horse grazes."
Six different rules were applied:
1. S - NP VP
2. NP - DET NOUN
3. VP - VERB
4. DET - "the"
5. NOUN - "horse"
6. VERB - "grazes"
How are multiple legal interpretations to be handled?
As for now, we will simply assume that the amount of ambiguous phrases is
small enough to allow multiple parse trees which are stored parallelly. This
restriction is important because of the influence of the ambiguities; each
one is leading to exponential growth of the number of possible successive
trees.
A very intuitive version of an algorithm to recover the grammatic structure
3.2.1 A
simple Parser from the linear word string would be the following: Iterating over all symbols in
the string, the algorithm searches the right-hand sides of all rules for possible
matching subsequences, replaces the matched symbols with the left-hand side
of the applicable rule as a node and appends the substituted symbols as its
children. The string in this algorithm is also called a parse forest. For our horseexample the parse trace would look like this:
forest
matched subsequence...
of rule
The horse grazes.
"The"
DET - the
DET horse grazes.
"horse"
NOUN - horse
DET NOUN grazes.
DET NOUN
DET NOUN - NP
NP grazes.
"grazes"
VERB - grazes
NP VERB
VERB
VP - VERB
NP VP
NP VP
S - NP VP
S
This returns us the above parse tree for the sentence. However, this
method can easily be proven too simple for even slightly more sophisticated
grammars. Consider a phrase like
"The Chinese signs are manifold."
The algorithm from above might produce the following trace (of course when
supposing that the needed lexicon upgrade was made):
forest
matched subsequence...
of rule
The Chinese signs are manifold.
"The"
DET - the
DET Chinese signs are manifold.
"Chinese"
NOUN Chinese
DET NOUN signs are manifold.
DET NOUN
DET NOUN NP
NP signs are manifold.
"signs"
VERB - signs
NP VERB are manifold.
VERB
VP - VERB
NP VP are manifold.
NP VP
S - NP VP
S are manifold.
Obviously, this is not a valid parse, because you cannot append "are manifold"
to a complete sentence in a grammatically correct way. This is a situation
where the parser has committed an error and has to go back (backtrack) in
order to undo this and try another possible parse.
But more dramatic examples are easy to formulate
Natural Language Processing | Project of Multimedia Systems | EECS 579 | update: 22/12/2000
Daniele Quercia
Download