A new way of parsing natural language

advertisement
A NEW WAY OF PARSING NATURAL LANGUAGE
Gîsca Veronica,
Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova,
REPUBLIC of MOLDOVA, Сhișinău, 5 Academiei Street
Abstract – In this paper we show that Earley algorithm can be used to build a system in which
application developers can specify syntactic and semantic descriptions of natural languages using
a notation of attribute grammars. Attribute grammars is a formalization of the augmented
context-free grammars characteristic to most current natural language processing systems. In
contrast with syntactic grammars, attribute grammars are capable of describing features in
natural language.
Кеуwords – natural language processing, parser, contex-free grammars, attribute grammars, Earley
parser.
I.Introduction
Parsing can be done in three stages. The first stage is token generation, or lexical analysis, by
which the input character stream is split into meaningful symbols defined by a grammar of regular
expression. The next stage is parsing or syntactic analysis, which involves checking that the tokens
form an allowable expression. This is usually done with reference to a context free grammar that
recursively defines components which can make up an expression and the order in which they must
appear. The final phase is semantic parsing or analysis, which requires working out the implications
of the expression just validated and taking the appropriate action. Attribute grammars can also be
used to define these actions.
Attribute grammars were developed by D. Knuth as a tool for formal languages specification, as
an extension to context-free grammars. Semantic rules and attributes have been added to contextfree grammars augmenting their expressional capabilities.
In this paper we formalize the analysis of natural language with attribute grammars equivalent
form and we extend the Earley’s parsing algorithm in order to be able to evaluate attributes based on
semantic rules simultaneously.
The paper is organized as follows: Section 2 consists of a brief description of the top-down parsing
strategy. Section 3 represents preliminary elements, including a notion of attribute grammar. Section
4 describes the extention of the Earley algoritm for attribute grammars. Finally, a conclusion is
drawn.
II. A top-down parsing
The task of the parser is to determine if and how the grammar of a preexisting sentence can be
determined. This can be done in two ways: top-down parsing and bottom- up parsing.
A top-down process satisfies a fundamental requirement for effective software development.
Components of a top-down process can be easily added or edited without affecting the rest of the
system [1].
In this way, the top-down parser can be constructed piecewise. In our top-down system, a new
component consisting of syntactic and the related semantic rules can be developed and tested
separately and then added as a new alternative.
In comparison, a bottom-up parsing technique requires maintaining an action table, which needs
global change whenever a new rule is added. In order to accommodate the semantic rules associated
with a new syntax rule, the developer needs to make extensive changes to the whole parser.
III. Attribute Grammars
Attribute grammar (AG) have been constructed as a compilable parser for formal languages. In an
attribute grammar, syntax rules of a context-free grammar are added semantic rules to describe the
meaning of the sentences of a context-free language. AG are an extension of context-free grammars,
where attributes are associated with grammar symbols, and semantic rules define values of the
attributes.
Thus, certain aspects of natural language such as concord between words, subcategories, etc. can
be easily shaped.
In an attribute grammar, a set of attributes is attached to each symbol. The attribute values are
calculated according to the rules attached to grammar productions, called semantic rules. A semantic
rule defines the computation of an attribute in the left side of production – and then the attribute is
called synthesized – or an attribute of a symbol from the right side of production – and then the
attribute is called inherited [2], [3].
So the attribute grammar can be formed from a context-free grammars (CFG) G defined as:
An AG is a 3-tuple AG = (G, A, VS), where A is a finite set of attributes and VS is a finite set of
semantic rules and G is a context free grammar.
A context free grammar (CFG) is 4-tuple
G = (VT, VN, P, S),
where
VN – nonterminal alphabet symbols,
VT – terminal alphabet symbols,
P – set of productions of type A  α, where αϵ(VT  VN)*;
S – axiom.
VT={v (verb), n (noun), adj (adjectiv), pron (pronoun), num (numeral), adv (adverb), art (article),
pp (preposition), interj (interjection), conj (conjunction)};
VN={NP (noun phrase), VP (verb phrase), ADJP (adjectival phrase), PP (propositional phrase),
ADVP (adverb phrase)} [4].
To define the attributes set the computational linguistic resources developed at the Institute of
Mathematics and Computer Science of the Academy of Sciences of Moldova were used. They
contain a database with word level linguistic information. The lexicon gives information about the
morphological categories of speech and possible syntactic function.
An attribute a associated with symbol A is denoted A.a. Attribute evaluation rules of the form A.a
→ f(X.b, Y.c, ) are listed following each production rule [5].
Synthesized and inherited semantic rules associated with a parser are mapped on to the starting and
ending positions respectively in the parser’s result set. This facilitates the overall syntax-directed
evaluation and allows semantic rules to be denoted in terms of potentially unevaluated attributes
from the environment of the current parser, its predecessor, successors, and sibling parsers.
IV.Earley algorithm
The syntactic analysis or the parsing of a string of words may be seen as a process of searching for
a derivation tree. This may be achieved either starting from S and searching for a tree with the words
from the given phrase as leaves (top-down parsing). An algorithm of efficient parsing is based on
dynamic programming: each time that we analyze the phrase or the string of words, we store the
results so that we may not have to reanalyze it later. The algorithms that perform this operation are
called Earley algorithm. The Earley algorithm uses a polynomial time and a polynomial space. We
developed an Earley algorithm using the formalism of attribute grammars.
Earley algorithm is a top-down dynamic programming algorithm. We use Earley’s dot notation:
given a production X  xy, the notation X  x•y represents a condition in which x has already been
parsed and y is expected.
For every input position (which represents a position between tokens), the parser generates an
ordered state set. Each state is a tuple (X  x • y, i), consisting of:
• the production currently being matched (X  x y);
• our current position in that production (represented by the dot);
• the position i in the input at which the matching of this production began: the origin position [6].
The state set at input position k is called S(k). The parser is seeded with S(0), consisting of only the
top-level rule. A state is denoted by i :k X (0) → X (1) • X (j)X (n) where i is the index of the set of
states, k refers to the state set from which this production was expanded, and • marks the current
position in the production. A list of attributes and their values are stored for each symbol in the
state. Given the next input symbol, the parsing algorithm iteratively performs one of the
following three operations for each state in the current state set.The parser then iteratively operates
in three stages: prediction, scanning, and completion [6].
Prediction: When X (j) is a nonterminal, evaluate the set of inherited attributes for A. For every
state in S(k) of the form (X(0) → X(1)•X(j)X(n), j) (where X (1) is terminal and n is the number of
words in sentence), add (X (j) → •X (l), k) to S(k) for every production in the grammar with X (j) on
the left-hand side (X(j) → X (l)).
Scanning: When X (j) is a terminal and matches the next input bk 1 , copy the current state to set
k+1, moving the dot to the right of X (j) and assigning all the attributes that are given with b to X
(j).
Completion: If the dot is in the rightmost position, check for all conditions on the attributes in the
state. If the conditions are satisfied, evaluate all syntetized attributes. For each state S in set k with
where A = X (0), copy S to set i, assign the attributesto A, and move the dot over to the right of A.
1. For each production S  x , create (S  x, 0, 0)
2. For j = 0 to n (n is the length of the input sentence)
3. For each item in the form of (X  x•Yy, i, j) apply Predictor operation while a new item is
created.
4. For each item in the form of (Y  z•, i, j) apply Completer operation while a new item is created.
5. For each item in the form of (X  x•wy, i, j) apply Scanner operation.
6. If we find an item of the form (S  x•, 0, n) then we accept it.
After implementation of Earley algorithm using our proposed method, it has been seen that the
algorithm can easily generate the parsed tree for a sentence if the sentence structure satisfies the
grammar rules.
Conclusion
This paper presents the Earley algoritm using attribute grammars. Specifically, semantic rules and
attributes were added to context-free grammars augmenting their expressional capabilities and
therefore offering many advantages in the domain of language desambiguation, analysis and
translation.
References
1. D. Jurafsky. ”Speech and Language Processing,” An Introduction to Speech Recognition,
Computational Linguistics and Natural Language Processing. 2006.
2. Knuth D.E. ”Semantics of context-free languages. ” Mathematical Systems Theory, pp.
127-145, 1986.
3. Grigoraș G. ”Proiectarea compilatoarelor.” Editura Universității A. I. Cuza din Iași, 2007,
pp123.
4. Hristea F.”Introducere în procesarea limbajului natural cu aplicații în prolog.” Editura
Universității din București, 2000, pp 309.
5. Donald E. Knuth. ”Semantics of Context-Free Languages. ” Journal of the Mathematical
Systems Theory, vol. 2, no 2 (1968), pp 127–145.
6. Earley Jay, ”An efficient context – free parsing algorithm.” Pennsylvania, 1989, pp 145.
Download