A NEW WAY OF PARSING NATURAL LANGUAGE Gîsca Veronica, Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova, REPUBLIC of MOLDOVA, Сhișinău, 5 Academiei Street Abstract – In this paper we show that Earley algorithm can be used to build a system in which application developers can specify syntactic and semantic descriptions of natural languages using a notation of attribute grammars. Attribute grammars is a formalization of the augmented context-free grammars characteristic to most current natural language processing systems. In contrast with syntactic grammars, attribute grammars are capable of describing features in natural language. Кеуwords – natural language processing, parser, contex-free grammars, attribute grammars, Earley parser. I.Introduction Parsing can be done in three stages. The first stage is token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expression. The next stage is parsing or syntactic analysis, which involves checking that the tokens form an allowable expression. This is usually done with reference to a context free grammar that recursively defines components which can make up an expression and the order in which they must appear. The final phase is semantic parsing or analysis, which requires working out the implications of the expression just validated and taking the appropriate action. Attribute grammars can also be used to define these actions. Attribute grammars were developed by D. Knuth as a tool for formal languages specification, as an extension to context-free grammars. Semantic rules and attributes have been added to contextfree grammars augmenting their expressional capabilities. In this paper we formalize the analysis of natural language with attribute grammars equivalent form and we extend the Earley’s parsing algorithm in order to be able to evaluate attributes based on semantic rules simultaneously. The paper is organized as follows: Section 2 consists of a brief description of the top-down parsing strategy. Section 3 represents preliminary elements, including a notion of attribute grammar. Section 4 describes the extention of the Earley algoritm for attribute grammars. Finally, a conclusion is drawn. II. A top-down parsing The task of the parser is to determine if and how the grammar of a preexisting sentence can be determined. This can be done in two ways: top-down parsing and bottom- up parsing. A top-down process satisfies a fundamental requirement for effective software development. Components of a top-down process can be easily added or edited without affecting the rest of the system [1]. In this way, the top-down parser can be constructed piecewise. In our top-down system, a new component consisting of syntactic and the related semantic rules can be developed and tested separately and then added as a new alternative. In comparison, a bottom-up parsing technique requires maintaining an action table, which needs global change whenever a new rule is added. In order to accommodate the semantic rules associated with a new syntax rule, the developer needs to make extensive changes to the whole parser. III. Attribute Grammars Attribute grammar (AG) have been constructed as a compilable parser for formal languages. In an attribute grammar, syntax rules of a context-free grammar are added semantic rules to describe the meaning of the sentences of a context-free language. AG are an extension of context-free grammars, where attributes are associated with grammar symbols, and semantic rules define values of the attributes. Thus, certain aspects of natural language such as concord between words, subcategories, etc. can be easily shaped. In an attribute grammar, a set of attributes is attached to each symbol. The attribute values are calculated according to the rules attached to grammar productions, called semantic rules. A semantic rule defines the computation of an attribute in the left side of production – and then the attribute is called synthesized – or an attribute of a symbol from the right side of production – and then the attribute is called inherited [2], [3]. So the attribute grammar can be formed from a context-free grammars (CFG) G defined as: An AG is a 3-tuple AG = (G, A, VS), where A is a finite set of attributes and VS is a finite set of semantic rules and G is a context free grammar. A context free grammar (CFG) is 4-tuple G = (VT, VN, P, S), where VN – nonterminal alphabet symbols, VT – terminal alphabet symbols, P – set of productions of type A α, where αϵ(VT VN)*; S – axiom. VT={v (verb), n (noun), adj (adjectiv), pron (pronoun), num (numeral), adv (adverb), art (article), pp (preposition), interj (interjection), conj (conjunction)}; VN={NP (noun phrase), VP (verb phrase), ADJP (adjectival phrase), PP (propositional phrase), ADVP (adverb phrase)} [4]. To define the attributes set the computational linguistic resources developed at the Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova were used. They contain a database with word level linguistic information. The lexicon gives information about the morphological categories of speech and possible syntactic function. An attribute a associated with symbol A is denoted A.a. Attribute evaluation rules of the form A.a → f(X.b, Y.c, ) are listed following each production rule [5]. Synthesized and inherited semantic rules associated with a parser are mapped on to the starting and ending positions respectively in the parser’s result set. This facilitates the overall syntax-directed evaluation and allows semantic rules to be denoted in terms of potentially unevaluated attributes from the environment of the current parser, its predecessor, successors, and sibling parsers. IV.Earley algorithm The syntactic analysis or the parsing of a string of words may be seen as a process of searching for a derivation tree. This may be achieved either starting from S and searching for a tree with the words from the given phrase as leaves (top-down parsing). An algorithm of efficient parsing is based on dynamic programming: each time that we analyze the phrase or the string of words, we store the results so that we may not have to reanalyze it later. The algorithms that perform this operation are called Earley algorithm. The Earley algorithm uses a polynomial time and a polynomial space. We developed an Earley algorithm using the formalism of attribute grammars. Earley algorithm is a top-down dynamic programming algorithm. We use Earley’s dot notation: given a production X xy, the notation X x•y represents a condition in which x has already been parsed and y is expected. For every input position (which represents a position between tokens), the parser generates an ordered state set. Each state is a tuple (X x • y, i), consisting of: • the production currently being matched (X x y); • our current position in that production (represented by the dot); • the position i in the input at which the matching of this production began: the origin position [6]. The state set at input position k is called S(k). The parser is seeded with S(0), consisting of only the top-level rule. A state is denoted by i :k X (0) → X (1) • X (j)X (n) where i is the index of the set of states, k refers to the state set from which this production was expanded, and • marks the current position in the production. A list of attributes and their values are stored for each symbol in the state. Given the next input symbol, the parsing algorithm iteratively performs one of the following three operations for each state in the current state set.The parser then iteratively operates in three stages: prediction, scanning, and completion [6]. Prediction: When X (j) is a nonterminal, evaluate the set of inherited attributes for A. For every state in S(k) of the form (X(0) → X(1)•X(j)X(n), j) (where X (1) is terminal and n is the number of words in sentence), add (X (j) → •X (l), k) to S(k) for every production in the grammar with X (j) on the left-hand side (X(j) → X (l)). Scanning: When X (j) is a terminal and matches the next input bk 1 , copy the current state to set k+1, moving the dot to the right of X (j) and assigning all the attributes that are given with b to X (j). Completion: If the dot is in the rightmost position, check for all conditions on the attributes in the state. If the conditions are satisfied, evaluate all syntetized attributes. For each state S in set k with where A = X (0), copy S to set i, assign the attributesto A, and move the dot over to the right of A. 1. For each production S x , create (S x, 0, 0) 2. For j = 0 to n (n is the length of the input sentence) 3. For each item in the form of (X x•Yy, i, j) apply Predictor operation while a new item is created. 4. For each item in the form of (Y z•, i, j) apply Completer operation while a new item is created. 5. For each item in the form of (X x•wy, i, j) apply Scanner operation. 6. If we find an item of the form (S x•, 0, n) then we accept it. After implementation of Earley algorithm using our proposed method, it has been seen that the algorithm can easily generate the parsed tree for a sentence if the sentence structure satisfies the grammar rules. Conclusion This paper presents the Earley algoritm using attribute grammars. Specifically, semantic rules and attributes were added to context-free grammars augmenting their expressional capabilities and therefore offering many advantages in the domain of language desambiguation, analysis and translation. References 1. D. Jurafsky. ”Speech and Language Processing,” An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. 2006. 2. Knuth D.E. ”Semantics of context-free languages. ” Mathematical Systems Theory, pp. 127-145, 1986. 3. Grigoraș G. ”Proiectarea compilatoarelor.” Editura Universității A. I. Cuza din Iași, 2007, pp123. 4. Hristea F.”Introducere în procesarea limbajului natural cu aplicații în prolog.” Editura Universității din București, 2000, pp 309. 5. Donald E. Knuth. ”Semantics of Context-Free Languages. ” Journal of the Mathematical Systems Theory, vol. 2, no 2 (1968), pp 127–145. 6. Earley Jay, ”An efficient context – free parsing algorithm.” Pennsylvania, 1989, pp 145.