CIS 324: Language Design and Implementation Syntax Analysis 1. Context-Free Grammars (CFG) A context- free grammar is a finite set of variables ( nonterminals ) each of which represents a language. The languages represented by the nonterminals are described recursively in terms of each other and primitive symbols called terminals. The rules relating the nonterminals are called productions. A production states that the language associated with a given nonterminal contains strings that are formed by concatenating strings from the languages of certain other nonterminals, possibly along some terminals. A context-free grammar has four components: terminals- basic symbols from whish sentences are formed; nonterminals- syntactic variables that denote strings and define sentences by imposing hierarchical structure; productions- specify the way for combining terminals and nonterminals into sentences; a start symbol A context- free grammar is formally represented by the tuple: CFG = ( N, T, P, s ) where: N and T are finite sets of nonterminals and terminals / under the assumption that N and T are disjoint/ P is a finite set of productions / each production is of the form: A where A is a nonterminal and is a sting of symbols / s is the start symbol Notational conventions: terminals are usually denoted by: - lower-case letters early in the alphabet: a, b, c; - operator symbols: +, -, *, etc.; - punctuation symbols: (, ), {, }, ; etc.; - digits: 0, 1, 2, ..., 9; - boldface strings: if, else, etc.; nonterminals are usually denoted by: - upper-case letters early in the alphabet: A, B, C; - the letter S representing the start symbol; - lower-case italic names: expr, stmt, etc.; grammar symbols, that is either terminals or nonterminals, are represented by upper-case letters late in the alphabet: X, Y, Z strings of terminals only are represented by lower-case letters late in the alphabet: u, v, w, ... z productions are represented in the following way: A 1, A 2 etc. alternatives in roductions are represented: A 1 | 2 etc. 2. Derivations Derivations give explanations how a grammar defines a language: A is a derivation if A is a production and and are grammar symbols If 1 2 ... n then 1 derives n. The symbol means derives in one step, * means derives in zero or more steps, + means derives in one or more steps. Axioms: 1. * for any string 2. if * and * then A grammar G defines a languege L: L( G ) so that each string of terminals w is in L( G ) if and only if S + w, the string w is called sentence of G. If S * , where may contain nonterminals, then is a sentential form of G. A sentence is a sentential form without nonterminals. Example: E - E - ( E ) - ( E + E ) - ( id + E ) - ( id + id ) E + - ( id + E ) - ( id + id ) 2.1 Leftmost Derivations Derivations in which only the leftmost nonterminal in any sentential form is replaced at each step are called leftmost: lm . if w A w then lm . where A is a production and w is a string of terminals , is a string of grammar symbols. To emphasize the fact that derives by a leftmost derivation may be written: lm* . If S lm* then is a left-sentential form of the grammar. 2.2 Rightmost Derivations Derivations in which only the rightmost nonterminal in any sentential form is replaced at each step are called rightmost: rm . if A w w then rm . where A is a production and w is a string of terminals , is a string of grammar symbols. To emphasize the fact that derives by a rightmost derivation may be written: rm* . If S rm* then is a right-sentential form of the grammar. 3. Parse Trees A parse tree is a graphical representation for a derivation that enables to identify the choice regarding the replacement order. Each internal node of the parse tree is labeled by a nonterminal, and its children are labeled from left to right by the symbols in the right hand side of the production by which the node nonterminal is replaced according to the derivation. E E T + T T * F F F const const const 3 1 2 4. Elimination of Left Recursion The left-recursive pair of productions A A | could be replaced by the non-left-recursive productions A A ' and A ' A ' | without changing the set of strings derivable from A. The left-recursive productions A A1 | A2 | A3 | ... | Am | 1 | 2 | 3 | ... | n could be replaced by the non-left-recursive productions A 1 A ' | 2 A ' | ... | n A ' and A ' 1 A ' | 2 A ' | ... | m A ' | Algorithm for Elimination of Left Recursion Initialize: Arrange the nonterminals in some order A1, A2, ..., An Repeat: for i := 1 to n do for j := 1 to i - 1 do replace any production the form Ai Aj by the productions Ai 1 | 2 | ... | k where Aj 1 | 2 | ... | k end eliminate the immediate left-recursion among the Ai productions ( A A' and A' A' | ) Example: S Aa | b A Ac | Sd | for i = 1 nothing happens for i = 2 we obtain A Ac | Aad | bd | after eliminating the immediate left recursions S Aa | b A bd A' | A' A' cA' | adA' |