Coverage • • • • • • • • • • • • Programming Language Syntax: Syntax Specifications, Stages in Translation: Processing Programs, Syntax Analysis, Semantic Analysis, Lexical Analyzer, Code Generation, Regular expressions, Finite Automata, Grammar Types: Unrestricted, Context-Free, Context-Sensitive, Regular, BNF, EBNF, Derivation: Parse Tree, Grammar Issues: Ambiguous Grammars, Grammar Transformations, Syntax Diagram, Recursive Descent Process, Shift-reduce Parsing, Concrete and Abstract Syntax, LL grammar and LR grammar: SLR, LALR. Programming the Scanner and Parser Syntax Analysis 1 Programming Language Syntax • Syntax defines the structure of the language • Syntax helps in: − − − • Language design and language comprehension Implementing or writing the compiler, software specification and the language system as a whole Verifying for program correctness Definitions − − − − Constructs: Strings that belong to the language Syntax: The form or structure of the expression, statements, and the program unit as a whole is called as Syntax Semantics: Semantics duly considers what happens while executing a program segment. Thus, it provides the meaning of the statements, expressions and program unit Pragmatics: Tools provided by the translator to help in debugging and interacting with the operating system Syntax Analysis 2 Programming Language Syntax • Lexeme: Lowest level syntactic unit of any language (e.g., sum, begin) • Token: Category of lexemes (e.g., Identifiers) • Any complier needs to have recognizers to recognize the syntax of the language • Notations of Expressions • • • • Infix notation: operator symbol is present between the operands Prefix or Polish notation: operator symbol is present before the operands Postfix or Suffix or Reverse Polish notation: operator symbol is present after the operands Mixfix notation: operations that don't fit into the previous notations, like if-then-else Syntax Analysis 3 Programming Language Syntax • Associativity in Expressions − − • Left-associative: Expressions with the same operator or operator with same precedence are grouped from left to right. • Example: +, -, * and / Right-associative: Expressions with the same operator or operator with same precedence are grouped from right to left. • Example: Assignment symbol and exponentiation Expression Trees and their Evaluation • − Expressions are expressed in the form of a tree with the root indicating the result of the expression Traversing a tree can be done in many ways: • In-order traversal: All the nodes in the left subtree are visited first and then the root node is visited. Finally, the nodes in the right subtree are visited. • Post-order traversal: All the nodes in the left and right subtree are visited before the root node is visited. Syntax Analysis 4 Programming Language Syntax • Expression Trees and their Evaluation − Traversing a tree can be done in many ways: • Pre-order traversal: The root node is visited first and then the nodes of the left and right subtree are visited. • Breadth-first traversal: Traversing is taken level by level. Finish visiting nodes at one level before moving to the next level. It is also called as level-order traversal. • Depth-first traversal: Traversing goes into the depth and then rises to the next subtree. The order of traversing the tree performed by depth-first traversal is similar to preorder traversal. Syntax Analysis 5 Programming Language Syntax • Evaluation of Expressions − − − − Applicative Order Evaluation (strict or eager evaluation): The process of evaluation is bottom-up, which means the processing starts from the leaves and moves towards the root Normal Order Evaluation: Evaluation of an expression is done when it is needed in the computation of the result • Addition(5+2) • Addition(Y) {int Y; Y = Y + 2;} • Here, Y is replaced with 5+2 instead of doing the addition first Lazy Evaluation (Delayed evaluation): Evaluation is postponed until it is really needed • Frequently used in functional languages. Block Order Evaluation: This is the evaluation of an expression that contains a declaration. • Example: We could have block expression in a function that includes variable declaration in Pascal Syntax Analysis 6 Programming Language Syntax • Evaluation of Expressions − Short Circuit Evaluation: When we are evaluating expressions which are of Boolean or logical, we could partially evaluate the expression and get the result • AND (X AND Y): If both X and Y are "1", then the result is "1". Otherwise, the result is "0". • OR (X OR Y): If either or both X and Y are "1", then the result is "1". Otherwise, the result is "0". • XOR (X XOR Y): If only one of them (X or Y) is "1", then the result is "1". Otherwise, the result is "0". • NOT (X): If X is "1", then the result is "0". If X is "0", then the result is "1". Syntax Analysis 7 Compilation Process SOURCE PROGRAM SCANNER TOKENS PARSER S Y N T A X A N A L Y S I S PARSE TREE SEMANTIC ANALYSIS ABSTRACT SYNTAX TREE SYMBOL TABLE INTERMEDIATE CODE GENERATION INTERMEDIATE CODE CODE GENERATION MACHINE CODE Syntax Analysis OPTIMIZATION (OPTIONAL) 8 Compilation Process • • • Syntax Analysis is of low-level and high-level parts. • Low-level (scanner or lexical analyzer): • Mostly done using finite automata • Input symbols are scanned and grouped into meaningful units called tokens. • Tokens are formed by principle of longest substring or maximum match, using lookahead pointer • High-level part (parser or syntax analyzer) • Done using Backus-Naur Form (BNF) or Context-Free grammar • Tokens are grouped into syntactic units like expressions, statements and declarations and checked whether they confirm to the grammatical rules of the language Identification of reserved words: Use lookup table (symbol table) if statement: "if" "(" "y" "<" "5" ")" … • y is called as a variable, < is called as an operator, … • Tokens are represented as keywords, operators, identifiers, literals, etc. Syntax Analysis 9 Compilation Process • Parser • • • The parser should find all syntax errors and produce the parse tree Parsing algorithms: • Top-down: Recursive descent (which is a coded implementation) and LL Parser (which is a table driven implementation) • Bottom-up: LR grammar Why separate the syntax analysis into scanner and parser? − − − Simplicity: Separating them makes the parser simpler. Efficiency: Due to the separation, we could make optimization possible for the lexical analyzer. Portability: Even though parts of the lexical analyzer might not be portable, we could always make the parser portable Syntax Analysis 10 Compilation Process • Semantic analysis (Contextual analysis) is required to make sure that the data types match • Semantic analysis works in synchronization with the syntax analysis • Contextual analysis is used to answer the following: − − − − − • Whether the variable has been declared earlier or not? Does the declaration type match with the usage type of the variable? Whether the initialization of the variable has been done in advance or not? Is the reference to the array within the bounds of the array? … Code generation • • Converting the program into executable machine code Stages: intermediate code generation and code generation Syntax Analysis 11 Regular Expressions • Regular expression is used to represent the information required by the lexical analyzer • Regular Expression Definitions: The rules of a language L(E) defined over the alphabet of the language is expressed using regular expression E. − − − − − − Alternation: If a and b are regular expressions, then (a+b) is also a regular expression. Concatenation (or Sequencing): If a and b are regular expressions, then (a.b) is also a regular expression. Kleene Closure: If a is a regular expression, then a* means zero or more representation of a. Positive Closure: If a is a regular expression, then a+ means one or more of the representation of a. Empty: Empty expressions are those with no strings. Atom: Atoms indicate that there is only one string in the expression. Syntax Analysis 12 Regular Expressions Syntax Analysis 13 Regular Expressions Syntax Analysis 14 Regular Expressions • Regular expression to match integers and floating point numbers − − − − − − To match a digit: [0-9] To match one or more occurrences, we use [0-9]+ To support both signed and unsigned integers: -?[0-9]+ • -? indicates the presence or absence of minus Floating point representation: Decimal part is present before the dot • ([0-9]* \. [0-9]+) Exponent part: Presence of the character "e" either as lower or uppercase. • “e” is followed by + or – sign which is followed by an integer. • ([eE][-+]?[0-9]+)? • Question mark at the end indicates the presence of exponent part is not compulsory. -?(([0-9]+) | ([0-9]* \. [0-9]+) ([eE][-+]?[0-9]+)?) Syntax Analysis 15 Finite Automata • • 16 Finite Automata represent computing devices that could accept or recognize the given regular expression that represent a language Finite Automata Definitions − Alphabet (): An alphabet is made up of finite, non-empty set of symbols. Symbols are represented using lower case Latin alphabets. Symbols are considered to be atoms which cannot be subdivided further. Ex. = {a,b,c} − String or Word: String is a sequence of symbols formed using a single alphabet. • Given the alphabet = {a,b,c}, the various strings that could be formed are: a, abc, aa, abcabcabc − Empty String (e): Empty string indicates a string that is composed of zero symbols. Empty string can be included in an alphabet. − Size of a String: Size of a string indicates the number of symbols present in the string. • Size of the string ab is denoted as, |ab| = 2 • Size of the string |e| = 0 Size of the string |b| = 1 Syntax Analysis Finite Automata • Finite Automata Definitions − Concatenation of Strings: String can be combined together to form a new string. • S1 = abc and S2 = def: S1S2 = abcdef and S2S1 = defabc • Concatenate empty string: S1e = eS1 = abce = eabc = abc = S1 • Empty string is called as the identity operator for string concatenation. − Languages (L): Language defines an infinite set of strings from a given alphabet. = {a,b,c}, Language L = {anbncn | n 0} • In this example, number of a's and b's and c's are the same. − Power of an alphabet: • Represented by the power of order n • This order represents the number of elements present in each permutation combination of the given string − For a string = {a,b,c} − 0 = {e} − 1 = {a, b, c} − 2 = {aa, bb, cc, ab, ba, ac, ca, bc, cb} − 3 = {aaa, bbb, ccc, aab, bba, aac, cca, …} Syntax Analysis 17 Finite Automata • Finite Automata Definitions − Closure of an alphabet: • Transitive Closure: − Zero or more combinations of the string. − * = 0 1 2 3 = {e, a, b, c, aa, bb, cc, ab, … } • Transitive-reflexive Closure: − One or more combinations of the string. − + = 1 2 3 = {a, b, c, aa, bb, cc, ab, … } • Any language defined on the given alphabet is a subset of the transitive-reflexive closure of the alphabet. − "L, L * − Empty Language: • Empty language is one that has no strings in it. • L = {} is an empty language. • L = {e} is not an empty language because it is made up of one string, called as the empty string. Syntax Analysis 18 Finite Automata • 19 Finite Automata Representation • Circle: state; Arrows: transition; Double circle: final state • States are indicated using numbers • Arrows are indicated using a transition variable or e e t Figure 2.2. NFA for e Figure 2.3. NFA for t e X Y Figure 2.4. NFA for XY e X e e Y e Figure 2.5. NFA for X|Y Syntax Analysis Finite Automata e e X e e Figure 2.6. NFA for X* • • DFA (Deterministic Finite Automata) Vs NFA (Non-deterministic Finite Automata) • In DFA, empty transitions (e) are not allowed. Also, from any state s there should be only one edge labeled a. Convert from NFA to DFA − − Find e–closure of s: • Add s (the node itself) to its e–closure. i.e. e–closure(s) = {s} Reachable with empty transition: If there is a node t in e– closure(s), and there exists an edge labeled e from t to u, then u is also added to e–closure(s) if u is not there already. Continue until no more nodes can be added to e–closure(s) Syntax Analysis 20 Finite Automata • Convert from NFA to DFA − State transition: • From the initial e–closure, find transitions on various terminals present in the given regular expression • Example: If there is a node t in the e–closure(s), and there exists an edge labeled a (non-empty) from t to u, u is also added to e–closure(s) if u is not there already. From u, add all the nodes that could be reached using e–transition. − A transition table is drawn based on the States and Inputs. − Optimization of the transition table can be done as: • Partition the set of states into non-final and final states. • With the non-final states: − The state whose transition goes to outside the group is separated from the group. − If there are states with same transition on all the inputs, keep one of those states and replace the other entries with the preserved one. − Check for dead state. Dead state is one in which the transitions end up in the same state irrespective of the input. Also, this dead state is not the final state. Syntax Analysis 21 Finite Automata - Example • 22 Transitions for (m | n)*mnn e 0 e e • • • • m 3 e 1 6 e • 2 4 n 5 e 7 m 8 n 9 n 10 e e Find e–closure: Starting from 0, using e-transition, we could reach 0, 1, 2, 4 and 7. A = {0, 1, 2, 4, 7}. − From node 3, we can reach 6, 7, 1, 2 and 4 using e-transition. But from node 8, there is no more transition possible using e-transition. − e-Closure({3,8}) = B = {3,8} − Finally, we get B = {1, 2, 3, 4, 6, 7, 8}. Transition of n on set A, we get C = {1,2,4,5,6,7} Transition of n on set B, we get D = {1,2,4,5,6,7,9} Transition of n on set D, we get E = {1,2,4,5,6,7,10} If you apply transition of m on set C, we get B. So, we stop here because any further transition repeats to the already found sets only. Syntax Analysis Finite Automata - Example • Transition Table • Non-Final States (ABCD); Final State (E). • With non-final states − − − − − On input m, all of them go to B and so they are in one group. On input n, states A, B, and C move to members of group (ABCD) but D goes to E. So, split (ABCD) into (ABC) and (D). In (ABC), with input n, states A & C go to C but B goes to D. So, split them as (AC) and (B). In (AC), both of have the same transitions. Thus, use only one (A) of them. Check for dead state. In our example, there is no dead state. Syntax Analysis 23 Grammar Types - Definitions 24 • Terminal Symbols: Atomic or non-divisible symbols in any language • Non-terminal Symbols (variable symbols or syntactic categories or syntactic variable or abstraction): A single non-terminal symbol can be made of more than one Right Hand Side (RHS) derivation, separated by a divisor (|). • Variable symbol or distinguished symbol (start symbol): Basic category that is being defined • Production or Rewriting Rules: Rules that are used to define the structure of the constructs. Defines how to write any variable symbol using terminal and non-terminal symbols. Rule has a left-hand size (LHS) derived to a right-hand side (RHS) that is made up of terminal and non-terminal symbols. Syntax Analysis Grammar Types - Definitions • Grammar: A grammar is a finite non-empty set of rules. • Syntactic lists: Lists of syntactic nature could be represented using recursion. <ident_list> ident | ident, <ident_list> • Derivation: This is the process of repeatedly applying the rules, starting from the start symbol until there are no more non-terminal symbols to expand. Syntax Analysis 25 Grammar Types • Unrestricted Grammar: − Called as Recursively Enumerable or Phrase Structured grammar or Type 0 grammar. − There is no restriction on the right hand side of the production rule. − At least one non-terminal symbol on the left side of the production rule must be present − − − − awhere aV + and V * V: finite set of Variable Symbols. T: finite set of terminal symbols. Example: S ACaB; Ca aaC Syntax Analysis 26 Grammar Types 27 • Context-Sensitive Grammar: − Called as Type 1 grammar − Requires that the right side of the production rule must not have fewer symbols compared to the left side − Called as Context-Sensitive Grammar as any replacement of a variable depends on what surrounds it − a1A21w2 • where AV, 12 V* and wV + − Example: Things b b Thing; Thing c Other b c Syntax Analysis Grammar Types 28 • Context-Free Grammar: − Called as Type 2 grammar − Developed by Noam Chomsky during the mid-1950s − The left side of a production rule is a single variable symbol and the right side is a combination of terminal and variable symbols − Production rule takes the form A awhere AV, aV * − Example: Fraction Digit; Fraction Digit Fraction Syntax Analysis Grammar Types • Regular Grammar: − − − − Called as Restrictive Grammar or Type 3 grammar Each production rule is restricted to have only one terminal or one terminal and one variable on the right side Regular Grammars are classified as right-linear or left-linear grammars. Right-linear grammar • A xB or A x where AV, BV, and xT − Left-linear grammar • A Bx or A x where AV, BV, and xT − Regular expressions Vs context-free grammar: • To represent lexical rules which are simple in nature, we don't need a powerful notation like context-free grammar • Regular expressions can be used to make recognizers for any language. Syntax Analysis 29 Grammar Types • Backus-Naur Form (BNF): − Invented by John Backus to describe Algol 58 − Described as a metalanguage because it is a language that is used to describe another language − Considered equivalent to context-free grammar − Abstractions are used to represent various classes of syntactic structures, which act like non-terminal symbols. • To represent While statement: − <while_stmt> while ( <logic_expr> ) <stmt> • Reasons for using BNF to describe syntax are: − BNF provides a clear and concise syntax description. − The parser can be based directly on the BNF. − Parsers based on BNF are easier to handle. Syntax Analysis 30 Grammar Types • Extended BNF (EBNF): − BNF’s notation + regular expressions − Different notations persist: • Optional parts: Denoted with a subscript as opt or used within a square bracket. − <proc_call> ident ( <expr_list>)opt − <proc_call> ident [ ( <expr_list>)] − Alternative parts: • Pipe (|) indicates either-or choice • Grouping of the choices is done with square brackets or brackets. − <term> <term> [+ | -] const − <term> <term> (+ | -) const − Put repetitions (0 or more) in braces ({ }) • Asterisk indicates zero or more occurrence of the item. • Presence or absence of asterisk means the same here, as the presence of curly brackets itself indicates zero or more occurrence of the item. − <ident> letter {letter | digit}* − <ident> letter {letter | digit} Syntax Analysis 31 Grammar Types • Differences between BNF and EBNF notations − BNF: • <expr> <expr> + <term> | <expr> - <term> | <term> • <term> <term> * <factor> | <term> / <factor> | <factor> − EBNF: • <expr> <term> {[+ | -] <term>}* • <term> <factor> {[ * | / ] <factor>}* • EBNF uses the final replacement of <expr> by the <term> and provides the right hand side without any <expr> entry there. Syntax Analysis 32 Derivation 33 • Apply the grammar to the start symbol <program> and continue to expand until there is no more non-terminal symbol left on the righthand side • Methods of Derivation − Leftmost derivation is a process by which the leftmost nonterminal in each sentential form is expanded − Parse-tree or Derivation tree • Top-down parser keeps the start symbol as the root of the tree. Then, it replaces every variable symbol with a string of terminal symbols. • Bottom-up parser begins with the terminal symbols. These terminal symbols are matched with the right hand side of the production rule and are replaced with the corresponding variable symbols present in the left hand side of the production rule. • Parse trees can be used to attach semantics of a construct to its syntactic structure, called as syntax-directed semantics Syntax Analysis Derivation - Example • Given the regular grammar S ::= aS | bS | a | b, check whether the grammar can derive the form anbn. − Let's try for a1b1; S aS ab − Let's try for a2b2; S aS aaS aabS aabb − Let's try for a3b3; S aS aaS aaaS aaabS aaabbS aaabbb − We are able to attain the required format using this regular grammar. Syntax Analysis 34 Grammar Issues 35 • Ambiguities in Grammar − Any grammar is said to be ambiguous if it generates a sentential form that has two or more distinct parse trees. − Ex. If statement with dangling else. If Statement if ( Expression ) Statement If Statement If Statement if ( Expression ) Statement else Statement if ( Expression ) Statement else Statement If Statement if Syntax Analysis ( Expression ) Statement Grammar Transformations • • • Left Factorization: − Initial element of the options in right side of the given rule is same • N XY | XZ X (Y|Z) Elimination of Left Recursion: − First element on the right hand side causes transition to the left hand side of the rule • N X | NY XY* − The termination of the NY is possible only if we replace N with X. − If N X is used without the use of N NY, then there will be no Y. • N NY NYY XYY Substitution of Non-terminal Symbols: − Presence of any non-terminal symbol in the right hand side of the given rule should be replaced using another rule. • N X and M a N can be changed as N X and M a X Syntax Analysis 36 Syntax Diagram • • • • • • Called as Syntax Charts or Railroad Diagram Developed by Niklaus Wirth in 1970 Used to visualize rules in the form of diagrams Used to represent EBNF notations and not BNF notations Variables are represented by rectangles and terminal symbols are represented by circles (sometimes oval shape) Each production rule is represented as a directed graph whose vertices are symbols Syntax Analysis 37 Recursive Descent Parsing • There is a subprogram for each non-terminal in the grammar that parses the sentences that are generated by the non-terminal • For proceeding with the correct grammatical rule, we match each terminal symbol in the right hand side with the next input token. − If there is a match, we continue further. − Otherwise, an error is generated or other rules are tried • If a non-terminal has more than one RHS, we determine which one to parse first using: − Choose the correct RHS based on the next token (lookahead). − Next token is compared with the first token that can be generated by each RHS until a match is found. − If there is no match, then it is considered as a syntax error. • Shift-Reduce Parsing: With the given grammar and given input string, we reduce the right hand side of the input string to attain the start symbol of the grammar Syntax Analysis 38 Concrete and Abstract Syntax • Concrete Syntax: − Defines the structure of all the parts of a program like arithmetic expressions, assignments, loops, functions, definitions, etc. − Context-Free grammars, BNF, EBNF, etc are of concrete syntax type. • Assignment Identifier = Expression; • Expression Term | Expression + Term • Abstract Syntax: − Generated by the parser and is used to link syntax and semantics of a program − Unlike concrete syntax, abstract syntax provides only the essential syntactic elements and does not describe how they are structured • Statement = Assignment | Loop • Assignment = Variable target; Expression source • Ambiguity occurs in concrete syntax but not in abstract syntax Syntax Analysis 39 Symbol Table • • • Identification Tables − Called as symbol tables. − A dictionary-type data structure to store identifier names along with corresponding attributes Organization of identification table depends on the "block structure" used in different languages − Monolithic block structure: e.g. BASIC, COBOL − Flat block structure: e.g. Fortran − Nested block structure is used in the modern "block-structured" programming languages (e.g. Algol, Pascal, C, C++, Scheme, Java, …) Monolithic Block Structure: − A single block is used for the entire program − Every identifier is visible throughout the entire program − Scope of each identifier is the whole program and cannot be declared twice Syntax Analysis 40 Symbol Table • Flat Block Structure: − Whole block area is divided into several disjoint blocks − Declarations can be local or global − Identifiers can be redefined in another block − Local declaration is given higher priority over global declaration • Nested Block Structure: − Blocks may be nested one within another − Scope of an identifier depends on the level of nesting present − An identifier cannot be defined more than once at the same level within the same block Syntax Analysis 41 Symbol Table Structure • • • • Unordered list: Data could be stored in an array or a linked list. Ordered list: − Entries in the list are ordered − Searching is faster − Insertion of data into the list is an expensive process Binary Search Tree: − Using a binary search tree, the searching time takes O(log(n)). Hash Table: − Most commonly used option − Access the data can be done in constant time − Storage of data is not time consuming Syntax Analysis 42 LL Grammar • • • • First L in LL specifies that a left-to-right scan of the input is handled Second L specifies that a leftmost derivation is generated First step towards using LL grammar is elimination of common prefix. Note: aand can match zero or more elements. − Form is B a1 | a2 | … |am |Xm+1| Xm+2 | … | Xm+n − Replace it with • B aB1 | Xm+1| Xm+2 | … | Xm+n • B1 1 | 2 | … |m Convert the grammar into unambiguous one − Make sure they obey precendence and associativity rules − Start from the terminal and move from high precedence to low precedence • Consider the grammar: E E + E | E * E | (E) | id − Select the terminals and name them differently. • Factor (E) | id − * operator has high priority that + operator. So, select E E * E next • E E * E is considered first. Syntax Analysis 43 LL Grammar • • Convert the grammar into unambiguous one • Consider the grammar: E E + E | E * E | (E) | id − * has high priority that +. So, select E E * E next • To provide the link between E * E and the Factor, use the pipe (|) operator. • With no link, the non-terminal will never become a terminal. • Give a new name “Term” for the element. • Term Term * Factor | Factor − Then, consider E E + E and change it also. • Expression Expression + Term | Term − So, F (E) | id; T T * F | F; E E + T | T Remove Left-recursion − If A Aa1 | Aa2 | … | Aam | 1 | 2 | … | n − Where no i begins with an A. Where A is E, ais +T & is T − Replace the above as: • A 1A' | 2A' |… | nA' • A' a1A' | a2A' | … | amA' | e Syntax Analysis 44 LL Grammar • • • Consider the grammar ETE'; E'+TE'|e; TFT'; T'*FT'|e; F(E)|id FIRST & FOLLOW − FIRST: • If X is terminal, then FIRST(X) is {X}. • − • • If X is non-terminal and X aa is a production, then add a to FIRST(X). If X e is a production, then add e to FIRST(X). • If X Y1Y2…Yk is a production, then for all i such that all of Y1,..Yi-1 are non-terminals and FIRST(Yj) contains e for j=1,2,… i-1, add every non-e symbol in FIRST(Yj) to FIRST(X). If e is in FIRST(Yj) for all j=1,2,…,k, then add e to FIRST(X). The third rule of FIRST is like E TE' where T FT' and F(E)|id. Thus, what is in FIRST(F) will be in FIRST(E) & FIRST(T). FIRST(E) = FIRST(T) = FIRST(F) = {(,id} FIRST(E')={+, e} FIRST(T')={*, e} Syntax Analysis 45 LL Grammar • • • 46 FIRST & FOLLOW − FOLLOW: (ais any string of grammar symbols; a can also be e.) • $ in FOLLOW(X), where X is the start symbol. • If there is a production AaB, e, then everything in FIRST() but e is in FOLLOW(B). • If there is a production AaB, or a production AaB where FIRST() contains e, then everything in FOLLOW(A) is in FOLLOW(B). In FOLLOW, take the first rule apply to all the grammar and then take the second rule apply to all the grammar and so on. Note: Refer to notes for verbal explanation for FIRST & FOLLOW rules Third Rule of FOLLOW A a B Second Rule of FOLLOW A a B FOLLOW FIRST, except e Condition: e FOLLOW FOLLOW A a B FOLLOW Syntax Analysis FOLLOW Condition: FIRST(containse LL Grammar • • FIRST & FOLLOW − FOLLOW(E) = FOLLOW(E') = {), $} − FOLLOW(T) = FOLLOW(T') = {+,), $} − FOLLOW(F) = {+,*,),$} Generating the parsing table − A Grammar whose parsing table has no multiply-defined entries is said to be LL(1). a is any string of grammar symbols; a can also be e. 1. For each production Aa of the grammar, do steps 2 & 3. 2. For each terminal a in FIRST(a), add Aa to M[A,a]. 3. If e is in FIRST(A), add Ae to M[A,b] for each terminal b in FOLLOW(A). If e is in FIRST(A) and $ is in FOLLOW(A), add Ae to M[A,$]. − Note: Here, M[A,b] indicates the corresponding cell in the table, whose row corresponds to the non-terminal A and column corresponds to the terminal b. 4. Make each undefined entry of M error. Syntax Analysis 47 LR Grammar • • • • • • • Left to Right grammar Most powerful shift-reduce parsing technique − Non-backtracking shift-reduce parsing which could detect a syntactic error as soon as possible Represented as LR(k) where k indicates the look-ahead value LR(1) means no look-ahead: only next element is considered and not anything those follows the next element. Can parse all grammars that could be parsed with predictive parsers like LL(1) grammar Types of LR grammars: − SLR – Simple LR parser. − LR – Most general LR parser. − LALR – Intermediate LR parser (Look-ahead LR parser). All the types use the same algorithm but with different parsing table Syntax Analysis 48 LR Grammar 49 • LR parser configuration: (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $), which includes Stack values and the rest of Inputs • − Xi is a grammar symbol − Si is a state − ai is an input Sm Initial Stack contains just S0 Sm-1 a1 ... ai ... an $ Xm Xm-1 LR PARSING ALGORITHM . . S1 X1 Action Table Terminal and $ Goto Table Non-Terminal S0 States + Four Different Actions States + Each item is a state number Figure 2.11. LR Parsing Syntax Analysis LR Grammar 50 • Parser takes action using Sm and ai • shift s: shifts the next input symbol ai and the state s onto the stack − (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm Sm ai s, ai+1 ... an $) reduce A (or rn where n is a production number) − pop r (r is the length of ) number of items from the stack; This is done so that we can replace the right hand side with the left hand side of the grammar. − then push A and s where s=goto[sm-r,A]. Here, m-r indicates that r items have been taken of the stack. − (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm-r Sm-r A s, ai ... an $) − Output is the reducing production rule, reduce A Accept: Parsing is successfully completed. Error: Parser has detected an error. This might because there is an empty entry in the action table. GOTO takes a state and grammar symbol as arguments and produces a state. • • • • Syntax Analysis Phases of LR Grammar Processing • Closure: If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items constructed from I by the two rules: 1. Initially, every LR(0) item in I is added to closure(I). 2. If A a.B is in closure(I) and B is a production rule of G; then B. will be in the closure(I). Here, B is a non-terminal. a can be anything or even empty • The above-mentioned rule is applied until no more LR(0) item can be added to closure(I). E' E E E+T E T T T*F T F F (E) F id Check for non-terminal after dot, if there is, continue the productions. closure({E' .E}) = { T .T*FT .F E' .E E .E+T F .(E) Syntax Analysis E .T F .id } 51 Phases of LR Grammar Processing • GOTO: If I is a set of LR(0) items and X is a grammar symbol (terminal or non-terminal), then goto(I,X) is defined as follows: − If A a.X in I then every item in closure({A aX.}) will be in goto(I,X). Example: I ={ E' .E, E .E+T, E .T, T .T*F, T .F, F .(E), F .id } goto(I,E) = { E' E., E E.+T } Move dot one step further with E. goto(I,T) = { E T., T T.*F } Move dot one step further with T. goto(I,F) = {T F. } Move dot one step further with F. goto(I,() = { F (.E), E .E+T, E .T, T .T*F, T .F, F .(E), F .id } After moving the dot after (, there exists a non-terminal and so add the closure of that non-terminal. goto(I,id) = { F id. } Move dot one step further with id. Syntax Analysis 52 Phases of LR Grammar Processing • Canonical LR(0) algorithm: This is needed to create the SLR parsing table. C is { closure({S'.S}) } repeat the followings until no more set of LR(0) items can be added to C. for each I in C and each grammar symbol X if goto(I,X) is not empty and not in C add goto(I,X) to C • goto function is a DFA on the sets in C. For I1, we look at I0 and use the symbol E. I2 and I3 are obtained using transitions with symbol T and F Syntax Analysis 53 Phases of LR Grammar Processing 54 • For I4, we have moved the dot on open-bracket. As the dot is followed by E (a non-terminal), we need to add all the transitions with E (E .E+T and E .T) from I0. As still we have some non-terminals (like T and F) that follow the dot, we add their transitions also. • I5 is made using transition on id from I0. Then, we make transition on E T + from I2 to obtain I6. + * To I I I I I 0 1 6 9 F T I2 ( I7 * To I3 To I4 F id To I5 F I3 I10 ( ( id I4 id id E I8 I5 To I5 ) T I11 To I2 F ( To I4 To I3 To I4 + To I6 Figure 2.12. SLR Transitions Syntax Analysis 7 LR Grammar – Create SLR Parsing Table 55 1. Construct the canonical collection of sets of LR(0) items for G’. C {I0,...,In} 2. Create the parsing action table as follows • If a is a terminal, Aa.a in Ii and goto(Ii,a)=Ij then action[i,a] is shift j. • If Aa. is in Ii , then action[i,a] is reduce Aa for all a in FOLLOW(A) where AS'. Aa in reduce is represented using the sequence number of Aa in the grammar. • Note: There is no element after the dot; a can be anything or even empty • If S'S. is in Ii , then action[i,$] is accept. Here, E being the starting symbol S, E'E. will produce the accept entry. • If any conflicting actions generated by these rules, the grammar is not SLR(1). 3. Create the parsing goto table • for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j 4. All entries not defined by (2) and (3) are errors. 5. Initial state of the parser contains S'.S Syntax Analysis Phases of LR Grammar Processing • 1) E E+T 2) E T 3) T T*F • 4) T F 5) F (E) 6) F id • The first entry of s5 in the (row, column) grouping as (0,id) is because from Figure 2.12, we could see that I0 transits to I5 on id. So, action[0, id] = shift 5. • s6 on (1,+) is because from Figure 2.12, we could see that I1 transits to I6 on +. And so on… Syntax Analysis 56 LR Grammar – Given an input id * id + id Syntax Analysis 57 SLR(1) Grammar 58 • SLR(1) grammar is called as SLR grammar in short • SLR grammar is always unambiguous but that does not mean that all unambiguous grammars are SLR grammars. • SLR grammar does not posses any of these conflicts: − Shift/Reduce conflict: It is in a state when it is not sure whether to make a shift or reduction operation for a terminal. − Reduce/Reduce conflict: It is in a state when it is not sure whether to make a reduction operation using the production rule i or j for a terminal. • Canonical SLR(1) parsing table: − In SLR method, the state i makes a reduction by Aa when the current token is a: • if the Aa. in the Ii and a is FOLLOW(A) − In some situations, A cannot be followed by the terminal a in a right-sentential form when a and the state i are on the top stack. This means that making reduction in this case is not correct. Syntax Analysis SLR(1) Grammar • LR(1) item − In order to avoid invalid reductions, we need to make the states carry more information. This information is added as a terminal symbol in the form of a second component in an item. − A LR(1) item is defined as: A a.,a where a is the look-ahead of the LR(1) item (a is a terminal or endmarker.) When (in the LR(1) item A a.,a ) is not empty, the look-ahead does not have any effect. − When is empty (A a.,a ), we do the reduction by Aa only if the next input symbol is a (not for any terminal in FOLLOW(A)). − A state will contain A a.,a1 where {a1,...,an} FOLLOW(A) Syntax Analysis 59 SLR(1) Grammar • • • 60 Canonical Collection of LR(1) items: Similar to LR(0) but with slight changes in closure and goto. closure(I) is: ( where I is a set of LR(1) items) − every LR(1) item in I is in closure(I) − if Aa.B,a in closure(I) and B is a production rule of G; then B.,b will be in the closure(I) for each terminal b in FIRST(a) . • B is the term next to the dot. The rule of any non-terminal that follows the dot will be included into the closure. • Also, indicates on what follows B as it is the FIRST() or FIRST(a). a and can be anything or even empty. If I is a set of LR(1) items and X is a grammar symbol (terminal or nonterminal), then goto(I,X) is defined as follows: − If A a.X,a in I then every item in closure({A aX.,a}) will be in goto(I,X). − Move the dot a step forward using goto Syntax Analysis SLR(1) Grammar 61 • Numbering of the rules start with 1 but the initial S' S is excluded from the rule numbering. Syntax Analysis SLR(1) Grammar • 62 In I0: In the representation S' .S,$: $ is the element that follows S'. From here, as the dot is followed by a terminal (S), we need add its rules (S .L=R,$ & S .R,$) also. − S' .S,$ matches Aa.B,a and S .L=R,$ matches B.,b. $ is added as the look-ahead item as is empty [so, FIRST() is also empty] and FIRST(a) = FIRST($) = $. Then, the dot is followed by L and R, we add their rules also. The dot stays at the beginning of the right-side in the added rules. • In I0: In the representation L .*R,$/= L .id,$/= we need to apply FIRST() = FIRST(=) and FIRST(a) = FIRST($) as Aa.B,a is matched with S L.=R,$. • In I0: R .L,$ does not contain a = as the look-ahead because Aa.B,a is matched to S .R,$ and is empty and a is $. • Transitions are handled based on the movement of dot across terminal or non-terminal. Transition to I1 from I0 is based on S. Syntax Analysis LR(1) Parsing Table Construction 1. Construct the canonical collection of sets of LR(1) items for G’. C{I0,...,In} 2. Create the parsing action table as follows • If a is a terminal, Aa.a,b in Ii and goto(Ii,a)=Ij then action[i,a] is shift j. • If Aa.,a is in Ii , then action[i,a] is reduce Aa where AS’. • If S’S.,$ is in Ii , then action[i,$] is accept. • If any conflicting actions generated by these rules, the grammar is not LR(1). 3. Create the parsing goto table • for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j 4. All entries not defined by (2) and (3) are errors. 5. Initial state of the parser contains S’.S,$ Syntax Analysis 63 LALR Grammar • • • • • • LALR stands for LookAhead LR LALR tables are smaller than LR(1) parsing tables but the number of states remain the same LALR parser is obtained by shrinking the canonical LR(1) parser. This shrinking process should not produce reduce/reduce conflict. The core of the LALR grammar is the first component of the LR(1) items, which excludes the look-ahead item. − For Example, in S L.=R,$, the core part is S L.=R If there is more than one LR(1) item with the same core, we merge them into a single state. Creating LALR parsing table − Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar. − Find all sets that have the same core. Replace those sets having the same core with a single set which is their union. • C={I0,...,In} C’={J1,...,Jm} where m n Syntax Analysis 64 LALR Grammar • • Creating LALR parsing table − Create the parsing tables (action and goto tables) same as the construction of the parsing tables of LR(1) parser. • Note that: If J=I1 ... Ik since I1,...,Ik have same cores then cores of goto(I1,X),...,goto(I2,X) must be same. • So, goto(J,X)=K where K is the union of all sets of items having same cores as goto(I1,X). − If no conflict is introduced, the grammar is LALR(1) grammar. (reduce/reduce conflicts can be introduced but not shift/reduce conflict) Ambiguous grammars produce conflicts − Consider this ambiguous grammar E E + E | E * E | (E) | id − Produce the parsing table Syntax Analysis 65 LALR Grammar Syntax Analysis 66 Error Recovery in LR Grammar • • • • Errors can be detected by consulting the parsing action table − Goto table is not used to detect errors Canonical LR or LR(1) parser will not make any reduction before announcing an error but SLR and LALR might make many reductions before indicating an error Panic Mode Error Recovery in LR Parser − When faced with an error, remove the entries in the stack before the state sthat has a goto with a particular non-terminal A − Discard zero or more input symbols until the symbol a is found that is present in follow of A − Parser can now stack the non-terminal A and the state goto[s,A] and proceed with parsing Phrase Mode Error Recovery in LR Parser − An empty entry in the action table is associated with a specific error routine that reflects the most likely error in this case − This error could either insert or delete symbols into or from the stack − This could be useful in handling missing operand, unbalanced right parenthesis, etc Syntax Analysis 67 Programming the Scanner and Parser • • • • For scanner: − lex (A Lexical Analyzer Generator): generates codes in C language − Variants to lex: flex, AT&T lex, Abraxas Pclex, MKS Lex, POSIX Lex, jflex, … For Parser: − yacc ("Yet Another Compiler Compiler" with AT&T Yacc, Berkeley Yacc and GNU Bison as variants) − Accent: Check for conflicts Programming with lex/flex − File name: filename.l − Does not generate executable code, but generates the C routine called yylex() − We will need to write a program that calls yylex( ) to run the lexer Lex programs are divided into three sections: definitions section, rules section and user subroutines section − The starting and ending of the rules section is indicated using "%%" − ONLY User subroutines section is optional Syntax Analysis 68 Programming the Scanner and Parser • • • • • • 69 In the definitions section, the part that is covered by %{ and %} is copied as it is into the generated C program C language comments can be added outside the definition section also When using comments outside the %{ and %} block, comments must be intended with whitespace. Rules section − Map pattern and action − If the number of actions that ought to be handled is more than one, then the actions are grouped with braces. User subroutines section − Contains many subroutines − The subroutine that calls yylex( ) is copied as it is into the C program Internal Variables of LEX/FLEX: − yylval: This variable contains the value of the token. − yyleng: This variable contains the length of the string the lexer has recognized. Syntax Analysis Programming the Scanner and Parser • Internal Variables of LEX/FLEX: − yyin: Indicates how lexer reads the input. By default yyin is set to stdin. − yylex( ): Function that runs the lexer. − yywrap( ): Function that is called by the yylex to check for the end of the file. − input( ), output( ) and unput( ): input() and unput() functions are needed to read input from the command line. − Start State: Start states are defined using %s in the definitions section. − ECHO: This macro is used to write the token to the current output file yyout. This is similar to writing like: fprintf(yyout, "%s", yytext); − REJECT: REJECT is used as an action to put back the text matched by the pattern and search for the next best match. Syntax Analysis 70 Programming the Scanner and Parser • 71 Programming with Yacc/Bison − Does the task of LALR(1) parser − Being LALR(1) parser, yacc can only go one step lookahead and thus ambiguous natures beyond one step will generate an error − The program structure in Yacc is similar to that of Lex − Definitions sections: definitions, C code and associativity rules are specified. − Yacc calls yylex routine repeatedly to get the token and then applies the rules specified − As Lex returns tokens to Yacc, both the programs need to agree on what tokens are • Definitions section in yacc: %token NUMBER − In the lex program: • extern int yylval; • %% • [0-9]+ {yylval = atoi(yytext); return NUMBER; } Syntax Analysis Programming the Scanner and Parser • Programming with Yacc/Bison − In the yacc program, do the following: • Specify the variables. − %union {int ival, double cost;} • Connect the values to the return tokens. − %token <ival> INDEX − %token <cost> NUMBER • Specify the type for the non-terminals. Let's say ival is a terminal but cost is not. − %type <cost> expression • Associative and Precedence rules are specified in the definition section of the yacc program. − %left '-' '+' − %left '*' '/' − %nonassoc UMINUS Syntax Analysis 72 Programming the Scanner and Parser • • Programming with Yacc/Bison − expression: expression '+' NUMBER {$$ = $1 - $3; } − | expression '-' NUMBER {$$ = $1 - $3; } − ; • $1 represents the first number value in the right hand side, $2 represents the operator and $3 represents the second number value in the right hand side. Left-hand side is represented using $$. − Using union and yyval, only a single value can be passed between lexer and parser. So, use symbol table to pass multiple values − Error is reported using yyerror() function. − While compiling the C programs generated by Lex and Yacc, we will use –ly option of the C compiler. The yacc library must contain main() and yyerror(). Compilation and Execution on Linux platform. − Compile the lex program: lex filename.l − Compile the yacc program: yacc –d filename.y − Compile the C program: gcc –o output y.tab.c lex.yy.c –ly –ll − Running the program: ./output Syntax Analysis 73 Programming the Scanner and Parser • • Compilation and Execution on Windows platform. − Make sure that flex (flex.exe), bison (bison.exe) and tcc (Tiny C Compiler or any C compiler) are installed. − Compile the lex program: flex filename.l − Compile the yacc program: • bison –d filename.y • bison –d filename.y –b y − Compile the C program (using Tiny C Compiler – tcc) generated: • tcc –o output.exe y.tab.c lex.yy.c yyerror.c libyywrap.c yyinit.c main.c yyaccpt.c − Running the program: output.exe Programming with Accent and Amber − After writing the lex program, we need to write the accent program − Rules have left and right hand side separated by a colon − The initial symbol provided in the grammar is called as start symbol and it follows context-free grammar Syntax Analysis 74 Programming the Scanner and Parser • 75 Accent − Parameters can be specified as in (inherited attributes) and out (synthesized attributes), with “<“ and “>” enclosing them − Statements written within %prelude { …} are literally copied into the generated C program • Given the grammar (R stands for root, E stands for expression, T stands for term and F stands for factor. id is a terminal which represented by token NUMBER): • RE • EE+T|T • T T * F | F; • F (E) | id; %token NUMBER; root: expression<n> { printf("Final = %d\n", n);} ; expression<n>: expression<x> '+' term<y> { *n = x + y;} | term<n> ; term<n> : term<x> '*' factor<y> { *n = x * y; } | factor<n> ; factor<n> : '(' expression<n> ')' | NUMBER<n> ; Syntax Analysis Programming the Scanner and Parser • Programming with Accent − Compilation and Execution on Linux • lex filename.l • accent filename.acc • gcc –o output yygrammar.c lex.yy.c entire.c • Check for ambiguity using Amber: − accent filename.acc − gcc -o output -O3 yygrammar.c amber.c − output examples 1000 − Compilation and Execution on Windows • flex filename.l • accent filename.acc • tcc –o output.exe yygrammar.c lex.yy.c entire.c yyerror.c libyywrap.c main.c yyinit.c yyaccpt.c • Check for ambiguity using Amber: − accent filename.acc − tcc -o output.exe yygrammar.c amber.c − output examples 1000 Syntax Analysis 76