CSE 305 Lecture – Parsing Feb 11, 2016 Program Statements stmt assign | cond | loop | cmpd assign var = expr ; expr aexp | bexp cond if ‘(‘ expr ‘)’ stmt [else stmt] loop while ‘(‘ expr ‘)’ stmt cmpd ‘{’ stmts ‘}’ stmts { stmt } Ambiguity in Grammars A grammar G is said to be ambiguous if some sentence in L(G) has at least two parse trees. If one grammar for a language L is ambiguous, it does not mean that every grammar for L is ambiguous – if the latter were true, L would be called an inherently ambiguous language (these are not common in practice). Unambiguous grammars are important for parsing. Feb 11, 2016 3 CSE 305 / Jayaraman Ambiguity Example Preferred Parse if (exp1) while (exp2) if (exp3) stmt1 else stmt2 if (exp1) while (exp2) if (exp3) stmt1 else stmt2 Feb 11, 2016 if (exp1) while (exp2) if (exp3) stmt1 else stmt2 4 CSE 305 / Jayaraman Read Sebesta on Ambiguity The “dangling else” ambiguity is one of the classic examples of an ambiguous grammar. Sebesta has more on this example, including how to make the grammar unambiguous. Unambiguous grammars are desirable for parsing, since the parse tree guides further analysis, including code generation. Feb 11, 2016 5 CSE 305 / Jayaraman Ambiguity in Expressions assign var = expr expr aexp | bexp aexp … bexp … Ambiguous, because both aexp and bexp generate id, (id), ((id)), etc. We can merge the aexp and bexp grammars, as follows: assign var = expr expr term | term op1 expr term fact | fact op2 term fact num | true | false | ‘(‘ expr ‘)’ op1 + | - | ‘||’ op2 * | / | && Need for Attributes The merged expression grammar is unambiguous, but it generates many incorrectly typed expressions, such as: 10 & 20 true * 101 10*20 || false – 30 … We need to constrain the grammar through the use of attributes and semantic clauses, to avoid over-generation. The needed attribute here is type information for every expression and sub-expression. Feb 11, 2016 7 CSE 305 / Jayaraman The next two slides were not presented in class, but are included here for completeness, since reference was made to attributes and attribute grammars (inherited and synthesized attributes) during the lecture. Feb 11, 2016 8 CSE 305 / Jayaraman Using Attributes and Semantic Rules to Specify Type-Correctness SYNTAX RULE op1 op1 t op2 op2 t SEMANTICS RULE (+ | - ) ‘||’ t = “int” t = “bool” t (* | / ) ‘&&’ t = “int” t = “bool” fact t num t = “int” fact t true t = “bool” fact t false t = “bool” fact t ‘( expr ‘)’ t = t2 t is a synthesized attribute Inherited and Synthesized Attributes ATTRIBUTE GRAMMAR program decls type type type decls t t t ST TYPE OF ATTRIBUTE stmts ST int real bool ST t is synthesized t is synthesized t is synthesized type idlist ST, t id ST is inherited by decls and stmts i1 t idlist { , id ST, t i2 } ; ST is inherited; t is synthesized by type and inherited by idlist ST and t are inherited, id1 and id2 synthesized Broader Context for Parsing: Compiler Phases Source Code Source Code lexical Feb 11, 2016 Compiler Analysis syntactic Target Code semantic Synthesis intermediate code generation 11 Target Code optimization code genrn CSE 305 / Jayaraman Compiler Structure lexical syntactic semantic 1. Lexical: translates sequence of characters into sequence of ‘tokens’ 2. Syntactic: translates sequence of tokens into a ‘parse tree’; also builds symbol table 3. Semantic: traverses parse tree and performs global checks, e.g. type-checking, actual-parameter correspondence Feb 11, 2016 12 CSE 305 / Jayaraman Compiler Structure (cont’d) interm code generation machine code gen. optimization 4. Intermediate CG: Traverses parse tree and generates ‘abstract machine code’, e.g. triples, quadruples 5. Optimization: Performs control and data flow analysis; remove redundant ops, move loop invariant ops outside loop 6. Code Generation: Translate intermediate code to actual machine code Feb 11, 2016 13 CSE 305 / Jayaraman A Simple Example lexer Source Code // declarations // not shown f = 1; i = 1; while (i < n) { i = i + 1; f = f * i; } print(f); Feb 11, 2016 parser … Lexical Tokens id op int p id op int p key … 1 1 1 1 2 1 1 1 1 … Target Code Parse Tree … p1 p1 op1 id1 key1 op1 int1 id2 code gen’r … op2 int1 id2 14 id3 LD R1, #1 ST R1, Mf LD R2, #1 ST R2, Mi LD R3, Mn L: CMP R2, R3 JF Out INC R2 ST R2, Mi MUL R1, R2 ST R1, Mf JMP L Out: Print Mf CSE 305 / Jayaraman Java Bytecodes public static int fact(int n) { // n >= 0; int f = 1; int i = 1; while (i < n) { i = i + 1; f = f * i; } return f; } cmd> javap –c Factorial Lexical Analyzer (lex) • Scans the input file character by character, skips over comments and white space (except in Python where indentation is important). • Two main outputs: token and value • Token is an integer code for each lexical class: identifiers, numbers, keywords, operators, punctuation • Value is the actual instance: • for identifier, it is the string; • for numbers, it is their numeric value; • for keywords, operators and punctuation, the token code = token value Clarifying the Lexical-Syntax Analyzer Interaction • Although the diagram shows the lexical analyzer feeding its output to the syntax analyzer, in practice, the syntax analyzer calls the lexical analyzer repeatedly. • At each call, the lexical analyzer prepares the next token for the syntax analyzer. • The lexical analyzer would not need to create an explicit ‘Lexical Token’ table, as shown in the previous diagram, since the syntax analyzer only needs to work with one token at a time. Feb 11, 2016 17 CSE 305 / Jayaraman Design of a Simple Parser • We will see how to design a top-down parser for simple language. • In the next few slides is the structure of the lexical analyzer – some of the details and terminology taken from Sebesta. • Then we will see how to design the parser. Feb 11, 2016 18 CSE 305 / Jayaraman Token Codes class Token { public public public public public public public public public public public public public public public public public public public public public static static static static static static static static static static static static static static static static static static static static static final final final final final final final final final final final final final final final final final final final final final int int int int int int int int int int int int int int int int int int int int int SEMICOLON = 0; COMMA = 1; NOT_EQ = 2; ADD_OP = 3; SUB_OP = 4; MULT_OP = 5; DIV_OP = 6; ASSIGN_OP = 7; GREATER_OP = 8; LESSER_OP = 9; LEFT_PAREN= 10; RIGHT_PAREN= 11; LEFT_BRACE= 12; RIGHT_BRACE= 13; ID = 14; INT_LIT = 15; KEY_IF = 16; KEY_INT = 17; KEY_ELSE = 18; KEY_WHILE = 19; KEY_END = 20; } Feb 11, 2016 19 CSE 305 / Jayaraman Lexer: Lexical Analyzer public class Lexer { static private Buffer buffer = new Buffer(…); Feb 11, 2016 } static public int nextToken; // code static public int intValue; // value … public static int lex() { … sets nextToken and intValue each time it is called … } 20 CSE 305 / Jayaraman Parsing Strategies There are two broad strategies for parsing: * top-down parsing (a.k.a. recursive-descent parsing) * bottom-up parsing Top-down parsing is less powerful than bottom-up parsing. But it is preferred when manually constructing a parser. Tools such as YACC and JavaCC automatically construct a bottom-up parser from a grammar, but this bottom-up parser is hard to understand. Top-down Parsing E Grammar*: EE+T ET TT*F TF F id F(E) T T F a T + E * F F + b * c Choosing the correct expansion at each step is the issue. * This grammar is not suited for top-down parsing - will discuss later. Feb 11, 2016 22 CSE 305 / Jayaraman Bottom-up Parsing E Grammar: EE+T ET TT*F TF F id F(E) T T F a T + E * F F + b * c Choosing whether to ‘shift’ or ‘reduce’ and, if the latter, choosing the correct reduction are the issues. Feb 11, 2016 23 CSE 305 / Jayaraman Deterministic Parsing The term ‘deterministic parsing’ means that the parser can, at each step, correctly decide which rule to use without any guesswork. This requires some peeking into (or, looking ahead in) the input. For example: stmt assign | cond | loop | cmpd For a top-down parser to decide which of the above four cases applies, it needs to look into the input to see which is the next symbol, or “token”, in the input: identifier, if, while, { Feb 11, 2016 24 CSE 305 / Jayaraman Constructing a Top-down Parser (one void procedure per nonterminal) Case 1: Alternation on RHS of rule, e.g., stmt assign | cond | loop | cmpd Parser code: void stmt () { switch (Lexer.nextToken) { case Token.ID : { case Token.IF : { case Token.WHILE: { case Token.LBRACE: { default: break; } } Feb 11, 2016 25 assign(); break; cond(); break; loop(); break; cmpd(); break; } } } } CSE 305 / Jayaraman Constructing Top-down Parser (cont’d) Case 2: Sequencing on RHS of a rule, e.g., decl type idlist Parser code: void decl() { type(); idlist(); } Feb 11, 2016 26 CSE 305 / Jayaraman Constructing Parser(cont’d) Case 3: Terminal Symbols on RHS of a rule: factor num | ‘(‘ expr ‘)’ Parser code: void factor() { switch (Lexer.nextToken) { case Token.INT_LIT: int i = Lexer.intValue; Lexer.lex(); break; case Token.LPAR: Lexer.lex(); expr(); if (Lexer.nextToken == Token.RPAR) Lexer.lex(); else syntaxerror(“missing ‘)’”); default: break; } Feb 11, 2016 27 CSE 305 / Jayaraman Constructing a Top-down Parser (cont’d) Case 4: Left-Factoring the RHS of a rule: expr term | term + expr Parser Code: void expr() { term(); if (Lexer.nextToken == Token.ADD_OP) { Lexer.lex(); expr(); } } Left-recursion is not compatible with Top-down Parsing Problem: Left-recursive Rule expr term | expr + term Problem: We cannot decide which alternative to use even with lookahead. Reason: The recursion in ‘expr’ must eventually end in ‘term’, thus both alternatives have the same set of leading terminal symbols. Recognizer vs Parser Terminology: A “recognizer” only outputs a yes/no answer indicating whether an input string belongs to L(G), the language defined by a grammar G. A “parser” builds upon the basic structure provided by the recognizer, enhancing it with attributes and semantic actions so as to produce additional output. Adding attributes to the Parser In a top-down parser, the attribute information is incorporated as follows: - Inherited attributes of a grammar rule become input parameters of the corresponding procedure. - Synthesized attributesd become output parameters of the procedure. NOTE: Java does not have output parameters, hence we explain how synthesized attributes are represented in a Java setting. Feb 11, 2016 31 CSE 305 / Jayaraman Object-Oriented Top-down Recognizer A top-down parser is basically a set of mutually-recursive procedures, one per nonterminal of the grammar. In the object-oriented approach, each such procedure can be made the constructor of a class. Two benefits: (i) Run-time object structure = parse tree. (ii) When grammar rules are enhanced with attributes, the synthesized attributes become fields of the class. Thus, synthesized attributes are like “decorations” added to the nodes of the parse tree. Demo of an OO Top-down Recognizer written in Java and run under JIVE: Java Interactive Visualization Environment http://www.cse.buffalo.edu/jive Feb 11, 2016 33 CSE 305 / Jayaraman