Chapter 4: Syntax Analysis Part 1: Grammar Concepts CSE4100 Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre CH4p1.1 Syntax Analysis - Parsing CSE4100 An overview of parsing : Functions & Responsibilities Context Free Grammars: Concepts & Terminology Writing and Designing Grammars Resolving Grammar Problems / Difficulties Top-Down Parsing Recursive Descent & Predictive LL Bottom-Up Parsing LR & LALR Key Issues in Error Handling Concluding Remarks/Looking Ahead CH4p1.2 An Overview of Parsing Why are Grammars to formally describe Languages Important ? CSE4100 1. Precise, easy-to-understand representations 2. Compiler-writing tools can take grammar and generate a compiler 3. allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones ADA ADA9x, C++ Adds: Templates, exceptions, How do grammars relate to parsing process ? CH4p1.3 Parsing During Compilation regular expressions errors CSE4100 source program lexical analyzer token get next token • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors parser symbol table parse tree rest of front end interm repres • also technically part or parsing • includes augmenting info on tokens in source, type checking, semantic analysis CH4p1.4 Parsing Responsibilities Syntax Error Identification / Handling CSE4100 Recall typical error types: Lexical : Misspellings Syntactic : Omission, wrong order of tokens Semantic : Incompatible types Logical : Infinite loop / recursive call Majority of error processing occurs during syntax analysis NOTE: Not all errors are identifiable !! Which ones? CH4p1.5 Key issues in Error Handling Detection Actual Detection Finding position at which they occur Reporting CSE4100 Clear / accurate presentation Recovery How to skip over to continue and find later errors Cannot impact compilation of correct programs CH4p1.6 What are some Typical Errors ? #include<stdio.h> int f1(int v) { int i,j=0; As reported by MS VC++ for (i=1;i<5;i++) CSE4100 { j=v+f2(i) } return j; } int f2(int u) { 'f2' undefined; assuming extern returning int syntax error : missing ';' before ‘}‘ syntax error : missing ';' before ‘return‘ fatal error : unexpected end of file found int j; j=u+f1(u*u) return j; } int main() { int i,j=0; for (i=1;i<10;i++) { Which are “easy” to recover from? Which are “hard” ? j=j+i*I; printf(“%d\n”,i); printf("%d\n",f1(j)); return 0; } CH4p1.7 Error Recovery Strategies Panic Mode – Discard tokens until a “synchro” token is found ( end, “;”, “}”, etc. ) CSE4100 -- Decision of designer -- Problems: skip input miss declaration – causing more errors miss errors in skipped material -- Advantages: simple suited to 1 error per statement Phase Level – Local correction on input -- “,” ”;” – Delete “,” – insert “;” -- Also decision of designer -- Not suited to all situations -- Used in conjunction with panic mode to allow less input to be skipped CH4p1.8 Error Recovery Strategies – (2) Error Productions: -- Augment grammar with rules -- Augment grammar used for parser CSE4100 construction / generation -- Add a rule for := in C assignment statements Report error but continue compile -- Self correction + diagnostic messages -- What are other rules ? (supported by yacc) Global Correction: -- Adding / deleting / replacing symbols is chancy – may do many changes ! -- Algorithms available to minimize changes costly - key issues What do you think of each approach? What approach do compilers you’ve used take ? CH4p1.9 Motivating Grammars • Regular Expressions CSE4100 Basis of lexical analysis Represent regular languages • Context Free Grammars Basis of parsing Represent language constructs Characterize context free languages Reg. Lang. CFLs EXAMPLE: anbn , n 1 : Is it regular ? CH4p1.10 Context Free Languages Regular Languages Context Free Languages CSE4100 Token Structure Sentence Structure Scanner Generation Reg. Languag es Parser Generation Context Free Languages CH4p1.11 Context Free Grammars : Concepts & Terminology Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where: CSE4100 T: Terminals / tokens of the language NT: Non-terminals to denote sets of strings generatable by the grammar & in the language S: Start symbol, SNT, which defines all strings of the language PR: Production rules to indicate how T and NT are combines to generate valid strings of the language. PR: NT (T | NT)* Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model ! CH4p1.12 Context Free Grammars : A First Look assign_stmt id := expr ; expr term operator term CSE4100 term id term real term integer What do “BLUE” symbols represent? What do “BLACK” symbols represent? operator + operator - Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a collection of terminals / tokens. Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax. CH4p1.13 How is Grammar Used ? Given the rules on the previous slide, suppose CSE4100 id := real + int; is input. Is it syntactically correct? How do we know? expr is represented as: expr term operator term Is this accurate / complete? expr expr operator term expr term How does this affect the derivations which are possible? CH4p1.14 Derivation and Parse Tree Let’s derive: id := id + real – integer ; CSE4100 What’s its parse tree ? CH4p1.15 Derivation Example Let’s CSE4100 derive: id := id + real – integer ; assign_stmt assign_stmt → id := expr ; → id := expr ; expr → expr operator term → id := expr operator term; expr → expr operator term → id := expr operator term operator term; expr → term → id := term operator term operator term; term → id → id := id operator term operator term; operator →+ → id := id + term operator term; term → real → id := id + real operator term; operator →- → id := id + real - term; term → integer → id := id + real - integer; assign_stmt → id := expr ; expr → expr operator term expr → term term → id term → real term → integer operator → + operator → CH4p1.16 Example Grammar CSE4100 expr expr op expr expr ( expr ) expr - expr expr id op + op op * op / op Black : NT Blue : T expr : S 9 Production rules To simplify / standardize notation, we offer a synopsis of terminology. CH4p1.17 Example Grammar - Terminology Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings CSE4100 Non Terminals: A,B,C,S, black strings T or NT: X,Y,Z Strings of Terminals: u,v,…,z in T* Strings of T / NT: , in ( T NT)* Alternatives of production rules: A 1; A 2; …; A k; A 1 | 2 | … | 1 First NT on LHS of 1st production rule is designated as start symbol ! E E A E | ( E ) | -E | id A+|-|*| / | CH4p1.18 Grammar Concepts A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule. CSE4100 EXAMPLE: E -E (the means “derives” in one step) using the production rule: E -E EXAMPLE: E E A E E * E E * ( E ) DEFINITION: derives in one step + derives in one step * derives in zero steps EXAMPLES: A if A is a production rule * * 1 2 … n 1 n ; for all * and then * If CH4p1.19 How does this relate to Languages? + Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language + CSE4100 generated by G, denoted L(G). So WL(G) S W. W : is a sentence of G When S (and may have NTs) it is called a sentential form of G. EXAMPLE: id * id is a sentence Here’s the derivation: E E A E E * E id * E id * id Sentential forms * E id * id CH4p1.20 Leftmost and Rightmost Derivations Leftmost: Replace the leftmost non-terminal symbol CSE4100 E E A E id A E id * E id * id lm lm lm lm Rightmost: Replace the leftmost non-terminal symbol E rm EAE E A id E * id id * id rm rm rm Important Notes: A If A , what’s true about ? lm If A , what’s true about ? rm Derivations: Actions to parse input can be represented pictorially in a parse tree. CH4p1.21 Examples of LM / RM Derivations E E A E | ( E ) | -E | id CSE4100 A+|-|*| / | A leftmost derivation of : id + id * id A rightmost derivation of : id + id * id CH4p1.22 Derivations & Parse Tree E E EAE E CSE4100 A E E E*E E A E * E id * E id * id E A E id * E E A E id * id CH4p1.23 Parse Trees and Derivations Consider the expression grammar: E E+E | E*E | (E) | -E | id CSE4100 Leftmost derivations of id + id * id E E EE+E E + E + E id + E E E + E id E id + E id + E * E E + E id E * E CH4p1.24 Parse Tree & Derivations - continued E CSE4100 id + E * E id + id * E E + E id E * E id E id + id * E id + id * id E + E id E * id E id CH4p1.25 Alternative Parse Tree & Derivation CSE4100 EE*E E E+E*E id + E * E id + id * E E id E * E + E id id id + id * id WHAT’S THE ISSUE HERE ? CH4p1.26 Example Pascal Grammar in Yacc %term CSE4100 %binary %left %left %left %left T_AND T_CONST T_TO T_FOR T_IF T_MOD T_OR T_PROGRAM T_STRING T_UNTIL T_ARRAY T_BEGIN T_DIV T_DO T_ELSE T_END T_FUNCTION /* T_GOTO */ T_IN T_INTEGER T_NOT T_REAL /* T_PACKED */ T_NIL T_RECORD T_REPEAT T_THEN T_DOWNTO T_VAR T_WHILE T_CASE T_RANGE T_FILE T_IDENTIFIER T_LABEL T_OF T_PROCEDURE T_SET T_TYPE /* T_WITH */ T_COMMA T_LPAREN T_UPARROW T_PERIOD T_RPAREN T_SEMI T_COLON T_RBRACK T_LT T_EQ T_GT T_PLUS T_MINUS UNARYSIGN T_MULT T_RDIV T_DIV T_NOT T_ASSIGN T_LBRACK T_GE T_OR T_LE T_MOD T_AND T_NE T_IN %% CH4p1.27 Example Pascal Grammar in Yacc Start : Program_Header Declarations Block T_PERIOD Program_Header : T_PROGRAM T_IDENTIFIER T_LPAREN Id_List T_RPAREN T_SEMI CSE4100 Block : T_BEGIN Stt_List T_END Declarations : Const_Def_Part Type_Def_Part Var_Decl_Part Routine_Decl_Part Const_Def_Part : T_CONST Const_Def_List | /* epsilon */ ; Const_Def_List : Const_Def | Const_Def_List Const_Def ; Const_Def : T_IDENTIFIER T_EQ Constant T_SEMI Type_Def_Part : T_TYPE Type_Def_List | /* epsilon */ Type_Def_List : Type_Def | Type_Def_List Type_Def CH4p1.28 Example Pascal Grammar in Yacc Type_Def : T_IDENTIFIER T_EQ Type T_SEMI CSE4100 Var_Decl_Part : T_VAR Var_Decl_List | /* epsilon */ Var_Decl_List : Var_Decl_List Var_Decl | Var_Decl Var_Decl : Id_List T_COLON Type T_SEMI Routine_Decl_Part : Routine_Decl_Part Routine_Decl | /* epsilon */ Routine_Decl : Routine_Head T_IDENTIFIER T_SEMI | Routine_Head Declarations Block T_SEMI Routine_Head : T_PROCEDURE T_IDENTIFIER Params T_SEMI | T_FUNCTION T_IDENTIFIER Params Function_Type T_SEMI Params : T_LPAREN Params_List T_RPAREN | /* epsilon */ Param_Group : Id_List T_COLON Type | T_VAR Id_List T_COLON Type CH4p1.29 Example Pascal Grammar in Yacc Function_Type : T_COLON Type | /* epsilon */ CSE4100 Params_List : Param_Group | Params_List T_SEMI Param_Group Constant : T_STRING | Number | T_PLUS Number | T_MINUS Number Number : T_IDENTIFIER | T_INTEGER | T_REAL Const_List : Constant | Const_List T_COMMA Constant Type : Simple_Type | T_UPARROW T_IDENTIFIER | Structured_Type Simple_Type : T_IDENTIFIER | T_LPAREN Id_List T_RPAREN | Constant T_RANGE Constant Structured_Type : T_ARRAY T_LBRACK Simple_Type_List T_RBRACK T_OF Type | T_SET T_OF Simple_Type | T_RECORD Field_List T_END Simple_Type_List : Simple_Type | Simple_Type_List T_COMMA Simple_Type CH4p1.30 Example Pascal Grammar in Yacc Field_List : Fixed_Part /* Variant_part */ Fixed_Part : Field | Fixed_Part T_SEMI Field CSE4100 Field : /* epsilon */ | Id_List T_COLON Type Variant_part : / * epsilon * / | T_CASE T_IDENTIFIER T_OF Variant_List | T_CASE T_IDENTIFIER T_COLON T_IDENTIFIER T_OF Variant_List Variant_List : Variant | Variant_List T_SEMI Variant Variant : / * epsilon * / | Const_List T_COLON T_LPAREN Field_List T_RPAREN Stt_List : Statement | Stt_List T_SEMI Statement Case_Stt_List : Case_Statement | Case_Stt_List T_SEMI Case_Statement Case_Statement : Const_List T_COLON Statement | /* epsilon */ CH4p1.31 Example Pascal Grammar in Yacc CSE4100 Statement : /* epsilon */ | T_IDENTIFIER | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | Variable T_ASSIGN Expression | T_BEGIN Stt_List T_END | T_CASE Expression T_OF Case_Stt_List T_END | T_WHILE Expression T_DO Statement | T_REPEAT Stt_List T_UNTIL Expression | T_FOR Variable T_ASSIGN Expression T_TO Expression T_DO Statement | T_FOR Variable T_ASSIGN Expression T_DOWNTO Expression T_DO Statement | T_IF Expression T_THEN Statement | T_IF Expression T_THEN Statement T_ELSE Statement Expression : Expression relop Expression | T_PLUS Expression | T_MINUS Expression | Expression addop Expression | Expression divop Expression | T_NIL | T_STRING | T_INTEGER | T_REAL | Variable | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | T_LPAREN Expression T_RPAREN | negop Expression | T_LBRACK Element_List T_RBRACK | T_LBRACK T_RBRACK %prec %prec %prec %prec %prec T_LT UNARYSIGN UNARYSIGN T_PLUS T_MULT %prec T_NOT CH4p1.32 Example Pascal Grammar in Yacc Element_List : Element | Element_List T_COMMA Element CSE4100 Element : Expression | Expression T_RANGE Expression Variable : T_IDENTIFIER | Variable T_LBRACK Expr_List T_RBRACK | Variable T_PERIOD T_IDENTIFIER | Variable T_UPARROW Expr_List : Expression | Expr_List T_COMMA Expression relop : T_EQ | T_LT | T_GT | T_NE | T_LE | T_GE | T_IN addop : T_PLUS | T_MINUS | T_OR divop : T_MULT | T_RDIV | T_DIV | T_MOD | T_AND ; negop : T_NOT CH4p1.33 Resolving Grammar Problems/Difficulties Regular Expressions : Basis of Lexical Analysis CSE4100 Reg. Expr. generate/represent regular languages Reg. Languages smallest, most well defined class of languages Context Free Grammars: Basis of Parsing CFGs represent context free languages CFLs contain more powerful languages Reg. Lang. CFLs anbn – CFL that’s not regular! Try to write DFA/NFA for it ! CH4p1.34 Resolving Problems/Difficulties – (2) Since Reg. Lang. Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA. CSE4100 Recall: (a | b)*abb a start a 0 1 b 2 b 3 b CH4p1.35 Resolving Problems/Difficulties – (3) Construct CFG as follows: 1. Each State I has non-terminal Ai : A0, A1, A2, A3 2. If CSE4100 3. If i a i b j j then Ai a Aj then Ai Aj : A0 aA0, A0 aA1 : A0 bA0, A1 bA2 4. : A2 bA3 If I is an accepting state, Ai : A3 5. If I is a starting state, Ai is the start symbol : A0 T={a,b}, NT={A0, A1, A2, A3}, S = A0 PR ={ A0 aA0 | aA1 | bA0 ; A1 bA2 ; A2 3A3 ; A3 } CH4p1.36 How Does This CFG Derive Strings ? a start CSE4100 a 0 1 b 2 b 3 b vs. A0 aA0, A0 aA1 A0 bA0, A1 bA2 A2 bA3, A3 How is abaabb derived in each ? CH4p1.37 Regular Expressions vs. CFGs CSE4100 Regular expressions for lexical syntax 1. CFGs are overkill, lexical rules are quite simple and straightforward 2. REs – concise / easy to understand 3. More efficient lexical analyzer can be constructed 4. RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. Why? CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, … Intended for structural associations between tokens ! Are tokens in correct order ? CH4p1.38 Resolving Grammar Difficulties : Motivation 1. Humans write / develop grammars 2. Different parsing approaches have different needs CSE4100 LL(k) Recursive LR(k) LALR(k) Top-Down vs. Bottom-Up • For: 1 remove “errors” • For: 2 put / redesign grammar - ambiguity - -moves Grammar - cycles Problems - left recursion - left factoring CH4p1.39 Resolving Problems: Ambiguous Grammars Consider the following grammar segment: stmt if expr then stmt CSE4100 | if expr then stmt else stmt | other (any other statement) What’s problem here ? Consider the Program: if e1 then if e2 then s1 else s2 Else must match to previous then. Structure indicates parse subtree for expression. CH4p1.40 Resulting Parse Tree CSE4100 Easy case Else must match to previous then. if e1 then s1 else if e2 then s2 else s3 CH4p1.41 Example : What Happens with this string? If E1 then if E2 then S1 else S2 CSE4100 How is this parsed ? if E1 then if E2 then S1 else S2 vs. if E1 then if E2 then S1 else S2 What’s the issue here ? CH4p1.42 Parse Trees for Example Form 1: CSE4100 if e1 then if e2 then s1 else s2 Form 2: What’s the issue here ? CH4p1.43 Removing Ambiguity Take Original Grammar: CSE4100 stmt if expr then stmt | if expr then stmt else stmt | other (any other statement) Revise to remove ambiguity: stmt matched_stmt | unmatched_stmt matched_stmt if expr then matched_stmt else matched_stmt | other unmatched_stmt if expr then stmt | if expr then matched_stmt else unmatched_stmt How does this grammar work ? CH4p1.44 Resolving Difficulties : Left Recursion A left recursive grammar has rules that support the + derivation : A A, for some . CSE4100 Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination. A A A A … etc. A A | Take left recursive grammar: A A | To the following: A’ A’ A’ A’ | CH4p1.45 Why is Left Recursion a Problem ? CSE4100 Consider: EE+T | T TT*F | F F ( E ) | id Derive : id + id + id EE+T How can left recursion be removed ? EE+T | T What does this generate? EE+TT+T EE+TE+T+TT+T+T How does this build strings ? What does each string have to start with ? CH4p1.46 Resolving Difficulties : Left Recursion (2) Informal Discussion: CSE4100 Take all productions for A and order as: A A1 | A2 | … | Am | 1 | 2 | … | n Where no i begins with A. Now apply concepts of previous slide: A 1A’ | 2A’ | … | nA’ A’ 1A’ | 2A’| … | m A’ | For our example: EE+T | T TT*F | F F ( E ) | id E TE’ E’ + TE’ | F ( E ) | id T FT’ T’ * FT’ | CH4p1.47 Resolving Difficulties : Left Recursion (3) Problem: If left recursion is two-or-more levels deep, this isn’t enough S Aa | b CSE4100 A Ac | Sd | S Aa Sda Algorithm: Input: Grammar G with no cycles or -productions Output: An equivalent grammar with no left recursion 1. Arrange the non-terminals in some order A1,A2,…An 2. for i := 1 to n do begin for j := 1 to i – 1 do begin replace each production of the form Ai Aj by the productions Ai 1 | 2 | … | k where Aj 1|2|…|k are all current Aj productions; end eliminate the immediate left recursion among Ai productions end CH4p1.48 Using the Algorithm Apply the algorithm to: A1 A2a | b A2 A2c | A1d | CSE4100 i=1 For A1 there is no left recursion i=2 for j=1 to 1 do Take productions: A2 A1 and replace with A2 1 | 2 | … | k | where A1 1 | 2 | … | k are A1 productions in our case A2 A1d becomes A2 A2ad | bd What’s left: A1 A2a | b A2 A2 c | A2 ad | bd | Are we done ? CH4p1.49 Using the Algorithm (2) No ! We must still remove A2 left recursion ! CSE4100 A1 A2a | b A2 A2 c | A2 ad | bd | Recall: A A1 | A2 | … | Am | 1 | 2 | … | n A 1A’ | 2A’ | … | nA’ A’ 1A’ | 2A’| … | m A’ | Apply to above case. What do you get ? CH4p1.50 Removing Difficulties : -Moves Very Simple: A and B uAv implies add B uv to the grammar G. CSE4100 Why does this work ? Examples: E TE’ E’ + TE’ | T FT’ T’ * FT’ | F ( E ) | id A1 A2 a | b A2 bd A2’ | A2’ A2’ c A2’ | bd A2’ | CH4p1.51 Removing Difficulties : Cycles How would cycles be removed ? CSE4100 S SS | ( S ) | Has: S SS S S CH4p1.52 Removing Difficulties : Left Factoring Problem : Uncertain which of 2 rules to choose: CSE4100 stmt if expr then stmt else stmt | if expr then stmt When do you know which one is valid ? What’s the general form of stmt ? A 1 | 2 : if expr then stmt 1: else stmt 2 : Transform to: A A’ A’ 1 | 2 EXAMPLE: stmt if expr then stmt rest rest else stmt | CH4p1.53 Resolving Grammar Problems : Another Look • Forcing Matching Within Grammar CSE4100 - if-then-else else must go with most recent if stmt if expr then stmt | if expr then stmt else stmt | other Leads to: stmt matched_stmt | unmatched_stmt matched_stmt if expr then matched_stmt else matched_stmt | other unmatched_stmt if expr then stmt | if expr then matched_stmt else unmatched_stmt CH4p1.54 Resolving Grammar Problems : Another Look (2) • Left Factoring - Know correct choice in advance ! CSE4100 where_queries where are_does declarations of symbol | where are_does “search_option” appear Rules given in Assignment or where_queries where rest_where rest_where are declarations of symbol | does “search_option” appear CH4p1.55 Resolving Grammar Problems : Another Look (3) • Left Recursion - Avoid infinite loop when choosing rules CSE4100 A A | EXAMPLE: EE+T | T TT*F | F F ( E ) | id A A’ A’ A’ | E TE’ E’ + TE’ | F ( E ) | id T FT’ T’ * FT’ | If we use the original production rules on input: id + id (with leftmost): * T+T+T+…+T EE+TE+T+TE+T+T+T not assured If we use the transformed production rules on input: id + id (with leftmost): * id + TE’ * id + id E’ id + id E TE’ T + TE’ CH4p1.56 Resolving Grammar Problems : Another Look (4) • Removing -Moves CSE4100 E TE’ E’ + TE’ | ET E TE’ E’ + TE’ E’ + T Is there still a problem ? CH4p1.57 Resolving Grammar Problems Note: Not all aspects of a programming language can be represented by context free grammars / languages. CSE4100 Examples: 1. Declaring ID before its use 2. Valid typing within expressions (Type Rules) 3. Parameters in definition vs. in call 4. Method Overloading/hiding These features are call context-sensitive and define yet another language class, CSL. Reg. Lang. CFLs CSLs CH4p1.58 Context-Sensitive Languages - Examples Examples: CSE4100 L1 = { wcw | w is in (a | b)* } : Declare before use L2 = { an bm cn dm | n 1, m 1 } an bm : formal parameter cn dm : actual parameter What does it mean when L is context-sensitive ? CH4p1.59 How do you show a Language is a CFL? L3 = { w c wR | w is in (a | b)* } CSE4100 L4 = { an bm cm dn | n 1, m 1 } L5 = { an bn cm dm | n 1, m 1 } L6 = { an bn | n 1 } CH4p1.60 Solutions L3 = { w c wR | w is in (a | b)* } CSE4100 SaSa | bSb | c L4 = { an bm cm dn | n 1, m 1 } SaSd | aAd A b A c | bc L5 = { an bn cm dm | n 1, m 1 } S XY X a X b | ab Y c Y d | cd L6 = { an bn | n 1 } S a S b | ab CH4p1.61